Setting up to run Mistral LLM locally

Mistral is a series of Large Language Models (LLM) developed by a French company Mistral AI. These models are notable for their high performance and efficiency in generating text. Many of Mistral’s models are open source and can be used without restrictions under the Apache 2.0 license. Here is a quick troubleshooting walkthrough on how I set up dolphin-2.5-mixtral-8x7b to run locally with cuBLAS back-end for hardware acceleration. I experienced a few annoying hangups in the process, so hopefully this little note can help anyone experiencing the same.

Setup the environment

First, make sure you have CUDA installed:

nvcc --version

Create an environment definition file mistral.yml to be installed by mamba/conda:

name: mistral
channels:
  - pytorch
  - nvidia
  - defaults
  - conda-forge
dependencies:
  - python=3.11
  - pip
  - anaconda
  - langchain
  - huggingface_hub
  - pytorch
  - pytorch-cuda=12.1
  - pip:
      - torchaudio
      - torchvision

Notably, you will see that llama-cpp-python is missing from the package list. We will install it separately using pip with additional build parameters to enable cuBLAS.

Create and switch into the environment:

mamba env create -f mistral.yml
mamba activate mistral

Set environmental variables to have CMake compile llama.cpp with cuBLAS back-end. Then install using pip. The --no-cache-dir is important, because you want to make sure you are building fresh from source, rather than using a potential pre-chached version. The --verbose flag will help you troubleshoot if CUDA of cuBLAS issues arise during the build.

set CMAKE_ARGS=-DLLAMA_CUBLAS=on  
set FORCE_CMAKE=1  
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

Use the huggingface CLI to get the appropriate model. Going to use the recommended 4-bit quantized model dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf.

huggingface-cli download TheBloke/dolphin-2.5-mixtral-8x7b-GGUF dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False

Minimal python code

Now let’s use langchain to set up a minimal running example.

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by 
step way to be sure we have the right 
answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path=r'.\dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf',
    n_gpu_layers=3,
    n_batch=500,
    callback_manager=callback_manager,
    verbose=True
    )

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the best planet in the galaxy?"
llm_chain.run(question)