Header image for the post titled Setting up to run Mistral LLM locally

Here is a quick troubleshooting walkthrough on how I set up dolphin-2.5-mixtral-8x7b to run locally with cuBLAS back-end for hardware acceleration. I experienced a few annoying hangups in the process, so hopefully this little note can help anyone experiencing the same.

Setup the environment

First, make sure you have CUDA installed:

nvcc --version

Create an environment definition file mistral.yml to be installed by mamba/conda:

name: mistral
channels:
  - pytorch
  - nvidia
  - defaults
  - conda-forge
dependencies:
  - python=3.11
  - pip
  - anaconda
  - langchain
  - huggingface_hub
  - pytorch
  - pytorch-cuda=12.1
  - pip:
      - torchaudio
      - torchvision

Notably, you will see that llama-cpp-python is missing from the package list. We will install it separately using pip with additional build parameters to enable cuBLAS.

Create and switch into the environment:

mamba env create -f mistral.yml
mamba activate mistral

Set environmental variables to have CMake compile llama.cpp with cuBLAS back-end. Then install using pip. The --no-cache-dir is important, because you want to make sure you are building fresh from source, rather than using a potential pre-chached version. The --verbose flag will help you troubleshoot if CUDA of cuBLAS issues arise during the build.

set CMAKE_ARGS=-DLLAMA_CUBLAS=on  
set FORCE_CMAKE=1  
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

Use the huggingface CLI to get the appropriate model. Going to use the recommended 4-bit quantized model dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf.

huggingface-cli download TheBloke/dolphin-2.5-mixtral-8x7b-GGUF dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False

Minimal python code

Now let’s use langchain to set up a minimal running example.

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by 
step way to be sure we have the right 
answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path=r'.\dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf',
    n_gpu_layers=3,
    n_batch=500,
    callback_manager=callback_manager,
    verbose=True
    )

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the best planet in the galaxy?"
llm_chain.run(question)