Setting up to run Mistral LLM locally
Mistral is a series of Large Language Models (LLM) developed by a French company Mistral AI. These models are notable for their high performance and efficiency in generating text. Many of Mistral’s models are open source and can be used without restrictions under the Apache 2.0 license. Here is a quick troubleshooting walkthrough on how I set up dolphin-2.5-mixtral-8x7b to run locally with cuBLAS back-end for hardware acceleration. I experienced a few annoying hangups in the process, so hopefully this little note can help anyone experiencing the same.
Setup the environment
First, make sure you have CUDA installed:
nvcc --version
Create an environment definition file mistral.yml
to be installed by mamba/conda:
name: mistral
channels:
- pytorch
- nvidia
- defaults
- conda-forge
dependencies:
- python=3.11
- pip
- anaconda
- langchain
- huggingface_hub
- pytorch
- pytorch-cuda=12.1
- pip:
- torchaudio
- torchvision
Notably, you will see that llama-cpp-python
is missing from the package list. We will install it separately using pip with additional build parameters to enable cuBLAS.
Create and switch into the environment:
mamba env create -f mistral.yml
mamba activate mistral
Set environmental variables to have CMake compile llama.cpp with cuBLAS back-end. Then install using pip. The --no-cache-dir
is important, because you want to make sure you are building fresh from source, rather than using a potential pre-chached version. The --verbose
flag will help you troubleshoot if CUDA of cuBLAS issues arise during the build.
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose
Use the huggingface CLI to get the appropriate model. Going to use the recommended 4-bit quantized model dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf
.
huggingface-cli download TheBloke/dolphin-2.5-mixtral-8x7b-GGUF dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False
Minimal python code
Now let’s use langchain to set up a minimal running example.
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
template = """Question: {question}
Answer: Let's work this out in a step by
step way to be sure we have the right
answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
model_path=r'.\dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf',
n_gpu_layers=3,
n_batch=500,
callback_manager=callback_manager,
verbose=True
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the best planet in the galaxy?"
llm_chain.run(question)