How to Manage conda Environments
If you are using Anaconda, the popular python distribution for data science, you know that it comes with its own package and environment manager called conda
.
There are many tutorials out there, that tell you to update your conda
environments using conda update --all
. Luckily, you know that this is a terrible idea!
When distributions like Anaconda are created, a lot of care goes into ensuring version compatibility between the included packages. Performing a bulk update with conda update --all
unpins package versions and destroys all of this hard work.
So instead of letting the chaos reign, let’s learn to harness the power of conda
as an environment manager. Just remember the main tenet: you really should have a separate environment for each of your projects. While it may seem onerous, in the long term it will save you a lot of pain and surprises like “it was working just yesterday!”
Sidenote: Never use the base environment
The base conda environment is not for you! It is there to support the basic functionality of conda itself. The only thing you are allowed to do from the base environment is update conda:
conda update conda
. You should never install anything extraneous into base.
In the examples below, we will be dealing with a hypothetical environment called envy.
Managing environments from the command line
Let’s say you are starting a new project that uses a large distribution or collection of packages (Anaconda in our examples) and a handful of individual packages.
I like to prioritize either the language version or the large distribution version as the defining feature of an environment and install them at the creation step.
So, if you need the latest version of the distribution:
conda create -n envy anaconda
Or you can specify a specific release:
conda create -n env_name anaconda=2021.05`
Now let’s say you care more about a specific version of python:
conda create -n envy python=3.8 anaconda`
Of course, you can also specify versions of both components. But unless you have very specific needs, try to specify only one of the two, and let the package manager figure out the most compatible counterpart.
I usually don’t install individual packages at the environment creation step. This is because at each individual installation command, conda
creates a revision (i.e., a snapshot) of the environment. If an installed package messes with the rest of the environment, you can always roll it back to a previous state.
To view the list of available snapshots:
conda list --revisions
To roll back to a specific revision:
conda install --revision N
where N
is the revision number.
At the same time, don’t try to create an overly complex revision history by installing one package at a time. This can lead to dependency conflicts. Instead, group installations of related packages, or packages that tend to interact with each other a lot.
Sidenote: using pip
conda
repositories may not contain all packages present in PyPI. You can usepip
to install such packages, but doing so indiscriminately can break things. Always try to install as many packages withconda
as possible, then finish things off withpip
. If further modifications are needed to the environment, it is best to create a new environment rather than runningconda
afterpip
. Never runpip
from the base environment.
Now, let’s imagine that you’ve been using your environment for a while. You’ve written a lot of code that works well with this environment. But now you need to update, add, or remove a package. As we’ve learned, we are not going to do it to your original environment. We are going to clone it instead!
conda create --name envyMar2023 --clone envy
Then make the necessary changes in the new environment. As you see, it’s helpful to add differentiators (like a date) in the environment name to help manage them better.
Managing environments with YAML files
If you want to have full control over your environments, you need to learn about environment YAML files. These will help you maintain their reusable definitions to whatever level of specificity you desire. They are invaluable when you intend to recreate and update environments frequently.
To get an example of what an environment YAML file looks like, activate your target environment and run:
conda env export -f envy.yml
This will create the envy.yml file in the active directory. It will contain the list of all installed packages, their versions, and their builds. Such a file works great when you want to reproduce an environment exactly. But in our case, we are less concerned about specific versions of all packages, but rather in defining the minimal skeleton of an environment, and then letting conda come up with the best combination of versions. For this, we need to create a definition file manually:
name: envy
channels:
- pytorch
- nvidia
- defaults
- conda-forge
dependencies:
- python=3.10
- anaconda
- langchain
- psycopg2
- pytorch
- pytorch-cuda
- sentence-transformers
- umap-learn
- pip:
- gpt4all
- pymupdf
- torchaudio
- torchvision
- hdbscan
As in the case with the command line, I have pinned the language version, and left the other packages unpinned. Packages to be installed using pip are listed as well.
Now, when you want to create an environment from the YAML file:
conda env create -f envy.yml
You can also use a YAML file to update an environment (hopefully a clone of an environment that you have been using previously)
conda env update -f envy.yml --prune
Bonus: ditch conda for mamba!
conda
is a decent environment manager, but it is awfully slow. Mamba is a conda re-write that is much faster and more reliable. I suggest installing a minimal mambaforge distribution, that comes pre-configured with the conda-forge channel, and with a lean base environment. Then just substitute conda
in all of the previous commands with mamba
, and enjoy all of the free time you gain from managing environments!