Back to blog listing

ESM Local Setup


The overview was the what and why. This post is how to run all of it on hardware I own. The goal is narrow. Get from a clean checkout to three things: ESMC embeddings A protein language model's internal vector representation of each residue: a list of numbers per amino acid that encodes what the model has learned about that position in context. These vectors are what downstream models (like a folding trunk) actually consume. , an ESMFold2 structure prediction, and sparse-autoencoder An unsupervised network trained to re-express a model's dense internal activations as a sparse combination of many interpretable "features." Only a handful of the ~16,000 features fire for any given residue, and each one tends to correspond to a recognizable piece of biology. features, all on a single RTX 3090.

The pleasant surprise up front: unlike OpenFold 3, where 24 GB was a constant squeeze, ESM has plenty of headroom on a 3090. The heaviest thing I ran peaked at about 14 GB. The hard part was not VRAM or kernels. It was downloading 25 GB of weights over a flaky connection, which I will get to.

The environment

ESM ships a pixi workspace, same as OpenFold, so I went the same route. Pixi gives you a real lockfile and a named environment, and pixi install resolves the whole thing in one shot: Python 3.12, PyTorch 2.6 built against CUDA 12.4, and the rest.

pixi install

As with OpenFold, typing pixi run in front of every command gets old, so I aliased it. I am a fish user:

alias --save esmrun "pixi run"

One thing to flag right away, because it is the load-bearing dependency and it is unusual. ESM does not use mainline transformers. The pyproject.toml pins a fork:

transformers @ git+https://github.com/Biohub/transformers.git@main

That fork is where the esmc, esmc_sae, and esmfold2 model classes actually live. I half expected this to be the thing that broke the build. A git-installed fork of a library this size is exactly the kind of dependency that fights you. It resolved and imported on the first try. A quick smoke test confirms the stack is live:

esmrun python -c "import torch, transformers; \
  print(torch.__version__, torch.cuda.is_available()); \
  import transformers.models.esmfold2; print('esmfold2 ok')"
# 2.6.0+cu124 True
# esmfold2 ok

Sidebar for fellow Arch/CachyOS people: nothing GPU-specific bit me here. The 3090 is Ampere, the CUDA 12.4 wheels just work, and the “fused kernel not found” warnings below are cosmetic. This was a much smoother ride than the OpenFold kernel-compatibility dance.

The kernels you do not have, and why it is fine

The moment you load ESMC you get a wall of warnings like this:

ESMC: neither xformers nor flash-attn is installed; falling back to
PyTorch F.scaled_dot_product_attention.
ESMC: transformer_engine is not installed; falling back to pure-PyTorch
LayerNorm+Linear.
ESMC: flash-attn rotary kernel not installed; falling back to pure-PyTorch RoPE.

Do not panic. These are all optional fused kernels, and ESM degrades to stock PyTorch for every one of them. The fused paths (flash-attn, xformers, and NVIDIA’s transformer_engine) buy you speed, and in the fp8 case lower memory. But fp8 needs an H100-class card anyway, so it is irrelevant on a 3090. The warnings even quantify the numerical drift from skipping them, which works out to a few ULP after the final LayerNorm, with perplexity inside rounding noise. I ran everything below on the pure-PyTorch fallbacks and never hit a real wall.

Running ESMC (and a bug I tripped over)

The first thing I tried was the legacy SDK path the cookbook advertises:

from esm.models.esmc import ESMC
model = ESMC.from_pretrained("esmc_300m", use_flash_attn=False)

This downloads the weights and then dies:

ValueError: Directory '.../models--biohub--esmc-300m-2024-12/snapshots/...'
does not contain a valid checkpoint.

It is a real bug, and a small one. The loader calls huggingface_hub.load_torch_model(model, snapshot_dir), which scans the root of the snapshot for a checkpoint. But in this repo the 300M and 600M weights ship one level down, at data/weights/esmc_300m_2024_12_v0.pth, and the helper never looks there. The 6B model is fine, because it ships sharded safetensors with an index at the root, which is exactly what the helper expects. Loading the nested .pth directly works:

The workaround, and a candidate PR
import glob, torch
from accelerate import init_empty_weights
from esm.models.esmc import ESMC
from esm.tokenization import get_esmc_model_tokenizers
from esm.utils.constants.esm3 import data_root

with init_empty_weights():
    model = ESMC(d_model=960, n_heads=15, n_layers=30,
                 tokenizer=get_esmc_model_tokenizers(), use_flash_attn=False).eval()
pth = glob.glob(str(data_root("esmc-300") / "data/weights/*.pth"))[0]
model.load_state_dict(torch.load(pth, map_location="cpu"), assign=True)
model = model.to("cuda")

The fix in the library is to point the loader at the nested data/weights/ path for the 300M and 600M builders, the same way the ESM3 builders in that file already use torch.load. It is a small, self-contained change, the kind of newcomer-sized contribution I like as an on-ramp.

The cleaner route, and the one the README actually leads with, is the Hugging Face transformers API. It has no such issue:

import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("biohub/ESMC-6B",
                                  dtype=torch.bfloat16, device_map="cuda").eval()

Numbers from my box, with a 260-residue protein:

ModelParamsVRAM (loaded)Forward pass
ESMC-300M333 M~1.3 GB1.2 s (cold)
ESMC-6B (bf16)6 B~12.7 GB0.6 s

The 6B model loading into 12.7 GB is the headline number for the whole post. In bfloat16 it leaves more than enough room on a 24 GB card, which is what makes everything downstream comfortable.

Running ESMFold2

ESMFold2 is the structure predictor, and it is the reason you need the 6B model. The folding trunk is tiny, a roughly 0.9 GB diffusion and pairformer module, but it consumes ESMC-6B embeddings, so loading it pulls the full backbone:

from transformers.models.esmfold2.modeling_esmfold2 import ESMFold2Model
from esm.models.esmfold2 import ProteinInput, StructurePredictionInput, ESMFold2InputBuilder

model = ESMFold2Model.from_pretrained("biohub/ESMFold2").cuda().eval()
spi = StructurePredictionInput(sequences=[ProteinInput(id="A", sequence=UBIQUITIN)])
result = ESMFold2InputBuilder().fold(
    model, spi, num_loops=10, num_sampling_steps=100, num_diffusion_samples=1, seed=0
)
print(result.plddt.mean(), result.ptm)
open("ubq.cif", "w").write(result.complex.to_mmcif())

On the 3090, single-sequence mode:

  • Loading the ESMC-6B backbone plus trunk takes about 28 s and settles at 13.7 GB.
  • Folding ubiquitin (76 aa) at 10 loops by 100 steps takes 5.6 s.
  • Peak VRAM is 14.0 GB, comfortably inside 24 GB.

That fold is the ubiquitin structure from the overview, which lands 1.12 Å from the experimental crystal structure. One API gotcha worth knowing. Pass num_diffusion_samples > 1 and fold() returns a list of results, one per sample AlphaFold 3 and its relatives generate atomic coordinates with a diffusion model, starting from noise and denoising into a structure. Each sample is one such run, so more samples means more candidate structures to rank. , not a single object. Rank them by pTM Predicted TM-score: a single 0 to 1 number estimating how close the predicted fold is to the true structure overall. Higher is better, and values above about 0.5 usually mean the global topology is right. ipTM is the same idea scored across chains in a complex. and take the best.

Running the sparse autoencoders

The SAEs were the part I was most curious about, and the easiest to run once ESMC-6B is cached. You attach an SAE to specific layers of the loaded language model, and it returns, per residue, a sparse vector over a roughly 16,000-entry feature codebook. The one detail that saves you a lot of disk is to download only the layers you want with allow_patterns, not the whole 27 GB SAE repo.

sae = AutoModel.from_pretrained(
    "biohub/ESMC-6B-sae-k64-codebook16384",
    allow_patterns=["config.json", "layer_30.safetensors", "layer_60.safetensors"],
    device=model.device,
)
sae.initialize_layers([30, 60])
model.add_sae_models([sae.layers["30"], sae.layers["60"]])
out = model(**inputs)
out["sae_outputs"]["layer60"]   # torch.sparse_coo, shape (L, 16384)

For a 145-residue input this added almost no time (the forward was 0.6 s) and peaked at 13.4 GB. The output is a sparse COO tensor with exactly 64 active features per residue. The k64 in the model name is the sparsity budget. What those 64 features mean is the payoff, and it is where the deep dive goes next.

The 24 GB question: it is not the bottleneck

This is the inverted version of the OpenFold post. There, the whole back half was about fighting 24 GB. The full picture for ESM:

WorkloadPeak VRAMFits on a 3090?
ESMC-300M embeddings~2 GBtrivially
ESMC-6B embeddings (bf16)~13 GBcomfortably
ESMFold2 fold (76 aa)~14 GBcomfortably
ESMC-6B plus SAE (145 aa)~13 GBcomfortably

The headroom is real, with one asterisk. I tested small proteins. ESMFold2’s memory grows with sequence length and with the number of chains in a complex, and the diffusion sampler’s cost scales with num_sampling_steps × num_loops × num_diffusion_samples. A big multi-chain complex with a lot of samples is a different story, and finding exactly where that falls over on 24 GB is a follow-up I want to do. But for “fold a protein and look at it,” the 3090 is not the constraint.

The actual hard part: the download

The constraint turned out to be bytes. ESMC-6B is 25 GB on disk, six fp32 safetensors shards, and ESMFold2 needs all of it. On my connection, snapshot_download stalled twice and sat dead for an hour each time before I noticed.

The root cause is worth knowing if you hit it. The huggingface_hub downloader has a connection timeout but no body-read timeout. When the connection half-opens and the byte stream goes silent mid-shard, the download does not error and retry. It just blocks forever. My first attempt at a fix, bumping etag_timeout, did nothing, because that only covers the metadata request, not the file-content stream.

What worked was abandoning the Python downloader for curl, which can abort a stalled stream:

curl -L --continue-at - \
     --connect-timeout 30 --speed-limit 30000 --speed-time 20 \
     --retry 1000 --retry-all-errors --retry-delay 5 \
     -o "$blob" "$url"

--speed-time 20 --speed-limit 30000 is the part that matters. If throughput drops below 30 KB/s for 20 seconds, curl kills the connection, and --retry with --continue-at - reconnects and resumes from where it left off. I pointed it at the Hugging Face resolve/ URLs for each shard, wrote into the same blob paths the HF cache uses, then sha256-verified every shard before loading. A download that has been truncated and resumed a dozen times is exactly the kind of thing that silently corrupts. All six verified clean.

If you are on a fast, stable connection none of this matters and snapshot_download is fine. But if you are watching a progress bar sit at “15.7 GB” for the third time, reach for curl.

A structure I actually ran

To confirm the whole stack works end to end, not just that it imports cleanly, I folded sperm whale myoglobin from sequence alone. Myoglobin is a fitting smoke test: 153 residues, all alpha-helical, and the first protein structure ever solved by crystallography (Kendrew, 1958). About as friendly a target as exists.

Loading 3D structure…
Sperm whale myoglobin (P02185), predicted with ESMFold2 in single-sequence mode · avg pLDDT ≈ 88, pTM ≈ 0.90 Drag to rotate · pinch or scroll to zoom · colored by OpenFold 3 pLDDT on the AlphaFold confidence scale (blue high, orange low)

ESMFold2 folds it beautifully, the whole thing deep-blue high-confidence. Superposed on the 1.6 Å crystal structure (PDB 1MBO) it lands at 0.69 Å Cα RMSD over the 151-residue rigid core, from a single sequence, no MSA, in about twelve seconds on the 3090. This is the single-sequence bet paying off as advertised: for a well-represented, locally-packed helical fold, the language model has internalized everything it needs. Worth holding onto that number, because the deep dive folds something the model finds much harder.

Where I’m headed

Everything runs. The deep dive gets into how it is built: ESMC as a language model, how ESMFold2 bolts a diffusion structure module onto it, and what the sparse-autoencoder features turn out to mean.