ESM Technical Deep Dive
- 1 A Brief Overview of ESM
- 2 ESM Local Setup
- 3 ESM Technical Deep Dive you are here
- 4 Can You Write Down a Protein Model's Mind in English?
The overview was the pitch and the setup post got it running. This one is the architecture. What the three pieces actually are under the hood, and how they compose. I will go in dependency order, because that is how the system is built. ESMC is the foundation, ESMFold2 sits on top of it, and the sparse autoencoders hang off its side.
The shape of the codebase
There are really two repositories in play. The esm package itself is fairly lean. Model definitions for the older ESM3 line, the ESMC modules, tokenizers, an SDK for the hosted platform, and a pile of structure utilities. The new model classes, esmc, esmc_sae, and esmfold2, actually live in a fork of Hugging Face transformers, registered as normal AutoModel architectures.
That is an interesting packaging choice. Rather than ship a bespoke model class, EvolutionaryScale taught transformers about their models. So AutoModel.from_pretrained("biohub/ESMC-6B") works with the rest of the HF ecosystem out of the box: tokenizers, device_map, dtype handling. The cost is the git-pinned fork, which is the one dependency I would worry about long term.
ESMC: a language model that happens to be about proteins
ESMC is a bidirectional transformer encoder trained with masked language modeling. Architecturally it is much closer to BERT than to GPT. Protein design and analysis care about infilling, meaning what belongs at this position given everything around it, more than they care about left-to-right generation. The building blocks are the modern-LLM kit:
- Pre-LayerNorm transformer blocks (
UnifiedTransformerBlock), attention plus FFN with residual streams. - RoPE A protein language model's internal vector representation of each residue: a list of numbers per amino acid that encodes what the model has learned about that position in context. These vectors are what downstream models (like a folding trunk) actually consume. rotary position embeddings rather than learned or sinusoidal, the same choice most current LLMs have converged on.
- SwiGLU feed-forward with a 4x expansion ratio, hidden dim rounded to a multiple of 256 for kernel alignment.
- A tiny vocabulary: the 20 amino acids plus a handful of special tokens, around 64 total. The per-position logits The raw, pre-softmax scores a model emits over its vocabulary. For a protein language model, the per-position logits over the 20 amino acids tell you which residues the model thinks are plausible there, which is useful for scoring mutations. are a distribution over that, which is why a single forward pass gives you a per-residue likelihood you can use to score mutations.
It is released at three sizes, and the scaling is clean:
| Model | d_model | heads | layers | params |
|---|---|---|---|---|
| ESMC-300M | 960 | 15 | 30 | 333 M |
| ESMC-600M | 1152 | 18 | 36 | ~600 M |
| ESMC-6B | 2560 | 40 | 80 | 6 B |
The whole ESM thesis is that pushing up this table, more parameters on the same objective, keeps buying you emergent capability. Better long-range structural understanding that nobody trained for directly. The 6B model is the one ESMFold2 and the SAEs are built on, presumably because that structure signal is strongest there.
What you pull out of ESMC is one of three things. The logits, for masked-position and mutation scoring. The embeddings, which are the residual-stream vectors that downstream models consume. Or the hidden states at any layer, for probing what is represented where.
ESMFold2: diffusion on top of language
ESMFold2 is the part that goes up against AlphaFold 3, and it is not a from-scratch structure predictor. It is a folding head bolted onto frozen ESMC-6B embeddings. The config describes it tidily: “SWA atom encoders with 3D RoPE, a diffusion transformer, a folding trunk, and an ESMC 6B PLM backbone.” Unpacking that:
- The ESMC-6B backbone embeds the sequence. This is where the evolutionary knowledge enters, and it is why ESMFold2 can run without an MSA Multiple Sequence Alignment: a stack of evolutionarily related sequences for the same protein. The patterns of which residues change together are a big part of how these models infer 3D structure. . AlphaFold-style models read evolution out of an MSA at inference. ESMFold2 reads it out of the language model’s weights.
- A folding trunk, a relatively shallow stack. The confidence head’s trunk is just 4 layers. That is small next to AlphaFold’s 48-block Evoformer, and it is why the trunk weights are under a gigabyte. The heavy lifting was moved into pre-training the backbone.
- A diffusion AlphaFold 3 and its relatives generate atomic coordinates with a diffusion model, starting from noise and denoising into a structure. Each sample is one such run, so more samples means more candidate structures to rank. transformer that generates atomic coordinates, the same move AlphaFold 3 made over AlphaFold 2’s direct coordinate regression. Atoms get 3D RoPE, and coordinates are denoised from noise over
num_sampling_steps. - A confidence head that predicts the familiar bins: pLDDT Predicted Local Distance Difference Test: an AlphaFold-style per-residue confidence score from 0 to 100. Higher means the model is more sure about how that part of the structure is positioned. (50 bins), PAE and PDE (64 bins each), and a 128-bin distogram.
The fold() call exposes the knobs that matter. num_loops recycles the trunk, num_sampling_steps is the diffusion denoising count, and num_diffusion_samples is how many independent structures to generate and rank by pTM Predicted TM-score: a single 0 to 1 number estimating how close the predicted fold is to the true structure overall. Higher is better, and values above about 0.5 usually mean the global topology is right. ipTM is the same idea scored across chains in a complex. . On my ubiquitin run, 20 loops, 200 steps, 3 samples, the best sample came back at pTM 0.76 and matched the crystal structure to 1.12 Å, all in about 14 GB.
One detail I appreciated while reading modeling_esmfold2.py. There is a whole fp8 and transformer_engine path for H100-class hardware that re-quantizes the ESMC backbone’s weights to fp8 to save memory. On a 3090 it is gated behind a TE_AVAILABLE flag and falls back to bf16, so the consumer-GPU path is a first-class citizen rather than an afterthought.
Why single-sequence folding is a different tradeoff, not strictly better
It is tempting to read “no MSA needed” as “MSAs are obsolete,” and that is not the claim. The MSA is an explicit, per-target summary of evolutionary variation. The language model is an amortized, learned-once summary of all variation. For proteins well represented in the training distribution, ubiquitin say, the amortized version is plenty. For a genuinely novel or poorly-sampled sequence, an MSA built fresh for that target can still carry signal the language model never saw enough of to internalize. The two approaches degrade in different places, which is why both still exist.
The sparse autoencoders: reading the model’s mind
This is the piece with no AlphaFold analog, and the one I keep coming back to. A sparse autoencoder An unsupervised network trained to re-express a model's dense internal activations as a sparse combination of many interpretable "features." Only a handful of the ~16,000 features fire for any given residue, and each one tends to correspond to a recognizable piece of biology. is a small unsupervised network trained to re-express ESMC’s dense residual-stream activations, 2560 numbers per residue, as a sparse combination of a much larger feature codebook. The variant I ran has a 16,384-feature codebook and a top-k of 64. For any given residue, the SAE describes ESMC’s state as a blend of just 64 features pulled from those 16,384. I confirmed exactly that on my own run. The output is a torch.sparse_coo tensor with precisely 64 non-zeros per position.
Mechanically it is a simple idea. Encode the dense vector into 16,384 candidate activations, keep the 64 largest, zero the rest, decode back, and train to reconstruct. The bet, which comes from the LLM interpretability literature this borrows from, is that the sparse basis is far more interpretable than the dense one. Individual features tend to fire for recognizable, monosemantic pieces of biology, a particular binding motif, a structural element, a functional family, rather than the polysemantic mush of raw neurons.
EvolutionaryScale ships SAEs for every one of ESMC-6B’s 80 layers, and ships natural-language descriptions for the 16,384 features, generated by a pipeline that maps each feature onto known biology from protein databases. The way you use them is nice. You load only the layers you want, add_sae_models them onto the live ESMC model, and a normal forward pass now returns sae_outputs alongside the usual logits and embeddings. Interpretation as a side channel, with no second model to run.
How inference runs, end to end
Stepping back, the data flow for the full system is:
sequence
└─ ESMC-6B (80-layer MLM encoder, RoPE + SwiGLU)
├─ logits -> mutation / likelihood scoring
├─ embeddings ----┬-> ESMFold2 trunk + diffusion -> coordinates + pLDDT/pTM
└─ hidden states -┴-> SAE (top-64 of 16,384) -> interpretable features
It is a composable design. One backbone forward pass is the expensive step, about 0.6 s for the 6B model on a 3090, and the three heads, language modeling, folding, and interpretability, all hang off it. That is a different philosophy from a monolithic structure predictor, and it is what makes ESM feel less like an AlphaFold competitor than a foundation you build other tools on. Which is the framing the overview was pointing at from the start.
A structure the model gets wrong
Every other fold in this series came back clean, so here is the counterexample, which taught me the most. I gave ESMFold2 green fluorescent protein, the 238-residue beta-barrel from Aequorea victoria, again from sequence alone.
You can watch it go wrong. The strands are there but the barrel never closes, and the coloring says it plainly: orange and yellow throughout, the model’s own confidence on the floor. Against the crystal structure (PDB 1EMA) it is 13.9 Å Cα RMSD over the full chain, and even the best-fitting rigid core is 4.8 Å. That is not a near-miss, it is the wrong structure, and the model knew it.
So why does the same tool that nails myoglobin to 0.69 Å come apart here? The honest answer is the tradeoff from the collapsible above, made concrete. Myoglobin is a bundle of helices held together by local, short-range contacts the language model sees constantly. A beta-barrel is the opposite: its shape comes from pairing strands that sit far apart in the sequence, and the classic signal for those long-range contacts is coevolution, the columns of an MSA that move together. In single-sequence mode there is no MSA. ESMFold2 is betting that ESMC absorbed enough of that coevolutionary structure during pre-training to stand in for one, and for a fold this dependent on the barrel closing, that bet did not pay off. It is the visible version of the whole tradeoff: the MSA you dropped for speed was doing real work, and you feel its absence most exactly where the fold needs information that is not local. That is the thread I pick up in a follow-up, where I try handing the MSA back and see how much of the barrel returns.
Wrap
My read by now is that ESM and OpenFold are solving overlapping problems from opposite ends. One amortizes evolution into a language model and reads structure out as one of several heads. The other treats structure prediction as the whole game and feeds it evolution explicitly through MSAs. Both run on my 3090. The thing I will keep poking at is the SAEs. Being able to ask a 6-billion-parameter model what it learned about a protein, and get back features a biologist could name, is the most novel thing in this release. That itch became the next post, where I borrow Anthropic’s natural language autoencoder idea and try to write ESMC’s representations down in English, then measure what survives.