A Brief Overview of ESM

In the OpenFold 3 overview I left myself a parenthetical note to circle back to:

ESM (Evolutionary Scale Modeling) is also really appealing as a project to get involved in. It comes at biology from the language-model angle rather than structure prediction, and the open weights and active community make it another strong on ramp. Something for a future post, maybe.

This is that future post. I spent a weekend getting the whole ESM stack running on my own machine. The language model, the folding model, and the interpretability tooling. It turned out to be a richer system than I expected. This post is the “what and why.” The next one is how I got it running on a 3090, and the third is the architecture.

What is ESM?

ESM stands for Evolutionary Scale Modeling, and the bet is right there in the name. Treat evolution as a training signal. A protein language model is trained the same way an LLM is. You show it a protein sequence with some amino acids masked out, ask it to fill in the blanks, and repeat across billions of sequences. The only real difference from an LLM is the alphabet (20 amino acids instead of a tokenizer’s worth of word-pieces) and the corpus (the proteins evolution actually produced, rather than the text humans actually wrote).

The real claim is what falls out of that. Train a model big enough on enough sequences and it stops being a spell-checker for proteins. It starts encoding structure, function, and evolutionary relationships, none of which it was ever explicitly told about. The release I dug into bundles three things that all lean on that one idea:

ESMC is the language model itself, released at 300M, 600M, and 6B parameters. This is the foundation. Everything else is built on its embeddings .
ESMFold2 is a structure predictor built on top of the ESMC 6B model. It pairs those embeddings with a diffusion module that generates 3D coordinates. This is the part that goes up against AlphaFold 3, and by extension OpenFold 3.
ESM Atlas and the sparse autoencoders are a map of roughly 6.8 billion proteins, plus sparse autoencoders trained to crack ESMC’s internal representations into human-readable features. This is the interpretability story, and it has no real analog in the AlphaFold world.

How this differs from the OpenFold bet

OpenFold 3, AlphaFold 3, Boltz, and the rest are structure predictors first. You hand them a sequence plus an MSA , a stack of evolutionarily related sequences, and they hand back coordinates. The MSA does a lot of the work. It is an explicit, hand-assembled summary of evolutionary variation.

ESM’s wager is that a big enough language model has already internalized that variation during pre-training, so you should not need to assemble an MSA at inference time at all. ESMFold2 leans into this with a single-sequence mode. No MSA search, no genetic databases, just sequence in and structure out, an order of magnitude faster.

There is a name for this kind of bet. Rich Sutton’s bitter lesson is the observation that in AI, general methods that scale with computation tend to win out over methods that bake in our own knowledge of the problem. The MSA is exactly that kind of baked-in knowledge, a feature we hand-assemble because we know evolutionary variation matters. ESM’s bet is that you should not have to hand-assemble it, that a big enough model trained on enough raw sequence learns that signal on its own, and then some. ESMFold2 is that bet pointed straight at structure prediction.

When I first read that I was skeptical. The OpenFold post taught me how much these models lean on MSAs. My offline, no-MSA cytochrome c run cratered to a pLDDT around 40. So the obvious test was to point ESMFold2 at a protein and see whether the language model alone could carry the fold.

A structure I actually ran

Below is human ubiquitin, predicted on my own machine (an RTX 3090) with ESMFold2, from sequence alone. No MSA, single-sequence mode, about six seconds end to end. Ubiquitin is a good first target. It is 76 residues, absurdly well studied, and it is the small protein cells use as a molecular tag. They attach it to other proteins to mark them for degradation, move them around, or flip their activity. The 2004 Nobel in Chemistry went to the discovery of ubiquitin-mediated degradation. It is the protein the cell uses to take out the trash.

Loading 3D structure…

Human ubiquitin, predicted with ESMFold2 in single-sequence mode · avg pLDDT ≈ 81, pTM ≈ 0.76 Drag to rotate · pinch or scroll to zoom · colored by OpenFold 3 pLDDT on the AlphaFold confidence scale (blue high, orange low)

The coloring is ESMFold2’s per-residue confidence (pLDDT) on the standard AlphaFold scale. Deep blue is high confidence, easing to cyan and yellow at the flexible C-terminal tail. That tail is the ...LRGG that actually does the conjugating, and it is floppy in reality too. The model recovered the classic β-grasp fold, a five-stranded β-sheet wrapped around a single α-helix, without ever being shown an experimental ubiquitin structure in this run.

Then the same sanity check I ran in the OpenFold series. I superposed the prediction on the experimental crystal structure (PDB 1UBQ, a 1.8 Å X-ray structure from 1987) and measured the backbone deviation.

It came back at 1.12 Å Cα RMSD. That is essentially experiment-accurate for a monomer this size. It is also a clean demonstration of the ESM thesis. The evolutionary signal that OpenFold reads out of an MSA at inference time, ESMC has baked into its weights at training time. For a small, well-represented protein like ubiquitin, that is enough.

I do not want to oversell it. Ubiquitin is about the friendliest target you can pick, and the single-sequence trick gets shakier on large, novel, or poorly-sampled proteins. That is exactly where MSAs still earn their keep. The honest read is that these are two tools that happen to overlap, not a knockout. But it is a good result to get out of a language model on a consumer GPU.

Who is behind it?

ESM came out of the protein team at Meta’s FAIR lab, led by Alexander Rives. The original ESM and ESM2 papers are from there. That team spun out into an independent company, EvolutionaryScale, which is where ESM3 and this latest generation (ESMC and ESMFold2) come from. The models in this release are distributed through biohub.ai. You will still see Forge scattered through the SDK as a holdover from the old forge.evolutionaryscale.ai API.

The funding and openness angle is different from OpenFold’s. OpenFold is a nonprofit consortium whose openness is the product. EvolutionaryScale is a venture-backed company that open-sources weights under a permissive license while also running a paid hosted platform. That is closer to the Chai and Boltz “open model, commercial business” shape than to OpenFold’s commons. Worth keeping in mind, but the weights for ESMC 6B, ESMFold2, and the SAEs are out under MIT, which is what matters for building on them.

Where I’m headed

That was the orientation. Two things surprised me, and they are what I want to dig into next. First, how cheaply all of this runs on 24 GB, once you get past a cursed download. Second, the sparse autoencoders, which let you ask what the model has actually learned and get back something a biologist could read. The setup post covers getting it running. The deep dive covers how it is built.

🐍 Snake