A Brief Overview of ESM
- 1 A Brief Overview of ESM you are here
- 2 ESM Local Setup
- 3 ESM Technical Deep Dive
- 4 Can You Write Down a Protein Model's Mind in English?
In the OpenFold 3 overview I left myself a parenthetical note to circle back to:
ESM (Evolutionary Scale Modeling) is also really appealing as a project to get involved in. It comes at biology from the language-model angle rather than structure prediction, and the open weights and active community make it another strong on ramp. Something for a future post, maybe.
This is that future post. I spent a weekend getting the whole ESM stack running on my own machine. The language model, the folding model, and the interpretability tooling. It turned out to be a richer system than I expected. This post is the āwhat and why.ā The next one is how I got it running on a 3090, and the third is the architecture.
What is ESM?
ESM stands for Evolutionary Scale Modeling, and the bet is right there in the name. Treat evolution as a training signal. A protein language model A transformer trained on raw protein sequences the way an LLM is trained on text. Instead of predicting the next word it predicts masked-out amino acids, and in doing so learns an internal representation of protein biology that transfers to structure, function, and design tasks. is trained the same way an LLM is. You show it a protein sequence with some amino acids masked out, ask it to fill in the blanks, and repeat across billions of sequences. The only real difference from an LLM is the alphabet (20 amino acids instead of a tokenizerās worth of word-pieces) and the corpus (the proteins evolution actually produced, rather than the text humans actually wrote).
The real claim is what falls out of that. Train a model big enough on enough sequences and it stops being a spell-checker for proteins. It starts encoding structure, function, and evolutionary relationships, none of which it was ever explicitly told about. The release I dug into bundles three things that all lean on that one idea:
- ESMC is the language model itself, released at 300M, 600M, and 6B parameters. This is the foundation. Everything else is built on its embeddings A protein language model's internal vector representation of each residue: a list of numbers per amino acid that encodes what the model has learned about that position in context. These vectors are what downstream models (like a folding trunk) actually consume. .
- ESMFold2 is a structure predictor built on top of the ESMC 6B model. It pairs those embeddings with a diffusion AlphaFold 3 and its relatives generate atomic coordinates with a diffusion model, starting from noise and denoising into a structure. Each sample is one such run, so more samples means more candidate structures to rank. module that generates 3D coordinates. This is the part that goes up against AlphaFold 3, and by extension OpenFold 3.
- ESM Atlas and the sparse autoencoders are a map of roughly 6.8 billion proteins, plus sparse autoencoders An unsupervised network trained to re-express a model's dense internal activations as a sparse combination of many interpretable "features." Only a handful of the ~16,000 features fire for any given residue, and each one tends to correspond to a recognizable piece of biology. trained to crack ESMCās internal representations into human-readable features. This is the interpretability story, and it has no real analog in the AlphaFold world.
How this differs from the OpenFold bet
OpenFold 3, AlphaFold 3, Boltz, and the rest are structure predictors first. You hand them a sequence plus an MSA Multiple Sequence Alignment: a stack of evolutionarily related sequences for the same protein. The patterns of which residues change together are a big part of how these models infer 3D structure. , a stack of evolutionarily related sequences, and they hand back coordinates. The MSA does a lot of the work. It is an explicit, hand-assembled summary of evolutionary variation.
ESMās wager is that a big enough language model has already internalized that variation during pre-training, so you should not need to assemble an MSA at inference time at all. ESMFold2 leans into this with a single-sequence mode. No MSA search, no genetic databases, just sequence in and structure out, an order of magnitude faster.
There is a name for this kind of bet. Rich Suttonās bitter lesson is the observation that in AI, general methods that scale with computation tend to win out over methods that bake in our own knowledge of the problem. The MSA is exactly that kind of baked-in knowledge, a feature we hand-assemble because we know evolutionary variation matters. ESMās bet is that you should not have to hand-assemble it, that a big enough model trained on enough raw sequence learns that signal on its own, and then some. ESMFold2 is that bet pointed straight at structure prediction.
When I first read that I was skeptical. The OpenFold post taught me how much these models lean on MSAs. My offline, no-MSA cytochrome c run cratered to a pLDDT Predicted Local Distance Difference Test: an AlphaFold-style per-residue confidence score from 0 to 100. Higher means the model is more sure about how that part of the structure is positioned. around 40. So the obvious test was to point ESMFold2 at a protein and see whether the language model alone could carry the fold.
A structure I actually ran
Below is human ubiquitin, predicted on my own machine (an RTX 3090) with ESMFold2, from sequence alone. No MSA, single-sequence mode, about six seconds end to end. Ubiquitin is a good first target. It is 76 residues, absurdly well studied, and it is the small protein cells use as a molecular tag. They attach it to other proteins to mark them for degradation, move them around, or flip their activity. The 2004 Nobel in Chemistry went to the discovery of ubiquitin-mediated degradation. It is the protein the cell uses to take out the trash.
The coloring is ESMFold2ās per-residue confidence (pLDDT) on the standard AlphaFold scale. Deep blue is high confidence, easing to cyan and yellow at the flexible C-terminal tail. That tail is the ...LRGG that actually does the conjugating, and it is floppy in reality too. The model recovered the classic β-grasp fold, a five-stranded β-sheet wrapped around a single α-helix, without ever being shown an experimental ubiquitin structure in this run.
Then the same sanity check I ran in the OpenFold series. I superposed the prediction on the experimental crystal structure (PDB 1UBQ, a 1.8 Ć X-ray structure from 1987) and measured the backbone deviation.
It came back at 1.12 à Cα RMSD. That is essentially experiment-accurate for a monomer this size. It is also a clean demonstration of the ESM thesis. The evolutionary signal that OpenFold reads out of an MSA at inference time, ESMC has baked into its weights at training time. For a small, well-represented protein like ubiquitin, that is enough.
I do not want to oversell it. Ubiquitin is about the friendliest target you can pick, and the single-sequence trick gets shakier on large, novel, or poorly-sampled proteins. That is exactly where MSAs still earn their keep. The honest read is that these are two tools that happen to overlap, not a knockout. But it is a good result to get out of a language model on a consumer GPU.
Who is behind it?
ESM came out of the protein team at Metaās FAIR lab, led by Alexander Rives. The original ESM and ESM2 papers are from there. That team spun out into an independent company, EvolutionaryScale, which is where ESM3 and this latest generation (ESMC and ESMFold2) come from. The models in this release are distributed through biohub.ai. You will still see Forge scattered through the SDK as a holdover from the old forge.evolutionaryscale.ai API.
The funding and openness angle is different from OpenFoldās. OpenFold is a nonprofit consortium whose openness is the product. EvolutionaryScale is a venture-backed company that open-sources weights under a permissive license while also running a paid hosted platform. That is closer to the Chai and Boltz āopen model, commercial businessā shape than to OpenFoldās commons. Worth keeping in mind, but the weights for ESMC 6B, ESMFold2, and the SAEs are out under MIT, which is what matters for building on them.
Where Iām headed
That was the orientation. Two things surprised me, and they are what I want to dig into next. First, how cheaply all of this runs on 24 GB, once you get past a cursed download. Second, the sparse autoencoders, which let you ask what the model has actually learned and get back something a biologist could read. The setup post covers getting it running. The deep dive covers how it is built.