Back to blog listing

OpenFold 3 Technical Deep Dive


A Technical Deep Dive

Before being able to contribute, it’s necessary to get a strong understanding of the overall codebase. Here I will detail the shape of the codebase, the tools/libraries/utilities used by the library, and some other details that I think will be helpful to understand. In particular I’ll cover the approach to testing and CI throughout the codebase. I’ll wrap up with a basic overview of inference vs training and detail where/when each takes place. The inference section will be more detailed, since most of us (myself included) will never really spend much time (read: no time) dealing with the training runs or infra.

The next post will be more of a tactical primer on contributing. I’ll cover codebase/commit/PR etiquette, and the high level ā€œlanesā€ or areas available to budding contributors to focus on. Stay tuned!

Language and libraries

Python and PyTorch

The main language, unsurprisingly, is Python. A realistic range right now seems to be 3.10 - 3.13. There are a few reasons why 3.14 is not your best bet right now. The project uses snakemake for data analysis workflows, and apparently the pinned version does not support 3.14 yet (it may in a newer version, I’m unfamiliar with snakemake). Primarily the project depends (at least historically) on various features of TorchScript, which is an older pytorch utility that has been essentially replaced with torch.compile. You can read more on the [pytorch torch.compiler docs].

The quick note is that pytorch does some fancy stuff to compile python code in a few different ways. TorchScript was an older way to do this, but pytorch now offers a full JIT compiler. TorchScript is a frozen feature that is essentially no longer supported, with the pytorch team actively steering users away from it and towards the newer torch.compile. For this reason you’ll see warnings during inference if you use 3.14, since anything using TorchScript seems to not actually compile and instead run as normal python. This means you’re also likely leaving some performance on the table.

As you explore the codebase, you’ll come across uses of jit/torchscript via examples such as:

# openfold3/core/model/primitives/attention.py:104
@torch.jit.ignore
def _deepspeed_evo_attn(...):
    ...

# attention.py:123  (the scripting itself is currently disabled)
# @torch.jit.script
def _attention(query, key, value, biases, ...):
    ...

PyTorch Lightning

PyTorch Lightning from lightning.ai is used as the training and inference harness (trainer, prediction loop, callbacks, etc.). I’ll cover this in more detail later, but it’s a good idea to browse the docs briefly to get a general idea for what’s going on.

Biology Ecosystem

Biotite is OpenFold-3’s core structural-biology toolkit. Biotite’s AtomArray is the in-memory representation of molecular structures throughout the data and output pipeline. Main uses:

  1. The structure data model (biotite.structure, 47 uses). AtomArray / Atom / BondList are the canonical objects passed around in core/data/primitives/structure/ (query.py, tokenization.py, metadata.py, unresolved.py, template.py) for tokenization, bond handling, and building model inputs. Also used in metrics (rasa.py for solvent accessibility) and tensor conversion (tensor_utils.py).

  2. File I/O (biotite.structure.io.pdbx, .io). Reading/writing CIF / BinaryCIF / PDB: CIFFile, CIFBlock, CIFCategory, BinaryCIFFile. Used to parse input structures and templates, and the output writer (core/runners/writer.py) serializes predictions to CIF/PDB.

  3. The CCD - Chemical Component Dictionary (biotite.setup_ccd). setup_openfold.py and dev scripts call setup_ccd to install/build the local components.bcif, the reference ligand/residue chemistry the pipeline looks up.

  4. Chemistry helpers. biotite.structure.info for bond/link types (patches.py, metadata.py), biotite.interface.rdkit (from_mol/to_mol) to bridge to RDKit for ligand handling, and biotite.database.rcsb to fetch structures from the PDB.

  5. Tests. Heavily used to construct synthetic AtomArrays and assert on structures across the test suite.

In short: biotite is the structural representation + CIF/PDB I/O + CCD chemistry layer; RDKit and the model tensors sit on either side of it.

The docs stack

Generator: Sphinx. The standard Python documentation tool. It reads source files under docs/source/, is configured by docs/source/conf.py, and is built via docs/Makefile (e.g. make html). Output is HTML.

Source format: Markdown via myst-parser. Rather than Sphinx’s default reStructuredText, this project writes docs in Markdown using the MyST parser. Three MyST extensions are enabled in conf.py:

  • colon_fence - :::-style fenced blocks for admonitions/directives.
  • dollarmath - inline/display math with $...$ and $$...$$.
  • amsmath - LaTeX environments for multi-line equations.

Theme: furo. A clean, modern, responsive HTML theme (html_theme = ā€œfuroā€), also used by many well-known Python projects.

Diagrams: sphinxcontrib-mermaid. Lets you embed Mermaid diagrams (flowcharts, etc.) directly in the docs, rendered at build time.

Hosting: Read the Docs. Published at openfold-3.readthedocs.io; RTD rebuilds the Sphinx site on changes.

Dependencies. Declared in two places: the optional docs extra in pyproject.toml (sphinx, myst-parser, furo) and the pixi env (pixi.toml), which also pins sphinx, myst-parser, furo (plus the mermaid extension that conf.py imports).docs/environment.yml exists for a conda-based build too.

Notably absent: no autodoc/napoleon API-from-docstrings extensions are enabled, so the docs are hand-written prose/Markdown, not auto-generated from the code.

Various

The shape of the codebase

At the top level there are three directories worth knowing, and they map cleanly onto three different jobs.

openfold3/core is the actual library: the reusable, model-agnostic machinery. Inside it you’ll find the pieces you’d expect from a model of this size:

  • config - configuration plumbing and the linear-init defaults
  • data - the entire data pipeline (this turns out to be a much bigger world than the rest, more on that later)
  • kernels - the GPU kernel wrappers (cuEquivariance and Triton)
  • loss - the training losses (diffusion, distogram, confidence)
  • metrics - quality and confidence scoring, and sample ranking
  • model - the network itself, split into embedders, trunk, layers, primitives, structure, and heads
  • runners - the Lightning glue (the ModelRunner base class and the output writer)
  • utils - the grab bag (chunking, checkpointing, atomization, EMA, schedulers, and so on)

openfold3/entry_points is the command-and-control layer: the ExperimentRunner classes that stand up a PyTorch Lightning trainer for either training or inference, plus input validation and parameter download.

openfold3/projects is where an abstract pile of core components becomes a specific, runnable model. Right now there is one project, of3_all_atom, and it bundles a concrete model.py, a runner.py (OpenFold3AllAtom, which subclasses the core ModelRunner), and a config/ directory holding the real model config and the preset YAML. The entry point, project_entry.py, is the thing that hands you a fully-composed config via get_model_config_with_presets.

Why the indirection? core does not know anything about ā€œOpenFold 3ā€ specifically. It knows about embedders and trunks and diffusion modules in the abstract. A project is what pins down which of those you use, at what sizes, and with which config, so that the same training and inference machinery can in principle host more than one model. If you are coming to contribute, the practical takeaway is this: read projects/of3_all_atom first to see how the model is actually wired together, then drop into core to read the one piece you care about.

The model, end to end

If you trace a single prediction through the network, it moves through five stages, and the directories under core/model follow them almost one-to-one.

  1. Embedding the inputs (model/feature_embedders). The raw features (sequence, MSA, templates) get turned into the model’s working representations. input_embedders.py holds the all-atom input embedder and the MSA module embedder; template_embedders.py handles structural templates.

  2. The trunk (model/latent). This is the heart of the AlphaFold-style architecture, where the model iterates on two representations at once: a per-token ā€œsingleā€ representation and a pairwise ā€œpairā€ representation. msa_module.py mixes information out of the MSA, and pairformer.py is the Pairformer stack (the successor to AlphaFold 2’s Evoformer, which also still lives here as evoformer.py). This is where the expensive triangle operations run.

  3. The primitives and layers (model/primitives, model/layers) are what the trunk is built from. Primitives are the small reusable pieces: attention, LayerNorm A normalization layer that rescales activations to keep training stable; OpenFold 3 leans on it heavily. (normalization.py), linear layers, and activations. Layers are the bigger named blocks straight out of the paper: triangular_attention.py, triangular_multiplicative_update.py, outer_product_mean.py, attention_pair_bias.py, and the transitions.

  4. Structure generation (model/structure). diffusion_module.py is the diffusion model that actually produces 3D coordinates, denoising from noise into a structure conditioned on the trunk’s representations. This is the big architectural shift from AlphaFold 2, and it is why ā€œdiffusion samplesā€ showed up as a flag back in the setup post.

  5. The heads (model/heads). prediction_heads.py and head_modules.py produce the confidence outputs: the pLDDT, pTM, and PAE scores, plus the distogram.

So the whole flow is: features go in, the embedders lift them into single and pair representations, the trunk refines those, the diffusion module turns them into coordinates, and the heads score the result. Every structure I rendered in this series came out the far end of exactly this pipeline.

Kernels and performance

A model this size lives or dies on a handful of hand-optimized GPU kernels, and OpenFold 3 leans on two families. core/kernels/cueq_utils.py wraps NVIDIA’s cuEquivariance kernels for the triangle operations, and core/kernels/triton/ holds a set of Triton kernels (a fused softmax, a SwiGLU, and an Evoformer kernel). On top of that, attention can route through DeepSpeed’s EvoformerAttention, which is the @torch.jit.ignore-wrapped path you saw in primitives/attention.py earlier.

The reason all of this exists is the same reason the setup post hit a memory wall: the pair and triangle tensors scale with the square of the sequence length, so both compute and memory blow up fast. The model fights back with a couple of levers you can see directly in model_setting_presets.yml: a chunk_size that splits the big operations into smaller pieces, and an offload_inference flag that pushes activations off to the CPU instead of holding them on the GPU. The predict preset turns these on modestly, and the low_mem preset turns them up.

One more thing worth knowing if you ever go chasing speed: inference defaults to full fp32 (precision: "32-true" in the trainer args). On a memory-bound card that is a real lever left untouched, and it is one of the threads I want to pull on in a future, perf-focused post. For now the mental model is simple. It is correct first, and not yet squeezed.

Configuration system

Configuration is ml_collections-based, and the thing that makes it tractable is presets. projects/of3_all_atom/config/model_setting_presets.yml defines a few named settings blocks, train, predict, and low_mem, each one toggling things like chunk sizes and offloading. You compose them: inference stacks predict, and on a tight GPU you stack low_mem on top, which is exactly the model_update.presets: [predict, low_mem] line from the setup post. project_entry.get_model_config_with_presets is what resolves all of that into one concrete config.

Separately, the PyTorch Lightning trainer has its own small typed config in entry_points/validator.py (PlTrainerArgs), which is where things like precision (defaulting to 32-true) and the profiler live. If you want to flip the model into bf16 or attach a profiler, that is the surface to do it from, via a runner YAML and no code change.

How inference runs

Here is the path a prediction actually takes, top to bottom:

  1. run_openfold predict (the click CLI in run_openfold.py) parses your flags and your query JSON.
  2. That hands off to an InferenceExperimentRunner in entry_points/experiment_runner.py. Its setup builds the model and the data, and its run does the one thing that matters: self.trainer.predict(...).
  3. The trainer is a plain PyTorch Lightning pl.Trainer, constructed from the PlTrainerArgs mentioned above.
  4. The LightningModule it drives is core/runners/model_runner.py (ModelRunner), subclassed by projects/of3_all_atom/runner.py (OpenFold3AllAtom). The interesting hook is predict_step, where a batch becomes a structure.
  5. Output is handled by a writer callback in core/runners/writer.py, which serializes each prediction to CIF (and PDB) using biotite.

The one branch worth calling out is the MSA step, because it is the single biggest lever on quality (as the setup post showed). The tooling lives in core/data/tools: colabfold_msa_server.py is the hosted path you get with --use_msa_server, while jackhmmer.py, hhblits.py, and hhsearch.py are there for the precomputed, run-it-yourself route. Turn the server off and provide nothing, and you fall back to a single-sequence dummy MSA, which is fast and offline and, as we saw, a lot less accurate.

How training runs

I will keep this short, both because the docs cover it well in training.md and because, like most people running inference, it is not where I spend my time.

The training side mirrors the inference side: a TrainingExperimentRunner stands up the same kind of Lightning trainer, just pointed at the losses in core/loss (diffusion, distogram, and confidence) and the metrics in core/metrics (quality, confidence, sample ranking). The heavy lifting that makes training even possible is in core/data, which is a much bigger world than the inference path lets on: a full preprocessing and featurization pipeline plus a dataset cache system, because you cannot re-derive features for millions of structures on every step.

The part most worth knowing, and the part the overview post already gestured at, is that OpenFold reproduced the AlphaFold 3 training recipe, including the large MGnify-based distillation dataset. Training runs distributed through Lightning, and the conda-and-pixi parity that CI enforces (more on that next) matters a lot more here than it does for a one-off local inference.

Testing and CI

Testing is pytest, run in parallel with pytest-xdist, and the plugins it leans on tell you what the project cares about. pytest-regressions drives the numerical snapshot tests (the ones that compare arrays against committed baselines, and the ones that bit me on a non-reference GPU in the setup post). There is also pytest-benchmark for performance regressions and pytest-recording for replaying network interactions. A shared seeded_rng fixture keeps the randomness deterministic, which, as my flaky-test PR showed, is not optional for a model full of random initialization.

Linting and formatting are ruff, configured with an 88-character line length and the E/F/UP/B/SIM/I/TID rule sets. Two choices stand out: relative imports are banned outright (ban-relative-imports = "all"), so everything is imported by full path, and tests are exempt from the line-length rule.

CI is where you really see that this is a systems project and not just a model. There are dual test pipelines, one conda and one pixi, that build a Docker image and run the suite on cloud GPU runners, pushing images to GHCR. There is a heavier integration-test path for the slow, full-fat tests, a workflow that caches the model parameters from S3 so every run is not re-downloading gigabytes, a ruff gate, and a PyPI publish pipeline that uses cibuildwheel to ship the compiled kernels as wheels. If you contribute, the conda-and-pixi duplication is the thing to keep in mind: a change that works in one environment has to work in the other, because CI checks both.

Wrap

This ended up being a lot longer than I expected. Make no mistake that I myself understand only the very surface level of many of these details, as I’m still ramping up on the project. Don’t be intimidated by the amount of information. Just use this an overview, and focus on what’s relevant to you as you work on contributions. Next up I’ll dive into everything relevant to contributing and how to pick a lane to swim in.

Folding is fun

The structure for this post is the obvious one for a deep dive: ATP synthase subunit β, the catalytic heart of the machine that converts the mitochondrial proton gradient into ATP. The whole respiratory chain, the electron relay that cytochrome c and cytochrome c1 are part of in the other posts, exists to pump protons across the inner membrane. ATP synthase is what cashes that gradient back in. Subunit β is where the chemistry happens.

Loading 3D structure…
Human ATP synthase F1 subunit β (ATP5F1B), predicted with OpenFold 3 Ā· avg pLDDT ā‰ˆ 86 Drag to rotate Ā· pinch or scroll to zoom Ā· colored by OpenFold 3 pLDDT on the AlphaFold confidence scale (blue high, orange low)

The large blue body is the nucleotide-binding fold, predicted with high confidence. The long orange strand peeling off it is the N-terminal mitochondrial targeting presequence, the tag that gets the protein imported and is then cleaved; OpenFold 3 correctly has no confidence in its position, because in the mature protein it does not exist. Superposed against AlphaFold’s model of the same sequence (UniProt P06576), the confident core matches to 0.47 ƅ backbone RMSD over all 472 residues, which is about as close as two independent predictions of the same fold get. Same caveat as the rest of the series: the public AlphaFold database is AlphaFold 2, since there is no public bulk download of AlphaFold 3 structures, but for a conserved monomer like this it is a fair reference.