Back to blog listing

OpenFold 3 Local Setup


The overview was the “what and why” of OpenFold 3. This post is the “how do I actually run it on hardware I own.” My goal here is narrow: get from a clean checkout to structure prediction on a single consumer GPU, and understand which knobs to reach for when it inevitably runs out of memory. If you walk away able to install OF3, download the parameters, run an inference, and read the confidence numbers it hands back, this post did its job.

The constraint that shapes everything below: I am on a single RTX 3090 with 24GB of VRAM, and the docs recommend 32GB or more. That gap is the focus of the back half of this post.

Of course you’ll likely be better off consulting the OpenFold3 Docs directly, since most of this is likely to drift.

The environment

OF3 supports both conda and pixi, and the docs lean toward pixi. I did too. Pixi gives you a real lockfile and a named environment, which means the next person (or the next me, six months from now) gets the exact same dependency graph instead of a vague environment.yml that resolves differently every time. The environment I use throughout is openfold3-cuda12, and essentially every command in this post is prefixed with pixi run, e.g.:

pixi run -e openfold3-cuda12 <command>
# or drop into a shell with the env active:
pixi shell -e openfold3-cuda12 # I never do this, personally

Typing pixi run -e openfold3-cuda12 in front of every command gets old fast, so I aliased it to ofrun. Every command in the rest of this series becomes ofrun pytest openfold3/tests, ofrun ruff check --fix, and so on. I’m a fish user, so:

alias --save ofrun "pixi run -e openfold3-cuda12"

The --save flag writes it out as an autoloaded function in ~/.config/fish/functions/, so it survives new shells. If you’re on bash or zsh, the equivalent is alias ofrun="pixi run -e openfold3-cuda12" in your rc file.

The stack underneath is heavy and version-sensitive: PyTorch 2.10 built against CUDA 12.9, DeepSpeed, the cuEquivariance kernels, and Python 3.10. The thing to internalize is that the GPU kernels (DeepSpeed’s EvoformerAttention and the cuEquivariance triangle kernels) have to line up with your CUDA toolkit. Pixi handling that resolution is most of why I did not fight a single kernel-compatibility error. I do tend to prefer uv over everything else these days in my own projects. There is a pyproject.toml file available, so I’m sure you could use uv without too much hassle, but I just went the path the docs recommended (you should probably do the same).

Sidebar, for fellow Arch/CachyOS people: the pixi installer only updated my .zshrc, so pixi was not on my fish PATH. The fix that actually persisted was fish_add_path -U $HOME/.pixi/bin (the universal flag matters, the non-universal version did not survive new sessions).

New to Python's packaging zoo? A primer on venv, pip, conda, pixi, and uv (optional)

Quick tangent, since it may be relevant for people new to the ecosystem.

The core problem. Python projects need:

  • a specific Python version
  • a set of third-party libraries.

Different projects want different versions of each, so you isolate them in separate environments instead of installing everything globally. Tools split into two jobs (though most newer tools do both):

  • making isolated environments
  • installing packages into them

Where packages come from. Two main “stores”:

  • PyPI - the official index of Python-only packages (what pip install uses).
  • conda channels (e.g. conda-forge)
    • these also carry non-Python pieces like CUDA, compilers, and C libraries.

Environment managers (make the isolated box)

  • venv
    • built into Python. Creates an isolated environment from a Python you already have; doesn’t install anything itself
    • you use pip inside it. The classic combo is venv + pip.
  • virtualenv
    • older third-party version of venv. same idea, a few more features.
  • conda / pixi manage their own environments and install into them, so they replace venv.
  • uv
    • also creates environments for you, replacing venv.
    • newer, extremely fast, and typically preferred by me and others I know who are active in the community

Installers (put packages into an environment)

  • pip
    • Python’s built-in installer. PyPI only; Python packages only.
  • uv
    • fast modern pip replacement (Rust). Same PyPI world, but also installs Python versions and resolves dependencies far quicker.
    • again, typically I’d prefer uv over all else, if nothing is tying me to a specific tool
  • conda
    • installs from conda channels
    • handles non-Python system libraries too (CUDA, compilers)
    • Heavier and slower.
  • pixi
    • fast modern take on conda (Rust): same channels and system-library power, plus a reproducible project lockfile
    • can also pull from PyPI.

How they fit together

  • pip / uv install into an environment; venv makes one. Older baseline: venv + pip.
  • uv, conda, pixi each bundle isolation + installation in one tool.
  • conda / pixi uniquely reach beyond Python to system-level dependencies.

Quick rule of thumb

  • Pure-Python project → uv (or venv + pip).
  • Need CUDA, compilers, or system libraries (e.g. GPU/scientific work) → pixi (or conda).

(You may also see poetry/hatch, which are project managers focused on building and publishing PyPI packages, in the pip/uv family.)

setup_openfold, step by step

Install pixi and set up the environment.

# 0. clone openfold (I recommend the `gh` cli)
gh repo clone aqlaboratory/openfold-3 # or your own fork

# 1. install pixi
curl -fsSL https://pixi.sh/install.sh | bash

# 2. from repo root, resolve the GPU env
pixi install -e openfold3-cuda12

# 3. download model weights + CCD
pixi run -e openfold3-cuda12 setup_openfold

Once the environment exists, setup_openfold is a short interactive flow that does five things: pick a cache directory, pick a parameter directory, download model parameters, set up the CCD (the Chemical Component Dictionary biotite needs to build molecules), and optionally run the integration tests. I would go ahead and run the integration tests. It’s a good sanity check that everything is running correctly. As you’ll see, I ran into some issues and even landed a PR to fix flaky tests that were failing for me locally.

The parameter download menu offers three options: the default checkpoint, all checkpoints, or a specific one by name. There are two checkpoints worth knowing about: openfold3-p2-155k (the default) and openfold3-p2-145k, which is an earlier training snapshot with 10k fewer steps and no documented benefit. Take the default. The downloads come from a public, anonymous S3 bucket, so there are no credentials to set up.

Everything lands under ~/.openfold3: the checkpoint itself (of3-p2-155k.pt, about 2.2GB), a ckpt_root pointer, and the CCD components.bcif. After this, run_openfold auto-discovers the checkpoint through that cache directory, so you never pass a model path by hand.

One gotcha: option 2, “download all,” exited on me with “No directory specified” rather than downloading anything. If you only need to run inference, option 1 (the default checkpoint) is all you want anyway, so I didn’t go back to chase it at the time. I did eventually try to repro to create an issue or PR, and couldn’t figure out how. Maybe user error?

Running a prediction

Inference is run_openfold predict pointed at a query JSON. The query format is small: a set of named queries, each with one or more chains, each chain carrying a molecule type and a sequence.

{
  "queries": {
    "cytochrome_c1": {
      "chains": [
        {
          "molecule_type": "protein",
          "chain_ids": ["A"],
          "sequence": "MAAAAASLRG..."
        }
      ]
    }
  }
}
# Basic inference example
pixi run -e openfold3-cuda12 run_openfold predict \
  --query_json=query.json \
  --output_dir=out \
  --use_msa_server=True \
  --use_templates=False \
  --num_diffusion_samples=5 \
  --num_model_seeds=1

The single most consequential flag is --use_msa_server. With it on, OF3 submits your sequence to the ColabFold MSA server and folds with a real multiple sequence alignment Multiple Sequence Alignment: a stack of evolutionarily related sequences for the same protein. The patterns of which residues change together are a big part of how these models infer 3D structure. . With it off, you fold from the single sequence alone (a dummy MSA), which is faster and fully offline but dramatically less accurate. Concretely: the cytochrome c from the overview post hit an average pLDDT Predicted Local Distance Difference Test: an AlphaFold-style per-residue confidence score from 0 to 100. Higher means the model is more sure about how that part of the structure is positioned. around 84 with MSAs and cratered to around 40 without. It seems evolutionary signal is not a nice-to-have for these models; it is most of the predictive power.

The other knobs I reach for: --num_diffusion_samples (how many candidate structures the diffusion AlphaFold 3 and its relatives generate atomic coordinates with a diffusion model, starting from noise and denoising into a structure. Each sample is one such run, so more samples means more candidate structures to rank. module generates per seed, then ranks) and --num_model_seeds. Each prediction writes out a model file plus a confidence JSON, where the numbers that matter are avg_plddt (per-residue confidence, 0 to 100), ptm (a global fold-confidence score), has_clash, and a sample_ranking_score used to order the samples.

Low-memory mode, or: the 24GB reality

My local card (RTX 3090) has only 24GB of VRAM, as opposed to the recommended minimum of 32GB. To find the ceiling I ran a length sweep with default settings (single chain, one sample, fp32) and watched peak VRAM and runtime climb:

ResiduesPeak VRAMRuntime
1282.7 GB6.5 s
5126.5 GB34 s
76811.7 GB73 s
102418.4 GB137 s
153622.9 GB333 s
2048OOMn/a

The out-of-memory knee sits somewhere around 1600 to 1900 residues. Runtime grows roughly with the square of length, and memory a little slower than that, because the dominant cost is the O(N2)O(N^2) pair and triangle tensors rather than the fixed model weights.

The lever for getting past that wall is the low_mem preset, which you stack on top of predict through a runner YAML (model_update.presets: [predict, low_mem]). What it actually changes is mechanical: it chunks the big attention and triangle operations, offloads activations to the CPU instead of holding them on the GPU, and applies token cutoffs. It costs you runtime in exchange for headroom, so I leave it off for small inputs like the examples here and only switch it on when a query is large enough to threaten the ceiling (or once I’ve experienced an OOM directly).

Testing and the contributor loop

The contributor loop is the same one the docs prescribe: ruff format && ruff check --fix, then pytest openfold3/tests. Tests run in parallel and lean on a seeded_rng fixture for determinism. Two things tripped me up early, though neither is a real defect.

First, an actually-flaky test: test_template_module_offload compared the offloaded and non-offloaded code paths with a default torch.allclose tolerance, but the module’s weights were initialized unseeded, so a legitimate floating point difference on the order of 1e-7 would straddle the 1e-8 default tolerance and fail maybe 30% - 50% of the time. The fix was to seed the initialization and loosen the tolerance to something physically reasonable. That became my second tiny PR, #256.

Second, failures that look alarming but are not: the triangular-attention test_shape[cuda] tests compare against committed numerical snapshots that were generated on a different GPU and CUDA version (an NVIDIA GB10 on CUDA 13). On my 3090 and CUDA 12 they fail at a 1e-6 tolerance purely from hardware-level floating point differences. Running with --force-regen regenerates the baselines locally and they pass, which confirms it is a hardware mismatch and not a bug. The important part: do not commit those regenerated snapshots, since you would just be swapping in baselines that fail for everyone else.

A structure I actually ran

To close, and to keep a thread running through this series, here is a structure I folded while writing this post: human cytochrome c1, the heme protein of respiratory Complex III. It is the upstream partner of the cytochrome c from the overview. Cytochrome c1 is the thing that actually hands electrons to cytochrome c, which then carries them onward to Complex IV. So part 1 and part 2 of this series are two consecutive steps in the same electron relay.

Loading 3D structure…
Human cytochrome c1 (CYC1), predicted with OpenFold 3 · avg pLDDT ≈ 79 Drag to rotate · pinch or scroll to zoom · colored by OpenFold 3 pLDDT on the AlphaFold confidence scale (blue high, orange low)

This one is a more honest picture of a real prediction than the tidy cytochrome c was. The compact, deep-blue globular domain is the heme-binding core, and OpenFold 3 is confident about it. The orange and cyan strands trailing off it are the mitochondrial targeting presequence and the C-terminal membrane anchor, and the model correctly flags them as low confidence: pulled out of the membrane and stripped of their context, they genuinely do not have one fixed structure, and the pLDDT coloring says exactly that. This is the kind of thing the confidence scores are for.

For the sanity check, I superposed the confident core against AlphaFold’s model of the same sequence (UniProt P08574). The globular domain agrees to about 0.57 Å backbone RMSD, essentially identical, while the floppy anchor sits at a different angle in the two models, which is exactly what you would expect for a region neither model is sure about. As in the overview, the caveat holds: the public AlphaFold database is AlphaFold 2, not 3, but for a domain like this AF2 is a fair, near-experimental yardstick.

Next up: the technical deep dive into how the codebase is actually built.