A Brief Overview of OpenFold 3
- 1 A Brief Overview of OpenFold 3 you are here
- 2 OpenFold 3 Local Setup
- 3 OpenFold 3 Technical Deep Dive
- 4 Contributing to OpenFold 3: A Primer
What is OpenFold?
OpenFold is an open source reimplementation of AlphaFold. There are actually quite a few out there, including Boltz, Chai-1, Protenix, and HelixFold3, among others. More than I was expecting when I first started digging into the space. Iām personally aiming to move my career into the space of AI for Science, and Iām targeting Bio/ML as an on ramp because I find the field genuinely interesting. I decided recently that I wanted to focus on open source contributions over personal toy projects, because I want my time to have an actual impact. Iām still learning from various sources as time permits (for example working through Deep Learning for Biology), but I want to focus my time on open source projects. Iāve dabbled a bit in cccl/parrot/CUDA locally, but OpenFold 3 caught my eye this week (Iāve actually already landed 2 tiny little PRs at the time of writing: a broken Slack invite link and a flaky test fix).
(Aside: OpenFold isnāt the only thing on my radar. ESM (Evolutionary Scale Modeling) is also really appealing as a project to get involved in. It comes at biology from the language-model angle rather than structure prediction, and the open weights and active community make it another strong on ramp. Something for a future post, maybe.)
Why did I personally choose OpenFold 3? Iāve been deeply interested in proteins and microbiology since I first learned about AlphaFold and truly understood how impactful these kinds of models could be for the world. LLMs are neat, donāt get me wrong, but thereās something far more moving about the ability to impact drug discovery or disease research. Having lost many loved ones to cancer or other devastating diseases, I just find the idea of making an impact here very compelling. Is working on OpenFold 3 the most optimal way to do that? Unclear. But itās certainly a step in the right direction. Something about the funding situation caught my eye as well, as Iāll explain.
Why does OpenFold exist?
So why are there open source reimplementations of AlphaFold? I think some of the labs doing these projects (e.g. Baidu, ByteDance) are doing it as a form of signaling/posturing for hiring. Some of the labs are likely doing it for the love of the game, learning, etc. I think thereās been a lot of tension around AlphaFold 3. My understanding is that AlphaFold 2 was fully open sourced. AlphaFold 3 was originally closed source, but I believe the training data and weights are still closed source. So largely the goal is to bring the benefit of protein folding via ML to the world without limitations or centralization. In the words of the OpenFold 3 team in the codebase itself:
āā¦aiming to be a bitwise reproduction of DeepMindās AlphaFold3⦠This research preview is intended to gather community feedback and allow developers to start building on top of the OpenFold ecosystem. The OpenFold project is committed to long-term maintenance and open source support, and our repository is freely available for academic and commercial use under the Apache 2.0 license.ā
Before going further, a word on who āthe OpenFold projectā actually is, rather than hand-waving at a faceless āconsortium.ā It started back in 2022 out of Mohammed AlQuraishiās lab at Columbia (the same group behind the original OpenFold reproduction of AlphaFold 2), and Nazim Bouatta has been a key co-lead. The consortium is hosted by the Open Molecular Software Foundation, a nonprofit, and chaired by Psivant founder Woody Sherman. The members actually footing the bill are a mix of big pharma and infrastructure players: Bristol Myers Squibb, Bayer, Novo Nordisk, Biogen, and others, with Amazon, Microsoft, and NVIDIA backing the compute side. So itās less a vague collective and more a named academic lab with deep-pocketed industry sponsors who all want an open AF3 to build on.
A more detailed explanation can be gained from the OpenFold Consortiumās publicly stated goals. Basically it comes down to three things:
- AlphaFold3ās licensing is the core driver. Unlike AF1/AF2, DeepMind released AF3 under a non-commercial, weights-restricted model. There are no open weights initially, no commercial use. OpenFold3 exists to provide an Apache-2.0, commercially-usable, open-weights equivalent. This seems to be the load-bearing reason. The ābitwise reproductionā1 framing is the means, open commercial access is the end.
- Democratizing a foundation model for biology. Per consortium chair Woody Sherman, the goal is matching AF3ās capability while giving biotech/pharma/academia a foundation they can actually build on and fine-tune, quote āraise the bar for the entire field.ā
- A platform, not just a clone. The repo language (ābuild on top of the OpenFold ecosystemā) reflects intent for OF3 to be a base for downstream work, and thatās already happening (e.g. SandboxAQās AQAffinity builds affinity prediction on top of OF3).
Who are the AlphaFold contenders, and who funds them?
I couldnāt help but take a quick detour here, as I was doing more research to satisfy my own curiosity. Iāll keep it brief.
Turns out there are a lot of them, and the more interesting question is whoās footing the bill. They sort into four camps:
- Big Tech showing off: ByteDance (Protenix) and Baidu (HelixFold) do open releases that double as hiring and posturing, the same playbook they run in the LLM space.
- A VC-backed startup: Chai Discovery (Chai-1/-2), funded by OpenAI and Thrive Capital at a $1.3B valuation. Started open, now drifting closed, which is basically AF3ās trajectory.
- Pharma funding a pipeline: Boltz (MIT plus Recursion). Recursion pays for it because binding-affinity prediction feeds its own drug discovery. The model is a means, not the product.
- A consortium funding the commons: OpenFold3, run out of AlQuraishiās lab and bankrolled by pooled pharma and infra sponsors (BMS, Bayer, Novo Nordisk, AWS, NVIDIA, and more) with no single corporate owner.
The thing that stuck with me: everyone elseās openness is contingent on a business model. OpenFoldās openness is the business model. Thatās largely why I picked it.
Everyone elseās openness is contingent on a business model. OpenFoldās openness is the business model.
A structure I actually ran
Thatās all neat, but weāre here for protein folding, after all. Below is human cytochrome c, predicted on my own machine with OpenFold 3 (an RTX 3090, ColabFold MSAs Multiple Sequence Alignment: a stack of evolutionarily related sequences for the same protein. The patterns of which residues change together are a big part of how these models infer 3D structure. , five diffusion samples AlphaFold 3 and its relatives generate atomic coordinates with a diffusion model, starting from noise and denoising into a structure. Each sample is one such run, so more samples means more candidate structures to rank. , about seven seconds end to end). Cytochrome c is a small but essential cog in the mitochondrial respiratory chain. It ferries electrons to Complex IV (cytochrome c oxidase), the enzyme that finally hands those electrons off to O2 gas and helps pump the protons whose gradient ends up powering ATP synthesis. So when you breathe, this little protein is part of why the oxygen matters.
The choice isnāt random. Iāve been reading Nick Laneās The Vital Question, which argues that the proton gradients mitochondria maintain across their membranes are not a biochemical footnote but something central to the origin of complex life itself. It is hard to read that book and not come away wanting to stare at one of the little machines that actually run the chemistry.
The coloring shows OpenFold 3ās per-residue confidence ( pLDDT Predicted Local Distance Difference Test: an AlphaFold-style per-residue confidence score from 0 to 100. Higher means the model is more sure about how that part of the structure is positioned. ) on the standard AlphaFold color scale: here the whole chain comes out high confidence, deep blue through the helical core and easing to lighter cyan only at the very ends. The model never saw an experimental cytochrome c structure in this run. It folded the chain from sequence and evolutionary context alone and still recovered the classic compact fold (minus the heme An iron-containing cofactor. In cytochrome c the heme is what actually carries the electron; the protein chain is the scaffold that holds and tunes it. group, which I left out of the query). When I first ran this fully offline with no MSA, confidence cratered to a pLDDT around 40, a nice illustration of how much these models lean on evolutionary signal rather than sequence alone.
As a sanity check I superposed the prediction on AlphaFoldās model of the same sequence (UniProt P99999): the two agree to 0.34 Ć backbone RMSD, which is to say they are basically the same structure. One honest caveat, since OpenFold 3 specifically targets AlphaFold 3: the public AlphaFold database is AlphaFold 2, and there is no public bulk download of AlphaFold 3 structures. For a compact, well-studied monomer like this, AF2 is itself essentially experiment-accurate, so it is a fair yardstick. The same comparison shows up in each post in this series.
Where Iām headed
This post was really just me getting my bearings: what OpenFold is, why it exists, and whoās behind it. The fun part starts now. Iām planning a series of follow-ups as I dig into the codebase and actually contribute. Next up is the technical deep dive I keep threatening: a walk through the OpenFold 3 architecture, getting inference running locally, and what I found when I profiled it on a single consumer GPU.2
This is the concrete version of something I worked through in On motivation and meaning: wanting to point my career at real science instead of the next frontend framework. OpenFold is me actually doing that, one tiny PR at a time. More soon.
Footnotes
-
āBitwiseā is doing a lot of work in that sentence. The goal is not āa model roughly as good as AF3ā but one that reproduces its behavior closely enough to be a true open stand-in. In practice the repo enforces this with snapshot regression tests, which I promptly tripped over the moment I ran them on a non-reference GPU. More on that in the deep dive. ā©
-
An RTX 3090, 24GB. The docs recommend 32GB or more, which turns ārun inferenceā into ārun inference and watch exactly where it falls over.ā That turns out to be a feature, not a bug, if what youāre after is finding the performance bottlenecks. ā©