AlphaFold: if it's just supervised learning, what was actually hard?

The provocation

“AlphaFold is essentially supervised learning with deep neural networks. So what was the hard part?”

The premise is technically correct: AlphaFold 2 (Jumper et al., Nature 2021) is trained by gradient descent on a loss between predicted and experimentally-determined protein structures. Sequences in, coordinates out, labels from the Protein Data Bank. By that definition, yes — supervised learning.

That framing is also the kind of thing that gets you a 60% performance ceiling. The reason CASP14 was the breakthrough event it was — median GDT_TS ~92 on free-modeling targets, where the previous best was ~50–60 — is that the team solved a stack of problems that don’t usually show up in “standard” supervised learning pipelines. The lift was not “we trained a bigger network.” It was eight or nine separate inventions that had to land together.

Below is what I think the actual difficult parts were, in roughly the order they bite you when you try to do this naively.

1. The labels are wrong, or rather, there aren’t enough of them

Naive view: PDB has ~170,000 experimentally-determined structures. That’s a labeled dataset. Train on it.

What’s actually in PDB:

After clustering at 40% sequence identity, you have maybe ~25,000–30,000 unique structural folds.
Many entries are near-duplicates of the same protein (different ligands, different organisms, different crystallographic conditions).
Coverage is wildly biased toward proteins that are easy to crystallize — soluble, stable, small enough.

By the standards of, say, ImageNet (1.28M cleanly-labeled examples for a 1000-way classification problem), this is a tiny dataset for a target space that is enormously richer (atomic coordinates of chains hundreds-to-thousands of residues long, in continuous 3D space).

The actual signal is hiding in unlabeled data. Every protein sequence in nature is a sample from evolution. If residue i and residue j are in physical contact in 3D, then mutations at i often correlate with compensating mutations at j across millions of years of evolution — because the contact has to be preserved for the protein to fold and function. This co-evolutionary signal is observable in Multiple Sequence Alignments (MSAs): stack up thousands of homologous sequences from across the tree of life, and the columns that move together are talking to each other in 3D space.

The MSA isn’t labels — it’s correlated unlabeled data. AlphaFold’s first hard problem was: how do you build a supervised architecture whose input is the MSA, not just the target sequence? The labels you have (PDB) are vastly outnumbered by the unlabeled side-information that actually carries the signal.

This is closer to semi-supervised representation learning than to vanilla supervised learning.

2. The output isn’t a vector — it’s a geometric object

A typical supervised problem has an output that lives in ℝⁿ for some fixed n, or in a discrete class space. You compute a loss, propagate gradients, move on.

A protein structure is the 3D coordinates of every atom, where:

“Every atom” depends on the sequence — different residues have different numbers of atoms (alanine has 5 heavy atoms, tryptophan has 14).
The whole structure is invariant under rotation and translation of the global frame. A protein rotated 90° is the same protein. Your loss must not punish rotations.
The dominant geometric constraints are internal: bond lengths, bond angles, the dihedrals (φ, ψ, χ angles) that define backbone and side-chain conformations.

You can’t just predict a flat vector of coordinates and apply MSE — you’d be implicitly punishing the model for choosing the “wrong” global rotation. You can’t trivially apply a transformer either, because attention as it appears in NLP doesn’t have an opinion about whether two tokens are 4 Å apart or 40 Å apart in 3D.

The invention that solves this in AlphaFold 2 is the Structure Module built around Invariant Point Attention (IPA). Each residue carries a local 3D frame (a rotation and translation defining where that residue is in space and how it’s oriented). IPA performs attention where the queries, keys, and values include points expressed in those local frames; the attention scores are invariant under any global rigid-body transformation. The structure is then iteratively refined, frame by frame.

This is custom geometric machine learning. There is essentially no pre-existing “off the shelf” component you can grab from a library to do this — it had to be designed for the problem.

3. The loss function is also bespoke

If invariance is the constraint on the output space, the loss has to respect it. AlphaFold 2 introduced FAPE (Frame Aligned Point Error):

For each predicted residue frame, transform every other residue’s predicted Cα position into that frame’s local coordinate system. Do the same for the ground truth. Compute the per-point distance. Average over all (frame, point) pairs.

The result is a loss that:

Is invariant under global rotation/translation of either the prediction or the ground truth.
Penalizes local geometric errors strongly (a residue’s neighborhood must be correct).
Doesn’t artificially blow up when a small early mistake produces a globally rotated downstream chunk.

This is a non-obvious loss function. It is the kind of thing you only design once you have spent time staring at why simpler losses fail.

Alongside FAPE, AlphaFold 2 uses auxiliary losses on intermediate predictions: a distogram (predicted distribution over pairwise distances), a masked-MSA cross-entropy (recover hidden columns of the MSA — a BERT-style self-supervised loss baked into the supervised pipeline), and per-residue confidence prediction (pLDDT). These auxiliaries shape the representations the trunk learns; without them, training is much less stable.

4. The architecture is bespoke (the Evoformer)

The “neural network” in “deep neural networks” hides the central architectural innovation: the Evoformer, a 48-block trunk that simultaneously maintains two representations and lets them talk:

MSA representation — a tensor of shape (Nseq × Nres × c) capturing the aligned homologous sequences and their per-residue features.
Pair representation — a tensor of shape (Nres × Nres × c) capturing the relationship between every pair of residues (eventually the predicted distances and orientations).

Every block updates both representations and lets them inform each other:

Row-wise gated self-attention along the MSA, biased by the pair representation. (Information flows pair → MSA.)
Column-wise self-attention along the MSA. (Information flows across the evolutionary dimension.)
Outer product mean updates the pair representation from the MSA. (Information flows MSA → pair.)
Triangle multiplicative updates and triangle self-attention within the pair representation.

The triangle operations deserve special attention. If you know the distance from i to j and from j to k, then the distance from i to k is constrained by the triangle inequality. A pair representation that does not obey those constraints corresponds to no consistent 3D structure. Triangle updates explicitly couple every pair (i, j) to all triangles it participates in, baking the inequality constraint into the architecture rather than hoping the network discovers it.

This is, again, not generic deep learning. It is a model class invented for the geometry of distance matrices in 3D Euclidean space.

5. One forward pass is not enough — recycling

Even with all of the above, you don’t get a good structure from a single forward pass. The Evoformer’s output is fed back into its own input three times during inference (and varying numbers during training). Each cycle refines the MSA and pair representations using the previous cycle’s structural prediction as additional context.

This is closer to iterative refinement / fixed-point computation than to a typical feedforward classifier. The team had to figure out how to train such a thing efficiently — gradients aren’t backpropagated through all recycling iterations during training (too expensive); instead, a random number of iterations is sampled, only the last one’s gradients are kept, and the model learns to be a useful one-step improvement operator that composes.

6. Self-distillation: making more labels

With ~170k PDB structures of which maybe ~25k are structurally unique, supervised data is the bottleneck. The team addressed this with self-distillation:

Train AlphaFold 2 on real PDB data.
Run it on ~355,000 sequences from UniRef that have no known structure.
Filter for predictions where the model is highly confident (high pLDDT).
Add those confident predictions to the training set as if they were labels.
Retrain.

This is “iterated self-distillation” — a technique that works only because the model’s confidence prediction (pLDDT) is well-calibrated enough to trust. Which itself was a hard problem to solve (see next section).

The final training set is therefore a hybrid: real experimental structures + the model’s own confident guesses on much larger sequence databases.

7. Knowing when it’s wrong

A model that’s right 90% of the time but doesn’t tell you which 10% are wrong is much less useful than one that’s right 85% of the time and tells you which 15% to discard. For protein structure prediction to actually be useful downstream — for drug design, for experiment planning, for biology — you need calibrated per-residue confidence.

AlphaFold 2 predicts:

pLDDT (predicted lDDT-Cα): per-residue confidence, 0–100. Correlates strongly with actual accuracy.
Predicted Aligned Error (PAE): pairwise confidence in the relative position of residue i with respect to residue j. Useful for identifying which domains are well-resolved versus which inter-domain geometries are uncertain.

These confidence heads are trained by additional supervised losses against the actual error of the model’s own predictions during training. Getting them well-calibrated is a non-trivial training problem; it’s also what enables (6) above.

8. The data pipeline

Less glamorous but real:

MSA generation requires searching huge sequence databases (UniRef90, MGnify, BFD — the Big Fantastic Database of ~2.5 billion clustered metagenomic sequences). At training time this is precomputed; at inference time it’s a meaningful chunk of the latency.
Template search against PDB70 to optionally include known related structures as additional context.
MSA subsampling and cropping so that long alignments fit in memory while preserving signal.
Multi-chain handling, ligand handling, multimer prediction (AlphaFold-Multimer added later).

A surprising amount of CASP14’s lead was not “the architecture is better” but “the input pipeline extracts more from available data than competitors’ pipelines.”

So what was actually invented vs. assembled?

It’s fair to be honest about this. AlphaFold 2 is a synthesis. Key things it inherited:

Deep attention/transformer-style models (NLP).
Co-evolutionary analysis of MSAs (decades of bioinformatics work — EVfold, PSICOV, GREMLIN, RaptorX-Contact, trRosetta).
Distogram / contact prediction as an intermediate target (RaptorX, AlphaFold 1, trRosetta).
ResNets for contact prediction (Wang et al. 2017, PLOS Computational Biology).
The general idea that ImageNet-style pretraining + fine-tuning + lots of compute beats hand-crafted features.

What it actually invented or first-time-combined:

End-to-end differentiable 3D structure prediction with no separate optimization step. Prior methods including AlphaFold 1 produced distograms / contact maps and ran a separate gradient-descent or fragment-assembly step on physical energy. AlphaFold 2 made the entire pipeline a single network producing coordinates.
The Evoformer trunk: the specific architecture for letting MSA and pair representations co-update, with triangle-aware updates.
Invariant Point Attention and the frame-based Structure Module.
FAPE as a loss function.
Recycling as a training and inference technique.
The self-distillation flywheel at the specific scale they used it.
pLDDT and PAE as routinely-used, well-calibrated confidence outputs.

The fairest summary: AlphaFold 2 is not a single brilliant idea. It is eight to ten specific ML inventions, each non-trivial, that had to land simultaneously and be trained together with extraordinary care, on a problem where the data structure (MSAs + PDB + sequence) was the actual gold mine and the geometry (SE(3) equivariance + atomic constraints) was the actual constraint.

Why “just supervised learning” undersells it

Reframing the question: what made this a hard supervised learning problem?

The output space is non-Euclidean (it’s the space of 3D structures modulo rigid-body symmetry). Standard losses, standard architectures, and standard tokenization do not respect this. You need IPA, FAPE, frame-based representations.
The labels are scarce, but the side-information is abundant. PDB is 170k labeled examples; sequence databases are billions of unlabeled examples that carry the real signal via co-evolution. You need MSA-based architectures and self-distillation to bridge this gap.
Long-range, structured dependencies. Residue 5 may contact residue 350. Pairs of pairs are related by triangle inequalities. You need triangle attention to bake that in.
The output is iteratively refinable. One pass underspecifies the structure. You need recycling.
Confidence must be calibrated for the system to be useful, not just accurate.
Atomic geometry is hard. Bond lengths, angles, chirality, clashes — get any of these wrong and the structure is biophysically nonsense even if globally close. The Structure Module + auxiliary geometric losses handle this.
Compute and engineering at scale. Training took ~128 TPUv3 cores for weeks. Memory-efficient attention, gradient checkpointing, careful mixed-precision, careful curriculum on crop sizes — the engineering work is the kind of thing that doesn’t show up in a methods section but is invisibly load-bearing.

Calling all of that “just supervised learning with deep neural networks” is like calling the Apollo program “just chemistry and Newton’s laws.” Technically not wrong. But it skips over every part that was actually hard.

Concrete CASP14 numbers (for grounding)

AlphaFold 2 median GDT_TS on free-modeling targets: 92.4
Next-best team (Baker lab, RoseTTAFold’s predecessor) median GDT_TS: ~60–65
A GDT_TS of ~90 is essentially “indistinguishable from experimental error” for the backbone — a level structural biologists had not expected this decade.

AlphaFold 1 (CASP13, 2018) was already #1 — and it was beaten by AlphaFold 2 (CASP14, 2020) by margins comparable to what AlphaFold 1 had beaten everyone else by. Same team, two years later, the inventions above are what closed the gap from “best, but not solved” to “essentially solved for monomer prediction.”

TL;DR

If a junior ML engineer believes AlphaFold is “supervised learning + a big transformer,” ask them how they would handle SE(3) equivariance in the output, where they’d get their labels from beyond PDB, how they’d encode triangle-inequality constraints on pairwise distances, what their loss function would be, and how confidence calibration is supposed to fall out for free. The answers to those questions are the innovations.

The hard part wasn’t “deep learning.” The hard part was building a deep learning system that respects the geometry of three-dimensional space, mines an unlabeled evolutionary signal, refines its own predictions iteratively, knows when it’s wrong, and trains stably across all of those simultaneously.

Primary sources

Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 596, 583–589 (2021). The AlphaFold 2 paper.
Senior et al., “Improved protein structure prediction using potentials from deep learning,” Nature 577, 706–710 (2020). AlphaFold 1.
AlphaFold 2 Supplementary Information — the algorithm pseudocode lives here, and it is long.
CASP14 official results: https://predictioncenter.org/casp14/
Wang et al., “Accurate de novo prediction of protein contact map by ultra-deep learning model,” PLOS Comput Biol (2017). Earlier deep-learning contact prediction milestone.
Yang et al. (Baker lab), “Improved protein structure prediction using predicted interresidue orientations,” PNAS (2020). trRosetta.