NOTES > ENGINEERING BIOLOGY

Engineering Biology: The Unreasonable Effectiveness of Design / Build / Test

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

June 20, 2024

Over the past year, I’ve been collecting papers with surprising results about the success of machine learning in biology, ones that run against the grain of popular conceptions, that throw into question whether our models are learning biology at all. Papers that demonstrate models fixating on patterns in datasets that are too complex for a human to identify that turn out to be noise, not signal. Works that show limits to model performance despite interrogating dramatically more data. Many of these are negative results, or weak positive results — this is not the fault of the authors whom I admire for their willingness to push computational hypotheses, rigorously test them, and publish the results however they look. Rather, they are critical results for scientific and technological progress.

Consider:
• Protein Structure Models built off of Protein Language Models memorize motifs of interacting domains and do not learn representations of protein folding or biophysics. Despite promising early evidence that AlphaFold and related models were learning approximations of the energetics of protein folding, more detailed analyses of ESM, a competing structure model1 show clear signs of effectively memorizing which pieces of sequence interact with one another; in effect “scaling up” previous generations of computational approaches2. Large models are in practice able to memorize more chunks of interacting structure, not a fundamentally different approach to modeling proteins.

• Similarly, early evidence that Alphafold could predict conformational switching in dynamic proteins did not replicate in a larger survey. It had seemed that by tinkering with the sequences fed to Alphafold, one could identify multiple 3d structures of a protein corresponding to different dynamic states, such as “open” and “closed.” When this work was expanded to a larger number of proteins, the quantitative indications of multiple conformations turned out not to repeat, suggesting the initial results were due to specific properties of the proteins examined.

• Protein Language Models are biased by species of origin strongly enough that if you try to generate proteins with higher thermostability or salt tolerance3, improvements in the designed sequences will rapidly peter out. Why? The model training data is so strongly enriched for organisms (and thus proteins) that inhabit moderate temperatures and salt concentrations that the most stable proteins it has seen (or can generate) are exceedingly unlikely. The (un)likelihood of these sequences swamps their improved properties.


Despite the incredible success of protein structure prediction, heralded by AlphaFold2, we are still far from understanding much of the “why”, “how”, and limits of these models that will be critical to using them practically.

In the case of DNA:
• Widespread claims that general DNA language models naturally learn the “grammar” of gene regulation, turn out to be a function of training procedure. Older models perform just as well, given the same training—and new models do not learn them without specific training on regulatory sequences.

Results like these are not limited to biological sequences. In the case of small molecules, we see:
• Building ever larger models of small molecules off of billions of 2d + 3d structures does not materially improve the ability of a model to capture the underlying “space” of small molecules. Despite the authors’ clever addition of 3d structures to a more traditional model4 adapted to learn the rules of small molecule structure and generate novel ones, overall predictions and generated molecules are not materially different than previous methods. Neither has the addition of dramatically more training data: even the “bitter lesson” of more data and compute outperforming better model architecture has clearly not kicked in.

• Novel Machine Learning methods and models adapted to chemical structures are not needed to design better small molecules if you continue to run design-build-test cycles. Simply running linear regression on standard chemical fingerprints5 to predict the next cycle of molecules to test will yield improved molecules over time, even though the underlying machine learning algorithm knows nothing about chemistry. In fact, linear regression performed equally well as many considerably fancier models, while being orders of magnitude faster to run.


Both of these results are reminiscent of the hoary computational chemistry wisdom on QSAR6 models: It’s easy to build models that are effective locally but not globally. When claims of excellent global performance are made they reflect small training sets, bad testing hygiene, or highly biased evaluation data. We lack the right representations of the space of small molecules to build performant models across all of small molecule space.

Where does this leave us? While these papers may suggest the claimed capabilities of our models are overwrought, they manifestly do not imply that they are useless. Recall that even linear regression on molecular fingerprints was useful for designing better molecules. Similarly, even if protein models are biased and not learning the principles of protein structure, putting them in a Design-Build-Test framework leads to dramatically accelerated and diversified antibody fragment identification. In a vastly different domain, identification of a synthetic gut microbial consortium that could clear a drug resistant pathogen was accelerated by parallel methods. Performing broad functional screens of diverse microbial consortia and building very simple machine learning models on top of the results rapidly led to the design of an extraordinarily effective one.

Perhaps it is not Machine Learning that matters, but instead Active Learning.

We are far from “zero-shot” molecular generation and property prediction. Our models can interpolate, but may never extrapolate. They need specific data to the problem in question (“local data”) to meaningfully learn, but can latch onto surprisingly small amounts of it to help us design the next experiment. From this perspective we need three components: (i) a base model or representation of our biochemical units: a protein language model, molecular fingerprints, etc; (ii) the ability to generate new candidates rapidly and in a diverse fashion, eg through synthesis; and (iii) an assay to rapidly test these candidates. Design. Build. Test.

Operationalizing Design-Build-Test cycles requires neither omniscient nor perfectly accurate models. All it requires is that the representations used, and the models built on top of them, are faithful enough that we can make local predictions from their results that are approximately correctly ordered. If anything, these assumptions are weaker than we have traditionally thought. If linear regression on molecular fingerprints works for small molecule design, we could have been doing it 40 years ago7. Active Learning is unreasonably effective8 in ways we could not have anticipated.

In a recent piece, the eminent statistician David Donoho suggested that the rapid progress in Data Science and Machine Learning is due to a computational equivalent of Design-Build-Test: Data Sharing, Code Sharing, and Competitive Challenges9. If you design a novel algorithm, anyone can use it to build a version of your model on a new dataset, and test it on a competitive challenge. The acceleration provided by what he terms “Frictionless Reproducibility” has so accelerated progress as to suggest a hidden superpower or approaching “singularity."

Donoho’s paradigm clarifies what we need in biopharma. Internal “Frictionless Reproducibility.” The ability to rapidly and intelligently analyze results, pick the next set of candidates to test, and then to assay them. We need digital systems that capture, store, and serve results to users in the form they need, not excel sheets full of raw data and ad hoc QC’s. We need computational systems that can facilitate training of and the exploration of predictions from models. We need transparent experimental queues and design tools to enable collaboration across experts, functions, and yes, computers10, for picking the next experiment(s). We need laboratory systems that make implementing those experiments painless. We need systems to track projects, programs, and progress across Design-Build-Test cycles and to call halt when progress slows or stops. The transformation of processes designed for solo scientists into active-learning driven industrial discovery will demand a new set of tools and systems, most of which do not yet exist11.

– Jacob Oppenheim, PhD



(1) Technically, ESM is a protein language model with a structure predictor on top.  AlphaFold and its descendents have a somewhat more complex architecture, but all approaches begin by modeling protein sequences + coevolution and predict folds from there.

(2) eg Potts Models and Evolutionary Coupling analysis.

(3) As measured by Isoelectric point.

(4) Using SMILES strings, an alphabet that encodes 2d small molecule structures.

(5) Compressed, vector representations of small molecules that are standard in the computational chemistry literature, such as Morgan or ECFP.

(6) Quantitative Structure Activity Relationship, essentially classical ML methods applied molecular fingerprints to predict properties such as solubility, binding, or absorption.

(7) Or, to quote Pat Walters: Several recent papers have shown that AL isn’t particularly sensitive to the ML model used.  In our 2022 paper, we evaluated 5 different ML models and saw minimal differences in performance….Perspective: Active learning is a simple yet powerful technique that enables computational access to large and ultra-large (I’m not sure where the cutoff is) collections of molecules.  Papers published in 2023 showed how robust the method is.  Multiple studies showed that while the choice of the acquisition function matters, the method is largely insensitive to the composition of the initial training set.  These papers also showed that AL can be applied to various ML methods and molecular representations.  As the field progresses, it will be interesting to see whether similar approaches can be applied to high-throughput experimentation.

(8) With apologies to
Wigner.

(9) An excellent synopsis by Ben Recht
here.

(10) My suspicion remains that design will remain a collaboration between computational methods and humans for a long time yet, for some of the reasons Pat Walters elucidates here.

(11) Or if they do, only as one-offs, pet projects, and internal tools.

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.