NOTES > ENGINEERING BIOLOGY

Engineering Biology: Learning from Evolution—Why Protein Language Models Work

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

March 26, 2024

Over at the new OpenProtein.ai blog, Tristan Bepler and I wrote about the seemingly mysterious power of Deep Protein Language Models. Not only do they identify related proteins, they predict functionality, stability, and immunogenicity, in many cases “out-of-the-box.” Why should this be?

While structure conveys function, it is the underlying DNA, encoding protein sequences, that evolves. Billions of years of adaptive radiation have led to the vast variety of proteins we observe in the natural world. Without purifying selection guiding towards a set of general “rules,” neutral evolution would diversify sequences independently in each organism. But, the distribution of proteins in the natural world is non-random. Patterns in these sequences reflect the functional and structural constraints imposed by natural selection. We can intuit what these constraints are from our knowledge of biology: non-immunogenicity, stability, lack of aggregation, conservation of function, etc. in the physiological conditions of the organism.

The question then arises, if these constraints and patterns are encoded in sequence, why only just now have we been able to identify and manipulate them? Why deep models?

Biological “grammar” is composed of statistical patterns. Certain combinations of residues tend to fold certain ways, leading to stereotyped secondary structures, such as alpha helices and beta sheets. Functional domains are often not strictly stereotyped, but rather reflect diverse and distinct combinations of amino acids in sequence that convey the same chemical properties. In gene expression, the search for stereotyped binding motifs has been unable to fully explain the variety of regulatory patterns and states we find. Similarly, innate immunity in eukaryotes relies on recognition of patterns of residues likely to originate from pathogens, both definite and statistical. Biology lies in statistical patterns not stereotyped motifs.

Read more here:
openprotein.ai/learning-the-grammar-of-biology-why-protein-machine-learning-works

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.