“`html
From Structure Prediction to Real-World Drug Design
The recent Nobel Prize in Chemistry awarded to AlphaFold2 highlights the transformative role of AI in biology. What comes next after protein folding?
Generating “useful” proteins
PLAID, a multimodal generative model, learns from the latent space of protein folding models to generate both sequence and structure simultaneously. It can accept compositional function and organism prompts and is trained on larger sequence databases.
Limitations of existing models
- All-atom generation: Many current models only output backbone atoms, requiring sidechain placement which is sequence-dependent. This creates a multimodal problem that PLAID solves by generating both sequence and structure.
- Organism specificity: Humanized proteins are needed for biological applications to avoid immune rejection. PLAID addresses this by training on human-specific sequences.
- Control specification: Complex constraints like solubility and ease of transport can be specified, allowing for the design of proteins tailored for specific uses.
Training using sequence-only training data
A key aspect of PLAID is its ability to train a generative model solely on sequence databases. Sequence information is more abundant and cheaper to obtain compared to structural data, making this approach feasible.
How does it work?
- PLAID learns over the latent space of a protein folding model, using frozen weights from ESMFold for structure generation during inference.
- The method addresses the need for all-atom structures by leveraging structural understanding encoded in the weights of pretrained models like ESMFold.
Compressing the latent space of protein folding models
To make this approach more scalable, we propose CHEAP, a compression model that learns to represent protein sequence and structure in a compressed form.
Further links
If you’ve found our papers useful, consider citing:
title={Generating All-Atom Protein Structure from Sequence-Only Training Data},
author={Lu, Amy X and Yan, Wilson and Robinson, Sarah A and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Abbeel, Pieter and Frey, Nathan},
journal={bioRxiv},
pages={2024–12},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
title={Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure},
author={Lu, Amy X and Yan, Wilson and Yang, Kevin K and Gligorijevic, Vladimir and Cho, Kyunghyun and Abbeel, Pieter and Bonneau, Richard and Frey, Nathan},
journal={bioRxiv},
pages={2024–08},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
You can also check out our preprints (PLAID, CHEAP) and codebases (PLAID, CHEAP).
What’s Next?
We can adapt this method to perform multimodal generation for any modalities where there is a predictor from an abundant modality to a less abundant one. As sequence-to-structure predictors like AlphaFold3 are tackling increasingly complex systems, it’s easy to imagine using the same method for more sophisticated applications.
Some bonus protein generation fun!
Here are some additional examples of PLAID-generated proteins:
Acknowledgements
Thanks to Nathan Frey for detailed feedback on this article, and co-authors across BAIR, Genentech, Microsoft Research, and New York University: Wilson Yan, Sarah A. Robinson, Simon Kelow, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Richard Bonneau, Pieter Abbeel, and Nathan C. Frey.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.







