Artificial Intelligence 17 min read

Application of Language Models in Molecular Structure Prediction

This talk presents how large language models are leveraged for predicting protein, antibody, and RNA structures, covering background, model stability, generative approaches, antibody-specific models, RNA modeling, and protein‑RNA interaction prediction, along with experimental results and future research directions.

DataFunSummit
DataFunSummit
DataFunSummit
Application of Language Models in Molecular Structure Prediction

The presentation explores the use of large language models for molecular structure prediction, focusing on proteins, antibodies, and RNA.

It begins with a background on protein sequences as token strings, the history of the CASP competition, and the breakthroughs of AlphaFold2, while highlighting remaining challenges such as homolog search speed, performance on low‑homology proteins, and stability under mutations.

A dense‑retrieval approach is introduced, where a language model maps protein sequences to vectors, enabling fast similarity search in massive databases and efficient MSA construction.

The stability of AlphaFold2 is examined by applying small mutations; an evolutionary algorithm generates insertions, deletions, or substitutions to assess prediction robustness, revealing cases where minimal changes cause large structural deviations.

To address scarce homologs, a generative model based on a T5 architecture with column attention is proposed to augment MSAs. Experiments on the Artificial CASP‑14 dataset show significant accuracy gains, especially when the original MSA contains fewer than ten sequences.

An antibody‑specific mask language model is described, exploiting the high conservation of framework regions and variability of CDRs. Combined with a folding module, it predicts antibody structures without MSA and outperforms existing methods such as OmegaFold and IgFold.

A RNA mask language model is also developed, improving secondary‑structure prediction and achieving superior 3‑D structure prediction compared to state‑of‑the‑art tools like FarFar2, demonstrated on coronavirus RNA sequences.

For protein‑RNA interaction prediction, a multimodal framework extracts features from both sequences via language models and from spatial information via a point‑cloud CNN, leading to better binding‑site prediction and enhanced zero‑shot performance.

The talk concludes with a summary of the proposed methods and an invitation to join the Shanghai AI Lab for further research on applying NLP techniques to biological sequences.

Generative Modelslanguage modelsprotein structure predictionAI for biologyantibody modelingRNA modeling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.