HelixDock: A Large-Scale Pretrained Full-Atom Diffusion Model for Protein–Small Molecule Docking
HelixDock, a full‑atom diffusion model pretrained on a billion‑scale simulated docking dataset covering ~200,000 protein targets, delivers state‑of‑the‑art docking accuracy—85.6% success on PoseBusters and strong generalization on cross‑docking benchmarks—showing that massive data and model scaling dramatically improve AI‑driven drug discovery, and its code and data are fully open‑source.
Protein–small molecule conformation prediction is a critical task in drug discovery, aiming to forecast the binding pose between a ligand and its target protein. Traditional physics‑based docking tools suffer from limited conformational sampling and imprecise scoring functions, while recent deep‑learning attempts are hampered by scarce training data, leading to poor generalization.
The Baidu PaddleHelix team has developed and open‑sourced HelixDock, a full‑atom diffusion model trained with massive pretraining on a billion‑scale simulated docking dataset covering ~200,000 targets and over 2,000 protein families. The project, conducted jointly with a national super‑computing center, Tsinghua University’s School of Pharmacy, and Beijing Tuoling Botai, has already identified six high‑potential lead compounds for an autoimmune‑related target.
HelixDock markedly improves docking accuracy. On the PoseBusters benchmark (428 cases), it achieves an 85.6% success rate, second only to DeepMind’s AlphaFold 3 and far surpassing other methods.
Robustness is demonstrated on low‑similarity targets and cross‑docking datasets: success rates of 80.7% on PDBbind‑CrossDocked‑Core and 68.1% on APObind‑Core, confirming strong generalization.
In PoseBusters legality evaluation, HelixDock also exhibits high conformational validity.
For virtual screening, HelixDock’s predicted poses on the DUD‑E benchmark (102 targets) yield superior enrichment factors (EF1%, EF5%) compared with competing approaches.
Large‑scale experiments reveal scaling laws in AI for Science: with pretraining, model accuracy continuously rises with increasing parameters and data volume, whereas without pretraining, larger models do not gain accuracy. This underscores the importance of massive data and models for AI‑driven drug discovery.
HelixDock’s code and the billion‑scale training dataset are fully open‑source for the academic community. Resources include the GitHub repository https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/molecular_docking/helixdock , a free data request link https://paddlehelix.baidu.com/partnership , the official website https://paddlehelix.baidu.com/ , a free online demo https://paddlehelix.baidu.com/app/drug/helix-dock/forecast , and the accompanying arXiv paper https://arxiv.org/abs/2310.13913 . For further inquiries, contact [email protected] .
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.