Artificial Intelligence 6 min read

HelixDock: A Large-Scale Pretrained Full-Atom Diffusion Model for Protein–Small Molecule Docking

HelixDock, a full‑atom diffusion model pretrained on a billion‑scale simulated docking dataset covering ~200,000 protein targets, delivers state‑of‑the‑art docking accuracy—85.6% success on PoseBusters and strong generalization on cross‑docking benchmarks—showing that massive data and model scaling dramatically improve AI‑driven drug discovery, and its code and data are fully open‑source.

Baidu Tech Salon

May 24, 2024

HelixDock: A Large-Scale Pretrained Full-Atom Diffusion Model for Protein–Small Molecule Docking

Protein–small molecule conformation prediction is a critical task in drug discovery, aiming to forecast the binding pose between a ligand and its target protein. Traditional physics‑based docking tools suffer from limited conformational sampling and imprecise scoring functions, while recent deep‑learning attempts are hampered by scarce training data, leading to poor generalization.

The Baidu PaddleHelix team has developed and open‑sourced HelixDock, a full‑atom diffusion model trained with massive pretraining on a billion‑scale simulated docking dataset covering ~200,000 targets and over 2,000 protein families. The project, conducted jointly with a national super‑computing center, Tsinghua University’s School of Pharmacy, and Beijing Tuoling Botai, has already identified six high‑potential lead compounds for an autoimmune‑related target.

HelixDock markedly improves docking accuracy. On the PoseBusters benchmark (428 cases), it achieves an 85.6% success rate, second only to DeepMind’s AlphaFold 3 and far surpassing other methods.

Robustness is demonstrated on low‑similarity targets and cross‑docking datasets: success rates of 80.7% on PDBbind‑CrossDocked‑Core and 68.1% on APObind‑Core, confirming strong generalization.

In PoseBusters legality evaluation, HelixDock also exhibits high conformational validity.

For virtual screening, HelixDock’s predicted poses on the DUD‑E benchmark (102 targets) yield superior enrichment factors (EF1%, EF5%) compared with competing approaches.

Large‑scale experiments reveal scaling laws in AI for Science: with pretraining, model accuracy continuously rises with increasing parameters and data volume, whereas without pretraining, larger models do not gain accuracy. This underscores the importance of massive data and models for AI‑driven drug discovery.

HelixDock’s code and the billion‑scale training dataset are fully open‑source for the academic community. Resources include the GitHub repository

https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/molecular_docking/helixdock

, a free data request link https://paddlehelix.baidu.com/partnership, the official website https://paddlehelix.baidu.com/, a free online demo https://paddlehelix.baidu.com/app/drug/helix-dock/forecast, and the accompanying arXiv paper https://arxiv.org/abs/2310.13913. For further inquiries, contact [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning diffusion model AI for drug discovery HelixDock large-scale pretraining protein docking

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.