Artificial Intelligence 22 min read

Medical NLP at Alibaba: Data, Algorithms, and Knowledge for Smart Healthcare

This article reviews Alibaba Cloud senior algorithm expert Chen Mosha's presentation on medical NLP, covering Alibaba's healthcare business, data types, electronic medical record quality inspection, span‑based and nested NER models, term normalization, clinical trial outcome prediction, knowledge‑enhanced language models, and the CBLUE benchmark dataset.

DataFunTalk
DataFunTalk
DataFunTalk
Medical NLP at Alibaba: Data, Algorithms, and Knowledge for Smart Healthcare

The talk, presented by Alibaba Cloud senior algorithm expert Chen Mosha and organized by DataFunTalk, introduced the rapid growth of NLP applications in smart healthcare, focusing on three layers: data, algorithms, and knowledge.

Alibaba's healthcare portfolio includes Alibaba Cloud services for hospitals and public health, Alibaba Health's e‑commerce and online consultation platforms, Ant Insurance's intelligent claims, the Quark vertical search engine, and DAMO Academy teams working on medical AI and imaging.

Key medical data sources were described: electronic medical records (EMR) with high variability, drug instructions, examination reports, online consultation dialogues, and medical textbooks. The primary use case highlighted was EMR quality inspection, which checks consistency and diagnostic adequacy against regional standards.

To address EMR inspection, a span‑based backbone model using BERT was proposed to jointly extract entities and attributes, handling nested and non‑contiguous entities. An improved version reduces inference complexity from O(N²) to O(m·N) by thresholding start/end probabilities.

For nested entity recognition, a constituent‑parsing‑based approach with a TreeCRF was presented, achieving high F1 scores on Chinese medical NER benchmarks while lowering computational cost.

Medical term normalization was tackled with a two‑step pipeline: BM25 retrieval of candidate concepts from an ICD dictionary followed by BERT‑based re‑ranking, producing top‑3 candidates for expert verification.

The speaker also described a clinical‑trial outcome prediction framework (BPICO) that converts background, population, intervention, comparison, and outcome into a language‑model task (EBM‑Net), enabling the prioritization of promising trial designs.

A knowledge‑enhanced biomedical language model (KeBioLM) was introduced, integrating entity extraction and knowledge‑graph information to improve representation for Chinese medical text.

The CBLUE benchmark, a Chinese biomedical language understanding leaderboard co‑organized with CHIP, was explained, covering entity extraction, QA, classification, and term normalization, with plans for a 2.0 version adding generative tasks.

The Q&A session addressed data access challenges, annotation workflows, dataset expansion, multi‑modal extensions, and practical considerations for term standardization and model deployment.

knowledge graphdatasetAlibaba Cloudentity extractionHealthcare AImedical NLPClinical Trial Prediction
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.