Optimizing Question‑Answer Search Similarity in Haodf Online: A Semantic Similarity Model Case Study
This article describes how Haodf Online improved its medical question‑answer search by analyzing search challenges, adopting semantic similarity models based on pre‑trained language embeddings, designing contrastive training tasks, and evaluating the resulting increase in click‑through rate and user engagement.
With the rapid advancement of natural language processing, many previously difficult automation tasks are now applied to online services, providing efficient experiences. The article records Haodf Online's exploration of optimizing question‑answer search similarity.
In a search engine, recall and ranking are crucial; the core problems include understanding user intent, identifying truly relevant documents, and ensuring trustworthy content.
Haodf's search faces specific challenges: medical queries are highly descriptive, often colloquial, and the corpus is rigorously vetted, limiting noisy data. Traditional BM25 relies on term matching and synonym dictionaries, which struggle with misspellings, vague expressions, and low‑frequency medical terms.
The goal is to develop a relevance scoring model that takes two sentences or documents as input and outputs a similarity score, allowing semantically close results to be ranked higher (e.g., "hand swelling pain" vs. "hand palm swelling pain").
Recent NLP breakthroughs enable encoding text into vectors; models have evolved from Word2Vec to large pre‑trained models like FLAN. However, large models are costly for online inference, so Haodf explores smaller models with knowledge distillation (DistilBERT, AutoTinyBERT) and inference optimizations (TVM).
Training tasks progress from pointwise to pairwise and listwise ranking, with contrastive learning approaches such as SimCSE and sentence‑BERT. Additional modules enforce domain‑specific entity alignment and topic consistency.
Initial experiments using a modest dataset produced a model that distinguishes descriptive medical texts, but issues remained with rare drug names and noisy query terms.
To strengthen the model, Haodf designed new unsupervised and semi‑supervised tasks inspired by MUM and other works, including entity detection, replacement, correction, and rewrite contrastive tasks, employing losses like am‑softmax and KL‑regularized dropout.
After deploying the optimized similarity model, the click‑through rate for Q&A search results increased by 4.6%, and the average user interaction length grew by 8.5%, indicating better relevance and user satisfaction.
Future work will continue to enrich medical knowledge bases, refine data and model capabilities, and address remaining challenges in semantic similarity for medical search.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.