Artificial Intelligence 12 min read

Building a Medical Knowledge Base with RAG: A Step‑by‑Step Example

This article demonstrates how to construct an AI‑powered medical knowledge base for diabetes treatment by preprocessing literature, performing semantic chunking, generating BioBERT embeddings, storing them in a FAISS vector database, and using a RAG framework together with a knowledge graph to retrieve and generate accurate answers.

DevOps
DevOps
DevOps
Building a Medical Knowledge Base with RAG: A Step‑by‑Step Example

Recent inquiries about AI knowledge bases have highlighted the need for practical examples; this guide uses a medical scenario (diabetes treatment) to illustrate preprocessing, vectorization, storage, and retrieval with a Retrieval‑Augmented Generation (RAG) pipeline.

Simple RAG case

Assume a paper titled "Latest Advances in Diabetes Treatment". The text is first cleaned and formatted.

糖尿病是一种常见的慢性疾病,近年来,糖尿病的治疗方法取得了显著进展。在药物治疗方面,SGLT2抑制剂、GLP-1受体激动剂等新药物已成为治疗2型糖尿病的重要手段。最新研究表明,SGLT2抑制剂可显著降低心血管事件的发生率,并改善肾功能。

The preprocessing step removes redundant tags and prepares the text for chunking.

1. Text Chunking

Semantic chunking preserves the meaning of each paragraph, resulting in three blocks:

Background and treatment progress of diabetes.

Introduction of new drugs: SGLT2 inhibitors and GLP‑1 agonists.

Recent research results on SGLT2 inhibitor efficacy.

Each block will later be vectorized.

2. Vectorization

Each chunk is encoded with a biomedical model such as BioBERT to obtain embedding vectors.

from transformers import BertTokenizer, BertModel
import torch
# Load BioBERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
model = BertModel.from_pretrained('dmis-lab/biobert-v1.1')

# Text chunks extracted from the paper
paragraphs = [
    "糖尿病是一种常见的慢性疾病,近年来,糖尿病的治疗方法取得了显著进展。",
    "在药物治疗方面,SGLT2抑制剂、GLP-1受体激动剂等新药物已成为治疗2型糖尿病的重要手段。",
    "最新研究表明,SGLT2抑制剂可显著降低心血管事件的发生率,并改善肾功能。",
]

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)

embeddings = [get_embedding(p) for p in paragraphs]

for i, embedding in enumerate(embeddings):
    print(f"段落{i+1}的向量:", embedding)

The resulting embeddings capture the semantic information of each paragraph.

3. Storing Vectors

Embeddings are stored in a FAISS index for efficient similarity search.

import faiss
import numpy as np
# Convert embeddings to numpy arrays
embedding_1 = embeddings[0].numpy()
embedding_2 = embeddings[1].numpy()
embedding_3 = embeddings[2].numpy()

all_embeddings = np.vstack([embedding_1, embedding_2, embedding_3])

dim = all_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(all_embeddings)

# Query example
query_embedding = get_embedding("糖尿病的最新药物有哪些?").numpy()
D, I = index.search(query_embedding, k=3)
print(f"最相关的文档索引:{I}")

The index returns the most similar document chunks for a given query.

4. Retrieval and Generation

Retrieved paragraphs are fed to a generation model (e.g., GPT) to produce a concise answer.

# Assume the second paragraph is most relevant
relevant_paragraph = paragraphs[1]
prompt = f"根据以下文献内容,回答用户提问:\n{relevant_paragraph}\n\n问题:糖尿病的最新药物有哪些?"
# Pseudo‑code for GPT generation
generated_answer = "根据最新的研究,SGLT2抑制剂和GLP-1受体激动剂是治疗2型糖尿病的有效药物。"
print(generated_answer)

The system thus returns a precise answer to the user query.

5. Making RAG More Effective

To keep the knowledge base up‑to‑date, automate periodic crawling of sources such as PubMed or Google Scholar, apply relevance and timeliness filters (e.g., only papers from the last five years), and add metadata tags for better retrieval.

Integrating a knowledge graph (Neo4j, Apache Jena, GraphDB) can enrich the RAG pipeline by providing structured relationships between diseases, drugs, and symptoms, enabling more comprehensive reasoning.

Conclusion

The example shows how to preprocess medical literature, generate embeddings with BioBERT, store them in FAISS, retrieve relevant chunks, and combine them with a generative model to answer clinical questions, while highlighting the importance of continuous updates, relevance filtering, and knowledge‑graph integration.

RAGfaissknowledge graphMedical AIVector EmbeddingBioBERT
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.