Artificial Intelligence 11 min read

Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB

This tutorial demonstrates how to build an advanced Retrieval‑Augmented Generation (RAG) system for semi‑structured PDF data by leveraging LangChain, the unstructured library, ChromaDB vector store, and OpenAI models, covering installation, PDF partitioning, element classification, summarization, and query execution.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB

Preface

RAG (Retrieval‑Augmented Generation) is a natural‑language‑processing technique that combines retrieval (vector databases) with generative AI models to improve information‑retrieval quality.

Naive RAG

Naive RAG refers to the most basic retrieve‑and‑generate pipeline, which includes document chunking, embedding, and semantic similarity search based on user queries. While simple, its performance and quality are limited, motivating the move to Advanced RAG.

Semi‑Structured Data

Semi‑structured data lies between structured and unstructured data, mixing tabular formats with free‑form text, images, or other media. Examples include PDF statements that contain text, tables, and figures. Handling such data requires both SQL‑like processing for the structured parts and embedding‑based retrieval for the unstructured parts.

The demo uses the unstructured package to create custom pipelines for processing these elements, LangChain to orchestrate the RAG workflow, and ChromaDB as the vector store.

Nvidia Equity Change Statement

The example PDF is an Nvidia equity‑change declaration, chosen for its compact size and mix of structured tables and unstructured text.

Practical Steps

Install required Python packages: !pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U

Download the PDF and name it statement_of_changes.pdf : !wget -o statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf

Install system utilities for PDF extraction and OCR (poppler‑utils, tesseract‑ocr): !apt-get install poppler-utils tesseract-ocr

Set the OpenAI API key: import os os.environ["OPENAI_API_KEY"] = ""

Partition the PDF into elements using unstructured.partition_pdf with parameters that infer table structure and chunk by title. from typing import Any from pydantic import BaseModel from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename = "statement_of_changes.pdf", extract_images_in_pdf=False, infer_table_structure=True, chunking_strategy = "by_title", max_characters=4000, new_after_n_chars=3000, combine_text_under_n_chars=2000, image_output_dir_path="." )

Count element categories to understand the composition of the document. category_counts = {} for element in raw_pdf_elements: category = str(type(element)) if category in category_counts: category_counts[category] += 1 else: category_counts[category] = 1 unique_categories = set(category_counts.keys()) category_counts

Separate table and text elements into distinct lists. class Element(BaseModel): type: str text: Any table_elements = [] text_elements = [] for element in raw_pdf_elements: if "unstructured.documents.elemnts.Table" in str(type(element)): table_elements.append(Element(type="table", text=str(element))) elif "unstructured.documents.elments.CompositeElement" in str(type(element)): text_elements.append(Element(type="text", text=str(element))) print(len(table_elements)) print(len(text_elements))

Summarize each element using a LangChain chain. from langchain.chat_models import ChatOpenAI from langchain.prompts import ChatPromptTemplate from langchain.schema.output_parser import StrOutputParser prompt_text = """ You are responsible for concisely summarizing table or text chunk. {element} """ prompt = ChatPromptTemplate.from_template(prompt_text) model = ChatOpenAI(temperature=0, model="gpt-4") summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser # Summarize tables tables = [i.text for i in table_elements] table_summarizes = summarize_chain.batch(tables, {"max_concurrency": 5}) # Summarize texts texts = [i.text for i in text_elements] text_summarizes = summarize_chain.batch(texts, {"max_concurrency": 5})

Build a MultiVectorRetriever that links summaries (vectors) with original documents via a shared ID. import uuid from langchain.embeddings import OpenAIEmbeddings from langchain.schema.document import Document from langchain.storage import InMemoryStore from langchain.vectorstores import Chroma from langchain.retrievers import MultiVectorRetriever vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings()) store = InMemoryStore() id_key = "doc_id" retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key) # Text documents doc_ids = [str(uuid.uuid4()) for _ in texts] summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summarizes)] retriever.vectorstore.add_documents(summary_texts) retriever.docstore.mset(list(zip(doc_ids, texts))) # Table documents table_ids = [str(uuid.uuid4()) for _ in tables] summary_tables = [Document(page_content=s, metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summarizes)] retriever.vectorstore.add_documents(summary_tables) retriever.docstore.mset(list(zip(table_ids, tables)))

Create the final chain that takes a user question, retrieves relevant context, and generates an answer. from langchain.schema.runnable import RunnablePassthrough template = """Answer the question based only on the following context, which can include text and tables:\n{context}\nQuestion: {question}\n""" prompt = ChatPromptTemplate.from_template(template) model = ChatOpenAI(temperature=0, model="gpt-4") chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() )

Execute a sample query. chain.invoke("How many stocks were disposed? Who is the beneficial owner?")

Summary

MultiVectorRetriever for linking summaries with original documents.

Unstructured library for parsing semi‑structured PDFs.

ChromaDB and InMemoryStore for vector storage and document retrieval.

References

"RAG with Semi‑Structured Data" (Episode 01 of the series).

Nvidia equity‑change statement PDF.

Source code repository.

PythonAILangChainRAGChromaDBSemi-Structured DataUnstructured
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.