Artificial Intelligence 13 min read

How Graphify Builds Codebase Knowledge Graphs and Replaces Vector Search with Graph Traversal

Graphify is a Python tool and Claude Code skill that creates a persistent, queryable knowledge graph of code, documentation, and media, cutting token usage by up to 71.5× compared with raw file reads, and it does so through a three‑pass pipeline that combines deterministic AST extraction, optional local audio transcription, and AI‑driven semantic extraction.

Data Party THU

May 24, 2026

How Graphify Builds Codebase Knowledge Graphs and Replaces Vector Search with Graph Traversal

Why Graphify Exists

Large language models (LLMs) such as Claude Sonnet 4.6 (200 K context) and GPT‑5.4 (1 M context) make the cost and latency of feeding hundreds of source files into every query prohibitive. Traditional Retrieval‑Augmented Generation (RAG) works well for prose but fails for code because relationships like process_payment calling validate_card are structural, not semantic.

Graphify’s Core Idea

Instead of embedding files and performing similarity search, Graphify builds an explicit knowledge graph where nodes represent entities (functions, classes, concepts, document sections) and edges represent relationships (calls, imports, references, inferred dependencies). Queries traverse this graph, mirroring how an experienced engineer navigates an unfamiliar codebase.

What Graphify Produces

Running /graphify in a directory generates a graphify-out/ folder containing:

An interactive HTML graph rendered with vis.js.

A persistent JSON graph for programmatic queries.

A Markdown report highlighting high‑degree nodes and community clusters (Leiden algorithm).

Optional outputs: an Obsidian vault, Neo4j database, SVG, GraphML, or an MCP server that exposes the graph as an LLM‑callable tool.

Three Passes of Data Processing

Pass 1 – Deterministic AST Extraction (code stays on‑machine)

Source files are parsed by tree‑sitter, a rule‑based parser supporting 23 languages (Python, TypeScript, Go, Rust, Java, C/C++, etc.). It produces a dictionary of nodes and edges that faithfully reflects every function, class, import, and call found in the source, labeled with the confidence tag EXTRACTED to indicate factual certainty.

Pass 2 – Local Audio/Video Transcription (optional)

If the target directory contains audio or video, Graphify invokes faster‑whisper locally (installed via pip install "graphifyy[video]") to generate transcripts without uploading any media. Transcripts become document nodes in the graph.

Pass 3 – Semantic Extraction (documents and images)

Markdown, PDF, RST, PNG, JPG, WebP, and GIF files are sent to the user‑configured AI provider (Anthropic, OpenAI, etc.) using the existing API key. The provider extracts entities and relationships, which are added to the graph with confidence tags INFERRED (model‑derived) or AMBIGUOUS (uncertain). No central server or telemetry is involved.

Confidence System

Each edge carries one of three labels: EXTRACTED: directly observed in the source code (e.g., validate_card called by process_payment). INFERRED: derived from co‑occurrence in documentation (e.g., PaymentService and FraudDetector often appear together). AMBIGUOUS: the model is unsure; such edges are retained but should not be used for decisive reasoning without human verification.

The design mirrors a citation system: EXTRACTED is a page‑referenced fact, INFERRED a footnote, and AMBIGUOUS a “verify later” note.

Getting Started

# Install the core package
pip install graphifyy
# Register the Claude Code skill
graphify install

Optional extras:

# Audio/video support
pip install "graphifyy[video]"
# Office document support
pip install "graphifyy[office]"
# MCP server support
pip install "graphifyy[mcp]"
# Install everything
pip install "graphifyy[all]"

Typical commands:

/graphify                     # Standard analysis of current directory
/graphify --deep             # Aggressive relationship inference
/graphify ./src/auth         # Analyze a specific subdirectory
/graphify --watch            # Rebuild graph on file changes
/graphify query "..."        # Natural‑language query
/graphify path "UserService" "DatabasePool"
/graphify explain "PaymentProcessor"

Graphify can install Git hooks ( /graphify --install-hooks) so that any git commit or git checkout triggers an incremental update, ensuring the graph always reflects the current branch.

Token‑Reduction Claim

The README reports a 71.5× token‑usage reduction on a mixed‑corpus benchmark (derived from the worked/ directory). Architecturally this makes sense: a query that asks “what calls process_payment?” traverses a few graph nodes instead of loading all files. The exact multiplier varies with repository size, file types, and query specificity, and has not yet been validated on public benchmarks.

Suitable and Unsuitable Scenarios

Graphify shines for large monorepos that combine code, architecture docs, design PDFs, and recorded meetings, especially when the same codebase is queried repeatedly and Claude Code is used as an assistant to lower API costs. It is less appropriate for tiny projects (< 20 files), pure‑text repositories where flat RAG outperforms graph traversal, environments where the AI provider forbids sending document content, or use‑cases requiring fully verified analysis (because INFERRED and AMBIGUOUS edges may be speculative).

Limitations

The project is a personal open‑source effort (v0.4.10) without corporate backing; long‑term maintenance is uncertain. The PyPI package name is graphifyy (double “y”), a temporary mismatch that users must verify before installing.

Future Directions

The upcoming MCP server integration could make Graphify’s graph a foundational component for autonomous agents that need structured code‑base understanding rather than simple file search.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM code analysis knowledge graph Claude Code graph traversal graphify

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Graphify Exists

Graphify’s Core Idea

What Graphify Produces

Three Passes of Data Processing

Pass 1 – Deterministic AST Extraction (code stays on‑machine)

Pass 2 – Local Audio/Video Transcription (optional)

Pass 3 – Semantic Extraction (documents and images)

Confidence System

Getting Started

Token‑Reduction Claim

Suitable and Unsuitable Scenarios

Limitations

Future Directions

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Pass 1 – Deterministic AST Extraction (code stays on‑machine)

Pass 2 – Local Audio/Video Transcription (optional)

Pass 3 – Semantic Extraction (documents and images)