How an LLM Discovered a Real‑World SQLite Stack Buffer Overflow

Google’s Big Sleep AI agent, built on the Project Naptime framework, used a large language model to analyze recent SQLite commits, identify a previously unknown stack‑buffer‑underflow bug, and generate a reproducible test case, demonstrating that LLMs can effectively perform real‑world vulnerability research.

Smart Era Software Development
Smart Era Software Development
Smart Era Software Development
How an LLM Discovered a Real‑World SQLite Stack Buffer Overflow

Background

Recent advances in large‑language‑model (LLM) code understanding have motivated Google researchers to explore whether LLMs can replace or augment human security analysts. The Project Naptime paper introduced a framework for evaluating LLMs on offensive and defensive tasks, and the follow‑up Big Sleep project applied this framework to real‑world software.

Discovery of a SQLite Vulnerability

The AI agent identified a previously unknown, exploitable stack‑buffer‑underflow in the widely used open‑source database engine SQLite. The bug resides in seriesBestIndex, where a negative column index (‑1 for ROWID) can lead to an out‑of‑bounds write.

struct sqlite3_index_constraint {
    int iColumn;              /* Column constrained. -1 for ROWID */
    unsigned char op;         /* Constraint operator */
    unsigned char usable;    /* True if this constraint is usable */
    int iTermOffset;         /* Used internally – xBestIndex should ignore */
} *aConstraint;               /* Table of WHERE clause constraints */

static int seriesBestIndex(sqlite3_vtab *pVTab, sqlite3_index_info *pIdxInfo){
    ...
    for(i=0; i<pIdxInfo->nConstraint; i++, pConstraint++){
        int iCol = pConstraint->iColumn - SERIES_COLUMN_START;
        assert(iCol>=0 && iCol<=2);   // <-- fails when iColumn == -1
        int iMask = 1 << iCol;
        ...
    }
    ...
}

The assertion exists only in debug builds; the released SQLite binary lacks it, so the vulnerability could be triggered silently in production.

Methodology and Toolchain

The researchers collected a series of recent SQLite commits, filtered out documentation‑only changes, and crafted a prompt that supplied both commit messages and code diffs to the AI agent. The agent interacted with the codebase through a set of custom tools:

Code Browser : lets the agent navigate source files and locate specific entities.

Python Sandbox : executes intermediate calculations and generates precise program inputs.

Debugger : enables setting break‑points and evaluating expressions during execution.

Reporter : structures progress updates and signals success (e.g., a crash) or termination.

Execution Trace and Reasoning

ASSISTANT: "By introducing termFromWhereClause , the CL change improves handling of shared WHERE conditions in virtual‑table queries. Previously the mapping was incorrect, potentially causing inefficient plans or wrong results. I will first check allocateIndexInfo ."

After several investigative steps, the agent realized that the test harness required the TCL virtual‑table module, which was missing. It adjusted the input to use the built‑in generate_series virtual table and crafted the following SQL query to trigger the bug:

SELECT * FROM generate_series(1,10,1) WHERE ROWID = 1;

Running this query caused the assertion failure:

sqlite3: src/shell.c:6816: int seriesBestIndex(...): Assertion `iCol>=0 && iCol<=2' failed.

Why Traditional Fuzzing Missed the Issue

OSS‑Fuzz’s SQLite configuration does not enable the generate_series extension, and its fuzzingshell.c contains an older version of seriesBestIndex that is not vulnerable. The researchers attempted AFL fuzzing on the same binary for 150 CPU‑hours, adding the necessary keywords to the SQL dictionary, yet the bug remained undiscovered because coverage metrics did not highlight the edge case.

Comparing with Michal Zalewski’s 2015 study, AFL was highly effective at that time, but the SQLite codebase has since saturated the low‑hanging‑fruit vulnerabilities, making LLM‑guided analysis comparatively more valuable for finding subtle logic errors.

Conclusion

The Big Sleep team demonstrated that, when equipped with purpose‑built tooling, current LLMs can locate and reproduce previously unknown memory‑safety bugs in widely deployed software before the code is released. The results are experimental, and the authors stress that targeted fuzzers may still be competitive, but AI‑driven variant analysis offers a promising asymmetric advantage for defenders.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMSQLiteAI SecurityFuzzingVulnerability ResearchBig SleepProject Naptime
Smart Era Software Development
Written by

Smart Era Software Development

Committed to openness and connectivity, we build frontline engineering capabilities in software, requirements, and platform engineering. By integrating digitalization, cloud computing, blockchain, new media and other hot tech topics, we create an efficient, cutting‑edge tech exchange platform and a diversified engineering ecosystem. Provides frontline news, summit updates, and practical sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.