Deploying a Private LLM Knowledge Base on a MacBook
The guide walks through installing and quantizing the open‑source ChatGLM3‑6B model and the m3e‑base embedder on a MacBook, wrapping them with a FastAPI OpenAI‑compatible service, routing requests through a One‑API gateway, storing metadata in MongoDB and vectors in PostgreSQL pgvector, deploying FastGPT for RAG, ingesting data, and demonstrating 5‑7 second response times, while outlining future improvements.
This article describes how to set up a private large‑language‑model (LLM) knowledge‑base solution on a MacBook to assist personal knowledge management.
It first explains the motivation for a local deployment (data security, flexibility) and then outlines the overall architecture, which combines the Chinese open‑source model ChatGLM3‑6B, the embedding model m3e‑base, a FastAPI wrapper exposing OpenAI‑compatible endpoints, the One‑API gateway, and the FastGPT knowledge‑base platform.
Model preparation : download ChatGLM3‑6B from HuggingFace or ModelScope, quantize it with chatglm.cpp (8‑bit or lower), and verify the model with ./build/bin/main -m chatglm3-ggml-q8.bin -i . Also download the m3e‑base embedding model.
Model API service : a FastAPI application is built (see code excerpt) that provides /v1/chat/completions and /v1/embeddings endpoints, using chatglm_cpp.Pipeline for inference and SentenceTransformer for embeddings. The service is run with uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000 .
One‑API gateway : the open‑source One‑API project (Go/Node) is compiled and configured to route requests to the local ChatGLM3 and m3e‑base services, allowing unified API management and token accounting.
Knowledge‑base backend : MongoDB stores metadata while PostgreSQL with the pgvector extension stores vector embeddings. Installation commands for both databases on macOS are provided.
FastGPT deployment : FastGPT (an open‑source RAG system) is cloned, its environment variables point to the One‑API endpoint, MongoDB, and PostgreSQL. After installing Node.js dependencies ( pnpm i ) and launching the app ( pnpm dev ), the web UI is accessible at http://localhost:3000 .
Knowledge ingestion : the guide shows how to create a knowledge base, select the m3e‑base index model, and import data via manual entry, CSV, or API. Example queries demonstrate successful retrieval and citation.
Validation and results : sample chat interactions with ChatGLM3 show response times around 5–7 seconds and memory usage of ~3.8 GB on a 16 GB MacBook.
Future work : suggestions include improving chunking and embedding strategies, advanced prompt engineering, workflow orchestration, scaling hardware, and applying the system to real business problems.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.