Master LLMs: Basics, Prompt Engineering, RAG, Agents & Multimodal AI
This article provides a comprehensive overview of large language models, covering their fundamental concepts, historical milestones, parameter scaling, prompt engineering techniques, retrieval‑augmented generation, autonomous agents, and multimodal model applications, illustrating how these technologies reshape AI capabilities across domains.
1. LLM Basics
1.1 What is an LLM?
LLM stands for Large Language Model, a deep‑learning‑based natural‑language‑processing tool that can understand and generate text, images, and audio. Trained on massive corpora, LLMs excel at translation, writing, dialogue, summarisation, and many other tasks.
1.2 History
Key milestones include the 2017 introduction of the Transformer architecture by Vaswani et al., followed by models such as GPT and BERT that leveraged self‑attention to achieve parallel computation and superior contextual capture.
1.3 Model Size (B)
Parameters are measured in billions (B). For example, GPT‑3 uses 175 B parameters; larger models generally have stronger representation ability but require more data and compute, and excessive parameters can lead to over‑fitting if training data are insufficient.
2. Prompt Engineering
2.1 Prompt Concept
A prompt is a carefully designed instruction or sentence that guides the model to produce outputs aligned with user intent.
2.2 Prompt Components
Instruction (required): tells the model what to do.
Context (optional): additional knowledge, often retrieved from a vector database.
Input Data (optional): the user’s query or data to be processed.
Output Indicator (optional): marks the beginning of the desired output.
2.3 Design Principles
Clear goal: define the task explicitly.
Specific guidance: provide concrete constraints.
Concise language: keep prompts short and clear.
Appropriate cues: use examples or boundary questions.
Iterative optimisation: refine based on model outputs.
2.4 Prompt Types
Zero‑Shot Prompting
Few‑Shot Prompting
Chain‑of‑Thought (CoT)
Self‑Consistency
Tree of Thoughts (ToT)
ReAct framework
<code>prompt = """ Answer the question based on the context below. If the question cannot be answered using the information provided answer with "I don't know".
Context: Large Language Models (LLMs) are the latest models used in NLP. Their superior performance over smaller models has made them incredibly useful for developers building NLP‑enabled applications. These models can be accessed via Hugging Face's `transformers` library, via OpenAI using the `openai` library, and via Cohere using the `cohere` library.
Question: Which libraries and model providers offer LLMs?
Answer: """</code>3. Retrieval‑Augmented Generation (RAG)
RAG first retrieves relevant documents from a knowledge base and then feeds them into the LLM, improving factual accuracy and mitigating hallucinations.
3.1 Problems Addressed
Hallucination: models may generate plausible but false statements.
Knowledge cutoff: static training data cannot cover real‑time or proprietary information.
Data security: on‑premise retrieval keeps sensitive data within the enterprise.
3.2 Architecture
RAG can be viewed as "retrieval + generation". Retrieval uses vector databases (FAISS, Milvus, etc.) to fetch relevant chunks; generation uses a prompt that combines the user query with the retrieved context.
3.3 Workflow
Data preparation : extraction → text splitting → embedding → indexing. Application : user query → similarity or full‑text search → prompt injection → LLM generation.
4. AI Agents
4.1 Concept
Agents are AI systems that perceive an environment, plan actions, execute them, and learn from feedback, using an LLM as the reasoning core.
4.2 Core Components
LLM : provides reasoning and language generation.
Tools : external APIs, code execution, search, etc.
Memory : short‑term (context window) and long‑term (vector store) storage of interaction history.
Planning : task decomposition, CoT, ToT, ReAct.
4.3 ReAct Example
The ReAct loop interleaves
Thought,
Action, and
Observationsteps, allowing the agent to query external tools and refine its reasoning.
<code>Thought: Need to find programs that can control Apple Remote.
Action: Search["Apple Remote control programs"]
Observation: ...
... (repeated many times)</code>5. Multimodal Models
5.1 Definition
Multimodal models process and understand multiple data types—text, images, audio, video—simultaneously.
5.2 Why Multimodal?
The real world is multimodal; integrating diverse signals yields richer understanding, higher robustness, and better generalisation.
5.3 Characteristics & Applications
Information integration across modalities.
Enhanced expressive power.
Improved robustness when one modality is missing.
Use cases: medical diagnosis, autonomous driving, intelligent customer service.
References
https://arxiv.org/abs/2402.06196
https://arxiv.org/abs/2308.10792
https://arxiv.org/abs/2312.10997
https://lilianweng.github.io/posts/2023-06-23-agent
https://python.langchain.com/docs/modules/agents
https://www.promptingguide.ai/zh/techniques/fewshot
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.