Artificial Intelligence 42 min read

Understanding OpenAI o1: Chain‑of‑Thought, Scaling Laws, and Training Strategies

The article explains how OpenAI’s o1 model leverages chain‑of‑thought prompting, dual‑system cognitive theory, and new scaling laws—pre‑training on code/math and post‑training reinforcement with step‑wise reward models—to achieve superior reasoning, safety, and performance over GPT‑4, heralding a shift toward models that learn to think.

DaTaobao Tech

Oct 30, 2024

Understanding OpenAI o1: Chain‑of‑Thought, Scaling Laws, and Training Strategies

This article uses Asimov’s short story “The Last Question” to illustrate the limits of entropy and introduces the cognitive dual‑system theory (System 1 fast thinking vs. System 2 slow, deliberate reasoning) as a lens for analyzing large language models (LLMs).

It shows how current LLMs such as ChatGPT often fail on simple arithmetic reasoning tasks (e.g., GSM8K example) due to hallucination, highlighting the need for explicit reasoning steps.

The concept of Chain‑of‑Thought (CoT) prompting is presented: by asking the model to generate intermediate rationales (“Let’s think step by step”), LLMs achieve substantial gains on reasoning benchmarks, especially on the 5 % of problems that require System 2 processing.

Two scaling laws are discussed. The pre‑training scaling law relates model performance to compute, parameters, and data size, while the post‑training scaling law shows that increasing test‑time compute (e.g., more inference steps) can be more effective than enlarging the model.

OpenAI’s o1 model (released September 2024) is described as a breakthrough LLM with strong complex‑reasoning abilities, outperforming GPT‑4 on STEM tasks, Codeforces, AIME, GPQA, and safety benchmarks. o1 is offered in two versions: o1‑preview (high‑quality reasoning, slower) and o1‑mini (faster, lower‑quality).

The training pipeline consists of three stages: (1) Pre‑training a base model on massive code, mathematics, and CoT data while de‑emphasizing world‑knowledge text; (2) Supervised fine‑tuning (SFT) on curated CoT datasets to give the model basic step‑by‑step reasoning; (3) Post‑training reinforcement learning (RL) with self‑play, where a generator produces CoT and a verifier (reward model) evaluates each step. Two reward‑model designs are compared: Outcome‑supervised Reward Model (ORM) that only looks at the final answer, and Process‑supervised Reward Model (PRM) that scores every intermediate step.

For CoT data generation, the article highlights the STaR (Bootstrapping Reasoning with Reasoning) framework and its variant Quiet‑STaR, which iteratively let the model generate rationales, filter correct ones, and fine‑tune on them. It also outlines three PRM‑based search methods—Best‑of‑N weighted search, Beam search, and Lookahead search—to select the most promising reasoning path.

Safety improvements are achieved by training the model on CoT that includes safety‑related reasoning, allowing o1 to refuse illicit requests (e.g., “grandma‑prompt” jailbreak) more effectively than previous models. The article also discusses critic models (CriticGPT) that can generate natural‑language feedback on code and other outputs.

Finally, the piece argues that o1’s post‑training innovations represent a shift toward “learning to think” rather than merely “teaching,” and that future models (e.g., the upcoming Orion) will inherit these techniques to achieve human‑like yet superhuman reasoning.

{
    "instruction": "For which integer $a$ does $x^2 - x + a$ divide $x^{13} + x + 90?",
    "responses": [
        "Let's try to divide $x^{13} + x + 90$ by $x^2 - x + a.",
        "We should get a remainder of 0 for this to work.",
        "Yeah, so let's do polynomial long division.",
        "Ok, so the first term of our dividend is $x^{13}$, and the first term of our divisor is $x^2$.",
        "So $x^{13}$ divided by $x^2$ is $x^{11}$.",
        "And $x^{11}$ times $x^2$ is $x^{13}$.",
        "So our first term matches.",
        "The next term of the dividend is $x$, and the next term of the divisor is $-x$.",
        "$x^{11}$ multiplied by $x$ is $x^{12}$.",
        "And $x^{11}$ multiplied by $a$ is $ax^{11}$.",
        "So, the remainder is $x^{12}-ax^{11}+90$.",
        "And since $x^{12}$ divided by $x^2$ is $x^{10}$, the next term of the quotient is $x^{10}$.",
        "So $x^{10}$ times $x^2$ is $x^{12}$.",
        "And $x^{10}$ times $-x$ is $-x^{11}$.",
        "Ok, we can take a quick road here. Let's see that if $x=0$, then $a$ must divide 90. And if $x=1$, $a$ must divide 92.",
        "Right. So $a$ must be a factor of 90 and 92.",
        "The factors of 90 are 1, 2, 3, 5, 6, 9, 10, 15, 18, 30, 45, and 90. The factors of 92 are 1, 2, 4, 23, 46, and 92.",
        "And the only factors in common are $1$ and $2$.",
        "So $a$ can be $1$ or $2$.",
        "Let's see if either of these work. If $a=1$, then our divisor is $x^2-x+1$, and our dividend is $x^{13}+x+90$.",
        "Let's put $x=1$.",
        "Sorry, let's put $x=-1$.",
    ]
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Safety chain-of-thought reinforcement learning scaling laws

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.