Artificial Intelligence 12 min read

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9B-parameter model with a 32K token context window, achieving strong long‑text performance comparable to larger models while using only about 2% of GPT‑4’s compute, trained via long pre‑training and supervised fine‑tuning, with a trade‑off of reduced short‑context ability.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9 B‑parameter language model that supports a 32 K token context window, enabling it to read documents longer than 35 k characters in a single pass.

The model achieves strong long‑text performance while consuming only about 2 % of the compute budget of GPT‑4. In benchmark comparisons, its scores surpass those of 7 B‑parameter models and approach larger models such as GPT‑4 and Qwen2.

Training involved two stages after the base Index‑1.9B model:

Long PT : continued pre‑training on a curated >100 B token long‑text corpus, using doc‑packing, a token‑level batch size of 4 M, a peak learning rate of 1e‑5, cosine LR schedule with warm‑up, weight decay of 0.1, and gradient clipping at 1.0.

Long SFT : supervised fine‑tuning on more than 30 k long‑text instructions plus 50 k general instructions, with a token‑level batch size of 1 M, peak learning rate 5e‑6, and the same regularization settings as Long PT.

Key hyper‑parameters include a Rope base of 32 × 10 000, maximum sequence length 32 768, and matching position‑encoding length. The Rope base was selected after theoretical analysis and empirical experiments, confirming that larger values (e.g., millions) do not yield further gains.

Evaluation covered both long‑text and short‑text abilities. Long‑text performance was measured with NeedleBench, LongBench, and LEval, showing scores of 91.08 on NeedleBench, 35.23 on LongBench, and 35.86 on LEval—often exceeding larger models. Short‑text ability, assessed with a custom benchmark and MMLU, dropped by roughly 25 %, highlighting the trade‑off between long‑ and short‑context competence.

OpenCompass was used for all evaluations, and the evaluation code is fully open‑source, allowing reproducibility. During long‑context testing, a prompt‑truncation issue was identified; the solution retains the first and last 0.5 × max_prompt_len tokens while discarding the middle portion.

Additional experiments compared training‑free context‑extension methods such as Dynamic NTK and naive extrapolation. Dynamic NTK with a scaling factor of 8 was evaluated, and results were visualized alongside other techniques.

The authors discuss several failed attempts, including context‑length warm‑up, packing vs. non‑packing strategies, and using a 0.1 % long‑instruction fraction for SFT, all of which yielded negligible or negative effects.

Limitations and disclaimer: the model may generate inaccurate, biased, or harmful content. Users must perform their own safety testing and avoid deploying the model for malicious purposes.

AIFine-tuningOpen-sourcelarge language modellong contextevaluationPretraining
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.