Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit
TTT‑Video‑Dit is an open‑source framework that uses test‑time‑training and hierarchical attention to generate coherent 63‑second videos with style‑transfer, dramatically reducing GPU memory requirements so a single RTX 4090 can replace costly H100 clusters, enabling creators and developers to produce long AI videos efficiently.
Overview
TTT-Video-Dit is an open‑source framework for generating long video clips (up to 63 s) with consistent style using test‑time training (TTT) and a layered attention architecture. It builds on the CogVideoX‑5B diffusion model and runs on a single consumer‑grade GPU.
Key Technical Features
Segment‑extension + global attention : Starts from a 3‑second seed, fine‑tunes style, then iteratively expands to 9 s, 18 s, 30 s and finally 63 s. Reported continuity improvement ≈80 % over naïve stitching.
Unified style‑transfer : Local attention preserves short‑clip detail, while an added TTT layer enforces global style consistency. Improves fine‑detail fidelity by ~3× compared with conventional style‑transfer pipelines.
GPU memory optimisation : Hierarchical training and parameter reuse cut peak memory from ~48 GB (H100) to ~24 GB; an optional low‑memory mode reduces it to ~16 GB, enabling inference on RTX 4090.
Extensible configuration : Video length (3‑63 s), resolution (up to 1080p), and style template (10+ built‑in) are configurable via a YAML file.
Typical Use Cases
Social‑media content creation
Generate a 63‑second cartoon clip from a single text prompt.
python sample.py --prompt "A ginger cat sunbathing on a balcony in Ghibli style, warm colors" --duration 63 --style ghibliCorporate marketing – style transfer
Apply a vintage film style to an existing 1‑minute footage while preserving detail.
python scripts/style_transfer.py --input_video ./factory.mp4 --style vintage --output ./factory_vintage.mp4Research – custom long‑video model fine‑tuning
Fine‑tune on a domain‑specific dataset (e.g., 100 three‑second pet clips) using progressive training stages.
# Example training pipeline (pseudo‑commands)
# 1. Prepare data
# 2. Activate conda environment with CUDA 12.3+ and gcc 11+
# 3. Train sequentially: 3s → 9s → 18s → 30s → 63s
# 4. Resulting model generates 63‑second videos with ~22 GB peak memoryQuick‑Start Guide (5 steps)
Step 1 – Clone repository and create environment
# Clone the project (including submodules)
git clone --recursive https://github.com/test-time-training/ttt-video-dit.git
cd ttt-video-dit
# Create conda environment
conda env create -f environment.yaml
conda activate ttt-video
# Install TTT‑MLP kernel (requires CUDA 12.3+ and gcc 11+)
cd ttt-tk && python setup.py install
cd ..Step 2 – Download pretrained weights
Obtain CogVideoX‑5B safetensors, VAE and T5 encoder from HuggingFace and place them under models/:
models/
├─ vae/
├─ t5-encoder/
├─ diffusion_pytorch_model-00001-of-00002.safetensors
└─ diffusion_pytorch_model-00002-of-00002.safetensorsStep 3 – Create generation config
prompt: "A white puppy chasing butterflies in a park, bright Disney style"
duration: 63 # seconds
resolution: "1080p"
style: "disney"
batch_size: 1
device: "cuda"Step 4 – Run generation
python sample.py --config configs/sample_63s.yamlStep 5 – Inspect output
The resulting MP4 file is saved in outputs/ and can be played directly or imported into editing software.
Performance Notes
Peak GPU memory ≈22 GB for 63‑second generation on RTX 4090.
Low‑memory mode can be enabled with --low_mem true, reducing memory to ~16 GB at the cost of slower inference.
Generation time for a 63‑second clip on RTX 4090 is roughly 10 minutes (depends on prompt complexity).
Repository
https://github.com/test-time-training/ttt-video-dit
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Meng AI Explorer
Tracking global AI developments 24/7, focusing on large model iterations, commercial applications, and tech ethics. We break down hardcore technology into plain language, providing fresh news, in-depth analysis, and practical insights for professionals and enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
