Tagged articles

9 articles

Page 1 of 1

May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTLarge Language Models

0 likes · 11 min read

How Abstract Symbols Cut AI Inference Cost by 11×

AI Architecture Path

May 15, 2026 · Artificial Intelligence

Why OpenHuman Is Gaining Traction: 118+ Integrations, 80% Token Savings, Open‑Source

OpenHuman tackles the common AI‑assistant problems of slow cold‑start, complex integration, and weak privacy by offering a minimalist desktop UI, over 118 built‑in service integrations, local memory trees with Obsidian compatibility, and a self‑developed TokenJuice compression that cuts token usage by up to 80 %, all under a GNU open‑source license.

AI AssistantLocal memoryOpen Source

0 likes · 10 min read

Why OpenHuman Is Gaining Traction: 118+ Integrations, 80% Token Savings, Open‑Source

Machine Heart

May 13, 2026 · Artificial Intelligence

Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar

MiniCPM‑V 4.6, a 1.3 B‑parameter multimodal LLM, outperforms larger rivals such as Qwen3.5‑0.8B and Gemma 4 on both accuracy and speed, thanks to early ViT token compression and 4×/16× visual token reduction, delivering sub‑100 ms latency and over 2.6 k token/s throughput on a single RTX 4090 while also running offline on mobile devices.

Edge AIMiniCPM-VMultimodal LLM

0 likes · 16 min read

Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar

Geek Labs

Apr 10, 2026 · Artificial Intelligence

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

The article analyzes why AI chats are costly—repeating context each time—and presents two open‑source projects, mempalace and caveman, that together provide a large‑scale memory system and aggressive token compression, dramatically reducing token usage and expenses while preserving reasoning ability.

AI memoryLLM efficiencyOpen Source

0 likes · 7 min read

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

Machine Learning Algorithms & Natural Language Processing

Mar 20, 2026 · Artificial Intelligence

Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method

Cursor’s newly released Composer 2 model surpasses Claude Opus 4.6 on benchmarks such as Terminal‑Bench 2.0, offers dramatically lower token pricing, and achieves these gains by introducing a novel self‑summary reinforcement‑learning technique that compresses long‑context tasks while preserving critical information.

Composer 2CursorLLM

0 likes · 9 min read

Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method

Tencent Technical Engineering

Jan 30, 2026 · Artificial Intelligence

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

This article introduces Render‑of‑Thought (RoT), a novel paradigm that compresses chain‑of‑thought reasoning into visual embeddings using frozen vision encoders, achieving 3‑4× token reduction, faster inference, and improved interpretability while requiring minimal pre‑training.

Inference OptimizationLatent SpaceMultimodal

0 likes · 12 min read

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

AI Frontier Lectures

Jan 25, 2026 · Artificial Intelligence

Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough

Render‑of‑Thought (RoT) proposes a novel visual‑latent reasoning framework that compresses textual chain‑of‑thought into dense image embeddings, achieving faster inference, better interpretability, and plug‑and‑play integration without costly pre‑training, as demonstrated on multiple math and logic benchmarks.

Chain-of-ThoughtImplicit CoTInference Acceleration

0 likes · 11 min read

Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough

DataFunSummit

Aug 24, 2025 · Artificial Intelligence

Unlocking LLM Efficiency: Asymmetry, Token Compression, and Quantization Insights

This article examines the core mechanisms of large language models, revealing asymmetric token behaviors, novel token‑compression techniques, scaling‑law theory, and mixed‑precision quantization methods that together boost inference efficiency while dramatically reducing model size.

Artificial IntelligenceLLMtoken compression

0 likes · 26 min read

Unlocking LLM Efficiency: Asymmetry, Token Compression, and Quantization Insights

Architects' Tech Alliance

Feb 24, 2025 · Artificial Intelligence

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington

The NSA mechanism introduces a three‑branch hardware‑optimized sparse attention architecture—token compression, token selection, and sliding window—combined with learnable gating to balance global and local context, dramatically improving inference speed and efficiency for long‑context large language models.

AI ArchitectureDeepSeekLarge Language Models

0 likes · 5 min read

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington