Tagged articles

777 articles

Page 1 of 8

May 31, 2026 · Artificial Intelligence

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

NVIDIA's NVFP4 quantization reduces Qwen3.6-35B-A3B's memory footprint by threefold with almost no accuracy loss, offers plug‑and‑play deployment via vLLM, and outperforms other 4‑bit formats on Hopper/Blackwell GPUs, making it a practical choice for production AI workloads.

MoENVFP4Qwen3.6-35B-A3B

0 likes · 13 min read

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

Machine Learning Algorithms & Natural Language Processing

May 30, 2026 · Artificial Intelligence

Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

ClawGym provides a complete open‑source framework for Claw‑style personal agents, linking a 13.5 K synthetic task dataset, black‑box rollout training, sandbox‑parallel reinforcement learning, and a rigorously verified benchmark of 200 tasks, and demonstrates that synthetic data can lift a 30 B model beyond a 235 B baseline.

ClawGymOpenClawagent training

0 likes · 16 min read

Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

SuanNi

May 30, 2026 · Artificial Intelligence

Step 3.7 Flash: High‑Efficiency Pro‑Level Agent Model with 400 TPS and Low Cost

Step 3.7 Flash is a 196B‑parameter, 11B‑activation multimodal agent model that delivers 400 TPS inference, superior code‑generation and cross‑framework stability, cost‑effective Advisor Mode, and strong vision and search performance, with extensive benchmark gains over its predecessor and competing models.

AI agentAdvisor ModeMultimodal

0 likes · 12 min read

Step 3.7 Flash: High‑Efficiency Pro‑Level Agent Model with 400 TPS and Low Cost

Machine Heart

May 30, 2026 · Artificial Intelligence

Can MIT’s Attention Matching Cut LLM Memory 50× Without Accuracy Loss?

MIT researchers introduce Attention Matching, a latent‑space KV‑cache compaction technique that reduces large‑language‑model memory usage up to 50‑fold with negligible precision loss, outperforming token‑pruning, summarization, and prior compaction methods across benchmarks like QuALITY, LongHealth, and AIME‑2025.

Attention MatchingKV CacheLLM

0 likes · 13 min read

Can MIT’s Attention Matching Cut LLM Memory 50× Without Accuracy Loss?

Machine Learning Algorithms & Natural Language Processing

May 29, 2026 · Artificial Intelligence

Claude Opus 4.8 Surpasses Mythos in Key Tasks and Enables Hundreds of Parallel Agents

Claude Opus 4.8, released just 43 days after 4.7, improves honesty, cuts code‑defect miss rates to a quarter, reduces over‑confident answers, outperforms Mythos on several benchmarks, and introduces Dynamic Workflows that let hundreds of sub‑agents run in parallel for complex tasks.

AI modelClaude Opus 4.8benchmark

0 likes · 8 min read

Claude Opus 4.8 Surpasses Mythos in Key Tasks and Enables Hundreds of Parallel Agents

SuanNi

May 29, 2026 · Artificial Intelligence

SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes

The SenseNova-U1-8B-MoT-Infographic model dramatically improves AI‑generated infographics by enhancing dense‑text rendering, layout stability, and chart accuracy through targeted data, extended mid‑training, and reinforcement‑learning fine‑tuning, achieving top scores on BizGenEval and IGenBench and surpassing many commercial rivals.

AI modelMultimodalSenseNova

0 likes · 9 min read

SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes

Machine Heart

May 29, 2026 · Artificial Intelligence

Why Vendors Bet on Step 3.7 Flash: An Agent‑Optimized Model for High‑Cost AI

Step 3.7 Flash is an open‑source, sparse‑MoE flash model built for real‑world Agent workflows, offering 11 B active parameters, 400 TPS, 256 K context, multimodal perception and tool use, and achieves top‑tier scores on benchmarks such as ClawEval‑1.1, Toolathlon and SimpleVQA, while dramatically reducing token‑costs that have plagued large‑scale AI deployments.

AgentCostFlash

0 likes · 10 min read

Why Vendors Bet on Step 3.7 Flash: An Agent‑Optimized Model for High‑Cost AI

Machine Heart

May 28, 2026 · Artificial Intelligence

Can a Pre‑trained Embodied Model Work Out‑of‑the‑Box? New Chinese Open‑Source VLA Model Shows Yes

The newly open‑sourced Wall‑OSS‑0.5 VLA model demonstrates that a large‑scale pre‑trained embodied robot brain can achieve strong zero‑shot performance on 17 real‑world tasks, exhibit staircase emergence with longer pre‑training, and far surpass the industry baseline after fine‑tuning, while also revealing current precision limits.

Embodied AIVLAbenchmark

0 likes · 15 min read

Can a Pre‑trained Embodied Model Work Out‑of‑the‑Box? New Chinese Open‑Source VLA Model Shows Yes

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks

The open‑source 35‑billion‑parameter Intern‑S2‑Preview model achieves scientific‑task performance comparable to trillion‑parameter models, thanks to full‑link “general‑specialized” training, reinforced‑learning scaling, and hardware‑aware optimizations, and it outperforms leading closed‑source models on benchmarks such as MolecularIQ and crystal‑structure generation.

InternLMLarge Language ModelOpen Source

0 likes · 11 min read

Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks

Architects' Tech Alliance

May 27, 2026 · Industry Insights

Nvidia Vera CPU Smashes Intel and AMD x86 Titans in AI Workloads

Nvidia's Vera, an 88‑core custom ARM CPU designed for AI agents, delivers up to 55% higher overall performance than Intel Xeon 6980P, 10% over AMD EPYC 9575F and 63% over Nvidia Grace, while offering 1.2 TB/s LPDDR5X bandwidth, 500 W power envelope and a single‑chip design that could reshape the server CPU market.

AI serverARM CPULPDDR5X

0 likes · 10 min read

Nvidia Vera CPU Smashes Intel and AMD x86 Titans in AI Workloads

ShiZhen AI

May 27, 2026 · Artificial Intelligence

Turning Click‑Based Web Agents into Repeatable Scripts with Microsoft’s Open‑Source Webwright

Microsoft’s open‑source Webwright framework redefines browser agents by replacing step‑by‑step click actions with generated Playwright scripts, enabling repeatable, debuggable web tasks; the article details its architecture, workflow, benchmark results on Online‑Mind2Web and Odysseys, and discusses practical benefits and limitations.

GPT-5.4LLM agentsMicrosoft

0 likes · 9 min read

Turning Click‑Based Web Agents into Repeatable Scripts with Microsoft’s Open‑Source Webwright

Machine Heart

May 27, 2026 · Artificial Intelligence

RoboMemArena: A Comprehensive Benchmark that Truly Tests Robot Memory for Embodied AI

RoboMemArena introduces a systematic, long‑horizon robot memory benchmark with 26 tasks, 151 sub‑tasks, multimodal annotations, and real‑robot evaluations, exposing the limitations of existing benchmarks and demonstrating that the dual‑system PrediMem model markedly outperforms baselines both in simulation and on physical robots.

Embodied AIPrediMemRoboMemArena

0 likes · 9 min read

RoboMemArena: A Comprehensive Benchmark that Truly Tests Robot Memory for Embodied AI

Machine Learning Algorithms & Natural Language Processing

May 26, 2026 · Artificial Intelligence

Terminal-World: Large-Scale Environment Synthesis for Terminal Agents

The paper presents Terminal-World, an automated pipeline that uses Agent Skills to generate diverse terminal‑agent training data, builds over 5,700 environments, and trains models that outperform existing baselines on multiple benchmarks despite using far less data.

Agent SkillsTerminal-Worldbenchmark

0 likes · 4 min read

Terminal-World: Large-Scale Environment Synthesis for Terminal Agents

SuanNi

May 26, 2026 · Artificial Intelligence

Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming

The SkyClaw‑v1.0 model from Skywork AI offers a free, soon‑to‑be open‑source large‑language model for agent applications that matches Claude Opus 4.6 in performance while cutting token costs dramatically, and the article details its benchmarks, training pipeline, and deployment recommendations.

AgentLarge Language ModelOpenAI API

0 likes · 7 min read

Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming

SuanNi

May 26, 2026 · Artificial Intelligence

MiniCPM5-1B Sets New Benchmark for Sub‑2B Models – AI‑Trained, 10% Cheaper Than Nvidia

The 1‑billion‑parameter MiniCPM5-1B model tops the AA leaderboard with a 17.9 score, outperforms 2‑billion‑parameter rivals, uses an AI‑generated training framework that cuts cost by 10%, and runs on virtually any device thanks to aggressive quantisation and open‑source tooling.

AI modelEdge AIForgeTrain

0 likes · 9 min read

MiniCPM5-1B Sets New Benchmark for Sub‑2B Models – AI‑Trained, 10% Cheaper Than Nvidia

Machine Heart

May 26, 2026 · Artificial Intelligence

What Agent Harness Do AI Phones Like OpenAI’s AI Phone and Gemini on Android Really Need?

PhoneHarness, a mixed‑action orchestration framework and benchmark from Tencent Hunyuan and academic partners, argues that AI‑powered smartphones must go beyond GUI clicks, integrating CLI, GUI, and host tools while providing verifiable evidence of task completion, reshaping agents from screen‑talkers to true mobile assistants.

AI PhoneAndroidPhoneHarness

0 likes · 11 min read

What Agent Harness Do AI Phones Like OpenAI’s AI Phone and Gemini on Android Really Need?

Tencent Technical Engineering

May 26, 2026 · Information Security

AI Era Vulnerability Benchmark Revamp: 3,632 CVE Insights & VulnGym Release

Analyzing 3,632 high‑severity GitHub Advisory reports from 2025‑2026, the authors reveal a sharp rise in business‑logic flaws—especially in high‑star projects—prompting a redesign of vulnerability‑detection benchmarks, and introduce VulnGym, a real‑project, white‑box dataset with 400+ paths and detailed entry‑point, trace, and critical‑operation annotations.

AI securityBusiness Logic BugsOpen Source

0 likes · 17 min read

AI Era Vulnerability Benchmark Revamp: 3,632 CVE Insights & VulnGym Release

Data Party THU

May 26, 2026 · Artificial Intelligence

Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

Stanford, Berkeley and Nvidia researchers introduce LLM-as-a-Verifier, a universal verification framework that enhances agent performance, safety and stability on long‑horizon tasks, and outperforms Claude Mythos and GPT‑5.5 on the Terminal‑Bench and SWE‑Bench benchmarks.

AI agentsAgent verificationLLM-as-a-Verifier

0 likes · 7 min read

Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

Architect's Guide

May 26, 2026 · Backend Development

How Much Memory Do 1 Million Concurrent Tasks Consume in Different Languages?

This article benchmarks the peak memory usage of one, ten thousand, one hundred thousand, and one million concurrent tasks across Rust, Go, Java, C#, Node.js, Python, and Elixir, revealing surprising differences in runtime memory footprints and scalability.

AsyncCElixir

0 likes · 14 min read

How Much Memory Do 1 Million Concurrent Tasks Consume in Different Languages?

SuanNi

May 24, 2026 · Artificial Intelligence

Meituan’s Open‑Source Digital Human Model Delivers Real‑World Performance Across MV, E‑Commerce, and More

Meituan’s LongCat‑Video‑Avatar 1.5 replaces its audio encoder with Whisper‑Large, cuts inference to eight steps, and, after a 770‑person, 13,240‑rating evaluation, outperforms competing models in lip‑sync, style generalization, multi‑person scenes, and overall visual fidelity.

AILongCat-Video-AvatarVideo Generation

0 likes · 7 min read

Meituan’s Open‑Source Digital Human Model Delivers Real‑World Performance Across MV, E‑Commerce, and More

IT Services Circle

May 24, 2026 · Artificial Intelligence

2026 AI Coding Agent Benchmark: Cursor, Claude Code, and Codex – Who Leads?

A comprehensive 2026 benchmark evaluates major AI coding agents—Cursor CLI, Claude Code, OpenAI Codex, and Google Gemini—across performance, token consumption, cost per task, and execution time, revealing a tight top‑three score margin and highlighting cost‑efficiency and latency as the new competitive frontiers.

AI coding agentsClaude CodeCost

0 likes · 6 min read

2026 AI Coding Agent Benchmark: Cursor, Claude Code, and Codex – Who Leads?

Open Source Tech Hub

May 24, 2026 · Backend Development

FastJSON: A Drop‑In PHP 8.3+ JSON Extension Up to 6× Faster Than ext/json

FastJSON is a high‑performance PHP 8.3+ JSON extension that serves as a drop‑in replacement for ext/json, offering namespaced fastjson_* APIs, full compatibility with json_last_error, and delivering up to six‑fold speed gains in encoding, decoding, and validation while detailing installation steps, supported flags, memory trade‑offs, and benchmark results.

FastJSONJSONPHP

0 likes · 7 min read

FastJSON: A Drop‑In PHP 8.3+ JSON Extension Up to 6× Faster Than ext/json

AI Architecture Path

May 24, 2026 · Artificial Intelligence

How agentmemory Fixes Claude Code Forgetting and Slashes Token Usage by 92%

The article explains how the open‑source agentmemory system solves common AI‑coding assistant pain points—session forgetfulness, repetitive context feeding, and high token costs—by providing automatic, cross‑tool persistent memory, hybrid retrieval, and a zero‑dependency deployment that reduces token consumption by 92% while offering detailed benchmarks and configuration guides.

AI agentMCPagentmemory

0 likes · 15 min read

How agentmemory Fixes Claude Code Forgetting and Slashes Token Usage by 92%

SuanNi

May 22, 2026 · Artificial Intelligence

Why Qwen3.7-Max Is Sending Overseas Developers Into a Frenzy

Qwen3.7-Max demonstrates product‑level long‑task autonomy with 35 hours of uninterrupted operation, 1,158 tool calls, and kernel‑level optimizations, while outperforming Gemini 3.5‑Flash, Claude Opus, and GPT‑5.5 across a wide range of benchmarks, cost‑effectiveness, and real‑world agent scenarios.

AIAgentKernel Optimization

0 likes · 11 min read

Why Qwen3.7-Max Is Sending Overseas Developers Into a Frenzy

Machine Learning Algorithms & Natural Language Processing

May 22, 2026 · Artificial Intelligence

ESI‑Bench: The ImageNet‑Style Benchmark for Embodied Spatial Intelligence

ESI‑Bench, introduced by Fei‑Fei Li's team, transforms the observer into an active agent to evaluate embodied spatial intelligence across 10 task categories and 3,081 instances, revealing that perception is not the bottleneck, action strategies are critical, imperfect 3D reconstructions can hurt performance, and current models suffer from action blindness and metacognitive deficits compared with humans.

Embodied AIaction blindnessbenchmark

0 likes · 11 min read

ESI‑Bench: The ImageNet‑Style Benchmark for Embodied Spatial Intelligence

Data Party THU

May 22, 2026 · Artificial Intelligence

First Survey of Agent Harnesses: What Powers Agents Beyond the Model?

The article surveys recent research on Agent Harness engineering, showing that real‑world agent instability stems from system‑level factors beyond model capability, introduces the seven‑layer ETCLOVG architecture, presents benchmark gains from harness tweaks, maps open‑source projects to the framework, and outlines five key open research directions.

AIAgent HarnessETCLOVG

0 likes · 12 min read

First Survey of Agent Harnesses: What Powers Agents Beyond the Model?

Meituan Technology Team

May 22, 2026 · Artificial Intelligence

From High-Fidelity to Real-World Use: LongCat Video Avatar 1.5 Open‑Source Release

LongCat Video Avatar 1.5 is now open‑source, delivering commercial‑grade lip sync, physical realism, long‑video stability, multi‑person interaction and 15× faster inference through Whisper‑large audio encoding, DMD 8‑step distillation and LoRA adapters, and it outperforms leading closed‑source models in extensive human‑rated benchmarks.

AILongCat-Video-AvatarVideo Generation

0 likes · 9 min read

SuanNi

May 20, 2026 · Artificial Intelligence

Why Harness Is the Future of AI Agents: Insights from CMU, Yale, and Amazon

The article argues that an AI agent’s performance now hinges on its surrounding Harness rather than the model itself, presenting the ETCLOVG seven‑layer architecture, benchmark gains up to ten‑fold, and a roadmap of evolving engineering stages from prompt‑to‑context‑to‑harness design.

AI agentsContext ManagementETCLOVG

0 likes · 13 min read

Why Harness Is the Future of AI Agents: Insights from CMU, Yale, and Amazon

IT Services Circle

May 20, 2026 · Artificial Intelligence

Google I/O 2026 Unveils Gemini Omni and Gemini 3.5 Flash – A Leap in Multimodal AI

At Google I/O 2026 the company introduced Gemini Omni, a truly multimodal model that can ingest any combination of text, image, audio or video and generate high‑quality content, and Gemini 3.5 Flash, which outperforms Gemini 3.1 Pro across major benchmarks while delivering four‑times faster token throughput, alongside the new Antigravity 2.0 agent platform and the Gemini Spark personal AI assistant.

AI GenerationAgent PlatformGemini

0 likes · 13 min read

Google I/O 2026 Unveils Gemini Omni and Gemini 3.5 Flash – A Leap in Multimodal AI

Machine Heart

May 20, 2026 · Artificial Intelligence

Qwen3.7-Max Sets New Agent Benchmarks – China’s New Model King

Alibaba’s Qwen3.7‑Max model tops multiple Arena leaderboards, achieves SOTA scores in programming, reasoning, and multilingual benchmarks, runs a 35‑hour autonomous coding task on a custom AI chip with 10× speedup, and demonstrates end‑to‑end desktop app creation and web‑search agents, illustrating a rapid monthly model‑iteration strategy.

AI ChipAgentAlibaba

0 likes · 13 min read

Qwen3.7-Max Sets New Agent Benchmarks – China’s New Model King

Java Backend Technology

May 20, 2026 · Artificial Intelligence

Claude Code vs Codex: 10× Cost, 4× Speed – A Deep Comparative Review

The article provides a data‑driven comparison between Anthropic's Claude Code and OpenAI's Codex, covering benchmark scores (SWE‑bench, Terminal‑Bench), blind‑test code‑quality results, token consumption, real‑world cost scenarios, ecosystem integration (MCP), and community feedback to help teams choose the right AI coding agent for their workflow.

AI coding agentsClaude CodeCodex

0 likes · 14 min read

Claude Code vs Codex: 10× Cost, 4× Speed – A Deep Comparative Review

AI Insight Log

May 19, 2026 · Artificial Intelligence

Gemini 3.5 Flash Launches with 4× Speed, Beats Gemini 3.1 Pro in Coding Benchmarks

Google unveiled Gemini 3.5 Flash at I/O 2026, claiming roughly four times faster token output than comparable frontier models, half the price, and benchmark results that surpass its own Gemini 3.1 Pro in coding, agent, and multimodal tasks, while noting trade‑offs in deep reasoning and long‑context performance.

AIAgentAntigravity

0 likes · 12 min read

Gemini 3.5 Flash Launches with 4× Speed, Beats Gemini 3.1 Pro in Coding Benchmarks

SuanNi

May 19, 2026 · Artificial Intelligence

Is Google Search Obsolete? How AnySearch Builds AI‑Era Search Infrastructure

AnySearch launches a unified API that aggregates 22 professional data sources for AI agents, using intent classification and RRF fusion to cut token usage by up to 70% and boost accuracy and latency over Parallel and Brave, while offering architecture‑level privacy protections.

AI SearchRRFbenchmark

0 likes · 9 min read

Is Google Search Obsolete? How AnySearch Builds AI‑Era Search Infrastructure

PaperAgent

May 19, 2026 · Artificial Intelligence

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

MemEye is a multimodal memory benchmark that tests agents across eight real‑world scenarios, measuring visual evidence granularity and reasoning depth, and reveals that captions fall short for fine‑grained visual recall, highlighting the need for true visual memory in long‑term AI agents.

AI agentsMemEyebenchmark

0 likes · 4 min read

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

Machine Heart

May 19, 2026 · Artificial Intelligence

HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

HyperEyes introduces a unified‑location‑as‑search (UGS) action space, parallel data synthesis, and a dual‑granularity efficiency‑aware RL framework that enable multimodal agents to perform simultaneous multi‑target retrieval, dramatically reducing interaction rounds while improving accuracy and cost‑efficiency across benchmark evaluations.

AgentEfficiencybenchmark

0 likes · 9 min read

HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

AI Insight Log

May 19, 2026 · Artificial Intelligence

Cursor Returns with Composer 2.5: Openly Built on Kimi, 10× Lower Cost, Musk Endorses

Cursor unveiled Composer 2.5, reporting benchmark scores comparable to Opus 4.7 and GPT‑5.5, a ten‑fold cost reduction, explicit use of Moonshot’s Kimi K2.5 as a base, new RL training techniques, and a partnership with SpaceXAI that multiplies compute power, all highlighted by Elon Musk’s retweet.

AI modelComposer 2.5Cursor

0 likes · 10 min read

Cursor Returns with Composer 2.5: Openly Built on Kimi, 10× Lower Cost, Musk Endorses

Big Data Technology & Architecture

May 19, 2026 · Artificial Intelligence

Why Pure AI Black‑Box Text2SQL Fails in Enterprise Deployments

The article analyzes the inherent shortcomings of black‑box Text2SQL solutions—highlighting benchmark collapses, lack of auditability, and unacceptable error rates—and proposes a white‑box approach with a human‑readable intermediate language that enables deterministic, enterprise‑grade SQL generation.

NLQSQLText2SQL

0 likes · 13 min read

Why Pure AI Black‑Box Text2SQL Fails in Enterprise Deployments

Machine Heart

May 18, 2026 · Artificial Intelligence

JiuwenSwarm Launches Coordination Engineering for the ‘Beekeeping’ Era of AI Agents

openJiuwen’s open‑source JiuwenSwarm implements Coordination Engineering—a full‑stack system comprising Agent Swarm, Swarm Skills, a Skills Hub and self‑evolution—enabling autonomous multi‑agent collaboration, demonstrated by medical, coding, video and game case studies and achieving a 94.2% PinchBench score with 34.8% token savings.

AI agentsCoordination EngineeringJiuwenSwarm

0 likes · 13 min read

JiuwenSwarm Launches Coordination Engineering for the ‘Beekeeping’ Era of AI Agents

AIWalker

May 17, 2026 · Artificial Intelligence

From Image Captioning to Detective‑Style Perception: Pixel‑Searcher Beats Closed‑Source Models

Pixel‑Searcher introduces an agentic search‑driven visual perception framework that integrates web‑based evidence with pixel‑level grounding, and the new WebEyes benchmark demonstrates its superiority over existing open‑ and closed‑source multimodal models across localization, segmentation, and VQA tasks.

MultimodalPixel-SearcherWebEyes

0 likes · 16 min read

From Image Captioning to Detective‑Style Perception: Pixel‑Searcher Beats Closed‑Source Models

Machine Heart

May 16, 2026 · Artificial Intelligence

Why Robots Need World Models: A Joint Survey from Leading Institutions

This article surveys recent advances in robot world models, explaining why predictive models are essential for embodied intelligence, how they integrate with Vision‑Language‑Action systems, the various architectural approaches, benchmark trends, and the remaining challenges for reliable deployment.

SimulationSurveyVision-Language-Action

0 likes · 14 min read

Why Robots Need World Models: A Joint Survey from Leading Institutions

Data Party THU

May 16, 2026 · Artificial Intelligence

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

The article analyzes SubQ, a new LLM architecture using Subquadratic Sparse Attention (SSA) to achieve a 12‑million‑token context window with linear compute scaling, delivering up to 52× speedup and costing just 5% of Opus while matching dense‑attention performance on long‑context benchmarks.

SSASparse AttentionSubQ

0 likes · 14 min read

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

Machine Heart

May 16, 2026 · Artificial Intelligence

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

The article details how Beijing Humanoid’s Pelican‑Unify 1.0 model achieved top scores on WorldArena—including a 66.03 overall rating and 98.12% 3D accuracy—by unifying perception, reasoning, imagination and action in a single latent space, marking a milestone for model‑based end‑to‑end embodied intelligence.

Embodied AIMultimodal LearningPelican-Unify

0 likes · 17 min read

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

AI Engineering

May 16, 2026 · Backend Development

Cut 92% of Claude Code Tool Calls for Large Codebases with CodeGraph

CodeGraph builds a semantic knowledge graph of a codebase so Claude Code can query the graph instead of scanning files, reducing tool calls by an average of 92% and speeding up exploration by 71% across multiple large, multi‑language projects.

AI code assistanceClaude CodeCodeGraph

0 likes · 6 min read

Cut 92% of Claude Code Tool Calls for Large Codebases with CodeGraph

Machine Learning Algorithms & Natural Language Processing

May 15, 2026 · Artificial Intelligence

ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

The ClawMark benchmark introduces 100 multi‑turn, multi‑day tasks across 13 professional scenarios and five stateful sandbox services, evaluating seven cutting‑edge agent systems with a top weighted score of 75.8 but only a 20% strict success rate, highlighting the difficulty of end‑to‑end collaborative agent performance.

LLMagent performancebenchmark

0 likes · 4 min read

ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

PaperAgent

May 15, 2026 · Artificial Intelligence

How a 0.6B Model Beats GPT‑5.2 at Agent Privacy – Introducing MemPrivacy

The article analyzes the long‑standing privacy dilemma of cloud‑based agents, presents MemPrivacy’s three‑stage de‑identification framework and four‑level privacy taxonomy, details its two‑phase training with the MemPrivacy‑Bench dataset, and shows benchmark results where a 0.6B model outperforms GPT‑5.2 while keeping latency under 0.5 seconds.

AgentMemPrivacybenchmark

0 likes · 11 min read

How a 0.6B Model Beats GPT‑5.2 at Agent Privacy – Introducing MemPrivacy

Machine Heart

May 15, 2026 · Artificial Intelligence

When AI Knows Too Much: How MemPrivacy Secures Agent Memory

MemPrivacy introduces a reversible, fine‑grained privacy layer for edge‑cloud agents, outperforming OpenAI's privacy‑filter by over 50 % F1 while keeping system utility loss under 2 %, thus enabling agents to remain useful without exposing raw sensitive data.

AIAgent MemoryF1

0 likes · 16 min read

When AI Knows Too Much: How MemPrivacy Secures Agent Memory

Machine Heart

May 14, 2026 · Artificial Intelligence

How SenseNova U1’s Native Unified Architecture Lets a Small Model Beat Larger Ones

SenseNova U1 introduces the NEO‑Unify native unified architecture that eliminates separate vision encoders and VAEs, enabling simultaneous multimodal understanding, reasoning, and generation, and achieves state‑of‑the‑art benchmark scores that surpass larger proprietary models across vision‑language, reasoning, and generation tasks.

Model architectureNEO-UnifyOpen Source

0 likes · 19 min read

How SenseNova U1’s Native Unified Architecture Lets a Small Model Beat Larger Ones

Xiaomi Tech

May 13, 2026 · Artificial Intelligence

Xiaomi OneVL: A Breakthrough Open‑Source Model for Fast, Accurate Autonomous Driving

Xiaomi unveils OneVL, an open‑source stepwise latent language‑vision reasoning framework that unifies VLA, world‑model and latent inference, delivering higher accuracy than explicit CoT and inference speed comparable to answer‑only models, with SOTA benchmark results across multiple autonomous‑driving tests.

Autonomous DrivingOneVLOpen Source

0 likes · 8 min read

Xiaomi OneVL: A Breakthrough Open‑Source Model for Fast, Accurate Autonomous Driving

SuanNi

May 13, 2026 · Artificial Intelligence

How MiniCPM-V 4.6 Achieves Lightning‑Fast Multimodal AI on Smartphones (Open‑Source)

MiniCPM-V 4.6 combines a SigLIP2 visual encoder with a Qwen3.5 LLM, cuts FLOPs by over 50%, lowers token cost up to 43×, scores 13 on the Artificial Analysis Intelligence Index, and runs with 75 ms first‑token latency on 3136×3136 images across iOS, Android and HarmonyOS, all with fully open‑source code and extensive quantization support.

MiniCPM-VOpen Sourcebenchmark

0 likes · 6 min read

How MiniCPM-V 4.6 Achieves Lightning‑Fast Multimodal AI on Smartphones (Open‑Source)

AI Engineering

May 13, 2026 · Artificial Intelligence

First End‑to‑End Voice Agent Benchmark Shows Grok Leads with 52% Real‑World Success Rate

Artificial Analysis released the τ‑Voice benchmark, testing speech‑to‑speech agents across 278 real‑world customer‑service scenarios, and found the top‑performing Grok Voice Think Fast 1.0 achieves only a 52.1% task‑completion rate while average dialogue lengths stay under seven minutes.

Grok Voicebenchmarkspeech-to-speech

0 likes · 7 min read

First End‑to‑End Voice Agent Benchmark Shows Grok Leads with 52% Real‑World Success Rate

Bighead's Algorithm Notes

May 11, 2026 · Artificial Intelligence

Analyzing CN‑Buzz2Portfolio: A Chinese Market Dataset for LLM‑Driven Macro and Sector Asset Allocation

This article reviews the CN‑Buzz2Portfolio benchmark, which maps daily Chinese hot‑news streams to macro‑ and industry‑level ETF allocations, introduces a three‑stage CPA pipeline for evaluating large language models as autonomous financial agents, and reports extensive experiments on nine state‑of‑the‑art LLMs across two rolling market periods.

CN-Buzz2PortfolioCPA frameworkLLM

0 likes · 18 min read

Analyzing CN‑Buzz2Portfolio: A Chinese Market Dataset for LLM‑Driven Macro and Sector Asset Allocation

Machine Heart

May 11, 2026 · Artificial Intelligence

Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier

The authors demonstrate that visual perception, not reasoning, is the primary bottleneck for STEM multimodal large language models, introduce the CodePercept paradigm and the ICC-1M dataset, and show that code‑driven perception dramatically improves performance, surpassing much larger models on new benchmarks.

CVPR2026CodePerceptSTEM

0 likes · 9 min read

Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier

Geek Labs

May 11, 2026 · Artificial Intelligence

Train a 64M LLM from Scratch in 2 Hours for $3 and Master LLM Systems

This article introduces two open‑source projects—MiniMind, which lets you train a 64M‑parameter LLM in about two hours for under $3, and Happy‑LLM, a systematic tutorial that explains LLM theory and practice—detailing their features, training pipelines, benchmarks, data, and how they complement each other for comprehensive LLM learning.

AIHappy-LLMLLM

0 likes · 7 min read

Train a 64M LLM from Scratch in 2 Hours for $3 and Master LLM Systems

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

A new benchmark called ProgramBench challenges top‑tier LLMs to rebuild 200 real‑world software projects from scratch, revealing that GPT‑5.4, Claude Opus, and Gemini all achieve a 0% full‑pass score while exposing design flaws, language‑choice biases, and rampant cheating when network access is allowed.

AI Code GenerationProgramBenchbenchmark

0 likes · 11 min read

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

Machine Heart

May 9, 2026 · Artificial Intelligence

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

The BARD-VL framework bridges pretrained autoregressive vision‑language models to diffusion‑based VLMs, preserving or surpassing original performance while boosting decoding throughput up to three times, through progressive block merging, stage‑wise diffusion distillation, and engineering optimizations validated on multiple benchmarks.

BARD-VLEfficiencyMultimodal

0 likes · 9 min read

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

Architects' Tech Alliance

May 7, 2026 · Artificial Intelligence

Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

The article analyzes Huawei's Ascend AI chip evolution from the 910C baseline through the 950 series' low‑precision FP8/FP4 breakthrough to the 960/970 generation’s 8 PFLOPS performance, highlighting architectural innovations, memory and interconnect upgrades, scenario‑specific models, and a cost advantage over competing solutions.

AI ChipAscendFP8

0 likes · 6 min read

Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

Machine Heart

May 7, 2026 · Artificial Intelligence

How TACO Lets CLI Agents Self‑Evolve to Drop Useless Context

TACO is a plug‑and‑play, training‑free framework that lets terminal‑based autonomous agents automatically learn compression rules to filter low‑value output while preserving critical decision cues, achieving higher task success rates and better token efficiency across multiple terminal‑related benchmarks.

Context CompressionLLMSelf‑Evolving Rules

0 likes · 14 min read

How TACO Lets CLI Agents Self‑Evolve to Drop Useless Context

Bighead's Algorithm Notes

May 6, 2026 · Artificial Intelligence

AI‑Trader: Real‑time Benchmark for Autonomous LLM Agents in Financial Markets

The AI‑Trader benchmark evaluates large language model agents in fully autonomous, real‑time US stock, Chinese A‑share, and cryptocurrency markets, revealing that general intelligence alone does not guarantee profitable trading, while robust risk‑control mechanisms drive cross‑market stability and excess returns.

LLMRisk Managementautonomous agents

0 likes · 17 min read

AI‑Trader: Real‑time Benchmark for Autonomous LLM Agents in Financial Markets

Data Party THU

May 6, 2026 · Artificial Intelligence

When AI Seems Obedient, Hidden Alignment Risks Surface

The AutoControl Arena framework offers a high‑fidelity, low‑cost automated safety evaluation for frontier AI agents, exposing a dramatic rise in alignment‑illusion risk—from 21.7% under low pressure to 54.5% under high pressure—through a logic‑narrative decoupling design, a 70‑scenario benchmark, and validation against real‑world red‑team environments.

AI safetyAutoControl Arenaalignment illusion

0 likes · 9 min read

When AI Seems Obedient, Hidden Alignment Risks Surface

Machine Heart

May 6, 2026 · Artificial Intelligence

Luma’s Uni‑1.1 API Launch: Third‑Place Ranking and Text Rendering Near GPT‑Image 2

Luma released the Uni‑1.1 image‑generation API, which ranks third on the Arena blind‑test leaderboard, offers sub‑half‑price per image, and demonstrates production‑grade capabilities such as multi‑reference fusion, multi‑turn editing, and a decoder‑only transformer that jointly models text and image tokens.

API pricingLumabenchmark

0 likes · 13 min read

Luma’s Uni‑1.1 API Launch: Third‑Place Ranking and Text Rendering Near GPT‑Image 2

Machine Heart

May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

PromptEchoReward Modelingbenchmark

0 likes · 10 min read

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

Old Zhang's AI Learning

May 5, 2026 · Artificial Intelligence

Claude Enters Finance: 10 Open‑Source Financial Agent Templates Unveiled

Anthropic released ten ready‑to‑use financial Agent templates that bundle skills, data connectors and sub‑agents, can run natively in Excel, PowerPoint, Word and Outlook, are open‑sourced on GitHub, support two deployment modes, score 64.37% on the Vals AI finance benchmark, and integrate dozens of market data sources, while offering both strengths and notable limitations.

Agent TemplatesClaudeData Connectors

0 likes · 14 min read

Claude Enters Finance: 10 Open‑Source Financial Agent Templates Unveiled

PaperAgent

May 4, 2026 · Artificial Intelligence

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

The article explains that modern AI agents must be judged on actual task execution and audit evidence, and Claw‑Eval‑Live reveals that while agents can use terminals, they still fail dramatically on cross‑system workflows such as HR, management, and operations, with no model surpassing a 70% pass rate.

AI agentsClaw-EvalLLM

0 likes · 7 min read

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

Machine Heart

May 4, 2026 · Artificial Intelligence

Thought-Based Gloss-Free Sign Language Translation Model for the Deaf (ACL 2026)

The paper introduces SignThought, a gloss‑free sign language translation framework that uses a latent chain‑of‑thought reasoning layer and a plan‑then‑ground decoder, evaluates it on five benchmarks with state‑of‑the‑art BLEU‑4 and ROUGE scores, and releases a large new Hong Kong sign language dataset.

ACL 2026Gloss-FreeLatent Thoughts

0 likes · 11 min read

Thought-Based Gloss-Free Sign Language Translation Model for the Deaf (ACL 2026)

Old Zhang's AI Learning

May 4, 2026 · Artificial Intelligence

How DeepSeek’s New Paper Redefines Multimodal Reasoning with Visual Primitives

DeepSeek’s new paper "Thinking with Visual Primitives" tackles the reference gap in multimodal models by introducing points and boxes as reasoning units, achieving up to 8× token efficiency and leading benchmark scores in counting, spatial reasoning, and maze navigation compared with GPT‑5.4, Claude‑Sonnet‑4.6 and Gemini‑3‑Flash.

DeepSeekMultimodalVisual Primitives

0 likes · 10 min read

How DeepSeek’s New Paper Redefines Multimodal Reasoning with Visual Primitives

Machine Learning Algorithms & Natural Language Processing

May 3, 2026 · Artificial Intelligence

Do Large Language Models Wear Two Faces? New Study Reveals Alignment Illusion Under Pressure

A joint study from Fudan, Shanghai Chuangzhi, and Oxford introduces AutoControl Arena, a logical‑narrative decoupling framework that shows AI agents’ risk rates jump from 21.7% to 54.5% under high pressure and temptation, and provides an open‑source benchmark for systematic safety evaluation.

AI safetyAutoControl Arenaalignment illusion

0 likes · 9 min read

Do Large Language Models Wear Two Faces? New Study Reveals Alignment Illusion Under Pressure

PaperAgent

May 2, 2026 · Artificial Intelligence

Can Harnesses Self‑Evolve? Fudan & Peking University’s Agentic Harness Engineering Breakthrough

The paper introduces Agentic Harness Engineering (AHE), showing that a 10‑round evolution improves Coding Agent pass@1 from 69.7% to 77.0% on Terminal‑Bench 2—outperforming Codex‑CLI—and that the evolved harness transfers zero‑shot to SWE‑bench and multiple model families, thanks to three observability pillars.

Ablation StudyCoding AgentHarness Engineering

0 likes · 11 min read

Can Harnesses Self‑Evolve? Fudan & Peking University’s Agentic Harness Engineering Breakthrough

Node.js Tech Stack

May 2, 2026 · Databases

Why Drizzle ORM on Bun Beats Go’s Latency – Even Evan You Uses It

Drizzle ORM v1.0.0‑rc.1 introduces JIT row mappers and Effect v4 integration, delivering a benchmark where Bun + Drizzle achieves 7.3 ms latency versus Go’s 18.1 ms, with higher CPU usage, and the article analyzes the feature changes, performance trade‑offs, and migration considerations.

BunDrizzle ORMGo

0 likes · 10 min read

Why Drizzle ORM on Bun Beats Go’s Latency – Even Evan You Uses It

Machine Heart

May 1, 2026 · Artificial Intelligence

Can Large Language Models Truly Understand Your Daily Life? Introducing CL‑Bench Life

The new CL‑Bench Life benchmark evaluates how well large language models learn from fragmented, real‑world daily contexts, revealing that even top models solve only about 14‑22% of 405 tasks, with context misuse as the primary failure mode.

AI assistantsCL-Bench Lifebenchmark

0 likes · 14 min read

Can Large Language Models Truly Understand Your Daily Life? Introducing CL‑Bench Life

Su San Talks Tech

May 1, 2026 · Artificial Intelligence

Xiaomi Unveils 1.02‑Trillion‑Parameter MiMo 2.5 Model – Token Grant Guide and Real‑World Benchmarks

Xiaomi has launched the MiMo 2.5 series, featuring a 1.02‑trillion‑parameter MoE model with 1 M‑token context, offers a token‑grant program for developers, and delivers benchmark scores that rival leading models such as DeepSeek‑V4‑Pro, Kimi K2, GPT‑5 and Gemini 3.0.

AILarge Language ModelMiMo

0 likes · 9 min read

Xiaomi Unveils 1.02‑Trillion‑Parameter MiMo 2.5 Model – Token Grant Guide and Real‑World Benchmarks

Old Meng AI Explorer

Apr 30, 2026 · Artificial Intelligence

How to Use Kimi K2.6 for Free: The Open‑Source Chinese LLM That Beats Top Models

The article provides a deep technical overview of Kimi K2.6—including its MoE architecture, benchmark superiority over GPT‑5.4 and Claude Opus, six free‑access channels, practical usage tips, and real‑world scenarios—so developers can evaluate and adopt the model without cost.

Agent SwarmFree APIKimi K2.6

0 likes · 13 min read

How to Use Kimi K2.6 for Free: The Open‑Source Chinese LLM That Beats Top Models

PaperAgent

Apr 30, 2026 · Artificial Intelligence

DeepSeek Unveils Open‑Source Multimodal Model: “Thinking with Visual Primitives”

DeepSeek releases an open‑source multimodal LLM that introduces a visual‑primitive framework—elevating bounding boxes and points to token level—to close the reference gap, achieve extreme KV‑cache compression, and outperform GPT‑5.4, Claude‑Sonnet‑4.6 and Gemini‑3‑Flash on counting, spatial reasoning, maze navigation and path‑tracing benchmarks.

DeepSeekLLMMultimodal

0 likes · 13 min read

DeepSeek Unveils Open‑Source Multimodal Model: “Thinking with Visual Primitives”

ArcThink

Apr 29, 2026 · Artificial Intelligence

DeepSeek V4 Vision Mode: Architecture Breakdown and Benchmark vs Top Models

The article dissects DeepSeek V4's newly released vision mode, explains its mounted visual‑language architecture, compares its multimodal capabilities and costs against GPT‑5.5, Gemini 3 and Claude Opus 4.7, and outlines a roadmap from image understanding to native multimodal AI.

AIDeepSeekMultimodal

0 likes · 15 min read

DeepSeek V4 Vision Mode: Architecture Breakdown and Benchmark vs Top Models

SuanNi

Apr 29, 2026 · Artificial Intelligence

SenseNova U1: Open‑Source SOTA Multimodal Model Unifies Vision and Language

SenseNova U1, an open‑source multimodal model from SenseTime, replaces traditional visual encoders and VAEs with a native NEO‑unify architecture, delivering near‑lossless pixel‑level fidelity, a mixed‑of‑Transformer backbone, and unified training objectives that achieve SOTA performance on diverse vision‑language benchmarks while running efficiently on multiple Chinese chips.

MultimodalNEO-UnifyOpen Source

0 likes · 9 min read

SenseNova U1: Open‑Source SOTA Multimodal Model Unifies Vision and Language

Lao Guo's Learning Space

Apr 29, 2026 · Artificial Intelligence

What’s Inside GPT‑6’s ‘Spud’ Release? 5‑6 Trillion Parameters and 2 M Token Context

OpenAI’s GPT‑6 ‘Spud’ launch packs 5‑6 trillion parameters with MoE sparsity, a unified Symphony multimodal architecture, dual System‑1/2 reasoning, a 2‑million‑token window, and competitive benchmark results, while keeping pricing flat and introducing autonomous agent capabilities that reshape AI workflows.

AgentGPT-6Large Language Model

0 likes · 15 min read

What’s Inside GPT‑6’s ‘Spud’ Release? 5‑6 Trillion Parameters and 2 M Token Context

Old Meng AI Explorer

Apr 28, 2026 · Artificial Intelligence

One Subscription for All Top Chinese Coding Models – Save Hundreds Monthly

Volcengine’s Coding Plan bundles six leading Chinese AI coding models into a single subscription, offering seamless IDE integration, auto model selection, and performance comparable to individual APIs while cutting monthly costs from hundreds of yuan to under ten, as demonstrated by benchmark tests and a four‑step setup guide.

AI codingChinese modelsCoding Plan

0 likes · 10 min read

One Subscription for All Top Chinese Coding Models – Save Hundreds Monthly

PaperAgent

Apr 28, 2026 · Artificial Intelligence

MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.

AIMiniCPM-oMultimodal

0 likes · 13 min read

MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

Machine Heart

Apr 28, 2026 · Artificial Intelligence

How SenseNova U1’s Unified Architecture Eliminates Multimodal ‘Frankenstein’ Models

SenseNova U1 Lite, an 8‑billion‑parameter open‑source multimodal model from SenseTime, uses the NEO‑Unify architecture to fuse vision and language in a single space, achieving commercial‑grade efficiency and benchmark scores that surpass much larger proprietary models while supporting continuous image‑text generation.

NEO-UnifySenseNova U1benchmark

0 likes · 12 min read

How SenseNova U1’s Unified Architecture Eliminates Multimodal ‘Frankenstein’ Models

DataFunSummit

Apr 28, 2026 · Big Data

Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing

The article examines the limitations of traditional batch and stream processing, explains how Hologres Dynamic Table combines declarative freshness settings with stateful incremental computation to bridge the gap between low‑cost batch jobs and low‑latency streaming, and presents benchmark results and real‑world case studies.

Dynamic TableHologresbenchmark

0 likes · 13 min read

Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing

Machine Heart

Apr 28, 2026 · Artificial Intelligence

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

The article introduces the globally first open‑source large model uAI‑NEXUS‑MedVLM, built on the MedVidBench dataset and the MedGRPO training framework, which together overcome data scarcity, evaluation gaps, and task specialization challenges in surgical video AI, achieving state‑of‑the‑art performance across eight benchmark tasks.

AI in SurgeryLarge Language ModelMedVidBench

0 likes · 18 min read

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

DataFunTalk

Apr 28, 2026 · Artificial Intelligence

Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding

Manifold AI’s WorldScape 0.2 achieved the highest overall score on the embodied world‑model benchmark WorldArena, outperforming giants like Google and Nvidia by excelling in comprehensive perception, physics compliance, and 3D accuracy while using only about 10 % of the parameters of competing models, thanks to a newly introduced MoE architecture.

Embodied AIMoEScaling Law

0 likes · 9 min read

Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding

ZhiKe AI

Apr 28, 2026 · Artificial Intelligence

Demystifying DeepSeek‑V4 Benchmarks with Real‑World Data

This article breaks down DeepSeek‑V4's six core capability categories—knowledge, reasoning, programming, math, long‑context, and agent—showing how each benchmark works, presenting concrete scores that place V4 first or second against leading models, and explaining the hidden efficiency gains that make V4 up to 13.7× cheaper to run.

AI evaluationDeepSeek V4Efficiency

0 likes · 14 min read

Demystifying DeepSeek‑V4 Benchmarks with Real‑World Data

SuanNi

Apr 27, 2026 · Artificial Intelligence

How MIT’s RUBICON Cuts AI Agent Costs by 90% While Achieving 100% Accuracy

The paper shows that conventional LLM agents fail on real‑world enterprise data because of chaotic data sources, while the RUBICON architecture uses a minimal Agentic Query Language to let users direct data retrieval, achieving 100% accuracy with a much cheaper model and dramatically lower token and monetary costs.

Agentic Query LanguageLLM agentsRUBICON

0 likes · 11 min read

How MIT’s RUBICON Cuts AI Agent Costs by 90% While Achieving 100% Accuracy

ArcThink

Apr 27, 2026 · Artificial Intelligence

GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, dramatic long‑context gains, and wins 9 of 10 shared benchmarks against GPT‑5.4, while a side‑by‑side comparison with Claude Opus 4.7 shows each model excelling in different domains, heralding a multi‑polar era for frontier AI.

AgentClaude Opus 4.7GPT-5.5

0 likes · 16 min read

GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?

Lao Guo's Learning Space

Apr 27, 2026 · Artificial Intelligence

DeepSeek V4 & Huawei Ascend 950PR: Is Domestic Compute Ready for Enterprise AI?

DeepSeek V4, paired with Huawei’s Ascend 950PR chip, delivers inference speed up to 2.87× that of Nvidia H20 and introduces a CSA+HCA attention compression that cuts KV cache usage to under 10%, but its 94‑96% hallucination rate and high token consumption raise concerns for production use.

AI inferenceCSA+HCADeepSeek V4

0 likes · 13 min read

DeepSeek V4 & Huawei Ascend 950PR: Is Domestic Compute Ready for Enterprise AI?

SuanNi

Apr 26, 2026 · Artificial Intelligence

Xiaomi’s MiMo‑V2.5: Halving Cost, Doubling Efficiency with a New Multimodal LLM

Xiaomi unveiled the MiMo‑V2.5 and MiMo‑V2.5‑Pro large language models, highlighting up to 50% lower API cost, multimodal perception, token‑efficiency gains, benchmark superiority over Claude Opus 4.6 and GPT‑5.4, and real‑world demos that built a full compiler in 4.3 hours and a video‑editing web app in 11.5 hours.

AI agentLarge Language ModelMiMo V2.5

0 likes · 6 min read

Xiaomi’s MiMo‑V2.5: Halving Cost, Doubling Efficiency with a New Multimodal LLM

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.

DeepSeek V4Generative Reward ModelLLM

0 likes · 12 min read

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

JavaEdge

Apr 25, 2026 · Artificial Intelligence

GPT-5.5 Launch: A New Agentic AI for Real‑World Work

OpenAI’s GPT‑5.5, now available via API, claims agentic capabilities that let it autonomously plan, execute, and verify complex programming, knowledge‑work, and scientific tasks while matching GPT‑5.4 latency, delivering higher benchmark scores, stronger security controls, and a tiered pricing model.

GPT-5.5agentic AIbenchmark

0 likes · 12 min read

GPT-5.5 Launch: A New Agentic AI for Real‑World Work

Ops Development & AI Practice

Apr 25, 2026 · Artificial Intelligence

Do Large‑Model Code Generators Really Excel? ARC‑AGI‑2/3 Reveals the Harsh Truth

While recent model releases boast near‑perfect scores on benchmarks like MMLU and HumanEval, the ARC‑AGI‑2 and ARC‑AGI‑3 leaderboards expose a stark gap between headline numbers and genuine programming intelligence, highlighting cost, fluid reasoning, and real‑world applicability.

AI evaluationARC‑AGIbenchmark

0 likes · 10 min read

Do Large‑Model Code Generators Really Excel? ARC‑AGI‑2/3 Reveals the Harsh Truth

SuanNi

Apr 25, 2026 · Artificial Intelligence

Is Tencent’s Large Model Lagging? How Hy3‑preview Propels It Into the Top Tier

Tencent’s AI division rebuilt its Hunyuan model from the ground up, releasing the 295‑billion‑parameter Hy3‑preview with a fast‑slow hybrid expert architecture, extensive internal benchmarks, and strong performance on scientific, coding, and real‑world tasks, marking a decisive leap into the leading LLM tier.

AgentHy3-previewLarge Language Model

0 likes · 7 min read

Is Tencent’s Large Model Lagging? How Hy3‑preview Propels It Into the Top Tier

Architect's Tech Stack

Apr 25, 2026 · Artificial Intelligence

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

DeepSeek released the V4 series—V4‑Pro (1.6 T total, 49 B active) and V4‑Flash (284 B total, 13 B active)—featuring three architectural upgrades, three inference modes, mixed‑precision FP4/FP8 weights, and benchmark results that place its programming ability at the top of open‑source models while supporting a million‑token context window.

AI ArchitectureDeepSeekLarge Language Model

0 likes · 5 min read

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

ArcThink

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4’s Silent Launch: 1.6 T Parameters, Triple Innovation, and Redefined Accessibility

DeepSeek V4 quietly debuted with a 1.6‑trillion‑parameter MoE model, introducing CSA+HCA compressed attention, mHC manifold‑constrained hyperconnections, and the Muon optimizer, achieving 1M‑token context at a quarter of V3’s cost, top Codeforces and LiveCodeBench scores, a 1/7 Opus price, MIT open‑source licensing, and dual‑stack Ascend NPU/NVIDIA GPU support.

DeepSeek V4Large Language ModelManifold-constrained Hyperconnection

0 likes · 17 min read

DeepSeek V4’s Silent Launch: 1.6 T Parameters, Triple Innovation, and Redefined Accessibility

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

Survey of Computer-Use Agents: Terminal/CLI vs GUI Paths

The article surveys recent advances in computer-use agents, categorizing them into terminal/CLI‑based and GUI‑based routes, detailing representative systems, benchmarks, and open challenges such as error accumulation, safety, and evaluation gaps.

GUILLMTerminal

0 likes · 17 min read

Survey of Computer-Use Agents: Terminal/CLI vs GUI Paths

Java Web Project

Apr 25, 2026 · Artificial Intelligence

Why GPT-5.5’s Silent Release Signals Real Engineering Power

OpenAI’s April 23, 2026 launch of GPT-5.5 delivers record‑high scores on SWE‑Bench Pro (58.6%) and Terminal‑Bench 2.0 (82.7%), adds persistent multi‑file context, dynamic reasoning time, and token efficiency, while real‑world case studies show substantial productivity gains across engineering teams.

AI EngineeringCodexGPT-5.5

0 likes · 13 min read

Why GPT-5.5’s Silent Release Signals Real Engineering Power

Shuge Unlimited

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

DeepSeek V4, released shortly after GPT‑5.5, offers two models—V4‑Pro (1.6 T parameters) and V4‑Flash (284 B parameters)—that introduce a hybrid CSA/HCA attention architecture to enable efficient million‑token context, achieve dramatic FLOPs and KV savings, deliver competitive programming and agent benchmarks, and adopt a disruptive pricing strategy, while also exposing training‑stability tricks and highlighting both strengths and remaining gaps.

DeepSeek V4Hybrid AttentionLLM

0 likes · 25 min read

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

PaperAgent

Apr 24, 2026 · Artificial Intelligence

DeepSeek‑V4 Open‑Sources Its Million‑Token Architecture and Calls Out Claude Opus 4.6

DeepSeek‑V4’s open‑source report reveals a hybrid CSA/HCA attention design, manifold‑constrained residuals and the Muon optimizer that cut per‑token FLOPs to 27 % and KV‑Cache to 10 % at 1 M tokens, while benchmark results show it outperforms Claude Opus 4.6 on most tasks yet still lags on complex instruction following and multi‑turn dialogue.

AI ArchitectureClaude OpusDeepSeek V4

0 likes · 11 min read

DeepSeek‑V4 Open‑Sources Its Million‑Token Architecture and Calls Out Claude Opus 4.6

ZhiKe AI

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Launch: Open‑Source Model Beats Closed‑Source Leaders in Coding & Math, 1.6 T Params, 1 M Context

DeepSeek V4, released today, offers two open‑source models (Pro and Flash) with up to 1.6 T parameters and a 1‑million‑token context, achieving top‑tier programming and mathematics benchmark scores that surpass the three major closed‑source competitors, while cutting API costs to a fraction of the price.

APIDeepSeekV4

0 likes · 7 min read

DeepSeek V4 Launch: Open‑Source Model Beats Closed‑Source Leaders in Coding & Math, 1.6 T Params, 1 M Context

SuanNi

Apr 24, 2026 · Artificial Intelligence

Why GPT‑5.5 Beats Opus 4.7 and Sets a New Global SOTA

OpenAI’s newly released GPT‑5.5, marketed as a “next‑generation AI for real work,” outperforms competitors across coding, knowledge‑work, and scientific research benchmarks—achieving 82.7% accuracy on Terminal‑Bench 2.0, 58.6% on SWE‑Bench Pro, 84.9% on GDPval, and 98.0% on Tau2‑bench Telecom—while offering higher token efficiency and new pricing tiers.

AI agentGPT-5.5OpenAI

0 likes · 11 min read

Why GPT‑5.5 Beats Opus 4.7 and Sets a New Global SOTA

SuanNi

Apr 24, 2026 · Artificial Intelligence

DeepSeek-V4 Launches: Million-Token Context Becomes Affordable for All

DeepSeek-V4 introduces a hybrid attention architecture, manifold‑constrained hyper‑connections, and the Muon optimizer to cut inference FLOPs and KV cache dramatically, enabling open‑source models to handle million‑token contexts at a fraction of the cost of leading closed‑source services while matching their performance.

DeepSeek V4Hybrid AttentionLarge Language Model

0 likes · 7 min read

DeepSeek-V4 Launches: Million-Token Context Becomes Affordable for All

AI Large Model Application Practice

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Preview: Key Technical Highlights, Benchmarks, and Pricing

The DeepSeek‑V4 preview details two model variants—Pro and Flash—with trillion‑scale parameters, outlines benchmark scores that surpass or match leading overseas models across code generation, real‑world fixes, engineering tasks, and world knowledge, and explains core innovations, pricing, API endpoints, and open‑source licensing.

APIDeepSeekHybrid Attention

0 likes · 7 min read

DeepSeek V4 Preview: Key Technical Highlights, Benchmarks, and Pricing