Tagged articles

183 articles

Page 1 of 2

May 31, 2026 · Artificial Intelligence

Defining a Good Answer in the Agent Era: A Rubrics Survey

This survey examines how rubrics can decompose the vague notion of a "good answer" for large language models into concrete, multi‑dimensional evaluation criteria, detailing their definition, construction methods, applications in training and evaluation, and the open challenges they present.

AI alignmentLarge Language Modelsagentic AI

0 likes · 13 min read

Defining a Good Answer in the Agent Era: A Rubrics Survey

DataFunTalk

May 31, 2026 · Artificial Intelligence

The Most Comprehensive Survey of Agent Harness Engineering

This article summarizes the Agent Harness Engineering survey, outlining the evolution from Prompt to Context to Harness engineering, presenting the seven‑layer ETCLOVG framework, benchmark findings, and the shift toward platform‑level observability, governance, and trace‑native evaluation for reliable AI agents.

Agent HarnessContext EngineeringETCLOVG

0 likes · 12 min read

The Most Comprehensive Survey of Agent Harness Engineering

DataFunTalk

May 29, 2026 · Artificial Intelligence

From Prompt to Context to Harness: Unpacking the Three Paradigm Shifts in Agent Engineering

The survey "Agent Harness Engineering: A Survey" reveals how agent systems have evolved from prompt engineering to context engineering and now to harness engineering, introduces the seven‑layer ETCLOVG framework, shows benchmark gains from better harnesses, and argues that observability, governance, and trace‑native evaluation are essential for production‑grade AI agents.

AI agentsContext EngineeringGovernance

0 likes · 14 min read

From Prompt to Context to Harness: Unpacking the Three Paradigm Shifts in Agent Engineering

AI Engineer Programming

May 29, 2026 · Artificial Intelligence

How to Build a Reliable RAG Test Dataset

The article explains why a structured test set is essential for Retrieval‑Augmented Generation systems, outlines failure modes, describes layered evaluation of retrieval and generation, details infrastructure like chunk IDs and manifests, and provides a complete annotation pipeline with cold‑start and adversarial strategies.

LLMRAGadversarial

0 likes · 24 min read

How to Build a Reliable RAG Test Dataset

DataFunTalk

May 28, 2026 · Artificial Intelligence

The Most Comprehensive Survey on Agent Harness Engineering Revealed

This article summarizes the 71‑page survey "Agent Harness Engineering: A Survey", detailing the shift from prompt to context to harness engineering, introducing the seven‑layer ETCLOVG framework, benchmark results showing up to 10× gains, and arguing that future competition will focus on the engineering shell surrounding LLM agents rather than model size alone.

AI SystemsAgentFramework

0 likes · 15 min read

The Most Comprehensive Survey on Agent Harness Engineering Revealed

大转转FE

May 21, 2026 · Artificial Intelligence

Why AI Buzzwords Multiply Faster Than My Hair Falls

The article maps three generations of AI engineering—Prompt Engineering, Context Engineering, and Harness Engineering—explaining their core capabilities, key terms like LLM, RAG, Agent, and evaluation methods, while offering practical tips, pitfalls, and a concise three‑question checklist to stay grounded amid the rapid influx of new AI jargon.

AIAgentHarness

0 likes · 19 min read

Why AI Buzzwords Multiply Faster Than My Hair Falls

PaperAgent

May 19, 2026 · Artificial Intelligence

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

MemEye is a multimodal memory benchmark that tests agents across eight real‑world scenarios, measuring visual evidence granularity and reasoning depth, and reveals that captions fall short for fine‑grained visual recall, highlighting the need for true visual memory in long‑term AI agents.

AI agentsMemEyebenchmark

0 likes · 4 min read

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

DataFunTalk

May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Big DataData PlatformEnterprise Governance

0 likes · 18 min read

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

High Availability Architecture

May 19, 2026 · Artificial Intelligence

5 Essential Tools to Install Before Building an AI Agent

The article outlines five critical setup steps—privacy with direnv and a secret manager, token handling via litellm or portkey, context management using uv and git commits, visibility through mitmproxy, and rigorous evaluation with inspect‑ai—showing how they cut token waste by 68.3%, reduce costs 92.5% and raise evaluation pass rates to 94.2% across 347 runs.

AI agentsDevOpscost optimization

0 likes · 9 min read

5 Essential Tools to Install Before Building an AI Agent

DataFunSummit

May 18, 2026 · Artificial Intelligence

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion examined how data platforms must evolve from simple Copilot assistants to fully agentic systems, covering architectural redesign, security guardrails, knowledge‑base integration, evaluation pitfalls, cost management, and whether the future favors a super‑agent or a multi‑agent ecosystem.

Cost ManagementData PlatformFuture Trends

0 likes · 20 min read

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms

James' Growth Diary

May 11, 2026 · Artificial Intelligence

Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

This article breaks down RAG evaluation into a two‑layer framework, explains the four core metrics—Recall@K, MRR, NDCG, and the four RAGAS scores—shows how to implement them with LangChain.js, highlights common pitfalls, and offers scenario‑specific metric combinations for reliable performance monitoring.

LangChainMRRNDCG

0 likes · 20 min read

Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

Wuming AI

May 10, 2026 · Artificial Intelligence

Can Large Models Really Understand 1 M Tokens? Lessons from the RULER Benchmark

The article examines why a model’s advertised context window (e.g., 128 K or 1 M tokens) does not guarantee effective long‑context reasoning, summarizing the RULER framework that breaks long‑context ability into retrieval, interference resistance, multi‑hop tracking, aggregation, and multi‑answer recall, and offering practical guidance for evaluating and using such models.

LLMRULERaggregation

0 likes · 16 min read

Can Large Models Really Understand 1 M Tokens? Lessons from the RULER Benchmark

Machine Heart

May 10, 2026 · Artificial Intelligence

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

The HiLight approach inserts lightweight highlight tags into full-length inputs, training a small Emphasis Actor to score token importance and guide a frozen large language model, improving performance on tasks like recommendation and QA without modifying the solver, while keeping low latency and training cost.

LLMLow latencyevaluation

0 likes · 9 min read

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

Architect

May 4, 2026 · Artificial Intelligence

What Skills Architects Must Master in the Agent Era and Which Will Last Six Months

In the fast‑changing Agent era, architects should focus on durable engineering capabilities—context management, tool design, evaluation, harness, permissions, and cost control—rather than chasing the latest frameworks, ensuring agents remain stable and controllable in production systems.

AI agentsContext ManagementHarness

0 likes · 26 min read

What Skills Architects Must Master in the Agent Era and Which Will Last Six Months

PaperAgent

May 4, 2026 · Artificial Intelligence

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

The article explains that modern AI agents must be judged on actual task execution and audit evidence, and Claw‑Eval‑Live reveals that while agents can use terminals, they still fail dramatically on cross‑system workflows such as HR, management, and operations, with no model surpassing a 70% pass rate.

AI agentsClaw-EvalLLM

0 likes · 7 min read

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

AI Engineering

May 4, 2026 · Artificial Intelligence

Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure

The article argues that the competition over which large language model will dominate is outdated, explaining that true value now comes from building multi‑model routing, context engineering, standardized tool protocols, intelligent orchestration, and robust evaluation layers that turn models into reliable AI infrastructure.

AI infrastructureMCPModel routing

0 likes · 6 min read

Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure

PMTalk Product Manager Community

May 4, 2026 · Product Management

2026 AI Product Manager: The Essential Capability Model

By 2026, AI product managers must shift from merely using models to delivering stable, valuable results, mastering seven core abilities—demand judgment, evaluation-driven iteration, context design, RAG strategy, agent orchestration, solution planning, and rapid Vibe Coding—to close the loop between business needs and AI capabilities.

AI product managementAgent DesignContext Engineering

0 likes · 13 min read

2026 AI Product Manager: The Essential Capability Model

AgentGuide

May 3, 2026 · Artificial Intelligence

How to Evaluate an AI Agent Beyond Just Accuracy

Evaluating AI agents requires more than accuracy; you must measure task completion, execution trace, tool usage, latency, cost, error rates, and both explicit and implicit user feedback, using observability, offline smoke‑test and regression suites, and continuous online monitoring to create a closed‑loop improvement process.

AI AgentMetricsObservability

0 likes · 14 min read

How to Evaluate an AI Agent Beyond Just Accuracy

AI Architecture Hub

May 3, 2026 · Artificial Intelligence

What to Learn, Build, and Skip in AI Agents

The article analyzes the fast‑changing AI‑agent landscape, proposes five concrete criteria for filtering new technologies, outlines essential concepts such as context engineering, tool design, scheduler‑subagent patterns, evaluation frameworks, and recommends a stable 2026 tech stack while warning against hype‑driven tools.

AI agentsContext EngineeringLangGraph

0 likes · 27 min read

What to Learn, Build, and Skip in AI Agents

AI Engineer Programming

May 2, 2026 · Artificial Intelligence

From Demo to Production: How to Evaluate RAG Effectively

This guide outlines a comprehensive RAG evaluation framework covering failure modes, multi‑layer metrics, test‑set construction, open‑source tools, CI/CD quality gates, production monitoring, and special considerations for agentic RAG to ensure reliable, trustworthy retrieval‑augmented generation systems.

AILLMMetrics

0 likes · 18 min read

From Demo to Production: How to Evaluate RAG Effectively

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026 · Artificial Intelligence

When Unprompted, Large Language Models Can Still Deceive

A recent ICLR 2026 oral paper shows that even without malicious prompting, many leading LLMs produce inconsistent or strategically biased answers, revealing a form of deception that grows with question complexity and is not guaranteed to diminish with model size.

AI safetyCSQ frameworkLarge Language Models

0 likes · 10 min read

When Unprompted, Large Language Models Can Still Deceive

MaGe Linux Operations

Apr 28, 2026 · Artificial Intelligence

Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies

This article systematically analyzes why Retrieval‑Augmented Generation pipelines often underperform—covering embedding model selection, chunking strategies, hybrid retrieval, reranking, context window waste, evaluation metrics, and a detailed troubleshooting checklist—while providing concrete code examples and best‑practice recommendations for engineers.

ChunkingEmbeddingHybrid Retrieval

0 likes · 19 min read

Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies

PaperAgent

Apr 27, 2026 · Artificial Intelligence

A Comprehensive Review of Modern LLM Agent Memory Frameworks

The article surveys recent LLM‑based agent memory research, presenting a unified framework that breaks memory systems into four components, detailing their design choices, experimental evaluation on LOCOMO and LONGMEMEVAL, key findings, and a new low‑token SOTA architecture.

Agent MemoryLLMMemory Management

0 likes · 8 min read

A Comprehensive Review of Modern LLM Agent Memory Frameworks

AI Engineer Programming

Apr 23, 2026 · Artificial Intelligence

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

The article outlines why rigorous, automated evaluation is essential for AI agents, defines core concepts such as tasks, trials, graders, and frameworks, compares code‑based, model‑based and human graders, and presents an eight‑step roadmap—from early testing to open‑source maintenance—to create reliable, scalable agent assessments.

AI agentsBenchmarkingLLM grading

0 likes · 22 min read

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

MaGe Linux Operations

Apr 22, 2026 · Artificial Intelligence

5 Essential Design Principles for Building High‑Quality RAG Systems

This article outlines five critical design principles for constructing high‑quality Retrieval‑Augmented Generation (RAG) systems, covering document chunking strategies, embedding model selection, hybrid retrieval architectures, metadata filtering with multi‑level indexes, and reranking mechanisms, and provides concrete code snippets and evaluation metrics.

EmbeddingHybrid RetrievalRAG

0 likes · 17 min read

5 Essential Design Principles for Building High‑Quality RAG Systems

PMTalk Product Manager Community

Apr 22, 2026 · Product Management

AI Product Managers Have Stopped Sketching Wireframes – Here’s Why

The article explains how AI product managers have shifted from creating prototype diagrams to designing continuous evaluation “exams”, using real‑world examples, data‑driven testing, cross‑team collaboration, and iterative error analysis to deliver truly useful AI products.

AI product managementContinuous ImprovementData Testing

0 likes · 8 min read

AI Product Managers Have Stopped Sketching Wireframes – Here’s Why

Su San Talks Tech

Apr 21, 2026 · Artificial Intelligence

How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide

This article walks through a complete prompt‑engineering workflow—starting from a weak baseline, building an evaluation pipeline, and applying four concrete techniques (clarity, specificity, XML structuring, and examples) that lift a Claude score from 3.4 to over 9, with code, metrics, and real‑world examples.

AIClaudePrompt Engineering

0 likes · 19 min read

How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide

FunTester

Apr 20, 2026 · Artificial Intelligence

Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems

The article analyzes why letting the same AI Agent generate and self‑evaluate results in over‑confident but flawed outputs, especially for subjective tasks, and proposes a three‑stage multi‑agent architecture with independent evaluation, concrete standards, and prompt‑based calibration to improve reliability as models evolve.

AIPrompt EngineeringSystem Design

0 likes · 9 min read

Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems

Java One

Apr 20, 2026 · Artificial Intelligence

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

This article walks through an iterative prompt‑engineering workflow—starting with a weak baseline, applying four concrete techniques (clarity & directness, specificity, XML structuring, and examples), evaluating each change with a PromptEvaluator, and showing how scores jump from 3.4 to over 9.5 using real code snippets and concrete data.

AIClaudePrompt Engineering

0 likes · 20 min read

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

Machine Heart

Apr 17, 2026 · Artificial Intelligence

Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

The paper introduces OPeRA, a step‑wise online‑shopping dataset capturing observations, personas, rationales, and actions from real users, and uses it to benchmark LLMs on next‑action prediction, revealing that even top models like GPT‑4.1 achieve only about 20 % accuracy on fine‑grained actions, with persona information offering limited benefit while rationales prove crucial.

AILLMdataset

0 likes · 9 min read

Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

Data Party THU

Apr 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

AIMME-EmotionMultimodal LLM

0 likes · 10 min read

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

Machine Heart

Apr 10, 2026 · Artificial Intelligence

Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure

The launch of Generalist AI’s GEN‑1 model demonstrates a breakthrough in success rate, speed and resilience, but the article argues that the true competitive frontier has moved from model performance to the underlying data, simulation and evaluation infrastructure that enables continuous learning and scalable testing for embodied intelligence.

AI modelsData InfrastructureEmbodied AI

0 likes · 12 min read

Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure

DataFunSummit

Apr 10, 2026 · Artificial Intelligence

How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering

This article examines the shortcomings of current AI assistants, outlines the ideal of long‑term memory engineering, reviews mainstream industry solutions such as hard‑context models and Retrieval‑Augmented Generation, proposes a four‑layer memory loop architecture, and looks ahead to online learning and collective intelligence for future agents.

AIAgentFoundation Model

0 likes · 15 min read

How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering

Data STUDIO

Apr 10, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Writing Effective Agent Skill.md Files

This article explains what Agent Skills are, shows the folder layout and SKILL.md format, introduces the progressive‑disclosure design, provides concrete best‑practice tips, testing and evaluation methods, and demonstrates how to package scripts for reliable AI‑assistant automation.

AI AssistantAgent SkillsAutomation

0 likes · 29 min read

Step‑by‑Step Guide to Writing Effective Agent Skill.md Files

AI Step-by-Step

Apr 8, 2026 · Operations

How to Light Up the Black Box of LLM Agents with Full‑Stack Observability

The article explains why traditional logs are insufficient for LLM agents, outlines five observability dimensions—tracing, metrics, behavioral governance, state & memory, and evaluation—and provides concrete, open‑source‑based steps to instrument, monitor, and act on agent workloads in production.

Behavioral GovernanceLLM AgentsMetrics

0 likes · 11 min read

How to Light Up the Black Box of LLM Agents with Full‑Stack Observability

Machine Heart

Apr 5, 2026 · Artificial Intelligence

How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap

This survey maps the 2021‑2025 progress of imitation learning for dexterous manipulation, detailing theoretical foundations, datasets, algorithms, hardware platforms, and evaluation protocols, and highlights challenges such as data quality, hardware dependence, and the need for standardized benchmarks to advance embodied AI.

AlgorithmsDexterous Manipulationdatasets

0 likes · 11 min read

How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap

AI Code to Success

Apr 3, 2026 · Artificial Intelligence

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

The author explores Clawvard, an AI‑agent assessment platform that tests agents across eight dimensions, shares personal test results showing an initial A‑ rating with a critical retrieval weakness, details the customized improvement rules applied, and demonstrates a subsequent A+ rating, while also discussing the platform’s limits and practical use cases.

AI AgentArtificial IntelligencePrompt Engineering

0 likes · 8 min read

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

AgentGuide

Apr 3, 2026 · Artificial Intelligence

How to Evaluate RAG Systems: Key Metrics and the Ragas Framework

The article explains how to assess Retrieval-Augmented Generation (RAG) projects using the Ragas automated evaluation framework, detailing four key dimensions—recall quality, answer faithfulness, answer relevance, and context utilization—and describes the underlying metrics for both retrieval and generation stages.

LLMMetricsRAG

0 likes · 5 min read

How to Evaluate RAG Systems: Key Metrics and the Ragas Framework

AI Engineer Programming

Apr 2, 2026 · Artificial Intelligence

How to Rigorously Test Your Own Trained LLM and Choose the Right Benchmarks

This guide outlines a systematic LLM evaluation framework, covering goal definition, core and code‑oriented benchmarks, agent and safety tests, data‑contamination mitigation, toolchain choices, result reporting, and the inherent structural limits of static benchmarks.

AgentLLMSafety

0 likes · 14 min read

How to Rigorously Test Your Own Trained LLM and Choose the Right Benchmarks

Top Architecture Tech Stack

Mar 28, 2026 · Artificial Intelligence

How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours

Anthropic’s engineering recap reveals a GAN‑inspired multi‑agent framework that separates generation, evaluation, and planning to overcome Claude’s context anxiety and self‑evaluation bias, enabling the model to sustain multi‑hour, high‑quality tasks across frontend design, full‑stack apps, and game‑engine projects.

AIClaudeevaluation

0 likes · 19 min read

How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours

SuanNi

Mar 26, 2026 · Artificial Intelligence

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

The Omni-WorldBench framework introduces a comprehensive 4D evaluation suite with 1,068 test cases and three interaction levels, applying novel metrics to assess video quality, controllability, and physical interaction fidelity across 18 state‑of‑the‑art AI video models, revealing strengths, weaknesses, and future research directions.

4D interactionOmni-WorldBenchbenchmark

0 likes · 14 min read

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

SuanNi

Mar 25, 2026 · Artificial Intelligence

Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?

This article analyses the concept of Harness engineering introduced by OpenAI and Anthropic, explains how multi‑agent architectures decompose and manage long‑running AI tasks, examines practical experiments such as a retro game maker and a web‑audio workstation, and distills lessons for future AI system design.

AI EngineeringAnthropicClaude

0 likes · 16 min read

Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?

o-ai.tech

Mar 25, 2026 · Artificial Intelligence

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

Anthropic’s article dissects a three‑role harness—planner, generator, evaluator—for building long‑running AI applications, explaining how structured specs, sprint contracts, iterative evaluation, and context management transform a single model into a reliable software‑engineering pipeline, with concrete front‑end and full‑stack case studies.

AI agentsEvaluatorHarness

0 likes · 23 min read

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

Frontend AI Walk

Mar 25, 2026 · Artificial Intelligence

Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents

The article outlines seven essential mindset transitions for building robust LLM agents—recognizing agents as autonomous decision loops, prioritizing harness over model size, layering context, designing tools for agent goals, structuring multi‑layer memory, coordinating multiple agents with isolation and protocols, and aligning evaluation with the real environment.

Context ManagementHarnessLLM Agents

0 likes · 16 min read

Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents

AgentGuide

Mar 22, 2026 · Artificial Intelligence

How to Design Prompt Engineering in Your Project: A Complete Workflow

The article outlines a systematic Prompt Engineering process that starts with defining task goals and metrics, structures prompts into modular components, uses offline evaluation and bad‑case analysis, incorporates RAG or tools when needed, and continuously monitors accuracy, hallucination, latency and cost.

AI workflowFew-shotLarge Language Model

0 likes · 7 min read

How to Design Prompt Engineering in Your Project: A Complete Workflow

ByteDance SE Lab

Mar 20, 2026 · Artificial Intelligence

How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking

This guide explains the challenges of multi‑repository code retrieval, presents an experimental evaluation of OpenViking's semantic search, and provides step‑by‑step instructions for installing, configuring, importing repositories, and integrating the system into AI agents and chatbots.

AI AssistantMulti-repoOpenViking

0 likes · 16 min read

How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking

AI Frontier Lectures

Mar 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AIMultimodal LLMbenchmark

0 likes · 9 min read

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

Alibaba Cloud Developer

Mar 16, 2026 · Artificial Intelligence

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

AI BenchmarkEmotion AIHumanization

0 likes · 25 min read

HeartBench: Building the First Chinese AI Humanization Benchmark

PaperAgent

Mar 15, 2026 · Artificial Intelligence

Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

WildToolBench reveals that existing LLM tool‑calling benchmarks overlook real‑world user behavior, and a comprehensive evaluation of 58 models shows even the strongest agents achieve less than 15% session accuracy, highlighting a huge gap between reported performance and practical usability.

LLMagentic AIbenchmark

0 likes · 10 min read

Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

Old Zhang's AI Learning

Mar 11, 2026 · Artificial Intelligence

Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine

Anthropic’s updated skill‑creator turns Skills into a core, engineering‑focused capability for Claude, offering a systematic workflow—baseline A/B testing, quantitative assertions, visual evaluation, and iterative description optimization—so developers can rebuild, refine, and reliably trigger their Skills for higher productivity.

AI agentsAnthropicAutomation

0 likes · 13 min read

Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine

PaperAgent

Mar 9, 2026 · Artificial Intelligence

How SkillNet Turns AI Agent Experience into Reusable Skills

SkillNet proposes a three‑layer infrastructure that extracts, evaluates, and connects over 200,000 AI‑agent skills into a structured graph, dramatically improving performance across benchmark environments while turning transient agent experience into durable, reusable assets.

AI agentsLLMMachine Learning

0 likes · 6 min read

How SkillNet Turns AI Agent Experience into Reusable Skills

AI Tech Publishing

Mar 7, 2026 · Artificial Intelligence

A Practical Guide to Evaluating Agent Skills

This article explains why many Agent Skills are released without testing, defines measurable success criteria, and presents a lightweight evaluation framework—including prompt set creation, deterministic checks, optional LLM‑based qualitative checks, and best‑practice recommendations—demonstrated by improving a Gemini Interactions API skill from 66.7% to 100% pass rate.

AI agentsAgent SkillsGemini

0 likes · 13 min read

A Practical Guide to Evaluating Agent Skills

Amap Tech

Mar 5, 2026 · Artificial Intelligence

How MobilityBench Measures the Real Power of AI Route‑Planning Agents

MobilityBench is an open‑source benchmark built from over 100 000 real user queries that evaluates AI route‑planning agents with a deterministic sandbox, multi‑dimensional metrics, and support for ReAct and Plan‑and‑Execute frameworks, revealing performance gaps between open‑source and closed‑source models.

AI agentsMobilityBenchPlan-and-Execute

0 likes · 6 min read

How MobilityBench Measures the Real Power of AI Route‑Planning Agents

Data Party THU

Feb 18, 2026 · Artificial Intelligence

Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark

The article analyzes the gap between high benchmark scores and poor real‑world performance of AI agents, introduces the Trainee‑Bench workplace simulator, details its three evaluation dimensions, construction steps, and reveals that even state‑of‑the‑art models achieve low success rates, highlighting the need for autonomous learning and zero‑hand‑over.

AI agentsTrainee-Benchcontinuous learning

0 likes · 11 min read

Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark

AI Engineering

Jan 29, 2026 · Artificial Intelligence

How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%

A Vercel team experiment shows that replacing the Skills approach with a small 8 KB AGENTS.md file raised AI coding agents' pass rate from 53% to a perfect 100%, revealing the fragility of explicit tool calls and the strength of passive, always‑available context.

AGENTS.mdAI coding agentsNext.js

0 likes · 11 min read

How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%

JD Tech

Jan 27, 2026 · Artificial Intelligence

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Uni-Layout introduces a unified framework that integrates a universal layout generator, a human‑feedback‑simulating evaluator, and a dynamic margin preference optimization technique to align generation and evaluation across diverse e‑commerce design tasks, backed by a new 100k human‑annotated dataset.

Human FeedbackMultimodal LLMdynamic margin optimization

0 likes · 11 min read

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Architect

Jan 19, 2026 · Artificial Intelligence

How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems

This article analyzes Cursor's engineering choices for running autonomous coding agents at scale, detailing the long‑running, drift, and evaluation concepts, the Planner‑Worker‑Judge pipeline, concurrency challenges, experimental results, and actionable rules for building robust multi‑agent systems.

System architectureevaluationsoftware engineering

0 likes · 17 min read

How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems

Old Zhao – Management Systems Only

Jan 15, 2026 · Operations

Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter

The article explains why traditional supplier evaluation forms often become meaningless, introduces four decisive metrics—delivery stability, quality consistency, cost transparency, and collaboration willingness—provides concrete scoring formulas for each, and shows how an SRM system can automate and visualize these indicators to help companies decide whether to replace a supplier.

OperationsSRMevaluation

0 likes · 10 min read

Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter

JD Cloud Developers

Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

Uni-Layout introduces a unified framework that combines a multimodal large language model‑based generator, a human‑like evaluator trained on the large Layout‑HF100k dataset, and a Dynamic Margin Preference Optimization (DMPO) method to align generation and evaluation, achieving state‑of‑the‑art results across diverse layout tasks.

DMPOHuman FeedbackMultimodal LLM

0 likes · 11 min read

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

JD Tech Talk

Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

Uni-Layout introduces a unified framework that generates layouts across diverse tasks, simulates human evaluation with a novel feedback dataset, and aligns generation and assessment through dynamic margin preference optimization, achieving state‑of‑the‑art performance on multiple benchmarks.

AI designHuman FeedbackMultimodal LLM

0 likes · 11 min read

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

PMTalk Product Manager Community

Jan 14, 2026 · Product Management

From Docs to Evals: Essential AI Skills for Modern Product Managers

AI product managers are shifting from static PRDs to dynamic evaluation frameworks—Evals—that define product quality through automated tests, golden conversations, and LLM judges, enabling continuous iteration, error-driven requirement discovery, and architecture decisions in complex AI systems.

AILLMevals

0 likes · 7 min read

From Docs to Evals: Essential AI Skills for Modern Product Managers

AI Insight Log

Jan 10, 2026 · Artificial Intelligence

Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights

The article explains why evaluating AI agents is far more complex than testing deterministic code, outlines Anthropic’s anatomy of a complete evaluation system—including tasks, transcripts, and three grader types—and offers concrete best‑practice recommendations for building reliable agent pipelines.

AI agentsAnthropicLLM testing

0 likes · 9 min read

Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights

JD Retail Technology

Jan 8, 2026 · Artificial Intelligence

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

ACM MultimediaHuman FeedbackMultimodal LLM

0 likes · 11 min read

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

DataFunSummit

Jan 3, 2026 · Artificial Intelligence

What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential

A comprehensive dialogue among industry experts explores the concept of memory engineering for AI agents, covering its definition, system‑level challenges from edge to cloud, hybrid technical routes, evaluation metrics, privacy safeguards, audience questions, future directions, and practical advice for developers.

AI agentsHybrid Architectureevaluation

0 likes · 24 min read

What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential

AI Product Manager Community

Dec 27, 2025 · Product Management

Embracing Uncertainty: Redesigning AI Product Requirements

The article explores how product managers must shift from deterministic PRDs to uncertainty‑driven specifications for AI chatbots, replacing exhaustive logic with value‑based constraints, fuzzy‑evaluation metrics, dynamic benchmarks, and sample‑based requirements to better align with probabilistic large‑model behavior.

AIPRDPrompt Engineering

0 likes · 9 min read

Embracing Uncertainty: Redesigning AI Product Requirements

Alibaba Cloud Native

Dec 19, 2025 · Artificial Intelligence

What Enterprises Are Learning from the State of Agent Engineering Report

The recent LangChain "State of Agent Engineering" report, combined with data from the AI‑Native Application Architecture whitepaper, reveals rapid production adoption of AI agents, persistent quality challenges, widespread observability, multi‑model strategies, and evolving evaluation practices across organizations of all sizes.

AI agentsLLMObservability

0 likes · 10 min read

What Enterprises Are Learning from the State of Agent Engineering Report

Model Perspective

Dec 19, 2025 · Fundamentals

How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas

This study builds a comprehensive evaluation model for Chinese historical drama series, defining four primary and nine secondary indicators, standardizing data, applying weighted calculations and a time‑compensation factor to score 127 candidates and produce a TOP‑100 ranking that highlights the influence of audience reputation, market impact, professional recognition, and historical value.

evaluationhistorical dramamedia

0 likes · 18 min read

How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas

Youzan Coder

Nov 21, 2025 · Artificial Intelligence

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

This guide walks you through creating AI‑powered test agents, defining success metrics, building evaluation datasets, crafting and refining system prompts with techniques like chain‑of‑thought, XML, few‑shot and concise inputs, and scaling the workflow by splitting agents and managing prompt versions.

AI agentsLLMPrompt Engineering

0 likes · 21 min read

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

Tencent Cloud Developer

Nov 18, 2025 · Artificial Intelligence

Building a Fully Autonomous AI Data Analyst: Agent Architecture & Planning

This article explores how to create a self‑thinking AI data analyst by detailing agent fundamentals, core modules such as planning, memory and tool scheduling, practical development steps, multi‑agent collaboration, evaluation benchmarks, and real‑world examples like stock backtesting.

AI AgentMCPPlanning

0 likes · 35 min read

Building a Fully Autonomous AI Data Analyst: Agent Architecture & Planning

Wu Shixiong's Large Model Academy

Nov 4, 2025 · Artificial Intelligence

Why Financial RAG Fails and How to Solve Its Core Challenges

This article explains why Retrieval‑Augmented Generation (RAG) projects in the financial sector often underperform, highlighting data‑structure complexities, document‑parsing hurdles, chunking strategies, compliance constraints, evaluation metrics, and engineering requirements, and offers practical solutions and code examples.

ChunkingComplianceEngineering

0 likes · 10 min read

Why Financial RAG Fails and How to Solve Its Core Challenges

Open Source Tech Hub

Oct 23, 2025 · Backend Development

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

This guide introduces CEL-PHP, a high‑performance, non‑Turing‑complete expression engine for PHP 8+, showing how to install it, evaluate simple and contextual expressions, handle parsing and optimization, integrate caching, register custom functions, and avoid common pitfalls for robust backend rule evaluation.

CELCachingExpression Language

0 likes · 8 min read

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

AI2ML AI to Machine Learning

Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

KV CacheLLMMQA

0 likes · 9 min read

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

Alibaba Cloud Developer

Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMPrompt Engineering

0 likes · 28 min read

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

HyperAI Super Neural

Oct 14, 2025 · Artificial Intelligence

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2, introduced at NeurIPS 2025, evaluates 58 multimodal models on 23 OCR‑related tasks in Chinese and English, revealing that even top models like Gemini‑2.5‑Pro barely exceed the passing threshold and that most models struggle with fine‑grained text localization and multilingual performance.

GeminiLarge Language ModelsNeurIPS 2025

0 likes · 8 min read

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

Old Zhao – Management Systems Only

Oct 13, 2025 · Operations

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

This article explains why many procurement processes fail despite formal procedures and provides a step‑by‑step, data‑driven approach—clarifying requirements, using SRM templates, screening suppliers with performance data, scoring comprehensively, ensuring traceability, and conducting post‑award reviews—to select the right suppliers and turn procurement into a strategic advantage.

SRMdata drivenevaluation

0 likes · 8 min read

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

Fun with Large Models

Sep 17, 2025 · Artificial Intelligence

Evaluating Fine-Tuned Large Model Performance: Methods and Interview Tips

The article explains how to assess fine‑tuned large models using both human judgment and dataset‑driven metrics, outlines common pitfalls, introduces benchmark datasets and evaluation frameworks, and provides concise answers to related interview questions.

EvalScopebenchmark datasetsevaluation

0 likes · 7 min read

Evaluating Fine-Tuned Large Model Performance: Methods and Interview Tips

Data Thinking Notes

Sep 10, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

OpenAI’s latest research reveals that language model hallucinations stem from training and evaluation incentives that favor confident guesses over acknowledging uncertainty, and proposes revised scoring methods that reward modesty, highlighting statistical mechanisms behind false answers and offering pathways to reduce hallucinations.

AI safetyLanguage Modelsevaluation

0 likes · 10 min read

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

Architect

Sep 9, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

This article explains why large language models often produce confident but incorrect answers, detailing statistical inevitability, data scarcity, and model capacity limits, and proposes concrete solutions such as confidence thresholds and allowing abstention to reduce hallucinations.

AI safetyLanguage ModelsPrompt Engineering

0 likes · 8 min read

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

DataFunSummit

Aug 23, 2025 · Artificial Intelligence

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

This article surveys the latest research on role‑playing AI agents, covering their definition, core components, application scenarios, three main challenges—role fidelity, long‑term memory, and evaluation—and presents four technical approaches for each challenge along with future research directions and references.

AI agentsLarge Language ModelsMemory

0 likes · 22 min read

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

Data Party THU

Aug 23, 2025 · Artificial Intelligence

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.

CAMPOevaluationmath reasoning

0 likes · 11 min read

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

JD Tech Talk

Jul 27, 2025 · Artificial Intelligence

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

This article presents a thorough evaluation of the open‑source JoyAgent‑JDGenie multi‑agent AI framework, covering its background, test cases for restaurant recommendation and travel planning, deployment steps, performance metrics, and concluding recommendations, highlighting its efficiency, ease of deployment, and result quality.

AIDeploymentagents

0 likes · 8 min read

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

Zhihu Tech Column

Jul 25, 2025 · Artificial Intelligence

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

This article introduces the open‑source Zhi‑Create‑Qwen3‑32B model, detailing its fine‑tuned training on creative‑writing data, the multi‑domain dataset strategy, curriculum‑learning based SFT, evaluation on WritingBench, and practical deployment options across various hardware and inference frameworks.

DeploymentLarge Language Modelcreative writing

0 likes · 11 min read

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

ELab Team

Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AISpeculative Decodingcode editing

0 likes · 14 min read

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

DataFunTalk

Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

LLMModel CompressionMultimodal

0 likes · 15 min read

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

DataFunSummit

Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsMultimodal LearningRAG

0 likes · 11 min read

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

Aikesheng Open Source Community

Jun 17, 2025 · Artificial Intelligence

Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities

This article presents SCALE, a community‑driven, open‑source benchmark that expands beyond simple Text‑to‑SQL accuracy to evaluate large language models on performance, dialect conversion, and deep SQL understanding, offering developers, researchers, and CTOs a realistic measure of AI‑assisted database tasks.

AILLMSQL

0 likes · 10 min read

Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities

Tencent Technical Engineering

Jun 16, 2025 · Artificial Intelligence

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

This comprehensive guide walks you through the fundamentals of Retrieval‑Augmented Generation (RAG) and AI agents, explains their inner workings, shares optimization tricks, provides ready‑to‑run code snippets, and demonstrates how to evaluate performance with metrics such as recall, faithfulness, and answer relevance.

AI agentsLLMPrompt Engineering

0 likes · 36 min read

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

Baobao Algorithm Notes

May 26, 2025 · Artificial Intelligence

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.

LLMattentionchain-of-thought

0 likes · 11 min read

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

Model Perspective

May 25, 2025 · Fundamentals

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

The article explores how people manipulate evaluation systems by redefining variables, adjusting weights, and shifting perspectives, turning losses into perceived wins, and reveals the psychological and statistical biases that create this illusion, urging more honest, multi‑dimensional, transparent modeling for genuine assessment.

BiasPsychologydecision-making

0 likes · 9 min read

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

DataFunSummit

May 9, 2025 · Artificial Intelligence

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

This article presents a comprehensive overview of Zhihu Direct Answer, describing its AI‑driven search architecture, RAG framework, query understanding, retrieval, chunking, reranking, generation, evaluation mechanisms, engineering optimizations, and the professional edition, while sharing concrete performance‑boosting practices and future development plans.

AIProduct DevelopmentRAG

0 likes · 14 min read

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

Architect

Apr 17, 2025 · Artificial Intelligence

The Second Half of AI: From Model Innovation to Real‑World Utility

The article argues that artificial intelligence has entered a new phase where reinforcement learning finally generalizes, evaluation becomes more important than pure model performance, and researchers must redesign benchmarks and utility‑focused tasks to drive truly transformative progress.

evaluationresearch strategy

0 likes · 16 min read

The Second Half of AI: From Model Innovation to Real‑World Utility

Nightwalker Tech

Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentAutoGLMLarge Language Model

0 likes · 4 min read

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

Meituan Technology Team

Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCMachine LearningMultimodal

0 likes · 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

Alibaba Cloud Developer

Mar 24, 2025 · Artificial Intelligence

Boost LLM Evaluation with Semantic Enrichment and Vector Search

This article explains how semantic enrichment, vector and hybrid search, and clustering techniques can be applied to large language model logs to evaluate inputs and outputs, improve compliance auditing, and enhance model iteration across various business scenarios.

AILLMevaluation

0 likes · 12 min read

Boost LLM Evaluation with Semantic Enrichment and Vector Search

Alibaba Cloud Developer

Mar 24, 2025 · Artificial Intelligence

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

This article analyses the shortcomings of large‑model internet search—such as unverifiable sources, fabricated content, and poor instruction compliance—by comparing Qwen‑max, Doubao‑1.5‑pro‑256k, and DeepSeek‑v3, and proposes prompt engineering, post‑processing, and custom tool improvements to boost reliability.

AILLMevaluation

0 likes · 22 min read

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

DaTaobao Tech

Mar 19, 2025 · Artificial Intelligence

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Retrieval‑augmented generation (RAG) enhances large language models by integrating a preprocessing pipeline—cleaning, chunking, embedding, and vector storage—with a query‑driven retrieval and prompt‑injection workflow, leveraging vector databases, multi‑stage recall, advanced prompting, and comprehensive evaluation metrics to mitigate knowledge cut‑off, hallucinations, and security issues.

LLMRAGRetrieval-Augmented Generation

0 likes · 27 min read

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Efficient Ops

Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsFinancial IndustryOperations

0 likes · 9 min read

How BizDevOps Is Accelerating Digital Transformation in Finance

AI Algorithm Path

Feb 20, 2025 · Artificial Intelligence

What Is Perplexity in Large Language Models?

The article explains perplexity as a metric for evaluating large language models, walks through a step‑by‑step probability calculation for a sample sentence, shows how to normalize by sentence length using the geometric mean, and demonstrates that lower perplexity indicates a more accurate and less uncertain model.

AIPerplexityevaluation

0 likes · 6 min read

What Is Perplexity in Large Language Models?

JD Tech

Feb 14, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD’s Merchant Intelligent Assistant leverages a large‑language‑model‑based multi‑agent architecture to provide 24/7 e‑commerce support, detailing its evolution, planning techniques, online inference, evaluation methods, sample generation, and practical insights for scalable AI‑driven operations.

AutomationE-commerce AILLM

0 likes · 22 min read

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD Retail Technology

Feb 10, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

The JD Merchant Intelligent Assistant employs a large‑language‑model‑driven multi‑agent architecture with dynamic ReAct planning, enabling merchants to query and execute store operations in under a second with over 90 % decision accuracy, while reducing inference cost, hallucinations, and engineering effort across diverse e‑commerce tasks.

AILLMReAct

0 likes · 25 min read

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration