Tag

model evaluation

1 views collected around this technical thread.

DataFunTalk
DataFunTalk
Jun 9, 2025 · Artificial Intelligence

Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test

The author conducts a transparent, objective assessment of several large language models on the 2025 Chinese national math exam, converting all questions to LaTeX, applying strict Gaokao scoring rules, and revealing each model's strengths and weaknesses across single‑choice, multiple‑choice, and fill‑in‑the‑blank items.

AI benchmarkingGaokaolarge language models
0 likes · 7 min read
Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test
Baidu Tech Salon
Baidu Tech Salon
May 21, 2025 · Artificial Intelligence

Baidu AI Day 2024: Wenxin X1 Turbo Sets New Benchmark with Top‑Level Evaluation and Advanced Multimodal Capabilities

At Baidu AI Day in Beijing, the company unveiled the Wenxin 4.5 Turbo and X1 Turbo models, detailing multimodal training breakthroughs, self‑feedback loops, enhanced reasoning and tool‑calling, while the China Academy of Information and Communications Technology awarded X1 Turbo the highest "4+" rating across 24 capability tests, highlighting its leading position in domestic large‑model performance.

Artificial IntelligenceBaiduWenxin
0 likes · 9 min read
Baidu AI Day 2024: Wenxin X1 Turbo Sets New Benchmark with Top‑Level Evaluation and Advanced Multimodal Capabilities
DataFunTalk
DataFunTalk
Apr 8, 2025 · Artificial Intelligence

Meta AI VP Responds to Llama 4 Controversies and Allegations of Benchmark Manipulation

Meta AI Vice President Ahmad Al‑Dahle addressed recent criticisms of the newly released Llama 4 model, denying claims of test‑set cheating, explaining quality variations as post‑release optimization, and acknowledging internal concerns that led to staff resignations and calls for transparency.

Artificial IntelligenceBenchmarkingLlama 4
0 likes · 5 min read
Meta AI VP Responds to Llama 4 Controversies and Allegations of Benchmark Manipulation
Java Tech Enthusiast
Java Tech Enthusiast
Feb 22, 2025 · Artificial Intelligence

Grok‑3 Evaluation Controversy and Community Reactions

Three days after Grok‑3’s launch, OpenAI was accused of inflating its benchmark scores by using a “cons@64” method that aggregates 64 answers, a practice critics say unfairly skews comparisons with single‑shot models like o3‑mini, while developers have already begun experimenting with the model in simple games.

AIGrok 3OpenAI
0 likes · 5 min read
Grok‑3 Evaluation Controversy and Community Reactions
Architect
Architect
Feb 21, 2025 · Artificial Intelligence

DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation

This article reviews DeepSeek's recent breakthroughs, including the MLA attention redesign, GRPO alignment algorithm, MoE enhancements, multi‑stage training pipelines (SFT, RL, preference tuning, distillation), and comparative performance against GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

DeepSeekMixture of Expertsarchitecture
0 likes · 16 min read
DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation
DataFunTalk
DataFunTalk
Feb 18, 2025 · Artificial Intelligence

CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning

The DeepSeek team introduced CODEI/O, a massive dataset that converts code into natural‑language reasoning chains, and demonstrated that training large language models on this data markedly improves their performance on diverse inference tasks, including non‑code domains, through a two‑stage training strategy.

AI trainingCODEI/Ocode reasoning
0 likes · 8 min read
CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning
DevOps
DevOps
Feb 7, 2025 · Artificial Intelligence

OpenAI Releases o3-mini Chain‑of‑Thought: First Tests, Community Reactions, and Critical Analysis

OpenAI has publicly disclosed the chain‑of‑thought reasoning of its o3‑mini model, prompting a wave of community experiments, critiques about authenticity, and discussions on the model’s limitations, prompting insights into AI interpretability and the trade‑offs of revealing internal reasoning.

Artificial IntelligenceChain-of-ThoughtO3-mini
0 likes · 6 min read
OpenAI Releases o3-mini Chain‑of‑Thought: First Tests, Community Reactions, and Critical Analysis
Model Perspective
Model Perspective
Dec 23, 2024 · Fundamentals

Mastering Mathematical Modeling: 5 Stages & Common Pitfalls to Avoid

From the excitement of first encountering mathematical modeling to becoming a seasoned practitioner, this guide outlines five progressive stages, reveals typical misconceptions at each level, and offers practical advice to help learners avoid common traps and develop both technical and soft skills.

Common PitfallsData qualitylearning stages
0 likes · 8 min read
Mastering Mathematical Modeling: 5 Stages & Common Pitfalls to Avoid
DataFunSummit
DataFunSummit
Nov 26, 2024 · Information Security

AI‑Driven Security Operations (AISECOPS): Architecture, Practices, and Evaluation

This article explains how large‑model AI can be integrated into security operations (AISECOPS) to simplify application integration, improve fault detection, and automate protection across complex north‑south and east‑west network layers, while addressing challenges such as data quality, cost control, model selection, and safety frameworks.

AISECOPSLarge Modelscost-optimization
0 likes · 22 min read
AI‑Driven Security Operations (AISECOPS): Architecture, Practices, and Evaluation
Model Perspective
Model Perspective
Nov 24, 2024 · Fundamentals

Mastering Baselines: How to Evaluate and Improve Your Mathematical Models

This article explains the concept of baselines in mathematical modeling, outlines how to construct various types such as empirical, random, theoretical, and heuristic baselines, and demonstrates their crucial role in model evaluation, resource allocation, and fostering innovation through practical case studies.

Case StudyPerformance Metricsbaseline
0 likes · 7 min read
Mastering Baselines: How to Evaluate and Improve Your Mathematical Models
Test Development Learning Exchange
Test Development Learning Exchange
Nov 23, 2024 · Artificial Intelligence

Evaluating Linear Regression Model Performance with K-Fold Cross-Validation in Python

This tutorial teaches how to evaluate a linear regression model's performance using K‑fold cross‑validation in Python, covering data loading, preparation, computation of MSE and R² metrics, and visualizing predictions with matplotlib, and interpreting the results.

MSEPythonR2
0 likes · 6 min read
Evaluating Linear Regression Model Performance with K-Fold Cross-Validation in Python
DataFunSummit
DataFunSummit
Sep 28, 2024 · Artificial Intelligence

Seat Copilot: Design, Large‑Model Architecture, and Business Impact in Financial Services

This article introduces the Seat Copilot developed by Qifu Technology, explains its composition, design, and core large‑model architecture, details data engineering, training and evaluation processes, and presents quantitative results showing improvements in operator efficiency, conversion rates, and management productivity.

AIcall center automationfinancial technology
0 likes · 18 min read
Seat Copilot: Design, Large‑Model Architecture, and Business Impact in Financial Services
DataFunSummit
DataFunSummit
Sep 13, 2024 · Artificial Intelligence

Research on Domain Large Models by Fudan University Knowledge Workshop Lab

This article presents the Fudan University Knowledge Workshop Lab's comprehensive research on domain large models, covering background, domain adaptation, capability enhancement, collaborative workflows, challenges such as inference cost and alignment, and proposed solutions including source‑enhanced training, self‑correction mechanisms, and hybrid retrieval‑augmented generation.

AI researchKnowledge GraphsRetrieval-Augmented Generation
0 likes · 16 min read
Research on Domain Large Models by Fudan University Knowledge Workshop Lab
IT Services Circle
IT Services Circle
Sep 8, 2024 · Artificial Intelligence

10 Essential Plots for Linear Regression with Python Code Examples

This tutorial explains ten crucial visualizations for linear regression—scatter plot, trend line, residual plot, normal probability plot, learning curve, bias‑variance tradeoff, residuals vs fitted, partial regression, leverage, and Cook's distance—each illustrated with clear Python code using scikit‑learn, matplotlib, seaborn, and statsmodels.

MatplotlibPythondata visualization
0 likes · 21 min read
10 Essential Plots for Linear Regression with Python Code Examples
Model Perspective
Model Perspective
Aug 18, 2024 · Fundamentals

How to Judge a Mathematical Model: 6 Practical Criteria for Success

This article outlines six essential criteria—accuracy, robustness, simplicity, explainability, generalization, and scalability—for evaluating the quality of mathematical models such as e‑commerce recommendation systems, helping readers assess whether a model is truly reliable or merely a flashy façade.

Recommendation systemsaccuracyexplainability
0 likes · 3 min read
How to Judge a Mathematical Model: 6 Practical Criteria for Success
Kuaishou Tech
Kuaishou Tech
Jul 31, 2024 · Artificial Intelligence

Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications

The article presents a comprehensive overview of Kuaishou’s Kolors (formerly 可图) multimodal generative model, detailing its data collection strategy, diffusion‑based architecture, evaluation metrics, derived capabilities such as prompt refinement and interactive generation, and a range of practical applications from AI‑powered live‑stream gifts to virtual try‑on, while also offering strategic advice for the domestic visual‑generation community.

AI applicationsKolorsdiffusion models
0 likes · 27 min read
Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications
Kuaishou Tech
Kuaishou Tech
Jul 11, 2024 · Artificial Intelligence

Kuaishou Open-Sources Kolors: A High-Performance Text-to-Image Model Rivaling Midjourney v6

Kuaishou has officially open-sourced Kolors, a state-of-the-art text-to-image diffusion model that leverages ChatGLM3 for advanced bilingual text understanding and employs a two-stage training strategy to achieve photographic image quality rivaling leading proprietary systems.

Computer VisionText-to-Image Generationdiffusion models
0 likes · 8 min read
Kuaishou Open-Sources Kolors: A High-Performance Text-to-Image Model Rivaling Midjourney v6
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 20, 2024 · Artificial Intelligence

Xiaohongshu 2024 Large Model Frontier Paper Sharing Live Event

On June 27, 2024, Xiaohongshu’s technical team will livestream a two‑hour session across WeChat Channels, Bilibili, Douyin and Xiaohongshu, showcasing six top‑conference papers on large‑model advances—including early‑stopping and fine‑grained self‑consistency, novel evaluation methods, negative‑sample‑assisted distillation, and LLM‑based note recommendation—followed by a Q&A and recruitment briefing.

AI researchRecommendation systemsSelf-Consistency
0 likes · 12 min read
Xiaohongshu 2024 Large Model Frontier Paper Sharing Live Event
IT Services Circle
IT Services Circle
May 1, 2024 · Artificial Intelligence

Summary of Andrew Ng’s AI Agent Talk: Models, Workflows, and Design Patterns

The article summarizes Andrew Ng’s presentation on AI agents, contrasting traditional single‑prompt large‑model usage with iterative agent‑based workflows, reporting experimental accuracy gains, and outlining four agent design patterns—reflection, tool use, planning, and multi‑agent collaboration—while discussing practical trade‑offs such as latency and token speed.

AI Agentdesign patternslarge language model
0 likes · 7 min read
Summary of Andrew Ng’s AI Agent Talk: Models, Workflows, and Design Patterns