A/B Comparison of Direct Document Feeding vs Semantic Governance for Industrial Software Test Case Generation

The article presents a rigorous A/B experiment comparing a baseline AI that directly consumes documentation with a knowledge‑embedded approach that adds semantic governance, showing how structured data assets dramatically improve test point and test case quality in industrial software development.

AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
A/B Comparison of Direct Document Feeding vs Semantic Governance for Industrial Software Test Case Generation

Experiment Overview

The study evaluates AI‑generated test points and test cases for the Mass Flow Outlet feature in an industrial simulation software. Two generation pipelines are compared while keeping the output size identical (18 test points, 65 test cases) to isolate the effect of semantic governance.

Generation Methods

Gen1 (baseline) : Direct generation from requirements, design documents, and test context without semantic modeling, entity modeling, or type validation.

Gen2 (knowledge‑embedded) : After the Gen1 step, extracted test patterns, domain knowledge, semantic alignment, a knowledge graph, and rule constraints are injected before generation.

Gen1 Quantitative Results

Gen1 produced the expected quantity (18 points, 65 cases) and covered 122 requirement rules (vs. 72 expert rules) with 87 % semantic equivalence, but structural quality was poor.

Explicit state count: 0 (expert 61) – missing state modeling.

Risk count: 24 (expert 53) – 45 % of expert coverage.

Domain knowledge fields: 25 (expert 40) – missing expert explanations.

Defect association: 18 (expert 30) – insufficient mapping.

Boundary values: 6 (expert 12) – only half covered, biased to numeric.

Gen1 Comprehensive Score

A multi‑dimensional scoring framework gave Gen1 an overall score of 52.7 (Grade D). High marks were obtained for semantic equivalence (87 / 100, Grade A) and rule coverage (85 / 100, Grade A). Very low marks were recorded for state modeling (5 / 100, F), workflow (40 / 100, D), boundary values (45 / 100, D), and persistence (25 / 100, F).

Knowledge Assets Added in Gen2

State‑modeling library with 53 patterns (activation, deactivation, mutual exclusion, degradation, visibility, persistence verification, batch operations, dynamic UI updates).

Physical‑reasoning rule set (7 rules) linking boundary types, code‑architecture assumptions, default values, numeric intervals, solver algorithms, mass‑conservation checks, and difference‑driven validation.

Scenario‑specific persistence rules derived from historical test cases (e.g., constant mode, profile mode, multiphase, radiation, DPM, degradation).

Three generation constraints: each test point must output a state, reasoning must contain a physical explanation, and persistence requirements must be differentiated by scenario.

Knowledge‑Embedding Examples

Semantic objects : Raw test steps are transformed into JSON‑like entities describing intent, covered rules, states, risks, applicable test patterns, and boundary values.

Domain reasoning rule PR‑004 : Maps numeric intervals to physical meanings, enabling inference such as “negative mass‑flow rate implies reverse flow”.

State‑pattern entry : Example JSON for a chemical‑reaction “On” state with triggers and observable properties, reusable for the Mass Flow Outlet scenario.

Knowledge‑graph nodes ( BC‑MFO, PC‑MassFlux, etc.) and relations encode constraints, UI elements, and knowledge items, allowing the agent to traverse the graph during generation.

Full‑Scale A/B Statistics

Test‑point level (output size unchanged):

Explicit states: 0 → 59 (+59).

Domain knowledge fields: 619 → 619 (no change).

Persistence requirements: 0 → 54 (+54).

Test‑case level :

Explicit states: 0 → 65 (+65).

Domain knowledge fields: 1,692 → 4,556 (≈ 2.7×).

Persistence fields: 0 → 84 (+84).

Reasoning field size: 0 → 6,807 (+6,807 characters).

Overall Evaluation

The scoring script compared Gen1, Gen2, and a human expert:

Gen1 baseline: 52.7 (Grade D).

Gen2 knowledge‑embedded: 59.2 (+6.5 points).

Human expert baseline: 82.0 .

Gen2 reaches 72.2 % of expert performance. Improvements are concentrated in the following dimensions:

State modeling: 5 → 29.5 (+24.5).

Domain reasoning: 45 → 80 (+35).

Persistence verification: 25 → 100 (+75).

Risk awareness: 55 → 85 (+30).

Workflow propagation: 40 → 65 (+25).

Two dimensions regressed:

Boundary‑value coverage: 45 → 20 (‑25).

Strategy‑quality: 65 → 1.7 (‑63.3) because the new format did not preserve a stable strategy field recognizable by the evaluation script.

Key Findings

Semantic governance transforms AI output from readable text into usable structured assets by adding explicit state, persistence, workflow, and risk fields.

Embedding knowledge assets yields measurable quality gains without increasing the number of generated items.

Remaining challenges include stabilizing boundary‑value generation and representing test‑strategy information as structured fields.

Evaluation reports themselves become valuable semantic assets that should be continuously ingested into the knowledge base.

Conclusion

The strict A/B experiment confirms that feeding raw documents to a large language model can generate many test items, but without semantic governance the results lack state modeling, scenario‑aware persistence, workflow propagation, domain reasoning, and numeric stability. By extracting test patterns, consolidating domain knowledge, aligning semantics, building a knowledge graph, and embedding rules, Gen2 improves the overall score from 52.7 to 59.2 and narrows the gap to expert performance to 72.2 %. The study also highlights that semantic governance is an ongoing process: boundary‑value handling and strategy‑field structuring still require dedicated knowledge‑base work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI testingtest case generationindustrial softwaresemantic governanceA/B experimentknowledge embedding
AI Large-Model Wave and Transformation Guide
Written by

AI Large-Model Wave and Transformation Guide

Focuses on the latest large-model trends, applications, technical architectures, and related information.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.