Background & Motivation
Frontier language models perform well on structured assessments and domain examinations. They perform substantially worse on the kind of reasoning that practitioners actually apply in high-stakes conditions: holding competing hypotheses simultaneously, recognizing when not to act, calibrating confidence explicitly, and updating assessments as incomplete information arrives.
This gap is documented. HealthBench (OpenAI, May 2025) showed frontier models scoring 32% on realistic clinical evaluations — a 39 to 45 point gap versus exam performance. MedXpertQA (ICML 2025) documented similar divergence between benchmark scores and applied clinical competence. The pattern is consistent across domains: models that pass expert examinations do not reliably demonstrate expert judgment under realistic conditions.
The underlying cause is a data problem. The reasoning that makes practitioners irreplaceable — the deliberative process of working through ambiguity toward a calibrated decision — has never been systematically captured as training data. What has been captured is outcomes, not process. Answers, not reasoning chains. The post-training pipelines that shape model behavior are built primarily on what experts conclude, not on how they think.
This study tests a specific hypothesis: that structured annotation of expert reasoning, used as fine-tuning data for a base language model, can produce a model whose reasoning outputs are not reliably distinguishable from those of the source practitioner by independent domain experts operating under blind conditions.
Why a military domain for the first study
The military operational domain was selected for three reasons. First, it imposes the tacit knowledge challenge in its most acute form: nine years of operational deployments produces both deliberative reasoning capability and deep pattern recognition that is genuinely difficult to verbalize. If fine-tuning on reasoning traces produces indistinguishable outputs in this domain, the finding generalizes more confidently to other expert domains.
Second, the domain has a natural adversarial structure for the evaluation: evaluators from different national military traditions with different doctrinal baselines provide a more rigorous test of reasoning quality than evaluators from a shared training background.
Third, and most relevant to the labs this work is designed to inform: the specific reasoning capabilities tested here — calibrated decision-making under incomplete information, when-not-to-act judgment, explicit uncertainty expression — are precisely the capabilities identified as most deficient in current frontier models across all high-stakes domains.
Expert Profile & Capture Directions
The source practitioner is a French military specialist with nine years of active service and multiple operational external deployments (OPEX). Their professional background includes extensive experience with high-stakes operational decisions made under time pressure, incomplete situational information, and rules of engagement constraints. Identity is withheld.
Three capture directions were defined prior to the first session, co-designed based on the practitioner's operational profile and on demand signal from frontier AI labs regarding post-training gaps:
Each direction was designed to probe the specific reasoning capabilities that current frontier models most consistently fail to demonstrate — not general domain knowledge, but the process of reasoning under real operational constraints.
Capture Methodology
The practitioner produced 250 reasoning traces over a twelve-week engagement, at a maximum pace of 20 to 25 traces per week. The pace constraint was deliberate: cognitive performance on structured reasoning tasks degrades with fatigue, and the quality of individual traces is the primary determinant of agent quality.
Two-phase case design
Cases were not randomly assigned. Prior to the capture sessions, 250 cases were designed in two phases. Phase 1 cases (200) were breadth-focused: representative of the practitioner's normal operational environment, varying difficulty, full coverage of all three capture directions. Phase 2 cases (50) were discrimination-focused: harder, more atypical, specifically designed to probe the failure modes the evaluation would measure. The practitioner was not informed of the phase designation during sessions.
Trace structure
Each trace followed a five-section format. The first four sections were self-directed:
The fifth section — the tacit probes — was administered by a facilitator after each trace was complete. These probes are designed to surface pre-verbal signal and implicit decision boundaries that structured verbalization alone does not capture. They do not fully resolve the tacit knowledge problem (see Section 8, Limitations), but they consistently move more of the tacit layer into articulable range.
Each trace was tagged by the practitioner as deliberative, pattern-based, or mixed, based on their own characterization of the primary reasoning mode used.
Dataset split
Of the 250 traces, 130 were assigned to fine-tuning and 120 were held out for evaluation. The held-out set was further divided: 60 cases were used to generate agent outputs (test set A), and 60 cases were re-engaged by the source practitioner independently, without access to their original traces, producing fresh human reference outputs (test set B). This re-engagement design ensures that human reference outputs reflect the practitioner's cold reasoning on evaluation cases, not recall of previously written traces.
Fine-tuning Protocol
The 130 training traces were formatted as prompt-completion pairs for supervised fine-tuning. Each pair consisted of a system prompt establishing the practitioner's operational profile (without identifying information), a user prompt presenting the operational scenario, and a completion containing the full five-section reasoning trace.
Fine-tuning was performed using Low-Rank Adaptation (LoRA) on Llama 3.1 8B Instruct on a single A100 GPU instance via RunPod. The choice of LoRA over full fine-tuning reflects three considerations: computational efficiency, reduced risk of catastrophic forgetting of the base model's general capabilities, and sufficiency for reasoning style and pattern capture at this data volume.
The resulting fine-tuned model — referred to as the agent throughout this document — was evaluated against held-out cases prior to the formal blind evaluation to confirm basic output quality and format compliance. No adjustments were made to fine-tuning parameters between this internal review and the formal evaluation.
Evaluation Protocol
The evaluation was designed to answer one question: can experienced domain practitioners reliably distinguish agent-generated reasoning outputs from outputs produced by the source human practitioner?
Evaluator composition
Five independent military evaluators were recruited from three different countries, representing at minimum two distinct national military doctrinal traditions. No two evaluators had a prior professional relationship. None had any prior acquaintance with each other. Evaluators were not informed that they were participating in an AI evaluation study. They were told they were assessing the quality of operational reasoning produced by different practitioners in a research context.
This blind design is critical. Evaluators who know they are judging AI versus human outputs look for AI tells — stylistic patterns, hedging conventions, formatting choices — rather than reasoning quality. Blind design forces evaluation on the substance of the reasoning itself.
Evaluation materials
120 outputs were prepared: 60 agent-generated (test set A) and 60 practitioner-generated (test set B). All outputs were anonymized with random identifiers. Identifying information, including any reference to the source practitioner's nationality, unit, or specific deployment context, was removed. Outputs were randomized and presented in a single pool to each evaluator.
Evaluation rubric
Each output was scored on four dimensions. Doctrinal alignment — whether the approach matched any specific national military doctrine — was explicitly excluded from scoring. Evaluators were instructed to score based on reasoning quality, not doctrinal correspondence. This exclusion was designed to prevent outputs from being penalized for reflecting a different national approach rather than inferior reasoning.
| Dimension | Scale | What is being assessed |
|---|---|---|
| Situational reasoning quality | 1–5 | Does the author identify the relevant factors? Is the threat assessment coherent and complete? |
| Decision logic | 1–5 | Is the engagement or abort decision well-reasoned? Are the conditions for action clearly articulated? |
| Uncertainty calibration | 1–5 | Does the author correctly identify what is unknown? Is confidence appropriately hedged? |
| Probable origin | Human / AI / Impossible to determine | Based on reasoning quality and structure only — not on writing style, formatting conventions, or doctrinal alignment. |
Each evaluator independently scored all 120 outputs. No communication between evaluators occurred during the evaluation period.
Statistical analysis
Three primary metrics were computed from the evaluation data:
Indistinguishability rate: the percentage of agent outputs (test set A) rated as "human" or "impossible to determine" across all evaluators. Reported with 95% confidence intervals. At n=60 agent outputs across 5 raters, confidence intervals are approximately ±12 percentage points.
Quality delta: the difference in average score across the three quality dimensions between agent outputs and practitioner outputs. A negative delta indicates agent outputs scored lower than practitioner outputs. Reported per dimension and as an aggregate.
Inter-rater reliability: Krippendorff's alpha computed across all five raters on the origin classification dimension. An alpha above 0.60 indicates substantial agreement and validates the reliability of the indistinguishability metric.
Scores were additionally disaggregated by reasoning type tag: deliberative cases and pattern-based cases were analyzed separately to characterize where agent performance is strongest and where it diverges most from the source practitioner.
Results
Indistinguishability
Of the 60 agent outputs evaluated, 73% were rated as human or impossible to determine by the five evaluators (95% CI: 61%–83%). The remaining 27% were rated as AI-generated. Krippendorff's alpha on origin classification was 0.71, indicating substantial inter-rater agreement. The consistency of this result across five evaluators from three different national military traditions — each with distinct doctrinal baselines — strengthens confidence in the finding beyond what a single-tradition evaluation could provide.
Quality delta
The average quality delta between agent outputs and practitioner outputs was 0.31 points on the 5-point scale, with agent outputs scoring lower. Performance varied modestly across dimensions, with the largest gap on uncertainty calibration — the dimension most dependent on accumulated operational experience:
| Dimension | Agent avg. | Practitioner avg. | Delta |
|---|---|---|---|
| Situational reasoning quality | 3.91 | 4.18 | −0.27 |
| Decision logic | 3.87 | 4.21 | −0.34 |
| Uncertainty calibration | 3.74 | 4.07 | −0.33 |
| Aggregate | 3.84 | 4.15 | −0.31 |
All three dimensions fell within the pre-specified quality threshold of 0.5 points. The agent's weakest relative performance on uncertainty calibration is consistent with the tacit knowledge boundary discussed in Section 8: uncertainty calibration in operational contexts relies heavily on accumulated pattern recognition that is partially resistant to verbalization.
Deliberative vs. pattern-based cases
Performance differed substantially across reasoning type. On deliberative cases — those requiring explicit multi-step reasoning through competing hypotheses — the indistinguishability rate was 79%. On pattern-based cases — those where the source practitioner characterized their reasoning as primarily intuitive or rapid recognition — the indistinguishability rate was 58%.
This 21-point divergence is expected and is discussed in Section 8 (Limitations). It does not represent a failure of the methodology — it represents an honest characterization of where fine-tuning on reasoning traces produces strongest results and where it falls short of the full practitioner cognitive profile. The deliberative case performance is the metric most directly relevant to post-training pipeline applications, where multi-step explicit reasoning traces are the primary output required.
Summary
A model fine-tuned on 130 structured reasoning traces from a single military practitioner produced outputs that experienced domain evaluators from three countries — operating under full blind conditions — could not reliably identify as AI-generated 73% of the time. The quality of those outputs, measured across three reasoning dimensions, was within 0.31 points of the source practitioner's independently produced outputs on the same scenarios. Inter-rater reliability (α = 0.71) confirms the robustness of the evaluation.
Implications for Post-Training Data
The post-training pipelines that shape frontier model behavior require three types of inputs where this work is directly relevant.
Reasoning trace generation
Preference data and supervised fine-tuning datasets for domains requiring multi-step reasoning under uncertainty benefit from reference outputs that reflect genuine practitioner reasoning rather than averaged or synthesized approximations. An agent that produces outputs indistinguishable from a verified practitioner at practitioner-level quality generates a different class of training signal than a prompted frontier model or a crowdsourced annotator.
Reward model calibration
Process Reward Models (PRMs) — models trained to evaluate the quality of reasoning steps rather than final answers — require bootstrapping data that accurately reflects what good reasoning looks like in a specific domain. The reasoning traces produced in this study, and by equivalent agents in other domains, represent high-quality bootstrapping material for PRMs in operational, clinical, legal, and analytical reasoning domains.
Evaluation standard-setting
The blind evaluation protocol used in this study — multiple independent domain experts from different national or institutional backgrounds, evaluating outputs without knowledge of their origin — constitutes a repeatable methodology for defining what practitioner-level reasoning means in a given domain. Applied systematically across multiple practitioners in a domain, it produces benchmark data with documented inter-expert agreement that can serve as an evaluation standard for frontier model assessment.
This is the work Trustwreck is building: not a general-purpose annotation platform, but a systematic infrastructure for capturing, replicating, and measuring expert reasoning across high-stakes domains.
Limitations
The tacit knowledge boundary
This study captures the explicit, deliberative reasoning layer of expert judgment — the part that is articulable when an expert is asked to explain their thinking. It does not fully capture the tacit, automatic pattern recognition layer that develops through years of operational experience and operates below the level of verbal articulation.
This boundary is visible in the performance divergence between deliberative and pattern-based cases documented in Section 6. Agents built on reasoning traces perform better on cases requiring explicit multi-step reasoning than on cases where the source practitioner's primary cognitive mode is rapid pattern recognition. The tacit probe methodology (Section 3) moves more of this layer into articulable range but does not eliminate the boundary.
This is a scope boundary, not a failure of the approach. The deliberative reasoning layer is the primary input labs need for post-training pipelines. Preference data, PRMs, and evaluation rubrics require the kind of reasoning that can be articulated and structured — not the instantaneous gestalt recognition that resists articulation.
Single practitioner
This study involves one source practitioner. The generalizability of findings to other practitioners — even in the same domain — cannot be assumed. Individual variation in reasoning style, verbalization tendency, and case complexity tolerance affects fine-tuning outcomes. Multi-practitioner studies in the same domain, with shared benchmark cases enabling inter-practitioner agreement analysis, are required before domain-level claims can be made.
130 training traces
Production-quality agents — defined here as agents intended for ongoing deployment in lab post-training pipelines — require substantially more training data than was used in this study. The 130-trace fine-tuning used here was calibrated for a proof-of-concept evaluation, not for production deployment. Production agents are built on progressive capture programs of 400 to 800 traces, with quality measured against domain benchmark improvement at each increment.
Domain specificity
The military operational domain has specific properties that may not generalize: cases are scenario-based rather than based on real patient or case data; the evaluation rubric was designed for this specific domain; and evaluators were recruited from a professional community with shared baseline competencies. Equivalent studies in clinical medicine, legal analysis, and intelligence assessment require domain-specific case design, rubric development, and evaluator recruitment protocols.
What Trustwreck Is Building
This study is the first published validation of a systematic approach to expert reasoning capture and replication that Trustwreck is applying across multiple high-stakes domains.
The network includes practitioners across clinical medicine, law and federal enforcement, intelligence analysis, military operations, critical industry safety, financial forensics, and scientific research. Each domain follows the same architecture: structured capture sessions with deliberate case design, progressive fine-tuning measured against domain benchmark improvement, and blind evaluation by independent domain experts as the quality gate before any agent is deployed.
The agents produced by this process are made available to frontier AI labs for post-training pipeline integration — annotation batch processing, preference pair generation, evaluation rubric design, and reward model calibration. A private expert directory, accessible to verified lab teams, documents the practitioner profiles available for equivalent capture programs across all domains.
The benchmarks produced as a byproduct of this process — beginning with MIL-BENCH-ROE from this study — are published with full methodology documentation and live model leaderboards. They are designed to serve as the evaluation standard for what practitioner-level reasoning means in each domain, independent of any specific agent deployment.