The Validation Status Problem: Conceptual, Partially Validated, Empirically Validated Are Not the Same Thing

Behavioral AI products routinely conflate theoretical inspiration with empirical validation. We argue that a three-level validation status column, attached to every behavioral module, is the minimum discipline a regulated decisioning system owes its reviewers.

ebc-frameworkvalidation-statusregulated-ailimitations

1. A conflation that becomes a liability

When a behavioral AI product is pitched to an institutional buyer, a familiar rhetorical move appears. The pitch cites a canonical social- science paper — Tyler on procedural legitimacy, Kahneman and Tversky on prospect theory, Goffman on face work — and asserts that the product “is built on” or “operationalizes” that theory. The institutional buyer, who is not a behavioral scientist, absorbs the citation as evidence and treats the product as empirically grounded.

The conflation is between two very different claims:

  1. “The canonical source theory has decades of empirical support in its original domain.”
  2. “The specific operationalization in this product — the choice of observable proxies, the threshold values, the constraint logic — has been empirically validated in the deployment context.”

Claim (1) is usually true. Claim (2) is usually not. The gap between the two is the validation status problem, and it becomes a liability the moment a serious supervisory review asks the right question.

2. The question that exposes the gap

Imagine a supervisory review of a recovery decisioning system that uses a “procedural justice” module based on Tyler (1990). The reviewer asks:

“Tyler’s work demonstrates that perceived legitimacy predicts compliance in criminal-justice and tax-enforcement contexts. Your module claims to measure perceived legitimacy from — among other features — silence duration, dispute history, and response refusals. On what empirical basis do you treat those observable proxies as valid measurements of perceived legitimacy in debt-recovery interactions?”

The answer a conventional behavioral-AI product can give is some combination of:

None of those answers is responsive. The first is an assertion of plausibility without a study. The second is a theoretical hand-wave. The third is an aggregate correlation that says nothing about whether the specific operationalization measures the specific latent construct.

A reviewer trained in evaluation methodology — and supervisors are increasingly trained in evaluation methodology — will not accept any of these as validation evidence. What they will accept, at minimum, is an explicit acknowledgment: “The theoretical source is established; the specific operationalization is not yet empirically validated in this domain.” That acknowledgment is only possible if the system has been built with a place to put it.

3. A three-level taxonomy

The Explicit Behavioral Constraint (EBC) framework requires every behavioral module to carry a validation status field V with one of three values:

VMeaningWhat the reviewer should read into it
ConceptualThe theoretical source construct has established empirical support in its original domain, but the specific operationalization — proxy choice, threshold values, constraint logic — has not been validated in the deployment context.”The idea is well-grounded. The translation to this domain is a design hypothesis.”
Partially ValidatedThe specific theoretical principle (proportional sanctions, framing effects) has empirical support in related applied contexts, but the calibration parameters in this system remain unvalidated.”The principle travels to related settings. The exact numeric thresholds are author choices.”
Empirically ValidatedDirect evidence exists in the deployment context that f(X) measures C, obtained through a documented validation study with pre-registered hypotheses.”An external reviewer can reproduce the validation study and reach the same conclusion.”

The three levels are not equivalent. A module marked Empirically Validated carries a stronger epistemic commitment than one marked Conceptual, and the system documentation must make the difference legible. When a reviewer asks “what is the evidence?”, the field V is the first place to look, and the honest answer lives there.

4. Why this is not a semantic game

The three-level taxonomy would be a semantic game if it were only ever used as a label. Its value comes from the structural consequences it imposes on the system:

A Conceptual module cannot claim empirical authority. If V is Conceptual, the product documentation, the sales pitch, and the customer-facing explainability surfaces must all say so. “This module operationalizes Tyler’s procedural justice theory. The specific operationalization is conceptual and has not been empirically validated in the debt-recovery context.” That sentence is not a liability. It is an honest disclosure, and honest disclosure is what a serious reviewer rewards.

Validation status can degrade. A module marked Partially Validated today can be demoted to Conceptual tomorrow if a new study casts doubt on the operationalization. The reverse is also true: a Conceptual module can be promoted to Partially Validated once the operationalization survives an annotation study in a related context. The field is updated as evidence accumulates, and the history of that field is itself auditable.

The sum of a system’s module statuses is a reviewability budget. If every module in a decisioning system is marked Conceptual, the system has a weak aggregate validation posture — and the honest documentation of that weakness is still stronger than the dishonest documentation of a fictional strength. A system that ships with a mix — some Conceptual, some Partially Validated — gives the reviewer a map of where to focus validation effort. A system that has no validation status field at all gives the reviewer a blank page.

5. Three validation studies every behavioral module eventually needs

For a module to move from Conceptual or Partially Validated to Empirically Validated in the debt-recovery context, three studies are typically required (Chaara, 2026, §9.1):

  1. Annotation study. Domain experts annotate debtor communications and interactions for the presence of the latent construct (legitimacy perception, face threat, behavioral disengagement). The proxy measurements used by f(X) are correlated with those annotations. A module whose proxies are uncorrelated with expert annotations fails the annotation study, and its operationalization is revised.
  2. Sensitivity analysis of threshold values. The module’s constraint logic typically uses numeric thresholds: “if legitimacy score L_T < 0.30, freeze escalation.” A sensitivity analysis varies those thresholds across a realistic range and measures how the system behavior changes. A module whose output is highly sensitive to threshold values that were chosen without data is not robust.
  3. Discriminant validity testing. The proxy measurements must distinguish the intended construct from related but distinct constructs. “Does our response-latency signal measure behavioral disengagement, or does it simply measure capacity constraints?” A module that cannot answer this question is operationalizing a latent construct that may not be the one it names.

None of these three studies is exotic. All of them are standard practice in evaluation research. Very few commercial behavioral-AI products disclose that they have run any of them.

6. The counterargument we take most seriously

The strongest objection is that the three-level taxonomy imposes an epistemic standard that will kill any commercial behavioral-AI product before it ships. If every module must be Empirically Validated before deployment, and validation requires field studies that take months to years, no product will ever be launchable.

We grant the objection partially. Our response has three parts:

  1. The taxonomy is permissive, not prohibitive. The EBC framework explicitly permits the deployment of modules marked Conceptual, on the condition that the marking is visible to the operator, the supervisor, and — when the decision is challenged — the data subject. Deployment of an unvalidated module is acceptable; silent deployment of an unvalidated module is not.
  2. The alternative is not no-validation; it is invisible validation. Every deployed behavioral-AI system has a validation status. In most systems, that status is implicitly Conceptual for every component, and the invisibility of that fact is the thing the framework corrects.
  3. The cost of honesty is far below the cost of a supervisory finding. A system deployed with honest Conceptual markers, challenged by a supervisor, and defended through an explicit limitations document survives the challenge. A system deployed with implicit “this is empirically grounded” rhetoric, challenged by the same supervisor, and unable to produce the validation studies the rhetoric implied is now in a regulatory dispute. The first cost is a documentation effort. The second is a business cost.

7. What this article does not claim

8. Further reading