Contacts
Get in touch
Close

Contacts

USA, Washington D.C

+ (1) 240-380-7545

info@zorost.com

Pull-quote: “Federal procurement should require calibration metrics in every AI proposal. Anything less is buying a black box.”

Why this matters

Federal decision support runs on AI now. Risk scoring, fraud detection, predictive maintenance, safety analysis, mission planning — every category has at least one AI vendor pitching the agency. The procurement question is: how does an agency tell the credible vendors from the rest?

Headline accuracy doesn’t help. Every vendor claims high accuracy. The number doesn’t translate into operational trust.

The right standard is calibration — and conformal prediction for individual uncertainty.

Calibration as a procurement requirement

Expected Calibration Error (ECE) is the standard metric. Below 0.02 is excellent. Below 0.01 is very good. The metric is widely adopted in academic ML evaluation and is the right floor for any high-stakes federal AI use.

A procurement RFP for an AI system should require:

  • ECE on a documented holdout slice of representative size
  • Reliability diagrams showing calibration across the full probability range
  • Sensitivity analysis on how calibration degrades under common distribution shifts (seasonal, regime change, missing data)
  • A monitoring plan for calibration drift in production

Every vendor that ships calibrated models can produce this. Every vendor that ships only headline accuracy will struggle to.

Conformal prediction as the second standard

Calibration tells you the average probability is honest. Conformal prediction tells you the individual uncertainty is honest. Locally Adaptive Conformal Prediction (LACP) produces distribution-free prediction intervals — when the model says “between 18 and 47 minutes with 90% coverage,” the actual answer falls in that interval 90% of the time, regardless of underlying distribution shape.

For federal decision support, this is non-negotiable. A point estimate without coverage is operationally meaningless.

NIST AI RMF alignment

The NIST AI Risk Management Framework articulates four functions: Map, Measure, Manage, Govern. Calibration and conformal prediction sit squarely in Measure. They are the operationally meaningful measurements of model trustworthiness — far more useful than the marketing accuracy a vendor leads with.

What this implies for vendor evaluation

Three concrete recommendations for federal AI procurement:

  1. Require ECE and reliability diagrams in every AI proposal.
  2. Require a stated coverage method (preferably conformal) for any system that produces numerical estimates.
  3. Require a monitoring plan for calibration drift, not just accuracy drift.

A vendor that cannot answer those is not a credible vendor for high-stakes use.

Closing

Federal decision support is too consequential to run on headline accuracy. Calibration and conformal prediction are the right standards. Procurement should require them. Vendors should ship them. We do, and we think the field should follow.