Pull-quote: “Federal procurement should require calibration metrics in every AI proposal. Anything less is buying a black box.”
Why this matters
Federal decision support runs on AI now. Risk scoring, fraud detection, predictive maintenance, safety analysis, mission planning — every category has at least one AI vendor pitching the agency. The procurement question is: how does an agency tell the credible vendors from the rest?
Headline accuracy doesn’t help. Every vendor claims high accuracy. The number doesn’t translate into operational trust.
The right standard is calibration — and conformal prediction for individual uncertainty.
Calibration as a procurement requirement
Expected Calibration Error (ECE) is the standard metric. Below 0.02 is excellent. Below 0.01 is very good. The metric is widely adopted in academic ML evaluation and is the right floor for any high-stakes federal AI use.
A procurement RFP for an AI system should require:
- ECE on a documented holdout slice of representative size
- Reliability diagrams showing calibration across the full probability range
- Sensitivity analysis on how calibration degrades under common distribution shifts (seasonal, regime change, missing data)
- A monitoring plan for calibration drift in production
Every vendor that ships calibrated models can produce this. Every vendor that ships only headline accuracy will struggle to.
Conformal prediction as the second standard
Calibration tells you the average probability is honest. Conformal prediction tells you the individual uncertainty is honest. Locally Adaptive Conformal Prediction (LACP) produces distribution-free prediction intervals — when the model says “between 18 and 47 minutes with 90% coverage,” the actual answer falls in that interval 90% of the time, regardless of underlying distribution shape.
For federal decision support, this is non-negotiable. A point estimate without coverage is operationally meaningless.
NIST AI RMF alignment
The NIST AI Risk Management Framework articulates four functions: Map, Measure, Manage, Govern. Calibration and conformal prediction sit squarely in Measure. They are the operationally meaningful measurements of model trustworthiness — far more useful than the marketing accuracy a vendor leads with.
What this implies for vendor evaluation
Three concrete recommendations for federal AI procurement:
- Require ECE and reliability diagrams in every AI proposal.
- Require a stated coverage method (preferably conformal) for any system that produces numerical estimates.
- Require a monitoring plan for calibration drift, not just accuracy drift.
A vendor that cannot answer those is not a credible vendor for high-stakes use.
Closing
Federal decision support is too consequential to run on headline accuracy. Calibration and conformal prediction are the right standards. Procurement should require them. Vendors should ship them. We do, and we think the field should follow.


