CSOAI   Home · Journal · Certification · Fabric
The 52-Article Charter · 27 of 52 · full text

Article 27: Performance Metrics Benchmarks

Published from the canonical CSOAI Partnership Charter (effective 15 January 2026). Full text below.

Version: 1.0 Effective Date: January 15, 2026, 09:00 GMT Status: Technical Article - Measurement Standards

Framework Integration: NIST AI RMF MEASURE Function, IEEE 7009 Fail-Safe Standards, Yoshua Bengio's Guaranteed Safe AI Metrics


PREAMBLE

This Article establishes comprehensive performance metrics and benchmarking standards for AI systems. You cannot improve what you do not measure. Rigorous metrics enable accountability, comparison, and continuous improvement.

Core Principle: Measure accurately, benchmark honestly, improve continuously.


27.1 STANDARD PERFORMANCE METRICS

27.1.1 Classification Metrics

Binary Classification:

Confusion Matrix: ``` Predicted Positive Negative Actual Positive TP FN Negative FP TN ```

Derived Metrics:

Accuracy:

Precision (Positive Predictive Value):

Recall (Sensitivity, True Positive Rate):

F1-Score:

Specificity (True Negative Rate):

ROC Curve & AUC:

PR Curve (Precision-Recall):

Example: ``` Medical Diagnosis Model:

Conclusion: High accuracy misleading. Model has room for improvement. ```

Multi-Class Classification:

Macro-Averaging:

Micro-Averaging:

Weighted-Averaging:

Per-Class Metrics:

27.1.2 Regression Metrics

Mean Absolute Error (MAE):

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

R² (Coefficient of Determination):

Mean Absolute Percentage Error (MAPE):

Example: ``` House Price Prediction:

```

27.1.3 Ranking Metrics

Precision@K:

Recall@K:

Mean Average Precision (MAP):

Normalized Discounted Cumulative Gain (NDCG):

Example: ``` Search Results (10 shown): Positions 1,3,5,7,9 relevant (5 relevant) NDCG@10: 0.72 MAP: 0.68

Interpretation: Good ranking, but some relevant items ranked low ```

27.1.4 Language Model Metrics

Perplexity:

BLEU (Bilingual Evaluation Understudy):

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

Exact Match (EM):

F1 (token-level):

Example: ``` Question: "When was the Eiffel Tower built?" Reference: "1889" Model Output: "The Eiffel Tower was built in 1889."

Exact Match: 0 (doesn't match exactly) F1: High (contains correct answer) ```


27.2 FAIRNESS METRICS

27.2.1 Demographic Parity

Framework Reference: IEEE 7000 Value-Based Design, NIST AI RMF MEASURE Function

Definition:

Measurement: ```python def demographic_parity_difference(y_pred, protected_attr): rate_group_0 = y_pred[protected_attr == 0].mean() rate_group_1 = y_pred[protected_attr == 1].mean() return abs(rate_group_0 - rate_group_1)

Threshold: < 0.05 (5% difference acceptable)

```

Example: ``` Loan Approvals: Group 0 (Majority): 60% approval Group 1 (Minority): 55% approval Difference: 5% (borderline acceptable) ```

27.2.2 Equalized Odds

Definition:

Measurement: ```python def equalized_odds_difference(y_true, y_pred, protected_attr): tpr_0 = TPR(y_true[protected_attr == 0], y_pred[protected_attr == 0]) tpr_1 = TPR(y_true[protected_attr == 1], y_pred[protected_attr == 1]) fpr_0 = FPR(y_true[protected_attr == 0], y_pred[protected_attr == 0]) fpr_1 = FPR(y_true[protected_attr == 1], y_pred[protected_attr == 1]) return max(abs(tpr_0 - tpr_1), abs(fpr_0 - fpr_1)) ```

27.2.3 Calibration

Definition:

Measurement:

Example: ``` Model says "70% probability of success" Group 0: Actually succeeds 68% of time (well calibrated) Group 1: Actually succeeds 50% of time (poorly calibrated, overconfident) ```

27.2.4 Individual Fairness

Definition:

Approaches:


27.3 SAFETY METRICS

Framework Integration: Yoshua Bengio's Guaranteed Safe AI, Max Tegmark's Formal Verification Metrics

27.3.1 Robustness Metrics

Adversarial Robustness:

Out-of-Distribution (OOD) Detection:

Certified Robustness:

27.3.2 Uncertainty Quantification

Confidence Calibration:

Prediction Intervals (Regression):

Epistemic vs. Aleatoric Uncertainty:

27.3.3 Alignment Metrics

Reward Hacking Detection:

Value Alignment Score:

27.3.4 Fail-Safe Metrics (IEEE 7009)

Graceful Degradation:

Override Success Rate:

Safe Stopping:


27.4 BENCHMARK DATASETS

27.4.1 Computer Vision Benchmarks

ImageNet (ILSVRC):

COCO (Common Objects in Context):

KITTI:

Open Images V7:

27.4.2 Natural Language Processing Benchmarks

GLUE & SuperGLUE:

SQuAD (Stanford Question Answering):

MMLU (Massive Multitask Language Understanding):

HELM (Holistic Evaluation of Language Models):

27.4.3 Reinforcement Learning Benchmarks

Atari 2600 Games:

MuJoCo Control Tasks:

DeepMind Control Suite:

OpenAI Gym:

27.4.4 Domain-Specific Benchmarks

Medical AI:

Legal AI:

Code Generation:

Multimodal:


27.5 PERFORMANCE MONITORING

Framework Integration: NIST AI RMF MEASURE Function - Continuous Monitoring

27.5.1 Production Metrics Tracking

Real-Time Dashboards:

Comparison to Development:

Drift Detection:

Data Drift:

Concept Drift:

Detection Methods:

Response to Drift:

27.5.2 A/B Testing

Controlled Experiments:

Setup:

Metrics:

Statistical Significance:

Example: ``` Model A (Control): 10% CTR, n=10,000 users Model B (Treatment): 11% CTR, n=10,000 users

Hypothesis test: p=0.03 (statistically significant) Decision: Deploy Model B ```

27.5.3 Alerting Thresholds

When to Alert:

| Metric | Threshold | Severity |
|--------|-----------|----------|
| Accuracy drop | >5% | Critical |
| Latency increase | p99 > 2x SLA | High |
| Error rate spike | >2x baseline | High |
| Confidence collapse | >20% low confidence | Medium |
| Data drift | KL divergence >0.1 | Medium |

Alert Routing:

27.5.4 Continuous Evaluation

Ongoing Testing:

Shadow Mode:


27.6 COMPARATIVE ANALYSIS

27.6.1 Benchmarking Against State-of-the-Art

Requirements:

Example: ``` Our Model: 94.2% ImageNet accuracy SOTA (EfficientNet-V2): 95.1% Gap: -0.9% (acceptable for our use case given 5x faster inference) ```

27.6.2 Competitor Analysis

If Public Benchmarks Available:

If Proprietary:

27.6.3 Longitudinal Performance

Track Improvement Over Time:

Improvement Rate:

27.6.4 Human Parity

When Relevant:

Superhuman Performance:


27.7 EFFICIENCY METRICS

Framework Integration: OECD AI Principles 2024 Update - Environmental Sustainability

27.7.1 Computational Efficiency

Training:

Inference:

Example: ``` Model Training:

Inference:

```

27.7.2 Cost Metrics

Training Cost:

Serving Cost:

Cost Per Prediction:

27.7.3 Environmental Impact

Carbon Footprint:

Energy Efficiency:

CSOAI Requirements:


27.8 CONCLUSION

Performance metrics are the foundation of accountability.

Without metrics:

CSOAI requires:

Measure rigorously. Report honestly. Improve continuously.

Effective Date: January 15, 2026, 09:00 GMT "What Gets Measured Gets Managed"


REFERENCES

NIST. (2023). AI Risk Management Framework - MEASURE Function. NIST AI 100-1.

IEEE. (2021). IEEE 7009 - Standard for Fail-Safe Design of Autonomous Systems.

Hardt, M., et al. (2016). Equality of Opportunity in Supervised Learning. NeurIPS.

Bengio, Y., et al. (2024). Guaranteed Safe AI: Metrics and Verification. arXiv.

OECD. (2024). AI Principles - Environmental Sustainability Update.

Deng, J., et al. (2009). ImageNet: A Large-Scale Hierarchical Image Database. CVPR.

Wang, A., et al. (2018). GLUE: A Multi-Task Benchmark. ICLR.


END OF ARTICLE 27

Progress: 27 of 52 Articles (52%)

Next: Article 28 - Interoperability Standards (completing Phase 3)

From charter to certificate. This article is part of the standard behind Watchdog Certification — independent assessment, Ed25519-signed, publicly verifiable. The crosswalks to the EU AI Act, ISO/IEC 42001 and 18 more frameworks are in the Crosswalk Library; the runtime tools are in the fabric.

The 52-Article Charter is published in full in the Journal. Bespoke briefings: hello@meok.ai.