# Evaluation

### Part XII — Evaluation, Assurance & Impact

Part XII specifies **how Nexus Risk Management (NRM) and the Nexus Ecosystem are evaluated, assured, and held to account** over time. It treats evaluation not as an afterthought, but as a **first-class system function** tightly coupled to the semantic, governance, and operational fabrics described in earlier parts.

Evaluation operates at multiple levels:

* **Artefact-level** (AEPs, models, agents).
* **Configuration-level** (NRM Profiles, Packs, Rails).
* **System-level** (NXHIVE and cross-rail systemic behaviour).
* **Societal-level** (equity, legitimacy, contribution to a safe and just operating space).

The aim is to ensure that NRM is **effective, fair, contestable, and improvable**.

***

#### 12.1 Performance & Impact Metrics

The Nexus Ecosystem adopts a **multi-dimensional, disaggregated metric system**. Rather than collapsing performance into a single score, NRM maintains a **small but coherent set of metric families**, each with explicit interpretation and governance.

**12.1.1 Timeliness: Event → Evidence → Decision → Cash**

Timeliness metrics quantify whether NRM is achieving its core promise: **compressing the time from signal to action under lawful, auditable constraints**. For each NRM Profile and rail, the following latency intervals MUST be instrumented and reported:

1. **Event → Detection latency**
   * Time between the onset of a relevant event (e.g., threshold exceedance in hazard, health, cyber, financial, or social indicators) and its detection by UNOSINT / INT pipelines and NXOBS indices.
2. **Detection → AEP latency**
   * Time from initial detection to publication of an **Assurance & Evidence Pack (AEP)** at a specified EQL (e.g. EQL2 for situational awareness, EQL4 for capital triggers).
3. **AEP → Decision latency**
   * Time until a material decision explicitly referencing the AEP is taken (e.g., activation of a contingency window, invocation of a playbook, regulatory or policy change).
4. **Decision → Cash/Action latency**
   * Time from decision to **verifiable action**, which MAY be:
     * Money-in-motion (cash transfers, facility disbursement, procurement).
     * Operational interventions (infrastructure actions, public health measures, cyber mitigations).

Timeliness metrics MUST be:

* **Profile-specific** (because acceptable latencies differ for floods vs fiscal risk vs supply-chain disruptions).
* **Disaggregated** by geography, population group, and institution where relevant.
* Linked to **SLOs in `rail.yaml`**, with NXHIVE providing cross-rail comparisons and identifying systematically underserviced domains or groups.

**12.1.2 Accuracy and Basis Risk Reduction**

Accuracy and basis risk metrics assess whether NRM **improves alignment** between ex ante representations of risk and ex post realised impacts.

Key components include:

* **Predictive performance**
  * Standard metrics for hazard, exposure, and impact models:
    * Calibration (e.g., Brier scores, calibration plots).
    * Discrimination (e.g., ROC-AUC where appropriate).
    * Error distributions (RMSE/MAE for continuous outcomes).
  * Evaluated within and across population groups, geographies, and time horizons.
* **Basis risk metrics for risk transfer and facilities**
  * For each NRM-linked facility or programme:
    * **Over-payments / under-payments** relative to observed impacts.
    * Spatial and socio-economic distribution of misalignment (who is systematically left underprotected or overprotected).
    * Stability of triggers over time and across scenarios.
* **Structural and epistemic error**
  * Attribution of error to:
    * Model structure (mis-specified relationships or missing mechanisms).
    * Data quality and coverage (measurement error, missingness, bias).
    * Governance and application (misuse of models outside validated domains).

These metrics MUST be attached to:

* **AEPs** (EQL documentation).
* **NRM Profiles** (performance across episodes).
* **Packs** (domain-specific model assessment).
* And surfaced to GRF and GCRI for methodological revision and standard updates.

**12.1.3 Equity and Justice Outcomes**

Equity metrics ensure that NRM is **normatively aligned with a safe and just operating space**, not only aggregate efficiency. They operate along three dimensions:

1. **Distributive justice**
   * Distribution of:
     * Protection (e.g., early warnings, infrastructure resilience).
     * Resources (e.g., transfers, investments) triggered by NRM.
     * Residual losses and harms.
   * Disaggregated by:
     * Socio-economic status, geography, gender, age.
     * Indigenous and community identities.
     * Other context-relevant vulnerability markers.
2. **Procedural justice**
   * Metrics on community and Indigenous participation in:
     * Rail DAOs and NVM processes.
     * Ontology design and AEP co-authorship.
     * Pack governance committees and evaluation activities.
   * Use of opaque knowledge protections, conditional consent, and veto mechanisms.
3. **Remedial justice**
   * Frequency and outcomes of:
     * Grievance filings and appeals.
     * Corrective and compensatory actions.
   * Time-to-remedy and satisfaction rates from impacted communities.

These metrics are maintained in the **Equity & Community governance fabrics** and MUST inform protocol renewals, pack updates, and funding priorities.

**12.1.4 Institutional Capacity, Adoption and Trust**

Institutional metrics track **whether NRM strengthens governance capacity and public trust**:

* **Capacity and adoption indicators**
  * Number and diversity of institutions using NRM artefacts in:
    * Planning and budgeting.
    * Regulatory oversight and supervision.
    * Operations and emergency management.
  * Share of major decisions in each domain that explicitly reference NRM Profiles and AEPs.
* **Governance maturity**
  * Implementation quality of:
    * NVM rules (presence of all required constituencies; quorum adherence).
    * Rail DAOs (meeting frequency, issue resolution time, transparency).
  * Compliance with CL/EQL regimes and safety envelopes.
* **Trust and legitimacy**
  * Longitudinal surveys and qualitative studies of:
    * Practitioner trust (CROs, analysts, operators).
    * Public trust (communities, NGOs, Indigenous nations).
  * Indicators of **constructive contestation**:
    * Use of grievance mechanisms and participatory modelling, rather than disengagement or parallel “shadow” systems.

These metrics are synthesised at NXHIVE as a **governance and legitimacy dashboard** that complements technical performance views.

***

#### 12.2 Audit & Evaluation Framework

The audit and evaluation framework defines **who evaluates what, how often, and with which methods**. It distinguishes:

* **Continuous monitoring** (operational).
* **Formative evaluation** (improvement-oriented).
* **Summative evaluation** (judgement of effectiveness and value).

**12.2.1 AEP, Profile, Rail & Agent Evaluation Methods**

**AEP-level evaluation**

For each AEP:

* **EQL verification**
  * Automatic and human checks that the documented evidence quality (EQL1–5) matches:
    * Data sources and coverage.
    * Methodological transparency and reproducibility.
    * Peer review status and co-authorship (including Indigenous/community inputs).
* **Fitness-for-purpose review** (sampled audits)
  * Does the AEP:
    * Address the decision context it claims to support?
    * Clearly state uncertainties, alternatives, and limitations?
    * Provide materially useful content for decision-makers and communities?

**NRM Profile & Pack evaluation**

Profiles and Packs are evaluated periodically (e.g., annually, or after major events):

* **Profile evaluation**
  * For each Profile, across episodes:
    * Timeliness, accuracy, basis risk, equity, and institutional uptake.
    * Evidence of learning (parameter updates, model or ontology revisions).
* **Pack evaluation**
  * Appropriateness of GRIx extensions, model choices, and playbooks for the domain.
  * Ease of localisation by RNCs and NCCs.
  * Safety overlay performance (frequency of agent or automation issues prevented by pack constraints).

**Rail-level evaluation**

Rails undergo **system-level audits**:

* Aggregate SLO attainment and incident histories.
* Governance functioning (NVM, Rail DAO, community governance fabrics).
* Alignment with national/regional policy and regulatory frameworks.
* Contribution to multi-rail systemic stability from NXHIVE’s perspective.

**Agent-level evaluation**

Agents (data, operations, policy):

* **Task-level performance** (accuracy, consistency, explanation quality).
* **Human–AI interaction quality** (trust, comprehension, override rates).
* **Safety performance** (blocked actions, near misses, policy violations, safe fallbacks).

Results are fed into the **AI & Agent Safety Fabric** and may trigger agent retraining, capability changes, or decommissioning.

**12.2.2 Counterfactuals and Control Groups**

To estimate **causal impact**, evaluation programmes SHOULD employ rigorous comparative methods where feasible:

* **Quasi-experimental designs**
  * Difference-in-differences: comparing outcomes in NRM-adopting vs non-adopting units over time (e.g., regions, sectors).
  * Synthetic controls: constructing counterfactual trajectories for “treated” units using weighted combinations of “control” units.
* **Randomised roll-out at the margin**
  * Where ethically and politically acceptable, staggered adoption or randomised pilot allocation to:
    * Test marginal impact of NRM-guided targeting, timing, or intervention choice.
* **Historical baseline comparisons**
  * Comparisons with pre-NRM episodes, with explicit adjustment for confounders (e.g., climate trends, economic cycles, policy changes).

NXFOUNDRY SHOULD provide standardised templates and tools for designing such evaluations, including **ethics and governance checklists** to ensure non-exploitative use of control groups and respect for local norms.

***

#### 12.3 External Evaluation & Independent Review

External evaluation is necessary to **guard against capture, confirmation bias, and institutional blind spots**.

**12.3.1 Independent Technical Review Bodies**

The Nexus Ecosystem SHOULD charter and resource **Independent Technical Review Bodies (ITRBs)** with the following properties:

* **Structural independence**
  * Distinct governance from GCRI/GRF/GRA and from any single RNC or major vendor.
  * Clear conflict-of-interest policies and transparency of funding sources.
* **Multidisciplinary composition**
  * Expertise spanning risk science, statistics, AI/ML, Earth systems, public policy, ethics, law, Indigenous knowledge, and community organising.
* **Mandate**
  * Periodic reviews of:
    * Selected high-impact AEPs and models.
    * NRM Profiles and rails that underpin major capital facilities or policy decisions.
    * AI & Agent Safety regimes and systemic risk of NRM itself.
  * Publication of **public-facing reports**, with technical annexes for expert audiences.

ITRBs MAY recommend:

* Temporary moratoria on certain applications.
* Model or ontology replacement.
* Governance reforms at rail or NXHIVE level.
* Protocol changes in NXSS / NXSOS.

**12.3.2 Civil Society & Community Oversight**

Civil society and communities provide **essential social oversight**:

* **Oversight forums and assemblies**
  * Rail-level and cross-rail forums where NGOs, unions, community groups, and Indigenous nations can:
    * Review public NRM outputs.
    * Table concerns and recommendations.
    * Request clarifications or re-analyses.
* **Participatory audits**
  * Community-driven audits of:
    * Who receives NRM-triggered protection and support.
    * Whose risks and knowledge are omitted.
  * Co-authored reports that feed into GRF, Rail DAOs, and NXPROG.
* **Escalation rights**
  * Formal mechanisms enabling civil society actors to:
    * Request independent review by ITRBs.
    * Trigger NVM-level reconsideration of contested Profiles, packs, or facilities.

These mechanisms embed **democratic and community accountability** into NRM’s technical core.

***

#### 12.4 AI, Model & Systemic Risk Assurance

Assurance in NRM must address **both micro-level model behaviour and macro-level systemic effects**.

**12.4.1 Model Cards, Robustness & Shift Monitoring**

Every significant model and agent policy MUST have a **governance and documentation bundle** anchored in NXSS:

* **Model Cards**
  * Purpose, intended use, and non-use domains.
  * Data sources and curation processes, including SDZ and lawful-basis constraints.
  * Performance metrics across relevant subgroups.
  * Explanation of methods, hyperparameters, and interpretability tools available.
  * Known risks (e.g., sensitivity to particular features) and mitigation strategies.
* **Robustness assessments**
  * Behaviour under:
    * Extreme events and tail distributions.
    * Data corruption or missingness.
    * Domain shifts (climate change, socio-economic transitions).
  * Use of Monte Carlo stress tests, adversarial inputs where relevant, and comparative model ensembles.
* **Shift monitoring and triggers**
  * Continual/periodic measurement of:
    * Data distribution drifts (covariate, label/target, concept).
    * Performance degradation indicators.
  * Well-defined thresholds and processes for:
    * Downgrading EQL.
    * Triggering recalibration or retraining.
    * Freezing use for high-stakes applications pending review.

These practices are supported by **ML Fabric**, **AI & Agent Safety Fabric**, and recorded in the **Chronotope & Episodic Memory Fabric**.

**12.4.2 Systemic Risk of NRM Itself (Procyclicality, Herding, Concentration)**

Because NRM aims for **wide adoption**, it must be treated as a **potential source of systemic risk**:

* **Procyclicality and synchronisation**
  * Metrics such as:
    * Correlation of NRM-driven actions across institutions (e.g., simultaneous risk-off responses).
    * Amplification ratios (how much NRM-linked responses amplify or dampen shocks).
* **Herding and model monoculture**
  * Diversity indicators:
    * Number of independent models and method families used per domain.
    * Range of scenario framings considered in key decisions.
  * Policies encouraging **model and scenario pluralism**, especially in high uncertainty domains.
* **Concentration of influence**
  * Assessment of:
    * Concentration of observatory and pack authorship.
    * Vendor dependence and single points of failure.
  * Mitigation via:
    * Redundancy in observatories.
    * Open-source reference implementations.
    * Competition and diversity safeguards in NXUNIV marketplace governance.

NXHIVE MUST provide **systemic dashboards** and periodic systemic-risk assessments of NRM itself, with recommendations for diversification, guardrails, or protocol changes.

***

#### 12.5 Public Transparency & Accountability

Public transparency and accountability are treated as **non-negotiable constraints**, not optional extras.

**12.5.1 Public Dashboards & Reporting Requirements**

Rails and NXHIVE SHOULD maintain **public transparency surfaces** that:

* Provide **high-level, non-technical summaries** of:
  * Major risks, resilience trends, and equity indicators.
  * Significant NRM Profiles and packs in force.
  * NRM-linked facilities and programmes, at least in aggregated form.
* Publish regular **NRM performance and impact reports** including:
  * Timeliness and coverage metrics.
  * Basis risk and accuracy summaries (at aggregate levels).
  * Equity and justice indicators.
* Summarise **incidents and postmortems**:
  * Without exposing sensitive individuals or data.
  * With clear accounts of what happened, what was learned, and what changed.

Where NRM is embedded in regulation or public finance, **legal instruments MAY require** such reporting, with GRF providing standard templates and minimum disclosure baselines.

**12.5.2 Civic Education & Engagement Using NRM Outputs**

NRM outputs can improve **societal risk literacy** if deliberately designed for this purpose:

* **Educational modules**
  * Simplified dashboards and narratives for:
    * Schools and universities.
    * Public servants’ training.
    * Media and civil society programmes.
* **Participatory simulation labs**
  * Use of scenario tools in public dialogues:
    * To explore trade-offs (e.g., adaptation pathways, urban resilience designs).
    * To elicit local knowledge and preferences, which then feed back into GRIx and NRM Profiles.
* **Communication principles**
  * Emphasis on:
    * Plain language and multilingual content.
    * Visualisations that respect uncertainty (e.g., fan charts, scenario ranges).
    * Avoiding false precision and determinism.

NXPROG and NXAPP MUST support these functions (e.g., **public explainer modes** that strictly exclude sensitive detail but preserve structure and limitations).

**12.5.3 Mechanisms for Public Feedback and Redress**

Finally, transparency must be accompanied by **mechanisms to challenge and repair**:

* **Feedback channels**
  * Public portals and helpdesks for:
    * Reporting errors, biases, and anomalies.
    * Submitting additional evidence or alternative framings of risk.
* **Grievance and appeal procedures**
  * Clearly described processes that:
    * Allow individuals, communities, and organisations to contest NRM-triggered decisions (e.g., non-payment, misclassification, exclusion).
    * Allocate responsibility for review (e.g., independent panels, Rail DAO committees, ombuds offices).
    * Specify timelines and possible remedies (correction, compensation, policy change).
* **Redress and structural correction**
  * Where NRM-induced decisions are found to be harmful or unjust:
    * Material, not merely symbolic, remedies SHOULD be available (e.g., corrective funding, recalibration of programmes).
    * Structural changes MUST be recorded:
      * Ontology or model updates.
      * Pack and Profile changes.
      * Governance rule adjustments (NVM, Rail DAO).

All such processes are logged as **episodes** in the Chronotope & Episodic Memory Fabric and are inputs to **protocol renewal cycles**.

***

#### 12.6 Meta-Evaluation and Learning Architecture

To avoid treating evaluation itself as static, the Nexus Ecosystem incorporates **meta-evaluation**—evaluation of the evaluation system.

Key elements:

* **Meta-indicators**
  * Fraction of major decisions later reviewed with explicit reference to NRM performance.
  * Frequency with which evaluation findings lead to:
    * Updated NXSS standards.
    * New or revised packs and Profiles.
    * Governance reforms at rail or NXHIVE level.
* **Roles and responsibilities**
  * **GCRI**: methodological stewardship of evaluation designs, causal inference practices, and learning analytics.
  * **GRF**: definition of minimum evaluation and transparency standards; conformance and certification of evaluation processes.
  * **GRA**: ensuring evaluation findings are integrated into capital facility design and pricing, including learning clauses.
  * **NXHIVE**: synthesis and cross-rail comparison; identification of systemic blind spots.
* **Review cycles**
  * Periodic (e.g., 3–5 year) **meta-evaluation cycles** tied to:
    * Protocol and Charter renewals.
    * Funding and mandate reviews of observatories and rails.
  * Participation of external reviewers, civil society, and youth/Indigenous voices to safeguard intergenerational and justice perspectives.

Meta-evaluation closes the loop: NRM not only helps societies **learn about their risks**, but also learns about **its own strengths, failures, and biases**, and evolves accordingly. In doing so, it aspires to be a **self-reflexive global digital public good**, worthy of the trust it seeks to underpin.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.therisk.global/organization/standardization/nexus-rail/nexus-based-risk-management-nrm/technology/evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.