Rollbacks
Purpose: Restore a known-safe runtime state quickly and deterministically when a release causes correctness, availability, or safety risk.
Rollback is an operational action on Runtime Pack activation state. It does not modify canonical truth or Kristal artifact schemas.
Non-negotiables
- Fail-closed: if a candidate pack cannot be verified or is incompatible, do not activate it.
- Atomic activation: activation is an all-or-nothing switch; partial activation is forbidden.
- Downgrade prevention: do not activate older versions unless explicitly authorized as a rollback.
- Deterministic rollback: given the same triggers and state history, rollback target selection must be deterministic.
- Offline correctness: rollback/activation must not depend on fetching trust roots over the network.
Rollback modes
At minimum, support one (recommended: both):
1) Pinned rollback
Activate an explicitly pinned known-good pack.
2) Last-known-good (LKG) rollback
Activate the most recent previously active verified pack that is still present and passes preflight.
When rollback is the correct action
Rollback is the correct action if: - A newly activated pack causes elevated errors, regressions, or unsafe outcomes. - Policy enforcement is not functioning as expected. - Determinism or verification guarantees are in question. - Activation is partially complete and must be normalized quickly.
Do not rollback (or rollback only after containment) if: - The issue is purely upstream (inputs) and the active pack is not implicated. - The issue is confined to a single target with environmental failure (prefer repair). - The rollback candidate is known-bad for the same failure class.
Triggers (examples)
Rollback can be triggered by: - Explicit operator action (recommended primary trigger). - Verified distribution control updates (e.g., revocation added, rollback authorization published). - Local runtime health signals (optional; must be policy-defined).
End-to-end shape (mental model)
```mermaid flowchart LR A[Detect issue] --> B[Freeze / stop rollout] B --> C[Select rollback target (Pinned or LKG)] C --> D[Preflight: verify + compatibility] D -->|pass| E[Atomic activate rollback target] D -->|fail| F[Try earlier candidate or Safe Mode] E --> G[Post-check: health + correctness] G --> H[Record evidence + unfreeze (gated)] ````
Procedure: automatic rollback (activation-time)
Automatic rollback is triggered when rollout detects:
- activation preflight failure
- post-activation health check failure
- policy enforcement failure (channel-defined)
Steps:
- Stop further rollout in the same stage.
- Attempt rollback to the last-known-good pack for the failing target(s).
-
If rollback succeeds:
-
mark rollout as failed
- require human approval to proceed further
-
If rollback fails:
-
escalate to emergency procedure
- place targets in safe mode if supported
Procedure: emergency rollback
Use when there is strong evidence of integrity compromise, unsafe behavior, or fast-spreading outage.
-
Immediate freeze
-
Freeze promotions and activations for the entire channel (or all channels if global).
-
Scope containment
-
Identify impacted targets and expand scope conservatively if uncertain.
-
Rollback to LKG
-
Execute rollback with priority on restoring safe operation.
-
If LKG is unavailable or fails preflight, select the most recent prior verified pack that passes preflight.
-
If rollback cannot restore safety
-
Enter safe mode (only if explicitly allowed for the channel):
-
deny/disable unsafe capabilities
- serve minimal safe responses
-
disable optional modules suspected in failures
-
Trust response (if integrity concern)
-
Rotate or revoke trust material as required by policy.
-
Require conformance re-verification before any new promotion.
-
Audit + incident
-
Open/attach an incident.
- Record all decisions and activation events in Release Record + Orgo.
Rollback failure handling
If rollback activation fails on any target:
- Stop further actions for that stage.
- Attempt rollback to an earlier candidate (prior LKG) if policy allows.
-
If no candidate passes preflight:
-
place target into safe mode (if supported)
- escalate to incident response
- require manual remediation
Required observability and audit
During rollback, emit:
- per-target activation attempt events
- preflight outcomes (pass/fail + reason codes)
- post-activation health results
- final active pack ID per target
Dashboards/alerts should include:
- rollback success rate by channel and environment
- time-to-restore by incident
- repeated rollback loops (flapping)
- divergence: reported active pack ID vs expected
Audit linkage requirements:
- every rollback-related metric/event must reference
build_idand/orrelease_idand relevant artifact refs (where applicable)
Konnaxion must also update and emit Konnaxion State on:
- rollback start and rollback completion
- failed verification or failed activation attempt
Operational hygiene
- Maintain at least N prior verified packs per channel (policy-defined) to ensure rollback availability.
- Periodically perform rollback drills in non-prod and canary channels.
- Treat frequent rollbacks as a signal to investigate pipeline quality, conformance tests, and gate strictness.
- Ensure rollback tooling remains compatible across runtime versions (backward compatibility plan).
Orgo integration
Operator-initiated and emergency rollbacks should be expressed as:
- an Orgo Case (incident/regression)
-
one or more Orgo Tasks, such as:
-
freeze-promotions rollback-activationpost-rollback-validationunfreeze-promotions(optional, gated)
Rollback tasks must reference:
- channel/targets
- candidate selection rule or explicit candidate IDs
- required validations and success criteria