Emergency Rollback Procedure
Two rollback layers exist for post-Voyager mainnet, in increasing cost / decreasing speed:
- Binary rollback (re-deploy a prior archived binary) — for a bad binary that hasn't corrupted state.
- chain.db restore (frozen-rsync from a healthy validator) — for state divergence; canonical recovery, but slowest.
Always escalate from the cheapest layer first.
Historical note: earlier rollback layers included a
SENTRIX_FORCE_PIONEER_MODE=1env-var override that forced the binary back to Pioneer PoA, used during the 2026-04-25 Voyager activation #1 livelock. Now obsolete — the L1 multiaddr advertisements
- L2 cold-start gate (v2.1.26 / v2.1.27) make the activation-time livelock failure mode unreachable, and the override has been removed from all production env files. If a future BFT-class livelock occurs on a steady-state chain, prefer chain.db rsync from a healthy peer over forcing back to Pioneer.
1. Binary Rollback (no state corruption)
Each validator's deploy archives the previous binary under
<bin_dir>/releases/ (last 3 retained). To roll back, re-run your
deploy with a prior archive instead of building a fresh binary.
For the maintainer fleet, use the private orchestrator with the
SENTRIX_ROLLBACK env var pointing at the archived binary path:
SENTRIX_ROLLBACK=/opt/sentrix/releases/sentrix-vX.Y.Z-<timestamp> \
<orchestrator> mainnet
The orchestrator skips the build step, ships the named binary, does the same rolling stop/start order with health check, ~2 min end-to-end.
For third-party validators (single host), use
scripts/deploy-validator.sh with --binary pointing at the archive:
./scripts/deploy-validator.sh \
--bin-dir /opt/sentrix --rpc-url http://127.0.0.1:8545 \
--binary /opt/sentrix/releases/sentrix-vX.Y.Z-<timestamp>
Manual single-host fallback (any operator):
# 1. Stop the unit
sudo systemctl stop <validator-service>
# 2. List archived versions
ls -lt <bin_dir>/releases/
# 3. Restore (use install/mv-rename, NOT cp — running binaries trip ETXTBSY)
sudo install -m 755 <bin_dir>/releases/sentrix-vX.Y.Z-<timestamp> <bin_dir>/sentrix
# 4. Restart
sudo systemctl start <validator-service>
Current production binary at the time of writing: v2.1.39 (mainnet
& testnet, post tokenomics-v2 fork landing). Prior production releases
archived under each validator's <bin_dir>/releases/: v2.1.38, v2.1.37,
v2.1.36, v2.1.35, v2.1.34, v2.1.33, v2.1.32, v2.1.31, v2.1.30, v2.1.29,
v2.1.28, v2.1.27, v2.1.26.
The 2026-04-25 / 2026-04-26 incident hotfix series:
- v2.1.31: BFT signing v2 foundation + Frontier F-2 shadow + libp2p connection-leak fix
- v2.1.32:
/p2p/<peer_id>in advert multiaddrs (closes #319 partial-fix gap) - v2.1.33: voyager_mode_for runtime-aware check + connection_limits Behaviour
- v2.1.34: connection_limits cap loosened 1→2 (production hotfix)
- v2.1.35: Voyager-mode-for migration sweep + claim-rewards tool
- v2.1.36: tx validate exempts staking ops from amount>0 check (ClaimRewards submission fix)
- v2.1.37: libp2p sync cascade-bail filter (P0: 2026-04-26 mainnet stall at h=604547 root cause + fix). Recovered via Treasury-canonical chain.db rsync. See PR #334 (RCA held in operator runbooks).
- v2.1.38: legacy TCP-path deletion (sync.rs + node.rs trimmed) + cumulative skip-counter observability for race re-emergence detection
- v2.1.39: tokenomics v2 fork (consensus, env-gated). 126M-block halving (4-year BTC-parity) + 315M MAX_SUPPLY. Activated on testnet at h=381651 (2026-04-26), armed on mainnet at h=640800 via
TOKENOMICS_V2_HEIGHTenv var. Same fork-gate pattern as VOYAGER_REWARD_V2_HEIGHT (zero behavior change pre-fork-height; runtime dispatch inget_block_reward()+max_supply_for(height)+halvings_at(height)). PR #336 + #337 (RPC display fix). - v2.1.40: fork-aware explorer richlist display (pre-fork hardcoded 210M now tracks
/chain/infomax_supply_srx). PR #348. - v2.1.41: jail-cascade observability + fork-gated BFT safety gate relaxation (
BFT_GATE_RELAX_HEIGHT, default DISABLED — operator activates after testnet bake). PRs #350-#352. - v2.1.42 / v2.1.43: asymmetric-application fixes (record_block_signatures + distribute_reward + epoch_manager now fire from libp2p apply paths, not just validator-loop finalize). PRs #356, #362.
- v2.1.44: Phase A→D consensus-jail full stack (StakingOp::JailEvidenceBundle epoch-boundary system tx, dispatch recompute-and-compare, 4-validator determinism harness). All dormant pre-fork via
JAIL_CONSENSUS_HEIGHT=u64::MAXdefault. Activation = operator halt-all event. PRs #359/#365/#366/#368/#369/#371/#372 + testnet bootstrap (#374).
2. State Recovery (chain.db restore)
When state has diverged (different block hash at the same height,
state_root mismatch, etc.), the canonical recovery is a frozen
rsync of chain.db from a healthy peer with all validators
halted. See STATE_EXPORT.md for why
sentrix state export/import is not the right path for a
post-genesis chain.
The full procedure lives in internal operator runbook
(internal). Outline:
- Pick the canonical validator (matches the most peers; longest valid chain at consensus root; for BFT-finalized chains, prefer the one whose justification signer-set matches the majority of healthy peers).
- Stop all validators on the diverged hosts.
- Backup the diverged
chain.dbto a sibling directory:sudo cp -a <data_dir>/chain.db <data_dir>/chain.db.divergent-<height>-<ts> - Tar-pipe the canonical
chain.dbto each diverged host while the source is frozen (canonical node is also stopped during the copy):Why tar-pipe over rsync: chain.db is a directory of MDBX files; tar handles ownership normalization withssh <canonical> "sudo tar -C <canonical_data_dir> -cf - chain.db" | \ssh <dest> "sudo tar -C <dest_data_dir> -xf - --no-same-owner --no-same-permissions"--no-same-ownercleanly when source/destination users differ. chown -R sentriscloud:sentriscloud <data_dir>/chain.db(or whichever user owns the running daemon).- MD5 parity check — verify all destinations have identical
mdbx.dat:sudo md5sum <data_dir>/chain.db/mdbx.dat - Start validators in the standard producer order (most-recently-active first to anchor BFT round numbering, then peers).
State_root is recomputed from the canonical chain.db on the next block and the divergence is gone.
Worked example (2026-04-26 morning mainnet stall, h=604547). All 4 validators had different block hashes at h=604547. Treasury picked as canonical (most progressed at h=604548, self-consistent prev-link, justification signer-set matched majority). Tar-pipe Treasury chain.db → Foundation, Core, Beacon. MD5 parity confirmed (
mdbx.datmd5 =567c7165301fff7e95ded23d03df63cd). Restart Treasury → Foundation → Core → Beacon. Chain advanced past h=604548 within seconds. Per-validator hash parity verified at h=604650. (RCA held in operator runbooks.)
Worked example #2 (2026-04-26 evening mainnet stall, h=633599). Rolling restart used to load
TOKENOMICS_V2_HEIGHTenv var into validator processes triggered auto-jail divergence: Foundation+Beacon view had 1 validator jailed (auto-jail counted missed proposals during their down-window), Treasury+Core view had 0 jailed. Active-set divergence (3 vs 4) tripped the P1 BFT safety gate ("active set ≥ minimum 4") on Foundation+Beacon, refusing BFT participation. Chain stalled at h=633599. Recovery: halt all 4 → forensic backup divergent chain.db → tar-pipe Treasury (frozen canonical) → Foundation/Core/Beacon → MD5 parity confirmed (975f9d67a7c3206dbea346f6b90f4826) → simultaneous start. BFT resumed within seconds. Per-validator hash parity at h=633650 (8e2166e9962da5aa...). Lesson: rolling restart on mainnet has the same jail-cascade pattern as testnet — for env-var changes, prefer halt-all + simultaneous-start over rolling.
NEVER Do This
-
Never
git push --forceto roll back. The CI/CD deploy job is disabled — a force-push to main does not redeploy. Re-run your deploy withSENTRIX_ROLLBACK=<archived-binary-path>instead. Force-push also rewrites public history; CI test artifacts and PR-comment links start pointing at vanished commits. -
Never build on Windows and SCP to Linux validators. Windows produces PE executables, Linux needs ELF. The binary will fail with "Exec format error". Always build inside a Linux container (e.g.
rust:1.95-bullseyefor glibc 2.31 compat across modern Ubuntu/Debian). -
Never run admin CLI separately per-VPS during recovery. Run the validator add/remove/toggle on a single chain.db, then rsync to the rest. The admin_log holds wall-clock timestamps; running the same op three times produces three different timestamps and three different state_roots.
-
Never use
sentrix state export/importto recover a post-genesis chain. v2.1.5+ refuses to start on a keystore built from import. Use frozen-rsync of chain.db (path 3 above). -
Never rolling-restart all 4 validators sequentially when consensus rules don't change between old/new state. Rolling restart triggers auto-jail divergence: validators down for their proposing slot get jailed locally on some peers but seen as active on others. P1 safety gate trips, chain stalls. For env-var changes, binary upgrades that don't change consensus rules, etc — use halt-all + simultaneous-start. Confirmed pattern on testnet (2026-04-20) AND mainnet (2026-04-26 evening, see Worked Example #2 above).