.git/core-platform/#1375
Implement zero-downtime deployment fallback for edge nodes
When edge nodes experience partial network partitioning during a rolling deployment, the current orchestrator fails over incorrectly, causing ~2-3s of request drops.
Objective: Implement a local cache + async sync fallback that allows edge nodes to serve stale-but-valid responses during deployment transitions, ensuring <50ms p99 latency even during network blips.
Acceptance Criteria
- Edge nodes cache last-known-good state for max 30s
- Automatic fallback triggers on consecutive 5xx responses from control plane
- Metrics dashboard exposes
edge_fallback_triggeredandcache_miss_rate - Load tested with 10k RPS under synthetic network partition
Technical Notes
if (controlPlane.healthCheck() === FAIL) {
edgeNode.enableFallbackMode();
metrics.increment('fallback_activations');
logger.warn("Switching to local cache mode");
}
Refer to architecture doc: /docs/arch/edge-resilience-v2.md
AC
Pushed initial fallback logic to
feat/edge-resilience. Running integration tests.SK
Can we ensure the fallback state doesn't leak user-specific session data? We should strip auth tokens before caching.
AC
Good catch. Added token scrubbing middleware in the cache layer. Will update the PR description.
SY
All 24 tests passed. Coverage: 94.2% (+1.1%)
edge-fallback-config.yaml
Modified
2.4 KB
cache-layer.ts
Added
8.1 KB
resilience.test.ts
Added
5.3 KB
load-test-results.json
Attached
128 KB
3 commits linked to this issue:
a3f9c21
Implement cache fallback logic
2h ago
8b2e1d4
Add token scrubbing middleware
5h ago
1c7f0a9
Initial scaffolding for edge resilience
Yesterday