.git/core-platform/#1375

Implement zero-downtime deployment fallback for edge nodes

Status: In Progress
Priority: High
Type: Infra
๐Ÿ‘ค Assigned: Alex Chen
๐Ÿ‘ฅ Watchers: 14
๐Ÿ“… Created: Oct 12, 2024

When edge nodes experience partial network partitioning during a rolling deployment, the current orchestrator fails over incorrectly, causing ~2-3s of request drops.

Objective: Implement a local cache + async sync fallback that allows edge nodes to serve stale-but-valid responses during deployment transitions, ensuring <50ms p99 latency even during network blips.

Acceptance Criteria

  • Edge nodes cache last-known-good state for max 30s
  • Automatic fallback triggers on consecutive 5xx responses from control plane
  • Metrics dashboard exposes edge_fallback_triggered and cache_miss_rate
  • Load tested with 10k RPS under synthetic network partition

Technical Notes

if (controlPlane.healthCheck() === FAIL) {
  edgeNode.enableFallbackMode();
  metrics.increment('fallback_activations');
  logger.warn("Switching to local cache mode");
}

Refer to architecture doc: /docs/arch/edge-resilience-v2.md

AC
Alex Chen updated the status to In Progress 2 hours ago
Pushed initial fallback logic to feat/edge-resilience. Running integration tests.
SK
Sarah Kim commented 5 hours ago
Can we ensure the fallback state doesn't leak user-specific session data? We should strip auth tokens before caching.
AC
Alex Chen commented Yesterday, 4:12 PM
Good catch. Added token scrubbing middleware in the cache layer. Will update the PR description.
SY
System automated check passed CI Yesterday, 3:45 PM
All 24 tests passed. Coverage: 94.2% (+1.1%)
๐Ÿ“„
edge-fallback-config.yaml
Modified
2.4 KB
๐Ÿ“ฆ
cache-layer.ts
Added
8.1 KB
๐Ÿงช
resilience.test.ts
Added
5.3 KB
๐Ÿ“Š
load-test-results.json
Attached
128 KB

3 commits linked to this issue:

๐Ÿ”˜
a3f9c21
Implement cache fallback logic
2h ago
๐Ÿ”˜
8b2e1d4
Add token scrubbing middleware
5h ago
๐Ÿ”˜
1c7f0a9
Initial scaffolding for edge resilience
Yesterday

Add a Comment