.git/#1375

When edge nodes experience partial network partitioning during a rolling deployment, the current orchestrator fails over incorrectly, causing ~2-3s of request drops.

Objective: Implement a local cache + async sync fallback that allows edge nodes to serve stale-but-valid responses during deployment transitions, ensuring <50ms p99 latency even during network blips.

Acceptance Criteria

Edge nodes cache last-known-good state for max 30s
Automatic fallback triggers on consecutive 5xx responses from control plane
Metrics dashboard exposes edge_fallback_triggered and cache_miss_rate
Load tested with 10k RPS under synthetic network partition

Technical Notes

if (controlPlane.healthCheck() === FAIL) {
  edgeNode.enableFallbackMode();
  metrics.increment('fallback_activations');
  logger.warn("Switching to local cache mode");
}

Refer to architecture doc: /docs/arch/edge-resilience-v2.md

AC

Alex Chen updated the status to In Progress 2 hours ago

Pushed initial fallback logic to feat/edge-resilience. Running integration tests.

SK

Sarah Kim commented 5 hours ago

Can we ensure the fallback state doesn't leak user-specific session data? We should strip auth tokens before caching.

AC

Alex Chen commented Yesterday, 4:12 PM

Good catch. Added token scrubbing middleware in the cache layer. Will update the PR description.

SY

System automated check passed CI Yesterday, 3:45 PM

All 24 tests passed. Coverage: 94.2% (+1.1%)

📄

edge-fallback-config.yaml

Modified

2.4 KB

📦

cache-layer.ts

Added

8.1 KB

🧪

resilience.test.ts

Added

5.3 KB

📊

load-test-results.json

Attached

128 KB

3 commits linked to this issue:

🔘

a3f9c21

Implement cache fallback logic

2h ago

🔘

8b2e1d4

Add token scrubbing middleware

5h ago

🔘

1c7f0a9

Initial scaffolding for edge resilience

Yesterday

Implement zero-downtime deployment fallback for edge nodes

Acceptance Criteria

Technical Notes

Add a Comment