Race condition in parallel git-fetch on edge nodes #1200 In Progress
Problem Description
We are seeing intermittent failures in the deployment pipeline when multiple edge nodes attempt to git-fetch simultaneously from the same high-churn repository. The lock mechanism appears to be releasing prematurely, causing corrupted index files on ~3% of concurrent pulls.
Steps to Reproduce
- Spin up 10+ edge nodes using the
git-v2.0.4image. - Trigger a massive parallel fetch job against the
repo-largetarget. - Observe logs for
index-pack: fatal: index corruptionerrors.
Expected Behavior
Parallel fetches should serialize access to the index safely or use separate temp directories without collision.
Stack Trace
Priority elevated to P0 due to impact on production deployments during peak hours.
Reproduced this locally. The issue stems from the file-lock timeout being set too low for high-latency network calls. Increasing the timeout to 5000ms stabilizes the test, but I think a better approach is to use O_EXCL flag on Linux to avoid the retry loop entirely.
I'll open a PR shortly to refactor the lock acquisition logic.