.git / core-infrastructure › Issues › #1200

Race condition in parallel git-fetch on edge nodes #1200 In Progress

bug P0: Critical backend performance
SK
Sarah Kim opened this issue 2 hours ago • Updated 15 mins ago

Problem Description

We are seeing intermittent failures in the deployment pipeline when multiple edge nodes attempt to git-fetch simultaneously from the same high-churn repository. The lock mechanism appears to be releasing prematurely, causing corrupted index files on ~3% of concurrent pulls.

Steps to Reproduce

  1. Spin up 10+ edge nodes using the git-v2.0.4 image.
  2. Trigger a massive parallel fetch job against the repo-large target.
  3. Observe logs for index-pack: fatal: index corruption errors.

Expected Behavior

Parallel fetches should serialize access to the index safely or use separate temp directories without collision.

Stack Trace

stack-trace.log
thread'main' panicked at 'lock_file.rs:45: failed to acquire write lock': Error: Resource temporarily unavailable at git_engine::index::lock::acquire at src/index/lock.rs:45 at git_engine::fetch::parallel::worker at src/fetch/parallel.rs:112\n at std::sys_common::backtrace::__rust_begin_short_backtrace

Priority elevated to P0 due to impact on production deployments during peak hours.

AR
Alex Rivera commented 45 mins ago

Reproduced this locally. The issue stems from the file-lock timeout being set too low for high-latency network calls. Increasing the timeout to 5000ms stabilizes the test, but I think a better approach is to use O_EXCL flag on Linux to avoid the retry loop entirely.

I'll open a PR shortly to refactor the lock acquisition logic.

🏷️
Sarah Kim added labels P0: Critical performance
👤
Sarah Kim assigned Alex Rivera
Leave a comment