Race condition in concurrent merge resolver on high-load clusters

#1360 Open
AK
alex_kim opened this issue 2 hours ago
AK
alex_kim
Maintainer ยท 2 hours ago

Description

When processing concurrent merges on nodes with >64GB RAM, the merge_resolver service occasionally throws a deadlock exception. This seems to happen specifically when binary_blob_size > 500MB.

Steps to Reproduce:

  1. Spin up a cluster with 4 nodes.
  2. Push 10 large binary assets simultaneously via webhook.
  3. Monitor logs for DEADLOCK_DETECTED.

Stack Trace:

Log
2025-05-20 14:32:01 ERROR [merge_resolver] Lock contention timeout
  at resolve_conflict(node_id: "us-east-4a")
  at sync_stream(blob_hash: "0x8f9a...")

// Thread dump indicates circular dependency between:
// - lock_write_head
// - lock_index_update

This is blocking our v3.2 release. Tagging @backend-team for immediate review.

alex_kim added labels bug performance backend 2 hours ago
system automatically assigned to sarah_ops 2 hours ago
marcus_dev commented
I can repro this locally on `main`. Looks like the new async queue implementation isn't releasing locks on timeout. I'll spin up a PR shortly.
45 minutes ago
Leave a comment Markdown supported