AC
Issues / Platform / ISSUE-42
๐Ÿ›

ISSUE-42: WMS GetMap requests timeout under high concurrency load

๐Ÿ“ Platform / Backend ๐Ÿท๏ธ High Priority ๐Ÿ‘ค Alex Chen ๐Ÿ“… Created 3 days ago

๐Ÿ“‹ Description

When the GeoServer WMS endpoint handles more than 200 concurrent requests, GetMap operations begin to timeout after approximately 30 seconds. This results in 504 Gateway Timeout responses from the reverse proxy, even though the underlying PostgreSQL/PostGIS database connections remain healthy.

The issue appears to be related to the default thread pool configuration in the image rendering pipeline. Under normal load (below 100 concurrent requests), response times are within acceptable thresholds (<2s). However, once concurrency exceeds the maxThreads=200 threshold, the request queue begins to accumulate and threads become blocked waiting on image processing locks.

๐Ÿ” Steps to Reproduce

  1. Configure GeoServer 2.24.0+ with a PostGIS data store containing a large raster layer (>500MB)
  2. Set up load testing with ab or k6 targeting the WMS GetMap endpoint
  3. Send 250 concurrent requests with identical bounding box parameters
  4. Observe that requests begin timing out at approximately the 180th concurrent request
  5. Check gc.log โ€” notice significant GC pauses (2โ€“4 seconds) correlating with timeout events

๐Ÿ’ป Expected Behavior

All GetMap requests should complete within the configured timeout window (60s) without gateway errors, and the thread pool should gracefully handle backpressure.

โš ๏ธ Actual Behavior

Requests beyond the 200 concurrent threshold timeout with 504 errors. Thread dumps show multiple threads blocked on ImageIO$ContainsFilter locks, and heap usage spikes to ~92% before full GC cycles trigger.

๐Ÿ“Š Environment

# Environment Details
GeoServer    : 2.24.3
Java        : OpenJDK 17.0.8
OS          : Ubuntu 22.04 LTS
Memory      : -Xmx8g -Xms4g -XX:MaxMetaspaceSize=512m
DB          : PostgreSQL 15.2 / PostGIS 3.3
Proxy       : NGINX 1.24.0 (upstream_timeout 60s)
Layer Size  : 1,247 MB GeoTIFF (38720 ร— 25848)
Concurrency : 250 simultaneous requests (k6)

๐Ÿ“Ž Stack Trace (Excerpt)

Thread "http-nio-8080-exec-47" #1892 daemon running
  at java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageReader.read(Unknown Source)
  at java.desktop/javax.imageio.ImageIO.read(ImageIO.java:1422)
  at org.geoserver.wms.map.StreamingImageResponse.encode(StreamingImageResponse.java:127)
  at org.geoserver.wms.map.RenderingResult.write(RenderingResult.java:89)
  at org.geoserver.wms.GetMap.execute(GetMap.java:312)
  at org.geoserver.ows.AbstractDescribeRequest.handle(AbstractDescribeRequest.java:54)
  Locked ownable synchronizers:
    - locked <0x00000007231abcd0> (a org.geotools.image.RasterSymbolizer)

๐Ÿ’ฌ Activity & Comments (5)

โš™๏ธ
Alex Chen changed status from New to In Progress
3 days ago
โš™๏ธ
Alex Chen assigned this issue to Maya Patel
3 days ago
AP
Maya Patel ยท commented 3 days ago

Thanks for the detailed report, Alex. I've reproduced this on our staging cluster. The root cause seems to be the default ImageIO cache settings. Under high concurrency, the cache locks up because all threads are trying to read from the same raster source simultaneously.

I'm investigating two potential fixes:

Option A: Increase the imageio.cache.limit property in global.xml and add connection pooling for raster readers.

Option B: Switch to a non-blocking image rendering pipeline using GeoTools GridCoverage2D with tile-based processing instead of full-raster loads.

JR
James Rodriguez ยท commented 2 days ago

We're seeing the same behavior in production on v2.24.2. Our monitoring shows the issue starts at around 150 concurrent requests, not 200 as described. Could this be related to JVM memory settings? We're running with -Xmx4g vs the -Xmx8g in the environment notes.

Also, has anyone tried the workaround of enabling UseJAI=false in the WMS config?

โš™๏ธ
Maya Patel added labels bug backend performance
2 days ago
AP
Maya Patel ยท commented 1 day ago

@James Rodriguez โ€” yes, memory settings definitely play a role. With -Xmx4g, the full GC pauses will be more frequent. I've benchmarked both configs and confirmed that -Xmx8g delays the onset but doesn't eliminate the issue.

The real fix is Option B โ€” implementing tile-based rendering with GridCoverage2D. I've got a PR branch with a prototype that reduces memory pressure by ~60% and pushes the concurrency threshold beyond 500 requests. Still need to add tests and clean up the code.

@Alex Chen โ€” can you test the patch on your environment? I'll share the build artifact via Slack.