Overview

The Data Ingestion module provides a unified interface for connecting external data sources to the GeoServer spatial database. It handles format parsing, coordinate reference system (CRS) transformation, schema inference, and error routing automatically.

💡 Why use the managed pipeline?

Instead of writing custom ETL scripts, the ingestion engine normalizes data streams, applies GeoPackage/PostGIS schemas, and routes payloads to your storage layer with exactly-once delivery guarantees.

Supported Data Sources

Ingestion connectors are available for batch files, database exports, cloud storage, and real-time streams.

Source Type Formats Delivery Mode Latency
Geospatial GeoJSON, Shapefile, KML, GeoTIFF, LAS/LAZ Batch / Streaming ~200ms - 5s
Database PostGIS, MySQL, SQLite, MongoDB Sync / CDC ~500ms
Stream Apache Kafka, AWS Kinesis, MQTT, WebSocket Real-time <100ms
Cloud Storage S3, GCS, Azure Blob (CSV, Parquet, NDJSON) Event-triggered ~1-3s

Pipeline Architecture

Data flows through a configurable four-stage pipeline. Each stage can be customized via YAML or the CLI.

  • 1. Connect: Establish secure links to source endpoints. Supports OAuth2, API keys, and IAM roles.
  • 2. Validate: Schema verification, CRS checking, topology validation, and duplicate detection.
  • 3. Transform: Apply coordinate reprojection, attribute mapping, geometry simplification, and enrichment rules.
  • 4. Route & Store: Write validated data to target storage (PostGIS, S3, GeoPackage) and publish layer endpoints.

Configuration Example

Define ingestion pipelines using a declarative YAML manifest. The example below configures a real-time Kafka stream ingesting telemetry points.

pipeline: kafka-telemetry-ingest
source:
  type: kafka
  topic: vehicle_gps_v2
  consumer_group: geoserver-fleet
transform:
  crs_target: EPSG:4326
  geometry_field: location
  simplify: true
  tolerance: 0.00001
sink:
  type: postgis
  schema: public
  table: telemetry_points
error_policy: route_to_deadletter

Best Practices

  • Validate early: Enable strict schema enforcement at the ingestion boundary to prevent downstream corruption.
  • Partition by time/space: Use partitioned tables or spatial indexes (H3, S2, QuadTree) for high-velocity streams.
  • Handle clock skew: Tag all records with ingestion timestamps separate from event timestamps.
  • Monitor throughput: Set up alerts for queue depth, transformation errors, and sink latency spikes.
  • Use idempotent writes: Enable upserts or unique constraints to safely retry failed messages.

API & CLI Integration

Trigger and manage pipelines programmatically via REST or the geospatial CLI tool. Authenticated requests require a service account with ingestion:write permissions.

# Start a pipeline run
curl -X POST https://api.geoserver.io/v1/pipelines/run \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "pipeline_id": "kafka-telemetry-ingest", "mode": "streaming" }'

# Check run status
curl https://api.geoserver.io/v1/pipelines/status?limit=5