Data Ingestion Pipeline

Home / Products / Data Ingestion

Overview

The Data Ingestion module provides a unified interface for connecting external data sources to the GeoServer spatial database. It handles format parsing, coordinate reference system (CRS) transformation, schema inference, and error routing automatically.

💡 Why use the managed pipeline?

Instead of writing custom ETL scripts, the ingestion engine normalizes data streams, applies GeoPackage/PostGIS schemas, and routes payloads to your storage layer with exactly-once delivery guarantees.

Supported Data Sources

Ingestion connectors are available for batch files, database exports, cloud storage, and real-time streams.

Source Type	Formats	Delivery Mode	Latency
Geospatial	GeoJSON, Shapefile, KML, GeoTIFF, LAS/LAZ	Batch / Streaming	~200ms - 5s
Database	PostGIS, MySQL, SQLite, MongoDB	Sync / CDC	~500ms
Stream	Apache Kafka, AWS Kinesis, MQTT, WebSocket	Real-time	<100ms
Cloud Storage	S3, GCS, Azure Blob (CSV, Parquet, NDJSON)	Event-triggered	~1-3s

Pipeline Architecture

Data flows through a configurable four-stage pipeline. Each stage can be customized via YAML or the CLI.

1. Connect: Establish secure links to source endpoints. Supports OAuth2, API keys, and IAM roles.
2. Validate: Schema verification, CRS checking, topology validation, and duplicate detection.
3. Transform: Apply coordinate reprojection, attribute mapping, geometry simplification, and enrichment rules.
4. Route & Store: Write validated data to target storage (PostGIS, S3, GeoPackage) and publish layer endpoints.

Configuration Example

Define ingestion pipelines using a declarative YAML manifest. The example below configures a real-time Kafka stream ingesting telemetry points.

pipeline: kafka-telemetry-ingest
source:
  type: kafka
  topic: vehicle_gps_v2
  consumer_group: geoserver-fleet
transform:
  crs_target: EPSG:4326
  geometry_field: location
  simplify: true
  tolerance: 0.00001
sink:
  type: postgis
  schema: public
  table: telemetry_points
error_policy: route_to_deadletter

Best Practices

Validate early: Enable strict schema enforcement at the ingestion boundary to prevent downstream corruption.
Partition by time/space: Use partitioned tables or spatial indexes (H3, S2, QuadTree) for high-velocity streams.
Handle clock skew: Tag all records with ingestion timestamps separate from event timestamps.
Monitor throughput: Set up alerts for queue depth, transformation errors, and sink latency spikes.
Use idempotent writes: Enable upserts or unique constraints to safely retry failed messages.

API & CLI Integration

Trigger and manage pipelines programmatically via REST or the geospatial CLI tool. Authenticated requests require a service account with ingestion:write permissions.

# Start a pipeline run
curl -X POST https://api.geoserver.io/v1/pipelines/run \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "pipeline_id": "kafka-telemetry-ingest", "mode": "streaming" }'

# Check run status
curl https://api.geoserver.io/v1/pipelines/status?limit=5