Providence — Data Infrastructure

Overview

Providence is the foundational data layer that powers Dictionary's global language services. It orchestrates the ingestion, verification, normalization, and distribution of over 15 million lexical entries across 100+ languages.

Designed for developers, researchers, and enterprise integrations, Providence guarantees deterministic outputs, versioned datasets, and transparent lineage tracking for every word, definition, and translation.

Core Capabilities

Real-time lexical lookup with sub-50ms latency
Version-controlled dictionary snapshots (monthly)
Multi-dialect pronunciation audio routing
Contextual synonym/antonym graph resolution
GDPR/CCPA-compliant data anonymization pipelines

Data Provenance

Every entry in Providence carries a verifiable lineage trail. We source from licensed academic corpora, peer-reviewed linguistic journals, and validated community contributions. All data undergoes a multi-stage verification process before entering the production index.

Source Type	Volume	Verification Tier	License
Academic Corpora	8.2M entries	Tier 1 (Peer-Reviewed)	Proprietary / Institutional
Verified Lexicographers	4.1M entries	Tier 1 (Expert)	Dictionary Exclusive
Community Contributions	2.7M entries	Tier 2 (AI+Human Review)	CC BY-NC 4.0
Regional Dialects	600K entries	Tier 3 (Linguist-Validated)	Open Linguistics

Lineage metadata is exposed via the ?include=lineage query parameter. Each record contains source IDs, review timestamps, confidence scores, and revision hashes.

Processing Pipeline

Providence uses a deterministic, batch-and-stream hybrid architecture. Ingestion occurs via secure SFTP, API webhook, and licensed data feeds. The pipeline enforces strict schema validation, deduplication, and semantic normalization.

Pipeline Stages

Ingestion: Raw XML/JSON parsing with checksum validation
Normalization: Unicode NFKC, diacritic stripping, case folding
Enrichment: POS tagging, morphological parsing, phonetic transcription (IPA)
Verification: Cross-reference against gold-standard corpora
Indexing: Distributed inverted index + vector embeddings for semantic search

Failed records are quarantined in the /pipeline/quarantine endpoint for manual review. SLA requires 99.2% successful processing within 4 hours of ingestion.

API Specifications

Providence exposes a RESTful API with OpenAPI 3.0 documentation. All endpoints return JSON, support pagination, and include rate-limit headers. Authentication uses Bearer tokens scoped to dataset access levels.

Example Response Structure

JSON

{
  "word": "ephemeral",
  "phonetic": "/əˈfem.ər.əl/",
  "pos": "adjective",
  "definitions": [
    {
      "text": "Lasting for a very short time",
      "source_id": "ACAD-CORP-8821",
      "confidence": 0.98,
      "verified": true
    }
  ],
  "lineage_hash": "sha256:a3f1c9d...",
  "updated": "2025-03-12T08:14:00Z"
}

Endpoint	Method	Rate Limit	Auth Required
`/v2/lookup`	GET	1,000 req/min	Bearer
`/v2/batch`	POST	500 req/min	Bearer
`/v2/datasets`	GET	100 req/min	Public
`/v2/audio`	GET	2,000 req/min	Bearer

Reliability & Scale

Providence is deployed across three geographically distributed regions with active-active replication. All data is served via edge caching with a 99.99% uptime SLA for Pro and Enterprise tiers.

Metric	Target	Current Performance
Availability	99.99%	99.994% (30-day avg)
P95 Latency	< 45ms	32ms
Data Freshness	< 6 hours	4.2 hours
Index Consistency	Strong	Verified via checksum audits

Failover is automatic. If a region exceeds error thresholds, traffic is rerouted via DNS failover and edge proxy rules. Status pages and incident reports are published at status.dictionary.com.

Compliance & Ethics

Providence adheres to international data protection standards and maintains rigorous ethical guidelines for linguistic data curation. All processing activities are logged, auditable, and reversible upon request.

Compliance Framework

GDPR Article 17 (Right to Erasure) supported for user-generated contributions
CCPA/CPRA compliance for California residents
ISO 27001:2022 certified infrastructure
Regular third-party bias audits across dialectal and regional data
Transparent licensing disclosures per entry

We maintain a public Data Ethics Charter outlining our commitment to linguistic diversity, non-commercialization of endangered languages, and equitable representation across global dialects.

Technical FAQ

How often is the production index updated? ▼

Full index snapshots are published monthly. Incremental updates are streamed in real-time for verified entries, with a maximum delay of 4 hours from ingestion to production availability.

Can I access raw source datasets? ▼

Raw datasets are available to Enterprise partners under a separate data licensing agreement. Request access via the Enterprise portal or contact our data partnerships team.

How is data versioning handled? ▼

Every dataset is versioned using semantic versioning (v2.4.1). You can pin your integrations to a specific version via the Accept-Version header to ensure deterministic responses.

What happens during rate limit exhaustion? ▼

API responses return HTTP 429 with Retry-After headers. Requests are not queued; they are rejected to preserve system stability. Upgrade to higher tiers or implement client-side backoff strategies.