Overview

Providence is the foundational data layer that powers Dictionary's global language services. It orchestrates the ingestion, verification, normalization, and distribution of over 15 million lexical entries across 100+ languages.

Designed for developers, researchers, and enterprise integrations, Providence guarantees deterministic outputs, versioned datasets, and transparent lineage tracking for every word, definition, and translation.

Core Capabilities

  • Real-time lexical lookup with sub-50ms latency
  • Version-controlled dictionary snapshots (monthly)
  • Multi-dialect pronunciation audio routing
  • Contextual synonym/antonym graph resolution
  • GDPR/CCPA-compliant data anonymization pipelines

Data Provenance

Every entry in Providence carries a verifiable lineage trail. We source from licensed academic corpora, peer-reviewed linguistic journals, and validated community contributions. All data undergoes a multi-stage verification process before entering the production index.

Source Type Volume Verification Tier License
Academic Corpora 8.2M entries Tier 1 (Peer-Reviewed) Proprietary / Institutional
Verified Lexicographers 4.1M entries Tier 1 (Expert) Dictionary Exclusive
Community Contributions 2.7M entries Tier 2 (AI+Human Review) CC BY-NC 4.0
Regional Dialects 600K entries Tier 3 (Linguist-Validated) Open Linguistics

Lineage metadata is exposed via the ?include=lineage query parameter. Each record contains source IDs, review timestamps, confidence scores, and revision hashes.

Processing Pipeline

Providence uses a deterministic, batch-and-stream hybrid architecture. Ingestion occurs via secure SFTP, API webhook, and licensed data feeds. The pipeline enforces strict schema validation, deduplication, and semantic normalization.

Pipeline Stages

  • Ingestion: Raw XML/JSON parsing with checksum validation
  • Normalization: Unicode NFKC, diacritic stripping, case folding
  • Enrichment: POS tagging, morphological parsing, phonetic transcription (IPA)
  • Verification: Cross-reference against gold-standard corpora
  • Indexing: Distributed inverted index + vector embeddings for semantic search

Failed records are quarantined in the /pipeline/quarantine endpoint for manual review. SLA requires 99.2% successful processing within 4 hours of ingestion.

API Specifications

Providence exposes a RESTful API with OpenAPI 3.0 documentation. All endpoints return JSON, support pagination, and include rate-limit headers. Authentication uses Bearer tokens scoped to dataset access levels.

Example Response Structure

JSON
{
  "word": "ephemeral",
  "phonetic": "/əˈfem.ər.əl/",
  "pos": "adjective",
  "definitions": [
    {
      "text": "Lasting for a very short time",
      "source_id": "ACAD-CORP-8821",
      "confidence": 0.98,
      "verified": true
    }
  ],
  "lineage_hash": "sha256:a3f1c9d...",
  "updated": "2025-03-12T08:14:00Z"
}
Endpoint Method Rate Limit Auth Required
/v2/lookup GET 1,000 req/min Bearer
/v2/batch POST 500 req/min Bearer
/v2/datasets GET 100 req/min Public
/v2/audio GET 2,000 req/min Bearer

Reliability & Scale

Providence is deployed across three geographically distributed regions with active-active replication. All data is served via edge caching with a 99.99% uptime SLA for Pro and Enterprise tiers.

Metric Target Current Performance
Availability 99.99% 99.994% (30-day avg)
P95 Latency < 45ms 32ms
Data Freshness < 6 hours 4.2 hours
Index Consistency Strong Verified via checksum audits

Failover is automatic. If a region exceeds error thresholds, traffic is rerouted via DNS failover and edge proxy rules. Status pages and incident reports are published at status.dictionary.com.

Compliance & Ethics

Providence adheres to international data protection standards and maintains rigorous ethical guidelines for linguistic data curation. All processing activities are logged, auditable, and reversible upon request.

Compliance Framework

  • GDPR Article 17 (Right to Erasure) supported for user-generated contributions
  • CCPA/CPRA compliance for California residents
  • ISO 27001:2022 certified infrastructure
  • Regular third-party bias audits across dialectal and regional data
  • Transparent licensing disclosures per entry

We maintain a public Data Ethics Charter outlining our commitment to linguistic diversity, non-commercialization of endangered languages, and equitable representation across global dialects.

Technical FAQ

How often is the production index updated?

Full index snapshots are published monthly. Incremental updates are streamed in real-time for verified entries, with a maximum delay of 4 hours from ingestion to production availability.

Can I access raw source datasets?

Raw datasets are available to Enterprise partners under a separate data licensing agreement. Request access via the Enterprise portal or contact our data partnerships team.

How is data versioning handled?

Every dataset is versioned using semantic versioning (v2.4.1). You can pin your integrations to a specific version via the Accept-Version header to ensure deterministic responses.

What happens during rate limit exhaustion?

API responses return HTTP 429 with Retry-After headers. Requests are not queued; they are rejected to preserve system stability. Upgrade to higher tiers or implement client-side backoff strategies.