Overview
Providence is the foundational data layer that powers Dictionary's global language services. It orchestrates the ingestion, verification, normalization, and distribution of over 15 million lexical entries across 100+ languages.
Designed for developers, researchers, and enterprise integrations, Providence guarantees deterministic outputs, versioned datasets, and transparent lineage tracking for every word, definition, and translation.
Core Capabilities
- Real-time lexical lookup with sub-50ms latency
- Version-controlled dictionary snapshots (monthly)
- Multi-dialect pronunciation audio routing
- Contextual synonym/antonym graph resolution
- GDPR/CCPA-compliant data anonymization pipelines
Data Provenance
Every entry in Providence carries a verifiable lineage trail. We source from licensed academic corpora, peer-reviewed linguistic journals, and validated community contributions. All data undergoes a multi-stage verification process before entering the production index.
| Source Type | Volume | Verification Tier | License |
|---|---|---|---|
| Academic Corpora | 8.2M entries | Tier 1 (Peer-Reviewed) | Proprietary / Institutional |
| Verified Lexicographers | 4.1M entries | Tier 1 (Expert) | Dictionary Exclusive |
| Community Contributions | 2.7M entries | Tier 2 (AI+Human Review) | CC BY-NC 4.0 |
| Regional Dialects | 600K entries | Tier 3 (Linguist-Validated) | Open Linguistics |
Lineage metadata is exposed via the ?include=lineage query parameter. Each record contains source IDs, review timestamps, confidence scores, and revision hashes.
Processing Pipeline
Providence uses a deterministic, batch-and-stream hybrid architecture. Ingestion occurs via secure SFTP, API webhook, and licensed data feeds. The pipeline enforces strict schema validation, deduplication, and semantic normalization.
Pipeline Stages
- Ingestion: Raw XML/JSON parsing with checksum validation
- Normalization: Unicode NFKC, diacritic stripping, case folding
- Enrichment: POS tagging, morphological parsing, phonetic transcription (IPA)
- Verification: Cross-reference against gold-standard corpora
- Indexing: Distributed inverted index + vector embeddings for semantic search
Failed records are quarantined in the /pipeline/quarantine endpoint for manual review. SLA requires 99.2% successful processing within 4 hours of ingestion.
API Specifications
Providence exposes a RESTful API with OpenAPI 3.0 documentation. All endpoints return JSON, support pagination, and include rate-limit headers. Authentication uses Bearer tokens scoped to dataset access levels.
Example Response Structure
{
"word": "ephemeral",
"phonetic": "/əˈfem.ər.əl/",
"pos": "adjective",
"definitions": [
{
"text": "Lasting for a very short time",
"source_id": "ACAD-CORP-8821",
"confidence": 0.98,
"verified": true
}
],
"lineage_hash": "sha256:a3f1c9d...",
"updated": "2025-03-12T08:14:00Z"
}
| Endpoint | Method | Rate Limit | Auth Required |
|---|---|---|---|
/v2/lookup |
GET | 1,000 req/min | Bearer |
/v2/batch |
POST | 500 req/min | Bearer |
/v2/datasets |
GET | 100 req/min | Public |
/v2/audio |
GET | 2,000 req/min | Bearer |
Reliability & Scale
Providence is deployed across three geographically distributed regions with active-active replication. All data is served via edge caching with a 99.99% uptime SLA for Pro and Enterprise tiers.
| Metric | Target | Current Performance |
|---|---|---|
| Availability | 99.99% | 99.994% (30-day avg) |
| P95 Latency | < 45ms | 32ms |
| Data Freshness | < 6 hours | 4.2 hours |
| Index Consistency | Strong | Verified via checksum audits |
Failover is automatic. If a region exceeds error thresholds, traffic is rerouted via DNS failover and edge proxy rules. Status pages and incident reports are published at status.dictionary.com.
Compliance & Ethics
Providence adheres to international data protection standards and maintains rigorous ethical guidelines for linguistic data curation. All processing activities are logged, auditable, and reversible upon request.
Compliance Framework
- GDPR Article 17 (Right to Erasure) supported for user-generated contributions
- CCPA/CPRA compliance for California residents
- ISO 27001:2022 certified infrastructure
- Regular third-party bias audits across dialectal and regional data
- Transparent licensing disclosures per entry
We maintain a public Data Ethics Charter outlining our commitment to linguistic diversity, non-commercialization of endangered languages, and equitable representation across global dialects.
Technical FAQ
Full index snapshots are published monthly. Incremental updates are streamed in real-time for verified entries, with a maximum delay of 4 hours from ingestion to production availability.
Raw datasets are available to Enterprise partners under a separate data licensing agreement. Request access via the Enterprise portal or contact our data partnerships team.
Every dataset is versioned using semantic versioning (v2.4.1). You can pin your integrations to a specific version via the Accept-Version header to ensure deterministic responses.
API responses return HTTP 429 with Retry-After headers. Requests are not queued; they are rejected to preserve system stability. Upgrade to higher tiers or implement client-side backoff strategies.