📊 Open Language Data

Structured Language Datasets for AI & Research

Access curated, high-quality linguistic data ready for NLP training, academic research, and application development. Updated weekly with verified annotations.\n

12.4B
Tokens Indexed
85
Languages Covered
48
Public Datasets
2.1M
Monthly Downloads

Available Datasets

Showing 1-6 of 48 results

Global Definition Corpus v3

Comprehensive definitions across 45 languages with contextual examples, usage notes, and semantic tagging.

JSON
📊 2.4M entries 🌍 45 languages 💾 850 MB
definitions multilingual semantic

Pronunciation & Phonetics Bank

Native-speaker audio recordings, IPA transcriptions, and stress patterns for 1.8M lexical entries.

API
🎧 3.2M clips 🌍 62 languages ⚡ Streaming
audio ipa phonetics

Synonym & Antonym Network

Graph-structured relational data mapping semantic relationships, collocations, and contextual substitutions.

CSV
🔗 45M edges 📈 Graph format 💾 1.2 GB
graph relationships thesaurus

Historical Etymology Archive

Traces word origins, morphological evolution, and cross-linguistic borrowing across 5,000 years of recorded language.

JSON
📜 890K lineages 🕰️ 1500 BCE-2024 💾 420 MB
history morphology academic

Real-time Usage Trends

Temporal frequency data, regional dialect markers, and emerging slang tracking across web, social, and academic corpora.

API
📉 Live updates 🌐 12 regions ⚡ WebSocket
temporal frequency sociolinguistics

Academic Word List 2024

Curated subset of 2,848 high-frequency academic terms with discipline-specific definitions and citation contexts.

CSV
🎓 Education 📊 2.8K terms 💾 14 MB
education academic reference

Integrate via REST & GraphQL

Don't want to download massive files? Access all datasets programmatically through our high-availability API with rate limits, caching, and webhook support.

  • 99.99% uptime SLA with global edge caching
  • GraphQL endpoints for flexible schema queries
  • Webhook notifications for dataset updates
  • SDKs available for Python, JavaScript, Go & Rust
Read API Documentation Generate Test Key
// Fetch definitions using our Python SDK import dictionary_api # Initialize client client = dictionary_api.Client("your_api_key") # Query dataset async def get_corpus(): data = await client.datasets.query( name="global-definition-corpus", languages=["en", "es", "fr"], limit=1000 ) return data # Output: 1000 structured JSON entries

Frequently Asked Questions

What license are the datasets available under?
All public datasets are released under CC BY-SA 4.0 unless otherwise noted. Commercial licenses with attribution waivers are available for Enterprise plans. Audio data follows a separate commercial-use license due to voice recording rights.
How frequently are the datasets updated?
Core lexical datasets are updated every Tuesday. Usage trends and temporal corpora update in real-time. You can subscribe to update notifications via our webhook system or email alerts.
Can I use these datasets for training commercial AI models?
Yes. Our datasets are specifically curated for ML/NLP training. We provide clear attribution guidelines and offer enterprise support for compliance, data pipeline integration, and custom subset generation.
How do I request a custom dataset or missing language?
Submit a request through our Data Lab portal. Our linguistics team reviews all requests weekly. Custom corpus generation typically takes 2-4 weeks depending on scope and language availability.