Data & Datasets

Available Datasets

Showing 1-6 of 48 results

Global Definition Corpus v3

Comprehensive definitions across 45 languages with contextual examples, usage notes, and semantic tagging.

JSON

📊 2.4M entries 🌍 45 languages 💾 850 MB

definitions multilingual semantic

Preview Sample Download Dataset

Pronunciation & Phonetics Bank

Native-speaker audio recordings, IPA transcriptions, and stress patterns for 1.8M lexical entries.

API

🎧 3.2M clips 🌍 62 languages ⚡ Streaming

audio ipa phonetics

Listen Samples Request Access

Synonym & Antonym Network

Graph-structured relational data mapping semantic relationships, collocations, and contextual substitutions.

CSV

🔗 45M edges 📈 Graph format 💾 1.2 GB

graph relationships thesaurus

View Schema Download Dataset

Historical Etymology Archive

Traces word origins, morphological evolution, and cross-linguistic borrowing across 5,000 years of recorded language.

JSON

📜 890K lineages 🕰️ 1500 BCE-2024 💾 420 MB

history morphology academic

Explore Tree Download Dataset

Real-time Usage Trends

Temporal frequency data, regional dialect markers, and emerging slang tracking across web, social, and academic corpora.

API

📉 Live updates 🌐 12 regions ⚡ WebSocket

temporal frequency sociolinguistics

View Dashboard Request Access

Academic Word List 2024

Curated subset of 2,848 high-frequency academic terms with discipline-specific definitions and citation contexts.

CSV

🎓 Education 📊 2.8K terms 💾 14 MB

education academic reference

Preview Sample Download Dataset

Integrate via REST & GraphQL

Don't want to download massive files? Access all datasets programmatically through our high-availability API with rate limits, caching, and webhook support.

99.99% uptime SLA with global edge caching
GraphQL endpoints for flexible schema queries
Webhook notifications for dataset updates
SDKs available for Python, JavaScript, Go & Rust

Read API Documentation Generate Test Key

// Fetch definitions using our Python SDK
import dictionary_api

# Initialize client
client = dictionary_api.Client("your_api_key")

# Query dataset
async def get_corpus():
    data = await client.datasets.query(
        name="global-definition-corpus",
        languages=["en", "es", "fr"],
        limit=1000
    )
    return data

# Output: 1000 structured JSON entries

Frequently Asked Questions

What license are the datasets available under?▼

All public datasets are released under CC BY-SA 4.0 unless otherwise noted. Commercial licenses with attribution waivers are available for Enterprise plans. Audio data follows a separate commercial-use license due to voice recording rights.

How frequently are the datasets updated?▼

Core lexical datasets are updated every Tuesday. Usage trends and temporal corpora update in real-time. You can subscribe to update notifications via our webhook system or email alerts.

Can I use these datasets for training commercial AI models?▼

Yes. Our datasets are specifically curated for ML/NLP training. We provide clear attribution guidelines and offer enterprise support for compliance, data pipeline integration, and custom subset generation.

How do I request a custom dataset or missing language?▼

Submit a request through our Data Lab portal. Our linguistics team reviews all requests weekly. Custom corpus generation typically takes 2-4 weeks depending on scope and language availability.

Structured Language Datasets for AI & Research