Genomics Standards & Reference Framework
Unified specifications for genomic data representation, annotation, exchange, and quality assurance across research and clinical pipelines.
# Overview
The Genomics Standards Framework establishes canonical formats, ontologies, and validation rules for genomic data lifecycle management. It harmonizes specifications from GA4GH, NCBI, EBI, and ISO/TC 276 to ensure interoperability, reproducibility, and regulatory compliance across sequencing, alignment, variant calling, and annotation workflows.
This standard covers reference sequence representation, read format specifications, variant calling conventions, functional annotation schemas, and metadata requirements for public deposition and clinical reporting.
# Core Data Formats
Primary file formats for raw, aligned, and variant genomic data. All implementations must adhere to MIME types, header structures, and compression standards defined below.
| Format | Extension | MIME Type | Specification | Status |
|---|---|---|---|---|
FASTQ |
.fastq / .fq | text/x-fastq | Sanger Illumina 1.8+ | Core |
BAM |
.bam | application/x-bam | SAMv1.6 + BGZIP | Core |
VCF |
.vcf / .bcf | text/vcf | VCFSpec v4.3 | Core |
FASTA |
.fa / .fasta | text/x-fastfa | GenBank v3.0 | Core |
CRAVAT |
.cravat | application/x-cravat | GA4GH 2023 | Extension |
@SEQ_ID instrument:run:lane:tile:x:y filter:i7:i5 @HWUSI-EAS500R:6:FF3KEAXXY:3:1101:1000:2148 1:N:0:ATCACG AGCTTAGCTACGTACGATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
# Annotation Standards
Functional and structural annotation must follow standardized coordinate systems, feature types, and evidence codes. All genomic intervals use 0-based half-open coordinates (BED-style) unless explicitly noted as 1-based (GFF/GTF).
| Schema | Purpose | Coordinate System | Compliance |
|---|---|---|---|
GFF3 |
Generic feature annotation | 1-based inclusive | Core |
GTF |
Gene/transcript structure | 1-based inclusive | Core |
BED |
Interval/peak calls | 0-based half-open | Core |
JBIG2/SAF |
Sequence annotation feature | 1-based | Extension |
# Metadata & Ontologies
Minimum metadata requirements align with ISA-Tab, MIxS, and OBO Foundry principles. All submissions must map controlled vocabularies to canonical IRIs.
- Sample Origin: Must use ENVO or NCBI Taxonomy IDs. Clinical samples require LOINC/SNOMED CT mapping.
- Sequencing Platform: Controlled vocabulary from SO (Sequence Ontology) or GA4GH Platform Codes.
- Reference Genome: Must specify assembly accession (e.g.,
GCA_000001405.15) and patch versions. - Processing Pipeline: CWL/WDL/Nextflow metadata with SHA256 checksums for reproducibility.
- Ethics & Consent: Dublin Core + GDPR/HIPAA compliance flags per
data_useschema.
# Interoperability & Exchange
Aevum mandates GA4GH Beacon v2.0 and DRSC (Distributed Research Support) compatibility for federated queries. All public repositories must expose REST/GraphQL endpoints conforming to the htsget and GA4GH Schemas v1.2.0.
{
"datasetId": "ae_genomics_v4.1",
"filter": {
"location": {
"referenceName": "chr17",
"start": 43044295,
"end": 43044295
},
"alternateBases": "A",
"referenceBases": "T"
},
"requestedFields": ["datasetId", "existence", "alleleFrequency"]
}
# Compliance Checklist
Use this checklist to validate pipelines and submissions against Aevum Genomics Standards v4.1.2.
- Verify all BAM files contain
@PGand@COheaders with tool names & versions - Ensure VCF files are sorted by
POSand indexed (.tbi) - Map all phenotype data to HPO or MedGen UIDs
- Validate JSON metadata against Aevum Schema Validator (
ae-cli validate) - Confirm read groups match
@RGheader specifications - Apply BGZIP + tabix compression for all coordinate-sorted text formats
- Include MD5/SHA256 checksums in submission manifests
# Changelog
v4.1.2 Patch Release
Fixed VCF INFO field validation rules. Added CRiSP v3 extension support. Updated MIxS crosswalk mappings.
v4.1.0 Minor Release
Integrated GA4GH Schema 1.2.0. Deprecated FASTQ v1.6. Added clinical reporting metadata schema (CLSI-AMP aligned).
v4.0.0 Major Release
Complete overhaul of coordinate system documentation. Introduced unified ontology layer. Migrated to JSON-LD metadata standard.
# References & Governing Bodies
- GA4GH (Global Alliance for Genomics & Health) –
ga4gh.org/standards - NCBI GenBank / RefSeq Specifications –
ncbi.nlm.nih.gov/genbank - EBI / ENA Deposit Guidelines –
ena-docs.readthedocs.io - ISO/TC 276 – Biotechnology Standards –
iso.org/committee/276 - Aevum Standards Committee Charter –
aevum.org/governance