# Overview

The Genomics Standards Framework establishes canonical formats, ontologies, and validation rules for genomic data lifecycle management. It harmonizes specifications from GA4GH, NCBI, EBI, and ISO/TC 276 to ensure interoperability, reproducibility, and regulatory compliance across sequencing, alignment, variant calling, and annotation workflows.

This standard covers reference sequence representation, read format specifications, variant calling conventions, functional annotation schemas, and metadata requirements for public deposition and clinical reporting.

# Core Data Formats

Primary file formats for raw, aligned, and variant genomic data. All implementations must adhere to MIME types, header structures, and compression standards defined below.

Format Extension MIME Type Specification Status
FASTQ .fastq / .fq text/x-fastq Sanger Illumina 1.8+ Core
BAM .bam application/x-bam SAMv1.6 + BGZIP Core
VCF .vcf / .bcf text/vcf VCFSpec v4.3 Core
FASTA .fa / .fasta text/x-fastfa GenBank v3.0 Core
CRAVAT .cravat application/x-cravat GA4GH 2023 Extension
FASTQ Header Specification Example
@SEQ_ID instrument:run:lane:tile:x:y filter:i7:i5
@HWUSI-EAS500R:6:FF3KEAXXY:3:1101:1000:2148 1:N:0:ATCACG
AGCTTAGCTACGTACGATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

# Annotation Standards

Functional and structural annotation must follow standardized coordinate systems, feature types, and evidence codes. All genomic intervals use 0-based half-open coordinates (BED-style) unless explicitly noted as 1-based (GFF/GTF).

Schema Purpose Coordinate System Compliance
GFF3 Generic feature annotation 1-based inclusive Core
GTF Gene/transcript structure 1-based inclusive Core
BED Interval/peak calls 0-based half-open Core
JBIG2/SAF Sequence annotation feature 1-based Extension

# Metadata & Ontologies

Minimum metadata requirements align with ISA-Tab, MIxS, and OBO Foundry principles. All submissions must map controlled vocabularies to canonical IRIs.

  • Sample Origin: Must use ENVO or NCBI Taxonomy IDs. Clinical samples require LOINC/SNOMED CT mapping.
  • Sequencing Platform: Controlled vocabulary from SO (Sequence Ontology) or GA4GH Platform Codes.
  • Reference Genome: Must specify assembly accession (e.g., GCA_000001405.15) and patch versions.
  • Processing Pipeline: CWL/WDL/Nextflow metadata with SHA256 checksums for reproducibility.
  • Ethics & Consent: Dublin Core + GDPR/HIPAA compliance flags per data_use schema.

# Interoperability & Exchange

Aevum mandates GA4GH Beacon v2.0 and DRSC (Distributed Research Support) compatibility for federated queries. All public repositories must expose REST/GraphQL endpoints conforming to the htsget and GA4GH Schemas v1.2.0.

GA4GH Beacon v2 Query Example
{
  "datasetId": "ae_genomics_v4.1",
  "filter": {
    "location": {
      "referenceName": "chr17",
      "start": 43044295,
      "end": 43044295
    },
    "alternateBases": "A",
    "referenceBases": "T"
  },
  "requestedFields": ["datasetId", "existence", "alleleFrequency"]
}

# Compliance Checklist

Use this checklist to validate pipelines and submissions against Aevum Genomics Standards v4.1.2.

  • Verify all BAM files contain @PG and @CO headers with tool names & versions
  • Ensure VCF files are sorted by POS and indexed (.tbi)
  • Map all phenotype data to HPO or MedGen UIDs
  • Validate JSON metadata against Aevum Schema Validator (ae-cli validate)
  • Confirm read groups match @RG header specifications
  • Apply BGZIP + tabix compression for all coordinate-sorted text formats
  • Include MD5/SHA256 checksums in submission manifests

# Changelog

2025-05-20

v4.1.2 Patch Release

Fixed VCF INFO field validation rules. Added CRiSP v3 extension support. Updated MIxS crosswalk mappings.

2025-02-14

v4.1.0 Minor Release

Integrated GA4GH Schema 1.2.0. Deprecated FASTQ v1.6. Added clinical reporting metadata schema (CLSI-AMP aligned).

2024-09-08

v4.0.0 Major Release

Complete overhaul of coordinate system documentation. Introduced unified ontology layer. Migrated to JSON-LD metadata standard.

# References & Governing Bodies

  • GA4GH (Global Alliance for Genomics & Health) – ga4gh.org/standards
  • NCBI GenBank / RefSeq Specifications – ncbi.nlm.nih.gov/genbank
  • EBI / ENA Deposit Guidelines – ena-docs.readthedocs.io
  • ISO/TC 276 – Biotechnology Standards – iso.org/committee/276
  • Aevum Standards Committee Charter – aevum.org/governance