Files
exploratory-data-analysis/references/bioinformatics_genomics_formats.md

21 KiB
Raw Blame History

Bioinformatics and Genomics File Formats Reference

This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications.

Sequence Data Formats

.fasta / .fa / .fna - FASTA Format

Description: Text-based format for nucleotide or protein sequences Typical Data: DNA, RNA, or protein sequences with headers Use Cases: Sequence storage, BLAST searches, alignments Python Libraries:

  • Biopython: SeqIO.parse('file.fasta', 'fasta')
  • pyfaidx: Fast indexed FASTA access
  • screed: Fast sequence parsing EDA Approach:
  • Sequence count and length distribution
  • GC content analysis
  • N content (ambiguous bases)
  • Sequence ID parsing
  • Duplicate detection
  • Quality metrics for assemblies (N50, L50)

.fastq / .fq - FASTQ Format

Description: Sequence data with base quality scores Typical Data: Raw sequencing reads with Phred quality scores Use Cases: NGS data, quality control, read mapping Python Libraries:

  • Biopython: SeqIO.parse('file.fastq', 'fastq')
  • pysam: Fast FASTQ/BAM operations
  • HTSeq: Sequencing data analysis EDA Approach:
  • Read count and length distribution
  • Quality score distribution (per-base, per-read)
  • GC content and bias
  • Duplicate rate estimation
  • Adapter contamination detection
  • k-mer frequency analysis
  • Encoding format validation (Phred33/64)

.sam - Sequence Alignment/Map

Description: Tab-delimited text format for alignments Typical Data: Aligned sequencing reads with mapping quality Use Cases: Read alignment storage, variant calling Python Libraries:

  • pysam: pysam.AlignmentFile('file.sam', 'r')
  • HTSeq: HTSeq.SAM_Reader('file.sam') EDA Approach:
  • Mapping rate and quality distribution
  • Coverage analysis
  • Insert size distribution (paired-end)
  • Alignment flags distribution
  • CIGAR string patterns
  • Mismatch and indel rates
  • Duplicate and supplementary alignment counts

.bam - Binary Alignment/Map

Description: Compressed binary version of SAM Typical Data: Aligned reads in compressed format Use Cases: Efficient storage and processing of alignments Python Libraries:

  • pysam: Full BAM support with indexing
  • bamnostic: Pure Python BAM reader EDA Approach:
  • Same as SAM plus:
  • Compression ratio analysis
  • Index file (.bai) validation
  • Chromosome-wise statistics
  • Strand bias detection
  • Read group analysis

.cram - CRAM Format

Description: Highly compressed alignment format Typical Data: Reference-compressed aligned reads Use Cases: Long-term storage, space-efficient archives Python Libraries:

  • pysam: CRAM support (requires reference)
  • Reference genome must be accessible EDA Approach:
  • Compression efficiency vs BAM
  • Reference dependency validation
  • Lossy vs lossless compression assessment
  • Decompression performance
  • Similar alignment metrics as BAM

.bed - Browser Extensible Data

Description: Tab-delimited format for genomic features Typical Data: Genomic intervals (chr, start, end) with annotations Use Cases: Peak calling, variant annotation, genome browsing Python Libraries:

  • pybedtools: pybedtools.BedTool('file.bed')
  • pyranges: pyranges.read_bed('file.bed')
  • pandas: Simple BED reading EDA Approach:
  • Feature count and size distribution
  • Chromosome distribution
  • Strand bias
  • Score distribution (if present)
  • Overlap and proximity analysis
  • Coverage statistics
  • Gap analysis between features

.bedGraph - BED with Graph Data

Description: BED format with per-base signal values Typical Data: Continuous-valued genomic data (coverage, signals) Use Cases: Coverage tracks, ChIP-seq signals, methylation Python Libraries:

  • pyBigWig: Can convert to bigWig
  • pybedtools: BedGraph operations EDA Approach:
  • Signal distribution statistics
  • Genome coverage percentage
  • Signal dynamics (peaks, valleys)
  • Chromosome-wise signal patterns
  • Quantile analysis
  • Zero-coverage regions

.bigWig / .bw - Binary BigWig

Description: Indexed binary format for genome-wide signal data Typical Data: Continuous genomic signals (compressed and indexed) Use Cases: Efficient genome browser tracks, large-scale data Python Libraries:

  • pyBigWig: pyBigWig.open('file.bw')
  • pybbi: BigWig/BigBed interface EDA Approach:
  • Signal statistics extraction
  • Zoom level analysis
  • Regional signal extraction
  • Efficient genome-wide summaries
  • Compression efficiency
  • Index structure analysis

.bigBed / .bb - Binary BigBed

Description: Indexed binary BED format Typical Data: Genomic features (compressed and indexed) Use Cases: Large feature sets, genome browsers Python Libraries:

  • pybbi: BigBed reading
  • pybigtools: Modern BigBed interface EDA Approach:
  • Feature density analysis
  • Efficient interval queries
  • Zoom level validation
  • Index performance metrics
  • Feature size statistics

.gff / .gff3 - General Feature Format

Description: Tab-delimited format for genomic annotations Typical Data: Gene models, transcripts, exons, regulatory elements Use Cases: Genome annotation, gene prediction Python Libraries:

  • BCBio.GFF: Biopython GFF module
  • gffutils: gffutils.create_db('file.gff3')
  • pyranges: GFF support EDA Approach:
  • Feature type distribution (gene, exon, CDS, etc.)
  • Gene structure validation
  • Strand balance
  • Hierarchical relationship validation
  • Phase validation for CDS
  • Attribute completeness
  • Gene model statistics (introns, exons per gene)

.gtf - Gene Transfer Format

Description: GFF2-based format for gene annotations Typical Data: Gene and transcript annotations Use Cases: RNA-seq analysis, gene quantification Python Libraries:

  • pyranges: pyranges.read_gtf('file.gtf')
  • gffutils: GTF database creation
  • HTSeq: GTF reading for counts EDA Approach:
  • Transcript isoform analysis
  • Gene structure completeness
  • Exon number distribution
  • Transcript length distribution
  • TSS and TES analysis
  • Biotype distribution
  • Overlapping gene detection

.vcf - Variant Call Format

Description: Text format for genetic variants Typical Data: SNPs, indels, structural variants with annotations Use Cases: Variant calling, population genetics, GWAS Python Libraries:

  • pysam: pysam.VariantFile('file.vcf')
  • cyvcf2: Fast VCF parsing
  • PyVCF: Older but comprehensive EDA Approach:
  • Variant count by type (SNP, indel, SV)
  • Quality score distribution
  • Allele frequency spectrum
  • Transition/transversion ratio
  • Heterozygosity rates
  • Missing genotype analysis
  • Hardy-Weinberg equilibrium
  • Annotation completeness (if annotated)

.bcf - Binary VCF

Description: Compressed binary variant format Typical Data: Same as VCF but binary Use Cases: Efficient variant storage and processing Python Libraries:

  • pysam: Full BCF support
  • cyvcf2: Optimized BCF reading EDA Approach:
  • Same as VCF plus:
  • Compression efficiency
  • Indexing validation
  • Read performance metrics

.gvcf - Genomic VCF

Description: VCF with reference confidence blocks Typical Data: All positions (variant and non-variant) Use Cases: Joint genotyping workflows, GATK Python Libraries:

  • pysam: GVCF support
  • Standard VCF parsers EDA Approach:
  • Reference block analysis
  • Coverage uniformity
  • Variant density
  • Genotype quality across genome
  • Reference confidence distribution

RNA-Seq and Expression Data

.counts - Gene Count Matrix

Description: Tab-delimited gene expression counts Typical Data: Gene IDs with read counts per sample Use Cases: RNA-seq quantification, differential expression Python Libraries:

  • pandas: pd.read_csv('file.counts', sep='\t')
  • scanpy (for single-cell): sc.read_csv() EDA Approach:
  • Library size distribution
  • Detection rate (genes per sample)
  • Zero-inflation analysis
  • Count distribution (log scale)
  • Outlier sample detection
  • Correlation between replicates
  • PCA for sample relationships

.tpm / .fpkm - Normalized Expression

Description: Normalized gene expression values Typical Data: TPM (transcripts per million) or FPKM values Use Cases: Cross-sample comparison, visualization Python Libraries:

  • pandas: Standard CSV reading
  • anndata: For integrated analysis EDA Approach:
  • Expression distribution
  • Highly expressed gene identification
  • Sample clustering
  • Batch effect detection
  • Coefficient of variation analysis
  • Dynamic range assessment

.mtx - Matrix Market Format

Description: Sparse matrix format (common in single-cell) Typical Data: Sparse count matrices (cells × genes) Use Cases: Single-cell RNA-seq, large sparse matrices Python Libraries:

  • scipy.io: scipy.io.mmread('file.mtx')
  • scanpy: sc.read_mtx('file.mtx') EDA Approach:
  • Sparsity analysis
  • Cell and gene filtering thresholds
  • Doublet detection metrics
  • Mitochondrial fraction
  • UMI count distribution
  • Gene detection per cell

.h5ad - Anndata Format

Description: HDF5-based annotated data matrix Typical Data: Expression matrix with metadata (cells, genes) Use Cases: Single-cell RNA-seq analysis with Scanpy Python Libraries:

  • scanpy: sc.read_h5ad('file.h5ad')
  • anndata: Direct AnnData manipulation EDA Approach:
  • Cell and gene counts
  • Metadata completeness
  • Layer availability (raw, normalized)
  • Embedding presence (PCA, UMAP)
  • QC metrics distribution
  • Batch information
  • Cell type annotation coverage

.loom - Loom Format

Description: HDF5-based format for omics data Typical Data: Expression matrices with metadata Use Cases: Single-cell data, RNA velocity analysis Python Libraries:

  • loompy: loompy.connect('file.loom')
  • scanpy: Can import loom files EDA Approach:
  • Layer analysis (spliced, unspliced)
  • Row and column attribute exploration
  • Graph connectivity analysis
  • Cluster assignments
  • Velocity-specific metrics

.rds - R Data Serialization

Description: R object storage (often Seurat objects) Typical Data: R analysis results, especially single-cell Use Cases: R-Python data exchange Python Libraries:

  • pyreadr: pyreadr.read_r('file.rds')
  • rpy2: For full R integration
  • Conversion tools to AnnData EDA Approach:
  • Object type identification
  • Data structure exploration
  • Metadata extraction
  • Conversion validation

Alignment and Assembly Formats

.maf - Multiple Alignment Format

Description: Text format for multiple sequence alignments Typical Data: Genome-wide or local multiple alignments Use Cases: Comparative genomics, conservation analysis Python Libraries:

  • Biopython: AlignIO.parse('file.maf', 'maf')
  • bx-python: MAF-specific tools EDA Approach:
  • Alignment block statistics
  • Species coverage
  • Gap analysis
  • Conservation scoring
  • Alignment quality metrics
  • Block length distribution

.axt - Pairwise Alignment Format

Description: Pairwise alignment format (UCSC) Typical Data: Pairwise genomic alignments Use Cases: Genome comparison, synteny analysis Python Libraries:

  • Custom parsers (simple format)
  • bx-python: AXT support EDA Approach:
  • Alignment score distribution
  • Identity percentage
  • Syntenic block identification
  • Gap size analysis
  • Coverage statistics

.chain - Chain Alignment Format

Description: Genome coordinate mapping chains Typical Data: Coordinate transformations between genome builds Use Cases: Liftover, coordinate conversion Python Libraries:

  • pyliftover: Chain file usage
  • Custom parsers for chain format EDA Approach:
  • Chain score distribution
  • Coverage of source genome
  • Gap analysis
  • Inversion detection
  • Mapping quality assessment

.psl - Pattern Space Layout

Description: BLAT/BLAST alignment format Typical Data: Alignment results from BLAT Use Cases: Transcript mapping, similarity searches Python Libraries:

  • Custom parsers (tab-delimited)
  • pybedtools: Can handle PSL EDA Approach:
  • Match percentage distribution
  • Gap statistics
  • Query coverage
  • Multiple mapping analysis
  • Alignment quality metrics

Genome Assembly and Annotation

.agp - Assembly Golden Path

Description: Assembly structure description Typical Data: Scaffold composition, gap information Use Cases: Genome assembly representation Python Libraries:

  • Custom parsers (simple tab-delimited)
  • Assembly analysis tools EDA Approach:
  • Scaffold statistics (N50, L50)
  • Gap type and size distribution
  • Component length analysis
  • Assembly contiguity metrics
  • Unplaced contig analysis

.scaffolds / .contigs - Assembly Sequences

Description: Assembled sequences (usually FASTA) Typical Data: Assembled genomic sequences Use Cases: Genome assembly output Python Libraries:

  • Same as FASTA format
  • Assembly-specific tools (QUAST) EDA Approach:
  • Assembly statistics (N50, N90, etc.)
  • Length distribution
  • Coverage analysis
  • Gap (N) content
  • Duplication assessment
  • BUSCO completeness (if annotations available)

.2bit - Compressed Genome Format

Description: UCSC compact genome format Typical Data: Reference genomes (highly compressed) Use Cases: Efficient genome storage and access Python Libraries:

  • py2bit: py2bit.open('file.2bit')
  • twobitreader: Alternative reader EDA Approach:
  • Compression efficiency
  • Random access performance
  • Sequence extraction validation
  • Masked region analysis
  • N content and distribution

.sizes - Chromosome Sizes

Description: Simple format with chromosome lengths Typical Data: Tab-delimited chromosome names and sizes Use Cases: Genome browsers, coordinate validation Python Libraries:

  • Simple file reading with pandas
  • Built into many genomic tools EDA Approach:
  • Genome size calculation
  • Chromosome count
  • Size distribution
  • Karyotype validation
  • Completeness check against reference

Phylogenetics and Evolution

.nwk / .newick - Newick Tree Format

Description: Parenthetical tree representation Typical Data: Phylogenetic trees with branch lengths Use Cases: Evolutionary analysis, tree visualization Python Libraries:

  • Biopython: Phylo.read('file.nwk', 'newick')
  • ete3: ete3.Tree('file.nwk')
  • dendropy: Phylogenetic computing EDA Approach:
  • Tree structure analysis (tips, internal nodes)
  • Branch length distribution
  • Tree balance metrics
  • Ultrametricity check
  • Bootstrap support analysis
  • Topology validation

.nexus - Nexus Format

Description: Rich format for phylogenetic data Typical Data: Alignments, trees, character matrices Use Cases: Phylogenetic software interchange Python Libraries:

  • Biopython: Nexus support
  • dendropy: Comprehensive Nexus handling EDA Approach:
  • Data block analysis
  • Character type distribution
  • Tree block validation
  • Taxa consistency
  • Command block parsing
  • Format compliance checking

.phylip - PHYLIP Format

Description: Sequence alignment format (strict/relaxed) Typical Data: Multiple sequence alignments Use Cases: Phylogenetic analysis input Python Libraries:

  • Biopython: AlignIO.read('file.phy', 'phylip')
  • dendropy: PHYLIP support EDA Approach:
  • Alignment dimensions
  • Sequence length uniformity
  • Gap position analysis
  • Informative site calculation
  • Format variant detection (strict vs relaxed)

.paml - PAML Output

Description: Output from PAML phylogenetic software Typical Data: Evolutionary model results, dN/dS ratios Use Cases: Molecular evolution analysis Python Libraries:

  • Custom parsers for specific PAML programs
  • Biopython: Basic PAML parsing EDA Approach:
  • Model parameter extraction
  • Likelihood values
  • dN/dS ratio distribution
  • Branch-specific results
  • Convergence assessment

Protein and Structure Data

.embl - EMBL Format

Description: Rich sequence annotation format Typical Data: Sequences with extensive annotations Use Cases: Sequence databases, genome records Python Libraries:

  • Biopython: SeqIO.read('file.embl', 'embl') EDA Approach:
  • Feature annotation completeness
  • Sequence length and type
  • Reference information
  • Cross-reference validation
  • Feature overlap analysis

.genbank / .gb / .gbk - GenBank Format

Description: NCBI's sequence annotation format Typical Data: Annotated sequences with features Use Cases: Sequence databases, annotation transfer Python Libraries:

  • Biopython: SeqIO.parse('file.gb', 'genbank') EDA Approach:
  • Feature type distribution
  • CDS analysis (start codons, stops)
  • Translation validation
  • Annotation completeness
  • Source organism extraction
  • Reference and publication info
  • Locus tag consistency

.sff - Standard Flowgram Format

Description: 454/Roche sequencing data format Typical Data: Raw pyrosequencing flowgrams Use Cases: Legacy 454 sequencing data Python Libraries:

  • Biopython: SeqIO.parse('file.sff', 'sff')
  • Platform-specific tools EDA Approach:
  • Read count and length
  • Flowgram signal quality
  • Key sequence detection
  • Adapter trimming validation
  • Quality score distribution

.hdf5 (Genomics Specific)

Description: HDF5 for genomics (10X, Hi-C, etc.) Typical Data: High-throughput genomics data Use Cases: 10X Genomics, spatial transcriptomics Python Libraries:

  • h5py: Low-level access
  • scanpy: For 10X data
  • cooler: For Hi-C data EDA Approach:
  • Dataset structure exploration
  • Barcode statistics
  • UMI counting
  • Feature-barcode matrix analysis
  • Spatial coordinates (if applicable)

.cool / .mcool - Cooler Format

Description: HDF5-based Hi-C contact matrices Typical Data: Chromatin interaction matrices Use Cases: 3D genome analysis, Hi-C data Python Libraries:

  • cooler: cooler.Cooler('file.cool')
  • hicstraw: For .hic format EDA Approach:
  • Resolution analysis
  • Contact matrix statistics
  • Distance decay curves
  • Compartment analysis
  • TAD boundary detection
  • Balance factor validation

.hic - Hi-C Binary Format

Description: Juicer binary Hi-C format Typical Data: Multi-resolution Hi-C matrices Use Cases: Hi-C analysis with Juicer tools Python Libraries:

  • hicstraw: hicstraw.HiCFile('file.hic')
  • straw: C++ library with Python bindings EDA Approach:
  • Available resolutions
  • Normalization methods
  • Contact statistics
  • Chromosomal interactions
  • Quality metrics

.bw (ChIP-seq / ATAC-seq specific)

Description: BigWig files for epigenomics Typical Data: Coverage or enrichment signals Use Cases: ChIP-seq, ATAC-seq, DNase-seq Python Libraries:

  • pyBigWig: Standard bigWig access EDA Approach:
  • Peak enrichment patterns
  • Background signal analysis
  • Sample correlation
  • Signal-to-noise ratio
  • Library complexity metrics

.narrowPeak / .broadPeak - ENCODE Peak Formats

Description: BED-based formats for peaks Typical Data: Peak calls with scores and p-values Use Cases: ChIP-seq peak calling output Python Libraries:

  • pybedtools: BED-compatible
  • Custom parsers for peak-specific fields EDA Approach:
  • Peak count and width distribution
  • Signal value distribution
  • Q-value and p-value analysis
  • Peak summit analysis
  • Overlap with known features
  • Motif enrichment preparation

.wig - Wiggle Format

Description: Dense continuous genomic data Typical Data: Coverage or signal tracks Use Cases: Genome browser visualization Python Libraries:

  • pyBigWig: Can convert to bigWig
  • Custom parsers for wiggle format EDA Approach:
  • Signal statistics
  • Coverage metrics
  • Format variant (fixedStep vs variableStep)
  • Span parameter analysis
  • Conversion efficiency to bigWig

.ab1 - Sanger Sequencing Trace

Description: Binary chromatogram format Typical Data: Sanger sequencing traces Use Cases: Capillary sequencing validation Python Libraries:

  • Biopython: SeqIO.read('file.ab1', 'abi')
  • tracy tools: For quality assessment EDA Approach:
  • Base calling quality
  • Trace quality scores
  • Mixed base detection
  • Primer and vector detection
  • Read length and quality region
  • Heterozygosity detection

.scf - Standard Chromatogram Format

Description: Sanger sequencing chromatogram Typical Data: Base calls and confidence values Use Cases: Sequencing trace analysis Python Libraries:

  • Biopython: SCF format support EDA Approach:
  • Similar to AB1 format
  • Quality score profiles
  • Peak height ratios
  • Signal-to-noise metrics

.idx - Index Files (Generic)

Description: Index files for various formats Typical Data: Fast random access indices Use Cases: Efficient data access (BAM, VCF, etc.) Python Libraries:

  • Format-specific libraries handle indices
  • pysam: Auto-handles BAI, CSI indices EDA Approach:
  • Index completeness validation
  • Binning strategy analysis
  • Access performance metrics
  • Index size vs data size ratio