21 KiB
Bioinformatics and Genomics File Formats Reference
This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications.
Sequence Data Formats
.fasta / .fa / .fna - FASTA Format
Description: Text-based format for nucleotide or protein sequences Typical Data: DNA, RNA, or protein sequences with headers Use Cases: Sequence storage, BLAST searches, alignments Python Libraries:
Biopython:SeqIO.parse('file.fasta', 'fasta')pyfaidx: Fast indexed FASTA accessscreed: Fast sequence parsing EDA Approach:- Sequence count and length distribution
- GC content analysis
- N content (ambiguous bases)
- Sequence ID parsing
- Duplicate detection
- Quality metrics for assemblies (N50, L50)
.fastq / .fq - FASTQ Format
Description: Sequence data with base quality scores Typical Data: Raw sequencing reads with Phred quality scores Use Cases: NGS data, quality control, read mapping Python Libraries:
Biopython:SeqIO.parse('file.fastq', 'fastq')pysam: Fast FASTQ/BAM operationsHTSeq: Sequencing data analysis EDA Approach:- Read count and length distribution
- Quality score distribution (per-base, per-read)
- GC content and bias
- Duplicate rate estimation
- Adapter contamination detection
- k-mer frequency analysis
- Encoding format validation (Phred33/64)
.sam - Sequence Alignment/Map
Description: Tab-delimited text format for alignments Typical Data: Aligned sequencing reads with mapping quality Use Cases: Read alignment storage, variant calling Python Libraries:
pysam:pysam.AlignmentFile('file.sam', 'r')HTSeq:HTSeq.SAM_Reader('file.sam')EDA Approach:- Mapping rate and quality distribution
- Coverage analysis
- Insert size distribution (paired-end)
- Alignment flags distribution
- CIGAR string patterns
- Mismatch and indel rates
- Duplicate and supplementary alignment counts
.bam - Binary Alignment/Map
Description: Compressed binary version of SAM Typical Data: Aligned reads in compressed format Use Cases: Efficient storage and processing of alignments Python Libraries:
pysam: Full BAM support with indexingbamnostic: Pure Python BAM reader EDA Approach:- Same as SAM plus:
- Compression ratio analysis
- Index file (.bai) validation
- Chromosome-wise statistics
- Strand bias detection
- Read group analysis
.cram - CRAM Format
Description: Highly compressed alignment format Typical Data: Reference-compressed aligned reads Use Cases: Long-term storage, space-efficient archives Python Libraries:
pysam: CRAM support (requires reference)- Reference genome must be accessible EDA Approach:
- Compression efficiency vs BAM
- Reference dependency validation
- Lossy vs lossless compression assessment
- Decompression performance
- Similar alignment metrics as BAM
.bed - Browser Extensible Data
Description: Tab-delimited format for genomic features Typical Data: Genomic intervals (chr, start, end) with annotations Use Cases: Peak calling, variant annotation, genome browsing Python Libraries:
pybedtools:pybedtools.BedTool('file.bed')pyranges:pyranges.read_bed('file.bed')pandas: Simple BED reading EDA Approach:- Feature count and size distribution
- Chromosome distribution
- Strand bias
- Score distribution (if present)
- Overlap and proximity analysis
- Coverage statistics
- Gap analysis between features
.bedGraph - BED with Graph Data
Description: BED format with per-base signal values Typical Data: Continuous-valued genomic data (coverage, signals) Use Cases: Coverage tracks, ChIP-seq signals, methylation Python Libraries:
pyBigWig: Can convert to bigWigpybedtools: BedGraph operations EDA Approach:- Signal distribution statistics
- Genome coverage percentage
- Signal dynamics (peaks, valleys)
- Chromosome-wise signal patterns
- Quantile analysis
- Zero-coverage regions
.bigWig / .bw - Binary BigWig
Description: Indexed binary format for genome-wide signal data Typical Data: Continuous genomic signals (compressed and indexed) Use Cases: Efficient genome browser tracks, large-scale data Python Libraries:
pyBigWig:pyBigWig.open('file.bw')pybbi: BigWig/BigBed interface EDA Approach:- Signal statistics extraction
- Zoom level analysis
- Regional signal extraction
- Efficient genome-wide summaries
- Compression efficiency
- Index structure analysis
.bigBed / .bb - Binary BigBed
Description: Indexed binary BED format Typical Data: Genomic features (compressed and indexed) Use Cases: Large feature sets, genome browsers Python Libraries:
pybbi: BigBed readingpybigtools: Modern BigBed interface EDA Approach:- Feature density analysis
- Efficient interval queries
- Zoom level validation
- Index performance metrics
- Feature size statistics
.gff / .gff3 - General Feature Format
Description: Tab-delimited format for genomic annotations Typical Data: Gene models, transcripts, exons, regulatory elements Use Cases: Genome annotation, gene prediction Python Libraries:
BCBio.GFF: Biopython GFF modulegffutils:gffutils.create_db('file.gff3')pyranges: GFF support EDA Approach:- Feature type distribution (gene, exon, CDS, etc.)
- Gene structure validation
- Strand balance
- Hierarchical relationship validation
- Phase validation for CDS
- Attribute completeness
- Gene model statistics (introns, exons per gene)
.gtf - Gene Transfer Format
Description: GFF2-based format for gene annotations Typical Data: Gene and transcript annotations Use Cases: RNA-seq analysis, gene quantification Python Libraries:
pyranges:pyranges.read_gtf('file.gtf')gffutils: GTF database creationHTSeq: GTF reading for counts EDA Approach:- Transcript isoform analysis
- Gene structure completeness
- Exon number distribution
- Transcript length distribution
- TSS and TES analysis
- Biotype distribution
- Overlapping gene detection
.vcf - Variant Call Format
Description: Text format for genetic variants Typical Data: SNPs, indels, structural variants with annotations Use Cases: Variant calling, population genetics, GWAS Python Libraries:
pysam:pysam.VariantFile('file.vcf')cyvcf2: Fast VCF parsingPyVCF: Older but comprehensive EDA Approach:- Variant count by type (SNP, indel, SV)
- Quality score distribution
- Allele frequency spectrum
- Transition/transversion ratio
- Heterozygosity rates
- Missing genotype analysis
- Hardy-Weinberg equilibrium
- Annotation completeness (if annotated)
.bcf - Binary VCF
Description: Compressed binary variant format Typical Data: Same as VCF but binary Use Cases: Efficient variant storage and processing Python Libraries:
pysam: Full BCF supportcyvcf2: Optimized BCF reading EDA Approach:- Same as VCF plus:
- Compression efficiency
- Indexing validation
- Read performance metrics
.gvcf - Genomic VCF
Description: VCF with reference confidence blocks Typical Data: All positions (variant and non-variant) Use Cases: Joint genotyping workflows, GATK Python Libraries:
pysam: GVCF support- Standard VCF parsers EDA Approach:
- Reference block analysis
- Coverage uniformity
- Variant density
- Genotype quality across genome
- Reference confidence distribution
RNA-Seq and Expression Data
.counts - Gene Count Matrix
Description: Tab-delimited gene expression counts Typical Data: Gene IDs with read counts per sample Use Cases: RNA-seq quantification, differential expression Python Libraries:
pandas:pd.read_csv('file.counts', sep='\t')scanpy(for single-cell):sc.read_csv()EDA Approach:- Library size distribution
- Detection rate (genes per sample)
- Zero-inflation analysis
- Count distribution (log scale)
- Outlier sample detection
- Correlation between replicates
- PCA for sample relationships
.tpm / .fpkm - Normalized Expression
Description: Normalized gene expression values Typical Data: TPM (transcripts per million) or FPKM values Use Cases: Cross-sample comparison, visualization Python Libraries:
pandas: Standard CSV readinganndata: For integrated analysis EDA Approach:- Expression distribution
- Highly expressed gene identification
- Sample clustering
- Batch effect detection
- Coefficient of variation analysis
- Dynamic range assessment
.mtx - Matrix Market Format
Description: Sparse matrix format (common in single-cell) Typical Data: Sparse count matrices (cells × genes) Use Cases: Single-cell RNA-seq, large sparse matrices Python Libraries:
scipy.io:scipy.io.mmread('file.mtx')scanpy:sc.read_mtx('file.mtx')EDA Approach:- Sparsity analysis
- Cell and gene filtering thresholds
- Doublet detection metrics
- Mitochondrial fraction
- UMI count distribution
- Gene detection per cell
.h5ad - Anndata Format
Description: HDF5-based annotated data matrix Typical Data: Expression matrix with metadata (cells, genes) Use Cases: Single-cell RNA-seq analysis with Scanpy Python Libraries:
scanpy:sc.read_h5ad('file.h5ad')anndata: Direct AnnData manipulation EDA Approach:- Cell and gene counts
- Metadata completeness
- Layer availability (raw, normalized)
- Embedding presence (PCA, UMAP)
- QC metrics distribution
- Batch information
- Cell type annotation coverage
.loom - Loom Format
Description: HDF5-based format for omics data Typical Data: Expression matrices with metadata Use Cases: Single-cell data, RNA velocity analysis Python Libraries:
loompy:loompy.connect('file.loom')scanpy: Can import loom files EDA Approach:- Layer analysis (spliced, unspliced)
- Row and column attribute exploration
- Graph connectivity analysis
- Cluster assignments
- Velocity-specific metrics
.rds - R Data Serialization
Description: R object storage (often Seurat objects) Typical Data: R analysis results, especially single-cell Use Cases: R-Python data exchange Python Libraries:
pyreadr:pyreadr.read_r('file.rds')rpy2: For full R integration- Conversion tools to AnnData EDA Approach:
- Object type identification
- Data structure exploration
- Metadata extraction
- Conversion validation
Alignment and Assembly Formats
.maf - Multiple Alignment Format
Description: Text format for multiple sequence alignments Typical Data: Genome-wide or local multiple alignments Use Cases: Comparative genomics, conservation analysis Python Libraries:
Biopython:AlignIO.parse('file.maf', 'maf')bx-python: MAF-specific tools EDA Approach:- Alignment block statistics
- Species coverage
- Gap analysis
- Conservation scoring
- Alignment quality metrics
- Block length distribution
.axt - Pairwise Alignment Format
Description: Pairwise alignment format (UCSC) Typical Data: Pairwise genomic alignments Use Cases: Genome comparison, synteny analysis Python Libraries:
- Custom parsers (simple format)
bx-python: AXT support EDA Approach:- Alignment score distribution
- Identity percentage
- Syntenic block identification
- Gap size analysis
- Coverage statistics
.chain - Chain Alignment Format
Description: Genome coordinate mapping chains Typical Data: Coordinate transformations between genome builds Use Cases: Liftover, coordinate conversion Python Libraries:
pyliftover: Chain file usage- Custom parsers for chain format EDA Approach:
- Chain score distribution
- Coverage of source genome
- Gap analysis
- Inversion detection
- Mapping quality assessment
.psl - Pattern Space Layout
Description: BLAT/BLAST alignment format Typical Data: Alignment results from BLAT Use Cases: Transcript mapping, similarity searches Python Libraries:
- Custom parsers (tab-delimited)
pybedtools: Can handle PSL EDA Approach:- Match percentage distribution
- Gap statistics
- Query coverage
- Multiple mapping analysis
- Alignment quality metrics
Genome Assembly and Annotation
.agp - Assembly Golden Path
Description: Assembly structure description Typical Data: Scaffold composition, gap information Use Cases: Genome assembly representation Python Libraries:
- Custom parsers (simple tab-delimited)
- Assembly analysis tools EDA Approach:
- Scaffold statistics (N50, L50)
- Gap type and size distribution
- Component length analysis
- Assembly contiguity metrics
- Unplaced contig analysis
.scaffolds / .contigs - Assembly Sequences
Description: Assembled sequences (usually FASTA) Typical Data: Assembled genomic sequences Use Cases: Genome assembly output Python Libraries:
- Same as FASTA format
- Assembly-specific tools (QUAST) EDA Approach:
- Assembly statistics (N50, N90, etc.)
- Length distribution
- Coverage analysis
- Gap (N) content
- Duplication assessment
- BUSCO completeness (if annotations available)
.2bit - Compressed Genome Format
Description: UCSC compact genome format Typical Data: Reference genomes (highly compressed) Use Cases: Efficient genome storage and access Python Libraries:
py2bit:py2bit.open('file.2bit')twobitreader: Alternative reader EDA Approach:- Compression efficiency
- Random access performance
- Sequence extraction validation
- Masked region analysis
- N content and distribution
.sizes - Chromosome Sizes
Description: Simple format with chromosome lengths Typical Data: Tab-delimited chromosome names and sizes Use Cases: Genome browsers, coordinate validation Python Libraries:
- Simple file reading with pandas
- Built into many genomic tools EDA Approach:
- Genome size calculation
- Chromosome count
- Size distribution
- Karyotype validation
- Completeness check against reference
Phylogenetics and Evolution
.nwk / .newick - Newick Tree Format
Description: Parenthetical tree representation Typical Data: Phylogenetic trees with branch lengths Use Cases: Evolutionary analysis, tree visualization Python Libraries:
Biopython:Phylo.read('file.nwk', 'newick')ete3:ete3.Tree('file.nwk')dendropy: Phylogenetic computing EDA Approach:- Tree structure analysis (tips, internal nodes)
- Branch length distribution
- Tree balance metrics
- Ultrametricity check
- Bootstrap support analysis
- Topology validation
.nexus - Nexus Format
Description: Rich format for phylogenetic data Typical Data: Alignments, trees, character matrices Use Cases: Phylogenetic software interchange Python Libraries:
Biopython: Nexus supportdendropy: Comprehensive Nexus handling EDA Approach:- Data block analysis
- Character type distribution
- Tree block validation
- Taxa consistency
- Command block parsing
- Format compliance checking
.phylip - PHYLIP Format
Description: Sequence alignment format (strict/relaxed) Typical Data: Multiple sequence alignments Use Cases: Phylogenetic analysis input Python Libraries:
Biopython:AlignIO.read('file.phy', 'phylip')dendropy: PHYLIP support EDA Approach:- Alignment dimensions
- Sequence length uniformity
- Gap position analysis
- Informative site calculation
- Format variant detection (strict vs relaxed)
.paml - PAML Output
Description: Output from PAML phylogenetic software Typical Data: Evolutionary model results, dN/dS ratios Use Cases: Molecular evolution analysis Python Libraries:
- Custom parsers for specific PAML programs
Biopython: Basic PAML parsing EDA Approach:- Model parameter extraction
- Likelihood values
- dN/dS ratio distribution
- Branch-specific results
- Convergence assessment
Protein and Structure Data
.embl - EMBL Format
Description: Rich sequence annotation format Typical Data: Sequences with extensive annotations Use Cases: Sequence databases, genome records Python Libraries:
Biopython:SeqIO.read('file.embl', 'embl')EDA Approach:- Feature annotation completeness
- Sequence length and type
- Reference information
- Cross-reference validation
- Feature overlap analysis
.genbank / .gb / .gbk - GenBank Format
Description: NCBI's sequence annotation format Typical Data: Annotated sequences with features Use Cases: Sequence databases, annotation transfer Python Libraries:
Biopython:SeqIO.parse('file.gb', 'genbank')EDA Approach:- Feature type distribution
- CDS analysis (start codons, stops)
- Translation validation
- Annotation completeness
- Source organism extraction
- Reference and publication info
- Locus tag consistency
.sff - Standard Flowgram Format
Description: 454/Roche sequencing data format Typical Data: Raw pyrosequencing flowgrams Use Cases: Legacy 454 sequencing data Python Libraries:
Biopython:SeqIO.parse('file.sff', 'sff')- Platform-specific tools EDA Approach:
- Read count and length
- Flowgram signal quality
- Key sequence detection
- Adapter trimming validation
- Quality score distribution
.hdf5 (Genomics Specific)
Description: HDF5 for genomics (10X, Hi-C, etc.) Typical Data: High-throughput genomics data Use Cases: 10X Genomics, spatial transcriptomics Python Libraries:
h5py: Low-level accessscanpy: For 10X datacooler: For Hi-C data EDA Approach:- Dataset structure exploration
- Barcode statistics
- UMI counting
- Feature-barcode matrix analysis
- Spatial coordinates (if applicable)
.cool / .mcool - Cooler Format
Description: HDF5-based Hi-C contact matrices Typical Data: Chromatin interaction matrices Use Cases: 3D genome analysis, Hi-C data Python Libraries:
cooler:cooler.Cooler('file.cool')hicstraw: For .hic format EDA Approach:- Resolution analysis
- Contact matrix statistics
- Distance decay curves
- Compartment analysis
- TAD boundary detection
- Balance factor validation
.hic - Hi-C Binary Format
Description: Juicer binary Hi-C format Typical Data: Multi-resolution Hi-C matrices Use Cases: Hi-C analysis with Juicer tools Python Libraries:
hicstraw:hicstraw.HiCFile('file.hic')straw: C++ library with Python bindings EDA Approach:- Available resolutions
- Normalization methods
- Contact statistics
- Chromosomal interactions
- Quality metrics
.bw (ChIP-seq / ATAC-seq specific)
Description: BigWig files for epigenomics Typical Data: Coverage or enrichment signals Use Cases: ChIP-seq, ATAC-seq, DNase-seq Python Libraries:
pyBigWig: Standard bigWig access EDA Approach:- Peak enrichment patterns
- Background signal analysis
- Sample correlation
- Signal-to-noise ratio
- Library complexity metrics
.narrowPeak / .broadPeak - ENCODE Peak Formats
Description: BED-based formats for peaks Typical Data: Peak calls with scores and p-values Use Cases: ChIP-seq peak calling output Python Libraries:
pybedtools: BED-compatible- Custom parsers for peak-specific fields EDA Approach:
- Peak count and width distribution
- Signal value distribution
- Q-value and p-value analysis
- Peak summit analysis
- Overlap with known features
- Motif enrichment preparation
.wig - Wiggle Format
Description: Dense continuous genomic data Typical Data: Coverage or signal tracks Use Cases: Genome browser visualization Python Libraries:
pyBigWig: Can convert to bigWig- Custom parsers for wiggle format EDA Approach:
- Signal statistics
- Coverage metrics
- Format variant (fixedStep vs variableStep)
- Span parameter analysis
- Conversion efficiency to bigWig
.ab1 - Sanger Sequencing Trace
Description: Binary chromatogram format Typical Data: Sanger sequencing traces Use Cases: Capillary sequencing validation Python Libraries:
Biopython:SeqIO.read('file.ab1', 'abi')tracytools: For quality assessment EDA Approach:- Base calling quality
- Trace quality scores
- Mixed base detection
- Primer and vector detection
- Read length and quality region
- Heterozygosity detection
.scf - Standard Chromatogram Format
Description: Sanger sequencing chromatogram Typical Data: Base calls and confidence values Use Cases: Sequencing trace analysis Python Libraries:
Biopython: SCF format support EDA Approach:- Similar to AB1 format
- Quality score profiles
- Peak height ratios
- Signal-to-noise metrics
.idx - Index Files (Generic)
Description: Index files for various formats Typical Data: Fast random access indices Use Cases: Efficient data access (BAM, VCF, etc.) Python Libraries:
- Format-specific libraries handle indices
pysam: Auto-handles BAI, CSI indices EDA Approach:- Index completeness validation
- Binning strategy analysis
- Access performance metrics
- Index size vs data size ratio