# Bioinformatics and Genomics File Formats Reference This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications. ## Sequence Data Formats ### .fasta / .fa / .fna - FASTA Format **Description:** Text-based format for nucleotide or protein sequences **Typical Data:** DNA, RNA, or protein sequences with headers **Use Cases:** Sequence storage, BLAST searches, alignments **Python Libraries:** - `Biopython`: `SeqIO.parse('file.fasta', 'fasta')` - `pyfaidx`: Fast indexed FASTA access - `screed`: Fast sequence parsing **EDA Approach:** - Sequence count and length distribution - GC content analysis - N content (ambiguous bases) - Sequence ID parsing - Duplicate detection - Quality metrics for assemblies (N50, L50) ### .fastq / .fq - FASTQ Format **Description:** Sequence data with base quality scores **Typical Data:** Raw sequencing reads with Phred quality scores **Use Cases:** NGS data, quality control, read mapping **Python Libraries:** - `Biopython`: `SeqIO.parse('file.fastq', 'fastq')` - `pysam`: Fast FASTQ/BAM operations - `HTSeq`: Sequencing data analysis **EDA Approach:** - Read count and length distribution - Quality score distribution (per-base, per-read) - GC content and bias - Duplicate rate estimation - Adapter contamination detection - k-mer frequency analysis - Encoding format validation (Phred33/64) ### .sam - Sequence Alignment/Map **Description:** Tab-delimited text format for alignments **Typical Data:** Aligned sequencing reads with mapping quality **Use Cases:** Read alignment storage, variant calling **Python Libraries:** - `pysam`: `pysam.AlignmentFile('file.sam', 'r')` - `HTSeq`: `HTSeq.SAM_Reader('file.sam')` **EDA Approach:** - Mapping rate and quality distribution - Coverage analysis - Insert size distribution (paired-end) - Alignment flags distribution - CIGAR string patterns - Mismatch and indel rates - Duplicate and supplementary alignment counts ### .bam - Binary Alignment/Map **Description:** Compressed binary version of SAM **Typical Data:** Aligned reads in compressed format **Use Cases:** Efficient storage and processing of alignments **Python Libraries:** - `pysam`: Full BAM support with indexing - `bamnostic`: Pure Python BAM reader **EDA Approach:** - Same as SAM plus: - Compression ratio analysis - Index file (.bai) validation - Chromosome-wise statistics - Strand bias detection - Read group analysis ### .cram - CRAM Format **Description:** Highly compressed alignment format **Typical Data:** Reference-compressed aligned reads **Use Cases:** Long-term storage, space-efficient archives **Python Libraries:** - `pysam`: CRAM support (requires reference) - Reference genome must be accessible **EDA Approach:** - Compression efficiency vs BAM - Reference dependency validation - Lossy vs lossless compression assessment - Decompression performance - Similar alignment metrics as BAM ### .bed - Browser Extensible Data **Description:** Tab-delimited format for genomic features **Typical Data:** Genomic intervals (chr, start, end) with annotations **Use Cases:** Peak calling, variant annotation, genome browsing **Python Libraries:** - `pybedtools`: `pybedtools.BedTool('file.bed')` - `pyranges`: `pyranges.read_bed('file.bed')` - `pandas`: Simple BED reading **EDA Approach:** - Feature count and size distribution - Chromosome distribution - Strand bias - Score distribution (if present) - Overlap and proximity analysis - Coverage statistics - Gap analysis between features ### .bedGraph - BED with Graph Data **Description:** BED format with per-base signal values **Typical Data:** Continuous-valued genomic data (coverage, signals) **Use Cases:** Coverage tracks, ChIP-seq signals, methylation **Python Libraries:** - `pyBigWig`: Can convert to bigWig - `pybedtools`: BedGraph operations **EDA Approach:** - Signal distribution statistics - Genome coverage percentage - Signal dynamics (peaks, valleys) - Chromosome-wise signal patterns - Quantile analysis - Zero-coverage regions ### .bigWig / .bw - Binary BigWig **Description:** Indexed binary format for genome-wide signal data **Typical Data:** Continuous genomic signals (compressed and indexed) **Use Cases:** Efficient genome browser tracks, large-scale data **Python Libraries:** - `pyBigWig`: `pyBigWig.open('file.bw')` - `pybbi`: BigWig/BigBed interface **EDA Approach:** - Signal statistics extraction - Zoom level analysis - Regional signal extraction - Efficient genome-wide summaries - Compression efficiency - Index structure analysis ### .bigBed / .bb - Binary BigBed **Description:** Indexed binary BED format **Typical Data:** Genomic features (compressed and indexed) **Use Cases:** Large feature sets, genome browsers **Python Libraries:** - `pybbi`: BigBed reading - `pybigtools`: Modern BigBed interface **EDA Approach:** - Feature density analysis - Efficient interval queries - Zoom level validation - Index performance metrics - Feature size statistics ### .gff / .gff3 - General Feature Format **Description:** Tab-delimited format for genomic annotations **Typical Data:** Gene models, transcripts, exons, regulatory elements **Use Cases:** Genome annotation, gene prediction **Python Libraries:** - `BCBio.GFF`: Biopython GFF module - `gffutils`: `gffutils.create_db('file.gff3')` - `pyranges`: GFF support **EDA Approach:** - Feature type distribution (gene, exon, CDS, etc.) - Gene structure validation - Strand balance - Hierarchical relationship validation - Phase validation for CDS - Attribute completeness - Gene model statistics (introns, exons per gene) ### .gtf - Gene Transfer Format **Description:** GFF2-based format for gene annotations **Typical Data:** Gene and transcript annotations **Use Cases:** RNA-seq analysis, gene quantification **Python Libraries:** - `pyranges`: `pyranges.read_gtf('file.gtf')` - `gffutils`: GTF database creation - `HTSeq`: GTF reading for counts **EDA Approach:** - Transcript isoform analysis - Gene structure completeness - Exon number distribution - Transcript length distribution - TSS and TES analysis - Biotype distribution - Overlapping gene detection ### .vcf - Variant Call Format **Description:** Text format for genetic variants **Typical Data:** SNPs, indels, structural variants with annotations **Use Cases:** Variant calling, population genetics, GWAS **Python Libraries:** - `pysam`: `pysam.VariantFile('file.vcf')` - `cyvcf2`: Fast VCF parsing - `PyVCF`: Older but comprehensive **EDA Approach:** - Variant count by type (SNP, indel, SV) - Quality score distribution - Allele frequency spectrum - Transition/transversion ratio - Heterozygosity rates - Missing genotype analysis - Hardy-Weinberg equilibrium - Annotation completeness (if annotated) ### .bcf - Binary VCF **Description:** Compressed binary variant format **Typical Data:** Same as VCF but binary **Use Cases:** Efficient variant storage and processing **Python Libraries:** - `pysam`: Full BCF support - `cyvcf2`: Optimized BCF reading **EDA Approach:** - Same as VCF plus: - Compression efficiency - Indexing validation - Read performance metrics ### .gvcf - Genomic VCF **Description:** VCF with reference confidence blocks **Typical Data:** All positions (variant and non-variant) **Use Cases:** Joint genotyping workflows, GATK **Python Libraries:** - `pysam`: GVCF support - Standard VCF parsers **EDA Approach:** - Reference block analysis - Coverage uniformity - Variant density - Genotype quality across genome - Reference confidence distribution ## RNA-Seq and Expression Data ### .counts - Gene Count Matrix **Description:** Tab-delimited gene expression counts **Typical Data:** Gene IDs with read counts per sample **Use Cases:** RNA-seq quantification, differential expression **Python Libraries:** - `pandas`: `pd.read_csv('file.counts', sep='\t')` - `scanpy` (for single-cell): `sc.read_csv()` **EDA Approach:** - Library size distribution - Detection rate (genes per sample) - Zero-inflation analysis - Count distribution (log scale) - Outlier sample detection - Correlation between replicates - PCA for sample relationships ### .tpm / .fpkm - Normalized Expression **Description:** Normalized gene expression values **Typical Data:** TPM (transcripts per million) or FPKM values **Use Cases:** Cross-sample comparison, visualization **Python Libraries:** - `pandas`: Standard CSV reading - `anndata`: For integrated analysis **EDA Approach:** - Expression distribution - Highly expressed gene identification - Sample clustering - Batch effect detection - Coefficient of variation analysis - Dynamic range assessment ### .mtx - Matrix Market Format **Description:** Sparse matrix format (common in single-cell) **Typical Data:** Sparse count matrices (cells × genes) **Use Cases:** Single-cell RNA-seq, large sparse matrices **Python Libraries:** - `scipy.io`: `scipy.io.mmread('file.mtx')` - `scanpy`: `sc.read_mtx('file.mtx')` **EDA Approach:** - Sparsity analysis - Cell and gene filtering thresholds - Doublet detection metrics - Mitochondrial fraction - UMI count distribution - Gene detection per cell ### .h5ad - Anndata Format **Description:** HDF5-based annotated data matrix **Typical Data:** Expression matrix with metadata (cells, genes) **Use Cases:** Single-cell RNA-seq analysis with Scanpy **Python Libraries:** - `scanpy`: `sc.read_h5ad('file.h5ad')` - `anndata`: Direct AnnData manipulation **EDA Approach:** - Cell and gene counts - Metadata completeness - Layer availability (raw, normalized) - Embedding presence (PCA, UMAP) - QC metrics distribution - Batch information - Cell type annotation coverage ### .loom - Loom Format **Description:** HDF5-based format for omics data **Typical Data:** Expression matrices with metadata **Use Cases:** Single-cell data, RNA velocity analysis **Python Libraries:** - `loompy`: `loompy.connect('file.loom')` - `scanpy`: Can import loom files **EDA Approach:** - Layer analysis (spliced, unspliced) - Row and column attribute exploration - Graph connectivity analysis - Cluster assignments - Velocity-specific metrics ### .rds - R Data Serialization **Description:** R object storage (often Seurat objects) **Typical Data:** R analysis results, especially single-cell **Use Cases:** R-Python data exchange **Python Libraries:** - `pyreadr`: `pyreadr.read_r('file.rds')` - `rpy2`: For full R integration - Conversion tools to AnnData **EDA Approach:** - Object type identification - Data structure exploration - Metadata extraction - Conversion validation ## Alignment and Assembly Formats ### .maf - Multiple Alignment Format **Description:** Text format for multiple sequence alignments **Typical Data:** Genome-wide or local multiple alignments **Use Cases:** Comparative genomics, conservation analysis **Python Libraries:** - `Biopython`: `AlignIO.parse('file.maf', 'maf')` - `bx-python`: MAF-specific tools **EDA Approach:** - Alignment block statistics - Species coverage - Gap analysis - Conservation scoring - Alignment quality metrics - Block length distribution ### .axt - Pairwise Alignment Format **Description:** Pairwise alignment format (UCSC) **Typical Data:** Pairwise genomic alignments **Use Cases:** Genome comparison, synteny analysis **Python Libraries:** - Custom parsers (simple format) - `bx-python`: AXT support **EDA Approach:** - Alignment score distribution - Identity percentage - Syntenic block identification - Gap size analysis - Coverage statistics ### .chain - Chain Alignment Format **Description:** Genome coordinate mapping chains **Typical Data:** Coordinate transformations between genome builds **Use Cases:** Liftover, coordinate conversion **Python Libraries:** - `pyliftover`: Chain file usage - Custom parsers for chain format **EDA Approach:** - Chain score distribution - Coverage of source genome - Gap analysis - Inversion detection - Mapping quality assessment ### .psl - Pattern Space Layout **Description:** BLAT/BLAST alignment format **Typical Data:** Alignment results from BLAT **Use Cases:** Transcript mapping, similarity searches **Python Libraries:** - Custom parsers (tab-delimited) - `pybedtools`: Can handle PSL **EDA Approach:** - Match percentage distribution - Gap statistics - Query coverage - Multiple mapping analysis - Alignment quality metrics ## Genome Assembly and Annotation ### .agp - Assembly Golden Path **Description:** Assembly structure description **Typical Data:** Scaffold composition, gap information **Use Cases:** Genome assembly representation **Python Libraries:** - Custom parsers (simple tab-delimited) - Assembly analysis tools **EDA Approach:** - Scaffold statistics (N50, L50) - Gap type and size distribution - Component length analysis - Assembly contiguity metrics - Unplaced contig analysis ### .scaffolds / .contigs - Assembly Sequences **Description:** Assembled sequences (usually FASTA) **Typical Data:** Assembled genomic sequences **Use Cases:** Genome assembly output **Python Libraries:** - Same as FASTA format - Assembly-specific tools (QUAST) **EDA Approach:** - Assembly statistics (N50, N90, etc.) - Length distribution - Coverage analysis - Gap (N) content - Duplication assessment - BUSCO completeness (if annotations available) ### .2bit - Compressed Genome Format **Description:** UCSC compact genome format **Typical Data:** Reference genomes (highly compressed) **Use Cases:** Efficient genome storage and access **Python Libraries:** - `py2bit`: `py2bit.open('file.2bit')` - `twobitreader`: Alternative reader **EDA Approach:** - Compression efficiency - Random access performance - Sequence extraction validation - Masked region analysis - N content and distribution ### .sizes - Chromosome Sizes **Description:** Simple format with chromosome lengths **Typical Data:** Tab-delimited chromosome names and sizes **Use Cases:** Genome browsers, coordinate validation **Python Libraries:** - Simple file reading with pandas - Built into many genomic tools **EDA Approach:** - Genome size calculation - Chromosome count - Size distribution - Karyotype validation - Completeness check against reference ## Phylogenetics and Evolution ### .nwk / .newick - Newick Tree Format **Description:** Parenthetical tree representation **Typical Data:** Phylogenetic trees with branch lengths **Use Cases:** Evolutionary analysis, tree visualization **Python Libraries:** - `Biopython`: `Phylo.read('file.nwk', 'newick')` - `ete3`: `ete3.Tree('file.nwk')` - `dendropy`: Phylogenetic computing **EDA Approach:** - Tree structure analysis (tips, internal nodes) - Branch length distribution - Tree balance metrics - Ultrametricity check - Bootstrap support analysis - Topology validation ### .nexus - Nexus Format **Description:** Rich format for phylogenetic data **Typical Data:** Alignments, trees, character matrices **Use Cases:** Phylogenetic software interchange **Python Libraries:** - `Biopython`: Nexus support - `dendropy`: Comprehensive Nexus handling **EDA Approach:** - Data block analysis - Character type distribution - Tree block validation - Taxa consistency - Command block parsing - Format compliance checking ### .phylip - PHYLIP Format **Description:** Sequence alignment format (strict/relaxed) **Typical Data:** Multiple sequence alignments **Use Cases:** Phylogenetic analysis input **Python Libraries:** - `Biopython`: `AlignIO.read('file.phy', 'phylip')` - `dendropy`: PHYLIP support **EDA Approach:** - Alignment dimensions - Sequence length uniformity - Gap position analysis - Informative site calculation - Format variant detection (strict vs relaxed) ### .paml - PAML Output **Description:** Output from PAML phylogenetic software **Typical Data:** Evolutionary model results, dN/dS ratios **Use Cases:** Molecular evolution analysis **Python Libraries:** - Custom parsers for specific PAML programs - `Biopython`: Basic PAML parsing **EDA Approach:** - Model parameter extraction - Likelihood values - dN/dS ratio distribution - Branch-specific results - Convergence assessment ## Protein and Structure Data ### .embl - EMBL Format **Description:** Rich sequence annotation format **Typical Data:** Sequences with extensive annotations **Use Cases:** Sequence databases, genome records **Python Libraries:** - `Biopython`: `SeqIO.read('file.embl', 'embl')` **EDA Approach:** - Feature annotation completeness - Sequence length and type - Reference information - Cross-reference validation - Feature overlap analysis ### .genbank / .gb / .gbk - GenBank Format **Description:** NCBI's sequence annotation format **Typical Data:** Annotated sequences with features **Use Cases:** Sequence databases, annotation transfer **Python Libraries:** - `Biopython`: `SeqIO.parse('file.gb', 'genbank')` **EDA Approach:** - Feature type distribution - CDS analysis (start codons, stops) - Translation validation - Annotation completeness - Source organism extraction - Reference and publication info - Locus tag consistency ### .sff - Standard Flowgram Format **Description:** 454/Roche sequencing data format **Typical Data:** Raw pyrosequencing flowgrams **Use Cases:** Legacy 454 sequencing data **Python Libraries:** - `Biopython`: `SeqIO.parse('file.sff', 'sff')` - Platform-specific tools **EDA Approach:** - Read count and length - Flowgram signal quality - Key sequence detection - Adapter trimming validation - Quality score distribution ### .hdf5 (Genomics Specific) **Description:** HDF5 for genomics (10X, Hi-C, etc.) **Typical Data:** High-throughput genomics data **Use Cases:** 10X Genomics, spatial transcriptomics **Python Libraries:** - `h5py`: Low-level access - `scanpy`: For 10X data - `cooler`: For Hi-C data **EDA Approach:** - Dataset structure exploration - Barcode statistics - UMI counting - Feature-barcode matrix analysis - Spatial coordinates (if applicable) ### .cool / .mcool - Cooler Format **Description:** HDF5-based Hi-C contact matrices **Typical Data:** Chromatin interaction matrices **Use Cases:** 3D genome analysis, Hi-C data **Python Libraries:** - `cooler`: `cooler.Cooler('file.cool')` - `hicstraw`: For .hic format **EDA Approach:** - Resolution analysis - Contact matrix statistics - Distance decay curves - Compartment analysis - TAD boundary detection - Balance factor validation ### .hic - Hi-C Binary Format **Description:** Juicer binary Hi-C format **Typical Data:** Multi-resolution Hi-C matrices **Use Cases:** Hi-C analysis with Juicer tools **Python Libraries:** - `hicstraw`: `hicstraw.HiCFile('file.hic')` - `straw`: C++ library with Python bindings **EDA Approach:** - Available resolutions - Normalization methods - Contact statistics - Chromosomal interactions - Quality metrics ### .bw (ChIP-seq / ATAC-seq specific) **Description:** BigWig files for epigenomics **Typical Data:** Coverage or enrichment signals **Use Cases:** ChIP-seq, ATAC-seq, DNase-seq **Python Libraries:** - `pyBigWig`: Standard bigWig access **EDA Approach:** - Peak enrichment patterns - Background signal analysis - Sample correlation - Signal-to-noise ratio - Library complexity metrics ### .narrowPeak / .broadPeak - ENCODE Peak Formats **Description:** BED-based formats for peaks **Typical Data:** Peak calls with scores and p-values **Use Cases:** ChIP-seq peak calling output **Python Libraries:** - `pybedtools`: BED-compatible - Custom parsers for peak-specific fields **EDA Approach:** - Peak count and width distribution - Signal value distribution - Q-value and p-value analysis - Peak summit analysis - Overlap with known features - Motif enrichment preparation ### .wig - Wiggle Format **Description:** Dense continuous genomic data **Typical Data:** Coverage or signal tracks **Use Cases:** Genome browser visualization **Python Libraries:** - `pyBigWig`: Can convert to bigWig - Custom parsers for wiggle format **EDA Approach:** - Signal statistics - Coverage metrics - Format variant (fixedStep vs variableStep) - Span parameter analysis - Conversion efficiency to bigWig ### .ab1 - Sanger Sequencing Trace **Description:** Binary chromatogram format **Typical Data:** Sanger sequencing traces **Use Cases:** Capillary sequencing validation **Python Libraries:** - `Biopython`: `SeqIO.read('file.ab1', 'abi')` - `tracy` tools: For quality assessment **EDA Approach:** - Base calling quality - Trace quality scores - Mixed base detection - Primer and vector detection - Read length and quality region - Heterozygosity detection ### .scf - Standard Chromatogram Format **Description:** Sanger sequencing chromatogram **Typical Data:** Base calls and confidence values **Use Cases:** Sequencing trace analysis **Python Libraries:** - `Biopython`: SCF format support **EDA Approach:** - Similar to AB1 format - Quality score profiles - Peak height ratios - Signal-to-noise metrics ### .idx - Index Files (Generic) **Description:** Index files for various formats **Typical Data:** Fast random access indices **Use Cases:** Efficient data access (BAM, VCF, etc.) **Python Libraries:** - Format-specific libraries handle indices - `pysam`: Auto-handles BAI, CSI indices **EDA Approach:** - Index completeness validation - Binning strategy analysis - Access performance metrics - Index size vs data size ratio