665 lines
22 KiB
Markdown
665 lines
22 KiB
Markdown
# Chemistry and Molecular File Formats Reference
|
|
|
|
This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.
|
|
|
|
## Structure File Formats
|
|
|
|
### .pdb - Protein Data Bank
|
|
**Description:** Standard format for 3D structures of biological macromolecules
|
|
**Typical Data:** Atomic coordinates, residue information, secondary structure, crystal structure data
|
|
**Use Cases:** Protein structure analysis, molecular visualization, docking studies
|
|
**Python Libraries:**
|
|
- `Biopython`: `Bio.PDB`
|
|
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
|
|
- `PyMOL`: `pymol.cmd.load('file.pdb')`
|
|
- `ProDy`: `prody.parsePDB('file.pdb')`
|
|
**EDA Approach:**
|
|
- Structure validation (bond lengths, angles, clashes)
|
|
- Secondary structure analysis
|
|
- B-factor distribution
|
|
- Missing residues/atoms detection
|
|
- Ramachandran plots for validation
|
|
- Surface area and volume calculations
|
|
|
|
### .cif - Crystallographic Information File
|
|
**Description:** Structured data format for crystallographic information
|
|
**Typical Data:** Unit cell parameters, atomic coordinates, symmetry operations, experimental data
|
|
**Use Cases:** Crystal structure determination, structural biology, materials science
|
|
**Python Libraries:**
|
|
- `gemmi`: `gemmi.cif.read_file('file.cif')`
|
|
- `PyCifRW`: `CifFile.ReadCif('file.cif')`
|
|
- `Biopython`: `Bio.PDB.MMCIFParser()`
|
|
**EDA Approach:**
|
|
- Data completeness check
|
|
- Resolution and quality metrics
|
|
- Unit cell parameter analysis
|
|
- Symmetry group validation
|
|
- Atomic displacement parameters
|
|
- R-factors and validation metrics
|
|
|
|
### .mol - MDL Molfile
|
|
**Description:** Chemical structure file format by MDL/Accelrys
|
|
**Typical Data:** 2D/3D coordinates, atom types, bond orders, charges
|
|
**Use Cases:** Chemical database storage, cheminformatics, drug design
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
|
|
- `Open Babel`: `pybel.readfile('mol', 'file.mol')`
|
|
- `ChemoPy`: For descriptor calculation
|
|
**EDA Approach:**
|
|
- Molecular property calculation (MW, logP, TPSA)
|
|
- Functional group analysis
|
|
- Ring system detection
|
|
- Stereochemistry validation
|
|
- 2D/3D coordinate consistency
|
|
- Valence and charge validation
|
|
|
|
### .mol2 - Tripos Mol2
|
|
**Description:** Complete 3D molecular structure format with atom typing
|
|
**Typical Data:** Coordinates, SYBYL atom types, bond types, charges, substructures
|
|
**Use Cases:** Molecular docking, QSAR studies, drug discovery
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.MolFromMol2File('file.mol2')`
|
|
- `Open Babel`: `pybel.readfile('mol2', 'file.mol2')`
|
|
- `MDAnalysis`: Can parse mol2 topology
|
|
**EDA Approach:**
|
|
- Atom type distribution
|
|
- Partial charge analysis
|
|
- Bond type statistics
|
|
- Substructure identification
|
|
- Conformational analysis
|
|
- Energy minimization status check
|
|
|
|
### .sdf - Structure Data File
|
|
**Description:** Multi-structure file format with associated data
|
|
**Typical Data:** Multiple molecular structures with properties/annotations
|
|
**Use Cases:** Chemical databases, virtual screening, compound libraries
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
|
|
- `Open Babel`: `pybel.readfile('sdf', 'file.sdf')`
|
|
- `PandasTools` (RDKit): For DataFrame integration
|
|
**EDA Approach:**
|
|
- Dataset size and diversity metrics
|
|
- Property distribution analysis (MW, logP, etc.)
|
|
- Structural diversity (Tanimoto similarity)
|
|
- Missing data assessment
|
|
- Outlier detection in properties
|
|
- Scaffold analysis
|
|
|
|
### .xyz - XYZ Coordinates
|
|
**Description:** Simple Cartesian coordinate format
|
|
**Typical Data:** Atom types and 3D coordinates
|
|
**Use Cases:** Quantum chemistry, geometry optimization, molecular dynamics
|
|
**Python Libraries:**
|
|
- `ASE`: `ase.io.read('file.xyz')`
|
|
- `Open Babel`: `pybel.readfile('xyz', 'file.xyz')`
|
|
- `cclib`: For parsing QM outputs with xyz
|
|
**EDA Approach:**
|
|
- Geometry analysis (bond lengths, angles, dihedrals)
|
|
- Center of mass calculation
|
|
- Moment of inertia
|
|
- Molecular size metrics
|
|
- Coordinate validation
|
|
- Symmetry detection
|
|
|
|
### .smi / .smiles - SMILES String
|
|
**Description:** Line notation for chemical structures
|
|
**Typical Data:** Text representation of molecular structure
|
|
**Use Cases:** Chemical databases, literature mining, data exchange
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.MolFromSmiles(smiles)`
|
|
- `Open Babel`: Can parse SMILES
|
|
- `DeepChem`: For ML on SMILES
|
|
**EDA Approach:**
|
|
- SMILES syntax validation
|
|
- Descriptor calculation from SMILES
|
|
- Fingerprint generation
|
|
- Substructure searching
|
|
- Tautomer enumeration
|
|
- Stereoisomer handling
|
|
|
|
### .pdbqt - AutoDock PDBQT
|
|
**Description:** Modified PDB format for AutoDock docking
|
|
**Typical Data:** Coordinates, partial charges, atom types for docking
|
|
**Use Cases:** Molecular docking, virtual screening
|
|
**Python Libraries:**
|
|
- `Meeko`: For PDBQT preparation
|
|
- `Open Babel`: Can read PDBQT
|
|
- `ProDy`: Limited PDBQT support
|
|
**EDA Approach:**
|
|
- Charge distribution analysis
|
|
- Rotatable bond identification
|
|
- Atom type validation
|
|
- Coordinate quality check
|
|
- Hydrogen placement validation
|
|
- Torsion definition analysis
|
|
|
|
### .mae - Maestro Format
|
|
**Description:** Schrödinger's proprietary molecular structure format
|
|
**Typical Data:** Structures, properties, annotations from Schrödinger suite
|
|
**Use Cases:** Drug discovery, molecular modeling with Schrödinger tools
|
|
**Python Libraries:**
|
|
- `schrodinger.structure`: Requires Schrödinger installation
|
|
- Custom parsers for basic reading
|
|
**EDA Approach:**
|
|
- Property extraction and analysis
|
|
- Structure quality metrics
|
|
- Conformer analysis
|
|
- Docking score distributions
|
|
- Ligand efficiency metrics
|
|
|
|
### .gro - GROMACS Coordinate File
|
|
**Description:** Molecular structure file for GROMACS MD simulations
|
|
**Typical Data:** Atom positions, velocities, box vectors
|
|
**Use Cases:** Molecular dynamics simulations, GROMACS workflows
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: `Universe('file.gro')`
|
|
- `MDTraj`: `mdtraj.load_gro('file.gro')`
|
|
- `GromacsWrapper`: For GROMACS integration
|
|
**EDA Approach:**
|
|
- System composition analysis
|
|
- Box dimension validation
|
|
- Atom position distribution
|
|
- Velocity distribution (if present)
|
|
- Density calculation
|
|
- Solvation analysis
|
|
|
|
## Computational Chemistry Output Formats
|
|
|
|
### .log - Gaussian Log File
|
|
**Description:** Output from Gaussian quantum chemistry calculations
|
|
**Typical Data:** Energies, geometries, frequencies, orbitals, populations
|
|
**Use Cases:** QM calculations, geometry optimization, frequency analysis
|
|
**Python Libraries:**
|
|
- `cclib`: `cclib.io.ccread('file.log')`
|
|
- `GaussianRunPack`: For Gaussian workflows
|
|
- Custom parsers with regex
|
|
**EDA Approach:**
|
|
- Convergence analysis
|
|
- Energy profile extraction
|
|
- Vibrational frequency analysis
|
|
- Orbital energy levels
|
|
- Population analysis (Mulliken, NBO)
|
|
- Thermochemistry data extraction
|
|
|
|
### .out - Quantum Chemistry Output
|
|
**Description:** Generic output file from various QM packages
|
|
**Typical Data:** Calculation results, energies, properties
|
|
**Use Cases:** QM calculations across different software
|
|
**Python Libraries:**
|
|
- `cclib`: Universal parser for QM outputs
|
|
- `ASE`: Can read some output formats
|
|
**EDA Approach:**
|
|
- Software-specific parsing
|
|
- Convergence criteria check
|
|
- Energy and gradient trends
|
|
- Basis set and method validation
|
|
- Computational cost analysis
|
|
|
|
### .wfn / .wfx - Wavefunction Files
|
|
**Description:** Wavefunction data for quantum chemical analysis
|
|
**Typical Data:** Molecular orbitals, basis sets, density matrices
|
|
**Use Cases:** Electron density analysis, QTAIM analysis
|
|
**Python Libraries:**
|
|
- `Multiwfn`: Interface via Python
|
|
- `Horton`: For wavefunction analysis
|
|
- Custom parsers for specific formats
|
|
**EDA Approach:**
|
|
- Orbital population analysis
|
|
- Electron density distribution
|
|
- Critical point analysis (QTAIM)
|
|
- Molecular orbital visualization
|
|
- Bonding analysis
|
|
|
|
### .fchk - Gaussian Formatted Checkpoint
|
|
**Description:** Formatted checkpoint file from Gaussian
|
|
**Typical Data:** Complete wavefunction data, results, geometry
|
|
**Use Cases:** Post-processing Gaussian calculations
|
|
**Python Libraries:**
|
|
- `cclib`: Can parse fchk files
|
|
- `GaussView` Python API (if available)
|
|
- Custom parsers
|
|
**EDA Approach:**
|
|
- Wavefunction quality assessment
|
|
- Property extraction
|
|
- Basis set information
|
|
- Gradient and Hessian analysis
|
|
- Natural orbital analysis
|
|
|
|
### .cube - Gaussian Cube File
|
|
**Description:** Volumetric data on a 3D grid
|
|
**Typical Data:** Electron density, molecular orbitals, ESP on grid
|
|
**Use Cases:** Visualization of volumetric properties
|
|
**Python Libraries:**
|
|
- `cclib`: `cclib.io.ccread('file.cube')`
|
|
- `ase.io`: `ase.io.read('file.cube')`
|
|
- `pyquante`: For cube file manipulation
|
|
**EDA Approach:**
|
|
- Grid dimension and spacing analysis
|
|
- Value distribution statistics
|
|
- Isosurface value determination
|
|
- Integration over volume
|
|
- Comparison between different cubes
|
|
|
|
## Molecular Dynamics Formats
|
|
|
|
### .dcd - Binary Trajectory
|
|
**Description:** Binary trajectory format (CHARMM, NAMD)
|
|
**Typical Data:** Time series of atomic coordinates
|
|
**Use Cases:** MD trajectory analysis
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: `Universe(topology, 'traj.dcd')`
|
|
- `MDTraj`: `mdtraj.load_dcd('traj.dcd', top='topology.pdb')`
|
|
- `PyTraj` (Amber): Limited support
|
|
**EDA Approach:**
|
|
- RMSD/RMSF analysis
|
|
- Trajectory length and frame count
|
|
- Coordinate range and drift
|
|
- Periodic boundary handling
|
|
- File integrity check
|
|
- Time step validation
|
|
|
|
### .xtc - Compressed Trajectory
|
|
**Description:** GROMACS compressed trajectory format
|
|
**Typical Data:** Compressed coordinates from MD simulations
|
|
**Use Cases:** Space-efficient MD trajectory storage
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: `Universe(topology, 'traj.xtc')`
|
|
- `MDTraj`: `mdtraj.load_xtc('traj.xtc', top='topology.pdb')`
|
|
**EDA Approach:**
|
|
- Compression ratio assessment
|
|
- Precision loss evaluation
|
|
- RMSD over time
|
|
- Structural stability metrics
|
|
- Sampling frequency analysis
|
|
|
|
### .trr - GROMACS Trajectory
|
|
**Description:** Full precision GROMACS trajectory
|
|
**Typical Data:** Coordinates, velocities, forces from MD
|
|
**Use Cases:** High-precision MD analysis
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: Full support
|
|
- `MDTraj`: Can read trr files
|
|
- `GromacsWrapper`
|
|
**EDA Approach:**
|
|
- Full system dynamics analysis
|
|
- Energy conservation check (with velocities)
|
|
- Force analysis
|
|
- Temperature and pressure validation
|
|
- System equilibration assessment
|
|
|
|
### .nc / .netcdf - Amber NetCDF Trajectory
|
|
**Description:** Network Common Data Form trajectory
|
|
**Typical Data:** MD coordinates, velocities, forces
|
|
**Use Cases:** Amber MD simulations, large trajectory storage
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: NetCDF support
|
|
- `PyTraj`: Native Amber analysis
|
|
- `netCDF4`: Low-level access
|
|
**EDA Approach:**
|
|
- Metadata extraction
|
|
- Trajectory statistics
|
|
- Time series analysis
|
|
- Replica exchange analysis
|
|
- Multi-dimensional data extraction
|
|
|
|
### .top - GROMACS Topology
|
|
**Description:** Molecular topology for GROMACS
|
|
**Typical Data:** Atom types, bonds, angles, force field parameters
|
|
**Use Cases:** MD simulation setup and analysis
|
|
**Python Libraries:**
|
|
- `ParmEd`: `parmed.load_file('system.top')`
|
|
- `MDAnalysis`: Can parse topology
|
|
- Custom parsers for specific fields
|
|
**EDA Approach:**
|
|
- Force field parameter validation
|
|
- System composition
|
|
- Bond/angle/dihedral distribution
|
|
- Charge neutrality check
|
|
- Molecule type enumeration
|
|
|
|
### .psf - Protein Structure File (CHARMM)
|
|
**Description:** Topology file for CHARMM/NAMD
|
|
**Typical Data:** Atom connectivity, types, charges
|
|
**Use Cases:** CHARMM/NAMD MD simulations
|
|
**Python Libraries:**
|
|
- `MDAnalysis`: Native PSF support
|
|
- `ParmEd`: Can read PSF files
|
|
**EDA Approach:**
|
|
- Topology validation
|
|
- Connectivity analysis
|
|
- Charge distribution
|
|
- Atom type statistics
|
|
- Segment analysis
|
|
|
|
### .prmtop - Amber Parameter/Topology
|
|
**Description:** Amber topology and parameter file
|
|
**Typical Data:** System topology, force field parameters
|
|
**Use Cases:** Amber MD simulations
|
|
**Python Libraries:**
|
|
- `ParmEd`: `parmed.load_file('system.prmtop')`
|
|
- `PyTraj`: Native Amber support
|
|
**EDA Approach:**
|
|
- Force field completeness
|
|
- Parameter validation
|
|
- System size and composition
|
|
- Periodic box information
|
|
- Atom mask creation for analysis
|
|
|
|
### .inpcrd / .rst7 - Amber Coordinates
|
|
**Description:** Amber coordinate/restart file
|
|
**Typical Data:** Atomic coordinates, velocities, box info
|
|
**Use Cases:** Starting coordinates for Amber MD
|
|
**Python Libraries:**
|
|
- `ParmEd`: Works with prmtop
|
|
- `PyTraj`: Amber coordinate reading
|
|
**EDA Approach:**
|
|
- Coordinate validity
|
|
- System initialization check
|
|
- Box vector validation
|
|
- Velocity distribution (if restart)
|
|
- Energy minimization status
|
|
|
|
## Spectroscopy and Analytical Data
|
|
|
|
### .jcamp / .jdx - JCAMP-DX
|
|
**Description:** Joint Committee on Atomic and Molecular Physical Data eXchange
|
|
**Typical Data:** Spectroscopic data (IR, NMR, MS, UV-Vis)
|
|
**Use Cases:** Spectroscopy data exchange and archiving
|
|
**Python Libraries:**
|
|
- `jcamp`: `jcamp.jcamp_reader('file.jdx')`
|
|
- `nmrglue`: For NMR JCAMP files
|
|
- Custom parsers for specific subtypes
|
|
**EDA Approach:**
|
|
- Peak detection and analysis
|
|
- Baseline correction assessment
|
|
- Signal-to-noise calculation
|
|
- Spectral range validation
|
|
- Integration analysis
|
|
- Comparison with reference spectra
|
|
|
|
### .mzML - Mass Spectrometry Markup Language
|
|
**Description:** Standard XML format for mass spectrometry data
|
|
**Typical Data:** MS/MS spectra, chromatograms, metadata
|
|
**Use Cases:** Proteomics, metabolomics, mass spectrometry workflows
|
|
**Python Libraries:**
|
|
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
|
- `pyteomics`: `pyteomics.mzml.read('file.mzML')`
|
|
- `MSFileReader` wrappers
|
|
**EDA Approach:**
|
|
- Scan count and types
|
|
- MS level distribution
|
|
- Retention time range
|
|
- m/z range and resolution
|
|
- Peak intensity distribution
|
|
- Data completeness
|
|
- Quality control metrics
|
|
|
|
### .mzXML - Mass Spectrometry XML
|
|
**Description:** Open XML format for MS data
|
|
**Typical Data:** Mass spectra, retention times, peak lists
|
|
**Use Cases:** Legacy MS data, metabolomics
|
|
**Python Libraries:**
|
|
- `pymzml`: Can read mzXML
|
|
- `pyteomics.mzxml`
|
|
- `lxml` for direct XML parsing
|
|
**EDA Approach:**
|
|
- Similar to mzML
|
|
- Version compatibility check
|
|
- Conversion quality assessment
|
|
- Peak picking validation
|
|
|
|
### .raw - Vendor Raw Data
|
|
**Description:** Proprietary instrument data files (Thermo, Bruker, etc.)
|
|
**Typical Data:** Raw instrument signals, unprocessed data
|
|
**Use Cases:** Direct instrument data access
|
|
**Python Libraries:**
|
|
- `pymsfilereader`: For Thermo RAW files
|
|
- `ThermoRawFileParser`: CLI wrapper
|
|
- Vendor-specific APIs (Thermo, Bruker Compass)
|
|
**EDA Approach:**
|
|
- Instrument method extraction
|
|
- Raw signal quality
|
|
- Calibration status
|
|
- Scan function analysis
|
|
- Chromatographic quality metrics
|
|
|
|
### .d - Agilent Data Directory
|
|
**Description:** Agilent's data folder structure
|
|
**Typical Data:** LC-MS, GC-MS data and metadata
|
|
**Use Cases:** Agilent instrument data processing
|
|
**Python Libraries:**
|
|
- `agilent-reader`: Community tools
|
|
- `Chemstation` Python integration
|
|
- Custom directory parsing
|
|
**EDA Approach:**
|
|
- Directory structure validation
|
|
- Method parameter extraction
|
|
- Signal file integrity
|
|
- Calibration curve analysis
|
|
- Sequence information extraction
|
|
|
|
### .fid - NMR Free Induction Decay
|
|
**Description:** Raw NMR time-domain data
|
|
**Typical Data:** Time-domain NMR signal
|
|
**Use Cases:** NMR processing and analysis
|
|
**Python Libraries:**
|
|
- `nmrglue`: `nmrglue.bruker.read_fid('fid')`
|
|
- `nmrstarlib`: For NMR-STAR files
|
|
**EDA Approach:**
|
|
- Signal decay analysis
|
|
- Noise level assessment
|
|
- Acquisition parameter validation
|
|
- Apodization function selection
|
|
- Zero-filling optimization
|
|
- Phasing parameter estimation
|
|
|
|
### .ft - NMR Frequency-Domain Data
|
|
**Description:** Processed NMR spectrum
|
|
**Typical Data:** Frequency-domain NMR data
|
|
**Use Cases:** NMR analysis and interpretation
|
|
**Python Libraries:**
|
|
- `nmrglue`: Comprehensive NMR support
|
|
- `pyNMR`: For processing
|
|
**EDA Approach:**
|
|
- Peak picking and integration
|
|
- Chemical shift calibration
|
|
- Multiplicity analysis
|
|
- Coupling constant extraction
|
|
- Spectral quality metrics
|
|
- Reference compound identification
|
|
|
|
### .spc - Spectroscopy File
|
|
**Description:** Thermo Galactic spectroscopy format
|
|
**Typical Data:** IR, Raman, UV-Vis spectra
|
|
**Use Cases:** Spectroscopic data from various instruments
|
|
**Python Libraries:**
|
|
- `spc`: `spc.File('file.spc')`
|
|
- Custom parsers for binary format
|
|
**EDA Approach:**
|
|
- Spectral resolution
|
|
- Wavelength/wavenumber range
|
|
- Baseline characterization
|
|
- Peak identification
|
|
- Derivative spectra calculation
|
|
|
|
## Chemical Database Formats
|
|
|
|
### .inchi - International Chemical Identifier
|
|
**Description:** Text identifier for chemical substances
|
|
**Typical Data:** Layered chemical structure representation
|
|
**Use Cases:** Chemical database keys, structure searching
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.MolFromInchi(inchi)`
|
|
- `Open Babel`: InChI conversion
|
|
**EDA Approach:**
|
|
- InChI validation
|
|
- Layer analysis
|
|
- Stereochemistry verification
|
|
- InChI key generation
|
|
- Structure round-trip validation
|
|
|
|
### .cdx / .cdxml - ChemDraw Exchange
|
|
**Description:** ChemDraw drawing file format
|
|
**Typical Data:** 2D chemical structures with annotations
|
|
**Use Cases:** Chemical drawing, publication figures
|
|
**Python Libraries:**
|
|
- `RDKit`: Can import some CDXML
|
|
- `Open Babel`: Limited support
|
|
- `ChemDraw` Python API (commercial)
|
|
**EDA Approach:**
|
|
- Structure extraction
|
|
- Annotation preservation
|
|
- Style consistency
|
|
- 2D coordinate validation
|
|
|
|
### .cml - Chemical Markup Language
|
|
**Description:** XML-based chemical structure format
|
|
**Typical Data:** Chemical structures, reactions, properties
|
|
**Use Cases:** Semantic chemical data representation
|
|
**Python Libraries:**
|
|
- `RDKit`: CML support
|
|
- `Open Babel`: Good CML support
|
|
- `lxml`: For XML parsing
|
|
**EDA Approach:**
|
|
- XML schema validation
|
|
- Namespace handling
|
|
- Property extraction
|
|
- Reaction scheme analysis
|
|
- Metadata completeness
|
|
|
|
### .rxn - MDL Reaction File
|
|
**Description:** Chemical reaction structure file
|
|
**Typical Data:** Reactants, products, reaction arrows
|
|
**Use Cases:** Reaction databases, synthesis planning
|
|
**Python Libraries:**
|
|
- `RDKit`: `Chem.ReactionFromRxnFile('file.rxn')`
|
|
- `Open Babel`: Reaction support
|
|
**EDA Approach:**
|
|
- Reaction balancing validation
|
|
- Atom mapping analysis
|
|
- Reagent identification
|
|
- Stereochemistry changes
|
|
- Reaction classification
|
|
|
|
### .rdf - Reaction Data File
|
|
**Description:** Multi-reaction file format
|
|
**Typical Data:** Multiple reactions with data
|
|
**Use Cases:** Reaction databases
|
|
**Python Libraries:**
|
|
- `RDKit`: RDF reading capabilities
|
|
- Custom parsers
|
|
**EDA Approach:**
|
|
- Reaction yield statistics
|
|
- Condition analysis
|
|
- Success rate patterns
|
|
- Reagent frequency analysis
|
|
|
|
## Computational Output and Data
|
|
|
|
### .hdf5 / .h5 - Hierarchical Data Format
|
|
**Description:** Container for scientific data arrays
|
|
**Typical Data:** Large arrays, metadata, hierarchical organization
|
|
**Use Cases:** Large dataset storage, computational results
|
|
**Python Libraries:**
|
|
- `h5py`: `h5py.File('file.h5', 'r')`
|
|
- `pytables`: Advanced HDF5 interface
|
|
- `pandas`: Can read HDF5
|
|
**EDA Approach:**
|
|
- Dataset structure exploration
|
|
- Array shape and dtype analysis
|
|
- Metadata extraction
|
|
- Memory-efficient data sampling
|
|
- Chunk optimization analysis
|
|
- Compression ratio assessment
|
|
|
|
### .pkl / .pickle - Python Pickle
|
|
**Description:** Serialized Python objects
|
|
**Typical Data:** Any Python object (molecules, dataframes, models)
|
|
**Use Cases:** Intermediate data storage, model persistence
|
|
**Python Libraries:**
|
|
- `pickle`: Built-in serialization
|
|
- `joblib`: Enhanced pickling for large arrays
|
|
- `dill`: Extended pickle support
|
|
**EDA Approach:**
|
|
- Object type inspection
|
|
- Size and complexity analysis
|
|
- Version compatibility check
|
|
- Security validation (trusted source)
|
|
- Deserialization testing
|
|
|
|
### .npy / .npz - NumPy Arrays
|
|
**Description:** NumPy array binary format
|
|
**Typical Data:** Numerical arrays (coordinates, features, matrices)
|
|
**Use Cases:** Fast numerical data I/O
|
|
**Python Libraries:**
|
|
- `numpy`: `np.load('file.npy')`
|
|
- Direct memory mapping for large files
|
|
**EDA Approach:**
|
|
- Array shape and dimensions
|
|
- Data type and precision
|
|
- Statistical summary (mean, std, range)
|
|
- Missing value detection
|
|
- Outlier identification
|
|
- Memory footprint analysis
|
|
|
|
### .mat - MATLAB Data File
|
|
**Description:** MATLAB workspace data
|
|
**Typical Data:** Arrays, structures from MATLAB
|
|
**Use Cases:** MATLAB-Python data exchange
|
|
**Python Libraries:**
|
|
- `scipy.io`: `scipy.io.loadmat('file.mat')`
|
|
- `h5py`: For v7.3 MAT files
|
|
**EDA Approach:**
|
|
- Variable extraction and types
|
|
- Array dimension analysis
|
|
- Structure field exploration
|
|
- MATLAB version compatibility
|
|
- Data type conversion validation
|
|
|
|
### .csv - Comma-Separated Values
|
|
**Description:** Tabular data in text format
|
|
**Typical Data:** Chemical properties, experimental data, descriptors
|
|
**Use Cases:** Data exchange, analysis, machine learning
|
|
**Python Libraries:**
|
|
- `pandas`: `pd.read_csv('file.csv')`
|
|
- `csv`: Built-in module
|
|
- `polars`: Fast CSV reading
|
|
**EDA Approach:**
|
|
- Data types inference
|
|
- Missing value patterns
|
|
- Statistical summaries
|
|
- Correlation analysis
|
|
- Distribution visualization
|
|
- Outlier detection
|
|
|
|
### .json - JavaScript Object Notation
|
|
**Description:** Structured text data format
|
|
**Typical Data:** Chemical properties, metadata, API responses
|
|
**Use Cases:** Data interchange, configuration, web APIs
|
|
**Python Libraries:**
|
|
- `json`: Built-in JSON support
|
|
- `pandas`: `pd.read_json()`
|
|
- `ujson`: Faster JSON parsing
|
|
**EDA Approach:**
|
|
- Schema validation
|
|
- Nesting depth analysis
|
|
- Key-value distribution
|
|
- Data type consistency
|
|
- Array length statistics
|
|
|
|
### .parquet - Apache Parquet
|
|
**Description:** Columnar storage format
|
|
**Typical Data:** Large tabular datasets efficiently
|
|
**Use Cases:** Big data, efficient columnar analytics
|
|
**Python Libraries:**
|
|
- `pandas`: `pd.read_parquet('file.parquet')`
|
|
- `pyarrow`: Direct parquet access
|
|
- `fastparquet`: Alternative implementation
|
|
**EDA Approach:**
|
|
- Column statistics from metadata
|
|
- Partition analysis
|
|
- Compression efficiency
|
|
- Row group structure
|
|
- Fast sampling for large files
|
|
- Schema evolution tracking
|