Add more scientific skills

This commit is contained in:
Timothy Kassis
2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions

View File

@@ -17,7 +17,41 @@
"strict": false,
"skills": [
"./scientific-packages/anndata",
"./scientific-packages/arboreto"
"./scientific-packages/arboreto",
"./scientific-packages/astropy",
"./scientific-packages/biomni",
"./scientific-packages/biopython",
"./scientific-packages/bioservices",
"./scientific-packages/cellxgene-census",
"./scientific-packages/cobrapy",
"./scientific-packages/datamol",
"./scientific-packages/deepchem",
"./scientific-packages/deeptools",
"./scientific-packages/diffdock",
"./scientific-packages/etetoolkit",
"./scientific-packages/flowio",
"./scientific-packages/gget",
"./scientific-packages/matplotlib",
"./scientific-packages/medchem",
"./scientific-packages/molfeat",
"./scientific-packages/polars",
"./scientific-packages/pubchem-database",
"./scientific-packages/pydeseq2",
"./scientific-packages/pymatgen",
"./scientific-packages/pymc",
"./scientific-packages/pymoo",
"./scientific-packages/pytdc",
"./scientific-packages/pytorch-lightning",
"./scientific-packages/rdkit",
"./scientific-packages/reportlab",
"./scientific-packages/scanpy",
"./scientific-packages/scikit-bio",
"./scientific-packages/scikit-learn",
"./scientific-packages/seaborn",
"./scientific-packages/torch_geometric",
"./scientific-packages/transformers",
"./scientific-packages/umap-learn",
"./scientific-packages/zarr-python"
]
},
{

View File

@@ -0,0 +1,790 @@
---
name: astropy
description: Comprehensive toolkit for astronomical data analysis and computation using the astropy Python library. This skill should be used when working with astronomical data including FITS files, coordinate transformations, cosmological calculations, time systems, physical units, data tables, model fitting, WCS transformations, and visualization. Use this skill for tasks involving celestial coordinates, astronomical file formats, photometry, spectroscopy, or any astronomy-specific Python computations.
---
# Astropy
## Overview
Astropy is the community standard Python library for astronomy, providing core functionality for astronomical data analysis and computation. This skill provides comprehensive guidance and tools for working with astropy's extensive capabilities across coordinate systems, file I/O, units and quantities, time systems, cosmology, modeling, and more.
## When to Use This Skill
Use this skill when:
- Working with FITS files (reading, writing, inspecting, modifying)
- Performing coordinate transformations between astronomical reference frames
- Calculating cosmological distances, ages, or other quantities
- Handling astronomical time systems and conversions
- Working with physical units and dimensional analysis
- Processing astronomical data tables with specialized column types
- Fitting models to astronomical data
- Converting between pixel and world coordinates (WCS)
- Performing robust statistical analysis on astronomical data
- Visualizing astronomical images with proper scaling and stretching
## Core Capabilities
### 1. FITS File Operations
FITS (Flexible Image Transport System) is the standard file format in astronomy. Astropy provides comprehensive FITS support.
**Quick FITS Inspection**:
Use the included `scripts/fits_info.py` script for rapid file inspection:
```bash
python scripts/fits_info.py observation.fits
python scripts/fits_info.py observation.fits --detailed
python scripts/fits_info.py observation.fits --ext 1
```
**Common FITS workflows**:
```python
from astropy.io import fits
# Read FITS file
with fits.open('image.fits') as hdul:
hdul.info() # Display structure
data = hdul[0].data
header = hdul[0].header
# Write FITS file
fits.writeto('output.fits', data, header, overwrite=True)
# Quick access (less efficient for multiple operations)
data = fits.getdata('image.fits', ext=0)
header = fits.getheader('image.fits', ext=0)
# Update specific header keyword
fits.setval('image.fits', 'OBJECT', value='M31')
```
**Multi-extension FITS**:
```python
from astropy.io import fits
# Create multi-extension FITS
primary = fits.PrimaryHDU(primary_data)
image_ext = fits.ImageHDU(science_data, name='SCI')
error_ext = fits.ImageHDU(error_data, name='ERR')
hdul = fits.HDUList([primary, image_ext, error_ext])
hdul.writeto('multi_ext.fits', overwrite=True)
```
**Binary tables**:
```python
from astropy.io import fits
# Read binary table
with fits.open('catalog.fits') as hdul:
table_data = hdul[1].data
ra = table_data['RA']
dec = table_data['DEC']
# Better: use astropy.table for table operations (see section 5)
```
### 2. Coordinate Systems and Transformations
Astropy supports ~25 coordinate frames with seamless transformations.
**Quick Coordinate Conversion**:
Use the included `scripts/coord_convert.py` script:
```bash
python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
python scripts/coord_convert.py --file coords.txt --from icrs --to galactic --output sexagesimal
```
**Basic coordinate operations**:
```python
from astropy.coordinates import SkyCoord
import astropy.units as u
# Create coordinate (multiple input formats supported)
c = SkyCoord(ra=10.68*u.degree, dec=41.27*u.degree, frame='icrs')
c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
c = SkyCoord('00h42m44.3s +41d16m09s')
# Transform between frames
c_galactic = c.galactic
c_fk5 = c.fk5
print(f"Galactic: l={c_galactic.l.deg:.3f}, b={c_galactic.b.deg:.3f}")
```
**Working with coordinate arrays**:
```python
import numpy as np
from astropy.coordinates import SkyCoord
import astropy.units as u
# Arrays of coordinates
ra = np.array([10.1, 10.2, 10.3]) * u.degree
dec = np.array([40.1, 40.2, 40.3]) * u.degree
coords = SkyCoord(ra=ra, dec=dec, frame='icrs')
# Calculate separations
sep = coords[0].separation(coords[1])
print(f"Separation: {sep.to(u.arcmin)}")
# Position angle
pa = coords[0].position_angle(coords[1])
```
**Catalog matching**:
```python
from astropy.coordinates import SkyCoord
import astropy.units as u
catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
# Find nearest neighbors
idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
# Filter by separation threshold
max_sep = 1 * u.arcsec
matched = sep2d < max_sep
```
**Horizontal coordinates (Alt/Az)**:
```python
from astropy.coordinates import SkyCoord, EarthLocation, AltAz
from astropy.time import Time
import astropy.units as u
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
obstime = Time('2023-01-01 03:00:00')
target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
altaz_frame = AltAz(obstime=obstime, location=location)
target_altaz = target.transform_to(altaz_frame)
print(f"Alt: {target_altaz.alt.deg:.2f}°, Az: {target_altaz.az.deg:.2f}°")
```
**Available coordinate frames**:
- `icrs` - International Celestial Reference System (default, preferred)
- `fk5`, `fk4` - Fifth/Fourth Fundamental Katalog
- `galactic` - Galactic coordinates
- `supergalactic` - Supergalactic coordinates
- `altaz` - Horizontal (altitude-azimuth) coordinates
- `gcrs`, `cirs`, `itrs` - Earth-based systems
- Ecliptic frames: `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic`
### 3. Units and Quantities
Physical units are fundamental to astronomical calculations. Astropy's units system provides dimensional analysis and automatic conversions.
**Basic unit operations**:
```python
import astropy.units as u
# Create quantities
distance = 5.2 * u.parsec
velocity = 300 * u.km / u.s
time = 10 * u.year
# Convert units
distance_ly = distance.to(u.lightyear)
velocity_mps = velocity.to(u.m / u.s)
# Arithmetic with units
wavelength = 500 * u.nm
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
```
**Working with arrays**:
```python
import numpy as np
import astropy.units as u
wavelengths = np.array([400, 500, 600]) * u.nm
frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
```
**Important equivalencies**:
- `u.spectral()` - Convert wavelength ↔ frequency ↔ energy
- `u.doppler_optical(rest)` - Optical Doppler velocity
- `u.doppler_radio(rest)` - Radio Doppler velocity
- `u.doppler_relativistic(rest)` - Relativistic Doppler
- `u.temperature()` - Temperature unit conversions
- `u.brightness_temperature(freq)` - Brightness temperature
**Physical constants**:
```python
from astropy import constants as const
print(const.c) # Speed of light
print(const.G) # Gravitational constant
print(const.M_sun) # Solar mass
print(const.R_sun) # Solar radius
print(const.L_sun) # Solar luminosity
```
**Performance tip**: Use the `<<` operator for fast unit assignment to arrays:
```python
# Fast
result = large_array << u.m
# Slower
result = large_array * u.m
```
### 4. Time Systems
Astronomical time systems require high precision and multiple time scales.
**Creating time objects**:
```python
from astropy.time import Time
import astropy.units as u
# Various input formats
t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
t2 = Time(2459945.5, format='jd', scale='utc')
t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
# Convert formats
print(t1.jd) # Julian Date
print(t1.mjd) # Modified Julian Date
print(t1.unix) # Unix timestamp
print(t1.iso) # ISO format
# Convert time scales
print(t1.tai) # International Atomic Time
print(t1.tt) # Terrestrial Time
print(t1.tdb) # Barycentric Dynamical Time
```
**Time arithmetic**:
```python
from astropy.time import Time, TimeDelta
import astropy.units as u
t1 = Time('2023-01-01T00:00:00')
dt = TimeDelta(1*u.day)
t2 = t1 + dt
diff = t2 - t1
print(diff.to(u.hour))
# Array of times
times = t1 + np.arange(10) * u.day
```
**Astronomical time calculations**:
```python
from astropy.time import Time
from astropy.coordinates import SkyCoord, EarthLocation
import astropy.units as u
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
t = Time('2023-01-01T00:00:00')
# Local sidereal time
lst = t.sidereal_time('apparent', longitude=location.lon)
# Barycentric correction
target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
ltt = t.light_travel_time(target, location=location)
t_bary = t.tdb + ltt
```
**Available time scales**:
- `utc` - Coordinated Universal Time
- `tai` - International Atomic Time
- `tt` - Terrestrial Time
- `tcb`, `tcg` - Barycentric/Geocentric Coordinate Time
- `tdb` - Barycentric Dynamical Time
- `ut1` - Universal Time
### 5. Data Tables
Astropy tables provide astronomy-specific enhancements over pandas.
**Creating and manipulating tables**:
```python
from astropy.table import Table
import astropy.units as u
# Create table
t = Table()
t['name'] = ['Star1', 'Star2', 'Star3']
t['ra'] = [10.5, 11.2, 12.3] * u.degree
t['dec'] = [41.2, 42.1, 43.5] * u.degree
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
# Column metadata
t['flux'].description = 'Flux at 1.4 GHz'
t['flux'].format = '.2f'
# Add calculated column
t['flux_mJy'] = t['flux'].to(u.mJy)
# Filter and sort
bright = t[t['flux'] > 1.0 * u.Jy]
t.sort('flux')
```
**Table I/O**:
```python
from astropy.table import Table
# Read (format auto-detected from extension)
t = Table.read('data.fits')
t = Table.read('data.csv', format='ascii.csv')
t = Table.read('data.ecsv', format='ascii.ecsv') # Preserves units!
t = Table.read('data.votable', format='votable')
# Write
t.write('output.fits', overwrite=True)
t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
```
**Advanced operations**:
```python
from astropy.table import Table, join, vstack, hstack
# Join tables (like SQL)
joined = join(table1, table2, keys='id')
# Stack tables
combined_rows = vstack([t1, t2])
combined_cols = hstack([t1, t2])
# Grouping and aggregation
t.group_by('category').groups.aggregate(np.mean)
```
**Tables with astronomical objects**:
```python
from astropy.table import Table
from astropy.coordinates import SkyCoord
from astropy.time import Time
import astropy.units as u
coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
t = Table([coords, times], names=['coords', 'obstime'])
print(t['coords'][0].ra) # Access coordinate properties
```
### 6. Cosmological Calculations
Quick cosmology calculations using standard models.
**Using the cosmology calculator**:
```bash
python scripts/cosmo_calc.py 0.5 1.0 1.5
python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
python scripts/cosmo_calc.py 0.5 --verbose
python scripts/cosmo_calc.py --convert 1000 --from luminosity_distance
```
**Programmatic usage**:
```python
from astropy.cosmology import Planck18
import astropy.units as u
import numpy as np
cosmo = Planck18
# Calculate distances
z = 1.5
d_L = cosmo.luminosity_distance(z)
d_A = cosmo.angular_diameter_distance(z)
d_C = cosmo.comoving_distance(z)
# Time calculations
age = cosmo.age(z)
lookback = cosmo.lookback_time(z)
# Hubble parameter
H_z = cosmo.H(z)
print(f"At z={z}:")
print(f" Luminosity distance: {d_L:.2f}")
print(f" Age of universe: {age:.2f}")
```
**Convert observables**:
```python
from astropy.cosmology import Planck18
import astropy.units as u
cosmo = Planck18
z = 1.5
# Angular size to physical size
d_A = cosmo.angular_diameter_distance(z)
angular_size = 1 * u.arcsec
physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
# Flux to luminosity
flux = 1e-17 * u.erg / u.s / u.cm**2
d_L = cosmo.luminosity_distance(z)
luminosity = flux * 4 * np.pi * d_L**2
# Find redshift for given distance
from astropy.cosmology import z_at_value
z = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
```
**Available cosmologies**:
- `Planck18`, `Planck15`, `Planck13` - Planck satellite parameters
- `WMAP9`, `WMAP7`, `WMAP5` - WMAP satellite parameters
- Custom: `FlatLambdaCDM(H0=70*u.km/u.s/u.Mpc, Om0=0.3)`
### 7. Model Fitting
Fit mathematical models to astronomical data.
**1D fitting example**:
```python
from astropy.modeling import models, fitting
import numpy as np
# Generate data
x = np.linspace(0, 10, 100)
y_data = 10 * np.exp(-0.5 * ((x - 5) / 1)**2) + np.random.normal(0, 0.5, x.shape)
# Create and fit model
g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
fitter = fitting.LevMarLSQFitter()
g_fit = fitter(g_init, x, y_data)
# Results
print(f"Amplitude: {g_fit.amplitude.value:.3f}")
print(f"Mean: {g_fit.mean.value:.3f}")
print(f"Stddev: {g_fit.stddev.value:.3f}")
# Evaluate fitted model
y_fit = g_fit(x)
```
**Common 1D models**:
- `Gaussian1D` - Gaussian profile
- `Lorentz1D` - Lorentzian profile
- `Voigt1D` - Voigt profile
- `Moffat1D` - Moffat profile (PSF modeling)
- `Polynomial1D` - Polynomial
- `PowerLaw1D` - Power law
- `BlackBody` - Blackbody spectrum
**Common 2D models**:
- `Gaussian2D` - 2D Gaussian
- `Moffat2D` - 2D Moffat (stellar PSF)
- `AiryDisk2D` - Airy disk (diffraction pattern)
- `Disk2D` - Circular disk
**Fitting with constraints**:
```python
from astropy.modeling import models, fitting
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
# Set bounds
g.amplitude.bounds = (0, None) # Positive only
g.mean.bounds = (4, 6) # Constrain center
# Fix parameters
g.stddev.fixed = True
# Compound models
model = models.Gaussian1D() + models.Polynomial1D(degree=1)
```
**Available fitters**:
- `LinearLSQFitter` - Linear least squares (fast, for linear models)
- `LevMarLSQFitter` - Levenberg-Marquardt (most common)
- `SimplexLSQFitter` - Downhill simplex
- `SLSQPLSQFitter` - Sequential Least Squares with constraints
### 8. World Coordinate System (WCS)
Transform between pixel and world coordinates in images.
**Basic WCS usage**:
```python
from astropy.io import fits
from astropy.wcs import WCS
# Read FITS with WCS
hdu = fits.open('image.fits')[0]
wcs = WCS(hdu.header)
# Pixel to world
ra, dec = wcs.pixel_to_world_values(100, 200)
# World to pixel
x, y = wcs.world_to_pixel_values(ra, dec)
# Using SkyCoord (more powerful)
from astropy.coordinates import SkyCoord
import astropy.units as u
coord = SkyCoord(ra=150*u.deg, dec=-30*u.deg)
x, y = wcs.world_to_pixel(coord)
```
**Plotting with WCS**:
```python
from astropy.io import fits
from astropy.wcs import WCS
from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
import matplotlib.pyplot as plt
hdu = fits.open('image.fits')[0]
wcs = WCS(hdu.header)
data = hdu.data
# Create figure with WCS projection
fig = plt.figure()
ax = fig.add_subplot(111, projection=wcs)
# Plot with coordinate grid
norm = ImageNormalize(data, interval=PercentileInterval(99.5),
stretch=LogStretch())
ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
# Coordinate labels and grid
ax.set_xlabel('RA')
ax.set_ylabel('Dec')
ax.coords.grid(color='white', alpha=0.5)
```
### 9. Statistics and Data Processing
Robust statistical tools for astronomical data.
**Sigma clipping** (remove outliers):
```python
from astropy.stats import sigma_clip, sigma_clipped_stats
# Remove outliers
clipped = sigma_clip(data, sigma=3, maxiters=5)
# Get statistics on cleaned data
mean, median, std = sigma_clipped_stats(data, sigma=3)
# Use clipped data
background = median
signal = data - background
snr = signal / std
```
**Other statistical functions**:
```python
from astropy.stats import mad_std, biweight_location, biweight_scale
# Robust standard deviation
std_robust = mad_std(data)
# Robust central location
center = biweight_location(data)
# Robust scale
scale = biweight_scale(data)
```
### 10. Visualization
Display astronomical images with proper scaling.
**Image normalization and stretching**:
```python
from astropy.visualization import (ImageNormalize, MinMaxInterval,
PercentileInterval, ZScaleInterval,
SqrtStretch, LogStretch, PowerStretch,
AsinhStretch)
import matplotlib.pyplot as plt
# Common combination: percentile interval + sqrt stretch
norm = ImageNormalize(data,
interval=PercentileInterval(99),
stretch=SqrtStretch())
plt.imshow(data, norm=norm, origin='lower', cmap='gray')
plt.colorbar()
```
**Available intervals** (determine min/max):
- `MinMaxInterval()` - Use actual min/max
- `PercentileInterval(percentile)` - Clip to percentile (e.g., 99%)
- `ZScaleInterval()` - IRAF's zscale algorithm
- `ManualInterval(vmin, vmax)` - Specify manually
**Available stretches** (nonlinear scaling):
- `LinearStretch()` - Linear (default)
- `SqrtStretch()` - Square root (common for images)
- `LogStretch()` - Logarithmic (for high dynamic range)
- `PowerStretch(power)` - Power law
- `AsinhStretch()` - Arcsinh (good for wide range)
## Bundled Resources
### scripts/
**`fits_info.py`** - Comprehensive FITS file inspection tool
```bash
python scripts/fits_info.py observation.fits
python scripts/fits_info.py observation.fits --detailed
python scripts/fits_info.py observation.fits --ext 1
```
**`coord_convert.py`** - Batch coordinate transformation utility
```bash
python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
python scripts/coord_convert.py --file coords.txt --from icrs --to galactic
```
**`cosmo_calc.py`** - Cosmological calculator
```bash
python scripts/cosmo_calc.py 0.5 1.0 1.5
python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
```
### references/
**`module_overview.md`** - Comprehensive reference of all astropy subpackages, classes, and methods. Consult this for detailed API information, available functions, and module capabilities.
**`common_workflows.md`** - Complete working examples for common astronomical data analysis tasks. Contains full code examples for FITS operations, coordinate transformations, cosmology, modeling, and complete analysis pipelines.
## Best Practices
1. **Use context managers for FITS files**:
```python
with fits.open('file.fits') as hdul:
# Work with file
```
2. **Prefer astropy.table over raw FITS tables** for better unit/metadata support
3. **Use SkyCoord for coordinates** (high-level interface) rather than low-level frame classes
4. **Always attach units** to quantities when possible for dimensional safety
5. **Use ECSV format** for saving tables when you want to preserve units and metadata
6. **Vectorize coordinate operations** rather than looping for performance
7. **Use memmap=True** when opening large FITS files to save memory
8. **Install Bottleneck** package for faster statistics operations
9. **Pre-compute composite units** for repeated operations to improve performance
10. **Consult `references/module_overview.md`** for detailed module information and `references/common_workflows.md`** for complete working examples
## Common Patterns
### Pattern: FITS → Process → FITS
```python
from astropy.io import fits
from astropy.stats import sigma_clipped_stats
# Read
with fits.open('input.fits') as hdul:
data = hdul[0].data
header = hdul[0].header
# Process
mean, median, std = sigma_clipped_stats(data, sigma=3)
processed = (data - median) / std
# Write
fits.writeto('output.fits', processed, header, overwrite=True)
```
### Pattern: Catalog Matching
```python
from astropy.coordinates import SkyCoord
from astropy.table import Table
import astropy.units as u
# Load catalogs
cat1 = Table.read('catalog1.fits')
cat2 = Table.read('catalog2.fits')
# Create coordinate objects
coords1 = SkyCoord(ra=cat1['RA'], dec=cat1['DEC'], unit=u.degree)
coords2 = SkyCoord(ra=cat2['RA'], dec=cat2['DEC'], unit=u.degree)
# Match
idx, sep2d, dist3d = coords1.match_to_catalog_sky(coords2)
# Filter by separation
max_sep = 1 * u.arcsec
matched_mask = sep2d < max_sep
# Create matched catalog
matched_cat1 = cat1[matched_mask]
matched_cat2 = cat2[idx[matched_mask]]
```
### Pattern: Time Series Analysis
```python
from astropy.time import Time
from astropy.timeseries import TimeSeries
import astropy.units as u
# Create time series
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
flux = [1.2, 2.3, 1.8] * u.Jy
ts = TimeSeries(time=times)
ts['flux'] = flux
# Fold on period
from astropy.timeseries import aggregate_downsample
period = 1.5 * u.day
folded = ts.fold(period=period)
```
### Pattern: Image Display with WCS
```python
from astropy.io import fits
from astropy.wcs import WCS
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
import matplotlib.pyplot as plt
hdu = fits.open('image.fits')[0]
wcs = WCS(hdu.header)
data = hdu.data
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection=wcs)
norm = ImageNormalize(data, interval=PercentileInterval(99),
stretch=SqrtStretch())
im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
ax.set_xlabel('RA')
ax.set_ylabel('Dec')
ax.coords.grid(color='white', alpha=0.5, linestyle='solid')
plt.colorbar(im, ax=ax)
```
## Installation Note
Ensure astropy is installed in the Python environment:
```bash
pip install astropy
```
For additional performance and features:
```bash
pip install astropy[all] # Includes optional dependencies
```
## Additional Resources
- Official documentation: https://docs.astropy.org
- Tutorials: https://learn.astropy.org
- API reference: Consult `references/module_overview.md` in this skill
- Working examples: Consult `references/common_workflows.md` in this skill

View File

@@ -0,0 +1,618 @@
# Common Astropy Workflows
This document describes frequently used workflows when working with astronomical data using astropy.
## 1. Working with FITS Files
### Basic FITS Reading
```python
from astropy.io import fits
import numpy as np
# Open and examine structure
with fits.open('observation.fits') as hdul:
hdul.info()
# Access primary HDU
primary_hdr = hdul[0].header
primary_data = hdul[0].data
# Access extension
ext_data = hdul[1].data
ext_hdr = hdul[1].header
# Read specific header keywords
object_name = primary_hdr['OBJECT']
exposure = primary_hdr['EXPTIME']
```
### Writing FITS Files
```python
# Create new FITS file
from astropy.io import fits
import numpy as np
# Create data
data = np.random.random((100, 100))
# Create primary HDU
hdu = fits.PrimaryHDU(data)
hdu.header['OBJECT'] = 'M31'
hdu.header['EXPTIME'] = 300.0
# Write to file
hdu.writeto('output.fits', overwrite=True)
# Multi-extension FITS
hdul = fits.HDUList([
fits.PrimaryHDU(data1),
fits.ImageHDU(data2, name='SCI'),
fits.ImageHDU(data3, name='ERR')
])
hdul.writeto('multi_ext.fits', overwrite=True)
```
### FITS Table Operations
```python
from astropy.io import fits
# Read binary table
with fits.open('catalog.fits') as hdul:
table_data = hdul[1].data
# Access columns
ra = table_data['RA']
dec = table_data['DEC']
mag = table_data['MAG']
# Filter data
bright = table_data[table_data['MAG'] < 15]
# Write binary table
from astropy.table import Table
import astropy.units as u
t = Table([ra, dec, mag], names=['RA', 'DEC', 'MAG'])
t['RA'].unit = u.degree
t['DEC'].unit = u.degree
t.write('output_catalog.fits', format='fits', overwrite=True)
```
## 2. Coordinate Transformations
### Basic Coordinate Creation and Transformation
```python
from astropy.coordinates import SkyCoord
import astropy.units as u
# Create from RA/Dec
c = SkyCoord(ra=10.68458*u.degree, dec=41.26917*u.degree, frame='icrs')
# Alternative creation methods
c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
c = SkyCoord('00h42m44.3s +41d16m09s')
# Transform to different frames
c_gal = c.galactic
c_fk5 = c.fk5
print(f"Galactic: l={c_gal.l.deg}, b={c_gal.b.deg}")
```
### Coordinate Arrays and Separations
```python
import numpy as np
from astropy.coordinates import SkyCoord
import astropy.units as u
# Create array of coordinates
ra_array = np.array([10.1, 10.2, 10.3]) * u.degree
dec_array = np.array([40.1, 40.2, 40.3]) * u.degree
coords = SkyCoord(ra=ra_array, dec=dec_array, frame='icrs')
# Calculate separations
c1 = SkyCoord(ra=10*u.degree, dec=40*u.degree)
c2 = SkyCoord(ra=11*u.degree, dec=41*u.degree)
sep = c1.separation(c2)
print(f"Separation: {sep.to(u.arcmin)}")
# Position angle
pa = c1.position_angle(c2)
```
### Catalog Matching
```python
from astropy.coordinates import SkyCoord, match_coordinates_sky
import astropy.units as u
# Two catalogs of coordinates
catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
# Find nearest neighbors
idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
# Filter by separation threshold
max_sep = 1 * u.arcsec
matched = sep2d < max_sep
matching_indices = idx[matched]
```
### Horizontal Coordinates (Alt/Az)
```python
from astropy.coordinates import SkyCoord, EarthLocation, AltAz
from astropy.time import Time
import astropy.units as u
# Observer location
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
# Observation time
obstime = Time('2023-01-01 03:00:00')
# Target coordinate
target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
# Transform to Alt/Az
altaz_frame = AltAz(obstime=obstime, location=location)
target_altaz = target.transform_to(altaz_frame)
print(f"Altitude: {target_altaz.alt.deg}")
print(f"Azimuth: {target_altaz.az.deg}")
```
## 3. Units and Quantities
### Basic Unit Operations
```python
import astropy.units as u
# Create quantities
distance = 5.2 * u.parsec
time = 10 * u.year
velocity = 300 * u.km / u.s
# Unit conversion
distance_ly = distance.to(u.lightyear)
velocity_mps = velocity.to(u.m / u.s)
# Arithmetic with units
wavelength = 500 * u.nm
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
# Compose/decompose units
composite = (1 * u.kg * u.m**2 / u.s**2)
print(composite.decompose()) # Base SI units
print(composite.compose()) # Known compound units (Joule)
```
### Working with Arrays
```python
import numpy as np
import astropy.units as u
# Quantity arrays
wavelengths = np.array([400, 500, 600]) * u.nm
frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
# Mathematical operations preserve units
fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
```
### Custom Units and Equivalencies
```python
import astropy.units as u
# Define custom unit
beam = u.def_unit('beam', 1.5e-10 * u.steradian)
# Register for session
u.add_enabled_units([beam])
# Use in calculations
flux_per_beam = 1.5 * u.Jy / beam
# Doppler equivalencies
rest_wavelength = 656.3 * u.nm # H-alpha
observed = 656.5 * u.nm
velocity = observed.to(u.km/u.s,
equivalencies=u.doppler_optical(rest_wavelength))
```
## 4. Time Handling
### Time Creation and Conversion
```python
from astropy.time import Time
import astropy.units as u
# Create time objects
t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
t2 = Time(2459945.5, format='jd', scale='utc')
t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
# Convert formats
print(t1.jd) # Julian Date
print(t1.mjd) # Modified Julian Date
print(t1.unix) # Unix timestamp
print(t1.iso) # ISO format
# Convert time scales
print(t1.tai) # Convert to TAI
print(t1.tt) # Convert to TT
print(t1.tdb) # Convert to TDB
```
### Time Arithmetic
```python
from astropy.time import Time, TimeDelta
import astropy.units as u
t1 = Time('2023-01-01T00:00:00')
dt = TimeDelta(1*u.day)
# Add time delta
t2 = t1 + dt
# Difference between times
diff = t2 - t1
print(diff.to(u.hour))
# Array of times
times = t1 + np.arange(10) * u.day
```
### Sidereal Time and Astronomical Calculations
```python
from astropy.time import Time
from astropy.coordinates import EarthLocation
import astropy.units as u
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
t = Time('2023-01-01T00:00:00')
# Local sidereal time
lst = t.sidereal_time('apparent', longitude=location.lon)
# Light travel time correction
from astropy.coordinates import SkyCoord
target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
ltt_bary = t.light_travel_time(target, location=location)
t_bary = t + ltt_bary
```
## 5. Tables and Data Management
### Creating and Manipulating Tables
```python
from astropy.table import Table, Column
import astropy.units as u
import numpy as np
# Create table
t = Table()
t['name'] = ['Star1', 'Star2', 'Star3']
t['ra'] = [10.5, 11.2, 12.3] * u.degree
t['dec'] = [41.2, 42.1, 43.5] * u.degree
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
# Add column metadata
t['flux'].description = 'Flux at 1.4 GHz'
t['flux'].format = '.2f'
# Add new column
t['flux_mJy'] = t['flux'].to(u.mJy)
# Filter rows
bright = t[t['flux'] > 1.0 * u.Jy]
# Sort
t.sort('flux')
```
### Table I/O
```python
from astropy.table import Table
# Read various formats
t = Table.read('data.fits')
t = Table.read('data.csv', format='ascii.csv')
t = Table.read('data.ecsv', format='ascii.ecsv') # Preserves units
t = Table.read('data.votable', format='votable')
# Write various formats
t.write('output.fits', overwrite=True)
t.write('output.csv', format='ascii.csv', overwrite=True)
t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
t.write('output.votable', format='votable', overwrite=True)
```
### Advanced Table Operations
```python
from astropy.table import Table, join, vstack, hstack
# Join tables
t1 = Table([[1, 2], ['a', 'b']], names=['id', 'val1'])
t2 = Table([[1, 2], ['c', 'd']], names=['id', 'val2'])
joined = join(t1, t2, keys='id')
# Stack tables vertically
combined = vstack([t1, t2])
# Stack horizontally
combined = hstack([t1, t2])
# Grouping
t.group_by('category').groups.aggregate(np.mean)
```
### Tables with Astronomical Objects
```python
from astropy.table import Table
from astropy.coordinates import SkyCoord
from astropy.time import Time
import astropy.units as u
# Table with SkyCoord column
coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
t = Table([coords, times], names=['coords', 'obstime'])
# Access individual coordinates
print(t['coords'][0].ra)
print(t['coords'][0].dec)
```
## 6. Cosmological Calculations
### Distance Calculations
```python
from astropy.cosmology import Planck18, FlatLambdaCDM
import astropy.units as u
import numpy as np
# Use built-in cosmology
cosmo = Planck18
# Redshifts
z = np.linspace(0, 5, 50)
# Calculate distances
comoving_dist = cosmo.comoving_distance(z)
angular_diam_dist = cosmo.angular_diameter_distance(z)
luminosity_dist = cosmo.luminosity_distance(z)
# Age of universe
age_at_z = cosmo.age(z)
lookback_time = cosmo.lookback_time(z)
# Hubble parameter
H_z = cosmo.H(z)
```
### Converting Observables
```python
from astropy.cosmology import Planck18
import astropy.units as u
cosmo = Planck18
z = 1.5
# Angular diameter distance
d_A = cosmo.angular_diameter_distance(z)
# Convert angular size to physical size
angular_size = 1 * u.arcsec
physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
# Convert flux to luminosity
flux = 1e-17 * u.erg / u.s / u.cm**2
d_L = cosmo.luminosity_distance(z)
luminosity = flux * 4 * np.pi * d_L**2
# Find redshift for given distance
from astropy.cosmology import z_at_value
z_result = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
```
### Custom Cosmology
```python
from astropy.cosmology import FlatLambdaCDM
import astropy.units as u
# Define custom cosmology
my_cosmo = FlatLambdaCDM(H0=70 * u.km/u.s/u.Mpc,
Om0=0.3,
Tcmb0=2.725 * u.K)
# Use it for calculations
print(my_cosmo.age(0))
print(my_cosmo.luminosity_distance(1.5))
```
## 7. Model Fitting
### Fitting 1D Models
```python
from astropy.modeling import models, fitting
import numpy as np
import matplotlib.pyplot as plt
# Generate data with noise
x = np.linspace(0, 10, 100)
true_model = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
y = true_model(x) + np.random.normal(0, 0.5, x.shape)
# Create and fit model
g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
fitter = fitting.LevMarLSQFitter()
g_fit = fitter(g_init, x, y)
# Plot results
plt.plot(x, y, 'o', label='Data')
plt.plot(x, g_fit(x), label='Fit')
plt.legend()
# Get fitted parameters
print(f"Amplitude: {g_fit.amplitude.value}")
print(f"Mean: {g_fit.mean.value}")
print(f"Stddev: {g_fit.stddev.value}")
```
### Fitting with Constraints
```python
from astropy.modeling import models, fitting
# Set parameter bounds
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
g.amplitude.bounds = (0, None) # Positive only
g.mean.bounds = (4, 6) # Constrain center
g.stddev.fixed = True # Fix width
# Tie parameters (for multi-component models)
g1 = models.Gaussian1D(amplitude=10, mean=5, stddev=1, name='g1')
g2 = models.Gaussian1D(amplitude=5, mean=6, stddev=1, name='g2')
g2.stddev.tied = lambda model: model.g1.stddev
# Compound model
model = g1 + g2
```
### 2D Image Fitting
```python
from astropy.modeling import models, fitting
import numpy as np
# Create 2D data
y, x = np.mgrid[0:100, 0:100]
z = models.Gaussian2D(amplitude=100, x_mean=50, y_mean=50,
x_stddev=5, y_stddev=5)(x, y)
z += np.random.normal(0, 5, z.shape)
# Fit 2D Gaussian
g_init = models.Gaussian2D(amplitude=90, x_mean=48, y_mean=48,
x_stddev=4, y_stddev=4)
fitter = fitting.LevMarLSQFitter()
g_fit = fitter(g_init, x, y, z)
# Get parameters
print(f"Center: ({g_fit.x_mean.value}, {g_fit.y_mean.value})")
print(f"Width: ({g_fit.x_stddev.value}, {g_fit.y_stddev.value})")
```
## 8. Image Processing and Visualization
### Image Display with Proper Scaling
```python
from astropy.io import fits
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
import matplotlib.pyplot as plt
# Read FITS image
data = fits.getdata('image.fits')
# Apply normalization
norm = ImageNormalize(data,
interval=PercentileInterval(99),
stretch=SqrtStretch())
# Display
plt.imshow(data, norm=norm, origin='lower', cmap='gray')
plt.colorbar()
```
### WCS Plotting
```python
from astropy.io import fits
from astropy.wcs import WCS
from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
import matplotlib.pyplot as plt
# Read FITS with WCS
hdu = fits.open('image.fits')[0]
wcs = WCS(hdu.header)
data = hdu.data
# Create figure with WCS projection
fig = plt.figure()
ax = fig.add_subplot(111, projection=wcs)
# Plot with coordinate grid
norm = ImageNormalize(data, interval=PercentileInterval(99.5),
stretch=LogStretch())
im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
# Add coordinate labels
ax.set_xlabel('RA')
ax.set_ylabel('Dec')
ax.coords.grid(color='white', alpha=0.5)
plt.colorbar(im)
```
### Sigma Clipping and Statistics
```python
from astropy.stats import sigma_clip, sigma_clipped_stats
import numpy as np
# Data with outliers
data = np.random.normal(100, 15, 1000)
data[0:50] = np.random.normal(200, 10, 50) # Add outliers
# Sigma clipping
clipped = sigma_clip(data, sigma=3, maxiters=5)
# Get statistics on clipped data
mean, median, std = sigma_clipped_stats(data, sigma=3)
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Std: {std:.2f}")
print(f"Clipped {clipped.mask.sum()} values")
```
## 9. Complete Analysis Example
### Photometry Pipeline
```python
from astropy.io import fits
from astropy.wcs import WCS
from astropy.coordinates import SkyCoord
from astropy.stats import sigma_clipped_stats
from astropy.visualization import ImageNormalize, LogStretch
import astropy.units as u
import numpy as np
# Read FITS file
hdu = fits.open('observation.fits')[0]
data = hdu.data
header = hdu.header
wcs = WCS(header)
# Calculate background statistics
mean, median, std = sigma_clipped_stats(data, sigma=3.0)
print(f"Background: {median:.2f} +/- {std:.2f}")
# Subtract background
data_sub = data - median
# Known source coordinates
source_coord = SkyCoord(ra='10:42:30', dec='+41:16:09', unit=(u.hourangle, u.deg))
# Convert to pixel coordinates
x_pix, y_pix = wcs.world_to_pixel(source_coord)
# Simple aperture photometry
aperture_radius = 10 # pixels
y, x = np.ogrid[:data.shape[0], :data.shape[1]]
mask = (x - x_pix)**2 + (y - y_pix)**2 <= aperture_radius**2
aperture_sum = np.sum(data_sub[mask])
npix = np.sum(mask)
print(f"Source position: ({x_pix:.1f}, {y_pix:.1f})")
print(f"Aperture sum: {aperture_sum:.2f}")
print(f"S/N: {aperture_sum / (std * np.sqrt(npix)):.2f}")
```
This workflow document provides practical examples for common astronomical data analysis tasks using astropy.

View File

@@ -0,0 +1,340 @@
# Astropy Module Overview
This document provides a comprehensive reference of all major astropy subpackages and their capabilities.
## Core Data Structures
### astropy.units
**Purpose**: Handle physical units and dimensional analysis in computations.
**Key Classes**:
- `Quantity` - Combines numerical values with units
- `Unit` - Represents physical units
**Common Operations**:
```python
import astropy.units as u
distance = 5 * u.meter
time = 2 * u.second
velocity = distance / time # Returns Quantity in m/s
wavelength = 500 * u.nm
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
```
**Equivalencies**:
- `u.spectral()` - Convert wavelength ↔ frequency
- `u.doppler_optical()`, `u.doppler_radio()` - Velocity conversions
- `u.temperature()` - Temperature unit conversions
- `u.pixel_scale()` - Pixel to physical units
### astropy.constants
**Purpose**: Provide physical and astronomical constants.
**Common Constants**:
- `c` - Speed of light
- `G` - Gravitational constant
- `h` - Planck constant
- `M_sun`, `R_sun`, `L_sun` - Solar mass, radius, luminosity
- `M_earth`, `R_earth` - Earth mass, radius
- `pc`, `au` - Parsec, astronomical unit
### astropy.time
**Purpose**: Represent and manipulate times and dates with astronomical precision.
**Time Scales**:
- `UTC` - Coordinated Universal Time
- `TAI` - International Atomic Time
- `TT` - Terrestrial Time
- `TCB`, `TCG` - Barycentric/Geocentric Coordinate Time
- `TDB` - Barycentric Dynamical Time
- `UT1` - Universal Time
**Common Formats**:
- `iso`, `isot` - ISO 8601 strings
- `jd`, `mjd` - Julian/Modified Julian Date
- `unix`, `gps` - Unix/GPS timestamps
- `datetime` - Python datetime objects
**Example**:
```python
from astropy.time import Time
t = Time('2023-01-01T00:00:00', format='isot', scale='utc')
print(t.mjd) # Modified Julian Date
print(t.jd) # Julian Date
print(t.tt) # Convert to TT scale
```
### astropy.table
**Purpose**: Work with tabular data optimized for astronomical applications.
**Key Features**:
- Native support for astropy Quantity, Time, and SkyCoord columns
- Multi-dimensional columns
- Metadata preservation (units, descriptions, formats)
- Advanced operations: joins, grouping, binning
- File I/O via unified interface
**Example**:
```python
from astropy.table import Table
import astropy.units as u
t = Table()
t['name'] = ['Star1', 'Star2', 'Star3']
t['ra'] = [10.5, 11.2, 12.3] * u.degree
t['dec'] = [41.2, 42.1, 43.5] * u.degree
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
```
## Coordinates and World Coordinate Systems
### astropy.coordinates
**Purpose**: Represent and transform celestial coordinates.
**Primary Interface**: `SkyCoord` - High-level class for sky positions
**Coordinate Frames**:
- `ICRS` - International Celestial Reference System (default)
- `FK5`, `FK4` - Fifth/Fourth Fundamental Katalog
- `Galactic`, `Supergalactic` - Galactic coordinates
- `AltAz` - Horizontal (altitude-azimuth) coordinates
- `GCRS`, `CIRS`, `ITRS` - Earth-based systems
- `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic` - Ecliptic coordinates
**Common Operations**:
```python
from astropy.coordinates import SkyCoord
import astropy.units as u
# Create coordinate
c = SkyCoord(ra=10.625*u.degree, dec=41.2*u.degree, frame='icrs')
# Transform to galactic
c_gal = c.galactic
# Calculate separation
c2 = SkyCoord(ra=11*u.degree, dec=42*u.degree, frame='icrs')
sep = c.separation(c2)
# Match catalogs
idx, sep2d, dist3d = c.match_to_catalog_sky(catalog_coords)
```
### astropy.wcs
**Purpose**: Handle World Coordinate System transformations for astronomical images.
**Key Class**: `WCS` - Maps between pixel and world coordinates
**Common Use Cases**:
- Convert pixel coordinates to RA/Dec
- Convert RA/Dec to pixel coordinates
- Handle distortion corrections (SIP, lookup tables)
**Example**:
```python
from astropy.wcs import WCS
from astropy.io import fits
hdu = fits.open('image.fits')[0]
wcs = WCS(hdu.header)
# Pixel to world
ra, dec = wcs.pixel_to_world_values(100, 200)
# World to pixel
x, y = wcs.world_to_pixel_values(ra, dec)
```
## File I/O
### astropy.io.fits
**Purpose**: Read and write FITS (Flexible Image Transport System) files.
**Key Classes**:
- `HDUList` - Container for all HDUs in a file
- `PrimaryHDU` - Primary header data unit
- `ImageHDU` - Image extension
- `BinTableHDU` - Binary table extension
- `Header` - FITS header keywords
**Common Operations**:
```python
from astropy.io import fits
# Read FITS file
with fits.open('file.fits') as hdul:
hdul.info() # Display structure
header = hdul[0].header
data = hdul[0].data
# Write FITS file
fits.writeto('output.fits', data, header)
# Update header keyword
fits.setval('file.fits', 'OBJECT', value='M31')
```
### astropy.io.ascii
**Purpose**: Read and write ASCII tables in various formats.
**Supported Formats**:
- Basic, CSV, tab-delimited
- CDS/MRT (Machine Readable Tables)
- IPAC, Daophot, SExtractor
- LaTeX tables
- HTML tables
### astropy.io.votable
**Purpose**: Handle Virtual Observatory (VO) table format.
### astropy.io.misc
**Purpose**: Additional formats including HDF5, Parquet, and YAML.
## Scientific Calculations
### astropy.cosmology
**Purpose**: Perform cosmological calculations.
**Common Models**:
- `FlatLambdaCDM` - Flat universe with cosmological constant (most common)
- `LambdaCDM` - Universe with cosmological constant
- `Planck18`, `Planck15`, `Planck13` - Pre-defined Planck parameters
- `WMAP9`, `WMAP7`, `WMAP5` - Pre-defined WMAP parameters
**Common Methods**:
```python
from astropy.cosmology import FlatLambdaCDM, Planck18
import astropy.units as u
cosmo = FlatLambdaCDM(H0=70, Om0=0.3)
# Or use built-in: cosmo = Planck18
z = 1.5
print(cosmo.age(z)) # Age of universe at z
print(cosmo.luminosity_distance(z)) # Luminosity distance
print(cosmo.angular_diameter_distance(z)) # Angular diameter distance
print(cosmo.comoving_distance(z)) # Comoving distance
print(cosmo.H(z)) # Hubble parameter at z
```
### astropy.modeling
**Purpose**: Framework for model evaluation and fitting.
**Model Categories**:
- 1D models: Gaussian1D, Lorentz1D, Voigt1D, Polynomial1D
- 2D models: Gaussian2D, Disk2D, Moffat2D
- Physical models: BlackBody, Drude1D, NFW
- Polynomial models: Chebyshev, Legendre
**Common Fitters**:
- `LinearLSQFitter` - Linear least squares
- `LevMarLSQFitter` - Levenberg-Marquardt
- `SimplexLSQFitter` - Downhill simplex
**Example**:
```python
from astropy.modeling import models, fitting
# Create model
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
# Fit to data
fitter = fitting.LevMarLSQFitter()
fitted_model = fitter(g, x_data, y_data)
```
### astropy.convolution
**Purpose**: Convolve and filter astronomical data.
**Common Kernels**:
- `Gaussian2DKernel` - 2D Gaussian smoothing
- `Box2DKernel` - 2D boxcar smoothing
- `Tophat2DKernel` - 2D tophat filter
- Custom kernels via arrays
### astropy.stats
**Purpose**: Statistical tools for astronomical data analysis.
**Key Functions**:
- `sigma_clip()` - Remove outliers via sigma clipping
- `sigma_clipped_stats()` - Compute mean, median, std with clipping
- `mad_std()` - Median Absolute Deviation
- `biweight_location()`, `biweight_scale()` - Robust statistics
- `circmean()`, `circstd()` - Circular statistics
**Example**:
```python
from astropy.stats import sigma_clip, sigma_clipped_stats
# Remove outliers
filtered_data = sigma_clip(data, sigma=3, maxiters=5)
# Get statistics
mean, median, std = sigma_clipped_stats(data, sigma=3)
```
## Data Processing
### astropy.nddata
**Purpose**: Handle N-dimensional datasets with metadata.
**Key Class**: `NDData` - Container for array data with units, uncertainty, mask, and WCS
### astropy.timeseries
**Purpose**: Work with time series data.
**Key Classes**:
- `TimeSeries` - Time-indexed data table
- `BinnedTimeSeries` - Time-binned data
**Common Operations**:
- Period finding (Lomb-Scargle)
- Folding time series
- Binning data
### astropy.visualization
**Purpose**: Display astronomical data effectively.
**Key Features**:
- Image normalization (LogStretch, PowerStretch, SqrtStretch, etc.)
- Interval scaling (MinMaxInterval, PercentileInterval, ZScaleInterval)
- WCSAxes for plotting with coordinate overlays
- RGB image creation with stretching
- Astronomical colormaps
**Example**:
```python
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
import matplotlib.pyplot as plt
norm = ImageNormalize(data, interval=PercentileInterval(99),
stretch=SqrtStretch())
plt.imshow(data, norm=norm, origin='lower')
```
## Utilities
### astropy.samp
**Purpose**: Simple Application Messaging Protocol for inter-application communication.
**Use Case**: Connect Python scripts with other astronomical tools (e.g., DS9, TOPCAT).
## Module Import Patterns
**Standard imports**:
```python
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.time import Time
from astropy.io import fits
from astropy.table import Table
from astropy import constants as const
```
## Performance Tips
1. **Pre-compute composite units** for repeated operations
2. **Use `<<` operator** for fast unit assignments: `array << u.m` instead of `array * u.m`
3. **Vectorize operations** rather than looping over coordinates/times
4. **Use memmap=True** when opening large FITS files
5. **Install Bottleneck** for faster stats operations

View File

@@ -0,0 +1,226 @@
#!/usr/bin/env python3
"""
Coordinate conversion utility for astronomical coordinates.
This script provides batch coordinate transformations between different
astronomical coordinate systems using astropy.
"""
import sys
import argparse
from astropy.coordinates import SkyCoord
import astropy.units as u
def convert_coordinates(coords_input, input_frame='icrs', output_frame='galactic',
input_format='decimal', output_format='decimal'):
"""
Convert astronomical coordinates between different frames.
Parameters
----------
coords_input : list of tuples or str
Input coordinates as (lon, lat) pairs or strings
input_frame : str
Input coordinate frame (icrs, fk5, galactic, etc.)
output_frame : str
Output coordinate frame
input_format : str
Format of input coordinates ('decimal', 'sexagesimal', 'hourangle')
output_format : str
Format for output display ('decimal', 'sexagesimal', 'hourangle')
Returns
-------
list
Converted coordinates
"""
results = []
for coord in coords_input:
try:
# Parse input coordinate
if input_format == 'decimal':
if isinstance(coord, str):
parts = coord.split()
lon, lat = float(parts[0]), float(parts[1])
else:
lon, lat = coord
c = SkyCoord(lon*u.degree, lat*u.degree, frame=input_frame)
elif input_format == 'sexagesimal':
c = SkyCoord(coord, frame=input_frame, unit=(u.hourangle, u.deg))
elif input_format == 'hourangle':
if isinstance(coord, str):
parts = coord.split()
lon, lat = parts[0], parts[1]
else:
lon, lat = coord
c = SkyCoord(lon, lat, frame=input_frame, unit=(u.hourangle, u.deg))
# Transform to output frame
if output_frame == 'icrs':
c_out = c.icrs
elif output_frame == 'fk5':
c_out = c.fk5
elif output_frame == 'fk4':
c_out = c.fk4
elif output_frame == 'galactic':
c_out = c.galactic
elif output_frame == 'supergalactic':
c_out = c.supergalactic
else:
c_out = c.transform_to(output_frame)
results.append(c_out)
except Exception as e:
print(f"Error converting coordinate {coord}: {e}", file=sys.stderr)
results.append(None)
return results
def format_output(coords, frame, output_format='decimal'):
"""Format coordinates for display."""
output = []
for c in coords:
if c is None:
output.append("ERROR")
continue
if frame in ['icrs', 'fk5', 'fk4']:
lon_name, lat_name = 'RA', 'Dec'
lon = c.ra
lat = c.dec
elif frame == 'galactic':
lon_name, lat_name = 'l', 'b'
lon = c.l
lat = c.b
elif frame == 'supergalactic':
lon_name, lat_name = 'sgl', 'sgb'
lon = c.sgl
lat = c.sgb
else:
lon_name, lat_name = 'lon', 'lat'
lon = c.spherical.lon
lat = c.spherical.lat
if output_format == 'decimal':
out_str = f"{lon.degree:12.8f} {lat.degree:+12.8f}"
elif output_format == 'sexagesimal':
if frame in ['icrs', 'fk5', 'fk4']:
out_str = f"{lon.to_string(unit=u.hourangle, sep=':', pad=True)} "
out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
else:
out_str = f"{lon.to_string(unit=u.degree, sep=':', pad=True)} "
out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
elif output_format == 'hourangle':
out_str = f"{lon.to_string(unit=u.hourangle, sep=' ', pad=True)} "
out_str += f"{lat.to_string(unit=u.degree, sep=' ', pad=True)}"
output.append(out_str)
return output
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description='Convert astronomical coordinates between different frames',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Supported frames: icrs, fk5, fk4, galactic, supergalactic
Input formats:
decimal : Degrees (e.g., "10.68 41.27")
sexagesimal : HMS/DMS (e.g., "00:42:44.3 +41:16:09")
hourangle : Hours and degrees (e.g., "10.5h 41.5d")
Examples:
%(prog)s --from icrs --to galactic "10.68 41.27"
%(prog)s --from icrs --to galactic --input decimal --output sexagesimal "150.5 -30.2"
%(prog)s --from galactic --to icrs "120.5 45.3"
%(prog)s --file coords.txt --from icrs --to galactic
"""
)
parser.add_argument('coordinates', nargs='*',
help='Coordinates to convert (lon lat pairs)')
parser.add_argument('-f', '--from', dest='input_frame', default='icrs',
help='Input coordinate frame (default: icrs)')
parser.add_argument('-t', '--to', dest='output_frame', default='galactic',
help='Output coordinate frame (default: galactic)')
parser.add_argument('-i', '--input', dest='input_format', default='decimal',
choices=['decimal', 'sexagesimal', 'hourangle'],
help='Input format (default: decimal)')
parser.add_argument('-o', '--output', dest='output_format', default='decimal',
choices=['decimal', 'sexagesimal', 'hourangle'],
help='Output format (default: decimal)')
parser.add_argument('--file', dest='input_file',
help='Read coordinates from file (one per line)')
parser.add_argument('--header', action='store_true',
help='Print header line with coordinate names')
args = parser.parse_args()
# Get coordinates from file or command line
if args.input_file:
try:
with open(args.input_file, 'r') as f:
coords = [line.strip() for line in f if line.strip()]
except FileNotFoundError:
print(f"Error: File '{args.input_file}' not found.", file=sys.stderr)
sys.exit(1)
else:
if not args.coordinates:
print("Error: No coordinates provided.", file=sys.stderr)
parser.print_help()
sys.exit(1)
# Combine pairs of arguments
if args.input_format == 'decimal':
coords = []
i = 0
while i < len(args.coordinates):
if i + 1 < len(args.coordinates):
coords.append(f"{args.coordinates[i]} {args.coordinates[i+1]}")
i += 2
else:
print(f"Warning: Odd number of coordinates, skipping last value",
file=sys.stderr)
break
else:
coords = args.coordinates
# Convert coordinates
converted = convert_coordinates(coords,
input_frame=args.input_frame,
output_frame=args.output_frame,
input_format=args.input_format,
output_format=args.output_format)
# Format and print output
formatted = format_output(converted, args.output_frame, args.output_format)
# Print header if requested
if args.header:
if args.output_frame in ['icrs', 'fk5', 'fk4']:
if args.output_format == 'decimal':
print(f"{'RA (deg)':>12s} {'Dec (deg)':>13s}")
else:
print(f"{'RA':>25s} {'Dec':>26s}")
elif args.output_frame == 'galactic':
if args.output_format == 'decimal':
print(f"{'l (deg)':>12s} {'b (deg)':>13s}")
else:
print(f"{'l':>25s} {'b':>26s}")
for line in formatted:
print(line)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,250 @@
#!/usr/bin/env python3
"""
Cosmological calculator using astropy.cosmology.
This script provides quick calculations of cosmological distances,
ages, and other quantities for given redshifts.
"""
import sys
import argparse
import numpy as np
from astropy.cosmology import FlatLambdaCDM, Planck18, Planck15, WMAP9
import astropy.units as u
def calculate_cosmology(redshifts, cosmology='Planck18', H0=None, Om0=None):
"""
Calculate cosmological quantities for given redshifts.
Parameters
----------
redshifts : array-like
Redshift values
cosmology : str
Cosmology to use ('Planck18', 'Planck15', 'WMAP9', 'custom')
H0 : float, optional
Hubble constant for custom cosmology (km/s/Mpc)
Om0 : float, optional
Matter density parameter for custom cosmology
Returns
-------
dict
Dictionary containing calculated quantities
"""
# Select cosmology
if cosmology == 'Planck18':
cosmo = Planck18
elif cosmology == 'Planck15':
cosmo = Planck15
elif cosmology == 'WMAP9':
cosmo = WMAP9
elif cosmology == 'custom':
if H0 is None or Om0 is None:
raise ValueError("Must provide H0 and Om0 for custom cosmology")
cosmo = FlatLambdaCDM(H0=H0 * u.km/u.s/u.Mpc, Om0=Om0)
else:
raise ValueError(f"Unknown cosmology: {cosmology}")
z = np.atleast_1d(redshifts)
results = {
'redshift': z,
'cosmology': str(cosmo),
'luminosity_distance': cosmo.luminosity_distance(z),
'angular_diameter_distance': cosmo.angular_diameter_distance(z),
'comoving_distance': cosmo.comoving_distance(z),
'comoving_volume': cosmo.comoving_volume(z),
'age': cosmo.age(z),
'lookback_time': cosmo.lookback_time(z),
'H': cosmo.H(z),
'scale_factor': 1.0 / (1.0 + z)
}
return results, cosmo
def print_results(results, verbose=False, csv=False):
"""Print calculation results."""
z = results['redshift']
if csv:
# CSV output
print("z,D_L(Mpc),D_A(Mpc),D_C(Mpc),Age(Gyr),t_lookback(Gyr),H(km/s/Mpc)")
for i in range(len(z)):
print(f"{z[i]:.6f},"
f"{results['luminosity_distance'][i].value:.6f},"
f"{results['angular_diameter_distance'][i].value:.6f},"
f"{results['comoving_distance'][i].value:.6f},"
f"{results['age'][i].value:.6f},"
f"{results['lookback_time'][i].value:.6f},"
f"{results['H'][i].value:.6f}")
else:
# Formatted table output
if verbose:
print(f"\nCosmology: {results['cosmology']}")
print("-" * 80)
print(f"\n{'z':>8s} {'D_L':>12s} {'D_A':>12s} {'D_C':>12s} "
f"{'Age':>10s} {'t_lb':>10s} {'H(z)':>10s}")
print(f"{'':>8s} {'(Mpc)':>12s} {'(Mpc)':>12s} {'(Mpc)':>12s} "
f"{'(Gyr)':>10s} {'(Gyr)':>10s} {'(km/s/Mpc)':>10s}")
print("-" * 80)
for i in range(len(z)):
print(f"{z[i]:8.4f} "
f"{results['luminosity_distance'][i].value:12.3f} "
f"{results['angular_diameter_distance'][i].value:12.3f} "
f"{results['comoving_distance'][i].value:12.3f} "
f"{results['age'][i].value:10.4f} "
f"{results['lookback_time'][i].value:10.4f} "
f"{results['H'][i].value:10.4f}")
if verbose:
print("\nLegend:")
print(" z : Redshift")
print(" D_L : Luminosity distance")
print(" D_A : Angular diameter distance")
print(" D_C : Comoving distance")
print(" Age : Age of universe at z")
print(" t_lb : Lookback time to z")
print(" H(z) : Hubble parameter at z")
def convert_quantity(value, quantity_type, cosmo, to_redshift=False):
"""
Convert between redshift and cosmological quantity.
Parameters
----------
value : float
Value to convert
quantity_type : str
Type of quantity ('luminosity_distance', 'age', etc.)
cosmo : Cosmology
Cosmology object
to_redshift : bool
If True, convert quantity to redshift; else convert z to quantity
"""
from astropy.cosmology import z_at_value
if to_redshift:
# Convert quantity to redshift
if quantity_type == 'luminosity_distance':
z = z_at_value(cosmo.luminosity_distance, value * u.Mpc)
elif quantity_type == 'age':
z = z_at_value(cosmo.age, value * u.Gyr)
elif quantity_type == 'lookback_time':
z = z_at_value(cosmo.lookback_time, value * u.Gyr)
elif quantity_type == 'comoving_distance':
z = z_at_value(cosmo.comoving_distance, value * u.Mpc)
else:
raise ValueError(f"Unknown quantity type: {quantity_type}")
return z
else:
# Convert redshift to quantity
if quantity_type == 'luminosity_distance':
return cosmo.luminosity_distance(value)
elif quantity_type == 'age':
return cosmo.age(value)
elif quantity_type == 'lookback_time':
return cosmo.lookback_time(value)
elif quantity_type == 'comoving_distance':
return cosmo.comoving_distance(value)
else:
raise ValueError(f"Unknown quantity type: {quantity_type}")
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description='Calculate cosmological quantities for given redshifts',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Available cosmologies: Planck18, Planck15, WMAP9, custom
Examples:
%(prog)s 0.5 1.0 1.5
%(prog)s 0.5 --cosmology Planck15
%(prog)s 0.5 --cosmology custom --H0 70 --Om0 0.3
%(prog)s --range 0 3 0.5
%(prog)s 0.5 --verbose
%(prog)s 0.5 1.0 --csv
%(prog)s --convert 1000 --from luminosity_distance --cosmology Planck18
"""
)
parser.add_argument('redshifts', nargs='*', type=float,
help='Redshift values to calculate')
parser.add_argument('-c', '--cosmology', default='Planck18',
choices=['Planck18', 'Planck15', 'WMAP9', 'custom'],
help='Cosmology to use (default: Planck18)')
parser.add_argument('--H0', type=float,
help='Hubble constant for custom cosmology (km/s/Mpc)')
parser.add_argument('--Om0', type=float,
help='Matter density parameter for custom cosmology')
parser.add_argument('-r', '--range', nargs=3, type=float, metavar=('START', 'STOP', 'STEP'),
help='Generate redshift range (start stop step)')
parser.add_argument('-v', '--verbose', action='store_true',
help='Print verbose output with cosmology details')
parser.add_argument('--csv', action='store_true',
help='Output in CSV format')
parser.add_argument('--convert', type=float,
help='Convert a quantity to redshift')
parser.add_argument('--from', dest='from_quantity',
choices=['luminosity_distance', 'age', 'lookback_time', 'comoving_distance'],
help='Type of quantity to convert from')
args = parser.parse_args()
# Handle conversion mode
if args.convert is not None:
if args.from_quantity is None:
print("Error: Must specify --from when using --convert", file=sys.stderr)
sys.exit(1)
# Get cosmology
if args.cosmology == 'Planck18':
cosmo = Planck18
elif args.cosmology == 'Planck15':
cosmo = Planck15
elif args.cosmology == 'WMAP9':
cosmo = WMAP9
elif args.cosmology == 'custom':
if args.H0 is None or args.Om0 is None:
print("Error: Must provide --H0 and --Om0 for custom cosmology",
file=sys.stderr)
sys.exit(1)
cosmo = FlatLambdaCDM(H0=args.H0 * u.km/u.s/u.Mpc, Om0=args.Om0)
z = convert_quantity(args.convert, args.from_quantity, cosmo, to_redshift=True)
print(f"\n{args.from_quantity.replace('_', ' ').title()} = {args.convert}")
print(f"Redshift z = {z:.6f}")
print(f"(using {args.cosmology} cosmology)")
return
# Get redshifts
if args.range:
start, stop, step = args.range
redshifts = np.arange(start, stop + step/2, step)
elif args.redshifts:
redshifts = np.array(args.redshifts)
else:
print("Error: No redshifts provided.", file=sys.stderr)
parser.print_help()
sys.exit(1)
# Calculate
try:
results, cosmo = calculate_cosmology(redshifts, args.cosmology,
H0=args.H0, Om0=args.Om0)
print_results(results, verbose=args.verbose, csv=args.csv)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,189 @@
#!/usr/bin/env python3
"""
Quick FITS file inspection tool.
This script provides a convenient way to inspect FITS file structure,
headers, and basic statistics without writing custom code each time.
"""
import sys
from pathlib import Path
from astropy.io import fits
import numpy as np
def print_fits_info(filename, detailed=False, ext=None):
"""
Print comprehensive information about a FITS file.
Parameters
----------
filename : str
Path to FITS file
detailed : bool
If True, print detailed statistics for each HDU
ext : int or str, optional
Specific extension to examine in detail
"""
print(f"\n{'='*70}")
print(f"FITS File: {filename}")
print(f"{'='*70}\n")
try:
with fits.open(filename) as hdul:
# Print file structure
print("File Structure:")
print("-" * 70)
hdul.info()
print()
# If specific extension requested
if ext is not None:
print(f"\nDetailed view of extension: {ext}")
print("-" * 70)
hdu = hdul[ext]
print_hdu_details(hdu, detailed=True)
return
# Print header and data info for each HDU
for i, hdu in enumerate(hdul):
print(f"\n{'='*70}")
print(f"HDU {i}: {hdu.name}")
print(f"{'='*70}")
print_hdu_details(hdu, detailed=detailed)
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
sys.exit(1)
except Exception as e:
print(f"Error reading FITS file: {e}")
sys.exit(1)
def print_hdu_details(hdu, detailed=False):
"""Print details for a single HDU."""
# Header information
print("\nHeader Information:")
print("-" * 70)
# Key header keywords
important_keywords = ['SIMPLE', 'BITPIX', 'NAXIS', 'EXTEND',
'OBJECT', 'TELESCOP', 'INSTRUME', 'OBSERVER',
'DATE-OBS', 'EXPTIME', 'FILTER', 'AIRMASS',
'RA', 'DEC', 'EQUINOX', 'CTYPE1', 'CTYPE2']
header = hdu.header
for key in important_keywords:
if key in header:
value = header[key]
comment = header.comments[key]
print(f" {key:12s} = {str(value):20s} / {comment}")
# NAXIS keywords
if 'NAXIS' in header:
naxis = header['NAXIS']
for i in range(1, naxis + 1):
key = f'NAXIS{i}'
if key in header:
print(f" {key:12s} = {str(header[key]):20s} / {header.comments[key]}")
# Data information
if hdu.data is not None:
print("\nData Information:")
print("-" * 70)
data = hdu.data
print(f" Data type: {data.dtype}")
print(f" Shape: {data.shape}")
# For image data
if hasattr(data, 'ndim') and data.ndim >= 1:
try:
# Calculate statistics
finite_data = data[np.isfinite(data)]
if len(finite_data) > 0:
print(f" Min: {np.min(finite_data):.6g}")
print(f" Max: {np.max(finite_data):.6g}")
print(f" Mean: {np.mean(finite_data):.6g}")
print(f" Median: {np.median(finite_data):.6g}")
print(f" Std: {np.std(finite_data):.6g}")
# Count special values
n_nan = np.sum(np.isnan(data))
n_inf = np.sum(np.isinf(data))
if n_nan > 0:
print(f" NaN values: {n_nan}")
if n_inf > 0:
print(f" Inf values: {n_inf}")
except Exception as e:
print(f" Could not calculate statistics: {e}")
# For table data
if hasattr(data, 'columns'):
print(f"\n Table Columns ({len(data.columns)}):")
for col in data.columns:
print(f" {col.name:20s} {col.format:10s} {col.unit or ''}")
if detailed:
print(f"\n First few rows:")
print(data[:min(5, len(data))])
else:
print("\n No data in this HDU")
# WCS information if present
try:
from astropy.wcs import WCS
wcs = WCS(hdu.header)
if wcs.has_celestial:
print("\nWCS Information:")
print("-" * 70)
print(f" Has celestial WCS: Yes")
print(f" CTYPE: {wcs.wcs.ctype}")
if wcs.wcs.crval is not None:
print(f" CRVAL: {wcs.wcs.crval}")
if wcs.wcs.crpix is not None:
print(f" CRPIX: {wcs.wcs.crpix}")
if wcs.wcs.cdelt is not None:
print(f" CDELT: {wcs.wcs.cdelt}")
except Exception:
pass
def main():
"""Main function for command-line usage."""
import argparse
parser = argparse.ArgumentParser(
description='Inspect FITS file structure and contents',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s image.fits
%(prog)s image.fits --detailed
%(prog)s image.fits --ext 1
%(prog)s image.fits --ext SCI
"""
)
parser.add_argument('filename', help='FITS file to inspect')
parser.add_argument('-d', '--detailed', action='store_true',
help='Show detailed statistics for each HDU')
parser.add_argument('-e', '--ext', type=str, default=None,
help='Show details for specific extension only (number or name)')
args = parser.parse_args()
# Convert extension to int if numeric
ext = args.ext
if ext is not None:
try:
ext = int(ext)
except ValueError:
pass # Keep as string for extension name
print_fits_info(args.filename, detailed=args.detailed, ext=ext)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,375 @@
---
name: biomni
description: General-purpose biomedical AI agent for autonomously executing research tasks across diverse biomedical domains. Use this skill when working with biomedical data analysis, CRISPR screening, single-cell RNA-seq, molecular property prediction, genomics, proteomics, drug discovery, or any computational biology task requiring LLM-powered code generation and retrieval-augmented planning.
---
# Biomni
## Overview
Biomni is a general-purpose biomedical AI agent that autonomously executes research tasks across diverse biomedical subfields. It combines large language model reasoning with retrieval-augmented planning and code-based execution to enhance scientific productivity and hypothesis generation. The system operates with an ~11GB biomedical knowledge base covering molecular, genomic, and clinical domains.
## Quick Start
Initialize and use the Biomni agent with these basic steps:
```python
from biomni.agent import A1
# Initialize agent with data path and LLM model
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
# Execute a biomedical research task
agent.go("Your biomedical task description")
```
The agent will autonomously decompose the task, retrieve relevant biomedical knowledge, generate and execute code, and provide results.
## Installation and Setup
### Environment Preparation
1. **Set up the conda environment:**
- Follow instructions in `biomni_env/README.md` from the repository
- Activate the environment: `conda activate biomni_e1`
2. **Install the package:**
```bash
pip install biomni --upgrade
```
Or install from source:
```bash
git clone https://github.com/snap-stanford/biomni.git
cd biomni
pip install -e .
```
3. **Configure API keys:**
Set up credentials via environment variables or `.env` file:
```bash
export ANTHROPIC_API_KEY="your-key-here"
export OPENAI_API_KEY="your-key-here" # Optional
```
4. **Data initialization:**
On first use, the agent will automatically download the ~11GB biomedical knowledge base.
### LLM Provider Configuration
Biomni supports multiple LLM providers. Configure the default provider using:
```python
from biomni.config import default_config
# Set the default LLM model
default_config.llm = "claude-sonnet-4-20250514" # Anthropic
# default_config.llm = "gpt-4" # OpenAI
# default_config.llm = "azure/gpt-4" # Azure OpenAI
# default_config.llm = "gemini/gemini-pro" # Google Gemini
# Set timeout (optional)
default_config.timeout_seconds = 1200
# Set data path (optional)
default_config.data_path = "./custom/data/path"
```
Refer to `references/llm_providers.md` for detailed configuration options for each provider.
## Core Biomedical Research Tasks
### 1. CRISPR Screening and Design
Execute CRISPR screening tasks including guide RNA design, off-target analysis, and screening experiment planning:
```python
agent.go("Design a CRISPR screening experiment to identify genes involved in cancer cell resistance to drug X")
```
The agent will:
- Retrieve relevant gene databases
- Design guide RNAs with specificity analysis
- Plan experimental controls and readout strategies
- Generate analysis code for screening results
### 2. Single-Cell RNA-seq Analysis
Perform comprehensive scRNA-seq analysis workflows:
```python
agent.go("Analyze this 10X Genomics scRNA-seq dataset, identify cell types, and find differentially expressed genes between clusters")
```
Capabilities include:
- Quality control and preprocessing
- Dimensionality reduction and clustering
- Cell type annotation using marker databases
- Differential expression analysis
- Pathway enrichment analysis
### 3. Molecular Property Prediction (ADMET)
Predict absorption, distribution, metabolism, excretion, and toxicity properties:
```python
agent.go("Predict ADMET properties for these drug candidates: [SMILES strings]")
```
The agent handles:
- Molecular descriptor calculation
- Property prediction using integrated models
- Toxicity screening
- Drug-likeness assessment
### 4. Genomic Analysis
Execute genomic data analysis tasks:
```python
agent.go("Perform GWAS analysis to identify SNPs associated with disease phenotype in this cohort")
```
Supports:
- Genome-wide association studies (GWAS)
- Variant calling and annotation
- Population genetics analysis
- Functional genomics integration
### 5. Protein Structure and Function
Analyze protein sequences and structures:
```python
agent.go("Predict the structure of this protein sequence and identify potential binding sites")
```
Capabilities:
- Sequence analysis and domain identification
- Structure prediction integration
- Binding site prediction
- Protein-protein interaction analysis
### 6. Disease Diagnosis and Classification
Perform disease classification from multi-omics data:
```python
agent.go("Build a classifier to diagnose disease X from patient RNA-seq and clinical data")
```
### 7. Systems Biology and Pathway Analysis
Analyze biological pathways and networks:
```python
agent.go("Identify dysregulated pathways in this differential expression dataset")
```
### 8. Drug Discovery and Repurposing
Support drug discovery workflows:
```python
agent.go("Identify FDA-approved drugs that could be repurposed for treating disease Y based on mechanism of action")
```
## Advanced Features
### Custom Configuration per Agent
Override global configuration for specific agent instances:
```python
agent = A1(
path='./project_data',
llm='gpt-4o',
timeout=1800
)
```
### Conversation History and Reporting
Save execution traces as formatted PDF reports:
```python
# After executing tasks
agent.save_conversation_history(
output_path='./reports/experiment_log.pdf',
format='pdf'
)
```
Requires one of: WeasyPrint, markdown2pdf, or Pandoc.
### Model Context Protocol (MCP) Integration
Extend agent capabilities with external tools:
```python
# Add MCP-compatible tools
agent.add_mcp(config_path='./mcp_config.json')
```
MCP enables integration with:
- Laboratory information management systems (LIMS)
- Specialized bioinformatics databases
- Custom analysis pipelines
- External computational resources
### Using Biomni-R0 (Specialized Reasoning Model)
Deploy the 32B parameter Biomni-R0 model for enhanced biological reasoning:
```bash
# Install SGLang
pip install "sglang[all]"
# Deploy Biomni-R0
python -m sglang.launch_server \
--model-path snap-stanford/biomni-r0 \
--port 30000 \
--trust-remote-code
```
Then configure the agent:
```python
from biomni.config import default_config
default_config.llm = "openai/biomni-r0"
default_config.api_base = "http://localhost:30000/v1"
```
Biomni-R0 provides specialized reasoning for:
- Complex multi-step biological workflows
- Hypothesis generation and evaluation
- Experimental design optimization
- Literature-informed analysis
## Best Practices
### Task Specification
Provide clear, specific task descriptions:
✅ **Good:** "Analyze this scRNA-seq dataset (file: data.h5ad) to identify T cell subtypes, then perform differential expression analysis comparing activated vs. resting T cells"
❌ **Vague:** "Analyze my RNA-seq data"
### Data Organization
Structure data directories for efficient retrieval:
```
project/
├── data/ # Biomni knowledge base
├── raw_data/ # Your experimental data
├── results/ # Analysis outputs
└── reports/ # Generated reports
```
### Iterative Refinement
Use iterative task execution for complex analyses:
```python
# Step 1: Exploratory analysis
agent.go("Load and perform initial QC on the proteomics dataset")
# Step 2: Based on results, refine analysis
agent.go("Based on the QC results, remove low-quality samples and normalize using method X")
# Step 3: Downstream analysis
agent.go("Perform differential abundance analysis with adjusted parameters")
```
### Security Considerations
**CRITICAL:** Biomni executes LLM-generated code with full system privileges. For production use:
1. **Use sandboxed environments:** Deploy in Docker containers or VMs with restricted permissions
2. **Validate sensitive operations:** Review code before execution for file access, network calls, or credential usage
3. **Limit data access:** Restrict agent access to only necessary data directories
4. **Monitor execution:** Log all executed code for audit trails
Never run Biomni with:
- Unrestricted file system access
- Direct access to sensitive credentials
- Network access to production systems
- Elevated system privileges
### Model Selection Guidelines
Choose models based on task complexity:
- **Claude Sonnet 4:** Recommended for most biomedical tasks, excellent biological reasoning
- **GPT-4/GPT-4o:** Strong general capabilities, good for diverse tasks
- **Biomni-R0:** Specialized for complex biological reasoning, multi-step workflows
- **Smaller models:** Use for simple, well-defined tasks to reduce cost
## Evaluation and Benchmarking
Biomni-Eval1 benchmark contains 433 evaluation instances across 10 biological tasks:
- GWAS analysis
- Disease diagnosis
- Gene detection and classification
- Molecular property prediction
- Pathway analysis
- Protein function prediction
- Drug response prediction
- Variant interpretation
- Cell type annotation
- Biomarker discovery
Use the benchmark to:
- Evaluate custom agent configurations
- Compare LLM providers for specific tasks
- Validate analysis pipelines
## Troubleshooting
### Common Issues
**Issue:** Data download fails or times out
**Solution:** Manually download the knowledge base or increase timeout settings
**Issue:** Package dependency conflicts
**Solution:** Some optional dependencies cannot be installed by default due to conflicts. Install specific packages manually and uncomment relevant code sections as documented in the repository
**Issue:** LLM API errors
**Solution:** Verify API key configuration, check rate limits, ensure sufficient credits
**Issue:** Memory errors with large datasets
**Solution:** Process data in chunks, use data subsampling, or deploy on higher-memory instances
### Getting Help
For detailed troubleshooting:
- Review the Biomni GitHub repository issues
- Check `references/api_reference.md` for detailed API documentation
- Consult `references/task_examples.md` for comprehensive task patterns
## Resources
### references/
Detailed reference documentation for advanced usage:
- **api_reference.md:** Complete API documentation for A1 agent, configuration objects, and utility functions
- **llm_providers.md:** Comprehensive guide for configuring all supported LLM providers (Anthropic, OpenAI, Azure, Gemini, Groq, Ollama, AWS Bedrock)
- **task_examples.md:** Extensive collection of biomedical task examples with code patterns
### scripts/
Helper scripts for common operations:
- **setup_environment.py:** Automated environment setup and validation
- **generate_report.py:** Enhanced PDF report generation with custom formatting
Load reference documentation as needed:
```python
# Claude can read reference files when needed for detailed information
# Example: "Check references/llm_providers.md for Azure OpenAI configuration"
```

View File

@@ -0,0 +1,635 @@
# Biomni API Reference
This document provides comprehensive API documentation for the Biomni biomedical AI agent system.
## Core Classes
### A1 Agent
The primary agent class for executing biomedical research tasks.
#### Initialization
```python
from biomni.agent import A1
agent = A1(
path='./data', # Path to biomedical knowledge base
llm='claude-sonnet-4-20250514', # LLM model identifier
timeout=None, # Optional timeout in seconds
verbose=True # Enable detailed logging
)
```
**Parameters:**
- `path` (str, required): Directory path where the biomedical knowledge base is stored or will be downloaded. First-time initialization will download ~11GB of data.
- `llm` (str, optional): LLM model identifier. Defaults to the value in `default_config.llm`. Supports multiple providers (see LLM Providers section).
- `timeout` (int, optional): Maximum execution time in seconds for agent operations. Overrides `default_config.timeout_seconds`.
- `verbose` (bool, optional): Enable verbose logging for debugging. Default: True.
**Returns:** A1 agent instance ready for task execution.
#### Methods
##### `go(task_description: str) -> None`
Execute a biomedical research task autonomously.
```python
agent.go("Analyze this scRNA-seq dataset and identify cell types")
```
**Parameters:**
- `task_description` (str, required): Natural language description of the biomedical task to execute. Be specific about:
- Data location and format
- Desired analysis or output
- Any specific methods or parameters
- Expected results format
**Behavior:**
1. Decomposes the task into executable steps
2. Retrieves relevant biomedical knowledge from the data lake
3. Generates and executes Python/R code
4. Provides results and visualizations
5. Handles errors and retries with refinement
**Notes:**
- Executes code with system privileges - use in sandboxed environments
- Long-running tasks may require timeout adjustments
- Intermediate results are displayed during execution
##### `save_conversation_history(output_path: str, format: str = 'pdf') -> None`
Export conversation history and execution trace as a formatted report.
```python
agent.save_conversation_history(
output_path='./reports/analysis_log.pdf',
format='pdf'
)
```
**Parameters:**
- `output_path` (str, required): File path for the output report
- `format` (str, optional): Output format. Options: 'pdf', 'markdown'. Default: 'pdf'
**Requirements:**
- For PDF: Install one of: WeasyPrint, markdown2pdf, or Pandoc
```bash
pip install weasyprint # Recommended
# or
pip install markdown2pdf
# or install Pandoc system-wide
```
**Report Contents:**
- Task description and parameters
- Retrieved biomedical knowledge
- Generated code with execution traces
- Results, visualizations, and outputs
- Timestamps and execution metadata
##### `add_mcp(config_path: str) -> None`
Add Model Context Protocol (MCP) tools to extend agent capabilities.
```python
agent.add_mcp(config_path='./mcp_tools_config.json')
```
**Parameters:**
- `config_path` (str, required): Path to MCP configuration JSON file
**MCP Configuration Format:**
```json
{
"tools": [
{
"name": "tool_name",
"endpoint": "http://localhost:8000/tool",
"description": "Tool description for LLM",
"parameters": {
"param1": "string",
"param2": "integer"
}
}
]
}
```
**Use Cases:**
- Connect to laboratory information systems
- Integrate proprietary databases
- Access specialized computational resources
- Link to institutional data repositories
## Configuration
### default_config
Global configuration object for Biomni settings.
```python
from biomni.config import default_config
```
#### Attributes
##### `llm: str`
Default LLM model identifier for all agent instances.
```python
default_config.llm = "claude-sonnet-4-20250514"
```
**Supported Models:**
**Anthropic:**
- `claude-sonnet-4-20250514` (Recommended)
- `claude-opus-4-20250514`
- `claude-3-5-sonnet-20241022`
- `claude-3-opus-20240229`
**OpenAI:**
- `gpt-4o`
- `gpt-4`
- `gpt-4-turbo`
- `gpt-3.5-turbo`
**Azure OpenAI:**
- `azure/gpt-4`
- `azure/<deployment-name>`
**Google Gemini:**
- `gemini/gemini-pro`
- `gemini/gemini-1.5-pro`
**Groq:**
- `groq/llama-3.1-70b-versatile`
- `groq/mixtral-8x7b-32768`
**Ollama (Local):**
- `ollama/llama3`
- `ollama/mistral`
- `ollama/<model-name>`
**AWS Bedrock:**
- `bedrock/anthropic.claude-v2`
- `bedrock/anthropic.claude-3-sonnet`
**Custom/Biomni-R0:**
- `openai/biomni-r0` (requires local SGLang deployment)
##### `timeout_seconds: int`
Default timeout for agent operations in seconds.
```python
default_config.timeout_seconds = 1200 # 20 minutes
```
**Recommended Values:**
- Simple tasks (QC, basic analysis): 300-600 seconds
- Medium tasks (differential expression, clustering): 600-1200 seconds
- Complex tasks (full pipelines, ML models): 1200-3600 seconds
- Very complex tasks: 3600+ seconds
##### `data_path: str`
Default path to biomedical knowledge base.
```python
default_config.data_path = "/path/to/biomni/data"
```
**Storage Requirements:**
- Initial download: ~11GB
- Extracted size: ~15GB
- Additional working space: ~5-10GB recommended
##### `api_base: str`
Custom API endpoint for LLM providers (advanced usage).
```python
# For local Biomni-R0 deployment
default_config.api_base = "http://localhost:30000/v1"
# For custom OpenAI-compatible endpoints
default_config.api_base = "https://your-endpoint.com/v1"
```
##### `max_retries: int`
Number of retry attempts for failed operations.
```python
default_config.max_retries = 3
```
#### Methods
##### `reset() -> None`
Reset all configuration values to system defaults.
```python
default_config.reset()
```
## Database Query System
Biomni includes a retrieval-augmented generation (RAG) system for querying the biomedical knowledge base.
### Query Functions
#### `query_genes(query: str, top_k: int = 10) -> List[Dict]`
Query gene information from integrated databases.
```python
from biomni.database import query_genes
results = query_genes(
query="genes involved in p53 pathway",
top_k=20
)
```
**Parameters:**
- `query` (str): Natural language or gene identifier query
- `top_k` (int): Number of results to return
**Returns:** List of dictionaries containing:
- `gene_symbol`: Official gene symbol
- `gene_name`: Full gene name
- `description`: Functional description
- `pathways`: Associated biological pathways
- `go_terms`: Gene Ontology annotations
- `diseases`: Associated diseases
- `similarity_score`: Relevance score (0-1)
#### `query_proteins(query: str, top_k: int = 10) -> List[Dict]`
Query protein information from UniProt and other sources.
```python
from biomni.database import query_proteins
results = query_proteins(
query="kinase proteins in cell cycle",
top_k=15
)
```
**Returns:** List of dictionaries with protein metadata:
- `uniprot_id`: UniProt accession
- `protein_name`: Protein name
- `function`: Functional annotation
- `domains`: Protein domains
- `subcellular_location`: Cellular localization
- `similarity_score`: Relevance score
#### `query_drugs(query: str, top_k: int = 10) -> List[Dict]`
Query drug and compound information.
```python
from biomni.database import query_drugs
results = query_drugs(
query="FDA approved cancer drugs targeting EGFR",
top_k=10
)
```
**Returns:** Drug information including:
- `drug_name`: Common name
- `drugbank_id`: DrugBank identifier
- `indication`: Therapeutic indication
- `mechanism`: Mechanism of action
- `targets`: Molecular targets
- `approval_status`: Regulatory status
- `smiles`: Chemical structure (SMILES notation)
#### `query_diseases(query: str, top_k: int = 10) -> List[Dict]`
Query disease information from clinical databases.
```python
from biomni.database import query_diseases
results = query_diseases(
query="autoimmune diseases affecting joints",
top_k=10
)
```
**Returns:** Disease data:
- `disease_name`: Standard disease name
- `disease_id`: Ontology identifier
- `symptoms`: Clinical manifestations
- `associated_genes`: Genetic associations
- `prevalence`: Epidemiological data
#### `query_pathways(query: str, top_k: int = 10) -> List[Dict]`
Query biological pathways from KEGG, Reactome, and other sources.
```python
from biomni.database import query_pathways
results = query_pathways(
query="immune response signaling pathways",
top_k=15
)
```
**Returns:** Pathway information:
- `pathway_name`: Pathway name
- `pathway_id`: Database identifier
- `genes`: Genes in pathway
- `description`: Functional description
- `source`: Database source (KEGG, Reactome, etc.)
## Data Structures
### TaskResult
Result object returned by complex agent operations.
```python
class TaskResult:
success: bool # Whether task completed successfully
output: Any # Task output (varies by task)
code: str # Generated code
execution_time: float # Execution time in seconds
error: Optional[str] # Error message if failed
metadata: Dict # Additional metadata
```
### BiomedicalEntity
Base class for biomedical entities in the knowledge base.
```python
class BiomedicalEntity:
entity_id: str # Unique identifier
entity_type: str # Type (gene, protein, drug, etc.)
name: str # Entity name
description: str # Description
attributes: Dict # Additional attributes
references: List[str] # Literature references
```
## Utility Functions
### `download_data(path: str, force: bool = False) -> None`
Manually download or update the biomedical knowledge base.
```python
from biomni.utils import download_data
download_data(
path='./data',
force=True # Force re-download
)
```
### `validate_environment() -> Dict[str, bool]`
Check if the environment is properly configured.
```python
from biomni.utils import validate_environment
status = validate_environment()
# Returns: {
# 'conda_env': True,
# 'api_keys': True,
# 'data_available': True,
# 'dependencies': True
# }
```
### `list_available_models() -> List[str]`
Get a list of available LLM models based on configured API keys.
```python
from biomni.utils import list_available_models
models = list_available_models()
# Returns: ['claude-sonnet-4-20250514', 'gpt-4o', ...]
```
## Error Handling
### Common Exceptions
#### `BiomniConfigError`
Raised when configuration is invalid or incomplete.
```python
from biomni.exceptions import BiomniConfigError
try:
agent = A1(path='./data')
except BiomniConfigError as e:
print(f"Configuration error: {e}")
```
#### `BiomniExecutionError`
Raised when code generation or execution fails.
```python
from biomni.exceptions import BiomniExecutionError
try:
agent.go("invalid task")
except BiomniExecutionError as e:
print(f"Execution failed: {e}")
# Access failed code: e.code
# Access error details: e.details
```
#### `BiomniDataError`
Raised when knowledge base or data access fails.
```python
from biomni.exceptions import BiomniDataError
try:
results = query_genes("unknown query format")
except BiomniDataError as e:
print(f"Data access error: {e}")
```
#### `BiomniTimeoutError`
Raised when operations exceed timeout limit.
```python
from biomni.exceptions import BiomniTimeoutError
try:
agent.go("very complex long-running task")
except BiomniTimeoutError as e:
print(f"Task timed out after {e.duration} seconds")
# Partial results may be available: e.partial_results
```
## Best Practices
### Efficient Knowledge Retrieval
Pre-query databases for relevant context before complex tasks:
```python
from biomni.database import query_genes, query_pathways
# Gather relevant biological context first
genes = query_genes("cell cycle genes", top_k=50)
pathways = query_pathways("cell cycle regulation", top_k=20)
# Then execute task with enriched context
agent.go(f"""
Analyze the cell cycle progression in this dataset.
Focus on these genes: {[g['gene_symbol'] for g in genes]}
Consider these pathways: {[p['pathway_name'] for p in pathways]}
""")
```
### Error Recovery
Implement robust error handling for production workflows:
```python
from biomni.exceptions import BiomniExecutionError, BiomniTimeoutError
max_attempts = 3
for attempt in range(max_attempts):
try:
agent.go("complex biomedical task")
break
except BiomniTimeoutError:
# Increase timeout and retry
default_config.timeout_seconds *= 2
print(f"Timeout, retrying with {default_config.timeout_seconds}s timeout")
except BiomniExecutionError as e:
# Refine task based on error
print(f"Execution failed: {e}, refining task...")
# Optionally modify task description
else:
print("Task failed after max attempts")
```
### Memory Management
For large-scale analyses, manage memory explicitly:
```python
import gc
# Process datasets in chunks
for chunk_id in range(num_chunks):
agent.go(f"Process data chunk {chunk_id} located at data/chunk_{chunk_id}.h5ad")
# Force garbage collection between chunks
gc.collect()
# Save intermediate results
agent.save_conversation_history(f"./reports/chunk_{chunk_id}.pdf")
```
### Reproducibility
Ensure reproducible analyses by:
1. **Fixing random seeds:**
```python
agent.go("Set random seed to 42 for all analyses, then perform clustering...")
```
2. **Logging configuration:**
```python
import json
config_log = {
'llm': default_config.llm,
'timeout': default_config.timeout_seconds,
'data_path': default_config.data_path,
'timestamp': datetime.now().isoformat()
}
with open('config_log.json', 'w') as f:
json.dump(config_log, f, indent=2)
```
3. **Saving execution traces:**
```python
# Always save detailed reports
agent.save_conversation_history('./reports/full_analysis.pdf')
```
## Performance Optimization
### Model Selection Strategy
Choose models based on task characteristics:
```python
# For exploratory, simple tasks
default_config.llm = "gpt-3.5-turbo" # Fast, cost-effective
# For standard biomedical analyses
default_config.llm = "claude-sonnet-4-20250514" # Recommended
# For complex reasoning and hypothesis generation
default_config.llm = "claude-opus-4-20250514" # Highest quality
# For specialized biological reasoning
default_config.llm = "openai/biomni-r0" # Requires local deployment
```
### Timeout Tuning
Set appropriate timeouts based on task complexity:
```python
# Quick queries and simple analyses
agent = A1(path='./data', timeout=300)
# Standard workflows
agent = A1(path='./data', timeout=1200)
# Full pipelines with ML training
agent = A1(path='./data', timeout=3600)
```
### Caching and Reuse
Reuse agent instances for multiple related tasks:
```python
# Create agent once
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
# Execute multiple related tasks
tasks = [
"Load and QC the scRNA-seq dataset",
"Perform clustering with resolution 0.5",
"Identify marker genes for each cluster",
"Annotate cell types based on markers"
]
for task in tasks:
agent.go(task)
# Save complete workflow
agent.save_conversation_history('./reports/full_workflow.pdf')
```

View File

@@ -0,0 +1,649 @@
# LLM Provider Configuration Guide
This document provides comprehensive configuration instructions for all LLM providers supported by Biomni.
## Overview
Biomni supports multiple LLM providers through a unified interface. Configure providers using:
- Environment variables
- `.env` files
- Runtime configuration via `default_config`
## Quick Reference Table
| Provider | Recommended For | API Key Required | Cost | Setup Complexity |
|----------|----------------|------------------|------|------------------|
| Anthropic Claude | Most biomedical tasks | Yes | Medium | Easy |
| OpenAI | General tasks | Yes | Medium-High | Easy |
| Azure OpenAI | Enterprise deployment | Yes | Varies | Medium |
| Google Gemini | Multimodal tasks | Yes | Medium | Easy |
| Groq | Fast inference | Yes | Low | Easy |
| Ollama | Local/offline use | No | Free | Medium |
| AWS Bedrock | AWS ecosystem | Yes | Varies | Hard |
| Biomni-R0 | Complex biological reasoning | No | Free | Hard |
## Anthropic Claude (Recommended)
### Overview
Claude models from Anthropic provide excellent biological reasoning capabilities and are the recommended choice for most Biomni tasks.
### Setup
1. **Obtain API Key:**
- Sign up at https://console.anthropic.com/
- Navigate to API Keys section
- Generate a new key
2. **Configure Environment:**
**Option A: Environment Variable**
```bash
export ANTHROPIC_API_KEY="sk-ant-api03-..."
```
**Option B: .env File**
```bash
# .env file in project root
ANTHROPIC_API_KEY=sk-ant-api03-...
```
3. **Set Model in Code:**
```python
from biomni.config import default_config
# Claude Sonnet 4 (Recommended)
default_config.llm = "claude-sonnet-4-20250514"
# Claude Opus 4 (Most capable)
default_config.llm = "claude-opus-4-20250514"
# Claude 3.5 Sonnet (Previous version)
default_config.llm = "claude-3-5-sonnet-20241022"
```
### Available Models
| Model | Context Window | Strengths | Best For |
|-------|---------------|-----------|----------|
| `claude-sonnet-4-20250514` | 200K tokens | Balanced performance, cost-effective | Most biomedical tasks |
| `claude-opus-4-20250514` | 200K tokens | Highest capability, complex reasoning | Difficult multi-step analyses |
| `claude-3-5-sonnet-20241022` | 200K tokens | Fast, reliable | Standard workflows |
| `claude-3-opus-20240229` | 200K tokens | Strong reasoning | Legacy support |
### Advanced Configuration
```python
from biomni.config import default_config
# Use Claude with custom parameters
default_config.llm = "claude-sonnet-4-20250514"
default_config.timeout_seconds = 1800
# Optional: Custom API endpoint (for proxy/enterprise)
default_config.api_base = "https://your-proxy.com/v1"
```
### Cost Estimation
Approximate costs per 1M tokens (as of January 2025):
- Input: $3-15 depending on model
- Output: $15-75 depending on model
For a typical biomedical analysis (~50K tokens total): $0.50-$2.00
## OpenAI
### Overview
OpenAI's GPT models provide strong general capabilities suitable for diverse biomedical tasks.
### Setup
1. **Obtain API Key:**
- Sign up at https://platform.openai.com/
- Navigate to API Keys
- Create new secret key
2. **Configure Environment:**
```bash
export OPENAI_API_KEY="sk-proj-..."
```
Or in `.env`:
```
OPENAI_API_KEY=sk-proj-...
```
3. **Set Model:**
```python
from biomni.config import default_config
default_config.llm = "gpt-4o" # Recommended
# default_config.llm = "gpt-4" # Previous flagship
# default_config.llm = "gpt-4-turbo" # Fast variant
# default_config.llm = "gpt-3.5-turbo" # Budget option
```
### Available Models
| Model | Context Window | Strengths | Cost |
|-------|---------------|-----------|------|
| `gpt-4o` | 128K tokens | Fast, multimodal | Medium |
| `gpt-4-turbo` | 128K tokens | Fast inference | Medium |
| `gpt-4` | 8K tokens | Reliable | High |
| `gpt-3.5-turbo` | 16K tokens | Fast, cheap | Low |
### Cost Optimization
```python
# For exploratory analysis (budget-conscious)
default_config.llm = "gpt-3.5-turbo"
# For production analysis (quality-focused)
default_config.llm = "gpt-4o"
```
## Azure OpenAI
### Overview
Azure-hosted OpenAI models for enterprise users requiring data residency and compliance.
### Setup
1. **Azure Prerequisites:**
- Active Azure subscription
- Azure OpenAI resource created
- Model deployment configured
2. **Environment Variables:**
```bash
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-02-15-preview"
```
3. **Configuration:**
```python
from biomni.config import default_config
# Option 1: Use deployment name
default_config.llm = "azure/your-deployment-name"
# Option 2: Specify endpoint explicitly
default_config.llm = "azure/gpt-4"
default_config.api_base = "https://your-resource.openai.azure.com/"
```
### Deployment Setup
Azure OpenAI requires explicit model deployments:
1. Navigate to Azure OpenAI Studio
2. Create deployment for desired model (e.g., GPT-4)
3. Note the deployment name
4. Use deployment name in Biomni configuration
### Example Configuration
```python
from biomni.config import default_config
import os
# Set Azure credentials
os.environ['AZURE_OPENAI_API_KEY'] = 'your-key'
os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://your-resource.openai.azure.com/'
# Configure Biomni to use Azure deployment
default_config.llm = "azure/gpt-4-biomni" # Your deployment name
default_config.api_base = os.environ['AZURE_OPENAI_ENDPOINT']
```
## Google Gemini
### Overview
Google's Gemini models offer multimodal capabilities and competitive performance.
### Setup
1. **Obtain API Key:**
- Visit https://makersuite.google.com/app/apikey
- Create new API key
2. **Environment Configuration:**
```bash
export GEMINI_API_KEY="your-key"
```
3. **Set Model:**
```python
from biomni.config import default_config
default_config.llm = "gemini/gemini-1.5-pro"
# Or: default_config.llm = "gemini/gemini-pro"
```
### Available Models
| Model | Context Window | Strengths |
|-------|---------------|-----------|
| `gemini/gemini-1.5-pro` | 1M tokens | Very large context, multimodal |
| `gemini/gemini-pro` | 32K tokens | Balanced performance |
### Use Cases
Gemini excels at:
- Tasks requiring very large context windows
- Multimodal analysis (when incorporating images)
- Cost-effective alternative to GPT-4
```python
# For tasks with large context requirements
default_config.llm = "gemini/gemini-1.5-pro"
default_config.timeout_seconds = 2400 # May need longer timeout
```
## Groq
### Overview
Groq provides ultra-fast inference with open-source models, ideal for rapid iteration.
### Setup
1. **Get API Key:**
- Sign up at https://console.groq.com/
- Generate API key
2. **Configure:**
```bash
export GROQ_API_KEY="gsk_..."
```
3. **Set Model:**
```python
from biomni.config import default_config
default_config.llm = "groq/llama-3.1-70b-versatile"
# Or: default_config.llm = "groq/mixtral-8x7b-32768"
```
### Available Models
| Model | Context Window | Speed | Quality |
|-------|---------------|-------|---------|
| `groq/llama-3.1-70b-versatile` | 32K tokens | Very Fast | Good |
| `groq/mixtral-8x7b-32768` | 32K tokens | Very Fast | Good |
| `groq/llama-3-70b-8192` | 8K tokens | Ultra Fast | Moderate |
### Best Practices
```python
# For rapid prototyping and testing
default_config.llm = "groq/llama-3.1-70b-versatile"
default_config.timeout_seconds = 600 # Groq is fast
# Note: Quality may be lower than GPT-4/Claude for complex tasks
# Recommended for: QC, simple analyses, testing workflows
```
## Ollama (Local Deployment)
### Overview
Run LLMs entirely locally for offline use, data privacy, or cost savings.
### Setup
1. **Install Ollama:**
```bash
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
```
2. **Pull Models:**
```bash
ollama pull llama3 # Meta Llama 3 (8B)
ollama pull mixtral # Mixtral (47B)
ollama pull codellama # Code-specialized
ollama pull medllama # Medical domain (if available)
```
3. **Start Ollama Server:**
```bash
ollama serve # Runs on http://localhost:11434
```
4. **Configure Biomni:**
```python
from biomni.config import default_config
default_config.llm = "ollama/llama3"
default_config.api_base = "http://localhost:11434"
```
### Hardware Requirements
Minimum recommendations:
- **8B models:** 16GB RAM, CPU inference acceptable
- **70B models:** 64GB RAM, GPU highly recommended
- **Storage:** 5-50GB per model
### Model Selection
```python
# Fast, local, good for testing
default_config.llm = "ollama/llama3"
# Better quality (requires more resources)
default_config.llm = "ollama/mixtral"
# Code generation tasks
default_config.llm = "ollama/codellama"
```
### Advantages & Limitations
**Advantages:**
- Complete data privacy
- No API costs
- Offline operation
- Unlimited usage
**Limitations:**
- Lower quality than GPT-4/Claude for complex tasks
- Requires significant hardware
- Slower inference (especially on CPU)
- May struggle with specialized biomedical knowledge
## AWS Bedrock
### Overview
AWS-managed LLM service offering multiple model providers.
### Setup
1. **AWS Prerequisites:**
- AWS account with Bedrock access
- Model access enabled in Bedrock console
- AWS credentials configured
2. **Configure AWS Credentials:**
```bash
# Option 1: AWS CLI
aws configure
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
```
3. **Enable Model Access:**
- Navigate to AWS Bedrock console
- Request access to desired models
- Wait for approval (may take hours/days)
4. **Configure Biomni:**
```python
from biomni.config import default_config
default_config.llm = "bedrock/anthropic.claude-3-sonnet"
# Or: default_config.llm = "bedrock/anthropic.claude-v2"
```
### Available Models
Bedrock provides access to:
- Anthropic Claude models
- Amazon Titan models
- AI21 Jurassic models
- Cohere Command models
- Meta Llama models
### IAM Permissions
Required IAM policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:*::foundation-model/*"
}
]
}
```
### Example Configuration
```python
from biomni.config import default_config
import boto3
# Verify AWS credentials
session = boto3.Session()
credentials = session.get_credentials()
print(f"AWS Access Key: {credentials.access_key[:8]}...")
# Configure Biomni
default_config.llm = "bedrock/anthropic.claude-3-sonnet"
default_config.timeout_seconds = 1800
```
## Biomni-R0 (Local Specialized Model)
### Overview
Biomni-R0 is a 32B parameter reasoning model specifically trained for biological problem-solving. Provides the highest quality for complex biomedical reasoning but requires local deployment.
### Setup
1. **Hardware Requirements:**
- GPU with 48GB+ VRAM (e.g., A100, H100)
- Or multi-GPU setup (2x 24GB)
- 100GB+ storage for model weights
2. **Install Dependencies:**
```bash
pip install "sglang[all]"
pip install flashinfer # Optional but recommended
```
3. **Deploy Model:**
```bash
python -m sglang.launch_server \
--model-path snap-stanford/biomni-r0 \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--mem-fraction-static 0.8
```
For multi-GPU:
```bash
python -m sglang.launch_server \
--model-path snap-stanford/biomni-r0 \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--tp 2 # Tensor parallelism across 2 GPUs
```
4. **Configure Biomni:**
```python
from biomni.config import default_config
default_config.llm = "openai/biomni-r0"
default_config.api_base = "http://localhost:30000/v1"
default_config.timeout_seconds = 2400 # Longer for complex reasoning
```
### When to Use Biomni-R0
Biomni-R0 excels at:
- Multi-step biological reasoning
- Complex experimental design
- Hypothesis generation and evaluation
- Literature-informed analysis
- Tasks requiring deep biological knowledge
```python
# For complex biological reasoning tasks
default_config.llm = "openai/biomni-r0"
agent.go("""
Design a comprehensive CRISPR screening experiment to identify synthetic
lethal interactions with TP53 mutations in cancer cells, including:
1. Rationale and hypothesis
2. Guide RNA library design strategy
3. Experimental controls
4. Statistical analysis plan
5. Expected outcomes and validation approach
""")
```
### Performance Comparison
| Model | Speed | Biological Reasoning | Code Quality | Cost |
|-------|-------|---------------------|--------------|------|
| GPT-4 | Fast | Good | Excellent | Medium |
| Claude Sonnet 4 | Fast | Excellent | Excellent | Medium |
| Biomni-R0 | Moderate | Outstanding | Good | Free (local) |
## Multi-Provider Strategy
### Intelligent Model Selection
Use different models for different task types:
```python
from biomni.agent import A1
from biomni.config import default_config
# Strategy 1: Task-based selection
def get_agent_for_task(task_complexity):
if task_complexity == "simple":
default_config.llm = "gpt-3.5-turbo"
default_config.timeout_seconds = 300
elif task_complexity == "medium":
default_config.llm = "claude-sonnet-4-20250514"
default_config.timeout_seconds = 1200
else: # complex
default_config.llm = "openai/biomni-r0"
default_config.timeout_seconds = 2400
return A1(path='./data')
# Strategy 2: Fallback on failure
def execute_with_fallback(task):
models = [
"claude-sonnet-4-20250514",
"gpt-4o",
"claude-opus-4-20250514"
]
for model in models:
try:
default_config.llm = model
agent = A1(path='./data')
agent.go(task)
return
except Exception as e:
print(f"Failed with {model}: {e}, trying next...")
raise Exception("All models failed")
```
### Cost Optimization Strategy
```python
# Phase 1: Rapid prototyping with cheap models
default_config.llm = "gpt-3.5-turbo"
agent.go("Quick exploratory analysis of dataset structure")
# Phase 2: Detailed analysis with high-quality models
default_config.llm = "claude-sonnet-4-20250514"
agent.go("Comprehensive differential expression analysis with pathway enrichment")
# Phase 3: Complex reasoning with specialized models
default_config.llm = "openai/biomni-r0"
agent.go("Generate biological hypotheses based on multi-omics integration")
```
## Troubleshooting
### Common Issues
**Issue: "API key not found"**
- Verify environment variable is set: `echo $ANTHROPIC_API_KEY`
- Check `.env` file exists and is in correct location
- Try setting key programmatically: `os.environ['ANTHROPIC_API_KEY'] = 'key'`
**Issue: "Rate limit exceeded"**
- Implement exponential backoff and retry
- Upgrade API tier if available
- Switch to alternative provider temporarily
**Issue: "Model not found"**
- Verify model identifier is correct
- Check API key has access to requested model
- For Azure: ensure deployment exists with exact name
**Issue: "Timeout errors"**
- Increase `default_config.timeout_seconds`
- Break complex tasks into smaller steps
- Consider using faster model for initial phases
**Issue: "Connection refused (Ollama/Biomni-R0)"**
- Verify local server is running
- Check port is not blocked by firewall
- Confirm `api_base` URL is correct
### Testing Configuration
```python
from biomni.utils import list_available_models, validate_environment
# Check environment setup
status = validate_environment()
print("Environment Status:", status)
# List available models based on configured keys
models = list_available_models()
print("Available Models:", models)
# Test specific model
try:
from biomni.agent import A1
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
agent.go("Print 'Configuration successful!'")
except Exception as e:
print(f"Configuration test failed: {e}")
```
## Best Practices Summary
1. **For most users:** Start with Claude Sonnet 4 or GPT-4o
2. **For cost sensitivity:** Use GPT-3.5-turbo for exploration, Claude Sonnet 4 for production
3. **For privacy/offline:** Deploy Ollama locally
4. **For complex reasoning:** Use Biomni-R0 if hardware available
5. **For enterprise:** Consider Azure OpenAI or AWS Bedrock
6. **For speed:** Use Groq for rapid iteration
7. **Always:**
- Set appropriate timeouts
- Implement error handling and retries
- Log model and configuration for reproducibility
- Test configuration before production use

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,381 @@
#!/usr/bin/env python3
"""
Enhanced PDF Report Generation for Biomni
This script provides advanced PDF report generation with custom formatting,
styling, and metadata for Biomni analysis results.
"""
import argparse
import sys
from pathlib import Path
from datetime import datetime
from typing import Optional, Dict, Any
def generate_markdown_report(
title: str,
sections: list,
metadata: Optional[Dict[str, Any]] = None,
output_path: str = "report.md"
) -> str:
"""
Generate a formatted markdown report.
Args:
title: Report title
sections: List of dicts with 'heading' and 'content' keys
metadata: Optional metadata dict (author, date, etc.)
output_path: Path to save markdown file
Returns:
Path to generated markdown file
"""
md_content = []
# Title
md_content.append(f"# {title}\n")
# Metadata
if metadata:
md_content.append("---\n")
for key, value in metadata.items():
md_content.append(f"**{key}:** {value} \n")
md_content.append("---\n\n")
# Sections
for section in sections:
heading = section.get('heading', 'Section')
content = section.get('content', '')
level = section.get('level', 2) # Default to h2
md_content.append(f"{'#' * level} {heading}\n\n")
md_content.append(f"{content}\n\n")
# Write to file
output = Path(output_path)
output.write_text('\n'.join(md_content))
return str(output)
def convert_to_pdf_weasyprint(
markdown_path: str,
output_path: str,
css_style: Optional[str] = None
) -> bool:
"""
Convert markdown to PDF using WeasyPrint.
Args:
markdown_path: Path to markdown file
output_path: Path for output PDF
css_style: Optional CSS stylesheet path
Returns:
True if successful, False otherwise
"""
try:
import markdown
from weasyprint import HTML, CSS
# Read markdown
with open(markdown_path, 'r') as f:
md_content = f.read()
# Convert to HTML
html_content = markdown.markdown(
md_content,
extensions=['tables', 'fenced_code', 'codehilite']
)
# Wrap in HTML template
html_template = f"""
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Biomni Report</title>
<style>
body {{
font-family: 'Helvetica', 'Arial', sans-serif;
line-height: 1.6;
color: #333;
max-width: 800px;
margin: 40px auto;
padding: 20px;
}}
h1 {{
color: #2c3e50;
border-bottom: 3px solid #3498db;
padding-bottom: 10px;
}}
h2 {{
color: #34495e;
margin-top: 30px;
border-bottom: 1px solid #bdc3c7;
padding-bottom: 5px;
}}
h3 {{
color: #7f8c8d;
}}
code {{
background-color: #f4f4f4;
padding: 2px 6px;
border-radius: 3px;
font-family: 'Courier New', monospace;
}}
pre {{
background-color: #f4f4f4;
padding: 15px;
border-radius: 5px;
overflow-x: auto;
}}
table {{
border-collapse: collapse;
width: 100%;
margin: 20px 0;
}}
th, td {{
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}}
th {{
background-color: #3498db;
color: white;
}}
tr:nth-child(even) {{
background-color: #f9f9f9;
}}
.metadata {{
background-color: #ecf0f1;
padding: 15px;
border-radius: 5px;
margin: 20px 0;
}}
</style>
</head>
<body>
{html_content}
</body>
</html>
"""
# Generate PDF
pdf = HTML(string=html_template)
# Add custom CSS if provided
stylesheets = []
if css_style and Path(css_style).exists():
stylesheets.append(CSS(filename=css_style))
pdf.write_pdf(output_path, stylesheets=stylesheets)
return True
except ImportError:
print("Error: WeasyPrint not installed. Install with: pip install weasyprint")
return False
except Exception as e:
print(f"Error generating PDF: {e}")
return False
def convert_to_pdf_pandoc(markdown_path: str, output_path: str) -> bool:
"""
Convert markdown to PDF using Pandoc.
Args:
markdown_path: Path to markdown file
output_path: Path for output PDF
Returns:
True if successful, False otherwise
"""
try:
import subprocess
# Check if pandoc is installed
result = subprocess.run(
['pandoc', '--version'],
capture_output=True,
text=True
)
if result.returncode != 0:
print("Error: Pandoc not installed")
return False
# Convert with pandoc
result = subprocess.run(
[
'pandoc',
markdown_path,
'-o', output_path,
'--pdf-engine=pdflatex',
'-V', 'geometry:margin=1in',
'--toc'
],
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Pandoc error: {result.stderr}")
return False
return True
except FileNotFoundError:
print("Error: Pandoc not found. Install from https://pandoc.org/")
return False
except Exception as e:
print(f"Error: {e}")
return False
def create_biomni_report(
conversation_history: list,
output_path: str = "biomni_report.pdf",
method: str = "weasyprint"
) -> bool:
"""
Create a formatted PDF report from Biomni conversation history.
Args:
conversation_history: List of conversation turns
output_path: Output PDF path
method: Conversion method ('weasyprint' or 'pandoc')
Returns:
True if successful
"""
# Prepare report sections
metadata = {
'Date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'Tool': 'Biomni AI Agent',
'Report Type': 'Analysis Summary'
}
sections = []
# Executive Summary
sections.append({
'heading': 'Executive Summary',
'level': 2,
'content': 'This report contains the complete analysis workflow executed by the Biomni biomedical AI agent.'
})
# Conversation history
for i, turn in enumerate(conversation_history, 1):
sections.append({
'heading': f'Task {i}: {turn.get("task", "Analysis")}',
'level': 2,
'content': f'**Input:**\n```\n{turn.get("input", "")}\n```\n\n**Output:**\n{turn.get("output", "")}'
})
# Generate markdown
md_path = output_path.replace('.pdf', '.md')
generate_markdown_report(
title="Biomni Analysis Report",
sections=sections,
metadata=metadata,
output_path=md_path
)
# Convert to PDF
if method == 'weasyprint':
success = convert_to_pdf_weasyprint(md_path, output_path)
elif method == 'pandoc':
success = convert_to_pdf_pandoc(md_path, output_path)
else:
print(f"Unknown method: {method}")
return False
if success:
print(f"✓ Report generated: {output_path}")
print(f" Markdown: {md_path}")
else:
print("✗ Failed to generate PDF")
print(f" Markdown available: {md_path}")
return success
def main():
"""CLI for report generation."""
parser = argparse.ArgumentParser(
description='Generate formatted PDF reports for Biomni analyses'
)
parser.add_argument(
'input',
type=str,
help='Input markdown file or conversation history'
)
parser.add_argument(
'-o', '--output',
type=str,
default='biomni_report.pdf',
help='Output PDF path (default: biomni_report.pdf)'
)
parser.add_argument(
'-m', '--method',
type=str,
choices=['weasyprint', 'pandoc'],
default='weasyprint',
help='Conversion method (default: weasyprint)'
)
parser.add_argument(
'--css',
type=str,
help='Custom CSS stylesheet path'
)
args = parser.parse_args()
# Check if input is markdown or conversation history
input_path = Path(args.input)
if not input_path.exists():
print(f"Error: Input file not found: {args.input}")
return 1
# If input is markdown, convert directly
if input_path.suffix == '.md':
if args.method == 'weasyprint':
success = convert_to_pdf_weasyprint(
str(input_path),
args.output,
args.css
)
else:
success = convert_to_pdf_pandoc(str(input_path), args.output)
return 0 if success else 1
# Otherwise, assume it's conversation history (JSON)
try:
import json
with open(input_path) as f:
history = json.load(f)
success = create_biomni_report(
history,
args.output,
args.method
)
return 0 if success else 1
except json.JSONDecodeError:
print("Error: Input file is not valid JSON or markdown")
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,230 @@
#!/usr/bin/env python3
"""
Biomni Environment Setup and Validation Script
This script helps users set up and validate their Biomni environment,
including checking dependencies, API keys, and data availability.
"""
import os
import sys
import subprocess
from pathlib import Path
from typing import Dict, List, Tuple
def check_python_version() -> Tuple[bool, str]:
"""Check if Python version is compatible."""
version = sys.version_info
if version.major == 3 and version.minor >= 8:
return True, f"Python {version.major}.{version.minor}.{version.micro}"
else:
return False, f"Python {version.major}.{version.minor} - requires Python 3.8+"
def check_conda_env() -> Tuple[bool, str]:
"""Check if running in biomni conda environment."""
conda_env = os.environ.get('CONDA_DEFAULT_ENV', None)
if conda_env == 'biomni_e1':
return True, f"Conda environment: {conda_env}"
else:
return False, f"Not in biomni_e1 environment (current: {conda_env})"
def check_package_installed(package: str) -> bool:
"""Check if a Python package is installed."""
try:
__import__(package)
return True
except ImportError:
return False
def check_dependencies() -> Tuple[bool, List[str]]:
"""Check for required and optional dependencies."""
required = ['biomni']
optional = ['weasyprint', 'markdown2pdf']
missing_required = [pkg for pkg in required if not check_package_installed(pkg)]
missing_optional = [pkg for pkg in optional if not check_package_installed(pkg)]
messages = []
success = len(missing_required) == 0
if missing_required:
messages.append(f"Missing required packages: {', '.join(missing_required)}")
messages.append("Install with: pip install biomni --upgrade")
else:
messages.append("Required packages: ✓")
if missing_optional:
messages.append(f"Missing optional packages: {', '.join(missing_optional)}")
messages.append("For PDF reports, install: pip install weasyprint")
return success, messages
def check_api_keys() -> Tuple[bool, Dict[str, bool]]:
"""Check which API keys are configured."""
api_keys = {
'ANTHROPIC_API_KEY': os.environ.get('ANTHROPIC_API_KEY'),
'OPENAI_API_KEY': os.environ.get('OPENAI_API_KEY'),
'GEMINI_API_KEY': os.environ.get('GEMINI_API_KEY'),
'GROQ_API_KEY': os.environ.get('GROQ_API_KEY'),
}
configured = {key: bool(value) for key, value in api_keys.items()}
has_any = any(configured.values())
return has_any, configured
def check_data_directory(data_path: str = './data') -> Tuple[bool, str]:
"""Check if Biomni data directory exists and has content."""
path = Path(data_path)
if not path.exists():
return False, f"Data directory not found at {data_path}"
# Check if directory has files (data has been downloaded)
files = list(path.glob('*'))
if len(files) == 0:
return False, f"Data directory exists but is empty. Run agent once to download."
# Rough size check (should be ~11GB)
total_size = sum(f.stat().st_size for f in path.rglob('*') if f.is_file())
size_gb = total_size / (1024**3)
if size_gb < 1:
return False, f"Data directory exists but seems incomplete ({size_gb:.1f} GB)"
return True, f"Data directory: {data_path} ({size_gb:.1f} GB) ✓"
def check_disk_space(required_gb: float = 20) -> Tuple[bool, str]:
"""Check if sufficient disk space is available."""
try:
import shutil
stat = shutil.disk_usage('.')
free_gb = stat.free / (1024**3)
if free_gb >= required_gb:
return True, f"Disk space: {free_gb:.1f} GB available ✓"
else:
return False, f"Low disk space: {free_gb:.1f} GB (need {required_gb} GB)"
except Exception as e:
return False, f"Could not check disk space: {e}"
def test_biomni_import() -> Tuple[bool, str]:
"""Test if Biomni can be imported and initialized."""
try:
from biomni.agent import A1
from biomni.config import default_config
return True, "Biomni import successful ✓"
except ImportError as e:
return False, f"Cannot import Biomni: {e}"
except Exception as e:
return False, f"Biomni import error: {e}"
def suggest_fixes(results: Dict[str, Tuple[bool, any]]) -> List[str]:
"""Generate suggestions for fixing issues."""
suggestions = []
if not results['python'][0]:
suggestions.append("➜ Upgrade Python to 3.8 or higher")
if not results['conda'][0]:
suggestions.append("➜ Activate biomni environment: conda activate biomni_e1")
if not results['dependencies'][0]:
suggestions.append("➜ Install Biomni: pip install biomni --upgrade")
if not results['api_keys'][0]:
suggestions.append("➜ Set API key: export ANTHROPIC_API_KEY='your-key'")
suggestions.append(" Or create .env file with API keys")
if not results['data'][0]:
suggestions.append("➜ Data will auto-download on first agent.go() call")
if not results['disk_space'][0]:
suggestions.append("➜ Free up disk space (need ~20GB total)")
return suggestions
def main():
"""Run all environment checks and display results."""
print("=" * 60)
print("Biomni Environment Validation")
print("=" * 60)
print()
# Run all checks
results = {}
print("Checking Python version...")
results['python'] = check_python_version()
print(f" {results['python'][1]}")
print()
print("Checking conda environment...")
results['conda'] = check_conda_env()
print(f" {results['conda'][1]}")
print()
print("Checking dependencies...")
results['dependencies'] = check_dependencies()
for msg in results['dependencies'][1]:
print(f" {msg}")
print()
print("Checking API keys...")
results['api_keys'] = check_api_keys()
has_keys, key_status = results['api_keys']
for key, configured in key_status.items():
status = "" if configured else ""
print(f" {key}: {status}")
print()
print("Checking Biomni data directory...")
results['data'] = check_data_directory()
print(f" {results['data'][1]}")
print()
print("Checking disk space...")
results['disk_space'] = check_disk_space()
print(f" {results['disk_space'][1]}")
print()
print("Testing Biomni import...")
results['biomni_import'] = test_biomni_import()
print(f" {results['biomni_import'][1]}")
print()
# Summary
print("=" * 60)
all_passed = all(result[0] for result in results.values())
if all_passed:
print("✓ All checks passed! Environment is ready.")
print()
print("Quick start:")
print(" from biomni.agent import A1")
print(" agent = A1(path='./data', llm='claude-sonnet-4-20250514')")
print(" agent.go('Your biomedical task')")
else:
print("⚠ Some checks failed. See suggestions below:")
print()
suggestions = suggest_fixes(results)
for suggestion in suggestions:
print(suggestion)
print("=" * 60)
return 0 if all_passed else 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,450 @@
---
name: biopython
description: Comprehensive toolkit for computational molecular biology using BioPython. Use this skill when working with biological sequences (DNA, RNA, protein), parsing sequence files (FASTA, GenBank, FASTQ), accessing NCBI databases (Entrez, BLAST), performing sequence alignments, building phylogenetic trees, analyzing protein structures (PDB), or any bioinformatics task requiring BioPython modules.
---
# BioPython
## Overview
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
## When to Use This Skill
Use this skill when:
- Working with biological sequences (DNA, RNA, protein)
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
- Running or parsing BLAST searches
- Performing sequence alignments (pairwise or multiple)
- Building or analyzing phylogenetic trees
- Analyzing protein structures (PDB files)
- Calculating sequence properties (GC content, melting temp, molecular weight)
- Converting between sequence file formats
- Performing population genetics analysis
- Any bioinformatics task requiring BioPython
## Core Capabilities
### 1. Sequence Manipulation
Create and manipulate biological sequences using `Bio.Seq`:
```python
from Bio.Seq import Seq
dna_seq = Seq("ATGGTGCATCTGACT")
rna_seq = dna_seq.transcribe() # DNA → RNA
protein = dna_seq.translate() # DNA → Protein
rev_comp = dna_seq.reverse_complement() # Reverse complement
```
**Common operations:**
- Transcription and back-transcription
- Translation with custom genetic codes
- Complement and reverse complement
- Sequence slicing and concatenation
- Pattern searching and counting
**Reference:** See `references/core_modules.md` (section: Bio.Seq) for detailed operations and examples.
### 2. File Input/Output
Read and write sequence files in multiple formats using `Bio.SeqIO`:
```python
from Bio import SeqIO
# Read sequences
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(record.id, len(record.seq))
# Write sequences
SeqIO.write(records, "output.gb", "genbank")
# Convert formats
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
```
**Supported formats:** FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
**Common workflows:**
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
- Filtering sequences by length, ID, or content
- Batch processing large files with iterators
- Random access with `SeqIO.index()` for large files
**Script:** Use `scripts/file_io.py` for file I/O examples and patterns.
**Reference:** See `references/core_modules.md` (section: Bio.SeqIO) for comprehensive format details and workflows.
### 3. NCBI Database Access
Access NCBI databases (GenBank, PubMed, Protein, etc.) using `Bio.Entrez`:
```python
from Bio import Entrez
Entrez.email = "your.email@example.com" # Required!
# Search database
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
record = Entrez.read(handle)
id_list = record["IdList"]
# Fetch sequences
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
```
**Key Entrez functions:**
- `esearch()`: Search databases, retrieve IDs
- `efetch()`: Download full records
- `esummary()`: Get document summaries
- `elink()`: Find related records across databases
- `einfo()`: Get database information
- `epost()`: Upload ID lists for large queries
**Important:** Always set `Entrez.email` before using Entrez functions.
**Script:** Use `scripts/ncbi_entrez.py` for complete Entrez workflows including batch downloads and WebEnv usage.
**Reference:** See `references/database_tools.md` (section: Bio.Entrez) for detailed function documentation and parameters.
### 4. BLAST Searches
Run BLAST searches and parse results using `Bio.Blast`:
```python
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
# Save results
with open("blast_results.xml", "w") as out:
out.write(result_handle.read())
# Parse results
with open("blast_results.xml") as result_handle:
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Identity: {hsp.identities}/{hsp.align_length}")
```
**BLAST programs:** blastn, blastp, blastx, tblastn, tblastx
**Key result attributes:**
- `alignment.title`: Hit description
- `hsp.expect`: E-value
- `hsp.identities`: Number of identical residues
- `hsp.query`, `hsp.match`, `hsp.sbjct`: Aligned sequences
**Script:** Use `scripts/blast_search.py` for complete BLAST workflows including result filtering and extraction.
**Reference:** See `references/database_tools.md` (section: Bio.Blast) for detailed parsing and filtering strategies.
### 5. Sequence Alignment
Perform pairwise and multiple sequence alignments using `Bio.Align`:
**Pairwise alignment:**
```python
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2
alignments = aligner.align(seq1, seq2)
print(alignments[0])
print(f"Score: {alignments.score}")
```
**Multiple sequence alignment I/O:**
```python
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("alignment.clustal", "clustal")
# Write alignment
AlignIO.write(alignment, "output.phylip", "phylip")
# Convert formats
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
```
**Supported formats:** Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
**Script:** Use `scripts/alignment_phylogeny.py` for alignment examples and workflows.
**Reference:** See `references/core_modules.md` (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
### 6. Phylogenetic Analysis
Build and analyze phylogenetic trees using `Bio.Phylo`:
```python
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# Read alignment
alignment = AlignIO.read("sequences.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Build tree (UPGMA or Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm) # or constructor.nj(dm)
# Visualize tree
Phylo.draw_ascii(tree)
Phylo.draw(tree) # matplotlib visualization
# Save tree
Phylo.write(tree, "tree.nwk", "newick")
```
**Tree manipulation:**
- `tree.ladderize()`: Sort branches
- `tree.root_at_midpoint()`: Root at midpoint
- `tree.prune()`: Remove taxa
- `tree.collapse_all()`: Collapse short branches
- `tree.distance()`: Calculate distances between clades
**Supported formats:** Newick, NEXUS, PhyloXML, NeXML
**Script:** Use `scripts/alignment_phylogeny.py` for tree construction and manipulation examples.
**Reference:** See `references/specialized_modules.md` (section: Bio.Phylo) for comprehensive tree analysis capabilities.
### 7. Structural Bioinformatics
Analyze protein structures using `Bio.PDB`:
```python
from Bio.PDB import PDBParser, PDBList
# Download structure
pdbl = PDBList()
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
# Parse structure
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom.name, atom.coord)
# Secondary structure with DSSP
from Bio.PDB import DSSP
dssp = DSSP(model, "structure.pdb")
# Structural alignment
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)
print(f"RMSD: {sup.rms}")
```
**Key capabilities:**
- Parse PDB, mmCIF, MMTF formats
- Secondary structure analysis (DSSP)
- Solvent accessibility calculations
- Structural superimposition
- Distance and angle calculations
- Structure quality validation
**Reference:** See `references/specialized_modules.md` (section: Bio.PDB) for complete structural analysis capabilities.
### 8. Sequence Analysis Utilities
Calculate sequence properties using `Bio.SeqUtils`:
```python
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# DNA analysis
gc = gc_fraction(dna_seq) * 100
tm = mt.Tm_NN(dna_seq) # Melting temperature
# Protein analysis
protein_analysis = ProteinAnalysis(str(protein_seq))
mw = protein_analysis.molecular_weight()
pi = protein_analysis.isoelectric_point()
aromaticity = protein_analysis.aromaticity()
instability = protein_analysis.instability_index()
```
**Available analyses:**
- GC content and GC skew
- Melting temperature (multiple methods)
- Molecular weight
- Isoelectric point
- Aromaticity
- Instability index
- Secondary structure prediction
- Sequence checksums
**Script:** Use `scripts/sequence_operations.py` for sequence analysis examples.
**Reference:** See `references/core_modules.md` (section: Bio.SeqUtils) for all available utilities.
### 9. Specialized Modules
**Restriction enzymes:**
```python
from Bio import Restriction
enzyme = Restriction.EcoRI
sites = enzyme.search(seq)
```
**Motif analysis:**
```python
from Bio import motifs
m = motifs.create([seq1, seq2, seq3])
pwm = m.counts.normalize(pseudocounts=0.5)
```
**Population genetics:**
Use `Bio.PopGen` for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
**Clustering:**
Use `Bio.Cluster` for hierarchical clustering, k-means, PCA on biological data.
**Reference:** See `references/core_modules.md` and `references/specialized_modules.md` for specialized module documentation.
## Common Workflows
### Workflow 1: Download and Analyze NCBI Sequences
1. Search NCBI database with `Entrez.esearch()`
2. Fetch sequences with `Entrez.efetch()`
3. Parse with `SeqIO.parse()`
4. Analyze sequences (GC content, translation, etc.)
5. Save results to file
**Script:** Use `scripts/ncbi_entrez.py` for complete implementation.
### Workflow 2: Sequence Similarity Search
1. Run BLAST with `NCBIWWW.qblast()` or parse existing results
2. Parse XML results with `NCBIXML.read()`
3. Filter hits by E-value, identity, coverage
4. Extract and save significant hits
5. Perform downstream analysis
**Script:** Use `scripts/blast_search.py` for complete implementation.
### Workflow 3: Phylogenetic Tree Construction
1. Read multiple sequence alignment with `AlignIO.read()`
2. Calculate distance matrix with `DistanceCalculator`
3. Build tree with `DistanceTreeConstructor` (UPGMA or NJ)
4. Manipulate tree (ladderize, root, prune)
5. Visualize with `Phylo.draw()` or `Phylo.draw_ascii()`
6. Save tree with `Phylo.write()`
**Script:** Use `scripts/alignment_phylogeny.py` for complete implementation.
### Workflow 4: Format Conversion Pipeline
1. Read sequences in original format with `SeqIO.parse()`
2. Filter or modify sequences as needed
3. Write to new format with `SeqIO.write()`
4. Or use `SeqIO.convert()` for direct conversion
**Script:** Use `scripts/file_io.py` for format conversion examples.
## Best Practices
### Email Configuration
Always set `Entrez.email` before using NCBI services:
```python
Entrez.email = "your.email@example.com"
```
### Rate Limiting
Be polite to NCBI servers:
- Use `time.sleep()` between requests
- Use WebEnv for large queries
- Batch downloads in reasonable chunks (100-500 sequences)
### Memory Management
For large files:
- Use iterators (`SeqIO.parse()`) instead of lists
- Use `SeqIO.index()` for random access without loading entire file
- Process in batches when possible
### Error Handling
Always handle potential errors:
```python
try:
record = SeqIO.read(handle, format)
except Exception as e:
print(f"Error: {e}")
```
### File Format Selection
Choose appropriate formats:
- FASTA: Simple sequences, no annotations
- GenBank: Rich annotations, features, references
- FASTQ: Sequences with quality scores
- PDB: 3D structural data
## Resources
### scripts/
Executable Python scripts demonstrating common BioPython workflows:
- `sequence_operations.py`: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)
- `file_io.py`: Reading, writing, and converting sequence files; filtering; indexing large files
- `ncbi_entrez.py`: Searching and downloading from NCBI databases; batch processing with WebEnv
- `blast_search.py`: Running BLAST searches online; parsing and filtering results
- `alignment_phylogeny.py`: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
Run any script with `python3 scripts/<script_name>.py` to see examples.
### references/
Comprehensive reference documentation for BioPython modules:
- `core_modules.md`: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)
- `database_tools.md`: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)
- `specialized_modules.md`: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
Reference these files when:
- Learning about specific module capabilities
- Looking up function parameters and options
- Understanding supported file formats
- Finding example code patterns
Use `grep` to search references for specific topics:
```bash
grep -n "secondary structure" references/specialized_modules.md
grep -n "efetch" references/database_tools.md
```
## Additional Resources
**Official Documentation:** https://biopython.org/docs/latest/
**Tutorial:** https://biopython.org/docs/latest/Tutorial/index.html
**API Reference:** https://biopython.org/docs/latest/api/index.html
**Cookbook:** https://biopython.org/wiki/Category:Cookbook

View File

@@ -0,0 +1,232 @@
# BioPython Core Modules Reference
This document provides detailed information about BioPython's core modules and their capabilities.
## Sequence Handling
### Bio.Seq - Sequence Objects
Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior.
**Creation:**
```python
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
```
**Key Operations:**
- String methods: `find()`, `count()`, `count_overlap()` (for overlapping patterns)
- Complement/Reverse complement: Returns complementary sequences
- Transcription: DNA → RNA (T → U)
- Back transcription: RNA → DNA
- Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling
**Use Cases:**
- DNA/RNA sequence manipulation
- Converting between nucleic acid types
- Protein translation from coding sequences
- Sequence searching and pattern counting
### Bio.SeqRecord - Sequence Metadata
SeqRecord wraps Seq objects with metadata like ID, description, and features.
**Attributes:**
- `seq`: The sequence itself (Seq object)
- `id`: Unique identifier
- `name`: Short name
- `description`: Longer description
- `features`: List of SeqFeature objects
- `annotations`: Dictionary of additional information
- `letter_annotations`: Per-letter annotations (e.g., quality scores)
### Bio.SeqFeature - Sequence Annotations
Manages sequence annotations and features such as genes, promoters, and coding regions.
**Common Features:**
- Gene locations
- CDS (coding sequences)
- Promoters and regulatory elements
- Exons and introns
- Protein domains
## File Input/Output
### Bio.SeqIO - Sequence File I/O
Unified interface for reading and writing sequence files in multiple formats.
**Supported Formats:**
- FASTA/FASTQ: Standard sequence formats
- GenBank/EMBL: Feature-rich annotation formats
- Clustal/Stockholm/PHYLIP: Alignment formats
- ABI/SFF: Trace and flowgram data
- Swiss-Prot/PIR: Protein databases
- PDB: Protein structure files
**Key Functions:**
**SeqIO.parse()** - Iterator for reading multiple records:
```python
from Bio import SeqIO
for record in SeqIO.parse("file.fasta", "fasta"):
print(record.id, len(record.seq))
```
**SeqIO.read()** - Read single record:
```python
record = SeqIO.read("file.fasta", "fasta")
```
**SeqIO.write()** - Write sequences:
```python
SeqIO.write(sequences, "output.fasta", "fasta")
```
**SeqIO.convert()** - Direct format conversion:
```python
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
```
**SeqIO.index()** - Memory-efficient random access for large files:
```python
record_dict = SeqIO.index("large_file.fasta", "fasta")
sequence = record_dict["seq_id"]
```
**SeqIO.to_dict()** - Load all records into dictionary (memory-based):
```python
record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta"))
```
**Common Patterns:**
- Format conversion between FASTA, GenBank, FASTQ
- Filtering sequences by length, ID, or content
- Extracting subsequences
- Batch processing large files with iterators
### Bio.AlignIO - Multiple Sequence Alignment I/O
Handles multiple sequence alignment files.
**Key Functions:**
- `write()`: Save alignments
- `parse()`: Read multiple alignments
- `read()`: Read single alignment
- `convert()`: Convert between formats
**Supported Formats:**
- Clustal
- PHYLIP (sequential and interleaved)
- Stockholm
- NEXUS
- FASTA (aligned)
- MAF (Multiple Alignment Format)
## Sequence Alignment
### Bio.Align - Alignment Tools
**PairwiseAligner** - High-performance pairwise alignment:
```python
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2.5
alignments = aligner.align(seq1, seq2)
```
**CodonAligner** - Codon-aware alignment
**MultipleSeqAlignment** - Container for MSA with column access
### Bio.pairwise2 (Legacy)
Legacy pairwise alignment module with functions like `align.globalxx()`, `align.localxx()`.
## Sequence Analysis Utilities
### Bio.SeqUtils - Sequence Analysis
Collection of utility functions:
**CheckSum** - Calculate sequence checksums (CRC32, CRC64, GCG)
**MeltingTemp** - DNA melting temperature calculations:
- Nearest-neighbor method
- Wallace rule
- GC content method
**IsoelectricPoint** - Protein pI calculation
**ProtParam** - Protein analysis:
- Molecular weight
- Aromaticity
- Instability index
- Secondary structure fractions
**GC/GC_skew** - Calculate GC content and GC skew for sequence windows
### Bio.Data.CodonTable - Genetic Codes
Access to NCBI genetic code tables:
```python
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table.forward_table) # codon to amino acid
print(standard_table.back_table) # amino acid to codons
print(standard_table.start_codons)
print(standard_table.stop_codons)
```
**Available codes:**
- Standard code (1)
- Vertebrate mitochondrial (2)
- Yeast mitochondrial (3)
- And many more organism-specific codes
## Sequence Motifs and Patterns
### Bio.motifs - Sequence Motif Analysis
Tools for working with sequence motifs:
**Position Weight Matrices (PWM):**
- Create PWM from aligned sequences
- Calculate information content
- Search sequences for motif matches
- Generate consensus sequences
**Position Specific Scoring Matrices (PSSM):**
- Convert PWM to PSSM
- Score sequences against motifs
- Determine significance thresholds
**Supported Formats:**
- JASPAR
- TRANSFAC
- MEME
- AlignAce
### Bio.Restriction - Restriction Enzymes
Comprehensive restriction enzyme database and analysis:
**Capabilities:**
- Search for restriction sites
- Predict digestion products
- Analyze restriction maps
- Access enzyme properties (recognition site, cut positions, isoschizomers)
**Example usage:**
```python
from Bio import Restriction
from Bio.Seq import Seq
seq = Seq("GAATTC...")
enzyme = Restriction.EcoRI
results = enzyme.search(seq)
```

View File

@@ -0,0 +1,306 @@
# BioPython Database Access and Search Tools
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
## NCBI Database Access
### Bio.Entrez - NCBI E-utilities Interface
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
**Important:** Always set your email before using Entrez:
```python
from Bio import Entrez
Entrez.email = "your.email@example.com"
```
#### Core Query Functions
**esearch** - Search databases and retrieve IDs:
```python
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
record = Entrez.read(handle)
id_list = record["IdList"]
```
Parameters:
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
- `term`: Search query
- `retmax`: Maximum number of IDs to return
- `sort`: Sort order (relevance, pub_date, etc.)
- `usehistory`: Store results on server (useful for large queries)
**efetch** - Retrieve full records:
```python
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
```
Parameters:
- `db`: Database name
- `id`: Single ID or comma-separated list
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
- `retmode`: Return mode (text, xml, asn.1)
- Automatically uses POST for >200 IDs
**elink** - Find related records across databases:
```python
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
result = Entrez.read(handle)
```
Parameters:
- `dbfrom`: Source database
- `db`: Target database
- `id`: ID(s) to link from
- Returns LinkOut providers and relevancy scores
**esummary** - Get document summaries:
```python
handle = Entrez.esummary(db="protein", id="15718680")
summary = Entrez.read(handle)
print(summary[0]['Title'])
```
Returns quick overviews without full records.
**einfo** - Get database statistics:
```python
handle = Entrez.einfo(db="nucleotide")
info = Entrez.read(handle)
```
Provides field indices, term counts, update dates, and available links.
**epost** - Upload ID lists to server:
```python
handle = Entrez.epost("nucleotide", id="123456,789012")
result = Entrez.read(handle)
webenv = result["WebEnv"]
query_key = result["QueryKey"]
```
Useful for large queries split across multiple requests.
**espell** - Get spelling suggestions:
```python
handle = Entrez.espell(term="brest cancer")
result = Entrez.read(handle)
print(result["CorrectedQuery"]) # "breast cancer"
```
**ecitmatch** - Convert citations to PubMed IDs:
```python
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
```
#### Data Processing Functions
**Entrez.read()** - Parse XML to Python dictionary:
```python
handle = Entrez.esearch(db="protein", term="insulin")
record = Entrez.read(handle)
```
**Entrez.parse()** - Generator for large XML results:
```python
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
for record in Entrez.parse(handle):
process(record)
```
#### Common Workflows
**Download sequences by accession:**
```python
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
```
**Search and download multiple sequences:**
```python
# Search
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
search_results = Entrez.read(search_handle)
# Download
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
for record in SeqIO.parse(fetch_handle, "genbank"):
print(record.id)
```
**Use WebEnv for large queries:**
```python
# Post IDs
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
post_result = Entrez.read(post_handle)
# Fetch in batches
batch_size = 500
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=post_result["WebEnv"],
query_key=post_result["QueryKey"]
)
# Process batch
```
### Bio.GenBank - GenBank Format Parsing
Low-level GenBank file parser (SeqIO is usually preferred).
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
Parse Swiss-Prot and UniProtKB flat file format:
```python
from Bio import SwissProt
with open("uniprot.dat") as handle:
for record in SwissProt.parse(handle):
print(record.entry_name, record.organism)
```
## Sequence Similarity Searches
### Bio.Blast - BLAST Interface
Tools for running BLAST searches and parsing results.
#### Running BLAST
**NCBI QBLAST (online):**
```python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
```
Parameters:
- Program: blastn, blastp, blastx, tblastn, tblastx
- Database: nt, nr, refseq_rna, pdb, etc.
- Sequence: string or Seq object
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
**Local BLAST:**
Run standalone BLAST from command line, then parse results.
#### Parsing BLAST Results
**XML format (recommended):**
```python
from Bio.Blast import NCBIXML
result_handle = open("blast_results.xml")
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"Length: {alignment.length}")
print(f"E-value: {hsp.expect}")
print(f"Identities: {hsp.identities}/{hsp.align_length}")
```
**Functions:**
- `NCBIXML.read()`: Single query
- `NCBIXML.parse()`: Multiple queries (generator)
**Key Record Attributes:**
- `alignments`: List of matching sequences
- `query`: Query sequence ID
- `query_length`: Length of query
**Alignment Attributes:**
- `title`: Description of hit
- `length`: Length of hit sequence
- `hsps`: High-scoring segment pairs
**HSP Attributes:**
- `expect`: E-value
- `score`: Bit score
- `identities`: Number of identical residues
- `positives`: Number of positive scoring matches
- `gaps`: Number of gaps
- `align_length`: Length of alignment
- `query`: Aligned query sequence
- `match`: Match indicators
- `sbjct`: Aligned subject sequence
- `query_start`, `query_end`: Query coordinates
- `sbjct_start`, `sbjct_end`: Subject coordinates
#### Common BLAST Workflows
**Find homologs:**
```python
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
with open("results.xml", "w") as out:
out.write(result.read())
```
**Filter results by criteria:**
```python
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
# Process high-quality hits
pass
```
### Bio.SearchIO - Unified Search Results Parser
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
**Key Functions:**
- `read()`: Parse single query
- `parse()`: Parse multiple queries (generator)
- `write()`: Write results to file
- `convert()`: Convert between formats
**Supported Tools:**
- BLAST (XML, tabular, plain text)
- HMMER (hmmscan, hmmsearch, phmmer)
- BLAT
- FASTA
- InterProScan
- Exonerate
**Example:**
```python
from Bio import SearchIO
results = SearchIO.parse("blast_output.xml", "blast-xml")
for result in results:
for hit in result:
if hit.hsps[0].evalue < 0.001:
print(hit.id, hit.hsps[0].evalue)
```
## Local Database Management
### BioSQL - SQL Database Interface
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
**Features:**
- Store SeqRecord objects with annotations
- Efficient querying and retrieval
- Cross-reference sequences
- Track relationships between sequences
**Example:**
```python
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
db = server["my_db"]
# Store sequences
db.load(SeqIO.parse("sequences.gb", "genbank"))
# Query
seq = db.lookup(accession="NC_005816")
```

View File

@@ -0,0 +1,612 @@
# BioPython Specialized Analysis Modules
This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.
## Structural Bioinformatics
### Bio.PDB - Protein Structure Analysis
Comprehensive tools for handling macromolecular crystal structures.
#### Structure Hierarchy
PDB structures are organized hierarchically:
- **Structure** → Models → Chains → Residues → Atoms
```python
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")
# Navigate hierarchy
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom.coord) # xyz coordinates
```
#### Parsing Structure Files
**PDB format:**
```python
from Bio.PDB import PDBParser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.pdb")
```
**mmCIF format:**
```python
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.cif")
```
**Fast mmCIF parser:**
```python
from Bio.PDB import FastMMCIFParser
parser = FastMMCIFParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.cif")
```
**MMTF format:**
```python
from Bio.PDB import MMTFParser
parser = MMTFParser()
structure = parser.get_structure("structure.mmtf")
```
**Binary CIF:**
```python
from Bio.PDB.binary_cif import BinaryCIFParser
parser = BinaryCIFParser()
structure = parser.get_structure("structure.bcif")
```
#### Downloading Structures
```python
from Bio.PDB import PDBList
pdbl = PDBList()
# Download specific structure
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")
# Download entire PDB (obsolete entries)
pdbl.download_obsolete_entries(pdir="obsolete/")
# Update local PDB mirror
pdbl.update_pdb()
```
#### Structure Selection and Filtering
```python
# Select specific chains
chain_A = structure[0]['A']
# Select specific residues
residue_10 = chain_A[10]
# Select specific atoms
ca_atom = residue_10['CA']
# Iterate over specific atom types
for atom in structure.get_atoms():
if atom.name == 'CA': # Alpha carbons only
print(atom.coord)
```
**Structure selectors:**
```python
from Bio.PDB.Polypeptide import is_aa
# Filter by residue type
for residue in structure.get_residues():
if is_aa(residue):
print(f"Amino acid: {residue.resname}")
```
#### Secondary Structure Analysis
**DSSP integration:**
```python
from Bio.PDB import DSSP
# Requires DSSP program installed
model = structure[0]
dssp = DSSP(model, "structure.pdb")
# Access secondary structure
for key in dssp:
secondary_structure = dssp[key][2]
accessibility = dssp[key][3]
print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
```
DSSP codes:
- H: Alpha helix
- B: Beta bridge
- E: Extended strand (beta sheet)
- G: 3-10 helix
- I: Pi helix
- T: Turn
- S: Bend
- -: Coil
#### Solvent Accessibility
**Shrake-Rupley algorithm:**
```python
from Bio.PDB import ShrakeRupley
sr = ShrakeRupley()
sr.compute(structure, level="R") # R=residue, A=atom, C=chain, M=model, S=structure
for residue in structure.get_residues():
print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
```
**NACCESS wrapper:**
```python
from Bio.PDB import NACCESS
# Requires NACCESS program
naccess = NACCESS("structure.pdb")
for residue_id, data in naccess.items():
print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
```
**Half-sphere exposure:**
```python
from Bio.PDB import HSExposure
# Requires DSSP
model = structure[0]
hse = HSExposure()
hse.calc_hs_exposure(model, "structure.pdb")
for chain in model:
for residue in chain:
if residue.has_id('EXP_HSE_A_U'):
hse_up = residue.xtra['EXP_HSE_A_U']
hse_down = residue.xtra['EXP_HSE_A_D']
```
#### Structural Alignment and Superimposition
**Standard superimposition:**
```python
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms) # Lists of atoms to align
sup.apply(structure2.get_atoms()) # Apply transformation
print(f"RMSD: {sup.rms}")
print(f"Rotation matrix: {sup.rotran[0]}")
print(f"Translation vector: {sup.rotran[1]}")
```
**QCP (Quaternion Characteristic Polynomial) method:**
```python
from Bio.PDB import QCPSuperimposer
qcp = QCPSuperimposer()
qcp.set(ref_coords, alt_coords)
qcp.run()
print(f"RMSD: {qcp.get_rms()}")
```
#### Geometric Calculations
**Distances and angles:**
```python
# Distance between atoms
from Bio.PDB import Vector
dist = atom1 - atom2 # Returns distance
# Angle between three atoms
from Bio.PDB import calc_angle
angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)
# Dihedral angle
from Bio.PDB import calc_dihedral
dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
```
**Vector operations:**
```python
from Bio.PDB.Vector import Vector
v1 = Vector(atom1.coord)
v2 = Vector(atom2.coord)
# Vector operations
v3 = v1 + v2
v4 = v1 - v2
dot_product = v1 * v2
cross_product = v1 ** v2
magnitude = v1.norm()
normalized = v1.normalized()
```
#### Internal Coordinates
Advanced residue geometry representation:
```python
from Bio.PDB import internal_coords
# Enable internal coordinates
structure.atom_to_internal_coordinates()
# Access phi, psi angles
for residue in structure.get_residues():
if residue.internal_coord:
print(f"Phi: {residue.internal_coord.get_angle('phi')}")
print(f"Psi: {residue.internal_coord.get_angle('psi')}")
```
#### Writing Structures
```python
from Bio.PDB import PDBIO
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")
# Save specific selection
io.save("chain_A.pdb", select=ChainSelector("A"))
```
### Bio.SCOP - SCOP Database
Access to Structural Classification of Proteins database.
### Bio.KEGG - Pathway Analysis
Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:
**Capabilities:**
- Access pathway maps
- Retrieve enzyme data
- Get compound information
- Query orthology relationships
## Phylogenetics
### Bio.Phylo - Phylogenetic Tree Analysis
Comprehensive phylogenetic tree manipulation and analysis.
#### Reading and Writing Trees
**Supported formats:**
- Newick: Simple, widely-used format
- NEXUS: Rich metadata format
- PhyloXML: XML-based with extensive annotations
- NeXML: Modern XML standard
```python
from Bio import Phylo
# Read tree
tree = Phylo.read("tree.nwk", "newick")
# Read multiple trees
trees = list(Phylo.parse("trees.nex", "nexus"))
# Write tree
Phylo.write(tree, "output.nwk", "newick")
```
#### Tree Visualization
**ASCII visualization:**
```python
Phylo.draw_ascii(tree)
```
**Matplotlib plotting:**
```python
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.show()
# With customization
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title("My Phylogenetic Tree")
plt.show()
```
#### Tree Navigation and Manipulation
**Find clades:**
```python
# Get all terminal nodes (leaves)
terminals = tree.get_terminals()
# Get all nonterminal nodes
nonterminals = tree.get_nonterminals()
# Find specific clade
target = tree.find_any(name="Species_A")
# Find all matching clades
matches = tree.find_clades(terminal=True)
```
**Tree properties:**
```python
# Count terminals
num_species = tree.count_terminals()
# Get total branch length
total_length = tree.total_branch_length()
# Check if tree is bifurcating
is_bifurcating = tree.is_bifurcating()
# Get maximum distance from root
max_dist = tree.distance(tree.root)
```
**Tree modification:**
```python
# Prune tree to specific taxa
keep_taxa = ["Species_A", "Species_B", "Species_C"]
tree.prune(keep_taxa)
# Collapse short branches
tree.collapse_all(lambda c: c.branch_length < 0.01)
# Ladderize (sort branches)
tree.ladderize()
# Root tree at midpoint
tree.root_at_midpoint()
# Root at specific clade
outgroup = tree.find_any(name="Outgroup_species")
tree.root_with_outgroup(outgroup)
```
**Calculate distances:**
```python
# Distance between two clades
dist = tree.distance(clade1, clade2)
# Distance from root
root_dist = tree.distance(tree.root, terminal_clade)
```
#### Tree Construction
**Distance-based methods:**
```python
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
from Bio import AlignIO
# Load alignment
aln = AlignIO.read("alignment.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
# Construct tree using UPGMA
constructor = DistanceTreeConstructor()
tree_upgma = constructor.upgma(dm)
# Or using Neighbor-Joining
tree_nj = constructor.nj(dm)
```
**Parsimony method:**
```python
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)
tree = searcher.search(starting_tree, alignment)
```
**Distance calculators:**
- 'identity': Simple identity scoring
- 'blastn': BLAST nucleotide scoring
- 'blastp': BLAST protein scoring
- 'dnafull': EMBOSS DNA scoring matrix
- 'blosum62': BLOSUM62 protein matrix
- 'pam250': PAM250 protein matrix
#### Consensus Trees
```python
from Bio.Phylo.Consensus import majority_consensus, strict_consensus
# Strict consensus
consensus_strict = strict_consensus(trees)
# Majority rule consensus
consensus_majority = majority_consensus(trees, cutoff=0.5)
# Bootstrap consensus
from Bio.Phylo.Consensus import bootstrap_consensus
bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
```
#### External Tool Wrappers
**PhyML:**
```python
from Bio.Phylo.Applications import PhymlCommandline
cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
stdout, stderr = cmd()
tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
```
**RAxML:**
```python
from Bio.Phylo.Applications import RaxmlCommandline
cmd = RaxmlCommandline(
sequences="alignment.phy",
model="GTRGAMMA",
name="mytree",
parsimony_seed=12345
)
stdout, stderr = cmd()
```
**FastTree:**
```python
from Bio.Phylo.Applications import FastTreeCommandline
cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
stdout, stderr = cmd()
```
### Bio.Phylo.PAML - Evolutionary Analysis
Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):
**CODEML - Codon-based analysis:**
```python
from Bio.Phylo.PAML import codeml
cml = codeml.Codeml()
cml.alignment = "alignment.phy"
cml.tree = "tree.nwk"
cml.out_file = "results.out"
cml.working_dir = "./paml_wd"
# Set parameters
cml.set_options(
seqtype=1, # Codon sequences
model=0, # One omega ratio
NSsites=[0, 1, 2], # Test different models
CodonFreq=2, # F3x4 codon frequencies
)
results = cml.run()
```
**BaseML - Nucleotide-based analysis:**
```python
from Bio.Phylo.PAML import baseml
bml = baseml.Baseml()
bml.alignment = "alignment.phy"
bml.tree = "tree.nwk"
results = bml.run()
```
**YN00 - Yang-Nielsen method:**
```python
from Bio.Phylo.PAML import yn00
yn = yn00.Yn00()
yn.alignment = "alignment.phy"
results = yn.run()
```
## Population Genetics
### Bio.PopGen - Population Genetics Analysis
Tools for population-level genetic analysis.
**Capabilities:**
- Allele frequency calculations
- Hardy-Weinberg equilibrium testing
- Linkage disequilibrium analysis
- F-statistics (FST, FIS, FIT)
- Tajima's D
- Population structure analysis
## Clustering and Machine Learning
### Bio.Cluster - Clustering Algorithms
Statistical clustering for gene expression and other biological data:
**Hierarchical clustering:**
```python
from Bio.Cluster import treecluster
tree = treecluster(data, method='a', dist='e')
# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
```
**k-means clustering:**
```python
from Bio.Cluster import kcluster
clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
```
**Self-Organizing Maps (SOM):**
```python
from Bio.Cluster import somcluster
clusterid, celldata = somcluster(data, nx=3, ny=3)
```
**Principal Component Analysis:**
```python
from Bio.Cluster import pca
columnmean, coordinates, components, eigenvalues = pca(data)
```
## Visualization
### Bio.Graphics - Genomic Visualization
Tools for creating publication-quality biological graphics.
**GenomeDiagram - Circular and linear genome maps:**
```python
from Bio.Graphics import GenomeDiagram
from Bio import SeqIO
record = SeqIO.read("genome.gb", "genbank")
gd_diagram = GenomeDiagram.Diagram("Genome Map")
gd_track = gd_diagram.new_track(1, greytrack=True)
gd_feature_set = gd_track.new_set()
# Add features
for feature in record.features:
if feature.type == "gene":
gd_feature_set.add_feature(feature, color="blue", label=True)
gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
gd_diagram.write("genome_map.pdf", "PDF")
```
**Chromosomes - Chromosome visualization:**
```python
from Bio.Graphics.BasicChromosome import Chromosome
chr = Chromosome("Chromosome 1")
chr.add("gene1", 1000, 2000, color="red")
chr.add("gene2", 3000, 4500, color="blue")
```
## Phenotype Analysis
### Bio.phenotype - Phenotypic Microarray Analysis
Tools for analyzing phenotypic microarray data (e.g., Biolog plates):
**Capabilities:**
- Parse PM plate data
- Growth curve analysis
- Compare phenotypic profiles
- Calculate similarity metrics

View File

@@ -0,0 +1,370 @@
#!/usr/bin/env python3
"""
Sequence alignment and phylogenetic analysis using BioPython.
This script demonstrates:
- Pairwise sequence alignment
- Multiple sequence alignment I/O
- Distance matrix calculation
- Phylogenetic tree construction
- Tree manipulation and visualization
"""
from Bio import Align, AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
from Bio.Seq import Seq
import matplotlib.pyplot as plt
def pairwise_alignment_example():
"""Demonstrate pairwise sequence alignment."""
print("Pairwise Sequence Alignment")
print("=" * 60)
# Create aligner
aligner = Align.PairwiseAligner()
# Set parameters
aligner.mode = "global" # or 'local' for local alignment
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5
# Sequences to align
seq1 = "ACGTACGTACGT"
seq2 = "ACGTTACGTGT"
print(f"Sequence 1: {seq1}")
print(f"Sequence 2: {seq2}")
print()
# Perform alignment
alignments = aligner.align(seq1, seq2)
# Show results
print(f"Number of optimal alignments: {len(alignments)}")
print(f"Best alignment score: {alignments.score:.1f}")
print()
# Display best alignment
print("Best alignment:")
print(alignments[0])
print()
def local_alignment_example():
"""Demonstrate local alignment (Smith-Waterman)."""
print("Local Sequence Alignment")
print("=" * 60)
aligner = Align.PairwiseAligner()
aligner.mode = "local"
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5
seq1 = "AAAAACGTACGTACGTAAAAA"
seq2 = "TTTTTTACGTACGTTTTTTT"
print(f"Sequence 1: {seq1}")
print(f"Sequence 2: {seq2}")
print()
alignments = aligner.align(seq1, seq2)
print(f"Best local alignment score: {alignments.score:.1f}")
print()
print("Best local alignment:")
print(alignments[0])
print()
def read_and_analyze_alignment(alignment_file, format="fasta"):
"""Read and analyze a multiple sequence alignment."""
print(f"Reading alignment from: {alignment_file}")
print("-" * 60)
# Read alignment
alignment = AlignIO.read(alignment_file, format)
print(f"Number of sequences: {len(alignment)}")
print(f"Alignment length: {alignment.get_alignment_length()}")
print()
# Display alignment
print("Alignment preview:")
for record in alignment[:5]: # Show first 5 sequences
print(f"{record.id[:15]:15s} {record.seq[:50]}...")
print()
# Calculate some statistics
analyze_alignment_statistics(alignment)
return alignment
def analyze_alignment_statistics(alignment):
"""Calculate statistics for an alignment."""
print("Alignment Statistics:")
print("-" * 60)
# Get alignment length
length = alignment.get_alignment_length()
# Count gaps
total_gaps = sum(str(record.seq).count("-") for record in alignment)
gap_percentage = (total_gaps / (length * len(alignment))) * 100
print(f"Total positions: {length}")
print(f"Number of sequences: {len(alignment)}")
print(f"Total gaps: {total_gaps} ({gap_percentage:.1f}%)")
print()
# Calculate conservation at each position
conserved_positions = 0
for i in range(length):
column = alignment[:, i]
# Count most common residue
if column.count(max(set(column), key=column.count)) == len(alignment):
conserved_positions += 1
conservation = (conserved_positions / length) * 100
print(f"Fully conserved positions: {conserved_positions} ({conservation:.1f}%)")
print()
def calculate_distance_matrix(alignment):
"""Calculate distance matrix from alignment."""
print("Calculating Distance Matrix")
print("-" * 60)
calculator = DistanceCalculator("identity")
dm = calculator.get_distance(alignment)
print("Distance matrix:")
print(dm)
print()
return dm
def build_upgma_tree(alignment):
"""Build phylogenetic tree using UPGMA."""
print("Building UPGMA Tree")
print("=" * 60)
# Calculate distance matrix
calculator = DistanceCalculator("identity")
dm = calculator.get_distance(alignment)
# Construct tree
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm)
print("UPGMA tree constructed")
print(f"Number of terminals: {tree.count_terminals()}")
print()
return tree
def build_nj_tree(alignment):
"""Build phylogenetic tree using Neighbor-Joining."""
print("Building Neighbor-Joining Tree")
print("=" * 60)
# Calculate distance matrix
calculator = DistanceCalculator("identity")
dm = calculator.get_distance(alignment)
# Construct tree
constructor = DistanceTreeConstructor(calculator)
tree = constructor.nj(dm)
print("Neighbor-Joining tree constructed")
print(f"Number of terminals: {tree.count_terminals()}")
print()
return tree
def visualize_tree(tree, title="Phylogenetic Tree"):
"""Visualize phylogenetic tree."""
print("Visualizing tree...")
print()
# ASCII visualization
print("ASCII tree:")
Phylo.draw_ascii(tree)
print()
# Matplotlib visualization
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title(title)
plt.tight_layout()
plt.savefig("tree_visualization.png", dpi=300, bbox_inches="tight")
print("Tree saved to tree_visualization.png")
print()
def manipulate_tree(tree):
"""Demonstrate tree manipulation operations."""
print("Tree Manipulation")
print("=" * 60)
# Get terminals
terminals = tree.get_terminals()
print(f"Terminal nodes: {[t.name for t in terminals]}")
print()
# Get nonterminals
nonterminals = tree.get_nonterminals()
print(f"Number of internal nodes: {len(nonterminals)}")
print()
# Calculate total branch length
total_length = tree.total_branch_length()
print(f"Total branch length: {total_length:.4f}")
print()
# Find specific clade
if len(terminals) > 0:
target_name = terminals[0].name
found = tree.find_any(name=target_name)
print(f"Found clade: {found.name}")
print()
# Ladderize tree (sort branches)
tree.ladderize()
print("Tree ladderized (branches sorted)")
print()
# Root at midpoint
tree.root_at_midpoint()
print("Tree rooted at midpoint")
print()
return tree
def read_and_analyze_tree(tree_file, format="newick"):
"""Read and analyze a phylogenetic tree."""
print(f"Reading tree from: {tree_file}")
print("-" * 60)
tree = Phylo.read(tree_file, format)
print(f"Tree format: {format}")
print(f"Number of terminals: {tree.count_terminals()}")
print(f"Is bifurcating: {tree.is_bifurcating()}")
print(f"Total branch length: {tree.total_branch_length():.4f}")
print()
# Show tree structure
print("Tree structure:")
Phylo.draw_ascii(tree)
print()
return tree
def compare_trees(tree1, tree2):
"""Compare two phylogenetic trees."""
print("Comparing Trees")
print("=" * 60)
# Get terminal names
terminals1 = {t.name for t in tree1.get_terminals()}
terminals2 = {t.name for t in tree2.get_terminals()}
print(f"Tree 1 terminals: {len(terminals1)}")
print(f"Tree 2 terminals: {len(terminals2)}")
print(f"Shared terminals: {len(terminals1 & terminals2)}")
print(f"Unique to tree 1: {len(terminals1 - terminals2)}")
print(f"Unique to tree 2: {len(terminals2 - terminals1)}")
print()
def create_example_alignment():
"""Create an example alignment for demonstration."""
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
sequences = [
SeqRecord(Seq("ACTGCTAGCTAGCTAG"), id="seq1"),
SeqRecord(Seq("ACTGCTAGCT-GCTAG"), id="seq2"),
SeqRecord(Seq("ACTGCTAGCTAGCTGG"), id="seq3"),
SeqRecord(Seq("ACTGCT-GCTAGCTAG"), id="seq4"),
]
alignment = MultipleSeqAlignment(sequences)
# Save alignment
AlignIO.write(alignment, "example_alignment.fasta", "fasta")
print("Created example alignment: example_alignment.fasta")
print()
return alignment
def example_workflow():
"""Demonstrate complete alignment and phylogeny workflow."""
print("=" * 60)
print("BioPython Alignment & Phylogeny Workflow")
print("=" * 60)
print()
# Pairwise alignment examples
pairwise_alignment_example()
print()
local_alignment_example()
print()
# Create example data
alignment = create_example_alignment()
# Analyze alignment
analyze_alignment_statistics(alignment)
# Calculate distance matrix
dm = calculate_distance_matrix(alignment)
# Build trees
upgma_tree = build_upgma_tree(alignment)
nj_tree = build_nj_tree(alignment)
# Manipulate tree
manipulate_tree(upgma_tree)
# Visualize
visualize_tree(upgma_tree, "UPGMA Tree")
print("Workflow completed!")
print()
if __name__ == "__main__":
example_workflow()
print("Note: For real analyses, use actual alignment files.")
print("Supported alignment formats: clustal, phylip, stockholm, nexus, fasta")
print("Supported tree formats: newick, nexus, phyloxml, nexml")

View File

@@ -0,0 +1,272 @@
#!/usr/bin/env python3
"""
BLAST searches and result parsing using BioPython.
This script demonstrates:
- Running BLAST searches via NCBI (qblast)
- Parsing BLAST XML output
- Filtering and analyzing results
- Working with alignments and HSPs
"""
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
def run_blast_online(sequence, program="blastn", database="nt", expect=0.001):
"""
Run BLAST search via NCBI's qblast.
Parameters:
- sequence: Sequence string or Seq object
- program: blastn, blastp, blastx, tblastn, tblastx
- database: nt (nucleotide), nr (protein), refseq_rna, etc.
- expect: E-value threshold
"""
print(f"Running {program} search against {database} database...")
print(f"E-value threshold: {expect}")
print("-" * 60)
# Run BLAST
result_handle = NCBIWWW.qblast(
program=program,
database=database,
sequence=sequence,
expect=expect,
hitlist_size=50, # Number of sequences to show alignments for
)
# Save results
output_file = "blast_results.xml"
with open(output_file, "w") as out:
out.write(result_handle.read())
result_handle.close()
print(f"BLAST search complete. Results saved to {output_file}")
print()
return output_file
def parse_blast_results(xml_file, max_hits=10, evalue_threshold=0.001):
"""Parse BLAST XML results."""
print(f"Parsing BLAST results from: {xml_file}")
print(f"E-value threshold: {evalue_threshold}")
print("=" * 60)
with open(xml_file) as result_handle:
blast_record = NCBIXML.read(result_handle)
print(f"Query: {blast_record.query}")
print(f"Query length: {blast_record.query_length} residues")
print(f"Database: {blast_record.database}")
print(f"Number of alignments: {len(blast_record.alignments)}")
print()
hit_count = 0
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect <= evalue_threshold:
hit_count += 1
if hit_count <= max_hits:
print(f"Hit {hit_count}:")
print(f" Sequence: {alignment.title}")
print(f" Length: {alignment.length}")
print(f" E-value: {hsp.expect:.2e}")
print(f" Score: {hsp.score}")
print(f" Identities: {hsp.identities}/{hsp.align_length} ({hsp.identities / hsp.align_length * 100:.1f}%)")
print(f" Positives: {hsp.positives}/{hsp.align_length} ({hsp.positives / hsp.align_length * 100:.1f}%)")
print(f" Gaps: {hsp.gaps}/{hsp.align_length}")
print(f" Query range: {hsp.query_start} - {hsp.query_end}")
print(f" Subject range: {hsp.sbjct_start} - {hsp.sbjct_end}")
print()
# Show alignment (first 100 characters)
print(" Alignment preview:")
print(f" Query: {hsp.query[:100]}")
print(f" Match: {hsp.match[:100]}")
print(f" Sbjct: {hsp.sbjct[:100]}")
print()
print(f"Total significant hits (E-value <= {evalue_threshold}): {hit_count}")
print()
return blast_record
def parse_multiple_queries(xml_file):
"""Parse BLAST results with multiple queries."""
print(f"Parsing multiple queries from: {xml_file}")
print("=" * 60)
with open(xml_file) as result_handle:
blast_records = NCBIXML.parse(result_handle)
for i, blast_record in enumerate(blast_records, 1):
print(f"\nQuery {i}: {blast_record.query}")
print(f" Number of hits: {len(blast_record.alignments)}")
if blast_record.alignments:
best_hit = blast_record.alignments[0]
best_hsp = best_hit.hsps[0]
print(f" Best hit: {best_hit.title[:80]}...")
print(f" Best E-value: {best_hsp.expect:.2e}")
def filter_blast_results(blast_record, min_identity=0.7, min_coverage=0.5):
"""Filter BLAST results by identity and coverage."""
print(f"Filtering results:")
print(f" Minimum identity: {min_identity * 100}%")
print(f" Minimum coverage: {min_coverage * 100}%")
print("-" * 60)
filtered_hits = []
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
identity_fraction = hsp.identities / hsp.align_length
coverage = hsp.align_length / blast_record.query_length
if identity_fraction >= min_identity and coverage >= min_coverage:
filtered_hits.append(
{
"title": alignment.title,
"length": alignment.length,
"evalue": hsp.expect,
"identity": identity_fraction,
"coverage": coverage,
"alignment": alignment,
"hsp": hsp,
}
)
print(f"Found {len(filtered_hits)} hits matching criteria")
print()
# Sort by E-value
filtered_hits.sort(key=lambda x: x["evalue"])
# Display top hits
for i, hit in enumerate(filtered_hits[:5], 1):
print(f"{i}. {hit['title'][:80]}")
print(f" Identity: {hit['identity']*100:.1f}%, Coverage: {hit['coverage']*100:.1f}%, E-value: {hit['evalue']:.2e}")
print()
return filtered_hits
def extract_hit_sequences(blast_record, output_file="blast_hits.fasta"):
"""Extract aligned sequences from BLAST results."""
print(f"Extracting hit sequences to {output_file}...")
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
records = []
for i, alignment in enumerate(blast_record.alignments[:10]): # Top 10 hits
hsp = alignment.hsps[0] # Best HSP for this alignment
# Extract accession from title
accession = alignment.title.split()[0]
# Create SeqRecord from aligned subject sequence
record = SeqRecord(
Seq(hsp.sbjct.replace("-", "")), # Remove gaps
id=accession,
description=f"E-value: {hsp.expect:.2e}, Identity: {hsp.identities}/{hsp.align_length}",
)
records.append(record)
# Write to FASTA
SeqIO.write(records, output_file, "fasta")
print(f"Extracted {len(records)} sequences")
print()
def analyze_blast_statistics(blast_record):
"""Compute statistics from BLAST results."""
print("BLAST Result Statistics:")
print("-" * 60)
if not blast_record.alignments:
print("No hits found")
return
evalues = []
identities = []
scores = []
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
evalues.append(hsp.expect)
identities.append(hsp.identities / hsp.align_length)
scores.append(hsp.score)
import statistics
print(f"Total HSPs: {len(evalues)}")
print(f"\nE-values:")
print(f" Min: {min(evalues):.2e}")
print(f" Max: {max(evalues):.2e}")
print(f" Median: {statistics.median(evalues):.2e}")
print(f"\nIdentity percentages:")
print(f" Min: {min(identities)*100:.1f}%")
print(f" Max: {max(identities)*100:.1f}%")
print(f" Mean: {statistics.mean(identities)*100:.1f}%")
print(f"\nBit scores:")
print(f" Min: {min(scores):.1f}")
print(f" Max: {max(scores):.1f}")
print(f" Mean: {statistics.mean(scores):.1f}")
print()
def example_workflow():
"""Demonstrate BLAST workflow."""
print("=" * 60)
print("BioPython BLAST Example Workflow")
print("=" * 60)
print()
# Example sequence (human beta-globin)
example_sequence = """
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
""".replace("\n", "").replace(" ", "")
print("Example: Human beta-globin sequence")
print(f"Length: {len(example_sequence)} bp")
print()
# Note: Uncomment to run actual BLAST search (takes time)
# xml_file = run_blast_online(example_sequence, program="blastn", database="nt", expect=0.001)
# For demonstration, use a pre-existing results file
print("To run a real BLAST search, uncomment the run_blast_online() line")
print("For now, demonstrating parsing with example results file")
print()
# If you have results, parse them:
# blast_record = parse_blast_results("blast_results.xml", max_hits=5)
# filtered = filter_blast_results(blast_record, min_identity=0.9)
# analyze_blast_statistics(blast_record)
# extract_hit_sequences(blast_record)
if __name__ == "__main__":
example_workflow()
print()
print("Note: BLAST searches can take several minutes.")
print("For production use, consider running local BLAST instead.")

View File

@@ -0,0 +1,215 @@
#!/usr/bin/env python3
"""
File I/O operations using BioPython SeqIO.
This script demonstrates:
- Reading sequences from various formats
- Writing sequences to files
- Converting between formats
- Filtering and processing sequences
- Working with large files efficiently
"""
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def read_sequences(filename, format_type):
"""Read and display sequences from a file."""
print(f"Reading {format_type} file: {filename}")
print("-" * 60)
count = 0
for record in SeqIO.parse(filename, format_type):
count += 1
print(f"ID: {record.id}")
print(f"Name: {record.name}")
print(f"Description: {record.description}")
print(f"Sequence length: {len(record.seq)}")
print(f"Sequence: {record.seq[:50]}...")
print()
# Only show first 3 sequences
if count >= 3:
break
# Count total sequences
total = len(list(SeqIO.parse(filename, format_type)))
print(f"Total sequences in file: {total}")
print()
def read_single_sequence(filename, format_type):
"""Read a single sequence from a file."""
record = SeqIO.read(filename, format_type)
print("Single sequence record:")
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print()
def write_sequences(records, output_filename, format_type):
"""Write sequences to a file."""
count = SeqIO.write(records, output_filename, format_type)
print(f"Wrote {count} sequences to {output_filename} in {format_type} format")
print()
def convert_format(input_file, input_format, output_file, output_format):
"""Convert sequences from one format to another."""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"Converted {count} sequences from {input_format} to {output_format}")
print()
def filter_sequences(input_file, format_type, min_length=100, max_length=1000):
"""Filter sequences by length."""
filtered = []
for record in SeqIO.parse(input_file, format_type):
if min_length <= len(record.seq) <= max_length:
filtered.append(record)
print(f"Found {len(filtered)} sequences between {min_length} and {max_length} bp")
return filtered
def extract_subsequence(input_file, format_type, seq_id, start, end):
"""Extract a subsequence from a specific record."""
# Index for efficient access
record_dict = SeqIO.index(input_file, format_type)
if seq_id in record_dict:
record = record_dict[seq_id]
subseq = record.seq[start:end]
print(f"Extracted subsequence from {seq_id} ({start}:{end}):")
print(subseq)
return subseq
else:
print(f"Sequence {seq_id} not found")
return None
def create_sequence_records():
"""Create SeqRecord objects from scratch."""
# Simple record
simple_record = SeqRecord(
Seq("ATGCATGCATGC"),
id="seq001",
name="MySequence",
description="Example sequence"
)
# Record with annotations
annotated_record = SeqRecord(
Seq("ATGGTGCATCTGACTCCTGAGGAG"),
id="seq002",
name="GeneX",
description="Important gene"
)
annotated_record.annotations["molecule_type"] = "DNA"
annotated_record.annotations["organism"] = "Homo sapiens"
return [simple_record, annotated_record]
def index_large_file(filename, format_type):
"""Index a large file for random access without loading into memory."""
# Create index
record_index = SeqIO.index(filename, format_type)
print(f"Indexed {len(record_index)} sequences")
print(f"Available IDs: {list(record_index.keys())[:10]}...")
print()
# Access specific record by ID
if len(record_index) > 0:
first_id = list(record_index.keys())[0]
record = record_index[first_id]
print(f"Accessed record: {record.id}")
print()
# Close index
record_index.close()
def parse_with_quality_scores(fastq_file):
"""Parse FASTQ files with quality scores."""
print("Parsing FASTQ with quality scores:")
print("-" * 60)
for record in SeqIO.parse(fastq_file, "fastq"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq[:50]}...")
print(f"Quality scores (first 10): {record.letter_annotations['phred_quality'][:10]}")
# Calculate average quality
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"Average quality: {avg_quality:.2f}")
print()
break # Just show first record
def batch_process_large_file(input_file, format_type, batch_size=100):
"""Process large files in batches to manage memory."""
batch = []
count = 0
for record in SeqIO.parse(input_file, format_type):
batch.append(record)
count += 1
if len(batch) == batch_size:
# Process batch
print(f"Processing batch of {len(batch)} sequences...")
# Do something with batch
batch = [] # Clear for next batch
# Process remaining records
if batch:
print(f"Processing final batch of {len(batch)} sequences...")
print(f"Total sequences processed: {count}")
def example_workflow():
"""Demonstrate a complete workflow."""
print("=" * 60)
print("BioPython SeqIO Workflow Example")
print("=" * 60)
print()
# Create example sequences
records = create_sequence_records()
# Write as FASTA
write_sequences(records, "example_output.fasta", "fasta")
# Write as GenBank
write_sequences(records, "example_output.gb", "genbank")
# Convert FASTA to GenBank (would work if file exists)
# convert_format("input.fasta", "fasta", "output.gb", "genbank")
print("Example workflow completed!")
if __name__ == "__main__":
example_workflow()
print()
print("Note: This script demonstrates BioPython SeqIO operations.")
print("Uncomment and adapt the functions for your specific files.")

View File

@@ -0,0 +1,293 @@
#!/usr/bin/env python3
"""
NCBI Entrez database access using BioPython.
This script demonstrates:
- Searching NCBI databases
- Downloading sequences by accession
- Retrieving PubMed articles
- Batch downloading with WebEnv
- Proper error handling and rate limiting
"""
import time
from Bio import Entrez, SeqIO
# IMPORTANT: Always set your email
Entrez.email = "your.email@example.com" # Change this!
def search_nucleotide(query, max_results=10):
"""Search NCBI nucleotide database."""
print(f"Searching nucleotide database for: {query}")
print("-" * 60)
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} total matches")
print(f"Returning top {len(record['IdList'])} IDs:")
print(record["IdList"])
print()
return record["IdList"]
def fetch_sequence_by_accession(accession):
"""Download a sequence by accession number."""
print(f"Fetching sequence: {accession}")
try:
handle = Entrez.efetch(
db="nucleotide", id=accession, rettype="gb", retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"Successfully retrieved: {record.id}")
print(f"Description: {record.description}")
print(f"Length: {len(record.seq)} bp")
print(f"Organism: {record.annotations.get('organism', 'Unknown')}")
print()
return record
except Exception as e:
print(f"Error fetching {accession}: {e}")
return None
def fetch_multiple_sequences(id_list, output_file="downloaded_sequences.fasta"):
"""Download multiple sequences and save to file."""
print(f"Fetching {len(id_list)} sequences...")
try:
# For >200 IDs, efetch automatically uses POST
handle = Entrez.efetch(
db="nucleotide", id=id_list, rettype="fasta", retmode="text"
)
# Parse and save
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
SeqIO.write(records, output_file, "fasta")
print(f"Successfully downloaded {len(records)} sequences to {output_file}")
print()
return records
except Exception as e:
print(f"Error fetching sequences: {e}")
return []
def search_and_download(query, output_file, max_results=100):
"""Complete workflow: search and download sequences."""
print(f"Searching and downloading: {query}")
print("=" * 60)
# Search
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
record = Entrez.read(handle)
handle.close()
id_list = record["IdList"]
print(f"Found {len(id_list)} sequences")
if not id_list:
print("No results found")
return
# Download in batches to be polite
batch_size = 100
all_records = []
for start in range(0, len(id_list), batch_size):
end = min(start + batch_size, len(id_list))
batch_ids = id_list[start:end]
print(f"Downloading batch {start // batch_size + 1} ({len(batch_ids)} sequences)...")
handle = Entrez.efetch(
db="nucleotide", id=batch_ids, rettype="fasta", retmode="text"
)
batch_records = list(SeqIO.parse(handle, "fasta"))
handle.close()
all_records.extend(batch_records)
# Be polite - wait between requests
time.sleep(0.5)
# Save all records
SeqIO.write(all_records, output_file, "fasta")
print(f"Downloaded {len(all_records)} sequences to {output_file}")
print()
def use_history_for_large_queries(query, max_results=1000):
"""Use NCBI History server for large queries."""
print("Using NCBI History server for large query")
print("-" * 60)
# Search with history
search_handle = Entrez.esearch(
db="nucleotide", term=query, retmax=max_results, usehistory="y"
)
search_results = Entrez.read(search_handle)
search_handle.close()
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print(f"Found {count} total sequences")
print(f"WebEnv: {webenv[:20]}...")
print(f"QueryKey: {query_key}")
print()
# Fetch in batches using history
batch_size = 500
all_records = []
for start in range(0, min(count, max_results), batch_size):
end = min(start + batch_size, max_results)
print(f"Downloading records {start + 1} to {end}...")
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key,
)
batch_records = list(SeqIO.parse(fetch_handle, "fasta"))
fetch_handle.close()
all_records.extend(batch_records)
# Be polite
time.sleep(0.5)
print(f"Downloaded {len(all_records)} sequences total")
return all_records
def search_pubmed(query, max_results=10):
"""Search PubMed for articles."""
print(f"Searching PubMed for: {query}")
print("-" * 60)
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
record = Entrez.read(handle)
handle.close()
id_list = record["IdList"]
print(f"Found {record['Count']} total articles")
print(f"Returning {len(id_list)} PMIDs:")
print(id_list)
print()
return id_list
def fetch_pubmed_abstracts(pmid_list):
"""Fetch PubMed article summaries."""
print(f"Fetching summaries for {len(pmid_list)} articles...")
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="abstract", retmode="text")
abstracts = handle.read()
handle.close()
print(abstracts[:500]) # Show first 500 characters
print("...")
print()
def get_database_info(database="nucleotide"):
"""Get information about an NCBI database."""
print(f"Getting info for database: {database}")
print("-" * 60)
handle = Entrez.einfo(db=database)
record = Entrez.read(handle)
handle.close()
db_info = record["DbInfo"]
print(f"Name: {db_info['DbName']}")
print(f"Description: {db_info['Description']}")
print(f"Record count: {db_info['Count']}")
print(f"Last update: {db_info['LastUpdate']}")
print()
def link_databases(db_from, db_to, id_):
"""Find related records in other databases."""
print(f"Finding links from {db_from} ID {id_} to {db_to}")
print("-" * 60)
handle = Entrez.elink(dbfrom=db_from, db=db_to, id=id_)
record = Entrez.read(handle)
handle.close()
if record[0]["LinkSetDb"]:
linked_ids = [link["Id"] for link in record[0]["LinkSetDb"][0]["Link"]]
print(f"Found {len(linked_ids)} linked records")
print(f"IDs: {linked_ids[:10]}")
else:
print("No linked records found")
print()
def example_workflow():
"""Demonstrate complete Entrez workflow."""
print("=" * 60)
print("BioPython Entrez Example Workflow")
print("=" * 60)
print()
# Note: These are examples - uncomment to run with your email set
# # Example 1: Search and get IDs
# ids = search_nucleotide("Homo sapiens[Organism] AND COX1[Gene]", max_results=5)
#
# # Example 2: Fetch a specific sequence
# fetch_sequence_by_accession("NM_001301717")
#
# # Example 3: Complete search and download
# search_and_download("Escherichia coli[Organism] AND 16S", "ecoli_16s.fasta", max_results=50)
#
# # Example 4: PubMed search
# pmids = search_pubmed("CRISPR[Title] AND 2023[PDAT]", max_results=5)
# fetch_pubmed_abstracts(pmids[:2])
#
# # Example 5: Get database info
# get_database_info("nucleotide")
print("Examples are commented out. Uncomment and set your email to run.")
if __name__ == "__main__":
example_workflow()
print()
print("IMPORTANT: Always set Entrez.email before using these functions!")
print("NCBI requires an email address for their E-utilities.")

View File

@@ -0,0 +1,125 @@
#!/usr/bin/env python3
"""
Common sequence operations using BioPython.
This script demonstrates basic sequence manipulation tasks like:
- Creating and manipulating Seq objects
- Transcription and translation
- Complement and reverse complement
- Calculating GC content and melting temperature
"""
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
def demonstrate_seq_operations():
"""Show common Seq object operations."""
# Create DNA sequence
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTG")
print("Original DNA sequence:")
print(dna_seq)
print()
# Transcription (DNA -> RNA)
rna_seq = dna_seq.transcribe()
print("Transcribed to RNA:")
print(rna_seq)
print()
# Translation (DNA -> Protein)
protein_seq = dna_seq.translate()
print("Translated to protein:")
print(protein_seq)
print()
# Translation with stop codon handling
protein_to_stop = dna_seq.translate(to_stop=True)
print("Translated to first stop codon:")
print(protein_to_stop)
print()
# Complement
complement = dna_seq.complement()
print("Complement:")
print(complement)
print()
# Reverse complement
reverse_complement = dna_seq.reverse_complement()
print("Reverse complement:")
print(reverse_complement)
print()
# GC content
gc = gc_fraction(dna_seq) * 100
print(f"GC content: {gc:.2f}%")
print()
# Melting temperature
tm = mt.Tm_NN(dna_seq)
print(f"Melting temperature (nearest-neighbor): {tm:.2f}°C")
print()
# Sequence searching
codon_start = dna_seq.find("ATG")
print(f"Start codon (ATG) position: {codon_start}")
# Count occurrences
g_count = dna_seq.count("G")
print(f"Number of G nucleotides: {g_count}")
print()
def translate_with_genetic_code():
"""Demonstrate translation with different genetic codes."""
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCT")
# Standard genetic code (table 1)
standard = dna_seq.translate(table=1)
print("Standard genetic code translation:")
print(standard)
# Vertebrate mitochondrial code (table 2)
mito = dna_seq.translate(table=2)
print("Vertebrate mitochondrial code translation:")
print(mito)
print()
def working_with_codons():
"""Access genetic code tables."""
from Bio.Data import CodonTable
# Get standard genetic code
standard_table = CodonTable.unambiguous_dna_by_id[1]
print("Standard genetic code:")
print(f"Start codons: {standard_table.start_codons}")
print(f"Stop codons: {standard_table.stop_codons}")
print()
# Show some codon translations
print("Example codons:")
for codon in ["ATG", "TGG", "TAA", "TAG", "TGA"]:
if codon in standard_table.stop_codons:
print(f"{codon} -> STOP")
else:
aa = standard_table.forward_table.get(codon, "Unknown")
print(f"{codon} -> {aa}")
if __name__ == "__main__":
print("=" * 60)
print("BioPython Sequence Operations Demo")
print("=" * 60)
print()
demonstrate_seq_operations()
print("-" * 60)
translate_with_genetic_code()
print("-" * 60)
working_with_codons()

View File

@@ -0,0 +1,355 @@
---
name: bioservices
description: Toolkit for accessing 40+ biological web services and databases programmatically. Use when working with protein sequences, gene pathways (KEGG), identifier mapping (UniProt), compound databases (ChEBI, ChEMBL), sequence analysis (BLAST), pathway interactions, gene ontology, or any bioinformatics data retrieval tasks requiring integration across multiple biological databases.
---
# BioServices
## Overview
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Use this skill to retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
## When to Use This Skill
Apply this skill when tasks involve:
- Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
- Analyzing metabolic pathways and gene functions via KEGG or Reactome
- Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
- Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs)
- Running sequence similarity searches (BLAST, MUSCLE alignment)
- Querying gene ontology terms (QuickGO, GO annotations)
- Accessing protein-protein interaction data (PSICQUIC, IntactComplex)
- Mining genomic data (BioMart, ArrayExpress, ENA)
- Integrating data from multiple bioinformatics resources in a single workflow
## Core Capabilities
### 1. Protein Analysis
Retrieve protein information, sequences, and functional annotations:
```python
from bioservices import UniProt
u = UniProt(verbose=False)
# Search for protein by name
results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
# Retrieve FASTA sequence
sequence = u.retrieve("P43403", "fasta")
# Map identifiers between databases
kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
```
**Key methods:**
- `search()`: Query UniProt with flexible search terms
- `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
- `mapping()`: Convert identifiers between databases
Reference: `references/services_reference.md` for complete UniProt API details.
### 2. Pathway Discovery and Analysis
Access KEGG pathway information for genes and organisms:
```python
from bioservices import KEGG
k = KEGG()
k.organism = "hsa" # Set to human
# Search for organisms
k.lookfor_organism("droso") # Find Drosophila species
# Find pathways by name
k.lookfor_pathway("B cell") # Returns matching pathway IDs
# Get pathways containing specific genes
pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene
# Retrieve and parse pathway data
data = k.get("hsa04660")
parsed = k.parse(data)
# Extract pathway interactions
interactions = k.parse_kgml_pathway("hsa04660")
relations = interactions['relations'] # Protein-protein interactions
# Convert to Simple Interaction Format
sif_data = k.pathway2sif("hsa04660")
```
**Key methods:**
- `lookfor_organism()`, `lookfor_pathway()`: Search by name
- `get_pathway_by_gene()`: Find pathways containing genes
- `parse_kgml_pathway()`: Extract structured pathway data
- `pathway2sif()`: Get protein interaction networks
Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
### 3. Compound Database Searches
Search and cross-reference compounds across multiple databases:
```python
from bioservices import KEGG, UniChem
k = KEGG()
# Search compounds by name
results = k.find("compound", "Geldanamycin") # Returns cpd:C11222
# Get compound information with database links
compound_info = k.get("cpd:C11222") # Includes ChEBI links
# Cross-reference KEGG → ChEMBL using UniChem
u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315
```
**Common workflow:**
1. Search compound by name in KEGG
2. Extract KEGG compound ID
3. Use UniChem for KEGG → ChEMBL mapping
4. ChEBI IDs are often provided in KEGG entries
Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
### 4. Sequence Analysis
Run BLAST searches and sequence alignments:
```python
from bioservices import NCBIblast
s = NCBIblast(verbose=False)
# Run BLASTP against UniProtKB
jobid = s.run(
program="blastp",
sequence=protein_sequence,
stype="protein",
database="uniprotkb",
email="your.email@example.com" # Required by NCBI
)
# Check job status and retrieve results
s.getStatus(jobid)
results = s.getResult(jobid, "out")
```
**Note:** BLAST jobs are asynchronous. Check status before retrieving results.
### 5. Identifier Mapping
Convert identifiers between different biological databases:
```python
from bioservices import UniProt, KEGG
# UniProt mapping (many database pairs supported)
u = UniProt()
results = u.mapping(
fr="UniProtKB_AC-ID", # Source database
to="KEGG", # Target database
query="P43403" # Identifier(s) to convert
)
# KEGG gene ID → UniProt
kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
# For compounds, use UniChem
from bioservices import UniChem
u = UniChem()
chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
```
**Supported mappings (UniProt):**
- UniProtKB ↔ KEGG
- UniProtKB ↔ Ensembl
- UniProtKB ↔ PDB
- UniProtKB ↔ RefSeq
- And many more (see `references/identifier_mapping.md`)
### 6. Gene Ontology Queries
Access GO terms and annotations:
```python
from bioservices import QuickGO
g = QuickGO(verbose=False)
# Retrieve GO term information
term_info = g.Term("GO:0003824", frmt="obo")
# Search annotations
annotations = g.Annotation(protein="P43403", format="tsv")
```
### 7. Protein-Protein Interactions
Query interaction databases via PSICQUIC:
```python
from bioservices import PSICQUIC
s = PSICQUIC(verbose=False)
# Query specific database (e.g., MINT)
interactions = s.query("mint", "ZAP70 AND species:9606")
# List available interaction databases
databases = s.activeDBs
```
**Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
## Multi-Service Integration Workflows
BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
### Complete Protein Analysis Pipeline
Execute a full protein characterization workflow:
```bash
python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
```
This script demonstrates:
1. UniProt search for protein entry
2. FASTA sequence retrieval
3. BLAST similarity search
4. KEGG pathway discovery
5. PSICQUIC interaction mapping
### Pathway Network Analysis
Analyze all pathways for an organism:
```bash
python scripts/pathway_analysis.py hsa output_directory/
```
Extracts and analyzes:
- All pathway IDs for organism
- Protein-protein interactions per pathway
- Interaction type distributions
- Exports to CSV/SIF formats
### Cross-Database Compound Search
Map compound identifiers across databases:
```bash
python scripts/compound_cross_reference.py Geldanamycin
```
Retrieves:
- KEGG compound ID
- ChEBI identifier
- ChEMBL identifier
- Basic compound properties
### Batch Identifier Conversion
Convert multiple identifiers at once:
```bash
python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
```
## Best Practices
### Output Format Handling
Different services return data in various formats:
- **XML**: Parse using BeautifulSoup (most SOAP services)
- **Tab-separated (TSV)**: Pandas DataFrames for tabular data
- **Dictionary/JSON**: Direct Python manipulation
- **FASTA**: BioPython integration for sequence analysis
### Rate Limiting and Verbosity
Control API request behavior:
```python
from bioservices import KEGG
k = KEGG(verbose=False) # Suppress HTTP request details
k.TIMEOUT = 30 # Adjust timeout for slow connections
```
### Error Handling
Wrap service calls in try-except blocks:
```python
try:
results = u.search("ambiguous_query")
if results:
# Process results
pass
except Exception as e:
print(f"Search failed: {e}")
```
### Organism Codes
Use standard organism abbreviations:
- `hsa`: Homo sapiens (human)
- `mmu`: Mus musculus (mouse)
- `dme`: Drosophila melanogaster
- `sce`: Saccharomyces cerevisiae (yeast)
List all organisms: `k.list("organism")` or `k.organismIds`
### Integration with Other Tools
BioServices works well with:
- **BioPython**: Sequence analysis on retrieved FASTA data
- **Pandas**: Tabular data manipulation
- **PyMOL**: 3D structure visualization (retrieve PDB IDs)
- **NetworkX**: Network analysis of pathway interactions
- **Galaxy**: Custom tool wrappers for workflow platforms
## Resources
### scripts/
Executable Python scripts demonstrating complete workflows:
- `protein_analysis_workflow.py`: End-to-end protein characterization
- `pathway_analysis.py`: KEGG pathway discovery and network extraction
- `compound_cross_reference.py`: Multi-database compound searching
- `batch_id_converter.py`: Bulk identifier mapping utility
Scripts can be executed directly or adapted for specific use cases.
### references/
Detailed documentation loaded as needed:
- `services_reference.md`: Comprehensive list of all 40+ services with methods
- `workflow_patterns.md`: Detailed multi-step analysis workflows
- `identifier_mapping.md`: Complete guide to cross-database ID conversion
Load references when working with specific services or complex integration tasks.
## Installation
```bash
pip install bioservices
```
Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
## Additional Information
For detailed API documentation and advanced features, refer to:
- Official documentation: https://bioservices.readthedocs.io/
- Source code: https://github.com/cokelaer/bioservices
- Service-specific references in `references/services_reference.md`

View File

@@ -0,0 +1,685 @@
# BioServices: Identifier Mapping Guide
This document provides comprehensive information about converting identifiers between different biological databases using BioServices.
## Table of Contents
1. [Overview](#overview)
2. [UniProt Mapping Service](#uniprot-mapping-service)
3. [UniChem Compound Mapping](#unichem-compound-mapping)
4. [KEGG Identifier Conversions](#kegg-identifier-conversions)
5. [Common Mapping Patterns](#common-mapping-patterns)
6. [Troubleshooting](#troubleshooting)
---
## Overview
Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:
1. **UniProt Mapping**: Comprehensive protein/gene ID conversion
2. **UniChem**: Chemical compound ID mapping
3. **KEGG**: Built-in cross-references in entries
4. **PICR**: Protein identifier cross-reference service
---
## UniProt Mapping Service
The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.
### Basic Usage
```python
from bioservices import UniProt
u = UniProt()
# Map single ID
result = u.mapping(
fr="UniProtKB_AC-ID", # Source database
to="KEGG", # Target database
query="P43403" # Identifier to convert
)
print(result)
# Output: {'P43403': ['hsa:7535']}
```
### Batch Mapping
```python
# Map multiple IDs (comma-separated)
ids = ["P43403", "P04637", "P53779"]
result = u.mapping(
fr="UniProtKB_AC-ID",
to="KEGG",
query=",".join(ids)
)
for uniprot_id, kegg_ids in result.items():
print(f"{uniprot_id}{kegg_ids}")
```
### Supported Database Pairs
UniProt supports mapping between 100+ database pairs. Key ones include:
#### Protein/Gene Databases
| Source Format | Code | Target Format | Code |
|---------------|------|---------------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` |
| KEGG | `KEGG` | UniProtKB | `UniProtKB` |
| Ensembl | `Ensembl` | UniProtKB | `UniProtKB` |
| GeneID | `GeneID` | UniProtKB | `UniProtKB` |
#### Structural Databases
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` |
| PDB | `PDB` | UniProtKB | `UniProtKB` |
#### Expression & Proteomics
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` |
#### Organism-Specific
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` |
#### Other Useful Mappings
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` |
### Complete List of Database Codes
To get the complete, up-to-date list:
```python
from bioservices import UniProt
u = UniProt()
# This information is in the UniProt REST API documentation
# Common patterns:
# - Source databases typically end in source database name
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
# - Most other databases use their standard abbreviation
```
### Common Database Codes Reference
**Gene/Protein Identifiers:**
- `UniProtKB_AC-ID`: UniProt accession/ID
- `UniProtKB`: UniProt accession
- `KEGG`: KEGG gene IDs (e.g., hsa:7535)
- `GeneID`: NCBI Gene (Entrez) IDs
- `Ensembl`: Ensembl gene IDs
- `Ensembl_Protein`: Ensembl protein IDs
- `Ensembl_Transcript`: Ensembl transcript IDs
- `RefSeq_Protein`: RefSeq protein IDs (NP_)
- `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_)
**Gene Nomenclature:**
- `HGNC`: Human Gene Nomenclature Committee
- `MGI`: Mouse Genome Informatics
- `RGD`: Rat Genome Database
- `SGD`: Saccharomyces Genome Database
- `FlyBase`: Drosophila database
- `WormBase`: C. elegans database
- `ZFIN`: Zebrafish database
**Structure:**
- `PDB`: Protein Data Bank
- `Pfam`: Protein families
- `InterPro`: Protein domains
- `SUPFAM`: Superfamily
- `PROSITE`: Protein motifs
**Pathways & Networks:**
- `Reactome`: Reactome pathways
- `BioCyc`: BioCyc pathways
- `PathwayCommons`: Pathway Commons
- `STRING`: Protein-protein networks
- `BioGRID`: Interaction database
### Mapping Examples
#### UniProt → KEGG
```python
from bioservices import UniProt
u = UniProt()
# Single mapping
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
print(result) # {'P43403': ['hsa:7535']}
```
#### KEGG → UniProt
```python
# Reverse mapping
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
print(result) # {'hsa:7535': ['P43403']}
```
#### UniProt → Ensembl
```python
# To Ensembl gene IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
print(result) # {'P43403': ['ENSG00000115085']}
# To Ensembl protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
print(result) # {'P43403': ['ENSP00000381359']}
```
#### UniProt → PDB
```python
# Find 3D structures
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
print(result) # {'P04637': ['1A1U', '1AIE', '1C26', ...]}
```
#### UniProt → RefSeq
```python
# Get RefSeq protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
print(result) # {'P43403': ['NP_001070.2']}
```
#### Gene Name → UniProt (via search, then mapping)
```python
# First search for gene
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
lines = search_result.strip().split("\n")
if len(lines) > 1:
uniprot_id = lines[1].split("\t")[0]
# Then map to other databases
kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
print(kegg_id)
```
---
## UniChem Compound Mapping
UniChem specializes in mapping chemical compound identifiers across databases.
### Source Database IDs
| Source ID | Database |
|-----------|----------|
| 1 | ChEMBL |
| 2 | DrugBank |
| 3 | PDB |
| 4 | IUPHAR/BPS Guide to Pharmacology |
| 5 | PubChem |
| 6 | KEGG |
| 7 | ChEBI |
| 8 | NIH Clinical Collection |
| 14 | FDA/SRS |
| 22 | PubChem |
### Basic Usage
```python
from bioservices import UniChem
u = UniChem()
# Get ChEMBL ID from KEGG compound ID
chembl_id = u.get_compound_id_from_kegg("C11222")
print(chembl_id) # CHEMBL278315
```
### All Compound IDs
```python
# Get all identifiers for a compound
# src_compound_id: compound ID, src_id: source database ID
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1) # 1 = ChEMBL
for mapping in all_ids:
src_name = mapping['src_name']
src_compound_id = mapping['src_compound_id']
print(f"{src_name}: {src_compound_id}")
```
### Specific Database Conversion
```python
# Convert between specific databases
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
print(result)
```
### Common Compound Mappings
#### KEGG → ChEMBL
```python
u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C00031") # D-Glucose
print(f"ChEMBL: {chembl_id}")
```
#### ChEMBL → PubChem
```python
result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
if result:
pubchem_id = result[0]['src_compound_id']
print(f"PubChem: {pubchem_id}")
```
#### ChEBI → DrugBank
```python
result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
if result:
drugbank_id = result[0]['src_compound_id']
print(f"DrugBank: {drugbank_id}")
```
---
## KEGG Identifier Conversions
KEGG entries contain cross-references that can be extracted by parsing.
### Extract Database Links from KEGG Entry
```python
from bioservices import KEGG
k = KEGG()
# Get compound entry
entry = k.get("cpd:C11222")
# Parse for specific database
chebi_id = None
uniprot_ids = []
for line in entry.split("\n"):
if "ChEBI:" in line:
# Extract ChEBI ID
parts = line.split("ChEBI:")
if len(parts) > 1:
chebi_id = parts[1].strip().split()[0]
# For genes/proteins
gene_entry = k.get("hsa:7535")
for line in gene_entry.split("\n"):
if line.startswith(" "): # Database links section
if "UniProt:" in line:
parts = line.split("UniProt:")
if len(parts) > 1:
uniprot_id = parts[1].strip()
uniprot_ids.append(uniprot_id)
```
### KEGG Gene ID Components
KEGG gene IDs have format `organism:gene_id`:
```python
kegg_id = "hsa:7535"
organism, gene_id = kegg_id.split(":")
print(f"Organism: {organism}") # hsa (human)
print(f"Gene ID: {gene_id}") # 7535
```
### KEGG Pathway to Genes
```python
k = KEGG()
# Get pathway entry
pathway = k.get("path:hsa04660")
# Parse for gene list
genes = []
in_gene_section = False
for line in pathway.split("\n"):
if line.startswith("GENE"):
in_gene_section = True
if in_gene_section:
if line.startswith(" " * 12): # Gene line
parts = line.strip().split()
if parts:
gene_id = parts[0]
genes.append(f"hsa:{gene_id}")
elif not line.startswith(" "):
break
print(f"Found {len(genes)} genes")
```
---
## Common Mapping Patterns
### Pattern 1: Gene Symbol → Multiple Database IDs
```python
from bioservices import UniProt
def gene_symbol_to_ids(gene_symbol, organism="9606"):
"""Convert gene symbol to multiple database IDs."""
u = UniProt()
# Search for gene
query = f"gene:{gene_symbol} AND organism:{organism}"
result = u.search(query, frmt="tab", columns="id")
lines = result.strip().split("\n")
if len(lines) < 2:
return None
uniprot_id = lines[1].split("\t")[0]
# Map to multiple databases
ids = {
'uniprot': uniprot_id,
'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
}
return ids
# Usage
ids = gene_symbol_to_ids("ZAP70")
print(ids)
```
### Pattern 2: Compound Name → All Database IDs
```python
from bioservices import KEGG, UniChem, ChEBI
def compound_name_to_ids(compound_name):
"""Search compound and get all database IDs."""
k = KEGG()
# Search KEGG
results = k.find("compound", compound_name)
if not results:
return None
# Extract KEGG ID
kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")
# Get KEGG entry for ChEBI
entry = k.get(f"cpd:{kegg_id}")
chebi_id = None
for line in entry.split("\n"):
if "ChEBI:" in line:
parts = line.split("ChEBI:")
if len(parts) > 1:
chebi_id = parts[1].strip().split()[0]
break
# Get ChEMBL from UniChem
u = UniChem()
try:
chembl_id = u.get_compound_id_from_kegg(kegg_id)
except:
chembl_id = None
return {
'kegg': kegg_id,
'chebi': chebi_id,
'chembl': chembl_id
}
# Usage
ids = compound_name_to_ids("Geldanamycin")
print(ids)
```
### Pattern 3: Batch ID Conversion with Error Handling
```python
from bioservices import UniProt
def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
"""Safely map IDs with error handling and chunking."""
u = UniProt()
all_results = {}
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
query = ",".join(chunk)
try:
results = u.mapping(fr=from_db, to=to_db, query=query)
all_results.update(results)
print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
except Exception as e:
print(f"✗ Error at chunk {i}: {e}")
# Try individual IDs in failed chunk
for single_id in chunk:
try:
result = u.mapping(fr=from_db, to=to_db, query=single_id)
all_results.update(result)
except:
all_results[single_id] = None
return all_results
# Usage
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")
```
### Pattern 4: Multi-Hop Mapping
Sometimes you need to map through intermediate databases:
```python
from bioservices import UniProt
def multi_hop_mapping(gene_symbol, organism="9606"):
"""Gene symbol → UniProt → KEGG → Pathways."""
u = UniProt()
k = KEGG()
# Step 1: Gene symbol → UniProt
query = f"gene:{gene_symbol} AND organism:{organism}"
result = u.search(query, frmt="tab", columns="id")
lines = result.strip().split("\n")
if len(lines) < 2:
return None
uniprot_id = lines[1].split("\t")[0]
# Step 2: UniProt → KEGG
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
if not kegg_mapping or uniprot_id not in kegg_mapping:
return None
kegg_id = kegg_mapping[uniprot_id][0]
# Step 3: KEGG → Pathways
organism_code, gene_id = kegg_id.split(":")
pathways = k.get_pathway_by_gene(gene_id, organism_code)
return {
'gene': gene_symbol,
'uniprot': uniprot_id,
'kegg': kegg_id,
'pathways': pathways
}
# Usage
result = multi_hop_mapping("TP53")
print(result)
```
---
## Troubleshooting
### Issue 1: No Mapping Found
**Symptom:** Mapping returns empty or None
**Solutions:**
1. Verify source ID exists in source database
2. Check database code spelling
3. Try reverse mapping
4. Some IDs may not have mappings in all databases
```python
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
if not result or 'P43403' not in result:
print("No mapping found. Try:")
print("1. Verify ID exists: u.search('P43403')")
print("2. Check if protein has KEGG annotation")
```
### Issue 2: Too Many IDs in Batch
**Symptom:** Batch mapping fails or times out
**Solution:** Split into smaller chunks
```python
def chunked_mapping(ids, from_db, to_db, chunk_size=50):
all_results = {}
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
all_results.update(result)
return all_results
```
### Issue 3: Multiple Target IDs
**Symptom:** One source ID maps to multiple target IDs
**Solution:** Handle as list
```python
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}
pdb_ids = result['P04637']
print(f"Found {len(pdb_ids)} PDB structures")
for pdb_id in pdb_ids:
print(f" {pdb_id}")
```
### Issue 4: Organism Ambiguity
**Symptom:** Gene symbol maps to multiple organisms
**Solution:** Always specify organism in searches
```python
# Bad: Ambiguous
result = u.search("gene:TP53") # Many organisms have TP53
# Good: Specific
result = u.search("gene:TP53 AND organism:9606") # Human only
```
### Issue 5: Deprecated IDs
**Symptom:** Old database IDs don't map
**Solution:** Update to current IDs first
```python
# Check if ID is current
entry = u.retrieve("P43403", frmt="txt")
# Look for secondary accessions
for line in entry.split("\n"):
if line.startswith("AC"):
print(line) # Shows primary and secondary accessions
```
---
## Best Practices
1. **Always validate inputs** before batch processing
2. **Handle None/empty results** gracefully
3. **Use chunking** for large ID lists (50-100 per chunk)
4. **Cache results** for repeated queries
5. **Specify organism** when possible to avoid ambiguity
6. **Log failures** in batch processing for later retry
7. **Add delays** between large batches to respect API limits
```python
import time
def polite_batch_mapping(ids, from_db, to_db):
"""Batch mapping with rate limiting."""
results = {}
for i in range(0, len(ids), 50):
chunk = ids[i:i+50]
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
results.update(result)
time.sleep(0.5) # Be nice to the API
return results
```
---
For complete working examples, see:
- `scripts/batch_id_converter.py`: Command-line batch conversion tool
- `workflow_patterns.md`: Integration into larger workflows

View File

@@ -0,0 +1,634 @@
# BioServices: Complete Services Reference
This document provides a comprehensive reference for all major services available in BioServices, including key methods, parameters, and use cases.
## Protein & Gene Resources
### UniProt
Protein sequence and functional information database.
**Initialization:**
```python
from bioservices import UniProt
u = UniProt(verbose=False)
```
**Key Methods:**
- `search(query, frmt="tab", columns=None, limit=None, sort=None, compress=False, include=False, **kwargs)`
- Search UniProt with flexible query syntax
- `frmt`: "tab", "fasta", "xml", "rdf", "gff", "txt"
- `columns`: Comma-separated list (e.g., "id,genes,organism,length")
- Returns: String in requested format
- `retrieve(uniprot_id, frmt="txt")`
- Retrieve specific UniProt entry
- `frmt`: "txt", "fasta", "xml", "rdf", "gff"
- Returns: Entry data in requested format
- `mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")`
- Convert identifiers between databases
- `fr`/`to`: Database identifiers (see identifier_mapping.md)
- `query`: Single ID or comma-separated list
- Returns: Dictionary mapping input to output IDs
- `searchUniProtId(pattern, columns="entry name,length,organism", limit=100)`
- Convenience method for ID-based searches
- Returns: Tab-separated values
**Common columns:** id, entry name, genes, organism, protein names, length, sequence, go-id, ec, pathway, interactor
**Use cases:**
- Protein sequence retrieval for BLAST
- Functional annotation lookup
- Cross-database identifier mapping
- Batch protein information retrieval
---
### KEGG (Kyoto Encyclopedia of Genes and Genomes)
Metabolic pathways, genes, and organisms database.
**Initialization:**
```python
from bioservices import KEGG
k = KEGG()
k.organism = "hsa" # Set default organism
```
**Key Methods:**
- `list(database)`
- List entries in KEGG database
- `database`: "organism", "pathway", "module", "disease", "drug", "compound"
- Returns: Multi-line string with entries
- `find(database, query)`
- Search database by keywords
- Returns: List of matching entries with IDs
- `get(entry_id)`
- Retrieve entry by ID
- Supports genes, pathways, compounds, etc.
- Returns: Raw entry text
- `parse(data)`
- Parse KEGG entry into dictionary
- Returns: Dict with structured data
- `lookfor_organism(name)`
- Search organisms by name pattern
- Returns: List of matching organism codes
- `lookfor_pathway(name)`
- Search pathways by name
- Returns: List of pathway IDs
- `get_pathway_by_gene(gene_id, organism)`
- Find pathways containing gene
- Returns: List of pathway IDs
- `parse_kgml_pathway(pathway_id)`
- Parse pathway KGML for interactions
- Returns: Dict with "entries" and "relations"
- `pathway2sif(pathway_id)`
- Extract Simple Interaction Format data
- Filters for activation/inhibition
- Returns: List of interaction tuples
**Organism codes:**
- hsa: Homo sapiens
- mmu: Mus musculus
- dme: Drosophila melanogaster
- sce: Saccharomyces cerevisiae
- eco: Escherichia coli
**Use cases:**
- Pathway analysis and visualization
- Gene function annotation
- Metabolic network reconstruction
- Protein-protein interaction extraction
---
### HGNC (Human Gene Nomenclature Committee)
Official human gene naming authority.
**Initialization:**
```python
from bioservices import HGNC
h = HGNC()
```
**Key Methods:**
- `search(query)`: Search gene symbols/names
- `fetch(format, query)`: Retrieve gene information
**Use cases:**
- Standardizing human gene names
- Looking up official gene symbols
---
### MyGeneInfo
Gene annotation and query service.
**Initialization:**
```python
from bioservices import MyGeneInfo
m = MyGeneInfo()
```
**Key Methods:**
- `querymany(ids, scopes, fields, species)`: Batch gene queries
- `getgene(geneid)`: Get gene annotation
**Use cases:**
- Batch gene annotation retrieval
- Gene ID conversion
---
## Chemical Compound Resources
### ChEBI (Chemical Entities of Biological Interest)
Dictionary of molecular entities.
**Initialization:**
```python
from bioservices import ChEBI
c = ChEBI()
```
**Key Methods:**
- `getCompleteEntity(chebi_id)`: Full compound information
- `getLiteEntity(chebi_id)`: Basic information
- `getCompleteEntityByList(chebi_ids)`: Batch retrieval
**Use cases:**
- Small molecule information
- Chemical structure data
- Compound property lookup
---
### ChEMBL
Bioactive drug-like compound database.
**Initialization:**
```python
from bioservices import ChEMBL
c = ChEMBL()
```
**Key Methods:**
- `get_compound_by_chemblId(chembl_id)`: Compound details
- `get_target_by_chemblId(chembl_id)`: Target information
- `get_assays()`: Bioassay data
**Use cases:**
- Drug discovery data
- Bioactivity information
- Target-compound relationships
---
### UniChem
Chemical identifier mapping service.
**Initialization:**
```python
from bioservices import UniChem
u = UniChem()
```
**Key Methods:**
- `get_compound_id_from_kegg(kegg_id)`: KEGG → ChEMBL
- `get_all_compound_ids(src_compound_id, src_id)`: Get all IDs
- `get_src_compound_ids(src_compound_id, from_src_id, to_src_id)`: Convert IDs
**Source IDs:**
- 1: ChEMBL
- 2: DrugBank
- 3: PDB
- 6: KEGG
- 7: ChEBI
- 22: PubChem
**Use cases:**
- Cross-database compound ID mapping
- Linking chemical databases
---
### PubChem
Chemical compound database from NIH.
**Initialization:**
```python
from bioservices import PubChem
p = PubChem()
```
**Key Methods:**
- `get_compounds(identifier, namespace)`: Retrieve compounds
- `get_properties(properties, identifier, namespace)`: Get properties
**Use cases:**
- Chemical structure retrieval
- Compound property information
---
## Sequence Analysis Tools
### NCBIblast
Sequence similarity searching.
**Initialization:**
```python
from bioservices import NCBIblast
s = NCBIblast(verbose=False)
```
**Key Methods:**
- `run(program, sequence, stype, database, email, **params)`
- Submit BLAST job
- `program`: "blastp", "blastn", "blastx", "tblastn", "tblastx"
- `stype`: "protein" or "dna"
- `database`: "uniprotkb", "pdb", "refseq_protein", etc.
- `email`: Required by NCBI
- Returns: Job ID
- `getStatus(jobid)`
- Check job status
- Returns: "RUNNING", "FINISHED", "ERROR"
- `getResult(jobid, result_type)`
- Retrieve results
- `result_type`: "out" (default), "ids", "xml"
**Important:** BLAST jobs are asynchronous. Always check status before retrieving results.
**Use cases:**
- Protein homology searches
- Sequence similarity analysis
- Functional annotation by homology
---
## Pathway & Interaction Resources
### Reactome
Pathway database.
**Initialization:**
```python
from bioservices import Reactome
r = Reactome()
```
**Key Methods:**
- `get_pathway_by_id(pathway_id)`: Pathway details
- `search_pathway(query)`: Search pathways
**Use cases:**
- Human pathway analysis
- Biological process annotation
---
### PSICQUIC
Protein interaction query service (federates 30+ databases).
**Initialization:**
```python
from bioservices import PSICQUIC
s = PSICQUIC()
```
**Key Methods:**
- `query(database, query_string)`
- Query specific interaction database
- Returns: PSI-MI TAB format
- `activeDBs`
- Property listing available databases
- Returns: List of database names
**Available databases:** MINT, IntAct, BioGRID, DIP, InnateDB, MatrixDB, MPIDB, UniProt, and 30+ more
**Query syntax:** Supports AND, OR, species filters
- Example: "ZAP70 AND species:9606"
**Use cases:**
- Protein-protein interaction discovery
- Network analysis
- Interactome mapping
---
### IntactComplex
Protein complex database.
**Initialization:**
```python
from bioservices import IntactComplex
i = IntactComplex()
```
**Key Methods:**
- `search(query)`: Search complexes
- `details(complex_ac)`: Complex details
**Use cases:**
- Protein complex composition
- Multi-protein assembly analysis
---
### OmniPath
Integrated signaling pathway database.
**Initialization:**
```python
from bioservices import OmniPath
o = OmniPath()
```
**Key Methods:**
- `interactions(datasets, organisms)`: Get interactions
- `ptms(datasets, organisms)`: Post-translational modifications
**Use cases:**
- Cell signaling analysis
- Regulatory network mapping
---
## Gene Ontology
### QuickGO
Gene Ontology annotation service.
**Initialization:**
```python
from bioservices import QuickGO
g = QuickGO()
```
**Key Methods:**
- `Term(go_id, frmt="obo")`
- Retrieve GO term information
- Returns: Term definition and metadata
- `Annotation(protein=None, goid=None, format="tsv")`
- Get GO annotations
- Returns: Annotations in requested format
**GO categories:**
- Biological Process (BP)
- Molecular Function (MF)
- Cellular Component (CC)
**Use cases:**
- Functional annotation
- Enrichment analysis
- GO term lookup
---
## Genomic Resources
### BioMart
Data mining tool for genomic data.
**Initialization:**
```python
from bioservices import BioMart
b = BioMart()
```
**Key Methods:**
- `datasets(dataset)`: List available datasets
- `attributes(dataset)`: List attributes
- `query(query_xml)`: Execute BioMart query
**Use cases:**
- Bulk genomic data retrieval
- Custom genome annotations
- SNP information
---
### ArrayExpress
Gene expression database.
**Initialization:**
```python
from bioservices import ArrayExpress
a = ArrayExpress()
```
**Key Methods:**
- `queryExperiments(keywords)`: Search experiments
- `retrieveExperiment(accession)`: Get experiment data
**Use cases:**
- Gene expression data
- Microarray analysis
- RNA-seq data retrieval
---
### ENA (European Nucleotide Archive)
Nucleotide sequence database.
**Initialization:**
```python
from bioservices import ENA
e = ENA()
```
**Key Methods:**
- `search_data(query)`: Search sequences
- `retrieve_data(accession)`: Retrieve sequences
**Use cases:**
- Nucleotide sequence retrieval
- Genome assembly access
---
## Structural Biology
### PDB (Protein Data Bank)
3D protein structure database.
**Initialization:**
```python
from bioservices import PDB
p = PDB()
```
**Key Methods:**
- `get_file(pdb_id, file_format)`: Download structure files
- `search(query)`: Search structures
**File formats:** pdb, cif, xml
**Use cases:**
- 3D structure retrieval
- Structure-based analysis
- PyMOL visualization
---
### Pfam
Protein family database.
**Initialization:**
```python
from bioservices import Pfam
p = Pfam()
```
**Key Methods:**
- `searchSequence(sequence)`: Find domains in sequence
- `getPfamEntry(pfam_id)`: Domain information
**Use cases:**
- Protein domain identification
- Family classification
- Functional motif discovery
---
## Specialized Resources
### BioModels
Systems biology model repository.
**Initialization:**
```python
from bioservices import BioModels
b = BioModels()
```
**Key Methods:**
- `get_model_by_id(model_id)`: Retrieve SBML model
**Use cases:**
- Systems biology modeling
- SBML model retrieval
---
### COG (Clusters of Orthologous Genes)
Orthologous gene classification.
**Initialization:**
```python
from bioservices import COG
c = COG()
```
**Use cases:**
- Orthology analysis
- Functional classification
---
### BiGG Models
Metabolic network models.
**Initialization:**
```python
from bioservices import BiGG
b = BiGG()
```
**Key Methods:**
- `list_models()`: Available models
- `get_model(model_id)`: Model details
**Use cases:**
- Metabolic network analysis
- Flux balance analysis
---
## General Patterns
### Error Handling
All services may throw exceptions. Wrap calls in try-except:
```python
try:
result = service.method(params)
if result:
# Process result
pass
except Exception as e:
print(f"Error: {e}")
```
### Verbosity Control
Most services support `verbose` parameter:
```python
service = Service(verbose=False) # Suppress HTTP logs
```
### Rate Limiting
Services have timeouts and rate limits:
```python
service.TIMEOUT = 30 # Adjust timeout
service.DELAY = 1 # Delay between requests (if supported)
```
### Output Formats
Common format parameters:
- `frmt`: "xml", "json", "tab", "txt", "fasta"
- `format`: Service-specific variants
### Caching
Some services cache results:
```python
service.CACHE = True # Enable caching
service.clear_cache() # Clear cache
```
## Additional Resources
For detailed API documentation:
- Official docs: https://bioservices.readthedocs.io/
- Individual service docs linked from main page
- Source code: https://github.com/cokelaer/bioservices

View File

@@ -0,0 +1,811 @@
# BioServices: Common Workflow Patterns
This document describes detailed multi-step workflows for common bioinformatics tasks using BioServices.
## Table of Contents
1. [Complete Protein Analysis Pipeline](#complete-protein-analysis-pipeline)
2. [Pathway Discovery and Network Analysis](#pathway-discovery-and-network-analysis)
3. [Compound Multi-Database Search](#compound-multi-database-search)
4. [Batch Identifier Conversion](#batch-identifier-conversion)
5. [Gene Functional Annotation](#gene-functional-annotation)
6. [Protein Interaction Network Construction](#protein-interaction-network-construction)
7. [Multi-Organism Comparative Analysis](#multi-organism-comparative-analysis)
---
## Complete Protein Analysis Pipeline
**Goal:** Given a protein name, retrieve sequence, find homologs, identify pathways, and discover interactions.
**Example:** Analyzing human ZAP70 protein
### Step 1: UniProt Search and Identifier Retrieval
```python
from bioservices import UniProt
u = UniProt(verbose=False)
# Search for protein by name
query = "ZAP70_HUMAN"
results = u.search(query, frmt="tab", columns="id,genes,organism,length")
# Parse results
lines = results.strip().split("\n")
if len(lines) > 1:
header = lines[0]
data = lines[1].split("\t")
uniprot_id = data[0] # e.g., P43403
gene_names = data[1] # e.g., ZAP70
print(f"UniProt ID: {uniprot_id}")
print(f"Gene names: {gene_names}")
```
**Output:**
- UniProt accession: P43403
- Gene name: ZAP70
### Step 2: Sequence Retrieval
```python
# Retrieve FASTA sequence
sequence = u.retrieve(uniprot_id, frmt="fasta")
print(sequence)
# Extract just the sequence string (remove header)
seq_lines = sequence.split("\n")
sequence_only = "".join(seq_lines[1:]) # Skip FASTA header
```
**Output:** Complete protein sequence in FASTA format
### Step 3: BLAST Similarity Search
```python
from bioservices import NCBIblast
import time
s = NCBIblast(verbose=False)
# Submit BLAST job
jobid = s.run(
program="blastp",
sequence=sequence_only,
stype="protein",
database="uniprotkb",
email="your.email@example.com"
)
print(f"BLAST Job ID: {jobid}")
# Wait for completion
while True:
status = s.getStatus(jobid)
print(f"Status: {status}")
if status == "FINISHED":
break
elif status == "ERROR":
print("BLAST job failed")
break
time.sleep(5)
# Retrieve results
if status == "FINISHED":
blast_results = s.getResult(jobid, "out")
print(blast_results[:500]) # Print first 500 characters
```
**Output:** BLAST alignment results showing similar proteins
### Step 4: KEGG Pathway Discovery
```python
from bioservices import KEGG
k = KEGG()
# Get KEGG gene ID from UniProt mapping
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
print(f"KEGG mapping: {kegg_mapping}")
# Extract KEGG gene ID (e.g., hsa:7535)
if kegg_mapping:
kegg_gene_id = kegg_mapping[uniprot_id][0] if uniprot_id in kegg_mapping else None
if kegg_gene_id:
# Find pathways containing this gene
organism = kegg_gene_id.split(":")[0] # e.g., "hsa"
gene_id = kegg_gene_id.split(":")[1] # e.g., "7535"
pathways = k.get_pathway_by_gene(gene_id, organism)
print(f"Found {len(pathways)} pathways:")
# Get pathway names
for pathway_id in pathways:
pathway_info = k.get(pathway_id)
# Parse NAME line
for line in pathway_info.split("\n"):
if line.startswith("NAME"):
pathway_name = line.replace("NAME", "").strip()
print(f" {pathway_id}: {pathway_name}")
break
```
**Output:**
- path:hsa04064 - NF-kappa B signaling pathway
- path:hsa04650 - Natural killer cell mediated cytotoxicity
- path:hsa04660 - T cell receptor signaling pathway
- path:hsa04662 - B cell receptor signaling pathway
### Step 5: Protein-Protein Interactions
```python
from bioservices import PSICQUIC
p = PSICQUIC()
# Query MINT database for human (taxid:9606) interactions
query = f"ZAP70 AND species:9606"
interactions = p.query("mint", query)
# Parse PSI-MI TAB format results
if interactions:
interaction_lines = interactions.strip().split("\n")
print(f"Found {len(interaction_lines)} interactions")
# Print first few interactions
for line in interaction_lines[:5]:
fields = line.split("\t")
protein_a = fields[0]
protein_b = fields[1]
interaction_type = fields[11]
print(f" {protein_a} - {protein_b}: {interaction_type}")
```
**Output:** List of proteins that interact with ZAP70
### Step 6: Gene Ontology Annotation
```python
from bioservices import QuickGO
g = QuickGO()
# Get GO annotations for protein
annotations = g.Annotation(protein=uniprot_id, format="tsv")
if annotations:
# Parse TSV results
lines = annotations.strip().split("\n")
print(f"Found {len(lines)-1} GO annotations")
# Display first few annotations
for line in lines[1:6]: # Skip header
fields = line.split("\t")
go_id = fields[6]
go_term = fields[7]
go_aspect = fields[8]
print(f" {go_id}: {go_term} [{go_aspect}]")
```
**Output:** GO terms annotating ZAP70 function, process, and location
### Complete Pipeline Summary
**Inputs:** Protein name (e.g., "ZAP70_HUMAN")
**Outputs:**
1. UniProt accession and gene name
2. Protein sequence (FASTA)
3. Similar proteins (BLAST results)
4. Biological pathways (KEGG)
5. Interaction partners (PSICQUIC)
6. Functional annotations (GO terms)
**Script:** `scripts/protein_analysis_workflow.py` automates this entire pipeline.
---
## Pathway Discovery and Network Analysis
**Goal:** Analyze all pathways for an organism and extract protein interaction networks.
**Example:** Human (hsa) pathway analysis
### Step 1: Get All Pathways for Organism
```python
from bioservices import KEGG
k = KEGG()
k.organism = "hsa"
# Get all pathway IDs
pathway_ids = k.pathwayIds
print(f"Found {len(pathway_ids)} pathways for {k.organism}")
# Display first few
for pid in pathway_ids[:10]:
print(f" {pid}")
```
**Output:** List of ~300 human pathways
### Step 2: Parse Pathway for Interactions
```python
# Analyze specific pathway
pathway_id = "hsa04660" # T cell receptor signaling
# Get KGML data
kgml_data = k.parse_kgml_pathway(pathway_id)
# Extract entries (genes/proteins)
entries = kgml_data['entries']
print(f"Pathway contains {len(entries)} entries")
# Extract relations (interactions)
relations = kgml_data['relations']
print(f"Found {len(relations)} relations")
# Analyze relation types
relation_types = {}
for rel in relations:
rel_type = rel.get('name', 'unknown')
relation_types[rel_type] = relation_types.get(rel_type, 0) + 1
print("\nRelation type distribution:")
for rel_type, count in sorted(relation_types.items()):
print(f" {rel_type}: {count}")
```
**Output:**
- Entry count (genes/proteins in pathway)
- Relation count (interactions)
- Distribution of interaction types (activation, inhibition, binding, etc.)
### Step 3: Extract Protein-Protein Interactions
```python
# Filter for specific interaction types
pprel_interactions = [
rel for rel in relations
if rel.get('link') == 'PPrel' # Protein-protein relation
]
print(f"Found {len(pprel_interactions)} protein-protein interactions")
# Extract interaction details
for rel in pprel_interactions[:10]:
entry1 = rel['entry1']
entry2 = rel['entry2']
interaction_type = rel.get('name', 'unknown')
print(f" {entry1} -> {entry2}: {interaction_type}")
```
**Output:** Directed protein-protein interactions with types
### Step 4: Convert to Network Format (SIF)
```python
# Get Simple Interaction Format (filters for key interactions)
sif_data = k.pathway2sif(pathway_id)
# SIF format: source, interaction_type, target
print("\nSimple Interaction Format:")
for interaction in sif_data[:10]:
print(f" {interaction}")
```
**Output:** Network edges suitable for Cytoscape or NetworkX
### Step 5: Batch Analysis of All Pathways
```python
import pandas as pd
# Analyze all pathways (this takes time!)
all_results = []
for pathway_id in pathway_ids[:50]: # Limit for example
try:
kgml = k.parse_kgml_pathway(pathway_id)
result = {
'pathway_id': pathway_id,
'num_entries': len(kgml.get('entries', [])),
'num_relations': len(kgml.get('relations', []))
}
all_results.append(result)
except Exception as e:
print(f"Error parsing {pathway_id}: {e}")
# Create DataFrame
df = pd.DataFrame(all_results)
print(df.describe())
# Find largest pathways
print("\nLargest pathways:")
print(df.nlargest(10, 'num_entries')[['pathway_id', 'num_entries', 'num_relations']])
```
**Output:** Statistical summary of pathway sizes and interaction densities
**Script:** `scripts/pathway_analysis.py` implements this workflow with export options.
---
## Compound Multi-Database Search
**Goal:** Search for compound by name and retrieve identifiers across KEGG, ChEBI, and ChEMBL.
**Example:** Geldanamycin (antibiotic)
### Step 1: Search KEGG Compound Database
```python
from bioservices import KEGG
k = KEGG()
# Search by compound name
compound_name = "Geldanamycin"
results = k.find("compound", compound_name)
print(f"KEGG search results for '{compound_name}':")
print(results)
# Extract compound ID
if results:
lines = results.strip().split("\n")
if lines:
kegg_id = lines[0].split("\t")[0] # e.g., cpd:C11222
kegg_id_clean = kegg_id.replace("cpd:", "") # C11222
print(f"\nKEGG Compound ID: {kegg_id_clean}")
```
**Output:** KEGG ID (e.g., C11222)
### Step 2: Get KEGG Entry with Database Links
```python
# Retrieve compound entry
compound_entry = k.get(kegg_id)
# Parse entry for database links
chebi_id = None
for line in compound_entry.split("\n"):
if "ChEBI:" in line:
# Extract ChEBI ID
parts = line.split("ChEBI:")
if len(parts) > 1:
chebi_id = parts[1].strip().split()[0]
print(f"ChEBI ID: {chebi_id}")
break
# Display entry snippet
print("\nKEGG Entry (first 500 chars):")
print(compound_entry[:500])
```
**Output:** ChEBI ID (e.g., 5292) and compound information
### Step 3: Cross-Reference to ChEMBL via UniChem
```python
from bioservices import UniChem
u = UniChem()
# Convert KEGG → ChEMBL
try:
chembl_id = u.get_compound_id_from_kegg(kegg_id_clean)
print(f"ChEMBL ID: {chembl_id}")
except Exception as e:
print(f"UniChem lookup failed: {e}")
chembl_id = None
```
**Output:** ChEMBL ID (e.g., CHEMBL278315)
### Step 4: Retrieve Detailed Information
```python
# Get ChEBI information
if chebi_id:
from bioservices import ChEBI
c = ChEBI()
try:
chebi_entity = c.getCompleteEntity(f"CHEBI:{chebi_id}")
print(f"\nChEBI Formula: {chebi_entity.Formulae}")
print(f"ChEBI Name: {chebi_entity.chebiAsciiName}")
except Exception as e:
print(f"ChEBI lookup failed: {e}")
# Get ChEMBL information
if chembl_id:
from bioservices import ChEMBL
chembl = ChEMBL()
try:
chembl_compound = chembl.get_compound_by_chemblId(chembl_id)
print(f"\nChEMBL Molecular Weight: {chembl_compound['molecule_properties']['full_mwt']}")
print(f"ChEMBL SMILES: {chembl_compound['molecule_structures']['canonical_smiles']}")
except Exception as e:
print(f"ChEMBL lookup failed: {e}")
```
**Output:** Chemical properties from multiple databases
### Complete Compound Workflow Summary
**Input:** Compound name (e.g., "Geldanamycin")
**Output:**
- KEGG ID: C11222
- ChEBI ID: 5292
- ChEMBL ID: CHEMBL278315
- Chemical formula
- Molecular weight
- SMILES structure
**Script:** `scripts/compound_cross_reference.py` automates this workflow.
---
## Batch Identifier Conversion
**Goal:** Convert multiple identifiers between databases efficiently.
### Batch UniProt → KEGG Mapping
```python
from bioservices import UniProt
u = UniProt()
# List of UniProt IDs
uniprot_ids = ["P43403", "P04637", "P53779", "Q9Y6K9"]
# Batch mapping (comma-separated)
query_string = ",".join(uniprot_ids)
results = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=query_string)
print("UniProt → KEGG mapping:")
for uniprot_id, kegg_ids in results.items():
print(f" {uniprot_id}{kegg_ids}")
```
**Output:** Dictionary mapping each UniProt ID to KEGG gene IDs
### Batch File Processing
```python
import csv
# Read identifiers from file
def read_ids_from_file(filename):
with open(filename, 'r') as f:
ids = [line.strip() for line in f if line.strip()]
return ids
# Process in chunks (API limits)
def batch_convert(ids, from_db, to_db, chunk_size=100):
u = UniProt()
all_results = {}
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
query = ",".join(chunk)
try:
results = u.mapping(fr=from_db, to=to_db, query=query)
all_results.update(results)
print(f"Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
except Exception as e:
print(f"Error processing chunk {i}: {e}")
return all_results
# Write results to CSV
def write_mapping_to_csv(mapping, output_file):
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Source_ID', 'Target_IDs'])
for source_id, target_ids in mapping.items():
target_str = ";".join(target_ids) if target_ids else "No mapping"
writer.writerow([source_id, target_str])
# Example usage
input_ids = read_ids_from_file("uniprot_ids.txt")
mapping = batch_convert(input_ids, "UniProtKB_AC-ID", "KEGG", chunk_size=50)
write_mapping_to_csv(mapping, "uniprot_to_kegg_mapping.csv")
```
**Script:** `scripts/batch_id_converter.py` provides command-line batch conversion.
---
## Gene Functional Annotation
**Goal:** Retrieve comprehensive functional information for a gene.
### Workflow
```python
from bioservices import UniProt, KEGG, QuickGO
# Gene of interest
gene_symbol = "TP53"
# 1. Find UniProt entry
u = UniProt()
search_results = u.search(f"gene:{gene_symbol} AND organism:9606",
frmt="tab",
columns="id,genes,protein names")
# Extract UniProt ID
lines = search_results.strip().split("\n")
if len(lines) > 1:
uniprot_id = lines[1].split("\t")[0]
protein_name = lines[1].split("\t")[2]
print(f"Protein: {protein_name}")
print(f"UniProt ID: {uniprot_id}")
# 2. Get KEGG pathways
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
if uniprot_id in kegg_mapping:
kegg_id = kegg_mapping[uniprot_id][0]
k = KEGG()
organism, gene_id = kegg_id.split(":")
pathways = k.get_pathway_by_gene(gene_id, organism)
print(f"\nPathways ({len(pathways)}):")
for pathway_id in pathways[:5]:
print(f" {pathway_id}")
# 3. Get GO annotations
g = QuickGO()
go_annotations = g.Annotation(protein=uniprot_id, format="tsv")
if go_annotations:
lines = go_annotations.strip().split("\n")
print(f"\nGO Annotations ({len(lines)-1} total):")
# Group by aspect
aspects = {"P": [], "F": [], "C": []}
for line in lines[1:]:
fields = line.split("\t")
go_aspect = fields[8] # P, F, or C
go_term = fields[7]
aspects[go_aspect].append(go_term)
print(f" Biological Process: {len(aspects['P'])} terms")
print(f" Molecular Function: {len(aspects['F'])} terms")
print(f" Cellular Component: {len(aspects['C'])} terms")
# 4. Get protein sequence features
full_entry = u.retrieve(uniprot_id, frmt="txt")
print("\nProtein Features:")
for line in full_entry.split("\n"):
if line.startswith("FT DOMAIN"):
print(f" {line}")
```
**Output:** Comprehensive annotation including name, pathways, GO terms, and features.
---
## Protein Interaction Network Construction
**Goal:** Build a protein-protein interaction network for a set of proteins.
### Workflow
```python
from bioservices import PSICQUIC
import networkx as nx
# Proteins of interest
proteins = ["ZAP70", "LCK", "LAT", "SLP76", "PLCg1"]
# Initialize PSICQUIC
p = PSICQUIC()
# Build network
G = nx.Graph()
for protein in proteins:
# Query for human interactions
query = f"{protein} AND species:9606"
try:
results = p.query("intact", query)
if results:
lines = results.strip().split("\n")
for line in lines:
fields = line.split("\t")
# Extract protein names (simplified)
protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
# Add edge
G.add_edge(protein_a, protein_b)
except Exception as e:
print(f"Error querying {protein}: {e}")
print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
# Analyze network
print("\nNode degrees:")
for node in proteins:
if node in G:
print(f" {node}: {G.degree(node)} interactions")
# Export for visualization
nx.write_gml(G, "protein_network.gml")
print("\nNetwork exported to protein_network.gml")
```
**Output:** NetworkX graph exported in GML format for Cytoscape visualization.
---
## Multi-Organism Comparative Analysis
**Goal:** Compare pathway or gene presence across multiple organisms.
### Workflow
```python
from bioservices import KEGG
k = KEGG()
# Organisms to compare
organisms = ["hsa", "mmu", "dme", "sce"] # Human, mouse, fly, yeast
organism_names = {
"hsa": "Human",
"mmu": "Mouse",
"dme": "Fly",
"sce": "Yeast"
}
# Pathway of interest
pathway_name = "cell cycle"
print(f"Searching for '{pathway_name}' pathway across organisms:\n")
for org in organisms:
k.organism = org
# Search pathways
results = k.lookfor_pathway(pathway_name)
print(f"{organism_names[org]} ({org}):")
if results:
for pathway in results[:3]: # Show first 3
print(f" {pathway}")
else:
print(" No matches found")
print()
```
**Output:** Pathway presence/absence across organisms.
---
## Best Practices for Workflows
### 1. Error Handling
Always wrap service calls:
```python
try:
result = service.method(params)
if result:
# Process
pass
except Exception as e:
print(f"Error: {e}")
```
### 2. Rate Limiting
Add delays for batch processing:
```python
import time
for item in items:
result = service.query(item)
time.sleep(0.5) # 500ms delay
```
### 3. Result Validation
Check for empty or unexpected results:
```python
if result and len(result) > 0:
# Process
pass
else:
print("No results returned")
```
### 4. Progress Reporting
For long workflows:
```python
total = len(items)
for i, item in enumerate(items):
# Process item
if (i + 1) % 10 == 0:
print(f"Processed {i+1}/{total}")
```
### 5. Data Export
Save intermediate results:
```python
import json
with open("results.json", "w") as f:
json.dump(results, f, indent=2)
```
---
## Integration with Other Tools
### BioPython Integration
```python
from bioservices import UniProt
from Bio import SeqIO
from io import StringIO
u = UniProt()
fasta_data = u.retrieve("P43403", "fasta")
# Parse with BioPython
fasta_io = StringIO(fasta_data)
record = SeqIO.read(fasta_io, "fasta")
print(f"Sequence length: {len(record.seq)}")
print(f"Description: {record.description}")
```
### Pandas Integration
```python
from bioservices import UniProt
import pandas as pd
from io import StringIO
u = UniProt()
results = u.search("zap70", frmt="tab", columns="id,genes,length,organism")
# Load into DataFrame
df = pd.read_csv(StringIO(results), sep="\t")
print(df.head())
print(df.describe())
```
### NetworkX Integration
See Protein Interaction Network Construction above.
---
For complete working examples, see the scripts in `scripts/` directory.

View File

@@ -0,0 +1,347 @@
#!/usr/bin/env python3
"""
Batch Identifier Converter
This script converts multiple identifiers between biological databases
using UniProt's mapping service. Supports batch processing with
automatic chunking and error handling.
Usage:
python batch_id_converter.py INPUT_FILE --from DB1 --to DB2 [options]
Examples:
python batch_id_converter.py uniprot_ids.txt --from UniProtKB_AC-ID --to KEGG
python batch_id_converter.py gene_ids.txt --from GeneID --to UniProtKB --output mapping.csv
python batch_id_converter.py ids.txt --from UniProtKB_AC-ID --to Ensembl --chunk-size 50
Input file format:
One identifier per line (plain text)
Common database codes:
UniProtKB_AC-ID - UniProt accession/ID
KEGG - KEGG gene IDs
GeneID - NCBI Gene (Entrez) IDs
Ensembl - Ensembl gene IDs
Ensembl_Protein - Ensembl protein IDs
RefSeq_Protein - RefSeq protein IDs
PDB - Protein Data Bank IDs
HGNC - Human gene symbols
GO - Gene Ontology IDs
"""
import sys
import argparse
import csv
import time
from bioservices import UniProt
# Common database code mappings
DATABASE_CODES = {
'uniprot': 'UniProtKB_AC-ID',
'uniprotkb': 'UniProtKB_AC-ID',
'kegg': 'KEGG',
'geneid': 'GeneID',
'entrez': 'GeneID',
'ensembl': 'Ensembl',
'ensembl_protein': 'Ensembl_Protein',
'ensembl_transcript': 'Ensembl_Transcript',
'refseq': 'RefSeq_Protein',
'refseq_protein': 'RefSeq_Protein',
'pdb': 'PDB',
'hgnc': 'HGNC',
'mgi': 'MGI',
'go': 'GO',
'pfam': 'Pfam',
'interpro': 'InterPro',
'reactome': 'Reactome',
'string': 'STRING',
'biogrid': 'BioGRID'
}
def normalize_database_code(code):
"""Normalize database code to official format."""
# Try exact match first
if code in DATABASE_CODES.values():
return code
# Try lowercase lookup
lowercase = code.lower()
if lowercase in DATABASE_CODES:
return DATABASE_CODES[lowercase]
# Return as-is if not found (may still be valid)
return code
def read_ids_from_file(filename):
"""Read identifiers from file (one per line)."""
print(f"Reading identifiers from {filename}...")
ids = []
with open(filename, 'r') as f:
for line in f:
line = line.strip()
if line and not line.startswith('#'):
ids.append(line)
print(f"✓ Read {len(ids)} identifier(s)")
return ids
def batch_convert(ids, from_db, to_db, chunk_size=100, delay=0.5):
"""Convert IDs with automatic chunking and error handling."""
print(f"\nConverting {len(ids)} IDs:")
print(f" From: {from_db}")
print(f" To: {to_db}")
print(f" Chunk size: {chunk_size}")
print()
u = UniProt(verbose=False)
all_results = {}
failed_ids = []
total_chunks = (len(ids) + chunk_size - 1) // chunk_size
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
chunk_num = (i // chunk_size) + 1
query = ",".join(chunk)
try:
print(f" [{chunk_num}/{total_chunks}] Processing {len(chunk)} IDs...", end=" ")
results = u.mapping(fr=from_db, to=to_db, query=query)
if results:
all_results.update(results)
mapped_count = len([v for v in results.values() if v])
print(f"✓ Mapped: {mapped_count}/{len(chunk)}")
else:
print(f"✗ No mappings returned")
failed_ids.extend(chunk)
# Rate limiting
if delay > 0 and i + chunk_size < len(ids):
time.sleep(delay)
except Exception as e:
print(f"✗ Error: {e}")
# Try individual IDs in failed chunk
print(f" Retrying individual IDs...")
for single_id in chunk:
try:
result = u.mapping(fr=from_db, to=to_db, query=single_id)
if result:
all_results.update(result)
print(f"{single_id}")
else:
failed_ids.append(single_id)
print(f"{single_id} - no mapping")
except Exception as e2:
failed_ids.append(single_id)
print(f"{single_id} - {e2}")
time.sleep(0.2)
# Add missing IDs to results (mark as failed)
for id_ in ids:
if id_ not in all_results:
all_results[id_] = None
print(f"\n✓ Conversion complete:")
print(f" Total: {len(ids)}")
print(f" Mapped: {len([v for v in all_results.values() if v])}")
print(f" Failed: {len(failed_ids)}")
return all_results, failed_ids
def save_mapping_csv(mapping, output_file, from_db, to_db):
"""Save mapping results to CSV."""
print(f"\nSaving results to {output_file}...")
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
# Header
writer.writerow(['Source_ID', 'Source_DB', 'Target_IDs', 'Target_DB', 'Mapping_Status'])
# Data
for source_id, target_ids in sorted(mapping.items()):
if target_ids:
target_str = ";".join(target_ids)
status = "Success"
else:
target_str = ""
status = "Failed"
writer.writerow([source_id, from_db, target_str, to_db, status])
print(f"✓ Results saved")
def save_failed_ids(failed_ids, output_file):
"""Save failed IDs to file."""
if not failed_ids:
return
print(f"\nSaving failed IDs to {output_file}...")
with open(output_file, 'w') as f:
for id_ in failed_ids:
f.write(f"{id_}\n")
print(f"✓ Saved {len(failed_ids)} failed ID(s)")
def print_mapping_summary(mapping, from_db, to_db):
"""Print summary of mapping results."""
print(f"\n{'='*70}")
print("MAPPING SUMMARY")
print(f"{'='*70}")
total = len(mapping)
mapped = len([v for v in mapping.values() if v])
failed = total - mapped
print(f"\nSource database: {from_db}")
print(f"Target database: {to_db}")
print(f"\nTotal identifiers: {total}")
print(f"Successfully mapped: {mapped} ({mapped/total*100:.1f}%)")
print(f"Failed to map: {failed} ({failed/total*100:.1f}%)")
# Show some examples
if mapped > 0:
print(f"\nExample mappings (first 5):")
count = 0
for source_id, target_ids in mapping.items():
if target_ids:
target_str = ", ".join(target_ids[:3])
if len(target_ids) > 3:
target_str += f" ... +{len(target_ids)-3} more"
print(f" {source_id}{target_str}")
count += 1
if count >= 5:
break
# Show multiple mapping statistics
multiple_mappings = [v for v in mapping.values() if v and len(v) > 1]
if multiple_mappings:
print(f"\nMultiple target mappings: {len(multiple_mappings)} ID(s)")
print(f" (These source IDs map to multiple target IDs)")
print(f"{'='*70}")
def list_common_databases():
"""Print list of common database codes."""
print("\nCommon Database Codes:")
print("-" * 70)
print(f"{'Alias':<20} {'Official Code':<30}")
print("-" * 70)
for alias, code in sorted(DATABASE_CODES.items()):
if alias != code.lower():
print(f"{alias:<20} {code:<30}")
print("-" * 70)
print("\nNote: Many other database codes are supported.")
print("See UniProt documentation for complete list.")
def main():
"""Main conversion workflow."""
parser = argparse.ArgumentParser(
description="Batch convert biological identifiers between databases",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python batch_id_converter.py uniprot_ids.txt --from UniProtKB_AC-ID --to KEGG
python batch_id_converter.py ids.txt --from GeneID --to UniProtKB -o mapping.csv
python batch_id_converter.py ids.txt --from uniprot --to ensembl --chunk-size 50
Common database codes:
UniProtKB_AC-ID, KEGG, GeneID, Ensembl, Ensembl_Protein,
RefSeq_Protein, PDB, HGNC, GO, Pfam, InterPro, Reactome
Use --list-databases to see all supported aliases.
"""
)
parser.add_argument("input_file", help="Input file with IDs (one per line)")
parser.add_argument("--from", dest="from_db", required=True,
help="Source database code")
parser.add_argument("--to", dest="to_db", required=True,
help="Target database code")
parser.add_argument("-o", "--output", default=None,
help="Output CSV file (default: mapping_results.csv)")
parser.add_argument("--chunk-size", type=int, default=100,
help="Number of IDs per batch (default: 100)")
parser.add_argument("--delay", type=float, default=0.5,
help="Delay between batches in seconds (default: 0.5)")
parser.add_argument("--save-failed", action="store_true",
help="Save failed IDs to separate file")
parser.add_argument("--list-databases", action="store_true",
help="List common database codes and exit")
args = parser.parse_args()
# List databases and exit
if args.list_databases:
list_common_databases()
sys.exit(0)
print("=" * 70)
print("BIOSERVICES: Batch Identifier Converter")
print("=" * 70)
# Normalize database codes
from_db = normalize_database_code(args.from_db)
to_db = normalize_database_code(args.to_db)
if from_db != args.from_db:
print(f"\nNote: Normalized '{args.from_db}''{from_db}'")
if to_db != args.to_db:
print(f"Note: Normalized '{args.to_db}''{to_db}'")
# Read input IDs
try:
ids = read_ids_from_file(args.input_file)
except Exception as e:
print(f"\n✗ Error reading input file: {e}")
sys.exit(1)
if not ids:
print("\n✗ No IDs found in input file")
sys.exit(1)
# Perform conversion
mapping, failed_ids = batch_convert(
ids,
from_db,
to_db,
chunk_size=args.chunk_size,
delay=args.delay
)
# Print summary
print_mapping_summary(mapping, from_db, to_db)
# Save results
output_file = args.output or "mapping_results.csv"
save_mapping_csv(mapping, output_file, from_db, to_db)
# Save failed IDs if requested
if args.save_failed and failed_ids:
failed_file = output_file.replace(".csv", "_failed.txt")
save_failed_ids(failed_ids, failed_file)
print(f"\n✓ Done!")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,378 @@
#!/usr/bin/env python3
"""
Compound Cross-Database Search
This script searches for a compound by name and retrieves identifiers
from multiple databases:
- KEGG Compound
- ChEBI
- ChEMBL (via UniChem)
- Basic compound properties
Usage:
python compound_cross_reference.py COMPOUND_NAME [--output FILE]
Examples:
python compound_cross_reference.py Geldanamycin
python compound_cross_reference.py "Adenosine triphosphate"
python compound_cross_reference.py Aspirin --output aspirin_info.txt
"""
import sys
import argparse
from bioservices import KEGG, UniChem, ChEBI, ChEMBL
def search_kegg_compound(compound_name):
"""Search KEGG for compound by name."""
print(f"\n{'='*70}")
print("STEP 1: KEGG Compound Search")
print(f"{'='*70}")
k = KEGG()
print(f"Searching KEGG for: {compound_name}")
try:
results = k.find("compound", compound_name)
if not results or not results.strip():
print(f"✗ No results found in KEGG")
return k, None
# Parse results
lines = results.strip().split("\n")
print(f"✓ Found {len(lines)} result(s):\n")
for i, line in enumerate(lines[:5], 1):
parts = line.split("\t")
kegg_id = parts[0]
description = parts[1] if len(parts) > 1 else "No description"
print(f" {i}. {kegg_id}: {description}")
# Use first result
first_result = lines[0].split("\t")
kegg_id = first_result[0].replace("cpd:", "")
print(f"\nUsing: {kegg_id}")
return k, kegg_id
except Exception as e:
print(f"✗ Error: {e}")
return k, None
def get_kegg_info(kegg, kegg_id):
"""Retrieve detailed KEGG compound information."""
print(f"\n{'='*70}")
print("STEP 2: KEGG Compound Details")
print(f"{'='*70}")
try:
print(f"Retrieving KEGG entry for {kegg_id}...")
entry = kegg.get(f"cpd:{kegg_id}")
if not entry:
print("✗ Failed to retrieve entry")
return None
# Parse entry
compound_info = {
'kegg_id': kegg_id,
'name': None,
'formula': None,
'exact_mass': None,
'mol_weight': None,
'chebi_id': None,
'pathways': []
}
current_section = None
for line in entry.split("\n"):
if line.startswith("NAME"):
compound_info['name'] = line.replace("NAME", "").strip().rstrip(";")
elif line.startswith("FORMULA"):
compound_info['formula'] = line.replace("FORMULA", "").strip()
elif line.startswith("EXACT_MASS"):
compound_info['exact_mass'] = line.replace("EXACT_MASS", "").strip()
elif line.startswith("MOL_WEIGHT"):
compound_info['mol_weight'] = line.replace("MOL_WEIGHT", "").strip()
elif "ChEBI:" in line:
parts = line.split("ChEBI:")
if len(parts) > 1:
compound_info['chebi_id'] = parts[1].strip().split()[0]
elif line.startswith("PATHWAY"):
current_section = "pathway"
pathway = line.replace("PATHWAY", "").strip()
if pathway:
compound_info['pathways'].append(pathway)
elif current_section == "pathway" and line.startswith(" "):
pathway = line.strip()
if pathway:
compound_info['pathways'].append(pathway)
elif line.startswith(" ") and not line.startswith(" "):
current_section = None
# Display information
print(f"\n✓ KEGG Compound Information:")
print(f" ID: {compound_info['kegg_id']}")
print(f" Name: {compound_info['name']}")
print(f" Formula: {compound_info['formula']}")
print(f" Exact Mass: {compound_info['exact_mass']}")
print(f" Molecular Weight: {compound_info['mol_weight']}")
if compound_info['chebi_id']:
print(f" ChEBI ID: {compound_info['chebi_id']}")
if compound_info['pathways']:
print(f" Pathways: {len(compound_info['pathways'])} found")
return compound_info
except Exception as e:
print(f"✗ Error: {e}")
return None
def get_chembl_id(kegg_id):
"""Map KEGG ID to ChEMBL via UniChem."""
print(f"\n{'='*70}")
print("STEP 3: ChEMBL Mapping (via UniChem)")
print(f"{'='*70}")
try:
u = UniChem()
print(f"Mapping KEGG:{kegg_id} to ChEMBL...")
chembl_id = u.get_compound_id_from_kegg(kegg_id)
if chembl_id:
print(f"✓ ChEMBL ID: {chembl_id}")
return chembl_id
else:
print("✗ No ChEMBL mapping found")
return None
except Exception as e:
print(f"✗ Error: {e}")
return None
def get_chebi_info(chebi_id):
"""Retrieve ChEBI compound information."""
print(f"\n{'='*70}")
print("STEP 4: ChEBI Details")
print(f"{'='*70}")
if not chebi_id:
print("⊘ No ChEBI ID available")
return None
try:
c = ChEBI()
print(f"Retrieving ChEBI entry for {chebi_id}...")
# Ensure proper format
if not chebi_id.startswith("CHEBI:"):
chebi_id = f"CHEBI:{chebi_id}"
entity = c.getCompleteEntity(chebi_id)
if entity:
print(f"\n✓ ChEBI Information:")
print(f" ID: {entity.chebiId}")
print(f" Name: {entity.chebiAsciiName}")
if hasattr(entity, 'Formulae') and entity.Formulae:
print(f" Formula: {entity.Formulae}")
if hasattr(entity, 'mass') and entity.mass:
print(f" Mass: {entity.mass}")
if hasattr(entity, 'charge') and entity.charge:
print(f" Charge: {entity.charge}")
return {
'chebi_id': entity.chebiId,
'name': entity.chebiAsciiName,
'formula': entity.Formulae if hasattr(entity, 'Formulae') else None,
'mass': entity.mass if hasattr(entity, 'mass') else None
}
else:
print("✗ Failed to retrieve ChEBI entry")
return None
except Exception as e:
print(f"✗ Error: {e}")
return None
def get_chembl_info(chembl_id):
"""Retrieve ChEMBL compound information."""
print(f"\n{'='*70}")
print("STEP 5: ChEMBL Details")
print(f"{'='*70}")
if not chembl_id:
print("⊘ No ChEMBL ID available")
return None
try:
c = ChEMBL()
print(f"Retrieving ChEMBL entry for {chembl_id}...")
compound = c.get_compound_by_chemblId(chembl_id)
if compound:
print(f"\n✓ ChEMBL Information:")
print(f" ID: {chembl_id}")
if 'pref_name' in compound and compound['pref_name']:
print(f" Preferred Name: {compound['pref_name']}")
if 'molecule_properties' in compound:
props = compound['molecule_properties']
if 'full_mwt' in props:
print(f" Molecular Weight: {props['full_mwt']}")
if 'alogp' in props:
print(f" LogP: {props['alogp']}")
if 'hba' in props:
print(f" H-Bond Acceptors: {props['hba']}")
if 'hbd' in props:
print(f" H-Bond Donors: {props['hbd']}")
if 'molecule_structures' in compound:
structs = compound['molecule_structures']
if 'canonical_smiles' in structs:
smiles = structs['canonical_smiles']
print(f" SMILES: {smiles[:60]}{'...' if len(smiles) > 60 else ''}")
return compound
else:
print("✗ Failed to retrieve ChEMBL entry")
return None
except Exception as e:
print(f"✗ Error: {e}")
return None
def save_results(compound_name, kegg_info, chembl_id, output_file):
"""Save results to file."""
print(f"\n{'='*70}")
print(f"Saving results to {output_file}")
print(f"{'='*70}")
with open(output_file, 'w') as f:
f.write("=" * 70 + "\n")
f.write(f"Compound Cross-Reference Report: {compound_name}\n")
f.write("=" * 70 + "\n\n")
# KEGG information
if kegg_info:
f.write("KEGG Compound\n")
f.write("-" * 70 + "\n")
f.write(f"ID: {kegg_info['kegg_id']}\n")
f.write(f"Name: {kegg_info['name']}\n")
f.write(f"Formula: {kegg_info['formula']}\n")
f.write(f"Exact Mass: {kegg_info['exact_mass']}\n")
f.write(f"Molecular Weight: {kegg_info['mol_weight']}\n")
f.write(f"Pathways: {len(kegg_info['pathways'])} found\n")
f.write("\n")
# Database IDs
f.write("Cross-Database Identifiers\n")
f.write("-" * 70 + "\n")
if kegg_info:
f.write(f"KEGG: {kegg_info['kegg_id']}\n")
if kegg_info['chebi_id']:
f.write(f"ChEBI: {kegg_info['chebi_id']}\n")
if chembl_id:
f.write(f"ChEMBL: {chembl_id}\n")
f.write("\n")
print(f"✓ Results saved")
def main():
"""Main workflow."""
parser = argparse.ArgumentParser(
description="Search compound across multiple databases",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python compound_cross_reference.py Geldanamycin
python compound_cross_reference.py "Adenosine triphosphate"
python compound_cross_reference.py Aspirin --output aspirin_info.txt
"""
)
parser.add_argument("compound", help="Compound name to search")
parser.add_argument("--output", default=None,
help="Output file for results (optional)")
args = parser.parse_args()
print("=" * 70)
print("BIOSERVICES: Compound Cross-Database Search")
print("=" * 70)
# Step 1: Search KEGG
kegg, kegg_id = search_kegg_compound(args.compound)
if not kegg_id:
print("\n✗ Failed to find compound. Exiting.")
sys.exit(1)
# Step 2: Get KEGG details
kegg_info = get_kegg_info(kegg, kegg_id)
# Step 3: Map to ChEMBL
chembl_id = get_chembl_id(kegg_id)
# Step 4: Get ChEBI details
chebi_info = None
if kegg_info and kegg_info['chebi_id']:
chebi_info = get_chebi_info(kegg_info['chebi_id'])
# Step 5: Get ChEMBL details
chembl_info = None
if chembl_id:
chembl_info = get_chembl_info(chembl_id)
# Summary
print(f"\n{'='*70}")
print("SUMMARY")
print(f"{'='*70}")
print(f" Compound: {args.compound}")
if kegg_info:
print(f" KEGG ID: {kegg_info['kegg_id']}")
if kegg_info['chebi_id']:
print(f" ChEBI ID: {kegg_info['chebi_id']}")
if chembl_id:
print(f" ChEMBL ID: {chembl_id}")
print(f"{'='*70}")
# Save to file if requested
if args.output:
save_results(args.compound, kegg_info, chembl_id, args.output)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,309 @@
#!/usr/bin/env python3
"""
KEGG Pathway Network Analysis
This script analyzes all pathways for an organism and extracts:
- Pathway sizes (number of genes)
- Protein-protein interactions
- Interaction type distributions
- Network data in various formats (CSV, SIF)
Usage:
python pathway_analysis.py ORGANISM OUTPUT_DIR [--limit N]
Examples:
python pathway_analysis.py hsa ./human_pathways
python pathway_analysis.py mmu ./mouse_pathways --limit 50
Organism codes:
hsa = Homo sapiens (human)
mmu = Mus musculus (mouse)
dme = Drosophila melanogaster
sce = Saccharomyces cerevisiae (yeast)
eco = Escherichia coli
"""
import sys
import os
import argparse
import csv
from collections import Counter
from bioservices import KEGG
def get_all_pathways(kegg, organism):
"""Get all pathway IDs for organism."""
print(f"\nRetrieving pathways for {organism}...")
kegg.organism = organism
pathway_ids = kegg.pathwayIds
print(f"✓ Found {len(pathway_ids)} pathways")
return pathway_ids
def analyze_pathway(kegg, pathway_id):
"""Analyze single pathway for size and interactions."""
try:
# Parse KGML pathway
kgml = kegg.parse_kgml_pathway(pathway_id)
entries = kgml.get('entries', [])
relations = kgml.get('relations', [])
# Count relation types
relation_types = Counter()
for rel in relations:
rel_type = rel.get('name', 'unknown')
relation_types[rel_type] += 1
# Get pathway name
try:
entry = kegg.get(pathway_id)
pathway_name = "Unknown"
for line in entry.split("\n"):
if line.startswith("NAME"):
pathway_name = line.replace("NAME", "").strip()
break
except:
pathway_name = "Unknown"
result = {
'pathway_id': pathway_id,
'pathway_name': pathway_name,
'num_entries': len(entries),
'num_relations': len(relations),
'relation_types': dict(relation_types),
'entries': entries,
'relations': relations
}
return result
except Exception as e:
print(f" ✗ Error analyzing {pathway_id}: {e}")
return None
def analyze_all_pathways(kegg, pathway_ids, limit=None):
"""Analyze all pathways."""
if limit:
pathway_ids = pathway_ids[:limit]
print(f"\n⚠ Limiting analysis to first {limit} pathways")
print(f"\nAnalyzing {len(pathway_ids)} pathways...")
results = []
for i, pathway_id in enumerate(pathway_ids, 1):
print(f" [{i}/{len(pathway_ids)}] {pathway_id}", end="\r")
result = analyze_pathway(kegg, pathway_id)
if result:
results.append(result)
print(f"\n✓ Successfully analyzed {len(results)}/{len(pathway_ids)} pathways")
return results
def save_pathway_summary(results, output_file):
"""Save pathway summary to CSV."""
print(f"\nSaving pathway summary to {output_file}...")
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
# Header
writer.writerow([
'Pathway_ID',
'Pathway_Name',
'Num_Genes',
'Num_Interactions',
'Activation',
'Inhibition',
'Phosphorylation',
'Binding',
'Other'
])
# Data
for result in results:
rel_types = result['relation_types']
writer.writerow([
result['pathway_id'],
result['pathway_name'],
result['num_entries'],
result['num_relations'],
rel_types.get('activation', 0),
rel_types.get('inhibition', 0),
rel_types.get('phosphorylation', 0),
rel_types.get('binding/association', 0),
sum(v for k, v in rel_types.items()
if k not in ['activation', 'inhibition', 'phosphorylation', 'binding/association'])
])
print(f"✓ Summary saved")
def save_interactions_sif(results, output_file):
"""Save all interactions in SIF format."""
print(f"\nSaving interactions to {output_file}...")
with open(output_file, 'w') as f:
for result in results:
pathway_id = result['pathway_id']
for rel in result['relations']:
entry1 = rel.get('entry1', '')
entry2 = rel.get('entry2', '')
interaction_type = rel.get('name', 'interaction')
# Write SIF format: source\tinteraction\ttarget
f.write(f"{entry1}\t{interaction_type}\t{entry2}\n")
print(f"✓ Interactions saved")
def save_detailed_pathway_info(results, output_dir):
"""Save detailed information for each pathway."""
print(f"\nSaving detailed pathway files to {output_dir}/pathways/...")
pathway_dir = os.path.join(output_dir, "pathways")
os.makedirs(pathway_dir, exist_ok=True)
for result in results:
pathway_id = result['pathway_id'].replace(":", "_")
filename = os.path.join(pathway_dir, f"{pathway_id}_interactions.csv")
with open(filename, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Source', 'Target', 'Interaction_Type', 'Link_Type'])
for rel in result['relations']:
writer.writerow([
rel.get('entry1', ''),
rel.get('entry2', ''),
rel.get('name', 'unknown'),
rel.get('link', 'unknown')
])
print(f"✓ Detailed files saved for {len(results)} pathways")
def print_statistics(results):
"""Print analysis statistics."""
print(f"\n{'='*70}")
print("PATHWAY ANALYSIS STATISTICS")
print(f"{'='*70}")
# Total stats
total_pathways = len(results)
total_interactions = sum(r['num_relations'] for r in results)
total_genes = sum(r['num_entries'] for r in results)
print(f"\nOverall:")
print(f" Total pathways: {total_pathways}")
print(f" Total genes/proteins: {total_genes}")
print(f" Total interactions: {total_interactions}")
# Largest pathways
print(f"\nLargest pathways (by gene count):")
sorted_by_size = sorted(results, key=lambda x: x['num_entries'], reverse=True)
for i, result in enumerate(sorted_by_size[:10], 1):
print(f" {i}. {result['pathway_id']}: {result['num_entries']} genes")
print(f" {result['pathway_name']}")
# Most connected pathways
print(f"\nMost connected pathways (by interactions):")
sorted_by_connections = sorted(results, key=lambda x: x['num_relations'], reverse=True)
for i, result in enumerate(sorted_by_connections[:10], 1):
print(f" {i}. {result['pathway_id']}: {result['num_relations']} interactions")
print(f" {result['pathway_name']}")
# Interaction type distribution
print(f"\nInteraction type distribution:")
all_types = Counter()
for result in results:
for rel_type, count in result['relation_types'].items():
all_types[rel_type] += count
for rel_type, count in all_types.most_common():
percentage = (count / total_interactions) * 100 if total_interactions > 0 else 0
print(f" {rel_type}: {count} ({percentage:.1f}%)")
def main():
"""Main analysis workflow."""
parser = argparse.ArgumentParser(
description="Analyze KEGG pathways for an organism",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python pathway_analysis.py hsa ./human_pathways
python pathway_analysis.py mmu ./mouse_pathways --limit 50
Organism codes:
hsa = Homo sapiens (human)
mmu = Mus musculus (mouse)
dme = Drosophila melanogaster
sce = Saccharomyces cerevisiae (yeast)
eco = Escherichia coli
"""
)
parser.add_argument("organism", help="KEGG organism code (e.g., hsa, mmu)")
parser.add_argument("output_dir", help="Output directory for results")
parser.add_argument("--limit", type=int, default=None,
help="Limit analysis to first N pathways")
args = parser.parse_args()
print("=" * 70)
print("BIOSERVICES: KEGG Pathway Network Analysis")
print("=" * 70)
# Create output directory
os.makedirs(args.output_dir, exist_ok=True)
# Initialize KEGG
kegg = KEGG()
# Get all pathways
pathway_ids = get_all_pathways(kegg, args.organism)
if not pathway_ids:
print(f"\n✗ No pathways found for {args.organism}")
sys.exit(1)
# Analyze pathways
results = analyze_all_pathways(kegg, pathway_ids, args.limit)
if not results:
print("\n✗ No pathways successfully analyzed")
sys.exit(1)
# Print statistics
print_statistics(results)
# Save results
summary_file = os.path.join(args.output_dir, "pathway_summary.csv")
save_pathway_summary(results, summary_file)
sif_file = os.path.join(args.output_dir, "all_interactions.sif")
save_interactions_sif(results, sif_file)
save_detailed_pathway_info(results, args.output_dir)
# Final summary
print(f"\n{'='*70}")
print("OUTPUT FILES")
print(f"{'='*70}")
print(f" Summary: {summary_file}")
print(f" Interactions: {sif_file}")
print(f" Detailed: {args.output_dir}/pathways/")
print(f"{'='*70}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,408 @@
#!/usr/bin/env python3
"""
Complete Protein Analysis Workflow
This script performs a comprehensive protein analysis pipeline:
1. UniProt search and identifier retrieval
2. FASTA sequence retrieval
3. BLAST similarity search
4. KEGG pathway discovery
5. PSICQUIC interaction mapping
6. GO annotation retrieval
Usage:
python protein_analysis_workflow.py PROTEIN_NAME EMAIL [--skip-blast]
Examples:
python protein_analysis_workflow.py ZAP70_HUMAN user@example.com
python protein_analysis_workflow.py P43403 user@example.com --skip-blast
Note: BLAST searches can take several minutes. Use --skip-blast to skip this step.
"""
import sys
import time
import argparse
from bioservices import UniProt, KEGG, NCBIblast, PSICQUIC, QuickGO
def search_protein(query):
"""Search UniProt for protein and retrieve basic information."""
print(f"\n{'='*70}")
print("STEP 1: UniProt Search")
print(f"{'='*70}")
u = UniProt(verbose=False)
print(f"Searching for: {query}")
# Try direct retrieval first (if query looks like accession)
if len(query) == 6 and query[0] in "OPQ":
try:
entry = u.retrieve(query, frmt="tab")
if entry:
uniprot_id = query
print(f"✓ Found UniProt entry: {uniprot_id}")
return u, uniprot_id
except:
pass
# Otherwise search
results = u.search(query, frmt="tab", columns="id,genes,organism,length,protein names", limit=5)
if not results:
print("✗ No results found")
return u, None
lines = results.strip().split("\n")
if len(lines) < 2:
print("✗ No entries found")
return u, None
# Display results
print(f"\n✓ Found {len(lines)-1} result(s):")
for i, line in enumerate(lines[1:], 1):
fields = line.split("\t")
print(f" {i}. {fields[0]} - {fields[1]} ({fields[2]})")
# Use first result
first_entry = lines[1].split("\t")
uniprot_id = first_entry[0]
gene_names = first_entry[1] if len(first_entry) > 1 else "N/A"
organism = first_entry[2] if len(first_entry) > 2 else "N/A"
length = first_entry[3] if len(first_entry) > 3 else "N/A"
protein_name = first_entry[4] if len(first_entry) > 4 else "N/A"
print(f"\nUsing first result:")
print(f" UniProt ID: {uniprot_id}")
print(f" Gene names: {gene_names}")
print(f" Organism: {organism}")
print(f" Length: {length} aa")
print(f" Protein: {protein_name}")
return u, uniprot_id
def retrieve_sequence(uniprot, uniprot_id):
"""Retrieve FASTA sequence for protein."""
print(f"\n{'='*70}")
print("STEP 2: FASTA Sequence Retrieval")
print(f"{'='*70}")
try:
sequence = uniprot.retrieve(uniprot_id, frmt="fasta")
if sequence:
# Extract sequence only (remove header)
lines = sequence.strip().split("\n")
header = lines[0]
seq_only = "".join(lines[1:])
print(f"✓ Retrieved sequence:")
print(f" Header: {header}")
print(f" Length: {len(seq_only)} residues")
print(f" First 60 residues: {seq_only[:60]}...")
return seq_only
else:
print("✗ Failed to retrieve sequence")
return None
except Exception as e:
print(f"✗ Error: {e}")
return None
def run_blast(sequence, email, skip=False):
"""Run BLAST similarity search."""
print(f"\n{'='*70}")
print("STEP 3: BLAST Similarity Search")
print(f"{'='*70}")
if skip:
print("⊘ Skipped (--skip-blast flag)")
return None
if not email or "@" not in email:
print("⊘ Skipped (valid email required for BLAST)")
return None
try:
print(f"Submitting BLASTP job...")
print(f" Database: uniprotkb")
print(f" Sequence length: {len(sequence)} aa")
s = NCBIblast(verbose=False)
jobid = s.run(
program="blastp",
sequence=sequence,
stype="protein",
database="uniprotkb",
email=email
)
print(f"✓ Job submitted: {jobid}")
print(f" Waiting for completion...")
# Poll for completion
max_wait = 300 # 5 minutes
start_time = time.time()
while time.time() - start_time < max_wait:
status = s.getStatus(jobid)
elapsed = int(time.time() - start_time)
print(f" Status: {status} (elapsed: {elapsed}s)", end="\r")
if status == "FINISHED":
print(f"\n✓ BLAST completed in {elapsed}s")
# Retrieve results
results = s.getResult(jobid, "out")
# Parse and display summary
lines = results.split("\n")
print(f"\n Results preview:")
for line in lines[:20]:
if line.strip():
print(f" {line}")
return results
elif status == "ERROR":
print(f"\n✗ BLAST job failed")
return None
time.sleep(5)
print(f"\n✗ Timeout after {max_wait}s")
return None
except Exception as e:
print(f"✗ Error: {e}")
return None
def discover_pathways(uniprot, kegg, uniprot_id):
"""Discover KEGG pathways for protein."""
print(f"\n{'='*70}")
print("STEP 4: KEGG Pathway Discovery")
print(f"{'='*70}")
try:
# Map UniProt → KEGG
print(f"Mapping {uniprot_id} to KEGG...")
kegg_mapping = uniprot.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
if not kegg_mapping or uniprot_id not in kegg_mapping:
print("✗ No KEGG mapping found")
return []
kegg_ids = kegg_mapping[uniprot_id]
print(f"✓ KEGG ID(s): {kegg_ids}")
# Get pathways for first KEGG ID
kegg_id = kegg_ids[0]
organism, gene_id = kegg_id.split(":")
print(f"\nSearching pathways for {kegg_id}...")
pathways = kegg.get_pathway_by_gene(gene_id, organism)
if not pathways:
print("✗ No pathways found")
return []
print(f"✓ Found {len(pathways)} pathway(s):\n")
# Get pathway names
pathway_info = []
for pathway_id in pathways:
try:
entry = kegg.get(pathway_id)
# Extract pathway name
pathway_name = "Unknown"
for line in entry.split("\n"):
if line.startswith("NAME"):
pathway_name = line.replace("NAME", "").strip()
break
pathway_info.append((pathway_id, pathway_name))
print(f"{pathway_id}: {pathway_name}")
except Exception as e:
print(f"{pathway_id}: [Error retrieving name]")
return pathway_info
except Exception as e:
print(f"✗ Error: {e}")
return []
def find_interactions(protein_query):
"""Find protein-protein interactions via PSICQUIC."""
print(f"\n{'='*70}")
print("STEP 5: Protein-Protein Interactions")
print(f"{'='*70}")
try:
p = PSICQUIC()
# Try querying MINT database
query = f"{protein_query} AND species:9606"
print(f"Querying MINT database...")
print(f" Query: {query}")
results = p.query("mint", query)
if not results:
print("✗ No interactions found in MINT")
return []
# Parse PSI-MI TAB format
lines = results.strip().split("\n")
print(f"✓ Found {len(lines)} interaction(s):\n")
# Display first 10 interactions
interactions = []
for i, line in enumerate(lines[:10], 1):
fields = line.split("\t")
if len(fields) >= 12:
protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
interaction_type = fields[11]
interactions.append((protein_a, protein_b, interaction_type))
print(f" {i}. {protein_a}{protein_b}")
if len(lines) > 10:
print(f" ... and {len(lines)-10} more")
return interactions
except Exception as e:
print(f"✗ Error: {e}")
return []
def get_go_annotations(uniprot_id):
"""Retrieve GO annotations."""
print(f"\n{'='*70}")
print("STEP 6: Gene Ontology Annotations")
print(f"{'='*70}")
try:
g = QuickGO()
print(f"Retrieving GO annotations for {uniprot_id}...")
annotations = g.Annotation(protein=uniprot_id, format="tsv")
if not annotations:
print("✗ No GO annotations found")
return []
lines = annotations.strip().split("\n")
print(f"✓ Found {len(lines)-1} annotation(s)\n")
# Group by aspect
aspects = {"P": [], "F": [], "C": []}
for line in lines[1:]:
fields = line.split("\t")
if len(fields) >= 9:
go_id = fields[6]
go_term = fields[7]
go_aspect = fields[8]
if go_aspect in aspects:
aspects[go_aspect].append((go_id, go_term))
# Display summary
print(f" Biological Process (P): {len(aspects['P'])} terms")
for go_id, go_term in aspects['P'][:5]:
print(f"{go_id}: {go_term}")
if len(aspects['P']) > 5:
print(f" ... and {len(aspects['P'])-5} more")
print(f"\n Molecular Function (F): {len(aspects['F'])} terms")
for go_id, go_term in aspects['F'][:5]:
print(f"{go_id}: {go_term}")
if len(aspects['F']) > 5:
print(f" ... and {len(aspects['F'])-5} more")
print(f"\n Cellular Component (C): {len(aspects['C'])} terms")
for go_id, go_term in aspects['C'][:5]:
print(f"{go_id}: {go_term}")
if len(aspects['C']) > 5:
print(f" ... and {len(aspects['C'])-5} more")
return aspects
except Exception as e:
print(f"✗ Error: {e}")
return {}
def main():
"""Main workflow."""
parser = argparse.ArgumentParser(
description="Complete protein analysis workflow using BioServices",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python protein_analysis_workflow.py ZAP70_HUMAN user@example.com
python protein_analysis_workflow.py P43403 user@example.com --skip-blast
"""
)
parser.add_argument("protein", help="Protein name or UniProt ID")
parser.add_argument("email", help="Email address (required for BLAST)")
parser.add_argument("--skip-blast", action="store_true",
help="Skip BLAST search (faster)")
args = parser.parse_args()
print("=" * 70)
print("BIOSERVICES: Complete Protein Analysis Workflow")
print("=" * 70)
# Step 1: Search protein
uniprot, uniprot_id = search_protein(args.protein)
if not uniprot_id:
print("\n✗ Failed to find protein. Exiting.")
sys.exit(1)
# Step 2: Retrieve sequence
sequence = retrieve_sequence(uniprot, uniprot_id)
if not sequence:
print("\n⚠ Warning: Could not retrieve sequence")
# Step 3: BLAST search
if sequence:
blast_results = run_blast(sequence, args.email, args.skip_blast)
# Step 4: Pathway discovery
kegg = KEGG()
pathways = discover_pathways(uniprot, kegg, uniprot_id)
# Step 5: Interaction mapping
interactions = find_interactions(args.protein)
# Step 6: GO annotations
go_terms = get_go_annotations(uniprot_id)
# Summary
print(f"\n{'='*70}")
print("WORKFLOW SUMMARY")
print(f"{'='*70}")
print(f" Protein: {args.protein}")
print(f" UniProt ID: {uniprot_id}")
print(f" Sequence: {'' if sequence else ''}")
print(f" BLAST: {'' if not args.skip_blast and sequence else ''}")
print(f" Pathways: {len(pathways)} found")
print(f" Interactions: {len(interactions)} found")
print(f" GO annotations: {sum(len(v) for v in go_terms.values())} found")
print(f"{'='*70}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,505 @@
---
name: cellxgene-census
description: Access and analyze single-cell genomics data from the CZ CELLxGENE Census. This skill should be used when working with large-scale single-cell RNA-seq data, querying cell and gene metadata, training machine learning models on Census data, integrating multiple single-cell datasets, or performing cross-dataset analyses. It covers data exploration, expression queries, out-of-core processing, PyTorch integration, and scanpy workflows.
---
# CZ CELLxGENE Census
## Overview
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
- **61+ million cells** from human and mouse
- **Standardized metadata** (cell types, tissues, diseases, donors)
- **Raw gene expression** matrices
- **Pre-calculated embeddings** and statistics
- **Integration with PyTorch, scanpy, and other analysis tools**
## When to Use This Skill
Use this skill when tasks involve:
- Querying single-cell expression data by cell type, tissue, or disease
- Exploring available single-cell datasets and metadata
- Training machine learning models on single-cell data
- Performing large-scale cross-dataset analyses
- Integrating Census data with scanpy or other analysis frameworks
- Computing statistics across millions of cells
- Accessing pre-calculated embeddings or model predictions
## Installation and Setup
Install the Census API:
```bash
pip install cellxgene-census
```
For machine learning workflows, install additional dependencies:
```bash
pip install cellxgene-census[experimental]
```
## Core Workflow Patterns
### 1. Opening the Census
Always use the context manager to ensure proper resource cleanup:
```python
import cellxgene_census
# Open latest stable version
with cellxgene_census.open_soma() as census:
# Work with census data
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Work with census data
```
**Key points:**
- Use context manager (`with` statement) for automatic cleanup
- Specify `census_version` for reproducible analyses
- Default opens latest "stable" release
### 2. Exploring Census Information
Before querying expression data, explore available datasets and metadata.
**Access summary information:**
```python
# Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")
# Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
```
**Query cell metadata to understand available data:**
```python
# Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")
# Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()
```
**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
### 3. Querying Expression Data (Small to Medium Scale)
For queries returning < 100k cells that fit in memory, use `get_anndata()`:
```python
# Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens", # or "Mus musculus"
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
# Query specific genes with multiple filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
```
**Filter syntax:**
- Use `obs_value_filter` for cell filtering
- Use `var_value_filter` for gene filtering
- Combine conditions with `and`, `or`
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
- Select only needed columns with `obs_column_names`
**Getting metadata separately:**
```python
# Query cell metadata
cell_metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "donor_id"]
)
# Query gene metadata
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A']",
column_names=["feature_id", "feature_name", "feature_length"]
)
```
### 4. Large-Scale Queries (Out-of-Core Processing)
For queries exceeding available RAM, use `axis_query()` with iterative processing:
```python
import tiledbsoma as soma
# Create axis query
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Iterate through expression matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
# batch is a pyarrow.Table with columns:
# - soma_data: expression value
# - soma_dim_0: cell (obs) coordinate
# - soma_dim_1: gene (var) coordinate
process_batch(batch)
```
**Computing incremental statistics:**
```python
# Example: Calculate mean expression
n_observations = 0
sum_values = 0.0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
n_observations += len(values)
sum_values += values.sum()
mean_expression = sum_values / n_observations
```
### 5. Machine Learning with PyTorch
For training models, use the experimental PyTorch integration:
```python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"]["cell_type"] # Cell type labels
# Forward pass
outputs = model(X)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
**Train/test splitting:**
```python
from cellxgene_census.experimental.ml import ExperimentDataset
# Create dataset from experiment
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
# Split into train and test
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
```
### 6. Integration with Scanpy
Seamlessly integrate Census data with scanpy workflows:
```python
import scanpy as sc
# Load data from Census
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)
# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
# Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
```
### 7. Multi-Dataset Integration
Query and integrate multiple datasets:
```python
# Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"]
adatas = []
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
adata.obs["tissue"] = tissue
adatas.append(adata)
# Concatenate
combined = adatas[0].concatenate(adatas[1:])
# Strategy 2: Query multiple datasets directly
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)
```
## Key Concepts and Best Practices
### Always Filter for Primary Data
Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times:
```python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
```
### Specify Census Version for Reproducibility
Always specify the Census version in production analyses:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
### Estimate Query Size Before Loading
For large queries, first check the number of cells to avoid memory issues:
```python
# Get cell count
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")
# If too large (>100k), use out-of-core processing
```
### Use tissue_general for Broader Groupings
The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses:
```python
# Broader grouping
obs_value_filter="tissue_general == 'immune system'"
# Specific tissue
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
```
### Select Only Needed Columns
Minimize data transfer by specifying only required metadata columns:
```python
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
```
### Check Dataset Presence for Gene-Specific Queries
When analyzing specific genes, verify which datasets measured them:
```python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)
```
### Two-Step Workflow: Explore Then Query
First explore metadata to understand available data, then query expression:
```python
# Step 1: Explore what's available
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
```
## Available Metadata Fields
### Cell Metadata (obs)
Key fields for filtering:
- `cell_type`, `cell_type_ontology_term_id`
- `tissue`, `tissue_general`, `tissue_ontology_term_id`
- `disease`, `disease_ontology_term_id`
- `assay`, `assay_ontology_term_id`
- `donor_id`, `sex`, `self_reported_ethnicity`
- `development_stage`, `development_stage_ontology_term_id`
- `dataset_id`
- `is_primary_data` (Boolean: True = unique cell)
### Gene Metadata (var)
- `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798")
- `feature_name` (Gene symbol, e.g., "FOXP2")
- `feature_length` (Gene length in base pairs)
## Reference Documentation
This skill includes detailed reference documentation:
### references/census_schema.md
Comprehensive documentation of:
- Census data structure and organization
- All available metadata fields
- Value filter syntax and operators
- SOMA object types
- Data inclusion criteria
**When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax.
### references/common_patterns.md
Examples and patterns for:
- Exploratory queries (metadata only)
- Small-to-medium queries (AnnData)
- Large queries (out-of-core processing)
- PyTorch integration
- Scanpy integration workflows
- Multi-dataset integration
- Best practices and common pitfalls
**When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
## Common Use Cases
### Use Case 1: Explore Cell Types in a Tissue
```python
with cellxgene_census.open_soma() as census:
cells = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'lung' and is_primary_data == True",
column_names=["cell_type"]
)
print(cells["cell_type"].value_counts())
```
### Use Case 2: Query Marker Gene Expression
```python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
)
```
### Use Case 3: Train Cell Type Classifier
```python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Train model
for epoch in range(epochs):
for batch in dataloader:
# Training logic
pass
```
### Use Case 4: Cross-Tissue Analysis
```python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
)
# Analyze macrophage differences across tissues
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
```
## Troubleshooting
### Query Returns Too Many Cells
- Add more specific filters to reduce scope
- Use `tissue` instead of `tissue_general` for finer granularity
- Filter by specific `dataset_id` if known
- Switch to out-of-core processing for large queries
### Memory Errors
- Reduce query scope with more restrictive filters
- Select fewer genes with `var_value_filter`
- Use out-of-core processing with `axis_query()`
- Process data in batches
### Duplicate Cells in Results
- Always include `is_primary_data == True` in filters
- Check if intentionally querying across multiple datasets
### Gene Not Found
- Verify gene name spelling (case-sensitive)
- Try Ensembl ID with `feature_id` instead of `feature_name`
- Check dataset presence matrix to see if gene was measured
- Some genes may have been filtered during Census construction
### Version Inconsistencies
- Always specify `census_version` explicitly
- Use same version across all analyses
- Check release notes for version-specific changes

View File

@@ -0,0 +1,182 @@
# CZ CELLxGENE Census Data Schema Reference
## Overview
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
## High-Level Structure
The Census is organized as a `SOMACollection` with two main components:
### 1. census_info
Summary information including:
- **summary**: Build date, cell counts, dataset statistics
- **datasets**: All datasets from CELLxGENE Discover with metadata
- **summary_cell_counts**: Cell counts stratified by metadata categories
### 2. census_data
Organism-specific `SOMAExperiment` objects:
- **"homo_sapiens"**: Human single-cell data
- **"mus_musculus"**: Mouse single-cell data
## Data Structure Per Organism
Each organism experiment contains:
### obs (Cell Metadata)
Cell-level annotations stored as a `SOMADataFrame`. Access via:
```python
census["census_data"]["homo_sapiens"].obs
```
### ms["RNA"] (Measurement)
RNA measurement data including:
- **X**: Data matrices with layers:
- `raw`: Raw count data
- `normalized`: (if available) Normalized counts
- **var**: Gene metadata
- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
## Cell Metadata Fields (obs)
### Required/Core Fields
**Identity & Dataset:**
- `soma_joinid`: Unique integer identifier for joins
- `dataset_id`: Source dataset identifier
- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
**Cell Type:**
- `cell_type`: Human-readable cell type name
- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
**Tissue:**
- `tissue`: Specific tissue name
- `tissue_general`: Broader tissue category (useful for grouping)
- `tissue_ontology_term_id`: Standardized ontology term
**Assay:**
- `assay`: Sequencing technology used
- `assay_ontology_term_id`: Standardized ontology term
**Disease:**
- `disease`: Disease status or condition
- `disease_ontology_term_id`: Standardized ontology term
**Donor:**
- `donor_id`: Unique donor identifier
- `sex`: Biological sex (male, female, unknown)
- `self_reported_ethnicity`: Ethnicity information
- `development_stage`: Life stage (adult, child, embryonic, etc.)
- `development_stage_ontology_term_id`: Standardized ontology term
**Organism:**
- `organism`: Scientific name (Homo sapiens, Mus musculus)
- `organism_ontology_term_id`: Standardized ontology term
**Technical:**
- `suspension_type`: Sample preparation type (cell, nucleus, na)
## Gene Metadata Fields (var)
Access via:
```python
census["census_data"]["homo_sapiens"].ms["RNA"].var
```
**Available Fields:**
- `soma_joinid`: Unique integer identifier for joins
- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
- `feature_name`: Gene symbol (e.g., "FOXP2")
- `feature_length`: Gene length in base pairs
## Value Filter Syntax
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
### Comparison Operators
- `==`: Equal to
- `!=`: Not equal to
- `<`, `>`, `<=`, `>=`: Numeric comparisons
- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
### Logical Operators
- `and`, `&`: Logical AND
- `or`, `|`: Logical OR
### Examples
**Single condition:**
```python
value_filter="cell_type == 'B cell'"
```
**Multiple conditions with AND:**
```python
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
```
**Using IN for multiple values:**
```python
value_filter="tissue in ['lung', 'liver', 'kidney']"
```
**Complex condition:**
```python
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
```
**Filtering genes:**
```python
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
```
## Data Inclusion Criteria
The Census includes all data from CZ CELLxGENE Discover meeting:
1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
2. **Technology**: Approved sequencing technologies for RNA
3. **Count Type**: Raw counts only (no processed/normalized-only data)
4. **Metadata**: Standardized following CELLxGENE schema
5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
## Important Data Characteristics
### Duplicate Cells
Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
### Count Types
The Census includes:
- **Molecule counts**: From UMI-based methods
- **Full-gene sequencing read counts**: From non-UMI methods
These may need different normalization approaches.
### Versioning
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
## Dataset Presence Matrix
Access which genes were measured in each dataset:
```python
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
```
This sparse boolean matrix helps understand:
- Gene coverage across datasets
- Which datasets to include for specific gene analyses
- Technical batch effects related to gene coverage
## SOMA Object Types
Core TileDB-SOMA objects used:
- **DataFrame**: Tabular data (obs, var)
- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
- **DenseNDArray**: Dense arrays (less common)
- **Collection**: Container for related objects
- **Experiment**: Top-level container for measurements
- **SOMAScene**: Spatial transcriptomics scenes
- **obs_spatial_presence**: Spatial data availability

View File

@@ -0,0 +1,351 @@
# Common Query Patterns and Best Practices
## Query Pattern Categories
### 1. Exploratory Queries (Metadata Only)
Use when exploring available data without loading expression matrices.
**Pattern: Get unique cell types in a tissue**
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} unique cell types")
```
**Pattern: Count cells by condition**
```python
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="disease != 'normal' and is_primary_data == True",
column_names=["disease", "tissue_general"]
)
counts = cell_metadata.groupby(["disease", "tissue_general"]).size()
```
**Pattern: Explore dataset information**
```python
# Access datasets table
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# Filter for specific criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
```
### 2. Small-to-Medium Queries (AnnData)
Use `get_anndata()` when results fit in memory (typically < 100k cells).
**Pattern: Tissue-specific cell type query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
```
**Pattern: Gene-specific query with multiple genes**
```python
marker_genes = ["CD4", "CD8A", "CD19", "FOXP3"]
# First get gene IDs
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter=f"feature_name in {marker_genes}",
column_names=["feature_id", "feature_name"]
)
gene_ids = gene_metadata["feature_id"].tolist()
# Query with gene filter
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter=f"feature_id in {gene_ids}",
obs_value_filter="cell_type == 'T cell' and is_primary_data == True",
)
```
**Pattern: Multi-tissue query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "dataset_id"],
)
```
**Pattern: Disease-specific query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and tissue_general == 'lung' and is_primary_data == True",
)
```
### 3. Large Queries (Out-of-Core Processing)
Use `axis_query()` for queries that exceed available RAM.
**Pattern: Iterative processing**
```python
import pyarrow as pa
# Create query
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Iterate through X matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
# Process batch (a pyarrow.Table)
# batch has columns: soma_data, soma_dim_0, soma_dim_1
process_batch(batch)
```
**Pattern: Incremental statistics (mean/variance)**
```python
# Using Welford's online algorithm
n = 0
mean = 0
M2 = 0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
for x in values:
n += 1
delta = x - mean
mean += delta / n
delta2 = x - mean
M2 += delta * delta2
variance = M2 / (n - 1) if n > 1 else 0
```
### 4. PyTorch Integration (Machine Learning)
Use `experiment_dataloader()` for training models.
**Pattern: Create training dataloader**
```python
from cellxgene_census.experimental.ml import experiment_dataloader
import torch
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression
labels = batch["obs"]["cell_type"] # Cell type labels
# Train model...
```
**Pattern: Train/test split**
```python
from cellxgene_census.experimental.ml import ExperimentDataset
# Create dataset from query
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
# Split data
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
# Create loaders
train_loader = experiment_dataloader(train_dataset)
test_loader = experiment_dataloader(test_dataset)
```
### 5. Integration Workflows
**Pattern: Scanpy integration**
```python
import scanpy as sc
# Load data
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and is_primary_data == True",
)
# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["cell_type", "tissue_general"])
```
**Pattern: Multi-dataset integration**
```python
# Query multiple datasets separately
datasets_to_integrate = ["dataset_id_1", "dataset_id_2", "dataset_id_3"]
adatas = []
for dataset_id in datasets_to_integrate:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"dataset_id == '{dataset_id}' and is_primary_data == True",
)
adatas.append(adata)
# Integrate using scanorama, harmony, or other tools
import scanpy.external as sce
sce.pp.scanorama_integrate(adatas)
```
## Best Practices
### 1. Always Filter for Primary Data
Unless specifically analyzing duplicates, always include `is_primary_data == True`:
```python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
```
### 2. Specify Census Version
For reproducible analysis, always specify the Census version:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
### 3. Use Context Manager
Always use the context manager to ensure proper cleanup:
```python
with cellxgene_census.open_soma() as census:
# Your code here
```
### 4. Select Only Needed Columns
Minimize data transfer by selecting only required metadata columns:
```python
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
```
### 5. Check Dataset Presence for Gene Queries
When analyzing specific genes, check which datasets measured them:
```python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)
```
### 6. Use tissue_general for Broader Queries
`tissue_general` provides coarser groupings than `tissue`, useful for cross-tissue analyses:
```python
# Better for broad queries
obs_value_filter="tissue_general == 'immune system'"
# Use specific tissue when needed
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
```
### 7. Combine Metadata Exploration with Expression Queries
First explore metadata to understand available data, then query expression:
```python
# Step 1: Explore
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19'",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
```
### 8. Memory Management for Large Queries
For large queries, check estimated size before loading:
```python
# Get cell count first
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells} cells")
# If too large, use out-of-core processing or further filtering
```
### 9. Leverage Ontology Terms for Consistency
When possible, use ontology term IDs instead of free text:
```python
# More reliable than cell_type == 'B cell' across datasets
obs_value_filter="cell_type_ontology_term_id == 'CL:0000236'"
```
### 10. Batch Processing Pattern
For systematic analyses across multiple conditions:
```python
tissues = ["lung", "liver", "kidney", "heart"]
results = {}
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
# Perform analysis
results[tissue] = analyze(adata)
```
## Common Pitfalls to Avoid
1. **Not filtering for is_primary_data**: Leads to counting duplicate cells
2. **Loading too much data**: Use metadata queries to estimate size first
3. **Not using context manager**: Can cause resource leaks
4. **Inconsistent versioning**: Results not reproducible without specifying version
5. **Overly broad queries**: Start with focused queries, expand as needed
6. **Ignoring dataset presence**: Some genes not measured in all datasets
7. **Wrong count normalization**: Be aware of UMI vs read count differences

View File

@@ -0,0 +1,457 @@
---
name: cobrapy
description: Comprehensive toolkit for constraint-based reconstruction and analysis (COBRA) of metabolic models. Use when working with genome-scale metabolic models, performing flux balance analysis (FBA), simulating cellular metabolism, conducting gene/reaction knockout studies, gapfilling metabolic networks, analyzing flux distributions, calculating minimal media requirements, or any systems biology task involving computational modeling of cellular metabolism. Supports SBML, JSON, YAML, and MATLAB formats.
---
# COBRApy - Constraint-Based Reconstruction and Analysis
## Overview
COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Use this skill to work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors.
## Core Capabilities
COBRApy provides comprehensive tools organized into several key areas:
### 1. Model Management
Load existing models from repositories or files:
```python
from cobra.io import load_model
# Load bundled test models
model = load_model("textbook") # E. coli core model
model = load_model("ecoli") # Full E. coli model
model = load_model("salmonella")
# Load from files
from cobra.io import read_sbml_model, load_json_model, load_yaml_model
model = read_sbml_model("path/to/model.xml")
model = load_json_model("path/to/model.json")
model = load_yaml_model("path/to/model.yml")
```
Save models in various formats:
```python
from cobra.io import write_sbml_model, save_json_model, save_yaml_model
write_sbml_model(model, "output.xml") # Preferred format
save_json_model(model, "output.json") # For Escher compatibility
save_yaml_model(model, "output.yml") # Human-readable
```
### 2. Model Structure and Components
Access and inspect model components:
```python
# Access components
model.reactions # DictList of all reactions
model.metabolites # DictList of all metabolites
model.genes # DictList of all genes
# Get specific items by ID or index
reaction = model.reactions.get_by_id("PFK")
metabolite = model.metabolites[0]
# Inspect properties
print(reaction.reaction) # Stoichiometric equation
print(reaction.bounds) # Flux constraints
print(reaction.gene_reaction_rule) # GPR logic
print(metabolite.formula) # Chemical formula
print(metabolite.compartment) # Cellular location
```
### 3. Flux Balance Analysis (FBA)
Perform standard FBA simulation:
```python
# Basic optimization
solution = model.optimize()
print(f"Objective value: {solution.objective_value}")
print(f"Status: {solution.status}")
# Access fluxes
print(solution.fluxes["PFK"])
print(solution.fluxes.head())
# Fast optimization (objective value only)
objective_value = model.slim_optimize()
# Change objective
model.objective = "ATPM"
solution = model.optimize()
```
Parsimonious FBA (minimize total flux):
```python
from cobra.flux_analysis import pfba
solution = pfba(model)
```
Geometric FBA (find central solution):
```python
from cobra.flux_analysis import geometric_fba
solution = geometric_fba(model)
```
### 4. Flux Variability Analysis (FVA)
Determine flux ranges for all reactions:
```python
from cobra.flux_analysis import flux_variability_analysis
# Standard FVA
fva_result = flux_variability_analysis(model)
# FVA at 90% optimality
fva_result = flux_variability_analysis(model, fraction_of_optimum=0.9)
# Loopless FVA (eliminates thermodynamically infeasible loops)
fva_result = flux_variability_analysis(model, loopless=True)
# FVA for specific reactions
fva_result = flux_variability_analysis(
model,
reaction_list=["PFK", "FBA", "PGI"]
)
```
### 5. Gene and Reaction Deletion Studies
Perform knockout analyses:
```python
from cobra.flux_analysis import (
single_gene_deletion,
single_reaction_deletion,
double_gene_deletion,
double_reaction_deletion
)
# Single deletions
gene_results = single_gene_deletion(model)
reaction_results = single_reaction_deletion(model)
# Double deletions (uses multiprocessing)
double_gene_results = double_gene_deletion(
model,
processes=4 # Number of CPU cores
)
# Manual knockout using context manager
with model:
model.genes.get_by_id("b0008").knock_out()
solution = model.optimize()
print(f"Growth after knockout: {solution.objective_value}")
# Model automatically reverts after context exit
```
### 6. Growth Media and Minimal Media
Manage growth medium:
```python
# View current medium
print(model.medium)
# Modify medium (must reassign entire dict)
medium = model.medium
medium["EX_glc__D_e"] = 10.0 # Set glucose uptake
medium["EX_o2_e"] = 0.0 # Anaerobic conditions
model.medium = medium
# Calculate minimal media
from cobra.medium import minimal_medium
# Minimize total import flux
min_medium = minimal_medium(model, minimize_components=False)
# Minimize number of components (uses MILP, slower)
min_medium = minimal_medium(
model,
minimize_components=True,
open_exchanges=True
)
```
### 7. Flux Sampling
Sample the feasible flux space:
```python
from cobra.sampling import sample
# Sample using OptGP (default, supports parallel processing)
samples = sample(model, n=1000, method="optgp", processes=4)
# Sample using ACHR
samples = sample(model, n=1000, method="achr")
# Validate samples
from cobra.sampling import OptGPSampler
sampler = OptGPSampler(model, processes=4)
sampler.sample(1000)
validation = sampler.validate(sampler.samples)
print(validation.value_counts()) # Should be all 'v' for valid
```
### 8. Production Envelopes
Calculate phenotype phase planes:
```python
from cobra.flux_analysis import production_envelope
# Standard production envelope
envelope = production_envelope(
model,
reactions=["EX_glc__D_e", "EX_o2_e"],
objective="EX_ac_e" # Acetate production
)
# With carbon yield
envelope = production_envelope(
model,
reactions=["EX_glc__D_e", "EX_o2_e"],
carbon_sources="EX_glc__D_e"
)
# Visualize (use matplotlib or pandas plotting)
import matplotlib.pyplot as plt
envelope.plot(x="EX_glc__D_e", y="EX_o2_e", kind="scatter")
plt.show()
```
### 9. Gapfilling
Add reactions to make models feasible:
```python
from cobra.flux_analysis import gapfill
# Prepare universal model with candidate reactions
universal = load_model("universal")
# Perform gapfilling
with model:
# Remove reactions to create gaps for demonstration
model.remove_reactions([model.reactions.PGI])
# Find reactions needed
solution = gapfill(model, universal)
print(f"Reactions to add: {solution}")
```
### 10. Model Building
Build models from scratch:
```python
from cobra import Model, Reaction, Metabolite
# Create model
model = Model("my_model")
# Create metabolites
atp_c = Metabolite("atp_c", formula="C10H12N5O13P3",
name="ATP", compartment="c")
adp_c = Metabolite("adp_c", formula="C10H12N5O10P2",
name="ADP", compartment="c")
pi_c = Metabolite("pi_c", formula="HO4P",
name="Phosphate", compartment="c")
# Create reaction
reaction = Reaction("ATPASE")
reaction.name = "ATP hydrolysis"
reaction.subsystem = "Energy"
reaction.lower_bound = 0.0
reaction.upper_bound = 1000.0
# Add metabolites with stoichiometry
reaction.add_metabolites({
atp_c: -1.0,
adp_c: 1.0,
pi_c: 1.0
})
# Add gene-reaction rule
reaction.gene_reaction_rule = "(gene1 and gene2) or gene3"
# Add to model
model.add_reactions([reaction])
# Add boundary reactions
model.add_boundary(atp_c, type="exchange")
model.add_boundary(adp_c, type="demand")
# Set objective
model.objective = "ATPASE"
```
## Common Workflows
### Workflow 1: Load Model and Predict Growth
```python
from cobra.io import load_model
# Load model
model = load_model("ecoli")
# Run FBA
solution = model.optimize()
print(f"Growth rate: {solution.objective_value:.3f} /h")
# Show active pathways
print(solution.fluxes[solution.fluxes.abs() > 1e-6])
```
### Workflow 2: Gene Knockout Screen
```python
from cobra.io import load_model
from cobra.flux_analysis import single_gene_deletion
# Load model
model = load_model("ecoli")
# Perform single gene deletions
results = single_gene_deletion(model)
# Find essential genes (growth < threshold)
essential_genes = results[results["growth"] < 0.01]
print(f"Found {len(essential_genes)} essential genes")
# Find genes with minimal impact
neutral_genes = results[results["growth"] > 0.9 * solution.objective_value]
```
### Workflow 3: Media Optimization
```python
from cobra.io import load_model
from cobra.medium import minimal_medium
# Load model
model = load_model("ecoli")
# Calculate minimal medium for 50% of max growth
target_growth = model.slim_optimize() * 0.5
min_medium = minimal_medium(
model,
target_growth,
minimize_components=True
)
print(f"Minimal medium components: {len(min_medium)}")
print(min_medium)
```
### Workflow 4: Flux Uncertainty Analysis
```python
from cobra.io import load_model
from cobra.flux_analysis import flux_variability_analysis
from cobra.sampling import sample
# Load model
model = load_model("ecoli")
# First check flux ranges at optimality
fva = flux_variability_analysis(model, fraction_of_optimum=1.0)
# For reactions with large ranges, sample to understand distribution
samples = sample(model, n=1000)
# Analyze specific reaction
reaction_id = "PFK"
import matplotlib.pyplot as plt
samples[reaction_id].hist(bins=50)
plt.xlabel(f"Flux through {reaction_id}")
plt.ylabel("Frequency")
plt.show()
```
### Workflow 5: Context Manager for Temporary Changes
Use context managers to make temporary modifications:
```python
# Model remains unchanged outside context
with model:
# Temporarily change objective
model.objective = "ATPM"
# Temporarily modify bounds
model.reactions.EX_glc__D_e.lower_bound = -5.0
# Temporarily knock out genes
model.genes.b0008.knock_out()
# Optimize with changes
solution = model.optimize()
print(f"Modified growth: {solution.objective_value}")
# All changes automatically reverted
solution = model.optimize()
print(f"Original growth: {solution.objective_value}")
```
## Key Concepts
### DictList Objects
Models use `DictList` objects for reactions, metabolites, and genes - behaving like both lists and dictionaries:
```python
# Access by index
first_reaction = model.reactions[0]
# Access by ID
pfk = model.reactions.get_by_id("PFK")
# Query methods
atp_reactions = model.reactions.query("atp")
```
### Flux Constraints
Reaction bounds define feasible flux ranges:
- **Irreversible**: `lower_bound = 0, upper_bound > 0`
- **Reversible**: `lower_bound < 0, upper_bound > 0`
- Set both bounds simultaneously with `.bounds` to avoid inconsistencies
### Gene-Reaction Rules (GPR)
Boolean logic linking genes to reactions:
```python
# AND logic (both required)
reaction.gene_reaction_rule = "gene1 and gene2"
# OR logic (either sufficient)
reaction.gene_reaction_rule = "gene1 or gene2"
# Complex logic
reaction.gene_reaction_rule = "(gene1 and gene2) or (gene3 and gene4)"
```
### Exchange Reactions
Special reactions representing metabolite import/export:
- Named with prefix `EX_` by convention
- Positive flux = secretion, negative flux = uptake
- Managed through `model.medium` dictionary
## Best Practices
1. **Use context managers** for temporary modifications to avoid state management issues
2. **Validate models** before analysis using `model.slim_optimize()` to ensure feasibility
3. **Check solution status** after optimization - `optimal` indicates successful solve
4. **Use loopless FVA** when thermodynamic feasibility matters
5. **Set fraction_of_optimum** appropriately in FVA to explore suboptimal space
6. **Parallelize** computationally expensive operations (sampling, double deletions)
7. **Prefer SBML format** for model exchange and long-term storage
8. **Use slim_optimize()** when only objective value needed for performance
9. **Validate flux samples** to ensure numerical stability
## Troubleshooting
**Infeasible solutions**: Check medium constraints, reaction bounds, and model consistency
**Slow optimization**: Try different solvers (GLPK, CPLEX, Gurobi) via `model.solver`
**Unbounded solutions**: Verify exchange reactions have appropriate upper bounds
**Import errors**: Ensure correct file format and valid SBML identifiers
## References
For detailed workflows and API patterns, refer to:
- `references/workflows.md` - Comprehensive step-by-step workflow examples
- `references/api_quick_reference.md` - Common function signatures and patterns
Official documentation: https://cobrapy.readthedocs.io/en/latest/

View File

@@ -0,0 +1,655 @@
# COBRApy API Quick Reference
This document provides quick reference for common COBRApy functions, signatures, and usage patterns.
## Model I/O
### Loading Models
```python
from cobra.io import load_model, read_sbml_model, load_json_model, load_yaml_model, load_matlab_model
# Bundled test models
model = load_model("textbook") # E. coli core metabolism
model = load_model("ecoli") # Full E. coli iJO1366
model = load_model("salmonella") # Salmonella LT2
# From files
model = read_sbml_model(filename, f_replace={}, **kwargs)
model = load_json_model(filename)
model = load_yaml_model(filename)
model = load_matlab_model(filename, variable_name=None)
```
### Saving Models
```python
from cobra.io import write_sbml_model, save_json_model, save_yaml_model, save_matlab_model
write_sbml_model(model, filename, f_replace={}, **kwargs)
save_json_model(model, filename, pretty=False, **kwargs)
save_yaml_model(model, filename, **kwargs)
save_matlab_model(model, filename, **kwargs)
```
## Model Structure
### Core Classes
```python
from cobra import Model, Reaction, Metabolite, Gene
# Create model
model = Model(id_or_model=None, name=None)
# Create metabolite
metabolite = Metabolite(
id=None,
formula=None,
name="",
charge=None,
compartment=None
)
# Create reaction
reaction = Reaction(
id=None,
name="",
subsystem="",
lower_bound=0.0,
upper_bound=None
)
# Create gene
gene = Gene(id=None, name="", functional=True)
```
### Model Attributes
```python
# Component access (DictList objects)
model.reactions # DictList of Reaction objects
model.metabolites # DictList of Metabolite objects
model.genes # DictList of Gene objects
# Special reaction lists
model.exchanges # Exchange reactions (external transport)
model.demands # Demand reactions (metabolite sinks)
model.sinks # Sink reactions
model.boundary # All boundary reactions
# Model properties
model.objective # Current objective (read/write)
model.objective_direction # "max" or "min"
model.medium # Growth medium (dict of exchange: bound)
model.solver # Optimization solver
```
### DictList Methods
```python
# Access by index
item = model.reactions[0]
# Access by ID
item = model.reactions.get_by_id("PFK")
# Query by string (substring match)
items = model.reactions.query("atp") # Case-insensitive search
items = model.reactions.query(lambda x: x.subsystem == "Glycolysis")
# List comprehension
items = [r for r in model.reactions if r.lower_bound < 0]
# Check membership
"PFK" in model.reactions
```
## Optimization
### Basic Optimization
```python
# Full optimization (returns Solution object)
solution = model.optimize()
# Attributes of Solution
solution.objective_value # Objective function value
solution.status # Optimization status ("optimal", "infeasible", etc.)
solution.fluxes # Pandas Series of reaction fluxes
solution.shadow_prices # Pandas Series of metabolite shadow prices
solution.reduced_costs # Pandas Series of reduced costs
# Fast optimization (returns float only)
objective_value = model.slim_optimize()
# Change objective
model.objective = "ATPM"
model.objective = model.reactions.ATPM
model.objective = {model.reactions.ATPM: 1.0}
# Change optimization direction
model.objective_direction = "max" # or "min"
```
### Solver Configuration
```python
# Check available solvers
from cobra.util.solver import solvers
print(solvers)
# Change solver
model.solver = "glpk" # or "cplex", "gurobi", etc.
# Solver-specific configuration
model.solver.configuration.timeout = 60 # seconds
model.solver.configuration.verbosity = 1
model.solver.configuration.tolerances.feasibility = 1e-9
```
## Flux Analysis
### Flux Balance Analysis (FBA)
```python
from cobra.flux_analysis import pfba, geometric_fba
# Parsimonious FBA
solution = pfba(model, fraction_of_optimum=1.0, **kwargs)
# Geometric FBA
solution = geometric_fba(model, epsilon=1e-06, max_tries=200)
```
### Flux Variability Analysis (FVA)
```python
from cobra.flux_analysis import flux_variability_analysis
fva_result = flux_variability_analysis(
model,
reaction_list=None, # List of reaction IDs or None for all
loopless=False, # Eliminate thermodynamically infeasible loops
fraction_of_optimum=1.0, # Optimality fraction (0.0-1.0)
pfba_factor=None, # Optional pFBA constraint
processes=1 # Number of parallel processes
)
# Returns DataFrame with columns: minimum, maximum
```
### Gene and Reaction Deletions
```python
from cobra.flux_analysis import (
single_gene_deletion,
single_reaction_deletion,
double_gene_deletion,
double_reaction_deletion
)
# Single deletions
results = single_gene_deletion(
model,
gene_list=None, # None for all genes
processes=1,
**kwargs
)
results = single_reaction_deletion(
model,
reaction_list=None, # None for all reactions
processes=1,
**kwargs
)
# Double deletions
results = double_gene_deletion(
model,
gene_list1=None,
gene_list2=None,
processes=1,
**kwargs
)
results = double_reaction_deletion(
model,
reaction_list1=None,
reaction_list2=None,
processes=1,
**kwargs
)
# Returns DataFrame with columns: ids, growth, status
# For double deletions, index is MultiIndex of gene/reaction pairs
```
### Flux Sampling
```python
from cobra.sampling import sample, OptGPSampler, ACHRSampler
# Simple interface
samples = sample(
model,
n, # Number of samples
method="optgp", # or "achr"
thinning=100, # Thinning factor (sample every n iterations)
processes=1, # Parallel processes (OptGP only)
seed=None # Random seed
)
# Advanced interface with sampler objects
sampler = OptGPSampler(model, processes=4, thinning=100)
sampler = ACHRSampler(model, thinning=100)
# Generate samples
samples = sampler.sample(n)
# Validate samples
validation = sampler.validate(sampler.samples)
# Returns array of 'v' (valid), 'l' (lower bound violation),
# 'u' (upper bound violation), 'e' (equality violation)
# Batch sampling
sampler.batch(n_samples, n_batches)
```
### Production Envelopes
```python
from cobra.flux_analysis import production_envelope
envelope = production_envelope(
model,
reactions, # List of 1-2 reaction IDs
objective=None, # Objective reaction ID (None uses model objective)
carbon_sources=None, # Carbon source for yield calculation
points=20, # Number of points to calculate
threshold=0.01 # Minimum objective value threshold
)
# Returns DataFrame with columns:
# - First reaction flux
# - Second reaction flux (if provided)
# - objective_minimum, objective_maximum
# - carbon_yield_minimum, carbon_yield_maximum (if carbon source specified)
# - mass_yield_minimum, mass_yield_maximum
```
### Gapfilling
```python
from cobra.flux_analysis import gapfill
# Basic gapfilling
solution = gapfill(
model,
universal=None, # Universal model with candidate reactions
lower_bound=0.05, # Minimum objective flux
penalties=None, # Dict of reaction: penalty
demand_reactions=True, # Add demand reactions if needed
exchange_reactions=False,
iterations=1
)
# Returns list of Reaction objects to add
# Multiple solutions
solutions = []
for i in range(5):
sol = gapfill(model, universal, iterations=1)
solutions.append(sol)
# Prevent finding same solution by increasing penalties
```
### Other Analysis Methods
```python
from cobra.flux_analysis import (
find_blocked_reactions,
find_essential_genes,
find_essential_reactions
)
# Blocked reactions (cannot carry flux)
blocked = find_blocked_reactions(
model,
reaction_list=None,
zero_cutoff=1e-9,
open_exchanges=False
)
# Essential genes/reactions
essential_genes = find_essential_genes(model, threshold=0.01)
essential_reactions = find_essential_reactions(model, threshold=0.01)
```
## Media and Boundary Conditions
### Medium Management
```python
# Get current medium (returns dict)
medium = model.medium
# Set medium (must reassign entire dict)
medium = model.medium
medium["EX_glc__D_e"] = 10.0
medium["EX_o2_e"] = 20.0
model.medium = medium
# Alternative: individual modification
with model:
model.reactions.EX_glc__D_e.lower_bound = -10.0
```
### Minimal Media
```python
from cobra.medium import minimal_medium
min_medium = minimal_medium(
model,
min_objective_value=0.1, # Minimum growth rate
minimize_components=False, # If True, uses MILP (slower)
open_exchanges=False, # Open all exchanges before optimization
exports=False, # Allow metabolite export
penalties=None # Dict of exchange: penalty
)
# Returns Series of exchange reactions with fluxes
```
### Boundary Reactions
```python
# Add boundary reaction
model.add_boundary(
metabolite,
type="exchange", # or "demand", "sink"
reaction_id=None, # Auto-generated if None
lb=None,
ub=None,
sbo_term=None
)
# Access boundary reactions
exchanges = model.exchanges # System boundary
demands = model.demands # Intracellular removal
sinks = model.sinks # Intracellular exchange
boundaries = model.boundary # All boundary reactions
```
## Model Manipulation
### Adding Components
```python
# Add reactions
model.add_reactions([reaction1, reaction2, ...])
model.add_reaction(reaction)
# Add metabolites
reaction.add_metabolites({
metabolite1: -1.0, # Consumed (negative stoichiometry)
metabolite2: 1.0 # Produced (positive stoichiometry)
})
# Add metabolites to model
model.add_metabolites([metabolite1, metabolite2, ...])
# Add genes (usually automatic via gene_reaction_rule)
model.genes += [gene1, gene2, ...]
```
### Removing Components
```python
# Remove reactions
model.remove_reactions([reaction1, reaction2, ...])
model.remove_reactions(["PFK", "FBA"])
# Remove metabolites (removes from reactions too)
model.remove_metabolites([metabolite1, metabolite2, ...])
# Remove genes (usually via gene_reaction_rule)
model.genes.remove(gene)
```
### Modifying Reactions
```python
# Set bounds
reaction.bounds = (lower, upper)
reaction.lower_bound = 0.0
reaction.upper_bound = 1000.0
# Modify stoichiometry
reaction.add_metabolites({metabolite: 1.0})
reaction.subtract_metabolites({metabolite: 1.0})
# Change gene-reaction rule
reaction.gene_reaction_rule = "(gene1 and gene2) or gene3"
# Knock out
reaction.knock_out()
gene.knock_out()
```
### Model Copying
```python
# Deep copy (independent model)
model_copy = model.copy()
# Copy specific reactions
new_model = Model("subset")
reactions_to_copy = [model.reactions.PFK, model.reactions.FBA]
new_model.add_reactions(reactions_to_copy)
```
## Context Management
Use context managers for temporary modifications:
```python
# Changes automatically revert after with block
with model:
model.objective = "ATPM"
model.reactions.EX_glc__D_e.lower_bound = -5.0
model.genes.b0008.knock_out()
solution = model.optimize()
# Model state restored here
# Multiple nested contexts
with model:
model.objective = "ATPM"
with model:
model.genes.b0008.knock_out()
# Both modifications active
# Only objective change active
# Context management with reactions
with model:
model.reactions.PFK.knock_out()
# Equivalent to: reaction.lower_bound = reaction.upper_bound = 0
```
## Reaction and Metabolite Properties
### Reaction Attributes
```python
reaction.id # Unique identifier
reaction.name # Human-readable name
reaction.subsystem # Pathway/subsystem
reaction.bounds # (lower_bound, upper_bound)
reaction.lower_bound
reaction.upper_bound
reaction.reversibility # Boolean (lower_bound < 0)
reaction.gene_reaction_rule # GPR string
reaction.genes # Set of associated Gene objects
reaction.metabolites # Dict of {metabolite: stoichiometry}
# Methods
reaction.reaction # Stoichiometric equation string
reaction.build_reaction_string() # Same as above
reaction.check_mass_balance() # Returns imbalances or empty dict
reaction.get_coefficient(metabolite_id)
reaction.add_metabolites({metabolite: coeff})
reaction.subtract_metabolites({metabolite: coeff})
reaction.knock_out()
```
### Metabolite Attributes
```python
metabolite.id # Unique identifier
metabolite.name # Human-readable name
metabolite.formula # Chemical formula
metabolite.charge # Charge
metabolite.compartment # Compartment ID
metabolite.reactions # FrozenSet of associated reactions
# Methods
metabolite.summary() # Print production/consumption
metabolite.copy()
```
### Gene Attributes
```python
gene.id # Unique identifier
gene.name # Human-readable name
gene.functional # Boolean activity status
gene.reactions # FrozenSet of associated reactions
# Methods
gene.knock_out()
```
## Model Validation
### Consistency Checking
```python
from cobra.manipulation import check_mass_balance, check_metabolite_compartment_formula
# Check all reactions for mass balance
unbalanced = {}
for reaction in model.reactions:
balance = reaction.check_mass_balance()
if balance:
unbalanced[reaction.id] = balance
# Check metabolite formulas are valid
check_metabolite_compartment_formula(model)
```
### Model Statistics
```python
# Basic stats
print(f"Reactions: {len(model.reactions)}")
print(f"Metabolites: {len(model.metabolites)}")
print(f"Genes: {len(model.genes)}")
# Advanced stats
print(f"Exchanges: {len(model.exchanges)}")
print(f"Demands: {len(model.demands)}")
# Blocked reactions
from cobra.flux_analysis import find_blocked_reactions
blocked = find_blocked_reactions(model)
print(f"Blocked reactions: {len(blocked)}")
# Essential genes
from cobra.flux_analysis import find_essential_genes
essential = find_essential_genes(model)
print(f"Essential genes: {len(essential)}")
```
## Summary Methods
```python
# Model summary
model.summary() # Overall model info
# Metabolite summary
model.metabolites.atp_c.summary()
# Reaction summary
model.reactions.PFK.summary()
# Summary with FVA
model.summary(fva=0.95) # Include FVA at 95% optimality
```
## Common Patterns
### Batch Analysis Pattern
```python
results = []
for condition in conditions:
with model:
# Apply condition
setup_condition(model, condition)
# Analyze
solution = model.optimize()
# Store result
results.append({
"condition": condition,
"growth": solution.objective_value,
"status": solution.status
})
df = pd.DataFrame(results)
```
### Systematic Knockout Pattern
```python
knockout_results = []
for gene in model.genes:
with model:
gene.knock_out()
solution = model.optimize()
knockout_results.append({
"gene": gene.id,
"growth": solution.objective_value if solution.status == "optimal" else 0,
"status": solution.status
})
df = pd.DataFrame(knockout_results)
```
### Parameter Scan Pattern
```python
parameter_values = np.linspace(0, 20, 21)
results = []
for value in parameter_values:
with model:
model.reactions.EX_glc__D_e.lower_bound = -value
solution = model.optimize()
results.append({
"glucose_uptake": value,
"growth": solution.objective_value,
"acetate_secretion": solution.fluxes["EX_ac_e"]
})
df = pd.DataFrame(results)
```
This quick reference covers the most commonly used COBRApy functions and patterns. For complete API documentation, see https://cobrapy.readthedocs.io/

View File

@@ -0,0 +1,593 @@
# COBRApy Comprehensive Workflows
This document provides detailed step-by-step workflows for common COBRApy tasks in metabolic modeling.
## Workflow 1: Complete Knockout Study with Visualization
This workflow demonstrates how to perform a comprehensive gene knockout study and visualize the results.
```python
import pandas as pd
import matplotlib.pyplot as plt
from cobra.io import load_model
from cobra.flux_analysis import single_gene_deletion, double_gene_deletion
# Step 1: Load model
model = load_model("ecoli")
print(f"Loaded model: {model.id}")
print(f"Model contains {len(model.reactions)} reactions, {len(model.metabolites)} metabolites, {len(model.genes)} genes")
# Step 2: Get baseline growth rate
baseline = model.slim_optimize()
print(f"Baseline growth rate: {baseline:.3f} /h")
# Step 3: Perform single gene deletions
print("Performing single gene deletions...")
single_results = single_gene_deletion(model)
# Step 4: Classify genes by impact
essential_genes = single_results[single_results["growth"] < 0.01]
severely_impaired = single_results[(single_results["growth"] >= 0.01) &
(single_results["growth"] < 0.5 * baseline)]
moderately_impaired = single_results[(single_results["growth"] >= 0.5 * baseline) &
(single_results["growth"] < 0.9 * baseline)]
neutral_genes = single_results[single_results["growth"] >= 0.9 * baseline]
print(f"\nSingle Deletion Results:")
print(f" Essential genes: {len(essential_genes)}")
print(f" Severely impaired: {len(severely_impaired)}")
print(f" Moderately impaired: {len(moderately_impaired)}")
print(f" Neutral genes: {len(neutral_genes)}")
# Step 5: Visualize distribution
fig, ax = plt.subplots(figsize=(10, 6))
single_results["growth"].hist(bins=50, ax=ax)
ax.axvline(baseline, color='r', linestyle='--', label='Baseline')
ax.set_xlabel("Growth rate (/h)")
ax.set_ylabel("Number of genes")
ax.set_title("Distribution of Growth Rates After Single Gene Deletions")
ax.legend()
plt.tight_layout()
plt.savefig("single_deletion_distribution.png", dpi=300)
# Step 6: Identify gene pairs for double deletions
# Focus on non-essential genes to find synthetic lethals
target_genes = single_results[single_results["growth"] >= 0.5 * baseline].index.tolist()
target_genes = [list(gene)[0] for gene in target_genes[:50]] # Limit for performance
print(f"\nPerforming double deletions on {len(target_genes)} genes...")
double_results = double_gene_deletion(
model,
gene_list1=target_genes,
processes=4
)
# Step 7: Find synthetic lethal pairs
synthetic_lethals = double_results[
(double_results["growth"] < 0.01) &
(single_results.loc[double_results.index.get_level_values(0)]["growth"].values >= 0.5 * baseline) &
(single_results.loc[double_results.index.get_level_values(1)]["growth"].values >= 0.5 * baseline)
]
print(f"Found {len(synthetic_lethals)} synthetic lethal gene pairs")
print("\nTop 10 synthetic lethal pairs:")
print(synthetic_lethals.head(10))
# Step 8: Export results
single_results.to_csv("single_gene_deletions.csv")
double_results.to_csv("double_gene_deletions.csv")
synthetic_lethals.to_csv("synthetic_lethals.csv")
```
## Workflow 2: Media Design and Optimization
This workflow shows how to systematically design growth media and find minimal media compositions.
```python
from cobra.io import load_model
from cobra.medium import minimal_medium
import pandas as pd
# Step 1: Load model and check current medium
model = load_model("ecoli")
current_medium = model.medium
print("Current medium composition:")
for exchange, bound in current_medium.items():
metabolite_id = exchange.replace("EX_", "").replace("_e", "")
print(f" {metabolite_id}: {bound:.2f} mmol/gDW/h")
# Step 2: Get baseline growth
baseline_growth = model.slim_optimize()
print(f"\nBaseline growth rate: {baseline_growth:.3f} /h")
# Step 3: Calculate minimal medium for different growth targets
growth_targets = [0.25, 0.5, 0.75, 1.0]
minimal_media = {}
for fraction in growth_targets:
target_growth = baseline_growth * fraction
print(f"\nCalculating minimal medium for {fraction*100:.0f}% growth ({target_growth:.3f} /h)...")
min_medium = minimal_medium(
model,
target_growth,
minimize_components=True,
open_exchanges=True
)
minimal_media[fraction] = min_medium
print(f" Required components: {len(min_medium)}")
print(f" Components: {list(min_medium.index)}")
# Step 4: Compare media compositions
media_df = pd.DataFrame(minimal_media).fillna(0)
media_df.to_csv("minimal_media_comparison.csv")
# Step 5: Test aerobic vs anaerobic conditions
print("\n--- Aerobic vs Anaerobic Comparison ---")
# Aerobic
model_aerobic = model.copy()
aerobic_growth = model_aerobic.slim_optimize()
aerobic_medium = minimal_medium(model_aerobic, aerobic_growth * 0.9, minimize_components=True)
# Anaerobic
model_anaerobic = model.copy()
medium_anaerobic = model_anaerobic.medium
medium_anaerobic["EX_o2_e"] = 0.0
model_anaerobic.medium = medium_anaerobic
anaerobic_growth = model_anaerobic.slim_optimize()
anaerobic_medium = minimal_medium(model_anaerobic, anaerobic_growth * 0.9, minimize_components=True)
print(f"Aerobic growth: {aerobic_growth:.3f} /h (requires {len(aerobic_medium)} components)")
print(f"Anaerobic growth: {anaerobic_growth:.3f} /h (requires {len(anaerobic_medium)} components)")
# Step 6: Identify unique requirements
aerobic_only = set(aerobic_medium.index) - set(anaerobic_medium.index)
anaerobic_only = set(anaerobic_medium.index) - set(aerobic_medium.index)
shared = set(aerobic_medium.index) & set(anaerobic_medium.index)
print(f"\nShared components: {len(shared)}")
print(f"Aerobic-only: {aerobic_only}")
print(f"Anaerobic-only: {anaerobic_only}")
# Step 7: Test custom medium
print("\n--- Testing Custom Medium ---")
custom_medium = {
"EX_glc__D_e": 10.0, # Glucose
"EX_o2_e": 20.0, # Oxygen
"EX_nh4_e": 5.0, # Ammonium
"EX_pi_e": 5.0, # Phosphate
"EX_so4_e": 1.0, # Sulfate
}
with model:
model.medium = custom_medium
custom_growth = model.optimize().objective_value
print(f"Growth on custom medium: {custom_growth:.3f} /h")
# Check which nutrients are limiting
for exchange in custom_medium:
with model:
# Double the uptake rate
medium_test = model.medium
medium_test[exchange] *= 2
model.medium = medium_test
test_growth = model.optimize().objective_value
improvement = (test_growth - custom_growth) / custom_growth * 100
if improvement > 1:
print(f" {exchange}: +{improvement:.1f}% growth when doubled (LIMITING)")
```
## Workflow 3: Flux Space Exploration with Sampling
This workflow demonstrates comprehensive flux space analysis using FVA and sampling.
```python
from cobra.io import load_model
from cobra.flux_analysis import flux_variability_analysis
from cobra.sampling import sample
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Load model
model = load_model("ecoli")
baseline = model.slim_optimize()
print(f"Baseline growth: {baseline:.3f} /h")
# Step 2: Perform FVA at optimal growth
print("\nPerforming FVA at optimal growth...")
fva_optimal = flux_variability_analysis(model, fraction_of_optimum=1.0)
# Step 3: Identify reactions with flexibility
fva_optimal["range"] = fva_optimal["maximum"] - fva_optimal["minimum"]
fva_optimal["relative_range"] = fva_optimal["range"] / (fva_optimal["maximum"].abs() + 1e-9)
flexible_reactions = fva_optimal[fva_optimal["range"] > 1.0].sort_values("range", ascending=False)
print(f"\nFound {len(flexible_reactions)} reactions with >1.0 mmol/gDW/h flexibility")
print("\nTop 10 most flexible reactions:")
print(flexible_reactions.head(10)[["minimum", "maximum", "range"]])
# Step 4: Perform FVA at suboptimal growth (90%)
print("\nPerforming FVA at 90% optimal growth...")
fva_suboptimal = flux_variability_analysis(model, fraction_of_optimum=0.9)
fva_suboptimal["range"] = fva_suboptimal["maximum"] - fva_suboptimal["minimum"]
# Step 5: Compare flexibility at different optimality levels
comparison = pd.DataFrame({
"range_100": fva_optimal["range"],
"range_90": fva_suboptimal["range"]
})
comparison["range_increase"] = comparison["range_90"] - comparison["range_100"]
print("\nReactions with largest increase in flexibility at suboptimality:")
print(comparison.sort_values("range_increase", ascending=False).head(10))
# Step 6: Perform flux sampling
print("\nPerforming flux sampling (1000 samples)...")
samples = sample(model, n=1000, method="optgp", processes=4)
# Step 7: Analyze sampling results for key reactions
key_reactions = ["PFK", "FBA", "TPI", "GAPD", "PGK", "PGM", "ENO", "PYK"]
available_key_reactions = [r for r in key_reactions if r in samples.columns]
if available_key_reactions:
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
for idx, reaction_id in enumerate(available_key_reactions[:8]):
ax = axes[idx]
samples[reaction_id].hist(bins=30, ax=ax, alpha=0.7)
# Overlay FVA bounds
fva_min = fva_optimal.loc[reaction_id, "minimum"]
fva_max = fva_optimal.loc[reaction_id, "maximum"]
ax.axvline(fva_min, color='r', linestyle='--', label='FVA min')
ax.axvline(fva_max, color='r', linestyle='--', label='FVA max')
ax.set_xlabel("Flux (mmol/gDW/h)")
ax.set_ylabel("Frequency")
ax.set_title(reaction_id)
if idx == 0:
ax.legend()
plt.tight_layout()
plt.savefig("flux_distributions.png", dpi=300)
# Step 8: Calculate correlation between reactions
print("\nCalculating flux correlations...")
correlation_matrix = samples[available_key_reactions].corr()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm",
center=0, ax=ax, square=True)
ax.set_title("Flux Correlations Between Key Glycolysis Reactions")
plt.tight_layout()
plt.savefig("flux_correlations.png", dpi=300)
# Step 9: Identify reaction modules (highly correlated groups)
print("\nHighly correlated reaction pairs (|r| > 0.9):")
for i in range(len(correlation_matrix)):
for j in range(i+1, len(correlation_matrix)):
corr = correlation_matrix.iloc[i, j]
if abs(corr) > 0.9:
print(f" {correlation_matrix.index[i]} <-> {correlation_matrix.columns[j]}: {corr:.3f}")
# Step 10: Export all results
fva_optimal.to_csv("fva_optimal.csv")
fva_suboptimal.to_csv("fva_suboptimal.csv")
samples.to_csv("flux_samples.csv")
correlation_matrix.to_csv("flux_correlations.csv")
```
## Workflow 4: Production Strain Design
This workflow demonstrates how to design a production strain for a target metabolite.
```python
from cobra.io import load_model
from cobra.flux_analysis import (
production_envelope,
flux_variability_analysis,
single_gene_deletion
)
import pandas as pd
import matplotlib.pyplot as plt
# Step 1: Define production target
TARGET_METABOLITE = "EX_ac_e" # Acetate production
CARBON_SOURCE = "EX_glc__D_e" # Glucose uptake
# Step 2: Load model
model = load_model("ecoli")
print(f"Designing strain for {TARGET_METABOLITE} production")
# Step 3: Calculate baseline production envelope
print("\nCalculating production envelope...")
envelope = production_envelope(
model,
reactions=[CARBON_SOURCE, TARGET_METABOLITE],
carbon_sources=CARBON_SOURCE
)
# Visualize production envelope
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(envelope[CARBON_SOURCE], envelope["mass_yield_maximum"], 'b-', label='Max yield')
ax.plot(envelope[CARBON_SOURCE], envelope["mass_yield_minimum"], 'r-', label='Min yield')
ax.set_xlabel(f"Glucose uptake (mmol/gDW/h)")
ax.set_ylabel(f"Acetate yield")
ax.set_title("Wild-type Production Envelope")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("production_envelope_wildtype.png", dpi=300)
# Step 4: Maximize production while maintaining growth
print("\nOptimizing for production...")
# Set minimum growth constraint
MIN_GROWTH = 0.1 # Maintain at least 10% of max growth
with model:
# Change objective to product formation
model.objective = TARGET_METABOLITE
model.objective_direction = "max"
# Add growth constraint
growth_reaction = model.reactions.get_by_id(model.objective.name) if hasattr(model.objective, 'name') else list(model.objective.variables.keys())[0].name
max_growth = model.slim_optimize()
model.reactions.BIOMASS_Ecoli_core_w_GAM.lower_bound = MIN_GROWTH
with model:
model.objective = TARGET_METABOLITE
model.objective_direction = "max"
production_solution = model.optimize()
max_production = production_solution.objective_value
print(f"Maximum production: {max_production:.3f} mmol/gDW/h")
print(f"Growth rate: {production_solution.fluxes['BIOMASS_Ecoli_core_w_GAM']:.3f} /h")
# Step 5: Identify beneficial gene knockouts
print("\nScreening for beneficial knockouts...")
# Reset model
model.reactions.BIOMASS_Ecoli_core_w_GAM.lower_bound = MIN_GROWTH
model.objective = TARGET_METABOLITE
model.objective_direction = "max"
knockout_results = []
for gene in model.genes:
with model:
gene.knock_out()
try:
solution = model.optimize()
if solution.status == "optimal":
production = solution.objective_value
growth = solution.fluxes["BIOMASS_Ecoli_core_w_GAM"]
if production > max_production * 1.05: # >5% improvement
knockout_results.append({
"gene": gene.id,
"production": production,
"growth": growth,
"improvement": (production / max_production - 1) * 100
})
except:
continue
knockout_df = pd.DataFrame(knockout_results)
if len(knockout_df) > 0:
knockout_df = knockout_df.sort_values("improvement", ascending=False)
print(f"\nFound {len(knockout_df)} beneficial knockouts:")
print(knockout_df.head(10))
knockout_df.to_csv("beneficial_knockouts.csv", index=False)
else:
print("No beneficial single knockouts found")
# Step 6: Test combination of best knockouts
if len(knockout_df) > 0:
print("\nTesting knockout combinations...")
top_genes = knockout_df.head(3)["gene"].tolist()
with model:
for gene_id in top_genes:
model.genes.get_by_id(gene_id).knock_out()
solution = model.optimize()
if solution.status == "optimal":
combined_production = solution.objective_value
combined_growth = solution.fluxes["BIOMASS_Ecoli_core_w_GAM"]
combined_improvement = (combined_production / max_production - 1) * 100
print(f"\nCombined knockout results:")
print(f" Genes: {', '.join(top_genes)}")
print(f" Production: {combined_production:.3f} mmol/gDW/h")
print(f" Growth: {combined_growth:.3f} /h")
print(f" Improvement: {combined_improvement:.1f}%")
# Step 7: Analyze flux distribution in production strain
if len(knockout_df) > 0:
best_gene = knockout_df.iloc[0]["gene"]
with model:
model.genes.get_by_id(best_gene).knock_out()
solution = model.optimize()
# Get active pathways
active_fluxes = solution.fluxes[solution.fluxes.abs() > 0.1]
active_fluxes.to_csv(f"production_strain_fluxes_{best_gene}_knockout.csv")
print(f"\nActive reactions in production strain: {len(active_fluxes)}")
```
## Workflow 5: Model Validation and Debugging
This workflow shows systematic approaches to validate and debug metabolic models.
```python
from cobra.io import load_model, read_sbml_model
from cobra.flux_analysis import flux_variability_analysis
import pandas as pd
# Step 1: Load model
model = load_model("ecoli") # Or read_sbml_model("your_model.xml")
print(f"Model: {model.id}")
print(f"Reactions: {len(model.reactions)}")
print(f"Metabolites: {len(model.metabolites)}")
print(f"Genes: {len(model.genes)}")
# Step 2: Check model feasibility
print("\n--- Feasibility Check ---")
try:
objective_value = model.slim_optimize()
print(f"Model is feasible (objective: {objective_value:.3f})")
except:
print("Model is INFEASIBLE")
print("Troubleshooting steps:")
# Check for blocked reactions
from cobra.flux_analysis import find_blocked_reactions
blocked = find_blocked_reactions(model)
print(f" Blocked reactions: {len(blocked)}")
if len(blocked) > 0:
print(f" First 10 blocked: {list(blocked)[:10]}")
# Check medium
print(f"\n Current medium: {model.medium}")
# Try opening all exchanges
for reaction in model.exchanges:
reaction.lower_bound = -1000
try:
objective_value = model.slim_optimize()
print(f"\n Model feasible with open exchanges (objective: {objective_value:.3f})")
print(" Issue: Medium constraints too restrictive")
except:
print("\n Model still infeasible with open exchanges")
print(" Issue: Structural problem (missing reactions, mass imbalance, etc.)")
# Step 3: Check mass and charge balance
print("\n--- Mass and Charge Balance Check ---")
unbalanced_reactions = []
for reaction in model.reactions:
try:
balance = reaction.check_mass_balance()
if balance:
unbalanced_reactions.append({
"reaction": reaction.id,
"imbalance": balance
})
except:
pass
if unbalanced_reactions:
print(f"Found {len(unbalanced_reactions)} unbalanced reactions:")
for item in unbalanced_reactions[:10]:
print(f" {item['reaction']}: {item['imbalance']}")
else:
print("All reactions are mass balanced")
# Step 4: Identify dead-end metabolites
print("\n--- Dead-end Metabolite Check ---")
dead_end_metabolites = []
for metabolite in model.metabolites:
producing_reactions = [r for r in metabolite.reactions
if r.metabolites[metabolite] > 0]
consuming_reactions = [r for r in metabolite.reactions
if r.metabolites[metabolite] < 0]
if len(producing_reactions) == 0 or len(consuming_reactions) == 0:
dead_end_metabolites.append({
"metabolite": metabolite.id,
"producers": len(producing_reactions),
"consumers": len(consuming_reactions)
})
if dead_end_metabolites:
print(f"Found {len(dead_end_metabolites)} dead-end metabolites:")
for item in dead_end_metabolites[:10]:
print(f" {item['metabolite']}: {item['producers']} producers, {item['consumers']} consumers")
else:
print("No dead-end metabolites found")
# Step 5: Check for duplicate reactions
print("\n--- Duplicate Reaction Check ---")
reaction_equations = {}
duplicates = []
for reaction in model.reactions:
equation = reaction.build_reaction_string()
if equation in reaction_equations:
duplicates.append({
"reaction1": reaction_equations[equation],
"reaction2": reaction.id,
"equation": equation
})
else:
reaction_equations[equation] = reaction.id
if duplicates:
print(f"Found {len(duplicates)} duplicate reaction pairs:")
for item in duplicates[:10]:
print(f" {item['reaction1']} == {item['reaction2']}")
else:
print("No duplicate reactions found")
# Step 6: Identify orphan genes
print("\n--- Orphan Gene Check ---")
orphan_genes = [gene for gene in model.genes if len(gene.reactions) == 0]
if orphan_genes:
print(f"Found {len(orphan_genes)} orphan genes (not associated with reactions):")
print(f" First 10: {[g.id for g in orphan_genes[:10]]}")
else:
print("No orphan genes found")
# Step 7: Check for thermodynamically infeasible loops
print("\n--- Thermodynamic Loop Check ---")
fva_loopless = flux_variability_analysis(model, loopless=True)
fva_standard = flux_variability_analysis(model)
loop_reactions = []
for reaction_id in fva_standard.index:
standard_range = fva_standard.loc[reaction_id, "maximum"] - fva_standard.loc[reaction_id, "minimum"]
loopless_range = fva_loopless.loc[reaction_id, "maximum"] - fva_loopless.loc[reaction_id, "minimum"]
if standard_range > loopless_range + 0.1:
loop_reactions.append({
"reaction": reaction_id,
"standard_range": standard_range,
"loopless_range": loopless_range
})
if loop_reactions:
print(f"Found {len(loop_reactions)} reactions potentially involved in loops:")
loop_df = pd.DataFrame(loop_reactions).sort_values("standard_range", ascending=False)
print(loop_df.head(10))
else:
print("No thermodynamically infeasible loops detected")
# Step 8: Generate validation report
print("\n--- Generating Validation Report ---")
validation_report = {
"model_id": model.id,
"feasible": objective_value if 'objective_value' in locals() else None,
"n_reactions": len(model.reactions),
"n_metabolites": len(model.metabolites),
"n_genes": len(model.genes),
"n_unbalanced": len(unbalanced_reactions),
"n_dead_ends": len(dead_end_metabolites),
"n_duplicates": len(duplicates),
"n_orphan_genes": len(orphan_genes),
"n_loop_reactions": len(loop_reactions)
}
validation_df = pd.DataFrame([validation_report])
validation_df.to_csv("model_validation_report.csv", index=False)
print("Validation report saved to model_validation_report.csv")
```
These workflows provide comprehensive templates for common COBRApy tasks. Adapt them as needed for specific research questions and models.

View File

@@ -0,0 +1,704 @@
---
name: datamol
description: Comprehensive toolkit for molecular cheminformatics using datamol, a Pythonic layer built on RDKit. Use this skill when working with molecular structures, SMILES strings, chemical reactions, molecular descriptors, conformer generation, molecular clustering, scaffold analysis, or any cheminformatics tasks. This skill should be applied when users need to process molecules, analyze chemical properties, visualize molecular structures, fragment compounds, or perform molecular similarity calculations.
---
# Datamol Cheminformatics Skill
## Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
**Key capabilities**:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
## Installation and Setup
Guide users to install datamol:
```bash
# Via conda/mamba (recommended)
conda install -c conda-forge datamol
# Via pip
pip install datamol
```
**Import convention**:
```python
import datamol as dm
```
## Core Workflows
### 1. Basic Molecule Handling
**Creating molecules from SMILES**:
```python
import datamol as dm
# Single molecule
mol = dm.to_mol("CCO") # Ethanol
# From list of SMILES
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
# Error handling
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
```
**Converting molecules to SMILES**:
```python
# Canonical SMILES
smiles = dm.to_smiles(mol)
# Isomeric SMILES (includes stereochemistry)
smiles = dm.to_smiles(mol, isomeric=True)
# Other formats
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
```
**Standardization and sanitization** (always recommend for user-provided molecules):
```python
# Sanitize molecule
mol = dm.sanitize_mol(mol)
# Full standardization (recommended for datasets)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
# For SMILES strings directly
clean_smiles = dm.standardize_smiles(smiles)
```
### 2. Reading and Writing Molecular Files
Refer to `references/io_module.md` for comprehensive I/O documentation.
**Reading files**:
```python
# SDF files (most common in chemistry)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
# SMILES files
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
# CSV with SMILES column
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
# Excel files
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
# Universal reader (auto-detects format)
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
```
**Writing files**:
```python
# Save as SDF
dm.to_sdf(mols, "output.sdf")
# Or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
# Save as SMILES file
dm.to_smi(mols, "output.smi")
# Excel with rendered molecule images
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
```
**Remote file support** (S3, GCS, HTTP):
```python
# Read from cloud storage
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
# Write to cloud storage
dm.to_sdf(mols, "s3://bucket/output.sdf")
```
### 3. Molecular Descriptors and Properties
Refer to `references/descriptors_viz.md` for detailed descriptor documentation.
**Computing descriptors for a single molecule**:
```python
# Get standard descriptor set
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
```
**Batch descriptor computation** (recommended for datasets):
```python
# Compute for all molecules in parallel
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # Use all CPU cores
progress=True # Show progress bar
)
```
**Specific descriptors**:
```python
# Aromaticity
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
# Stereochemistry
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
# Flexibility
n_rigid = dm.descriptors.n_rigid_bonds(mol)
```
**Drug-likeness filtering (Lipinski's Rule of Five)**:
```python
# Filter compounds
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
```
### 4. Molecular Fingerprints and Similarity
**Generating fingerprints**:
```python
# ECFP (Extended Connectivity Fingerprint, default)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
# Other fingerprint types
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
```
**Similarity calculations**:
```python
# Pairwise distances within a set
distance_matrix = dm.pdist(mols, n_jobs=-1)
# Distances between two sets
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
# Find most similar molecules
from scipy.spatial.distance import squareform
dist_matrix = squareform(dm.pdist(mols))
# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
```
### 5. Clustering and Diversity Selection
Refer to `references/core_api.md` for clustering details.
**Butina clustering**:
```python
# Cluster molecules by structural similarity
clusters = dm.cluster_mols(
mols,
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
n_jobs=-1 # Parallel processing
)
# Each cluster is a list of molecule indices
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
```
**Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
**Diversity selection**:
```python
# Pick diverse subset
diverse_mols = dm.pick_diverse(
mols,
npick=100 # Select 100 diverse molecules
)
# Pick cluster centroids
centroids = dm.pick_centroids(
mols,
npick=50 # Select 50 representative molecules
)
```
### 6. Scaffold Analysis
Refer to `references/fragments_scaffolds.md` for complete scaffold documentation.
**Extracting Murcko scaffolds**:
```python
# Get Bemis-Murcko scaffold (core structure)
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
```
**Scaffold-based analysis**:
```python
# Group compounds by scaffold
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Count scaffold frequency
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
# Create scaffold-to-molecules mapping
scaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
```
**Scaffold-based train/test splitting** (for ML):
```python
# Ensure train and test sets have different scaffolds
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
# Split scaffolds into train/test
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
# Get molecules for each split
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
```
### 7. Molecular Fragmentation
Refer to `references/fragments_scaffolds.md` for fragmentation details.
**BRICS fragmentation** (16 bond types):
```python
# Fragment molecule
fragments = dm.fragment.brics(mol)
# Returns: set of fragment SMILES with attachment points like '[1*]CCN'
```
**RECAP fragmentation** (11 bond types):
```python
fragments = dm.fragment.recap(mol)
```
**Fragment analysis**:
```python
# Find common fragments across compound library
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
# Fragment-based scoring
def fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
```
### 8. 3D Conformer Generation
Refer to `references/conformers_module.md` for detailed conformer documentation.
**Generating conformers**:
```python
# Generate 3D conformers
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Number to generate (auto if None)
rms_cutoff=0.5, # Filter similar conformers (Ångströms)
minimize_energy=True, # Minimize with UFF force field
method='ETKDGv3' # Embedding method (recommended)
)
# Access conformers
n_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0) # Get first conformer
positions = conf.GetPositions() # Nx3 array of atom coordinates
```
**Conformer clustering**:
```python
# Cluster conformers by RMSD
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
# Get representative conformers
centroids = dm.conformers.return_centroids(mol_3d, clusters)
```
**SASA calculation**:
```python
# Calculate solvent accessible surface area
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
# Access SASA from conformer properties
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
```
### 9. Visualization
Refer to `references/descriptors_viz.md` for visualization documentation.
**Basic molecule grid**:
```python
# Visualize molecules
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
# Save to file
dm.viz.to_image(mols, outfile="molecules.png")
# SVG for publications
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
```
**Aligned visualization** (for SAR analysis):
```python
# Align molecules by common substructure
dm.viz.to_image(
similar_mols,
align=True, # Enable MCS alignment
legends=activity_labels,
n_cols=4
)
```
**Highlighting substructures**:
```python
# Highlight specific atoms and bonds
dm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3], # Atom indices
highlight_bond=[0, 1, 2] # Bond indices
)
```
**Conformer visualization**:
```python
# Display multiple conformers
dm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
```
### 10. Chemical Reactions
Refer to `references/reactions_data.md` for reactions documentation.
**Applying reactions**:
```python
from rdkit.Chem import rdChemReactions
# Define reaction from SMARTS
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# Apply to molecule
reactant = dm.to_mol("CC(=O)O") # Acetic acid
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
# Convert to SMILES
product_smiles = dm.to_smiles(product)
```
**Batch reaction application**:
```python
# Apply reaction to library
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
```
## Parallelization
Datamol includes built-in parallelization for many operations. Use `n_jobs` parameter:
- `n_jobs=1`: Sequential (no parallelization)
- `n_jobs=-1`: Use all available CPU cores
- `n_jobs=4`: Use 4 cores
**Functions supporting parallelization**:
- `dm.read_sdf(..., n_jobs=-1)`
- `dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)`
- `dm.cluster_mols(..., n_jobs=-1)`
- `dm.pdist(..., n_jobs=-1)`
- `dm.conformers.sasa(..., n_jobs=-1)`
**Progress bars**: Many batch operations support `progress=True` parameter.
## Common Workflows and Patterns
### Complete Pipeline: Data Loading → Filtering → Analysis
```python
import datamol as dm
import pandas as pd
# 1. Load molecules
df = dm.read_sdf("compounds.sdf")
# 2. Standardize
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()] # Remove failed molecules
# 3. Compute descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
# 4. Filter by drug-likeness
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
# 5. Cluster and select diverse subset
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
# 6. Visualize results
dm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
```
### Structure-Activity Relationship (SAR) Analysis
```python
# Group by scaffold
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Create DataFrame with activities
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities # User-provided activity data
})
# Analyze each scaffold series
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3: # Need multiple examples
print(f"\nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
# Visualize with activities as legends
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True # Align by common substructure
)
```
### Virtual Screening Pipeline
```python
# 1. Generate fingerprints for query and library
query_fps = [dm.to_fp(mol) for mol in query_actives]
library_fps = [dm.to_fp(mol) for mol in library_mols]
# 2. Calculate similarities
from scipy.spatial.distance import cdist
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
# 3. Find closest matches (min distance to any query)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances # Convert distance to similarity
# 4. Rank and select top hits
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
# 5. Visualize hits
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
```
## Reference Documentation
For detailed API documentation, consult these reference files:
- **`references/core_api.md`**: Core namespace functions (conversions, standardization, fingerprints, clustering)
- **`references/io_module.md`**: File I/O operations (read/write SDF, CSV, Excel, remote files)
- **`references/conformers_module.md`**: 3D conformer generation, clustering, SASA calculations
- **`references/descriptors_viz.md`**: Molecular descriptors and visualization functions
- **`references/fragments_scaffolds.md`**: Scaffold extraction, BRICS/RECAP fragmentation
- **`references/reactions_data.md`**: Chemical reactions and toy datasets
## Best Practices
1. **Always standardize molecules** from external sources:
```python
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
```
2. **Check for None values** after molecule parsing:
```python
mol = dm.to_mol(smiles)
if mol is None:
# Handle invalid SMILES
```
3. **Use parallel processing** for large datasets:
```python
result = dm.operation(..., n_jobs=-1, progress=True)
```
4. **Leverage fsspec** for cloud storage:
```python
df = dm.read_sdf("s3://bucket/compounds.sdf")
```
5. **Use appropriate fingerprints** for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
6. **Consider scale limitations**:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
7. **Scaffold splitting for ML**: Ensure proper train/test separation by scaffold
8. **Align molecules** when visualizing SAR series
## Error Handling
```python
# Safe molecule creation
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
# Safe batch processing
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
```
## Integration with Machine Learning
```python
# Feature generation
X = np.array([dm.to_fp(mol) for mol in mols])
# Or descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
# Train model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
# Predict
predictions = model.predict(X_test)
```
## Troubleshooting
**Issue**: Molecule parsing fails
- **Solution**: Use `dm.standardize_smiles()` first or try `dm.fix_mol()`
**Issue**: Memory errors with clustering
- **Solution**: Use `dm.pick_diverse()` instead of full clustering for large sets
**Issue**: Slow conformer generation
- **Solution**: Reduce `n_confs` or increase `rms_cutoff` to generate fewer conformers
**Issue**: Remote file access fails
- **Solution**: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
## Additional Resources
- **Datamol Documentation**: https://docs.datamol.io/
- **RDKit Documentation**: https://www.rdkit.org/docs/
- **GitHub Repository**: https://github.com/datamol-io/datamol

View File

@@ -0,0 +1,131 @@
# Datamol Conformers Module Reference
The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations.
## Conformer Generation
### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, ...)`
Generate 3D molecular conformers.
- **Parameters**:
- `mol`: Input molecule
- `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None)
- `rms_cutoff`: RMS threshold in Ångströms for filtering similar conformers (removes duplicates)
- `minimize_energy`: Apply UFF energy minimization (default: True)
- `method`: Embedding method - options:
- `'ETDG'` - Experimental Torsion Distance Geometry
- `'ETKDG'` - ETDG with additional basic knowledge
- `'ETKDGv2'` - Enhanced version 2
- `'ETKDGv3'` - Enhanced version 3 (default, recommended)
- `add_hs`: Add hydrogens before embedding (default: True, critical for quality)
- `random_seed`: Set for reproducibility
- **Returns**: Molecule with embedded conformers
- **Example**:
```python
mol = dm.to_mol("CCO")
mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5)
conformers = mol_3d.GetConformers() # Access all conformers
```
## Conformer Clustering
### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)`
Group conformers by RMS distance.
- **Parameters**:
- `rms_cutoff`: Clustering threshold in Ångströms (default: 1.0)
- `already_aligned`: Whether conformers are pre-aligned
- `centroids`: Return centroid conformers (True) or cluster groups (False)
- **Returns**: Cluster information or centroid conformers
- **Use case**: Identify distinct conformational families
### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)`
Extract representative conformers from clusters.
- **Parameters**:
- `conf_clusters`: Sequence of cluster indices from `cluster()`
- `centroids`: Return single molecule (True) or list of molecules (False)
- **Returns**: Centroid conformer(s)
## Conformer Analysis
### `dm.conformers.rmsd(mol)`
Calculate pairwise RMSD matrix across all conformers.
- **Requirements**: Minimum 2 conformers
- **Returns**: NxN matrix of RMSD values
- **Use case**: Quantify conformer diversity
### `dm.conformers.sasa(mol, n_jobs=1, ...)`
Calculate Solvent Accessible Surface Area (SASA) using FreeSASA.
- **Parameters**:
- `n_jobs`: Parallelization for multiple conformers
- **Returns**: Array of SASA values (one per conformer)
- **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'`
- **Example**:
```python
sasa_values = dm.conformers.sasa(mol_3d)
# Or access from conformer properties
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
```
## Low-Level Conformer Manipulation
### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)`
Calculate molecular center.
- **Parameters**:
- `conf_id`: Conformer index (-1 for first conformer)
- `use_atoms`: Use atomic masses (True) or geometric center (False)
- `round_coord`: Decimal precision for rounding
- **Returns**: 3D coordinates of center
- **Use case**: Centering molecules for visualization or alignment
### `dm.conformers.get_coords(mol, conf_id=-1)`
Retrieve atomic coordinates from a conformer.
- **Returns**: Nx3 numpy array of atomic positions
- **Example**:
```python
positions = dm.conformers.get_coords(mol_3d, conf_id=0)
# positions.shape: (num_atoms, 3)
```
### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)`
Reposition conformer using transformation matrix.
- **Modification**: Operates in-place
- **Use case**: Aligning or repositioning molecules
## Workflow Example
```python
import datamol as dm
# 1. Create molecule and generate conformers
mol = dm.to_mol("CC(C)CCO") # Isopentanol
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Generate 50 initial conformers
rms_cutoff=0.5, # Filter similar conformers
minimize_energy=True # Minimize energy
)
# 2. Analyze conformers
n_conformers = mol_3d.GetNumConformers()
print(f"Generated {n_conformers} unique conformers")
# 3. Calculate SASA
sasa_values = dm.conformers.sasa(mol_3d)
# 4. Cluster conformers
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False)
# 5. Get representative conformers
centroids = dm.conformers.return_centroids(mol_3d, clusters)
# 6. Access 3D coordinates
coords = dm.conformers.get_coords(mol_3d, conf_id=0)
```
## Key Concepts
- **Distance Geometry**: Method for generating 3D structures from connectivity information
- **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge
- **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers
- **Energy Minimization**: Relaxes structures to nearest local energy minimum
- **Hydrogens**: Critical for accurate 3D geometry - always include during embedding

View File

@@ -0,0 +1,130 @@
# Datamol Core API Reference
This document covers the main functions available in the datamol namespace.
## Molecule Creation and Conversion
### `to_mol(mol, ...)`
Convert SMILES string or other molecular representations to RDKit molecule objects.
- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
- **Returns**: `rdkit.Chem.Mol` object
- **Common usage**: `mol = dm.to_mol("CCO")`
### `from_inchi(inchi)`
Convert InChI string to molecule object.
### `from_smarts(smarts)`
Convert SMARTS pattern to molecule object.
### `from_selfies(selfies)`
Convert SELFIES string to molecule object.
### `copy_mol(mol)`
Create a copy of a molecule object to avoid modifying the original.
## Molecule Export
### `to_smiles(mol, ...)`
Convert molecule object to SMILES string.
- **Common parameters**: `canonical=True`, `isomeric=True`
### `to_inchi(mol, ...)`
Convert molecule to InChI string representation.
### `to_inchikey(mol)`
Convert molecule to InChI key (fixed-length hash).
### `to_smarts(mol)`
Convert molecule to SMARTS pattern.
### `to_selfies(mol)`
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
## Sanitization and Standardization
### `sanitize_mol(mol, ...)`
Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
- **Purpose**: Fix common molecular structure issues
- **Returns**: Sanitized molecule or None if sanitization fails
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
Apply comprehensive standardization procedures including:
- Metal disconnection
- Normalization (charge corrections)
- Reionization
- Fragment handling (largest fragment selection)
### `standardize_smiles(smiles, ...)`
Apply SMILES standardization procedures directly to a SMILES string.
### `fix_mol(mol)`
Attempt to fix molecular structure issues automatically.
### `fix_valence(mol)`
Correct valence errors in molecular structures.
## Molecular Properties
### `reorder_atoms(mol, ...)`
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
- **Purpose**: Maintain reproducible feature generation
### `remove_hs(mol, ...)`
Remove hydrogen atoms from molecular structure.
### `add_hs(mol, ...)`
Add explicit hydrogen atoms to molecular structure.
## Fingerprints and Similarity
### `to_fp(mol, fp_type='ecfp', ...)`
Generate molecular fingerprints for similarity calculations.
- **Fingerprint types**:
- `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
- `'fcfp'` - Functional Connectivity Fingerprints
- `'maccs'` - MACCS keys
- `'topological'` - Topological fingerprints
- `'atompair'` - Atom pair fingerprints
- **Common parameters**: `n_bits`, `radius`
- **Returns**: Numpy array or RDKit fingerprint object
### `pdist(mols, ...)`
Calculate pairwise Tanimoto distances between all molecules in a list.
- **Supports**: Parallel processing via `n_jobs` parameter
- **Returns**: Distance matrix
### `cdist(mols1, mols2, ...)`
Calculate Tanimoto distances between two sets of molecules.
## Clustering and Diversity
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
Cluster molecules using Butina clustering algorithm.
- **Parameters**:
- `cutoff`: Distance threshold (default 0.2)
- `feature_fn`: Custom function for molecular features
- `n_jobs`: Parallelization (-1 for all cores)
- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
- **Returns**: List of clusters (each cluster is a list of molecule indices)
### `pick_diverse(mols, npick, ...)`
Select diverse subset of molecules based on fingerprint diversity.
### `pick_centroids(mols, npick, ...)`
Select centroid molecules representing clusters.
## Graph Operations
### `to_graph(mol)`
Convert molecule to graph representation for graph-based analysis.
### `get_all_path_between(mol, start, end)`
Find all paths between two atoms in molecular structure.
## DataFrame Integration
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
Convert list of molecules to pandas DataFrame.
### `from_df(df, smiles_column='smiles', mol_column='mol')`
Convert pandas DataFrame to list of molecules.

View File

@@ -0,0 +1,195 @@
# Datamol Descriptors and Visualization Reference
## Descriptors Module (`datamol.descriptors`)
The descriptors module provides tools for computing molecular properties and descriptors.
### Specialized Descriptor Functions
#### `dm.descriptors.n_aromatic_atoms(mol)`
Calculate the number of aromatic atoms.
- **Returns**: Integer count
- **Use case**: Aromaticity analysis
#### `dm.descriptors.n_aromatic_atoms_proportion(mol)`
Calculate ratio of aromatic atoms to total heavy atoms.
- **Returns**: Float between 0 and 1
- **Use case**: Quantifying aromatic character
#### `dm.descriptors.n_charged_atoms(mol)`
Count atoms with nonzero formal charge.
- **Returns**: Integer count
- **Use case**: Charge distribution analysis
#### `dm.descriptors.n_rigid_bonds(mol)`
Count non-rotatable bonds (neither single bonds nor ring bonds).
- **Returns**: Integer count
- **Use case**: Molecular flexibility assessment
#### `dm.descriptors.n_stereo_centers(mol)`
Count stereogenic centers (chiral centers).
- **Returns**: Integer count
- **Use case**: Stereochemistry analysis
#### `dm.descriptors.n_stereo_centers_unspecified(mol)`
Count stereocenters lacking stereochemical specification.
- **Returns**: Integer count
- **Use case**: Identifying incomplete stereochemistry
### Batch Descriptor Computation
#### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)`
Compute multiple molecular properties for a single molecule.
- **Parameters**:
- `properties_fn`: Custom list of descriptor functions
- `add_properties`: Include additional computed properties
- **Returns**: Dictionary of descriptor name → value pairs
- **Default descriptors include**:
- Molecular weight, LogP, number of H-bond donors/acceptors
- Aromatic atoms, stereocenters, rotatable bonds
- TPSA (Topological Polar Surface Area)
- Ring count, heteroatom count
- **Example**:
```python
mol = dm.to_mol("CCO")
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, ...}
```
#### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)`
Compute descriptors for multiple molecules in parallel.
- **Parameters**:
- `mols`: List of molecules
- `n_jobs`: Number of parallel jobs (-1 for all cores)
- `batch_size`: Chunk size for parallel processing
- `progress`: Show progress bar
- **Returns**: Pandas DataFrame with one row per molecule
- **Example**:
```python
mols = [dm.to_mol(smi) for smi in smiles_list]
df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1,
progress=True
)
```
### RDKit Descriptor Access
#### `dm.descriptors.any_rdkit_descriptor(name)`
Retrieve any descriptor function from RDKit by name.
- **Parameters**: `name` - Descriptor function name (e.g., 'MolWt', 'TPSA')
- **Returns**: RDKit descriptor function
- **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors`
- **Example**:
```python
tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA')
tpsa_value = tpsa_fn(mol)
```
### Common Use Cases
**Drug-likeness Filtering (Lipinski's Rule of Five)**:
```python
descriptors = dm.descriptors.compute_many_descriptors(mol)
is_druglike = (
descriptors['mw'] <= 500 and
descriptors['logp'] <= 5 and
descriptors['hbd'] <= 5 and
descriptors['hba'] <= 10
)
```
**ADME Property Analysis**:
```python
df = dm.descriptors.batch_compute_many_descriptors(compound_library)
# Filter by TPSA for blood-brain barrier penetration
bbb_candidates = df[df['tpsa'] < 90]
```
---
## Visualization Module (`datamol.viz`)
The viz module provides tools for rendering molecules and conformers as images.
### Main Visualization Function
#### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, ...)`
Generate image grid from molecules.
- **Parameters**:
- `mols`: Single molecule or list of molecules
- `legends`: String or list of strings as labels (one per molecule)
- `n_cols`: Number of molecules per row (default: 4)
- `use_svg`: Output SVG format (True) or PNG (False, default)
- `mol_size`: Tuple (width, height) or single int for square images
- `highlight_atom`: Atom indices to highlight (list or dict)
- `highlight_bond`: Bond indices to highlight (list or dict)
- `outfile`: Save path (local or remote, supports fsspec)
- `max_mols`: Maximum number of molecules to display
- `indices`: Draw atom indices on structures (default: False)
- `align`: Align molecules using MCS (Maximum Common Substructure)
- **Returns**: Image object (can be displayed in Jupyter) or saves to file
- **Example**:
```python
# Basic grid
dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]])
# Save to file
dm.viz.to_image(mols, outfile="molecules.png", n_cols=5)
# Highlight substructure
dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1])
# Aligned visualization
dm.viz.to_image(mols, align=True, legends=activity_labels)
```
### Conformer Visualization
#### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, ...)`
Display multiple conformers in grid layout.
- **Parameters**:
- `mol`: Molecule with embedded conformers
- `n_confs`: Number or list of conformer indices to display (None = all)
- `align_conf`: Align conformers for comparison (default: True)
- `n_cols`: Grid columns (default: 3)
- `sync_views`: Synchronize 3D views when interactive (default: True)
- `remove_hs`: Remove hydrogens for clarity (default: True)
- **Returns**: Grid of conformer visualizations
- **Use case**: Comparing conformational diversity
- **Example**:
```python
mol_3d = dm.conformers.generate(mol, n_confs=20)
dm.viz.conformers(mol_3d, n_confs=10, align_conf=True)
```
### Circle Grid Visualization
#### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, ...)`
Create concentric ring visualization with central molecule.
- **Parameters**:
- `center_mol`: Molecule at center
- `circle_mols`: List of molecule lists (one list per ring)
- `mol_size`: Image size per molecule
- `circle_margin`: Spacing between rings (default: 50)
- `act_mapper`: Activity mapping dictionary for color-coding
- **Returns**: Circular grid image
- **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks
- **Example**:
```python
# Show a reference molecule surrounded by similar compounds
dm.viz.circle_grid(
center_mol=reference,
circle_mols=[nearest_neighbors, second_tier]
)
```
### Visualization Best Practices
1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values
2. **Align related molecules**: Use `align=True` in `to_image()` for SAR analysis
3. **Adjust grid size**: Set `n_cols` based on molecule count and display width
4. **Use SVG for publications**: Set `use_svg=True` for scalable vector graphics
5. **Highlight substructures**: Use `highlight_atom` and `highlight_bond` to emphasize features
6. **Save large grids**: Use `outfile` parameter to save rather than display in memory

View File

@@ -0,0 +1,174 @@
# Datamol Fragments and Scaffolds Reference
## Scaffolds Module (`datamol.scaffold`)
Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).
### Murcko Scaffolds
#### `dm.to_scaffold_murcko(mol)`
Extract Bemis-Murcko scaffold (molecular framework).
- **Method**: Removes side chains, retaining ring systems and linkers
- **Returns**: Molecule object representing the scaffold
- **Use case**: Identify core structures across compound series
- **Example**:
```python
mol = dm.to_mol("c1ccc(cc1)CCN") # Phenethylamine
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
# Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
```
**Workflow for scaffold analysis**:
```python
# Extract scaffolds from compound library
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Count scaffold frequency
from collections import Counter
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
```
### Fuzzy Scaffolds
#### `dm.scaffold.fuzzy_scaffolding(mol, ...)`
Generate fuzzy scaffolds with enforceable groups that must appear in the core.
- **Purpose**: More flexible scaffold definition allowing specified functional groups
- **Use case**: Custom scaffold definitions beyond Murcko rules
### Applications
**Scaffold-based splitting** (for ML model validation):
```python
# Group compounds by scaffold
scaffold_to_mols = {}
for mol, scaffold in zip(mols, scaffolds):
smi = dm.to_smiles(scaffold)
if smi not in scaffold_to_mols:
scaffold_to_mols[smi] = []
scaffold_to_mols[smi].append(mol)
# Ensure train/test sets have different scaffolds
```
**SAR analysis**:
```python
# Group by scaffold and analyze activity
for scaffold_smi, molecules in scaffold_to_mols.items():
activities = [get_activity(mol) for mol in molecules]
print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")
```
---
## Fragments Module (`datamol.fragment`)
Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.
### BRICS Fragmentation
#### `dm.fragment.brics(mol, ...)`
Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).
- **Method**: Dissects based on 16 chemically meaningful bond types
- **Consideration**: Considers chemical environment and surrounding substructures
- **Returns**: Set of fragment SMILES strings
- **Use case**: Retrosynthetic analysis, fragment-based design
- **Example**:
```python
mol = dm.to_mol("c1ccccc1CCN")
fragments = dm.fragment.brics(mol)
# Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
# [1*] represents attachment points
```
### RECAP Fragmentation
#### `dm.fragment.recap(mol, ...)`
Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).
- **Method**: Dissects based on 11 predefined bond types
- **Rules**:
- Leaves alkyl groups smaller than 5 carbons intact
- Preserves cyclic bonds
- **Returns**: Set of fragment SMILES strings
- **Use case**: Combinatorial library design
- **Example**:
```python
mol = dm.to_mol("CCCCCc1ccccc1")
fragments = dm.fragment.recap(mol)
```
### MMPA Fragmentation
#### `dm.fragment.mmpa_frag(mol, ...)`
Fragment for Matched Molecular Pair Analysis.
- **Purpose**: Generate fragments suitable for identifying molecular pairs
- **Use case**: Analyzing how small structural changes affect properties
- **Example**:
```python
fragments = dm.fragment.mmpa_frag(mol)
# Used to find pairs of molecules differing by single transformation
```
### Comparison of Methods
| Method | Bond Types | Preserves Cycles | Best For |
|--------|-----------|------------------|----------|
| BRICS | 16 | Yes | Retrosynthetic analysis, fragment recombination |
| RECAP | 11 | Yes | Combinatorial library design |
| MMPA | Variable | Depends | Structure-activity relationship analysis |
### Fragmentation Workflow
```python
import datamol as dm
# 1. Fragment a molecule
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
brics_frags = dm.fragment.brics(mol)
recap_frags = dm.fragment.recap(mol)
# 2. Analyze fragment frequency across library
all_fragments = []
for mol in molecule_library:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
# 3. Identify common fragments
from collections import Counter
fragment_counts = Counter(all_fragments)
common_fragments = fragment_counts.most_common(20)
# 4. Convert fragments back to molecules (remove attachment points)
def clean_fragment(frag_smiles):
# Remove [1*], [2*], etc. attachment point markers
clean = frag_smiles.replace('[1*]', '[H]')
return dm.to_mol(clean)
```
### Advanced: Fragment-Based Virtual Screening
```python
# Build fragment library from known actives
active_fragments = set()
for active_mol in active_compounds:
frags = dm.fragment.brics(active_mol)
active_fragments.update(frags)
# Screen compounds for presence of active fragments
def score_by_fragments(mol, fragment_set):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(fragment_set)
return len(overlap) / len(mol_frags)
# Score screening library
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]
```
### Key Concepts
- **Attachment Points**: Marked with [1*], [2*], etc. in fragment SMILES
- **Retrosynthetic**: Fragmentation mimics synthetic disconnections
- **Chemically Meaningful**: Breaks occur at typical synthetic bonds
- **Recombination**: Fragments can theoretically be recombined into valid molecules

View File

@@ -0,0 +1,109 @@
# Datamol I/O Module Reference
The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.
## Reading Molecular Files
### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
Read Structure-Data File (SDF) format.
- **Parameters**:
- `filename`: Path to SDF file (supports local and remote paths via fsspec)
- `sanitize`: Apply sanitization to molecules
- `remove_hs`: Remove explicit hydrogens
- `as_df`: Return as DataFrame (True) or list of molecules (False)
- `mol_column`: Name of molecule column in DataFrame
- `n_jobs`: Enable parallel processing
- **Returns**: DataFrame or list of molecules
- **Example**: `df = dm.read_sdf("compounds.sdf")`
### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
Read SMILES file (space-delimited by default).
- **Common format**: SMILES followed by molecule ID/name
- **Example**: `df = dm.read_smi("molecules.smi")`
### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
Read CSV file with optional automatic SMILES-to-molecule conversion.
- **Parameters**:
- `smiles_column`: Column containing SMILES strings
- `mol_column`: If specified, creates molecule objects from SMILES column
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`
### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
Read Excel files with molecule handling.
- **Parameters**:
- `sheet_name`: Sheet to read (index or name)
- Other parameters similar to `read_csv`
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`
### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
Parse MOL block string (molecular structure text representation).
### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
Read Mol2 format files.
### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
Read Protein Data Bank (PDB) format files.
### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
Parse PDB block string.
### `dm.open_df(filename, ...)`
Universal DataFrame reader - automatically detects format.
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`
## Writing Molecular Files
### `dm.to_sdf(mols, filename, mol_column=None, ...)`
Write molecules to SDF file.
- **Input types**:
- List of molecules
- DataFrame with molecule column
- Sequence of molecules
- **Parameters**:
- `mol_column`: Column name if input is DataFrame
- **Example**:
```python
dm.to_sdf(mols, "output.sdf")
# or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
```
### `dm.to_smi(mols, filename, mol_column=None, ...)`
Write molecules to SMILES file with optional validation.
- **Format**: SMILES strings with optional molecule names/IDs
### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
Export DataFrame to Excel with rendered molecular images.
- **Parameters**:
- `mol_columns`: Columns containing molecules to render as images
- **Special feature**: Automatically renders molecules as images in Excel cells
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`
### `dm.to_molblock(mol, ...)`
Convert molecule to MOL block string.
### `dm.to_pdbblock(mol, ...)`
Convert molecule to PDB block string.
### `dm.save_df(df, filename, ...)`
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
## Remote File Support
All I/O functions support remote file paths through fsspec integration:
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
- **Example**:
```python
dm.read_sdf("s3://bucket/compounds.sdf")
dm.read_csv("https://example.com/data.csv")
```
## Key Parameters Across Functions
- **`sanitize`**: Apply molecule sanitization (default: True)
- **`remove_hs`**: Remove explicit hydrogens (default: True)
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
- **`mol_column`**: Name of molecule column in DataFrames
- **`smiles_column`**: Name of SMILES column in DataFrames

View File

@@ -0,0 +1,218 @@
# Datamol Reactions and Data Modules Reference
## Reactions Module (`datamol.reactions`)
The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
### Applying Chemical Reactions
#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
Apply a chemical reaction to reactant molecules.
- **Parameters**:
- `rxn`: Reaction object (from SMARTS pattern)
- `reactants`: Tuple of reactant molecules
- `as_smiles`: Return SMILES strings (True) or molecule objects (False)
- `sanitize`: Sanitize product molecules
- `single_product_group`: Return single product (True) or all product groups (False)
- `rm_attach`: Remove attachment point markers
- `product_index`: Which product to return from reaction
- **Returns**: Product molecule(s) or SMILES
- **Example**:
```python
from rdkit import Chem
# Define reaction: alcohol + carboxylic acid → ester
rxn = Chem.rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
)
# Apply to reactants
alcohol = dm.to_mol("CCO")
acid = dm.to_mol("CC(=O)O")
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
```
### Creating Reactions
Reactions are typically created from SMARTS patterns using RDKit:
```python
from rdkit.Chem import rdChemReactions
# Reaction pattern: [reactant1].[reactant2]>>[product]
rxn = rdChemReactions.ReactionFromSmarts(
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
)
```
### Validation Functions
The module includes functions to:
- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
- **Validate reaction**: Check if reaction is synthetically reasonable
- **Process reaction files**: Load reactions from files or databases
### Common Reaction Patterns
**Amide formation**:
```python
# Amine + carboxylic acid → amide
amide_rxn = rdChemReactions.ReactionFromSmarts(
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
)
```
**Suzuki coupling**:
```python
# Aryl halide + boronic acid → biaryl
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
)
```
**Functional group transformations**:
```python
# Alcohol → ester
esterification = rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
)
```
### Workflow Example
```python
import datamol as dm
from rdkit.Chem import rdChemReactions
# 1. Define reaction
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# 2. Apply to molecule library
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
acid_chlorides = []
for acid in acids:
try:
product = dm.reactions.apply_reaction(
rxn,
(acid,), # Single reactant as tuple
sanitize=True
)
acid_chlorides.append(product)
except Exception as e:
print(f"Reaction failed: {e}")
# 3. Validate products
valid_products = [p for p in acid_chlorides if p is not None]
```
### Key Concepts
- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
- **Attachment Points**: [1*] represents generic connection points
- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
---
## Data Module (`datamol.data`)
The data module provides convenient access to curated molecular datasets for testing and learning.
### Available Datasets
#### `dm.data.cdk2(as_df=True, mol_column='mol')`
RDKit CDK2 dataset - kinase inhibitor data.
- **Parameters**:
- `as_df`: Return as DataFrame (True) or list of molecules (False)
- `mol_column`: Name for molecule column
- **Returns**: Dataset with molecular structures and activity data
- **Use case**: Small dataset for algorithm testing
- **Example**:
```python
cdk2_df = dm.data.cdk2(as_df=True)
print(cdk2_df.shape)
print(cdk2_df.columns)
```
#### `dm.data.freesolv()`
FreeSolv dataset - experimental and calculated hydration free energies.
- **Contents**: 642 molecules with:
- IUPAC names
- SMILES strings
- Experimental hydration free energy values
- Calculated values
- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
- **Not suitable for**: Benchmarking or production model training
- **Example**:
```python
freesolv_df = dm.data.freesolv()
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
```
#### `dm.data.solubility(as_df=True, mol_column='mol')`
RDKit solubility dataset with train/test splits.
- **Contents**: Aqueous solubility data with pre-defined splits
- **Columns**: Includes 'split' column with 'train' or 'test' values
- **Use case**: Testing ML workflows with proper train/test separation
- **Example**:
```python
sol_df = dm.data.solubility(as_df=True)
# Split into train/test
train_df = sol_df[sol_df['split'] == 'train']
test_df = sol_df[sol_df['split'] == 'test']
# Use for model development
X_train = dm.to_fp(train_df[mol_column])
y_train = train_df['solubility']
```
### Usage Guidelines
**For testing and tutorials**:
```python
# Quick dataset for testing code
df = dm.data.cdk2()
mols = df['mol'].tolist()
# Test descriptor calculation
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
# Test clustering
clusters = dm.cluster_mols(mols, cutoff=0.3)
```
**For learning workflows**:
```python
# Complete ML pipeline example
sol_df = dm.data.solubility()
# Preprocessing
train = sol_df[sol_df['split'] == 'train']
test = sol_df[sol_df['split'] == 'test']
# Featurization
X_train = dm.to_fp(train['mol'])
X_test = dm.to_fp(test['mol'])
# Model training (example)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, train['solubility'])
predictions = model.predict(X_test)
```
### Important Notes
- **Toy Datasets**: Designed for pedagogical purposes, not production use
- **Small Size**: Limited number of compounds suitable for quick tests
- **Pre-processed**: Data already cleaned and formatted
- **Citations**: Check dataset documentation for proper attribution if publishing
### Best Practices
1. **Use for development only**: Don't draw scientific conclusions from toy datasets
2. **Validate on real data**: Always test production code on actual project data
3. **Proper attribution**: Cite original data sources if using in publications
4. **Understand limitations**: Know the scope and quality of each dataset

View File

@@ -0,0 +1,591 @@
---
name: deepchem
description: Comprehensive toolkit for molecular machine learning, drug discovery, and materials science using DeepChem. Use this skill when working with molecular data (SMILES, SDF files), predicting molecular properties (solubility, toxicity, binding affinity), training graph neural networks on molecules, using MoleculeNet benchmarks, performing molecular featurization, or applying transfer learning with pretrained chemical models (ChemBERTa, GROVER). Also applicable for materials science (crystal structures, bandgap prediction) and protein/DNA sequence analysis.
---
# DeepChem
## Overview
DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
## When to Use This Skill
Apply this skill when:
- Loading and processing molecular data (SMILES strings, SDF files, protein sequences)
- Predicting molecular properties (solubility, toxicity, binding affinity, ADMET properties)
- Training models on chemical/biological datasets
- Using MoleculeNet benchmark datasets (Tox21, BBBP, Delaney, etc.)
- Converting molecules to ML-ready features (fingerprints, graph representations, descriptors)
- Implementing graph neural networks for molecules (GCN, GAT, MPNN, AttentiveFP)
- Applying transfer learning with pretrained models (ChemBERTa, GROVER, MolFormer)
- Predicting crystal/materials properties (bandgap, formation energy)
- Analyzing protein or DNA sequences
## Core Capabilities
### 1. Molecular Data Loading and Processing
DeepChem provides specialized loaders for various chemical data formats:
```python
import deepchem as dc
# Load CSV with SMILES
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
# Load SDF files
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
# Load protein sequences
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
```
**Key Loaders**:
- `CSVLoader`: Tabular data with molecular identifiers
- `SDFLoader`: Molecular structure files
- `FASTALoader`: Protein/DNA sequences
- `ImageLoader`: Molecular images
- `JsonLoader`: JSON-formatted datasets
### 2. Molecular Featurization
Convert molecules into numerical representations for ML models.
#### Decision Tree for Featurizer Selection
```
Is the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
└─ 3D structure analysis
└─ CoulombMatrix
```
#### Example Featurization
```python
# Fingerprints (for traditional ML)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
# Descriptors (for interpretable models)
desc = dc.feat.RDKitDescriptors()
# Graph features (for GNNs)
graph_feat = dc.feat.MolGraphConvFeaturizer()
# Apply featurization
features = fp.featurize(['CCO', 'c1ccccc1'])
```
**Selection Guide**:
- **Small datasets (<1K)**: CircularFingerprint or RDKitDescriptors
- **Medium datasets (1K-100K)**: CircularFingerprint or graph featurizers
- **Large datasets (>100K)**: Graph featurizers (MolGraphConvFeaturizer, DMPNNFeaturizer)
- **Transfer learning**: Pretrained model featurizers (GroverFeaturizer)
See `references/api_reference.md` for complete featurizer documentation.
### 3. Data Splitting
**Critical**: For drug discovery tasks, use `ScaffoldSplitter` to prevent data leakage from similar molecular structures appearing in both training and test sets.
```python
# Scaffold splitting (recommended for molecules)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# Random splitting (for non-molecular data)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Stratified splitting (for imbalanced classification)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
```
**Available Splitters**:
- `ScaffoldSplitter`: Split by molecular scaffolds (prevents leakage)
- `ButinaSplitter`: Clustering-based molecular splitting
- `MaxMinSplitter`: Maximize diversity between sets
- `RandomSplitter`: Random splitting
- `RandomStratifiedSplitter`: Preserves class distributions
### 4. Model Selection and Training
#### Quick Model Selection Guide
| Dataset Size | Task | Recommended Model | Featurizer |
|-------------|------|-------------------|------------|
| < 1K samples | Any | SklearnModel (RandomForest) | CircularFingerprint |
| 1K-100K | Classification/Regression | GBDTModel or MultitaskRegressor | CircularFingerprint |
| > 100K | Molecular properties | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer |
| Any (small preferred) | Transfer learning | ChemBERTa, GROVER, MolFormer | Model-specific |
| Crystal structures | Materials properties | CGCNNModel, MEGNetModel | Structure-based |
| Protein sequences | Protein properties | ProtBERT | Sequence-based |
#### Example: Traditional ML
```python
from sklearn.ensemble import RandomForestRegressor
# Wrap scikit-learn model
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
```
#### Example: Deep Learning
```python
# Multitask regressor (for fingerprints)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
```
#### Example: Graph Neural Networks
```python
# Graph Convolutional Network
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# Graph Attention Network
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
# Attentive Fingerprint
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
```
### 5. MoleculeNet Benchmarks
Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
```python
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # or 'random', 'stratified'
reload=False
)
train, valid, test = datasets
# Train and evaluate
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
```
**Common Datasets**:
- **Classification**: `load_tox21()`, `load_bbbp()`, `load_hiv()`, `load_clintox()`
- **Regression**: `load_delaney()`, `load_freesolv()`, `load_lipo()`
- **Quantum properties**: `load_qm7()`, `load_qm8()`, `load_qm9()`
- **Materials**: `load_perovskite()`, `load_bandgap()`, `load_mp_formation_energy()`
See `references/api_reference.md` for complete dataset list.
### 6. Transfer Learning
Leverage pretrained models for improved performance, especially on small datasets:
```python
# ChemBERTa (BERT pretrained on 77M molecules)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # Lower LR for fine-tuning
)
model.fit(train, nb_epoch=10)
# GROVER (graph transformer pretrained on 10M molecules)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
```
**When to use transfer learning**:
- Small datasets (< 1000 samples)
- Novel molecular scaffolds
- Limited computational resources
- Need for rapid prototyping
Use the `scripts/transfer_learning.py` script for guided transfer learning workflows.
### 7. Model Evaluation
```python
# Define metrics
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
# Evaluate
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
```
### 8. Making Predictions
```python
# Predict on test set
predictions = model.predict(test)
# Predict on new molecules
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations as training
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
```
## Typical Workflows
### Workflow A: Quick Benchmark Evaluation
For evaluating a model on standard benchmarks:
```python
import deepchem as dc
# 1. Load benchmark
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# 2. Train model
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
# 3. Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
```
### Workflow B: Custom Data Prediction
For training on custom molecular datasets:
```python
import deepchem as dc
# 1. Load and featurize data
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
# 2. Split data (use ScaffoldSplitter for molecules!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Normalize (optional but recommended)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Train model
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
```
### Workflow C: Transfer Learning on Small Dataset
For leveraging pretrained models:
```python
import deepchem as dc
# 1. Load data (pretrained models often need raw SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
)
dataset = loader.create_dataset('small_dataset.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 3. Load pretrained model
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
# 4. Fine-tune
model.fit(train, nb_epoch=10)
# 5. Evaluate
predictions = model.predict(test)
```
See `references/workflows.md` for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.
## Example Scripts
This skill includes three production-ready scripts in the `scripts/` directory:
### 1. `predict_solubility.py`
Train and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
```bash
# Use Delaney benchmark
python scripts/predict_solubility.py
# Use custom data
python scripts/predict_solubility.py \
--data my_data.csv \
--smiles-col smiles \
--target-col solubility \
--predict "CCO" "c1ccccc1"
```
### 2. `graph_neural_network.py`
Train various graph neural network architectures on molecular data.
```bash
# Train GCN on Tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21
# Train AttentiveFP on custom data
python scripts/graph_neural_network.py \
--model attentivefp \
--data molecules.csv \
--task-type regression \
--targets activity \
--epochs 100
```
### 3. `transfer_learning.py`
Fine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
```bash
# Fine-tune ChemBERTa on BBBP
python scripts/transfer_learning.py --model chemberta --dataset bbbp
# Fine-tune GROVER on custom data
python scripts/transfer_learning.py \
--model grover \
--data small_dataset.csv \
--target activity \
--task-type classification \
--epochs 20
```
## Common Patterns and Best Practices
### Pattern 1: Always Use Scaffold Splitting for Molecules
```python
# GOOD: Prevents data leakage
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# BAD: Similar molecules in train and test
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
```
### Pattern 2: Normalize Features and Targets
```python
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # Also normalize target values
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)
```
### Pattern 3: Start Simple, Then Scale
1. Start with Random Forest + CircularFingerprint (fast baseline)
2. Try XGBoost/LightGBM if RF works well
3. Move to deep learning (MultitaskRegressor) if you have >5K samples
4. Try GNNs if you have >10K samples
5. Use transfer learning for small datasets or novel scaffolds
### Pattern 4: Handle Imbalanced Data
```python
# Option 1: Balancing transformer
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
# Option 2: Use balanced metrics
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
```
### Pattern 5: Avoid Memory Issues
```python
# Use DiskDataset for large datasets
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
# Use smaller batch sizes
model = dc.models.GCNModel(batch_size=32) # Instead of 128
```
## Common Pitfalls
### Issue 1: Data Leakage in Drug Discovery
**Problem**: Using random splitting allows similar molecules in train/test sets.
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
### Issue 2: GNN Underperforming vs Fingerprints
**Problem**: Graph neural networks perform worse than simple fingerprints.
**Solutions**:
- Ensure dataset is large enough (>10K samples typically)
- Increase training epochs (50-100)
- Try different architectures (AttentiveFP, DMPNN instead of GCN)
- Use pretrained models (GROVER)
### Issue 3: Overfitting on Small Datasets
**Problem**: Model memorizes training data.
**Solutions**:
- Use stronger regularization (increase dropout to 0.5)
- Use simpler models (Random Forest instead of deep learning)
- Apply transfer learning (ChemBERTa, GROVER)
- Collect more data
### Issue 4: Import Errors
**Problem**: Module not found errors.
**Solution**: Ensure DeepChem is installed with required dependencies:
```bash
pip install deepchem
# For PyTorch models
pip install deepchem[torch]
# For all features
pip install deepchem[all]
```
## Reference Documentation
This skill includes comprehensive reference documentation:
### `references/api_reference.md`
Complete API documentation including:
- All data loaders and their use cases
- Dataset classes and when to use each
- Complete featurizer catalog with selection guide
- Model catalog organized by category (50+ models)
- MoleculeNet dataset descriptions
- Metrics and evaluation functions
- Common code patterns
**When to reference**: Search this file when you need specific API details, parameter names, or want to explore available options.
### `references/workflows.md`
Eight detailed end-to-end workflows:
1. Molecular property prediction from SMILES
2. Using MoleculeNet benchmarks
3. Hyperparameter optimization
4. Transfer learning with pretrained models
5. Molecular generation with GANs
6. Materials property prediction
7. Protein sequence analysis
8. Custom model integration
**When to reference**: Use these workflows as templates for implementing complete solutions.
## Installation Notes
Basic installation:
```bash
pip install deepchem
```
For PyTorch models (GCN, GAT, etc.):
```bash
pip install deepchem[torch]
```
For all features:
```bash
pip install deepchem[all]
```
If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
## Additional Resources
- Official documentation: https://deepchem.readthedocs.io/
- GitHub repository: https://github.com/deepchem/deepchem
- Tutorials: https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html
- Paper: "MoleculeNet: A Benchmark for Molecular Machine Learning"

View File

@@ -0,0 +1,303 @@
# DeepChem API Reference
This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality.
## Data Handling
### Data Loaders
#### File Format Loaders
- **CSVLoader**: Load tabular data from CSV files with customizable feature handling
- **UserCSVLoader**: User-defined CSV loading with flexible column specifications
- **SDFLoader**: Process molecular structure files (SDF format)
- **JsonLoader**: Import JSON-structured datasets
- **ImageLoader**: Load image data for computer vision tasks
#### Biological Data Loaders
- **FASTALoader**: Handle protein/DNA sequences in FASTA format
- **FASTQLoader**: Process FASTQ sequencing data with quality scores
- **SAMLoader/BAMLoader/CRAMLoader**: Support sequence alignment formats
#### Specialized Loaders
- **DFTYamlLoader**: Process density functional theory computational data
- **InMemoryLoader**: Load data directly from Python objects
### Dataset Classes
- **NumpyDataset**: Wrap NumPy arrays for in-memory data manipulation
- **DiskDataset**: Manage larger datasets stored on disk, reducing memory overhead
- **ImageDataset**: Specialized container for image-based ML tasks
### Data Splitters
#### General Splitters
- **RandomSplitter**: Random dataset partitioning
- **IndexSplitter**: Split by specified indices
- **SpecifiedSplitter**: Use pre-defined splits
- **RandomStratifiedSplitter**: Stratified random splitting
- **SingletaskStratifiedSplitter**: Stratified splitting for single tasks
- **TaskSplitter**: Split for multitask scenarios
#### Molecule-Specific Splitters
- **ScaffoldSplitter**: Divide molecules by structural scaffolds (prevents data leakage)
- **ButinaSplitter**: Clustering-based molecular splitting
- **FingerprintSplitter**: Split based on molecular fingerprint similarity
- **MaxMinSplitter**: Maximize diversity between training/test sets
- **MolecularWeightSplitter**: Split by molecular weight properties
**Best Practice**: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures.
### Transformers
#### Normalization
- **NormalizationTransformer**: Standard normalization (mean=0, std=1)
- **MinMaxTransformer**: Scale features to [0,1] range
- **LogTransformer**: Apply log transformation
- **PowerTransformer**: Box-Cox and Yeo-Johnson transformations
- **CDFTransformer**: Cumulative distribution function normalization
#### Task-Specific
- **BalancingTransformer**: Address class imbalance
- **FeaturizationTransformer**: Apply dynamic feature engineering
- **CoulombFitTransformer**: Quantum chemistry specific
- **DAGTransformer**: Directed acyclic graph transformations
- **RxnSplitTransformer**: Chemical reaction preprocessing
## Molecular Featurizers
### Graph-Based Featurizers
Use these with graph neural networks (GCNs, MPNNs, etc.):
- **ConvMolFeaturizer**: Graph representations for graph convolutional networks
- **WeaveFeaturizer**: "Weave" graph embeddings
- **MolGraphConvFeaturizer**: Graph convolution-ready representations
- **EquivariantGraphFeaturizer**: Maintains geometric invariance
- **DMPNNFeaturizer**: Directed message-passing neural network inputs
- **GroverFeaturizer**: Pre-trained molecular embeddings
### Fingerprint-Based Featurizers
Use these with traditional ML (Random Forest, SVM, XGBoost):
- **MACCSKeysFingerprint**: 167-bit structural keys
- **CircularFingerprint**: Extended connectivity fingerprints (Morgan fingerprints)
- Parameters: `radius` (default 2), `size` (default 2048), `useChirality` (default False)
- **PubChemFingerprint**: 881-bit structural descriptors
- **Mol2VecFingerprint**: Learned molecular vector representations
### Descriptor Featurizers
Calculate molecular properties directly:
- **RDKitDescriptors**: ~200 molecular descriptors (MW, LogP, H-donors, H-acceptors, TPSA, etc.)
- **MordredDescriptors**: Comprehensive structural and physicochemical descriptors
- **CoulombMatrix**: Interatomic distance matrices for 3D structures
### Sequence-Based Featurizers
For recurrent networks and transformers:
- **SmilesToSeq**: Convert SMILES strings to sequences
- **SmilesToImage**: Generate 2D image representations from SMILES
- **RawFeaturizer**: Pass through raw molecular data unchanged
### Selection Guide
| Use Case | Recommended Featurizer | Model Type |
|----------|----------------------|------------|
| Graph neural networks | ConvMolFeaturizer, MolGraphConvFeaturizer | GCN, MPNN, GAT |
| Traditional ML | CircularFingerprint, RDKitDescriptors | Random Forest, XGBoost, SVM |
| Deep learning (non-graph) | CircularFingerprint, Mol2VecFingerprint | Dense networks, CNN |
| Sequence models | SmilesToSeq | LSTM, GRU, Transformer |
| 3D molecular structures | CoulombMatrix | Specialized 3D models |
| Quick baseline | RDKitDescriptors | Linear, Ridge, Lasso |
## Models
### Scikit-Learn Integration
- **SklearnModel**: Wrapper for any scikit-learn algorithm
- Usage: `SklearnModel(model=RandomForestRegressor())`
### Gradient Boosting
- **GBDTModel**: Gradient boosting decision trees (XGBoost, LightGBM)
### PyTorch Models
#### Molecular Property Prediction
- **MultitaskRegressor**: Multi-task regression with shared representations
- **MultitaskClassifier**: Multi-task classification
- **MultitaskFitTransformRegressor**: Regression with learned transformations
- **GCNModel**: Graph convolutional networks
- **GATModel**: Graph attention networks
- **AttentiveFPModel**: Attentive fingerprint networks
- **DMPNNModel**: Directed message passing neural networks
- **GroverModel**: GROVER pre-trained transformer
- **MATModel**: Molecule attention transformer
#### Materials Science
- **CGCNNModel**: Crystal graph convolutional networks
- **MEGNetModel**: Materials graph networks
- **LCNNModel**: Lattice CNN for materials
#### Generative Models
- **GANModel**: Generative adversarial networks
- **WGANModel**: Wasserstein GAN
- **BasicMolGANModel**: Molecular GAN
- **LSTMGenerator**: LSTM-based molecule generation
- **SeqToSeqModel**: Sequence-to-sequence models
#### Physics-Informed Models
- **PINNModel**: Physics-informed neural networks
- **HNNModel**: Hamiltonian neural networks
- **LNN**: Lagrangian neural networks
- **FNOModel**: Fourier neural operators
#### Computer Vision
- **CNN**: Convolutional neural networks
- **UNetModel**: U-Net architecture for segmentation
- **InceptionV3Model**: Pre-trained Inception v3
- **MobileNetV2Model**: Lightweight mobile networks
### Hugging Face Models
- **HuggingFaceModel**: General wrapper for HF transformers
- **Chemberta**: Chemical BERT for molecular property prediction
- **MoLFormer**: Molecular transformer architecture
- **ProtBERT**: Protein sequence BERT
- **DeepAbLLM**: Antibody large language models
### Model Selection Guide
| Task | Recommended Model | Featurizer |
|------|------------------|------------|
| Small dataset (<1000 samples) | SklearnModel (Random Forest) | CircularFingerprint |
| Medium dataset (1K-100K) | GBDTModel or MultitaskRegressor | CircularFingerprint or ConvMolFeaturizer |
| Large dataset (>100K) | GCNModel, AttentiveFPModel, or DMPNN | MolGraphConvFeaturizer |
| Transfer learning | GroverModel, Chemberta, MoLFormer | Model-specific |
| Materials properties | CGCNNModel, MEGNetModel | Structure-based |
| Molecule generation | BasicMolGANModel, LSTMGenerator | SmilesToSeq |
| Protein sequences | ProtBERT | Sequence-based |
## MoleculeNet Datasets
Quick access to 30+ benchmark datasets via `dc.molnet.load_*()` functions.
### Classification Datasets
- **load_bace()**: BACE-1 inhibitors (binary classification)
- **load_bbbp()**: Blood-brain barrier penetration
- **load_clintox()**: Clinical toxicity
- **load_hiv()**: HIV inhibition activity
- **load_muv()**: PubChem BioAssay (challenging, sparse)
- **load_pcba()**: PubChem screening data
- **load_sider()**: Adverse drug reactions (multi-label)
- **load_tox21()**: 12 toxicity assays (multi-task)
- **load_toxcast()**: EPA ToxCast screening
### Regression Datasets
- **load_delaney()**: Aqueous solubility (ESOL)
- **load_freesolv()**: Solvation free energy
- **load_lipo()**: Lipophilicity (octanol-water partition)
- **load_qm7/qm8/qm9()**: Quantum mechanical properties
- **load_hopv()**: Organic photovoltaic properties
### Protein-Ligand Binding
- **load_pdbbind()**: Binding affinity data
### Materials Science
- **load_perovskite()**: Perovskite stability
- **load_mp_formation_energy()**: Materials Project formation energy
- **load_mp_metallicity()**: Metal vs. non-metal classification
- **load_bandgap()**: Electronic bandgap prediction
### Chemical Reactions
- **load_uspto()**: USPTO reaction dataset
### Usage Pattern
```python
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='GraphConv', # or 'ECFP', 'GraphConv', 'Weave', etc.
splitter='scaffold', # or 'random', 'stratified', etc.
reload=False # set True to skip caching
)
train, valid, test = datasets
```
## Metrics
Common evaluation metrics available in `dc.metrics`:
### Classification Metrics
- **roc_auc_score**: Area under ROC curve (binary/multi-class)
- **prc_auc_score**: Area under precision-recall curve
- **accuracy_score**: Classification accuracy
- **balanced_accuracy_score**: Balanced accuracy for imbalanced datasets
- **recall_score**: Sensitivity/recall
- **precision_score**: Precision
- **f1_score**: F1 score
### Regression Metrics
- **mean_absolute_error**: MAE
- **mean_squared_error**: MSE
- **root_mean_squared_error**: RMSE
- **r2_score**: R² coefficient of determination
- **pearson_r2_score**: Pearson correlation
- **spearman_correlation**: Spearman rank correlation
### Multi-Task Metrics
Most metrics support multi-task evaluation by averaging over tasks.
## Training Pattern
Standard DeepChem workflow:
```python
# 1. Load data
loader = dc.data.CSVLoader(tasks=['task1'], feature_field='smiles',
featurizer=dc.feat.CircularFingerprint())
dataset = loader.create_dataset('data.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Transform data (optional)
transformers = [dc.trans.NormalizationTransformer(dataset=train)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Create and train model
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048, layer_sizes=[1000])
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
```
## Common Patterns
### Pattern 1: Quick Baseline with MoleculeNet
```python
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train, valid, test = datasets
model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024)
model.fit(train)
```
### Pattern 2: Custom Data with Graph Networks
```python
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(tasks=['activity'], feature_field='smiles',
featurizer=featurizer)
dataset = loader.create_dataset('my_data.csv')
train, test = dc.splits.RandomSplitter().train_test_split(dataset)
model = dc.models.GCNModel(mode='classification', n_tasks=1)
model.fit(train)
```
### Pattern 3: Transfer Learning with Pretrained Models
```python
model = dc.models.GroverModel(task='classification', n_tasks=1)
model.fit(train_dataset)
predictions = model.predict(test_dataset)
```

View File

@@ -0,0 +1,491 @@
# DeepChem Workflows
This document provides detailed workflows for common DeepChem use cases.
## Workflow 1: Molecular Property Prediction from SMILES
**Goal**: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
### Step-by-Step Process
#### 1. Prepare Your Data
Data should be in CSV format with at minimum:
- A column with SMILES strings
- One or more columns with property values (targets)
Example CSV structure:
```csv
smiles,solubility,toxicity
CCO,-0.77,0
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
```
#### 2. Choose Featurizer
Decision tree:
- **Small dataset (<1K)**: Use `CircularFingerprint` or `RDKitDescriptors`
- **Medium dataset (1K-100K)**: Use `CircularFingerprint` or `MolGraphConvFeaturizer`
- **Large dataset (>100K)**: Use graph-based featurizers (`MolGraphConvFeaturizer`, `DMPNNFeaturizer`)
- **Transfer learning**: Use pretrained model featurizers (`GroverFeaturizer`)
#### 3. Load and Featurize Data
```python
import deepchem as dc
# For fingerprint-based
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
# OR for graph-based
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'], # column names to predict
feature_field='smiles', # column with SMILES
featurizer=featurizer
)
dataset = loader.create_dataset('data.csv')
```
#### 4. Split Data
**Critical**: Use `ScaffoldSplitter` for drug discovery to prevent data leakage.
```python
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
```
#### 5. Transform Data (Optional but Recommended)
```python
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True,
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
```
#### 6. Select and Train Model
```python
# For fingerprints
model = dc.models.MultitaskRegressor(
n_tasks=2, # number of properties to predict
n_features=2048, # fingerprint size
layer_sizes=[1000, 500], # hidden layer sizes
dropouts=0.25,
learning_rate=0.001
)
# OR for graphs
model = dc.models.GCNModel(
n_tasks=2,
mode='regression',
batch_size=128,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=50)
```
#### 7. Evaluate
```python
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
valid_score = model.evaluate(valid, [metric])
test_score = model.evaluate(test, [metric])
print(f"Train R²: {train_score}")
print(f"Valid R²: {valid_score}")
print(f"Test R²: {test_score}")
```
#### 8. Make Predictions
```python
# Predict on new molecules
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
new_features = new_featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
```
---
## Workflow 2: Using MoleculeNet Benchmark Datasets
**Goal**: Quickly train and evaluate models on standard benchmarks.
### Quick Start
```python
import deepchem as dc
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# Train model
model = dc.models.GCNModel(
n_tasks=len(tasks),
mode='classification'
)
model.fit(train, nb_epoch=50)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
```
### Available Featurizer Options
When calling `load_*()` functions:
- `'ECFP'`: Extended-connectivity fingerprints (circular fingerprints)
- `'GraphConv'`: Graph convolution features
- `'Weave'`: Weave features
- `'Raw'`: Raw SMILES strings
- `'smiles2img'`: 2D molecular images
### Available Splitter Options
- `'scaffold'`: Scaffold-based splitting (recommended for drug discovery)
- `'random'`: Random splitting
- `'stratified'`: Stratified splitting (preserves class distributions)
- `'butina'`: Butina clustering-based splitting
---
## Workflow 3: Hyperparameter Optimization
**Goal**: Find optimal model hyperparameters systematically.
### Using GridHyperparamOpt
```python
import deepchem as dc
import numpy as np
# Load data
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='ECFP',
splitter='scaffold'
)
train, valid, test = datasets
# Define parameter grid
params_dict = {
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
'dropouts': [0.0, 0.25, 0.5],
'learning_rate': [0.001, 0.0001]
}
# Define model builder function
def model_builder(model_params, model_dir):
return dc.models.MultitaskClassifier(
n_tasks=len(tasks),
n_features=1024,
**model_params
)
# Setup optimizer
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
# Run optimization
best_model, best_params, all_results = optimizer.hyperparam_search(
params_dict,
train,
valid,
metric,
transformers=transformers
)
print(f"Best parameters: {best_params}")
print(f"Best validation score: {all_results['best_validation_score']}")
```
---
## Workflow 4: Transfer Learning with Pretrained Models
**Goal**: Leverage pretrained models for improved performance on small datasets.
### Using ChemBERTa
```python
import deepchem as dc
from transformers import AutoTokenizer
# Load your data
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
)
dataset = loader.create_dataset('data.csv')
# Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# Load pretrained ChemBERTa
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='regression',
n_tasks=1
)
# Fine-tune
model.fit(train, nb_epoch=10)
# Evaluate
predictions = model.predict(test)
```
### Using GROVER
```python
# GROVER: pre-trained on molecular graphs
model = dc.models.GroverModel(
task='classification',
n_tasks=1,
model_dir='./grover_model'
)
# Fine-tune on your data
model.fit(train_dataset, nb_epoch=20)
```
---
## Workflow 5: Molecular Generation with GANs
**Goal**: Generate novel molecules with desired properties.
### Basic MolGAN
```python
import deepchem as dc
# Load training data (molecules for the generator to learn from)
tasks, datasets, _ = dc.molnet.load_qm9(
featurizer='GraphConv',
splitter='random'
)
train, _, _ = datasets
# Create and train MolGAN
gan = dc.models.BasicMolGANModel(
learning_rate=0.001,
vertices=9, # max atoms in molecule
edges=5, # max bonds
nodes=[128, 256, 512]
)
# Train
gan.fit_gan(
train,
nb_epoch=100,
generator_steps=0.2,
checkpoint_interval=10
)
# Generate new molecules
generated_molecules = gan.predict_gan_generator(1000)
```
### Conditional Generation
```python
# For property-targeted generation
from deepchem.models.optimizers import ExponentialDecay
gan = dc.models.BasicMolGANModel(
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
conditional=True # enable conditional generation
)
# Train with properties
gan.fit_gan(train, nb_epoch=100)
# Generate molecules with target properties
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
molecules = gan.predict_gan_generator(
1000,
conditional_inputs=target_properties
)
```
---
## Workflow 6: Materials Property Prediction
**Goal**: Predict properties of crystalline materials.
### Using Crystal Graph Convolutional Networks
```python
import deepchem as dc
# Load materials data (structure files in CIF format)
loader = dc.data.CIFLoader()
dataset = loader.create_dataset('materials.csv')
# Split data
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Create CGCNN model
model = dc.models.CGCNNModel(
n_tasks=1,
mode='regression',
batch_size=32,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=100)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.mae_score)
test_score = model.evaluate(test, [metric])
```
---
## Workflow 7: Protein Sequence Analysis
**Goal**: Predict protein properties from sequences.
### Using ProtBERT
```python
import deepchem as dc
# Load protein sequence data
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
# Use ProtBERT
model = dc.models.HuggingFaceModel(
model='Rostlab/prot_bert',
task='classification',
n_tasks=1
)
# Split and train
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
model.fit(train, nb_epoch=5)
# Predict
predictions = model.predict(test)
```
---
## Workflow 8: Custom Model Integration
**Goal**: Use your own PyTorch/scikit-learn models with DeepChem.
### Wrapping Scikit-Learn Models
```python
from sklearn.ensemble import RandomForestRegressor
import deepchem as dc
# Create scikit-learn model
sklearn_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42
)
# Wrap in DeepChem
model = dc.models.SklearnModel(model=sklearn_model)
# Use with DeepChem datasets
model.fit(train)
predictions = model.predict(test)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
score = model.evaluate(test, [metric])
```
### Creating Custom PyTorch Models
```python
import torch
import torch.nn as nn
import deepchem as dc
class CustomNetwork(nn.Module):
def __init__(self, n_features, n_tasks):
super().__init__()
self.fc1 = nn.Linear(n_features, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, n_tasks)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
return self.fc3(x)
# Wrap in DeepChem TorchModel
model = dc.models.TorchModel(
model=CustomNetwork(n_features=2048, n_tasks=1),
loss=nn.MSELoss(),
output_types=['prediction']
)
# Train
model.fit(train, nb_epoch=50)
```
---
## Common Pitfalls and Solutions
### Issue 1: Data Leakage in Drug Discovery
**Problem**: Using random splitting allows similar molecules in train and test sets.
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
### Issue 2: Imbalanced Classification
**Problem**: Poor performance on minority class.
**Solution**: Use `BalancingTransformer` or weighted metrics.
```python
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
```
### Issue 3: Memory Issues with Large Datasets
**Problem**: Dataset doesn't fit in memory.
**Solution**: Use `DiskDataset` instead of `NumpyDataset`.
```python
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
```
### Issue 4: Overfitting on Small Datasets
**Problem**: Model memorizes training data.
**Solutions**:
1. Use stronger regularization (increase dropout)
2. Use simpler models (Random Forest, Ridge)
3. Apply transfer learning (pretrained models)
4. Collect more data
### Issue 5: Poor Graph Neural Network Performance
**Problem**: GNN performs worse than fingerprints.
**Solutions**:
1. Check if dataset is large enough (GNNs need >10K samples typically)
2. Increase training epochs
3. Try different GNN architectures (AttentiveFP, DMPNN)
4. Use pretrained models (GROVER)

View File

@@ -0,0 +1,338 @@
#!/usr/bin/env python3
"""
Graph Neural Network Training Script
This script demonstrates training Graph Convolutional Networks (GCNs) and other
graph-based models for molecular property prediction.
Usage:
python graph_neural_network.py --dataset tox21 --model gcn
python graph_neural_network.py --dataset bbbp --model attentivefp
python graph_neural_network.py --data custom.csv --task-type regression
"""
import argparse
import deepchem as dc
import sys
AVAILABLE_MODELS = {
'gcn': 'Graph Convolutional Network',
'gat': 'Graph Attention Network',
'attentivefp': 'Attentive Fingerprint',
'mpnn': 'Message Passing Neural Network',
'dmpnn': 'Directed Message Passing Neural Network'
}
MOLNET_DATASETS = {
'tox21': ('classification', 12),
'bbbp': ('classification', 1),
'bace': ('classification', 1),
'hiv': ('classification', 1),
'delaney': ('regression', 1),
'freesolv': ('regression', 1),
'lipo': ('regression', 1)
}
def create_model(model_type, n_tasks, mode='classification'):
"""
Create a graph neural network model.
Args:
model_type: Type of model ('gcn', 'gat', 'attentivefp', etc.)
n_tasks: Number of prediction tasks
mode: 'classification' or 'regression'
Returns:
DeepChem model
"""
if model_type == 'gcn':
return dc.models.GCNModel(
n_tasks=n_tasks,
mode=mode,
batch_size=128,
learning_rate=0.001,
dropout=0.0
)
elif model_type == 'gat':
return dc.models.GATModel(
n_tasks=n_tasks,
mode=mode,
batch_size=128,
learning_rate=0.001
)
elif model_type == 'attentivefp':
return dc.models.AttentiveFPModel(
n_tasks=n_tasks,
mode=mode,
batch_size=128,
learning_rate=0.001
)
elif model_type == 'mpnn':
return dc.models.MPNNModel(
n_tasks=n_tasks,
mode=mode,
batch_size=128,
learning_rate=0.001
)
elif model_type == 'dmpnn':
return dc.models.DMPNNModel(
n_tasks=n_tasks,
mode=mode,
batch_size=128,
learning_rate=0.001
)
else:
raise ValueError(f"Unknown model type: {model_type}")
def train_on_molnet(dataset_name, model_type, n_epochs=50):
"""
Train a graph neural network on a MoleculeNet benchmark dataset.
Args:
dataset_name: Name of MoleculeNet dataset
model_type: Type of model to train
n_epochs: Number of training epochs
Returns:
Trained model and test scores
"""
print("=" * 70)
print(f"Training {AVAILABLE_MODELS[model_type]} on {dataset_name.upper()}")
print("=" * 70)
# Get dataset info
task_type, n_tasks_default = MOLNET_DATASETS[dataset_name]
# Load dataset with graph featurization
print(f"\nLoading {dataset_name} dataset with GraphConv featurizer...")
load_func = getattr(dc.molnet, f'load_{dataset_name}')
tasks, datasets, transformers = load_func(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
n_tasks = len(tasks)
print(f"\nDataset Information:")
print(f" Task type: {task_type}")
print(f" Number of tasks: {n_tasks}")
print(f" Training samples: {len(train)}")
print(f" Validation samples: {len(valid)}")
print(f" Test samples: {len(test)}")
# Create model
print(f"\nCreating {AVAILABLE_MODELS[model_type]} model...")
model = create_model(model_type, n_tasks, mode=task_type)
# Train
print(f"\nTraining for {n_epochs} epochs...")
model.fit(train, nb_epoch=n_epochs)
print("Training complete!")
# Evaluate
print("\n" + "=" * 70)
print("Model Evaluation")
print("=" * 70)
if task_type == 'classification':
metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1'),
]
else:
metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE'),
]
results = {}
for dataset_name_eval, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
print(f"\n{dataset_name_eval} Set:")
scores = model.evaluate(dataset, metrics)
results[dataset_name_eval] = scores
for metric_name, score in scores.items():
print(f" {metric_name}: {score:.4f}")
return model, results
def train_on_custom_data(data_path, model_type, task_type, target_cols, smiles_col='smiles', n_epochs=50):
"""
Train a graph neural network on custom CSV data.
Args:
data_path: Path to CSV file
model_type: Type of model to train
task_type: 'classification' or 'regression'
target_cols: List of target column names
smiles_col: Name of SMILES column
n_epochs: Number of training epochs
Returns:
Trained model and test dataset
"""
print("=" * 70)
print(f"Training {AVAILABLE_MODELS[model_type]} on Custom Data")
print("=" * 70)
# Load and featurize data
print(f"\nLoading data from {data_path}...")
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(
tasks=target_cols,
feature_field=smiles_col,
featurizer=featurizer
)
dataset = loader.create_dataset(data_path)
print(f"Loaded {len(dataset)} molecules")
# Split data
print("\nSplitting data with scaffold splitter...")
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
print(f" Training: {len(train)}")
print(f" Validation: {len(valid)}")
print(f" Test: {len(test)}")
# Create model
print(f"\nCreating {AVAILABLE_MODELS[model_type]} model...")
n_tasks = len(target_cols)
model = create_model(model_type, n_tasks, mode=task_type)
# Train
print(f"\nTraining for {n_epochs} epochs...")
model.fit(train, nb_epoch=n_epochs)
print("Training complete!")
# Evaluate
print("\n" + "=" * 70)
print("Model Evaluation")
print("=" * 70)
if task_type == 'classification':
metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
]
else:
metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
]
for dataset_name, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
print(f"\n{dataset_name} Set:")
scores = model.evaluate(dataset, metrics)
for metric_name, score in scores.items():
print(f" {metric_name}: {score:.4f}")
return model, test
def main():
parser = argparse.ArgumentParser(
description='Train graph neural networks for molecular property prediction'
)
parser.add_argument(
'--model',
type=str,
choices=list(AVAILABLE_MODELS.keys()),
default='gcn',
help='Type of graph neural network model'
)
parser.add_argument(
'--dataset',
type=str,
choices=list(MOLNET_DATASETS.keys()),
default=None,
help='MoleculeNet dataset to use'
)
parser.add_argument(
'--data',
type=str,
default=None,
help='Path to custom CSV file'
)
parser.add_argument(
'--task-type',
type=str,
choices=['classification', 'regression'],
default='classification',
help='Type of prediction task (for custom data)'
)
parser.add_argument(
'--targets',
nargs='+',
default=['target'],
help='Names of target columns (for custom data)'
)
parser.add_argument(
'--smiles-col',
type=str,
default='smiles',
help='Name of SMILES column'
)
parser.add_argument(
'--epochs',
type=int,
default=50,
help='Number of training epochs'
)
args = parser.parse_args()
# Validate arguments
if args.dataset is None and args.data is None:
print("Error: Must specify either --dataset (MoleculeNet) or --data (custom CSV)",
file=sys.stderr)
return 1
if args.dataset and args.data:
print("Error: Cannot specify both --dataset and --data",
file=sys.stderr)
return 1
# Train model
try:
if args.dataset:
model, results = train_on_molnet(
args.dataset,
args.model,
n_epochs=args.epochs
)
else:
model, test_set = train_on_custom_data(
args.data,
args.model,
args.task_type,
args.targets,
smiles_col=args.smiles_col,
n_epochs=args.epochs
)
print("\n" + "=" * 70)
print("Training Complete!")
print("=" * 70)
return 0
except Exception as e:
print(f"\nError: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,224 @@
#!/usr/bin/env python3
"""
Molecular Solubility Prediction Script
This script trains a model to predict aqueous solubility from SMILES strings
using the Delaney (ESOL) dataset as an example. Can be adapted for custom datasets.
Usage:
python predict_solubility.py --data custom_data.csv --smiles-col smiles --target-col solubility
python predict_solubility.py # Uses Delaney dataset by default
"""
import argparse
import deepchem as dc
import numpy as np
import sys
def train_solubility_model(data_path=None, smiles_col='smiles', target_col='measured log solubility in mols per litre'):
"""
Train a solubility prediction model.
Args:
data_path: Path to CSV file with SMILES and solubility data. If None, uses Delaney dataset.
smiles_col: Name of column containing SMILES strings
target_col: Name of column containing solubility values
Returns:
Trained model, test dataset, and transformers
"""
print("=" * 60)
print("DeepChem Solubility Prediction")
print("=" * 60)
# Load data
if data_path is None:
print("\nUsing Delaney (ESOL) benchmark dataset...")
tasks, datasets, transformers = dc.molnet.load_delaney(
featurizer='ECFP',
splitter='scaffold'
)
train, valid, test = datasets
else:
print(f"\nLoading custom data from {data_path}...")
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=[target_col],
feature_field=smiles_col,
featurizer=featurizer
)
dataset = loader.create_dataset(data_path)
# Split data
print("Splitting data with scaffold splitter...")
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# Normalize data
print("Normalizing features and targets...")
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True,
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
tasks = [target_col]
print(f"\nDataset sizes:")
print(f" Training: {len(train)} molecules")
print(f" Validation: {len(valid)} molecules")
print(f" Test: {len(test)} molecules")
# Create model
print("\nCreating multitask regressor...")
model = dc.models.MultitaskRegressor(
n_tasks=len(tasks),
n_features=2048, # ECFP fingerprint size
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001,
batch_size=50
)
# Train model
print("\nTraining model...")
model.fit(train, nb_epoch=50)
print("Training complete!")
# Evaluate model
print("\n" + "=" * 60)
print("Model Evaluation")
print("=" * 60)
metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE'),
]
for dataset_name, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
print(f"\n{dataset_name} Set:")
scores = model.evaluate(dataset, metrics)
for metric_name, score in scores.items():
print(f" {metric_name}: {score:.4f}")
return model, test, transformers
def predict_new_molecules(model, smiles_list, transformers=None):
"""
Predict solubility for new molecules.
Args:
model: Trained DeepChem model
smiles_list: List of SMILES strings
transformers: List of data transformers to apply
Returns:
Array of predictions
"""
print("\n" + "=" * 60)
print("Predicting New Molecules")
print("=" * 60)
# Featurize new molecules
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
features = featurizer.featurize(smiles_list)
# Create dataset
new_dataset = dc.data.NumpyDataset(X=features)
# Apply transformers (if any)
if transformers:
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
# Predict
predictions = model.predict(new_dataset)
# Display results
print("\nPredictions:")
for smiles, pred in zip(smiles_list, predictions):
print(f" {smiles:30s} -> {pred[0]:.3f} log(mol/L)")
return predictions
def main():
parser = argparse.ArgumentParser(
description='Train a molecular solubility prediction model'
)
parser.add_argument(
'--data',
type=str,
default=None,
help='Path to CSV file with molecular data'
)
parser.add_argument(
'--smiles-col',
type=str,
default='smiles',
help='Name of column containing SMILES strings'
)
parser.add_argument(
'--target-col',
type=str,
default='solubility',
help='Name of column containing target values'
)
parser.add_argument(
'--predict',
nargs='+',
default=None,
help='SMILES strings to predict after training'
)
args = parser.parse_args()
# Train model
try:
model, test_set, transformers = train_solubility_model(
data_path=args.data,
smiles_col=args.smiles_col,
target_col=args.target_col
)
except Exception as e:
print(f"\nError during training: {e}", file=sys.stderr)
return 1
# Make predictions on new molecules if provided
if args.predict:
try:
predict_new_molecules(model, args.predict, transformers)
except Exception as e:
print(f"\nError during prediction: {e}", file=sys.stderr)
return 1
else:
# Example predictions
example_smiles = [
'CCO', # Ethanol
'CC(=O)O', # Acetic acid
'c1ccccc1', # Benzene
'CN1C=NC2=C1C(=O)N(C(=O)N2C)C', # Caffeine
]
predict_new_molecules(model, example_smiles, transformers)
print("\n" + "=" * 60)
print("Complete!")
print("=" * 60)
return 0
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,375 @@
#!/usr/bin/env python3
"""
Transfer Learning Script for DeepChem
Use pretrained models (ChemBERTa, GROVER, MolFormer) for molecular property prediction
with transfer learning. Particularly useful for small datasets.
Usage:
python transfer_learning.py --model chemberta --data my_data.csv --target activity
python transfer_learning.py --model grover --dataset bbbp
"""
import argparse
import deepchem as dc
import sys
PRETRAINED_MODELS = {
'chemberta': {
'name': 'ChemBERTa',
'description': 'BERT pretrained on 77M molecules from ZINC15',
'model_id': 'seyonec/ChemBERTa-zinc-base-v1'
},
'grover': {
'name': 'GROVER',
'description': 'Graph transformer pretrained on 10M molecules',
'model_id': None # GROVER uses its own loading mechanism
},
'molformer': {
'name': 'MolFormer',
'description': 'Transformer pretrained on molecular structures',
'model_id': 'ibm/MoLFormer-XL-both-10pct'
}
}
def train_chemberta(train_dataset, valid_dataset, test_dataset, task_type='classification', n_tasks=1, n_epochs=10):
"""
Fine-tune ChemBERTa on a dataset.
Args:
train_dataset: Training dataset
valid_dataset: Validation dataset
test_dataset: Test dataset
task_type: 'classification' or 'regression'
n_tasks: Number of prediction tasks
n_epochs: Number of fine-tuning epochs
Returns:
Trained model and evaluation results
"""
print("=" * 70)
print("Fine-tuning ChemBERTa")
print("=" * 70)
print("\nChemBERTa is a BERT model pretrained on 77M molecules from ZINC15.")
print("It uses SMILES strings as input and has learned rich molecular")
print("representations that transfer well to downstream tasks.")
print(f"\nLoading pretrained ChemBERTa model...")
model = dc.models.HuggingFaceModel(
model=PRETRAINED_MODELS['chemberta']['model_id'],
task=task_type,
n_tasks=n_tasks,
batch_size=32,
learning_rate=2e-5 # Lower LR for fine-tuning
)
print(f"\nFine-tuning for {n_epochs} epochs...")
print("(This may take a while on the first run as the model is downloaded)")
model.fit(train_dataset, nb_epoch=n_epochs)
print("Fine-tuning complete!")
# Evaluate
print("\n" + "=" * 70)
print("Model Evaluation")
print("=" * 70)
if task_type == 'classification':
metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
]
else:
metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
]
results = {}
for name, dataset in [('Train', train_dataset), ('Valid', valid_dataset), ('Test', test_dataset)]:
print(f"\n{name} Set:")
scores = model.evaluate(dataset, metrics)
results[name] = scores
for metric_name, score in scores.items():
print(f" {metric_name}: {score:.4f}")
return model, results
def train_grover(train_dataset, test_dataset, task_type='classification', n_tasks=1, n_epochs=20):
"""
Fine-tune GROVER on a dataset.
Args:
train_dataset: Training dataset
test_dataset: Test dataset
task_type: 'classification' or 'regression'
n_tasks: Number of prediction tasks
n_epochs: Number of fine-tuning epochs
Returns:
Trained model and evaluation results
"""
print("=" * 70)
print("Fine-tuning GROVER")
print("=" * 70)
print("\nGROVER is a graph transformer pretrained on 10M molecules using")
print("self-supervised learning. It learns both node and graph-level")
print("representations through masked atom/bond prediction tasks.")
print(f"\nCreating GROVER model...")
model = dc.models.GroverModel(
task=task_type,
n_tasks=n_tasks,
model_dir='./grover_pretrained'
)
print(f"\nFine-tuning for {n_epochs} epochs...")
model.fit(train_dataset, nb_epoch=n_epochs)
print("Fine-tuning complete!")
# Evaluate
print("\n" + "=" * 70)
print("Model Evaluation")
print("=" * 70)
if task_type == 'classification':
metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
]
else:
metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name=''),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
]
results = {}
for name, dataset in [('Train', train_dataset), ('Test', test_dataset)]:
print(f"\n{name} Set:")
scores = model.evaluate(dataset, metrics)
results[name] = scores
for metric_name, score in scores.items():
print(f" {metric_name}: {score:.4f}")
return model, results
def load_molnet_dataset(dataset_name, model_type):
"""
Load a MoleculeNet dataset with appropriate featurization.
Args:
dataset_name: Name of MoleculeNet dataset
model_type: Type of pretrained model being used
Returns:
tasks, train/valid/test datasets, transformers
"""
# Map of MoleculeNet datasets
molnet_datasets = {
'tox21': dc.molnet.load_tox21,
'bbbp': dc.molnet.load_bbbp,
'bace': dc.molnet.load_bace_classification,
'hiv': dc.molnet.load_hiv,
'delaney': dc.molnet.load_delaney,
'freesolv': dc.molnet.load_freesolv,
'lipo': dc.molnet.load_lipo
}
if dataset_name not in molnet_datasets:
raise ValueError(f"Unknown dataset: {dataset_name}")
# ChemBERTa and MolFormer use raw SMILES
if model_type in ['chemberta', 'molformer']:
featurizer = 'Raw'
# GROVER needs graph features
elif model_type == 'grover':
featurizer = 'GraphConv'
else:
featurizer = 'ECFP'
print(f"\nLoading {dataset_name} dataset...")
load_func = molnet_datasets[dataset_name]
tasks, datasets, transformers = load_func(
featurizer=featurizer,
splitter='scaffold'
)
return tasks, datasets, transformers
def load_custom_dataset(data_path, target_cols, smiles_col, model_type):
"""
Load a custom CSV dataset.
Args:
data_path: Path to CSV file
target_cols: List of target column names
smiles_col: Name of SMILES column
model_type: Type of pretrained model being used
Returns:
train, valid, test datasets
"""
print(f"\nLoading custom data from {data_path}...")
# Choose featurizer based on model
if model_type in ['chemberta', 'molformer']:
featurizer = dc.feat.DummyFeaturizer() # Models handle featurization
elif model_type == 'grover':
featurizer = dc.feat.MolGraphConvFeaturizer()
else:
featurizer = dc.feat.CircularFingerprint()
loader = dc.data.CSVLoader(
tasks=target_cols,
feature_field=smiles_col,
featurizer=featurizer
)
dataset = loader.create_dataset(data_path)
print(f"Loaded {len(dataset)} molecules")
# Split data
print("Splitting data with scaffold splitter...")
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
print(f" Training: {len(train)}")
print(f" Validation: {len(valid)}")
print(f" Test: {len(test)}")
return train, valid, test
def main():
parser = argparse.ArgumentParser(
description='Transfer learning for molecular property prediction'
)
parser.add_argument(
'--model',
type=str,
choices=list(PRETRAINED_MODELS.keys()),
required=True,
help='Pretrained model to use'
)
parser.add_argument(
'--dataset',
type=str,
choices=['tox21', 'bbbp', 'bace', 'hiv', 'delaney', 'freesolv', 'lipo'],
default=None,
help='MoleculeNet dataset to use'
)
parser.add_argument(
'--data',
type=str,
default=None,
help='Path to custom CSV file'
)
parser.add_argument(
'--target',
nargs='+',
default=['target'],
help='Target column name(s) for custom data'
)
parser.add_argument(
'--smiles-col',
type=str,
default='smiles',
help='SMILES column name for custom data'
)
parser.add_argument(
'--task-type',
type=str,
choices=['classification', 'regression'],
default='classification',
help='Type of prediction task'
)
parser.add_argument(
'--epochs',
type=int,
default=10,
help='Number of fine-tuning epochs'
)
args = parser.parse_args()
# Validate arguments
if args.dataset is None and args.data is None:
print("Error: Must specify either --dataset or --data", file=sys.stderr)
return 1
if args.dataset and args.data:
print("Error: Cannot specify both --dataset and --data", file=sys.stderr)
return 1
# Print model info
model_info = PRETRAINED_MODELS[args.model]
print("\n" + "=" * 70)
print(f"Transfer Learning with {model_info['name']}")
print("=" * 70)
print(f"\n{model_info['description']}")
try:
# Load dataset
if args.dataset:
tasks, datasets, transformers = load_molnet_dataset(args.dataset, args.model)
train, valid, test = datasets
task_type = 'classification' if args.dataset in ['tox21', 'bbbp', 'bace', 'hiv'] else 'regression'
n_tasks = len(tasks)
else:
train, valid, test = load_custom_dataset(
args.data,
args.target,
args.smiles_col,
args.model
)
task_type = args.task_type
n_tasks = len(args.target)
# Train model
if args.model == 'chemberta':
model, results = train_chemberta(
train, valid, test,
task_type=task_type,
n_tasks=n_tasks,
n_epochs=args.epochs
)
elif args.model == 'grover':
model, results = train_grover(
train, test,
task_type=task_type,
n_tasks=n_tasks,
n_epochs=args.epochs
)
else:
print(f"Error: Model {args.model} not yet implemented", file=sys.stderr)
return 1
print("\n" + "=" * 70)
print("Transfer Learning Complete!")
print("=" * 70)
print("\nTip: Pretrained models often work best with:")
print(" - Small datasets (< 1000 samples)")
print(" - Lower learning rates (1e-5 to 5e-5)")
print(" - Fewer epochs (5-20)")
print(" - Avoiding overfitting through early stopping")
return 0
except Exception as e:
print(f"\nError: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,537 @@
---
name: deeptools
description: Comprehensive toolkit for analyzing next-generation sequencing (NGS) data including ChIP-seq, RNA-seq, ATAC-seq, and related experiments. Use this skill when working with BAM files, bigWig coverage tracks, or when creating heatmaps, profile plots, and quality control visualizations for genomic data. Applicable for tasks involving read coverage analysis, sample correlation, ChIP enrichment assessment, normalization, and publication-quality visualization generation.
---
# deepTools: NGS Data Analysis Toolkit
## Overview
deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. This skill provides guidance for using deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
**Core capabilities:**
- Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
- Quality control assessment (fingerprint, correlation, coverage)
- Sample comparison and correlation analysis
- Heatmap and profile plot generation around genomic features
- Enrichment analysis and peak region visualization
## When to Use This Skill
Invoke this skill when users request tasks involving:
- **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
- **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
- **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot"
- **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis"
- **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow"
- **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context
## Quick Start
For users new to deepTools, start with file validation and common workflows:
### 1. Validate Input Files
Before running any analysis, validate BAM, bigWig, and BED files using the validation script:
```bash
python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed
```
This checks file existence, BAM indices, and format correctness.
### 2. Generate Workflow Template
For standard analyses, use the workflow generator to create customized scripts:
```bash
# List available workflows
python scripts/workflow_generator.py --list
# Generate ChIP-seq QC workflow
python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
--genome-size 2913022398
# Make executable and run
chmod +x qc_workflow.sh
./qc_workflow.sh
```
### 3. Most Common Operations
See `assets/quick_reference.md` for frequently used commands and parameters.
## Installation
Guide users to install deepTools using conda (recommended):
```bash
# Standard installation
conda install -c conda-forge -c bioconda deeptools
# For M1 Macs
CONDA_SUBDIR=osx-64 conda create -c conda-forge -c bioconda -n deeptools deeptools
```
Or using pip:
```bash
pip install deeptools
```
## Core Workflows
deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization**
### ChIP-seq Quality Control Workflow
When users request ChIP-seq QC or quality assessment:
1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc`
2. **Key QC steps**:
- Sample correlation (multiBamSummary + plotCorrelation)
- PCA analysis (plotPCA)
- Coverage assessment (plotCoverage)
- Fragment size validation (bamPEFragmentSize)
- ChIP enrichment strength (plotFingerprint)
**Interpreting results:**
- **Correlation**: Replicates should cluster together with high correlation (>0.9)
- **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment
- **Coverage**: Assess if sequencing depth is adequate for analysis
Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow"
### ChIP-seq Complete Analysis Workflow
For full ChIP-seq analysis from BAM to visualizations:
1. **Generate coverage tracks** with normalization (bamCoverage)
2. **Create comparison tracks** (bamCompare for log2 ratio)
3. **Compute signal matrices** around features (computeMatrix)
4. **Generate visualizations** (plotHeatmap, plotProfile)
5. **Enrichment analysis** at peaks (plotEnrichment)
Use `scripts/workflow_generator.py chipseq_analysis` to generate template.
Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow"
### RNA-seq Coverage Workflow
For strand-specific RNA-seq coverage tracks:
Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands.
**Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions).
Use normalization: CPM for fixed bins, RPKM for gene-level analysis.
Template available: `scripts/workflow_generator.py rnaseq_coverage`
Details in `references/workflows.md` → "RNA-seq Coverage Workflow"
### ATAC-seq Analysis Workflow
ATAC-seq requires Tn5 offset correction:
1. **Shift reads** using alignmentSieve with `--ATACshift`
2. **Generate coverage** with bamCoverage
3. **Analyze fragment sizes** (expect nucleosome ladder pattern)
4. **Visualize at peaks** if available
Template: `scripts/workflow_generator.py atacseq`
Full workflow in `references/workflows.md` → "ATAC-seq Workflow"
## Tool Categories and Common Tasks
### BAM/bigWig Processing
**Convert BAM to normalized coverage:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
--binSize 10 --numberOfProcessors 8
```
**Compare two samples (log2 ratio):**
```bash
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
--operation log2 --scaleFactorsMethod readCount
```
**Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve
Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools"
### Quality Control
**Check ChIP enrichment:**
```bash
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
--extendReads 200 --ignoreDuplicates
```
**Sample correlation:**
```bash
multiBamSummary bins --bamfiles *.bam -o counts.npz
plotCorrelation -in counts.npz --corMethod pearson \
--whatToShow heatmap -o correlation.png
```
**Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize
Complete reference: `references/tools_reference.md` → "Quality Control Tools"
### Visualization
**Create heatmap around TSS:**
```bash
# Compute matrix
computeMatrix reference-point -S signal.bw -R genes.bed \
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
# Generate heatmap
plotHeatmap -m matrix.gz -o heatmap.png \
--colorMap RdBu --kmeans 3
```
**Create profile plot:**
```bash
plotProfile -m matrix.gz -o profile.png \
--plotType lines --colors blue red
```
**Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment
Complete reference: `references/tools_reference.md` → "Visualization Tools"
## Normalization Methods
Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance.
**Quick selection guide:**
- **ChIP-seq coverage**: Use RPGC or CPM
- **ChIP-seq comparison**: Use bamCompare with log2 and readCount
- **RNA-seq bins**: Use CPM
- **RNA-seq genes**: Use RPKM (accounts for gene length)
- **ATAC-seq**: Use RPGC or CPM
**Normalization methods:**
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
- **CPM**: Counts per million mapped reads
- **RPKM**: Reads per kb per million (accounts for region length)
- **BPM**: Bins per million
- **None**: Raw counts (not recommended for comparisons)
Full explanation: `references/normalization_methods.md`
## Effective Genome Sizes
RPGC normalization requires effective genome size. Common values:
| Organism | Assembly | Size | Usage |
|----------|----------|------|-------|
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
| *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
Complete table with read-length-specific values: `references/effective_genome_sizes.md`
## Common Parameters Across Tools
Many deepTools commands share these options:
**Performance:**
- `--numberOfProcessors, -p`: Enable parallel processing (always use available cores)
- `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`)
**Read Filtering:**
- `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses)
- `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`)
- `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds
- `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering
**Read Processing:**
- `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO)
- `--centerReads`: Center at fragment midpoint for sharper signals
## Best Practices
### File Validation
**Always validate files first** using `scripts/validate_files.py` to check:
- File existence and readability
- BAM indices present (.bai files)
- BED format correctness
- File sizes reasonable
### Analysis Strategy
1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding
2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing
3. **Document commands**: Save full command lines for reproducibility
4. **Use consistent normalization**: Apply same method across samples in comparisons
5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds
### ChIP-seq Specific
- **Always extend reads** for ChIP-seq: `--extendReads 200`
- **Remove duplicates**: Use `--ignoreDuplicates` in most cases
- **Check enrichment first**: Run plotFingerprint before detailed analysis
- **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction
### RNA-seq Specific
- **Never extend reads** for RNA-seq (would span splice junctions)
- **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries
- **Normalization**: CPM for bins, RPKM for genes
### ATAC-seq Specific
- **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift`
- **Fragment filtering**: Set appropriate min/max fragment lengths
- **Check nucleosome pattern**: Fragment size plot should show ladder pattern
### Performance Optimization
1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores)
2. **Increase bin size** for faster processing and smaller files
3. **Process chromosomes separately** for memory-limited systems
4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files
5. **Use bigWig over bedGraph**: Compressed and faster to process
## Troubleshooting
### Common Issues
**BAM index missing:**
```bash
samtools index input.bam
```
**Out of memory:**
Process chromosomes individually using `--region`:
```bash
bamCoverage --bam input.bam -o chr1.bw --region chr1
```
**Slow processing:**
Increase `--numberOfProcessors` and/or increase `--binSize`
**bigWig files too large:**
Increase bin size: `--binSize 50` or larger
### Validation Errors
Run validation script to identify issues:
```bash
python scripts/validate_files.py --bam *.bam --bed regions.bed
```
Common errors and solutions explained in script output.
## Reference Documentation
This skill includes comprehensive reference documentation:
### references/tools_reference.md
Complete documentation of all deepTools commands organized by category:
- BAM and bigWig processing tools (9 tools)
- Quality control tools (6 tools)
- Visualization tools (3 tools)
- Miscellaneous tools (2 tools)
Each tool includes:
- Purpose and overview
- Key parameters with explanations
- Usage examples
- Important notes and best practices
**Use this reference when:** Users ask about specific tools, parameters, or detailed usage.
### references/workflows.md
Complete workflow examples for common analyses:
- ChIP-seq quality control workflow
- ChIP-seq complete analysis workflow
- RNA-seq coverage workflow
- ATAC-seq analysis workflow
- Multi-sample comparison workflow
- Peak region analysis workflow
- Troubleshooting and performance tips
**Use this reference when:** Users need complete analysis pipelines or workflow examples.
### references/normalization_methods.md
Comprehensive guide to normalization methods:
- Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.)
- When to use each method
- Formulas and interpretation
- Selection guide by experiment type
- Common pitfalls and solutions
- Quick reference table
**Use this reference when:** Users ask about normalization, comparing samples, or which method to use.
### references/effective_genome_sizes.md
Effective genome size values and usage:
- Common organism values (human, mouse, fly, worm, zebrafish)
- Read-length-specific values
- Calculation methods
- When and how to use in commands
- Custom genome calculation instructions
**Use this reference when:** Users need genome size for RPGC normalization or GC bias correction.
## Helper Scripts
### scripts/validate_files.py
Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format.
**Usage:**
```bash
python scripts/validate_files.py --bam sample1.bam sample2.bam \
--bed peaks.bed --bigwig signal.bw
```
**When to use:** Before starting any analysis, or when troubleshooting errors.
### scripts/workflow_generator.py
Generates customizable bash script templates for common deepTools workflows.
**Available workflows:**
- `chipseq_qc`: ChIP-seq quality control
- `chipseq_analysis`: Complete ChIP-seq analysis
- `rnaseq_coverage`: Strand-specific RNA-seq coverage
- `atacseq`: ATAC-seq with Tn5 correction
**Usage:**
```bash
# List workflows
python scripts/workflow_generator.py --list
# Generate workflow
python scripts/workflow_generator.py chipseq_qc -o qc.sh \
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
--genome-size 2913022398 --threads 8
# Run generated workflow
chmod +x qc.sh
./qc.sh
```
**When to use:** Users request standard workflows or need template scripts to customize.
## Assets
### assets/quick_reference.md
Quick reference card with most common commands, effective genome sizes, and typical workflow pattern.
**When to use:** Users need quick command examples without detailed documentation.
## Handling User Requests
### For New Users
1. Start with installation verification
2. Validate input files using `scripts/validate_files.py`
3. Recommend appropriate workflow based on experiment type
4. Generate workflow template using `scripts/workflow_generator.py`
5. Guide through customization and execution
### For Experienced Users
1. Provide specific tool commands for requested operations
2. Reference appropriate sections in `references/tools_reference.md`
3. Suggest optimizations and best practices
4. Offer troubleshooting for issues
### For Specific Tasks
**"Convert BAM to bigWig":**
- Use bamCoverage with appropriate normalization
- Recommend RPGC or CPM based on use case
- Provide effective genome size for organism
- Suggest relevant parameters (extendReads, ignoreDuplicates, binSize)
**"Check ChIP quality":**
- Run full QC workflow or use plotFingerprint specifically
- Explain interpretation of results
- Suggest follow-up actions based on results
**"Create heatmap":**
- Guide through two-step process: computeMatrix → plotHeatmap
- Help choose appropriate matrix mode (reference-point vs scale-regions)
- Suggest visualization parameters and clustering options
**"Compare samples":**
- Recommend bamCompare for two-sample comparison
- Suggest multiBamSummary + plotCorrelation for multiple samples
- Guide normalization method selection
### Referencing Documentation
When users need detailed information:
- **Tool details**: Direct to specific sections in `references/tools_reference.md`
- **Workflows**: Use `references/workflows.md` for complete analysis pipelines
- **Normalization**: Consult `references/normalization_methods.md` for method selection
- **Genome sizes**: Reference `references/effective_genome_sizes.md`
Search references using grep patterns:
```bash
# Find tool documentation
grep -A 20 "^### toolname" references/tools_reference.md
# Find workflow
grep -A 50 "^## Workflow Name" references/workflows.md
# Find normalization method
grep -A 15 "^### Method Name" references/normalization_methods.md
```
## Example Interactions
**User: "I need to analyze my ChIP-seq data"**
Response approach:
1. Ask about files available (BAM files, peaks, genes)
2. Validate files using validation script
3. Generate chipseq_analysis workflow template
4. Customize for their specific files and organism
5. Explain each step as script runs
**User: "Which normalization should I use?"**
Response approach:
1. Ask about experiment type (ChIP-seq, RNA-seq, etc.)
2. Ask about comparison goal (within-sample or between-sample)
3. Consult `references/normalization_methods.md` selection guide
4. Recommend appropriate method with justification
5. Provide command example with parameters
**User: "Create a heatmap around TSS"**
Response approach:
1. Verify bigWig and gene BED files available
2. Use computeMatrix with reference-point mode at TSS
3. Generate plotHeatmap with appropriate visualization parameters
4. Suggest clustering if dataset is large
5. Offer profile plot as complement
## Key Reminders
- **File validation first**: Always validate input files before analysis
- **Normalization matters**: Choose appropriate method for comparison type
- **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq
- **Use all cores**: Set `--numberOfProcessors` to available cores
- **Test on regions**: Use `--region` for parameter testing
- **Check QC first**: Run quality control before detailed analysis
- **Document everything**: Save commands for reproducibility
- **Reference documentation**: Use comprehensive references for detailed guidance

View File

@@ -0,0 +1,58 @@
# deepTools Quick Reference
## Most Common Commands
### BAM to bigWig (normalized)
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
--binSize 10 --numberOfProcessors 8
```
### Compare two BAM files
```bash
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
--operation log2 --scaleFactorsMethod readCount
```
### Correlation heatmap
```bash
multiBamSummary bins --bamfiles *.bam -o counts.npz
plotCorrelation -in counts.npz --corMethod pearson \
--whatToShow heatmap -o correlation.png
```
### Heatmap around TSS
```bash
computeMatrix reference-point -S signal.bw -R genes.bed \
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
plotHeatmap -m matrix.gz -o heatmap.png
```
### ChIP enrichment check
```bash
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
--extendReads 200 --ignoreDuplicates
```
## Effective Genome Sizes
| Organism | Assembly | Size |
|----------|----------|------|
| Human | hg38 | 2913022398 |
| Mouse | mm10 | 2652783500 |
| Fly | dm6 | 142573017 |
## Common Normalization Methods
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
- **CPM**: Counts per million (for fixed bins)
- **RPKM**: Reads per kb per million (for genes)
## Typical Workflow
1. **QC**: plotFingerprint, plotCorrelation
2. **Coverage**: bamCoverage with normalization
3. **Comparison**: bamCompare for treatment vs control
4. **Visualization**: computeMatrix → plotHeatmap/plotProfile

View File

@@ -0,0 +1,116 @@
# Effective Genome Sizes
## Definition
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
## Why It Matters
- Required for RPGC normalization (`--normalizeUsing RPGC`)
- Affects accuracy of coverage calculations
- Must match your data processing approach (filtered vs unfiltered reads)
## Calculation Methods
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
## Common Organism Values
### Using Non-N Bases Method
| Organism | Assembly | Effective Size | Full Command |
|----------|----------|----------------|--------------|
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
### Human (GRCh38) by Read Length
For quality-filtered reads, values vary by read length:
| Read Length | Effective Size |
|-------------|----------------|
| 50bp | ~2.7 billion |
| 75bp | ~2.8 billion |
| 100bp | ~2.8 billion |
| 150bp | ~2.9 billion |
| 250bp | ~2.9 billion |
### Mouse (GRCm38) by Read Length
| Read Length | Effective Size |
|-------------|----------------|
| 50bp | ~2.3 billion |
| 75bp | ~2.5 billion |
| 100bp | ~2.6 billion |
## Usage in deepTools
The effective genome size is most commonly used with:
### bamCoverage with RPGC normalization
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
```
### bamCompare with RPGC normalization
```bash
bamCompare -b1 treatment.bam -b2 control.bam \
--outFileName comparison.bw \
--scaleFactorsMethod RPGC \
--effectiveGenomeSize 2913022398
```
### computeGCBias / correctGCBias
```bash
computeGCBias --bamfile input.bam \
--effectiveGenomeSize 2913022398 \
--genome genome.2bit \
--fragmentLength 200 \
--biasPlot bias.png
```
## Choosing the Right Value
**For most analyses:** Use the non-N bases method value for your reference genome
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
## Common Shortcuts
deepTools also accepts these shorthand values in some contexts:
- `hs` or `GRCh38`: 2913022398
- `mm` or `GRCm38`: 2652783500
- `dm` or `dm6`: 142573017
- `ce` or `ce10`: 100286401
Check your specific deepTools version documentation for supported shortcuts.
## Calculating Custom Values
For custom genomes or assemblies, calculate the non-N bases count:
```bash
# Using faCount (UCSC tools)
faCount genome.fa | grep "total" | awk '{print $2-$7}'
# Using seqtk
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
```
## References
For the most up-to-date effective genome sizes and detailed calculation methods, see:
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
- ENCODE documentation for reference genome details

View File

@@ -0,0 +1,410 @@
# deepTools Normalization Methods
This document explains the various normalization methods available in deepTools and when to use each one.
## Why Normalize?
Normalization is essential for:
1. **Comparing samples with different sequencing depths**
2. **Accounting for library size differences**
3. **Making coverage values interpretable across experiments**
4. **Enabling fair comparisons between conditions**
Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
---
## Available Normalization Methods
### 1. RPKM (Reads Per Kilobase per Million mapped reads)
**Formula:** `(Number of reads) / (Length of region in kb × Total mapped reads in millions)`
**When to use:**
- Comparing different genomic regions within the same sample
- Adjusting for both sequencing depth AND region length
- RNA-seq gene expression analysis
**Available in:** `bamCoverage`
**Example:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPKM
```
**Interpretation:** RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
**Pros:**
- Accounts for both region length and library size
- Widely used and understood in genomics
**Cons:**
- Not ideal for comparing between samples if total RNA content differs
- Can be misleading when comparing samples with very different compositions
---
### 2. CPM (Counts Per Million mapped reads)
**Formula:** `(Number of reads) / (Total mapped reads in millions)`
**Also known as:** RPM (Reads Per Million)
**When to use:**
- Comparing the same genomic regions across different samples
- When region length is constant or not relevant
- ChIP-seq, ATAC-seq, DNase-seq analyses
**Available in:** `bamCoverage`, `bamCompare`
**Example:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing CPM
```
**Interpretation:** CPM of 5 means 5 reads per million mapped reads in that bin.
**Pros:**
- Simple and intuitive
- Good for comparing samples with different sequencing depths
- Appropriate when comparing fixed-size bins
**Cons:**
- Does not account for region length
- Affected by highly abundant regions (e.g., rRNA in RNA-seq)
---
### 3. BPM (Bins Per Million mapped reads)
**Formula:** `(Number of reads in bin) / (Sum of all reads in bins in millions)`
**Key difference from CPM:** Only considers reads that fall within the analyzed bins, not all mapped reads.
**When to use:**
- Similar to CPM, but when you want to exclude reads outside analyzed regions
- Comparing specific genomic regions while ignoring background
**Available in:** `bamCoverage`, `bamCompare`
**Example:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing BPM
```
**Interpretation:** BPM accounts only for reads in the binned regions.
**Pros:**
- Focuses normalization on analyzed regions
- Less affected by reads in unanalyzed areas
**Cons:**
- Less commonly used, may be harder to compare with published data
---
### 4. RPGC (Reads Per Genomic Content)
**Formula:** `(Number of reads × Scaling factor) / Effective genome size`
**Scaling factor:** Calculated to achieve 1× genomic coverage (1 read per base)
**When to use:**
- Want comparable coverage values across samples
- Need interpretable absolute coverage values
- Comparing samples with very different total read counts
- ChIP-seq with spike-in normalization context
**Available in:** `bamCoverage`, `bamCompare`
**Requires:** `--effectiveGenomeSize` parameter
**Example:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
```
**Interpretation:** Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
**Pros:**
- Produces 1× normalized coverage
- Interpretable in terms of genomic coverage
- Good for comparing samples with different sequencing depths
**Cons:**
- Requires knowing effective genome size
- Assumes uniform coverage (not true for ChIP-seq with peaks)
---
### 5. None (No Normalization)
**Formula:** Raw read counts
**When to use:**
- Preliminary analysis
- When samples have identical library sizes (rare)
- When downstream tool will perform normalization
- Debugging or quality control
**Available in:** All tools (usually default)
**Example:**
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing None
```
**Interpretation:** Raw read counts per bin.
**Pros:**
- No assumptions made
- Useful for seeing raw data
- Fastest computation
**Cons:**
- Cannot fairly compare samples with different sequencing depths
- Not suitable for publication figures
---
### 6. SES (Selective Enrichment Statistics)
**Method:** Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
**When to use:**
- ChIP-seq analysis with bamCompare
- Want sophisticated background correction
- Alternative to simple readCount scaling
**Available in:** `bamCompare` only
**Example:**
```bash
bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
--scaleFactorsMethod SES
```
**Note:** SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
---
### 7. readCount (Read Count Scaling)
**Method:** Scale by ratio of total read counts between samples
**When to use:**
- Default for `bamCompare`
- Compensating for sequencing depth differences in comparisons
- When you trust that total read counts reflect library size
**Available in:** `bamCompare`
**Example:**
```bash
bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
--scaleFactorsMethod readCount
```
**How it works:** If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
---
## Normalization Method Selection Guide
### For ChIP-seq Coverage Tracks
**Recommended:** RPGC or CPM
```bash
bamCoverage --bam chip.bam --outFileName chip.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--extendReads 200 \
--ignoreDuplicates
```
**Reasoning:** Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
---
### For ChIP-seq Comparisons (Treatment vs Control)
**Recommended:** log2 ratio with readCount or SES scaling
```bash
bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
--operation log2 \
--scaleFactorsMethod readCount \
--extendReads 200 \
--ignoreDuplicates
```
**Reasoning:** Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
---
### For RNA-seq Coverage Tracks
**Recommended:** CPM or RPKM
```bash
# Strand-specific forward
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
--normalizeUsing CPM \
--filterRNAstrand forward
# For gene-level: RPKM accounts for gene length
bamCoverage --bam rnaseq.bam --outFileName output.bw \
--normalizeUsing RPKM
```
**Reasoning:** CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
---
### For ATAC-seq
**Recommended:** RPGC or CPM
```bash
bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
```
**Reasoning:** Similar to ChIP-seq; want comparable coverage across samples.
---
### For Sample Correlation Analysis
**Recommended:** CPM or RPGC
```bash
multiBamSummary bins \
--bamfiles sample1.bam sample2.bam sample3.bam \
-o readCounts.npz
plotCorrelation -in readCounts.npz \
--corMethod pearson \
--whatToShow heatmap \
-o correlation.png
```
**Note:** `multiBamSummary` doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with `multiBigwigSummary`.
---
## Advanced Normalization Considerations
### Spike-in Normalization
For experiments with spike-in controls (e.g., *Drosophila* chromatin spike-in for ChIP-seq):
1. Calculate scaling factors from spike-in reads
2. Apply custom scaling factors using `--scaleFactor` parameter
```bash
# Calculate spike-in factor (example: 0.8)
SCALE_FACTOR=0.8
bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
--scaleFactor ${SCALE_FACTOR} \
--extendReads 200
```
---
### Manual Scaling Factors
You can apply custom scaling factors:
```bash
# Apply 2× scaling
bamCoverage --bam input.bam --outFileName output.bw \
--scaleFactor 2.0
```
---
### Chromosome Exclusion
Exclude specific chromosomes from normalization calculations:
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--ignoreForNormalization chrX chrY chrM
```
**When to use:** Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
---
## Common Pitfalls
### 1. Using RPKM for bin-based data
**Problem:** RPKM accounts for region length, but all bins are the same size
**Solution:** Use CPM or RPGC instead
### 2. Comparing unnormalized samples
**Problem:** Sample with 2× sequencing depth appears to have 2× signal
**Solution:** Always normalize when comparing samples
### 3. Wrong effective genome size
**Problem:** Using hg19 genome size for hg38 data
**Solution:** Double-check genome assembly and use correct size
### 4. Ignoring duplicates after GC correction
**Problem:** Can introduce bias
**Solution:** Never use `--ignoreDuplicates` after `correctGCBias`
### 5. Using RPGC without effective genome size
**Problem:** Command fails
**Solution:** Always specify `--effectiveGenomeSize` with RPGC
---
## Normalization for Different Comparisons
### Within-sample comparisons (different regions)
**Use:** RPKM (accounts for region length)
### Between-sample comparisons (same regions)
**Use:** CPM, RPGC, or BPM (accounts for library size)
### Treatment vs Control
**Use:** bamCompare with log2 ratio and readCount/SES scaling
### Multiple samples correlation
**Use:** CPM or RPGC normalized bigWig files, then multiBigwigSummary
---
## Quick Reference Table
| Method | Accounts for Depth | Accounts for Length | Best For | Command |
|--------|-------------------|---------------------|----------|---------|
| RPKM | ✓ | ✓ | RNA-seq genes | `--normalizeUsing RPKM` |
| CPM | ✓ | ✗ | Fixed-size bins | `--normalizeUsing CPM` |
| BPM | ✓ | ✗ | Specific regions | `--normalizeUsing BPM` |
| RPGC | ✓ | ✗ | Interpretable coverage | `--normalizeUsing RPGC --effectiveGenomeSize X` |
| None | ✗ | ✗ | Raw data | `--normalizeUsing None` |
| SES | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod SES` |
| readCount | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod readCount` |
---
## Further Reading
For more details on normalization theory and best practices:
- deepTools documentation: https://deeptools.readthedocs.io/
- ENCODE guidelines for ChIP-seq analysis
- RNA-seq normalization papers (DESeq2, TMM methods)

View File

@@ -0,0 +1,533 @@
# deepTools Complete Tool Reference
This document provides a comprehensive reference for all deepTools command-line utilities organized by category.
## BAM and bigWig File Processing Tools
### multiBamSummary
Computes read coverages for genomic regions across multiple BAM files, outputting compressed numpy arrays for downstream correlation and PCA analysis.
**Modes:**
- **bins**: Genome-wide analysis using consecutive equal-sized windows (default 10kb)
- **BED-file**: Restricts analysis to user-specified genomic regions
**Key Parameters:**
- `--bamfiles, -b`: Indexed BAM files (space-separated, required)
- `--outFileName, -o`: Output coverage matrix file (required)
- `--BED`: Region specification file (BED-file mode only)
- `--binSize`: Window size in bases (default: 10,000)
- `--labels`: Custom sample identifiers
- `--minMappingQuality`: Quality threshold for read inclusion
- `--numberOfProcessors, -p`: Parallel processing cores
- `--extendReads`: Fragment size extension
- `--ignoreDuplicates`: Remove PCR duplicates
- `--outRawCounts`: Export tab-delimited file with coordinate columns and per-sample counts
**Output:** Compressed numpy array (.npz) for plotCorrelation and plotPCA
**Common Usage:**
```bash
# Genome-wide comparison
multiBamSummary bins --bamfiles sample1.bam sample2.bam -o results.npz
# Peak region comparison
multiBamSummary BED-file --BED peaks.bed --bamfiles sample1.bam sample2.bam -o results.npz
```
---
### multiBigwigSummary
Similar to multiBamSummary but operates on bigWig files instead of BAM files. Used for comparing coverage tracks across samples.
**Modes:**
- **bins**: Genome-wide analysis
- **BED-file**: Region-specific analysis
**Key Parameters:** Similar to multiBamSummary but accepts bigWig files
---
### bamCoverage
Converts BAM alignment files into normalized coverage tracks in bigWig or bedGraph formats. Calculates coverage as number of reads per bin.
**Key Parameters:**
- `--bam, -b`: Input BAM file (required)
- `--outFileName, -o`: Output filename (required)
- `--outFileFormat, -of`: Output type (bigwig or bedgraph)
- `--normalizeUsing`: Normalization method
- **RPKM**: Reads Per Kilobase per Million mapped reads
- **CPM**: Counts Per Million mapped reads
- **BPM**: Bins Per Million mapped reads
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
- **None**: No normalization (default)
- `--effectiveGenomeSize`: Mappable genome size (required for RPGC)
- `--binSize`: Resolution in base pairs (default: 50)
- `--extendReads, -e`: Extend reads to fragment length (recommended for ChIP-seq, NOT for RNA-seq)
- `--centerReads`: Center reads at fragment length for sharper signals
- `--ignoreDuplicates`: Count identical reads only once
- `--minMappingQuality`: Filter reads below quality threshold
- `--minFragmentLength / --maxFragmentLength`: Fragment length filtering
- `--smoothLength`: Window averaging for noise reduction
- `--MNase`: Analyze MNase-seq data for nucleosome positioning
- `--Offset`: Position-specific offsets (useful for RiboSeq, GROseq)
- `--filterRNAstrand`: Separate forward/reverse strand reads
- `--ignoreForNormalization`: Exclude chromosomes from normalization (e.g., sex chromosomes)
- `--numberOfProcessors, -p`: Parallel processing
**Important Notes:**
- For RNA-seq: Do NOT use --extendReads (would extend over splice junctions)
- For ChIP-seq: Use --extendReads with smaller bin sizes
- Never apply --ignoreDuplicates after GC bias correction
**Common Usage:**
```bash
# Basic coverage with RPKM normalization
bamCoverage --bam input.bam --outFileName coverage.bw --normalizeUsing RPKM
# ChIP-seq with extension
bamCoverage --bam chip.bam --outFileName chip_coverage.bw \
--binSize 10 --extendReads 200 --ignoreDuplicates
# Strand-specific RNA-seq
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
--filterRNAstrand forward
```
---
### bamCompare
Compares two BAM files by generating bigWig or bedGraph files, normalizing for sequencing depth differences. Processes genome in equal-sized bins and performs per-bin calculations.
**Comparison Methods:**
- **log2** (default): Log2 ratio of samples
- **ratio**: Direct ratio calculation
- **subtract**: Difference between files
- **add**: Sum of samples
- **mean**: Average across samples
- **reciprocal_ratio**: Negative inverse for ratios < 0
- **first/second**: Output scaled signal from single file
**Normalization Methods:**
- **readCount** (default): Compensates for sequencing depth
- **SES**: Selective enrichment statistics
- **RPKM**: Reads per kilobase per million
- **CPM**: Counts per million
- **BPM**: Bins per million
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
**Key Parameters:**
- `--bamfile1, -b1`: First BAM file (required)
- `--bamfile2, -b2`: Second BAM file (required)
- `--outFileName, -o`: Output filename (required)
- `--outFileFormat`: bigwig or bedgraph
- `--operation`: Comparison method (see above)
- `--scaleFactorsMethod`: Normalization method (see above)
- `--binSize`: Bin width for output (default: 50bp)
- `--pseudocount`: Avoid division by zero (default: 1)
- `--extendReads`: Extend reads to fragment length
- `--ignoreDuplicates`: Count identical reads once
- `--minMappingQuality`: Quality threshold
- `--numberOfProcessors, -p`: Parallelization
**Common Usage:**
```bash
# Log2 ratio of treatment vs control
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw
# Subtract control from treatment
bamCompare -b1 treatment.bam -b2 control.bam -o difference.bw \
--operation subtract --scaleFactorsMethod readCount
```
---
### correctGCBias / computeGCBias
**computeGCBias:** Identifies GC-content bias from sequencing and PCR amplification.
**correctGCBias:** Corrects BAM files for GC bias detected by computeGCBias.
**Key Parameters (computeGCBias):**
- `--bamfile, -b`: Input BAM file
- `--effectiveGenomeSize`: Mappable genome size
- `--genome, -g`: Reference genome in 2bit format
- `--fragmentLength, -l`: Fragment length (for single-end)
- `--biasPlot`: Output diagnostic plot
**Key Parameters (correctGCBias):**
- `--bamfile, -b`: Input BAM file
- `--effectiveGenomeSize`: Mappable genome size
- `--genome, -g`: Reference genome in 2bit format
- `--GCbiasFrequenciesFile`: Frequencies from computeGCBias
- `--correctedFile, -o`: Output corrected BAM
**Important:** Never use --ignoreDuplicates after GC bias correction
---
### alignmentSieve
Filters BAM files by various quality metrics on-the-fly. Useful for creating filtered BAM files for specific analyses.
**Key Parameters:**
- `--bam, -b`: Input BAM file
- `--outFile, -o`: Output BAM file
- `--minMappingQuality`: Minimum mapping quality
- `--ignoreDuplicates`: Remove duplicates
- `--minFragmentLength / --maxFragmentLength`: Fragment length filters
- `--samFlagInclude / --samFlagExclude`: SAM flag filtering
- `--shift`: Shift reads (e.g., for ATACseq Tn5 correction)
- `--ATACshift`: Automatically shift for ATAC-seq data
---
### computeMatrix
Calculates scores per genomic region and prepares matrices for plotHeatmap and plotProfile. Processes bigWig score files and BED/GTF region files.
**Modes:**
- **reference-point**: Signal distribution relative to specific position (TSS, TES, or center)
- **scale-regions**: Signal across regions standardized to uniform lengths
**Key Parameters:**
- `-R`: Region file(s) in BED/GTF format (required)
- `-S`: BigWig score file(s) (required)
- `-o`: Output matrix file (required)
- `-b`: Upstream distance from reference point
- `-a`: Downstream distance from reference point
- `-m`: Region body length (scale-regions only)
- `-bs, --binSize`: Bin size for averaging scores
- `--skipZeros`: Skip regions with all zeros
- `--minThreshold / --maxThreshold`: Filter by signal intensity
- `--sortRegions`: ascending, descending, keep, no
- `--sortUsing`: mean, median, max, min, sum, region_length
- `-p, --numberOfProcessors`: Parallel processing
- `--averageTypeBins`: Statistical method (mean, median, min, max, sum, std)
**Output Options:**
- `--outFileNameMatrix`: Export tab-delimited data
- `--outFileSortedRegions`: Save filtered/sorted BED file
**Common Usage:**
```bash
# TSS analysis
computeMatrix reference-point -S signal.bw -R genes.bed \
-o matrix.gz -b 2000 -a 2000 --referencePoint TSS
# Scaled gene body
computeMatrix scale-regions -S signal.bw -R genes.bed \
-o matrix.gz -b 1000 -a 1000 -m 3000
```
---
## Quality Control Tools
### plotFingerprint
Quality control tool primarily for ChIP-seq experiments. Assesses whether antibody enrichment was successful. Generates cumulative read coverage profiles to distinguish signal from noise.
**Key Parameters:**
- `--bamfiles, -b`: Indexed BAM files (required)
- `--plotFile, -plot, -o`: Output image filename (required)
- `--extendReads, -e`: Extend reads to fragment length
- `--ignoreDuplicates`: Count identical reads once
- `--minMappingQuality`: Mapping quality filter
- `--centerReads`: Center reads at fragment length
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
- `--outRawCounts`: Save per-bin read counts
- `--outQualityMetrics`: Output QC metrics (Jensen-Shannon distance)
- `--labels`: Custom sample names
- `--numberOfProcessors, -p`: Parallel processing
**Interpretation:**
- Ideal control: Straight diagonal line
- Strong ChIP: Steep rise towards highest rank (concentrated reads in few bins)
- Weak enrichment: Flatter curve approaching diagonal
**Common Usage:**
```bash
plotFingerprint -b input.bam chip1.bam chip2.bam \
--labels Input ChIP1 ChIP2 -o fingerprint.png \
--extendReads 200 --ignoreDuplicates
```
---
### plotCoverage
Visualizes average read distribution across the genome. Shows genome coverage and helps determine if sequencing depth is adequate.
**Key Parameters:**
- `--bamfiles, -b`: BAM files to analyze (required)
- `--plotFile, -o`: Output plot filename (required)
- `--ignoreDuplicates`: Remove PCR duplicates
- `--minMappingQuality`: Quality threshold
- `--outRawCounts`: Save underlying data
- `--labels`: Sample names
- `--numberOfSamples`: Number of positions to sample (default: 1,000,000)
---
### bamPEFragmentSize
Determines fragment length distribution for paired-end sequencing data. Essential QC to verify expected fragment sizes from library preparation.
**Key Parameters:**
- `--bamfiles, -b`: BAM files (required)
- `--histogram, -hist`: Output histogram filename (required)
- `--plotTitle, -T`: Plot title
- `--maxFragmentLength`: Maximum length to consider (default: 1000)
- `--logScale`: Use logarithmic Y-axis
- `--outRawFragmentLengths`: Save raw fragment lengths
---
### plotCorrelation
Analyzes sample correlations from multiBamSummary or multiBigwigSummary outputs. Shows how similar different samples are.
**Correlation Methods:**
- **Pearson**: Measures metric differences; sensitive to outliers; appropriate for normally distributed data
- **Spearman**: Rank-based; less influenced by outliers; better for non-normal distributions
**Visualization Options:**
- **heatmap**: Color intensity with hierarchical clustering (complete linkage)
- **scatterplot**: Pairwise scatter plots with correlation coefficients
**Key Parameters:**
- `--corData, -in`: Input matrix from multiBamSummary/multiBigwigSummary (required)
- `--corMethod`: pearson or spearman (required)
- `--whatToShow`: heatmap or scatterplot (required)
- `--plotFile, -o`: Output filename (required)
- `--skipZeros`: Exclude zero-value regions
- `--removeOutliers`: Use median absolute deviation (MAD) filtering
- `--outFileCorMatrix`: Export correlation matrix
- `--labels`: Custom sample names
- `--plotTitle`: Plot title
- `--colorMap`: Color scheme (50+ options)
- `--plotNumbers`: Display correlation values on heatmap
**Common Usage:**
```bash
# Heatmap with Pearson correlation
plotCorrelation -in readCounts.npz --corMethod pearson \
--whatToShow heatmap -o correlation_heatmap.png --plotNumbers
# Scatterplot with Spearman correlation
plotCorrelation -in readCounts.npz --corMethod spearman \
--whatToShow scatterplot -o correlation_scatter.png
```
---
### plotPCA
Generates principal component analysis plots from multiBamSummary or multiBigwigSummary output. Displays sample relationships in reduced dimensionality.
**Key Parameters:**
- `--corData, -in`: Coverage file from multiBamSummary/multiBigwigSummary (required)
- `--plotFile, -o`: Output image (png, eps, pdf, svg) (required)
- `--outFileNameData`: Export PCA data (loadings/rotation and eigenvalues)
- `--labels, -l`: Custom sample labels
- `--plotTitle, -T`: Plot title
- `--plotHeight / --plotWidth`: Dimensions in centimeters
- `--colors`: Custom symbol colors
- `--markers`: Symbol shapes
- `--transpose`: Perform PCA on transposed matrix (rows=samples)
- `--ntop`: Use top N variable rows (default: 1000)
- `--PCs`: Components to plot (default: 1 2)
- `--log2`: Log2-transform data before analysis
- `--rowCenter`: Center each row at 0
**Common Usage:**
```bash
plotPCA -in readCounts.npz -o PCA_plot.png \
-T "PCA of read counts" --transpose
```
---
## Visualization Tools
### plotHeatmap
Creates genomic region heatmaps from computeMatrix output. Generates publication-quality visualizations.
**Key Parameters:**
- `--matrixFile, -m`: Matrix from computeMatrix (required)
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
- `--outFileSortedRegions`: Save regions after filtering
- `--outFileNameMatrix`: Export matrix values
- `--interpolationMethod`: auto, nearest, bilinear, bicubic, gaussian
- Default: nearest (≤1000 columns), bilinear (>1000 columns)
- `--dpi`: Figure resolution
**Clustering:**
- `--kmeans`: k-means clustering
- `--hclust`: Hierarchical clustering (slower for >1000 regions)
- `--silhouette`: Calculate cluster quality metrics
**Visual Customization:**
- `--heatmapHeight / --heatmapWidth`: Dimensions (3-100 cm)
- `--whatToShow`: plot, heatmap, colorbar (combinations)
- `--alpha`: Transparency (0-1)
- `--colorMap`: 50+ color schemes
- `--colorList`: Custom gradient colors
- `--zMin / --zMax`: Intensity scale limits
- `--boxAroundHeatmaps`: yes/no (default: yes)
**Labels:**
- `--xAxisLabel / --yAxisLabel`: Axis labels
- `--regionsLabel`: Region set identifiers
- `--samplesLabel`: Sample names
- `--refPointLabel`: Reference point label
- `--startLabel / --endLabel`: Region boundary labels
**Common Usage:**
```bash
# Basic heatmap
plotHeatmap -m matrix.gz -o heatmap.png
# With clustering and custom colors
plotHeatmap -m matrix.gz -o heatmap.png \
--kmeans 3 --colorMap RdBu --zMin -3 --zMax 3
```
---
### plotProfile
Generates profile plots showing scores across genomic regions using computeMatrix output.
**Key Parameters:**
- `--matrixFile, -m`: Matrix from computeMatrix (required)
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
- `--plotType`: lines, fill, se, std, overlapped_lines, heatmap
- `--colors`: Color palette (names or hex codes)
- `--plotHeight / --plotWidth`: Dimensions in centimeters
- `--yMin / --yMax`: Y-axis range
- `--averageType`: mean, median, min, max, std, sum
**Clustering:**
- `--kmeans`: k-means clustering
- `--hclust`: Hierarchical clustering
- `--silhouette`: Cluster quality metrics
**Labels:**
- `--plotTitle`: Main heading
- `--regionsLabel`: Region set identifiers
- `--samplesLabel`: Sample names
- `--startLabel / --endLabel`: Region boundary labels (scale-regions mode)
**Output Options:**
- `--outFileNameData`: Export data as tab-separated values
- `--outFileSortedRegions`: Save filtered/sorted regions as BED
**Common Usage:**
```bash
# Line plot
plotProfile -m matrix.gz -o profile.png --plotType lines
# With standard error shading
plotProfile -m matrix.gz -o profile.png --plotType se \
--colors blue red green
```
---
### plotEnrichment
Calculates and visualizes signal enrichment across genomic regions. Measures percentage of alignments overlapping region groups. Useful for FRiP (Fragment in Peaks) scores.
**Key Parameters:**
- `--bamfiles, -b`: Indexed BAM files (required)
- `--BED`: Region files in BED/GTF format (required)
- `--plotFile, -o`: Output visualization (png, pdf, eps, svg)
- `--labels, -l`: Custom sample identifiers
- `--outRawCounts`: Export numerical data
- `--perSample`: Group by sample instead of feature (default)
- `--regionLabels`: Custom region names
**Read Processing:**
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
- `--minMappingQuality`: Quality threshold
- `--samFlagInclude / --samFlagExclude`: SAM flag filters
- `--ignoreDuplicates`: Remove duplicates
- `--centerReads`: Center reads for sharper signal
**Common Usage:**
```bash
plotEnrichment -b Input.bam H3K4me3.bam \
--BED peaks_up.bed peaks_down.bed \
--regionLabels "Up regulated" "Down regulated" \
-o enrichment.png
```
---
## Miscellaneous Tools
### computeMatrixOperations
Advanced matrix manipulation tool for combining or subsetting matrices from computeMatrix. Enables complex multi-sample, multi-region analyses.
**Operations:**
- `cbind`: Combine matrices column-wise
- `rbind`: Combine matrices row-wise
- `subset`: Extract specific samples or regions
- `filterStrand`: Keep only regions on specific strand
- `filterValues`: Apply signal intensity filters
- `sort`: Order regions by various criteria
- `dataRange`: Report min/max values
**Common Usage:**
```bash
# Combine matrices
computeMatrixOperations cbind -m matrix1.gz matrix2.gz -o combined.gz
# Extract specific samples
computeMatrixOperations subset -m matrix.gz --samples 0 2 -o subset.gz
```
---
### estimateReadFiltering
Predicts the impact of various filtering parameters without actually filtering. Helps optimize filtering strategies before running full analyses.
**Key Parameters:**
- `--bamfiles, -b`: BAM files to analyze
- `--sampleSize`: Number of reads to sample (default: 100,000)
- `--binSize`: Bin size for analysis
- `--distanceBetweenBins`: Spacing between sampled bins
**Filtration Options to Test:**
- `--minMappingQuality`: Test quality thresholds
- `--ignoreDuplicates`: Assess duplicate impact
- `--minFragmentLength / --maxFragmentLength`: Test fragment filters
---
## Common Parameters Across Tools
Many deepTools commands share these filtering and performance options:
**Read Filtering:**
- `--ignoreDuplicates`: Remove PCR duplicates
- `--minMappingQuality`: Filter by alignment confidence
- `--samFlagInclude / --samFlagExclude`: SAM format filtering
- `--minFragmentLength / --maxFragmentLength`: Fragment length bounds
**Performance:**
- `--numberOfProcessors, -p`: Enable parallel processing
- `--region`: Process specific genomic regions (chr:start-end)
**Read Processing:**
- `--extendReads`: Extend to fragment length
- `--centerReads`: Center at fragment midpoint
- `--ignoreDuplicates`: Count unique reads only

View File

@@ -0,0 +1,474 @@
# deepTools Common Workflows
This document provides complete workflow examples for common deepTools analyses.
## ChIP-seq Quality Control Workflow
Complete quality control assessment for ChIP-seq experiments.
### Step 1: Initial Correlation Assessment
Compare replicates and samples to verify experimental quality:
```bash
# Generate coverage matrix across genome
multiBamSummary bins \
--bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
--labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
-o readCounts.npz \
--numberOfProcessors 8
# Create correlation heatmap
plotCorrelation \
-in readCounts.npz \
--corMethod pearson \
--whatToShow heatmap \
--plotFile correlation_heatmap.png \
--plotNumbers
# Generate PCA plot
plotPCA \
-in readCounts.npz \
-o PCA_plot.png \
-T "PCA of ChIP-seq samples"
```
**Expected Results:**
- Replicates should cluster together
- Input samples should be distinct from ChIP samples
---
### Step 2: Coverage and Depth Assessment
```bash
# Check sequencing depth and coverage
plotCoverage \
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
--labels Input ChIP_rep1 ChIP_rep2 \
--plotFile coverage.png \
--ignoreDuplicates \
--numberOfProcessors 8
```
**Interpretation:** Assess whether sequencing depth is adequate for downstream analysis.
---
### Step 3: Fragment Size Validation (Paired-end)
```bash
# Verify expected fragment sizes
bamPEFragmentSize \
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
--histogram fragmentSizes.png \
--plotTitle "Fragment Size Distribution"
```
**Expected Results:** Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).
---
### Step 4: GC Bias Detection and Correction
```bash
# Compute GC bias
computeGCBias \
--bamfile ChIP1.bam \
--effectiveGenomeSize 2913022398 \
--genome genome.2bit \
--fragmentLength 200 \
--biasPlot GCbias.png \
--frequenciesFile freq.txt
# If bias detected, correct it
correctGCBias \
--bamfile ChIP1.bam \
--effectiveGenomeSize 2913022398 \
--genome genome.2bit \
--GCbiasFrequenciesFile freq.txt \
--correctedFile ChIP1_GCcorrected.bam
```
**Note:** Only correct if significant bias is observed. Do NOT use `--ignoreDuplicates` with GC-corrected files.
---
### Step 5: ChIP Signal Strength Assessment
```bash
# Evaluate ChIP enrichment quality
plotFingerprint \
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
--labels Input ChIP_rep1 ChIP_rep2 \
--plotFile fingerprint.png \
--extendReads 200 \
--ignoreDuplicates \
--numberOfProcessors 8 \
--outQualityMetrics fingerprint_metrics.txt
```
**Interpretation:**
- Strong ChIP: Steep rise in cumulative curve
- Weak enrichment: Curve close to diagonal (input-like)
---
## ChIP-seq Analysis Workflow
Complete workflow from BAM files to publication-quality visualizations.
### Step 1: Generate Normalized Coverage Tracks
```bash
# Input control
bamCoverage \
--bam Input.bam \
--outFileName Input_coverage.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--binSize 10 \
--extendReads 200 \
--ignoreDuplicates \
--numberOfProcessors 8
# ChIP sample
bamCoverage \
--bam ChIP.bam \
--outFileName ChIP_coverage.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--binSize 10 \
--extendReads 200 \
--ignoreDuplicates \
--numberOfProcessors 8
```
---
### Step 2: Create Log2 Ratio Track
```bash
# Compare ChIP to Input
bamCompare \
--bamfile1 ChIP.bam \
--bamfile2 Input.bam \
--outFileName ChIP_vs_Input_log2ratio.bw \
--operation log2 \
--scaleFactorsMethod readCount \
--binSize 10 \
--extendReads 200 \
--ignoreDuplicates \
--numberOfProcessors 8
```
**Result:** Log2 ratio track showing enrichment (positive values) and depletion (negative values).
---
### Step 3: Compute Matrix Around TSS
```bash
# Prepare data for heatmap/profile around transcription start sites
computeMatrix reference-point \
--referencePoint TSS \
--scoreFileName ChIP_coverage.bw \
--regionsFileName genes.bed \
--beforeRegionStartLength 3000 \
--afterRegionStartLength 3000 \
--binSize 10 \
--sortRegions descend \
--sortUsing mean \
--outFileName matrix_TSS.gz \
--outFileNameMatrix matrix_TSS.tab \
--numberOfProcessors 8
```
---
### Step 4: Generate Heatmap
```bash
# Create heatmap around TSS
plotHeatmap \
--matrixFile matrix_TSS.gz \
--outFileName heatmap_TSS.png \
--colorMap RdBu \
--whatToShow 'plot, heatmap and colorbar' \
--zMin -3 --zMax 3 \
--yAxisLabel "Genes" \
--xAxisLabel "Distance from TSS (bp)" \
--refPointLabel "TSS" \
--heatmapHeight 15 \
--kmeans 3
```
---
### Step 5: Generate Profile Plot
```bash
# Create meta-profile around TSS
plotProfile \
--matrixFile matrix_TSS.gz \
--outFileName profile_TSS.png \
--plotType lines \
--perGroup \
--colors blue \
--plotTitle "ChIP-seq signal around TSS" \
--yAxisLabel "Average signal" \
--xAxisLabel "Distance from TSS (bp)" \
--refPointLabel "TSS"
```
---
### Step 6: Enrichment at Peaks
```bash
# Calculate enrichment in peak regions
plotEnrichment \
--bamfiles Input.bam ChIP.bam \
--BED peaks.bed \
--labels Input ChIP \
--plotFile enrichment.png \
--outRawCounts enrichment_counts.tab \
--extendReads 200 \
--ignoreDuplicates
```
---
## RNA-seq Coverage Workflow
Generate strand-specific coverage tracks for RNA-seq data.
### Forward Strand
```bash
bamCoverage \
--bam rnaseq.bam \
--outFileName forward_coverage.bw \
--filterRNAstrand forward \
--normalizeUsing CPM \
--binSize 1 \
--numberOfProcessors 8
```
### Reverse Strand
```bash
bamCoverage \
--bam rnaseq.bam \
--outFileName reverse_coverage.bw \
--filterRNAstrand reverse \
--normalizeUsing CPM \
--binSize 1 \
--numberOfProcessors 8
```
**Important:** Do NOT use `--extendReads` for RNA-seq (would extend over splice junctions).
---
## Multi-Sample Comparison Workflow
Compare multiple ChIP-seq samples (e.g., different conditions or time points).
### Step 1: Generate Coverage Files
```bash
# For each sample
for sample in Control_ChIP Treated_ChIP; do
bamCoverage \
--bam ${sample}.bam \
--outFileName ${sample}.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--binSize 10 \
--extendReads 200 \
--ignoreDuplicates \
--numberOfProcessors 8
done
```
---
### Step 2: Compute Multi-Sample Matrix
```bash
computeMatrix scale-regions \
--scoreFileName Control_ChIP.bw Treated_ChIP.bw \
--regionsFileName genes.bed \
--beforeRegionStartLength 1000 \
--afterRegionStartLength 1000 \
--regionBodyLength 3000 \
--binSize 10 \
--sortRegions descend \
--sortUsing mean \
--outFileName matrix_multi.gz \
--numberOfProcessors 8
```
---
### Step 3: Multi-Sample Heatmap
```bash
plotHeatmap \
--matrixFile matrix_multi.gz \
--outFileName heatmap_comparison.png \
--colorMap Blues \
--whatToShow 'plot, heatmap and colorbar' \
--samplesLabel Control Treated \
--yAxisLabel "Genes" \
--heatmapHeight 15 \
--kmeans 4
```
---
### Step 4: Multi-Sample Profile
```bash
plotProfile \
--matrixFile matrix_multi.gz \
--outFileName profile_comparison.png \
--plotType lines \
--perGroup \
--colors blue red \
--samplesLabel Control Treated \
--plotTitle "ChIP-seq signal comparison" \
--startLabel "TSS" \
--endLabel "TES"
```
---
## ATAC-seq Workflow
Specialized workflow for ATAC-seq data with Tn5 offset correction.
### Step 1: Shift Reads for Tn5 Correction
```bash
alignmentSieve \
--bam atacseq.bam \
--outFile atacseq_shifted.bam \
--ATACshift \
--minFragmentLength 38 \
--maxFragmentLength 2000 \
--ignoreDuplicates
```
---
### Step 2: Generate Coverage Track
```bash
bamCoverage \
--bam atacseq_shifted.bam \
--outFileName atacseq_coverage.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--binSize 1 \
--numberOfProcessors 8
```
---
### Step 3: Fragment Size Analysis
```bash
bamPEFragmentSize \
--bamfiles atacseq.bam \
--histogram fragmentSizes_atac.png \
--maxFragmentLength 1000
```
**Expected Pattern:** Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).
---
## Peak Region Analysis Workflow
Analyze ChIP-seq signal specifically at peak regions.
### Step 1: Matrix at Peaks
```bash
computeMatrix reference-point \
--referencePoint center \
--scoreFileName ChIP_coverage.bw \
--regionsFileName peaks.bed \
--beforeRegionStartLength 2000 \
--afterRegionStartLength 2000 \
--binSize 10 \
--outFileName matrix_peaks.gz \
--numberOfProcessors 8
```
---
### Step 2: Heatmap at Peaks
```bash
plotHeatmap \
--matrixFile matrix_peaks.gz \
--outFileName heatmap_peaks.png \
--colorMap YlOrRd \
--refPointLabel "Peak Center" \
--heatmapHeight 15 \
--sortUsing max
```
---
## Troubleshooting Common Issues
### Issue: Out of Memory
**Solution:** Use `--region` parameter to process chromosomes individually:
```bash
bamCoverage --bam input.bam -o chr1.bw --region chr1
```
### Issue: BAM Index Missing
**Solution:** Index BAM files before running deepTools:
```bash
samtools index input.bam
```
### Issue: Slow Processing
**Solution:** Increase `--numberOfProcessors`:
```bash
# Use 8 cores instead of default
--numberOfProcessors 8
```
### Issue: bigWig Files Too Large
**Solution:** Increase bin size:
```bash
--binSize 50 # or larger (default is 10-50)
```
---
## Performance Tips
1. **Use multiple processors:** Always set `--numberOfProcessors` to available cores
2. **Process regions:** Use `--region` for testing or memory-limited environments
3. **Adjust bin size:** Larger bins = faster processing and smaller files
4. **Pre-filter BAM files:** Use `alignmentSieve` to create filtered BAM files once, then reuse
5. **Use bigWig over bedGraph:** bigWig format is compressed and faster to process
---
## Best Practices
1. **Always check QC first:** Run correlation, coverage, and fingerprint analysis before proceeding
2. **Document parameters:** Save command lines for reproducibility
3. **Use consistent normalization:** Apply same normalization method across samples in a comparison
4. **Verify reference genome match:** Ensure BAM files and region files use same genome build
5. **Check strand orientation:** For RNA-seq, verify correct strand orientation
6. **Test on small regions first:** Use `--region chr1:1-1000000` for testing parameters
7. **Keep intermediate files:** Save matrices for regenerating plots with different settings

View File

@@ -0,0 +1,195 @@
#!/usr/bin/env python3
"""
deepTools File Validation Script
Validates BAM, bigWig, and BED files for deepTools analysis.
Checks for file existence, proper indexing, and basic format requirements.
"""
import os
import sys
import argparse
from pathlib import Path
def check_file_exists(filepath):
"""Check if file exists and is readable."""
if not os.path.exists(filepath):
return False, f"File not found: {filepath}"
if not os.access(filepath, os.R_OK):
return False, f"File not readable: {filepath}"
return True, f"✓ File exists: {filepath}"
def check_bam_index(bam_file):
"""Check if BAM file has an index (.bai or .bam.bai)."""
bai_file1 = bam_file + ".bai"
bai_file2 = bam_file.replace(".bam", ".bai")
if os.path.exists(bai_file1):
return True, f"✓ BAM index found: {bai_file1}"
elif os.path.exists(bai_file2):
return True, f"✓ BAM index found: {bai_file2}"
else:
return False, f"✗ BAM index missing for: {bam_file}\n Run: samtools index {bam_file}"
def check_bigwig_file(bw_file):
"""Basic check for bigWig file."""
# Check file size (bigWig files should have reasonable size)
file_size = os.path.getsize(bw_file)
if file_size < 100:
return False, f"✗ bigWig file suspiciously small: {bw_file} ({file_size} bytes)"
return True, f"✓ bigWig file appears valid: {bw_file} ({file_size} bytes)"
def check_bed_file(bed_file):
"""Basic validation of BED file format."""
try:
with open(bed_file, 'r') as f:
lines = [line.strip() for line in f if line.strip() and not line.startswith('#')]
if len(lines) == 0:
return False, f"✗ BED file is empty: {bed_file}"
# Check first few lines for basic format
for i, line in enumerate(lines[:10], 1):
fields = line.split('\t')
if len(fields) < 3:
return False, f"✗ BED file format error at line {i}: expected at least 3 columns\n Line: {line}"
# Check if start and end are integers
try:
start = int(fields[1])
end = int(fields[2])
if start >= end:
return False, f"✗ BED file error at line {i}: start >= end ({start} >= {end})"
except ValueError:
return False, f"✗ BED file format error at line {i}: start and end must be integers\n Line: {line}"
return True, f"✓ BED file format appears valid: {bed_file} ({len(lines)} regions)"
except Exception as e:
return False, f"✗ Error reading BED file: {bed_file}\n Error: {str(e)}"
def validate_files(bam_files=None, bigwig_files=None, bed_files=None):
"""
Validate all provided files.
Args:
bam_files: List of BAM file paths
bigwig_files: List of bigWig file paths
bed_files: List of BED file paths
Returns:
Tuple of (success: bool, messages: list)
"""
all_success = True
messages = []
# Validate BAM files
if bam_files:
messages.append("\n=== Validating BAM Files ===")
for bam_file in bam_files:
# Check existence
success, msg = check_file_exists(bam_file)
messages.append(msg)
if not success:
all_success = False
continue
# Check index
success, msg = check_bam_index(bam_file)
messages.append(msg)
if not success:
all_success = False
# Validate bigWig files
if bigwig_files:
messages.append("\n=== Validating bigWig Files ===")
for bw_file in bigwig_files:
# Check existence
success, msg = check_file_exists(bw_file)
messages.append(msg)
if not success:
all_success = False
continue
# Basic bigWig check
success, msg = check_bigwig_file(bw_file)
messages.append(msg)
if not success:
all_success = False
# Validate BED files
if bed_files:
messages.append("\n=== Validating BED Files ===")
for bed_file in bed_files:
# Check existence
success, msg = check_file_exists(bed_file)
messages.append(msg)
if not success:
all_success = False
continue
# Check BED format
success, msg = check_bed_file(bed_file)
messages.append(msg)
if not success:
all_success = False
return all_success, messages
def main():
parser = argparse.ArgumentParser(
description="Validate files for deepTools analysis",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Validate BAM files
python validate_files.py --bam sample1.bam sample2.bam
# Validate all file types
python validate_files.py --bam input.bam chip.bam --bed peaks.bed --bigwig signal.bw
# Validate from a directory
python validate_files.py --bam *.bam --bed *.bed
"""
)
parser.add_argument('--bam', nargs='+', help='BAM files to validate')
parser.add_argument('--bigwig', '--bw', nargs='+', help='bigWig files to validate')
parser.add_argument('--bed', nargs='+', help='BED files to validate')
args = parser.parse_args()
# Check if any files were provided
if not any([args.bam, args.bigwig, args.bed]):
parser.print_help()
sys.exit(1)
# Run validation
success, messages = validate_files(
bam_files=args.bam,
bigwig_files=args.bigwig,
bed_files=args.bed
)
# Print results
for msg in messages:
print(msg)
# Summary
print("\n" + "="*50)
if success:
print("✓ All validations passed!")
sys.exit(0)
else:
print("✗ Some validations failed. Please fix the issues above.")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,454 @@
#!/usr/bin/env python3
"""
deepTools Workflow Generator
Generates bash script templates for common deepTools workflows.
"""
import argparse
import sys
WORKFLOWS = {
'chipseq_qc': {
'name': 'ChIP-seq Quality Control',
'description': 'Complete QC workflow for ChIP-seq experiments',
},
'chipseq_analysis': {
'name': 'ChIP-seq Complete Analysis',
'description': 'Full ChIP-seq analysis from BAM to heatmaps',
},
'rnaseq_coverage': {
'name': 'RNA-seq Coverage Tracks',
'description': 'Generate strand-specific RNA-seq coverage',
},
'atacseq': {
'name': 'ATAC-seq Analysis',
'description': 'ATAC-seq workflow with Tn5 correction',
},
}
def generate_chipseq_qc_workflow(output_file, params):
"""Generate ChIP-seq QC workflow script."""
script = f"""#!/bin/bash
# deepTools ChIP-seq Quality Control Workflow
# Generated by deepTools workflow generator
# Configuration
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
CHIP_BAM=("{params.get('chip_bams', 'ChIP1.bam ChIP2.bam')}")
GENOME_SIZE={params.get('genome_size', '2913022398')}
THREADS={params.get('threads', '8')}
OUTPUT_DIR="{params.get('output_dir', 'deeptools_qc')}"
# Create output directory
mkdir -p $OUTPUT_DIR
echo "=== Starting ChIP-seq QC workflow ==="
# Step 1: Correlation analysis
echo "Step 1: Computing correlation matrix..."
multiBamSummary bins \\
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
-o $OUTPUT_DIR/readCounts.npz \\
--numberOfProcessors $THREADS
echo "Step 2: Generating correlation heatmap..."
plotCorrelation \\
-in $OUTPUT_DIR/readCounts.npz \\
--corMethod pearson \\
--whatToShow heatmap \\
--plotFile $OUTPUT_DIR/correlation_heatmap.png \\
--plotNumbers
echo "Step 3: Generating PCA plot..."
plotPCA \\
-in $OUTPUT_DIR/readCounts.npz \\
-o $OUTPUT_DIR/PCA_plot.png \\
-T "PCA of ChIP-seq samples"
# Step 2: Coverage assessment
echo "Step 4: Assessing coverage..."
plotCoverage \\
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
--plotFile $OUTPUT_DIR/coverage.png \\
--ignoreDuplicates \\
--numberOfProcessors $THREADS
# Step 3: Fragment size (for paired-end data)
echo "Step 5: Analyzing fragment sizes..."
bamPEFragmentSize \\
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
--histogram $OUTPUT_DIR/fragmentSizes.png \\
--plotTitle "Fragment Size Distribution"
# Step 4: ChIP signal strength
echo "Step 6: Evaluating ChIP enrichment..."
plotFingerprint \\
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
--plotFile $OUTPUT_DIR/fingerprint.png \\
--extendReads 200 \\
--ignoreDuplicates \\
--numberOfProcessors $THREADS \\
--outQualityMetrics $OUTPUT_DIR/fingerprint_metrics.txt
echo "=== ChIP-seq QC workflow complete ==="
echo "Results are in: $OUTPUT_DIR"
"""
with open(output_file, 'w') as f:
f.write(script)
return f"✓ Generated ChIP-seq QC workflow: {output_file}"
def generate_chipseq_analysis_workflow(output_file, params):
"""Generate complete ChIP-seq analysis workflow script."""
script = f"""#!/bin/bash
# deepTools ChIP-seq Complete Analysis Workflow
# Generated by deepTools workflow generator
# Configuration
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
CHIP_BAM="{params.get('chip_bam', 'ChIP.bam')}"
GENES_BED="{params.get('genes_bed', 'genes.bed')}"
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
GENOME_SIZE={params.get('genome_size', '2913022398')}
THREADS={params.get('threads', '8')}
OUTPUT_DIR="{params.get('output_dir', 'chipseq_analysis')}"
# Create output directory
mkdir -p $OUTPUT_DIR
echo "=== Starting ChIP-seq analysis workflow ==="
# Step 1: Generate normalized coverage tracks
echo "Step 1: Generating coverage tracks..."
bamCoverage \\
--bam $INPUT_BAM \\
--outFileName $OUTPUT_DIR/Input_coverage.bw \\
--normalizeUsing RPGC \\
--effectiveGenomeSize $GENOME_SIZE \\
--binSize 10 \\
--extendReads 200 \\
--ignoreDuplicates \\
--numberOfProcessors $THREADS
bamCoverage \\
--bam $CHIP_BAM \\
--outFileName $OUTPUT_DIR/ChIP_coverage.bw \\
--normalizeUsing RPGC \\
--effectiveGenomeSize $GENOME_SIZE \\
--binSize 10 \\
--extendReads 200 \\
--ignoreDuplicates \\
--numberOfProcessors $THREADS
# Step 2: Create log2 ratio track
echo "Step 2: Creating log2 ratio track..."
bamCompare \\
--bamfile1 $CHIP_BAM \\
--bamfile2 $INPUT_BAM \\
--outFileName $OUTPUT_DIR/ChIP_vs_Input_log2ratio.bw \\
--operation log2 \\
--scaleFactorsMethod readCount \\
--binSize 10 \\
--extendReads 200 \\
--ignoreDuplicates \\
--numberOfProcessors $THREADS
# Step 3: Compute matrix around TSS
echo "Step 3: Computing matrix around TSS..."
computeMatrix reference-point \\
--referencePoint TSS \\
--scoreFileName $OUTPUT_DIR/ChIP_coverage.bw \\
--regionsFileName $GENES_BED \\
--beforeRegionStartLength 3000 \\
--afterRegionStartLength 3000 \\
--binSize 10 \\
--sortRegions descend \\
--sortUsing mean \\
--outFileName $OUTPUT_DIR/matrix_TSS.gz \\
--numberOfProcessors $THREADS
# Step 4: Generate heatmap
echo "Step 4: Generating heatmap..."
plotHeatmap \\
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
--outFileName $OUTPUT_DIR/heatmap_TSS.png \\
--colorMap RdBu \\
--whatToShow 'plot, heatmap and colorbar' \\
--yAxisLabel "Genes" \\
--xAxisLabel "Distance from TSS (bp)" \\
--refPointLabel "TSS" \\
--heatmapHeight 15 \\
--kmeans 3
# Step 5: Generate profile plot
echo "Step 5: Generating profile plot..."
plotProfile \\
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
--outFileName $OUTPUT_DIR/profile_TSS.png \\
--plotType lines \\
--perGroup \\
--colors blue \\
--plotTitle "ChIP-seq signal around TSS" \\
--yAxisLabel "Average signal" \\
--refPointLabel "TSS"
# Step 6: Enrichment at peaks (if peaks provided)
if [ -f "$PEAKS_BED" ]; then
echo "Step 6: Calculating enrichment at peaks..."
plotEnrichment \\
--bamfiles $INPUT_BAM $CHIP_BAM \\
--BED $PEAKS_BED \\
--labels Input ChIP \\
--plotFile $OUTPUT_DIR/enrichment.png \\
--outRawCounts $OUTPUT_DIR/enrichment_counts.tab \\
--extendReads 200 \\
--ignoreDuplicates
fi
echo "=== ChIP-seq analysis complete ==="
echo "Results are in: $OUTPUT_DIR"
"""
with open(output_file, 'w') as f:
f.write(script)
return f"✓ Generated ChIP-seq analysis workflow: {output_file}"
def generate_rnaseq_coverage_workflow(output_file, params):
"""Generate RNA-seq coverage workflow script."""
script = f"""#!/bin/bash
# deepTools RNA-seq Coverage Workflow
# Generated by deepTools workflow generator
# Configuration
RNASEQ_BAM="{params.get('rnaseq_bam', 'rnaseq.bam')}"
THREADS={params.get('threads', '8')}
OUTPUT_DIR="{params.get('output_dir', 'rnaseq_coverage')}"
# Create output directory
mkdir -p $OUTPUT_DIR
echo "=== Starting RNA-seq coverage workflow ==="
# Generate strand-specific coverage tracks
echo "Step 1: Generating forward strand coverage..."
bamCoverage \\
--bam $RNASEQ_BAM \\
--outFileName $OUTPUT_DIR/forward_coverage.bw \\
--filterRNAstrand forward \\
--normalizeUsing CPM \\
--binSize 1 \\
--numberOfProcessors $THREADS
echo "Step 2: Generating reverse strand coverage..."
bamCoverage \\
--bam $RNASEQ_BAM \\
--outFileName $OUTPUT_DIR/reverse_coverage.bw \\
--filterRNAstrand reverse \\
--normalizeUsing CPM \\
--binSize 1 \\
--numberOfProcessors $THREADS
echo "=== RNA-seq coverage workflow complete ==="
echo "Results are in: $OUTPUT_DIR"
echo ""
echo "Note: These bigWig files can be loaded into genome browsers"
echo "for strand-specific visualization of RNA-seq data."
"""
with open(output_file, 'w') as f:
f.write(script)
return f"✓ Generated RNA-seq coverage workflow: {output_file}"
def generate_atacseq_workflow(output_file, params):
"""Generate ATAC-seq workflow script."""
script = f"""#!/bin/bash
# deepTools ATAC-seq Analysis Workflow
# Generated by deepTools workflow generator
# Configuration
ATAC_BAM="{params.get('atac_bam', 'atacseq.bam')}"
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
GENOME_SIZE={params.get('genome_size', '2913022398')}
THREADS={params.get('threads', '8')}
OUTPUT_DIR="{params.get('output_dir', 'atacseq_analysis')}"
# Create output directory
mkdir -p $OUTPUT_DIR
echo "=== Starting ATAC-seq analysis workflow ==="
# Step 1: Shift reads for Tn5 correction
echo "Step 1: Applying Tn5 offset correction..."
alignmentSieve \\
--bam $ATAC_BAM \\
--outFile $OUTPUT_DIR/atacseq_shifted.bam \\
--ATACshift \\
--minFragmentLength 38 \\
--maxFragmentLength 2000 \\
--ignoreDuplicates
# Index the shifted BAM
samtools index $OUTPUT_DIR/atacseq_shifted.bam
# Step 2: Generate coverage track
echo "Step 2: Generating coverage track..."
bamCoverage \\
--bam $OUTPUT_DIR/atacseq_shifted.bam \\
--outFileName $OUTPUT_DIR/atacseq_coverage.bw \\
--normalizeUsing RPGC \\
--effectiveGenomeSize $GENOME_SIZE \\
--binSize 1 \\
--numberOfProcessors $THREADS
# Step 3: Fragment size analysis
echo "Step 3: Analyzing fragment sizes..."
bamPEFragmentSize \\
--bamfiles $ATAC_BAM \\
--histogram $OUTPUT_DIR/fragmentSizes.png \\
--maxFragmentLength 1000
# Step 4: Compute matrix at peaks (if peaks provided)
if [ -f "$PEAKS_BED" ]; then
echo "Step 4: Computing matrix at peaks..."
computeMatrix reference-point \\
--referencePoint center \\
--scoreFileName $OUTPUT_DIR/atacseq_coverage.bw \\
--regionsFileName $PEAKS_BED \\
--beforeRegionStartLength 2000 \\
--afterRegionStartLength 2000 \\
--binSize 10 \\
--outFileName $OUTPUT_DIR/matrix_peaks.gz \\
--numberOfProcessors $THREADS
echo "Step 5: Generating heatmap..."
plotHeatmap \\
--matrixFile $OUTPUT_DIR/matrix_peaks.gz \\
--outFileName $OUTPUT_DIR/heatmap_peaks.png \\
--colorMap YlOrRd \\
--refPointLabel "Peak Center" \\
--heatmapHeight 15
fi
echo "=== ATAC-seq analysis complete ==="
echo "Results are in: $OUTPUT_DIR"
echo ""
echo "Expected fragment size pattern:"
echo " ~50bp: nucleosome-free regions"
echo " ~200bp: mono-nucleosome"
echo " ~400bp: di-nucleosome"
"""
with open(output_file, 'w') as f:
f.write(script)
return f"✓ Generated ATAC-seq workflow: {output_file}"
def main():
parser = argparse.ArgumentParser(
description="Generate deepTools workflow scripts",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=f"""
Available workflows:
{chr(10).join(f" {key}: {value['name']}" for key, value in WORKFLOWS.items())}
Examples:
# Generate ChIP-seq QC workflow
python workflow_generator.py chipseq_qc -o chipseq_qc.sh
# Generate ChIP-seq analysis with custom parameters
python workflow_generator.py chipseq_analysis -o analysis.sh \\
--chip-bam H3K4me3.bam --input-bam Input.bam
# List all available workflows
python workflow_generator.py --list
"""
)
parser.add_argument('workflow', nargs='?', choices=list(WORKFLOWS.keys()),
help='Workflow type to generate')
parser.add_argument('-o', '--output', default='deeptools_workflow.sh',
help='Output script filename (default: deeptools_workflow.sh)')
parser.add_argument('--list', action='store_true',
help='List all available workflows')
# Common parameters
parser.add_argument('--threads', type=int, default=8,
help='Number of threads (default: 8)')
parser.add_argument('--genome-size', type=int, default=2913022398,
help='Effective genome size (default: 2913022398 for hg38)')
parser.add_argument('--output-dir', default=None,
help='Output directory for results')
# Workflow-specific parameters
parser.add_argument('--input-bam', help='Input/control BAM file')
parser.add_argument('--chip-bam', help='ChIP BAM file')
parser.add_argument('--chip-bams', help='Multiple ChIP BAM files (space-separated)')
parser.add_argument('--rnaseq-bam', help='RNA-seq BAM file')
parser.add_argument('--atac-bam', help='ATAC-seq BAM file')
parser.add_argument('--genes-bed', help='Genes BED file')
parser.add_argument('--peaks-bed', help='Peaks BED file')
args = parser.parse_args()
# List workflows
if args.list:
print("\nAvailable deepTools workflows:\n")
for key, value in WORKFLOWS.items():
print(f" {key}")
print(f" {value['name']}")
print(f" {value['description']}\n")
sys.exit(0)
# Check if workflow was specified
if not args.workflow:
parser.print_help()
sys.exit(1)
# Prepare parameters
params = {
'threads': args.threads,
'genome_size': args.genome_size,
'output_dir': args.output_dir or f"{args.workflow}_output",
'input_bam': args.input_bam,
'chip_bam': args.chip_bam,
'chip_bams': args.chip_bams,
'rnaseq_bam': args.rnaseq_bam,
'atac_bam': args.atac_bam,
'genes_bed': args.genes_bed,
'peaks_bed': args.peaks_bed,
}
# Generate workflow
if args.workflow == 'chipseq_qc':
message = generate_chipseq_qc_workflow(args.output, params)
elif args.workflow == 'chipseq_analysis':
message = generate_chipseq_analysis_workflow(args.output, params)
elif args.workflow == 'rnaseq_coverage':
message = generate_rnaseq_coverage_workflow(args.output, params)
elif args.workflow == 'atacseq':
message = generate_atacseq_workflow(args.output, params)
print(message)
print(f"\nTo run the workflow:")
print(f" chmod +x {args.output}")
print(f" ./{args.output}")
print(f"\nNote: Edit the script to customize file paths and parameters.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,477 @@
---
name: diffdock
description: This skill provides comprehensive guidance for using DiffDock, a state-of-the-art diffusion-based molecular docking tool that predicts protein-ligand binding poses. Use this skill when users request molecular docking simulations, protein-ligand binding predictions, virtual screening, structure-based drug design tasks, or need to predict how small molecules bind to protein targets. This skill applies to tasks involving PDB files, SMILES strings, protein sequences, ligand structure files, or batch docking of compound libraries.
---
# DiffDock: Molecular Docking with Diffusion Models
## Overview
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
**Core Capabilities:**
- Predict ligand binding poses with high accuracy using deep learning
- Support protein structures (PDB files) or sequences (via ESMFold)
- Process single complexes or batch virtual screening campaigns
- Generate confidence scores to assess prediction reliability
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
## When to Use DiffDock
Invoke this skill when users request:
- "Dock this ligand to a protein" or "predict binding pose"
- "Run molecular docking" or "perform protein-ligand docking"
- "Virtual screening" or "screen compound library"
- "Where does this molecule bind?" or "predict binding site"
- Structure-based drug design or lead optimization tasks
- Tasks involving PDB files + SMILES strings or ligand structures
- Batch docking of multiple protein-ligand pairs
## Installation and Environment Setup
### Check Environment Status
Before proceeding with DiffDock tasks, verify the environment setup:
```bash
# Use the provided setup checker
python scripts/setup_check.py
```
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
### Installation Options
**Option 1: Conda (Recommended)**
```bash
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda env create --file environment.yml
conda activate diffdock
```
**Option 2: Docker**
```bash
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
```
**Important Notes:**
- GPU strongly recommended (10-100x speedup vs CPU)
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
- Model checkpoints (~500MB) download automatically if not present
## Core Workflows
### Workflow 1: Single Protein-Ligand Docking
**Use Case:** Dock one ligand to one protein target
**Input Requirements:**
- Protein: PDB file OR amino acid sequence
- Ligand: SMILES string OR structure file (SDF/MOL2)
**Command:**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
--out_dir results/single_docking/
```
**Alternative (protein sequence):**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
--ligand ligand.sdf \
--out_dir results/sequence_docking/
```
**Output Structure:**
```
results/single_docking/
├── rank_1.sdf # Top-ranked pose
├── rank_2.sdf # Second-ranked pose
├── ...
├── rank_10.sdf # 10th pose (default: 10 samples)
└── confidence_scores.txt
```
### Workflow 2: Batch Processing Multiple Complexes
**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
**Step 1: Prepare Batch CSV**
Use the provided script to create or validate batch input:
```bash
# Create template
python scripts/prepare_batch_csv.py --create --output batch_input.csv
# Validate existing CSV
python scripts/prepare_batch_csv.py my_input.csv --validate
```
**CSV Format:**
```csv
complex_name,protein_path,ligand_description,protein_sequence
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
complex3,protein3.pdb,ligand3.sdf,
```
**Required Columns:**
- `complex_name`: Unique identifier
- `protein_path`: PDB file path (leave empty if using sequence)
- `ligand_description`: SMILES string or ligand file path
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
**Step 2: Run Batch Docking**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv batch_input.csv \
--out_dir results/batch/ \
--batch_size 10
```
**For Large Virtual Screening (>100 compounds):**
Pre-compute protein embeddings for faster processing:
```bash
# Pre-compute embeddings
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
# Run with pre-computed embeddings
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/screening/
```
### Workflow 3: Analyzing Results
After docking completes, analyze confidence scores and rank predictions:
```bash
# Analyze all results
python scripts/analyze_results.py results/batch/
# Show top 5 per complex
python scripts/analyze_results.py results/batch/ --top 5
# Filter by confidence threshold
python scripts/analyze_results.py results/batch/ --threshold 0.0
# Export to CSV
python scripts/analyze_results.py results/batch/ --export summary.csv
# Show top 20 predictions across all complexes
python scripts/analyze_results.py results/batch/ --best 20
```
The analysis script:
- Parses confidence scores from all predictions
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
- Ranks predictions within and across complexes
- Generates statistical summaries
- Exports results to CSV for downstream analysis
## Confidence Score Interpretation
**Understanding Scores:**
| Score Range | Confidence Level | Interpretation |
|------------|------------------|----------------|
| **> 0** | High | Strong prediction, likely accurate |
| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
| **< -1.5** | Low | Uncertain prediction, requires validation |
**Critical Notes:**
1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
2. **Context Matters**: Adjust expectations for:
- Large ligands (>500 Da): Lower confidence expected
- Multiple protein chains: May decrease confidence
- Novel protein families: May underperform
3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
## Parameter Customization
### Using Custom Configuration
Create custom configuration for specific use cases:
```bash
# Copy template
cp assets/custom_inference_config.yaml my_config.yaml
# Edit parameters (see template for presets)
# Then run with custom config
python -m inference \
--config my_config.yaml \
--protein_ligand_csv input.csv \
--out_dir results/
```
### Key Parameters to Adjust
**Sampling Density:**
- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
- More samples = better coverage but longer runtime
**Inference Steps:**
- `inference_steps: 20` → Increase to 25-30 for higher accuracy
- More steps = potentially better quality but slower
**Temperature Parameters (control diversity):**
- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
- Higher temperature = more diverse poses
**Presets Available in Template:**
1. High Accuracy: More samples + steps, lower temperature
2. Fast Screening: Fewer samples, faster
3. Flexible Ligands: Increased torsion temperature
4. Rigid Ligands: Decreased torsion temperature
**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
## Advanced Techniques
### Ensemble Docking (Protein Flexibility)
For proteins with known flexibility, dock to multiple conformations:
```python
# Create ensemble CSV
import pandas as pd
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
ligand = "CC(=O)Oc1ccccc1C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
```
Run docking with increased sampling:
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--samples_per_complex 20 \
--out_dir results/ensemble/
```
### Integration with Scoring Functions
DiffDock generates poses; combine with other tools for affinity:
**GNINA (Fast neural network scoring):**
```bash
for pose in results/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only
done
```
**MM/GBSA (More accurate, slower):**
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
**Free Energy Calculations (Most accurate):**
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
**Recommended Workflow:**
1. DiffDock → Generate poses with confidence scores
2. Visual inspection → Check structural plausibility
3. GNINA or MM/GBSA → Rescore and rank by affinity
4. Experimental validation → Biochemical assays
## Limitations and Scope
**DiffDock IS Designed For:**
- Small molecule ligands (typically 100-1000 Da)
- Drug-like organic compounds
- Small peptides (<20 residues)
- Single or multi-chain proteins
**DiffDock IS NOT Designed For:**
- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
- Large peptides (>20 residues) → Use alternative methods
- Covalent docking → Use specialized covalent docking tools
- Binding affinity prediction → Combine with scoring functions
- Membrane proteins → Not specifically trained, use with caution
**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
## Troubleshooting
### Common Issues
**Issue: Low confidence scores across all predictions**
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
**Issue: Out of memory errors**
- Cause: GPU memory insufficient for batch size
- Solution: Reduce `--batch_size 2` or process fewer complexes at once
**Issue: Slow performance**
- Cause: Running on CPU instead of GPU
- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
**Issue: Unrealistic binding poses**
- Cause: Poor protein preparation, ligand too large, wrong binding site
- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
**Issue: "Module not found" errors**
- Cause: Missing dependencies or wrong environment
- Solution: Run `python scripts/setup_check.py` to diagnose
### Performance Optimization
**For Best Results:**
1. Use GPU (essential for practical use)
2. Pre-compute ESM embeddings for repeated protein use
3. Batch process multiple complexes together
4. Start with default parameters, then tune if needed
5. Validate protein structures (resolve missing residues)
6. Use canonical SMILES for ligands
## Graphical User Interface
For interactive use, launch the web interface:
```bash
python app/main.py
# Navigate to http://localhost:7860
```
Or use the online demo without installation:
- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
## Resources
### Helper Scripts (`scripts/`)
**`prepare_batch_csv.py`**: Create and validate batch input CSV files
- Create templates with example entries
- Validate file paths and SMILES strings
- Check for required columns and format issues
**`analyze_results.py`**: Analyze confidence scores and rank predictions
- Parse results from single or batch runs
- Generate statistical summaries
- Export to CSV for downstream analysis
- Identify top predictions across complexes
**`setup_check.py`**: Verify DiffDock environment setup
- Check Python version and dependencies
- Verify PyTorch and CUDA availability
- Test RDKit and PyTorch Geometric installation
- Provide installation instructions if needed
### Reference Documentation (`references/`)
**`parameters_reference.md`**: Complete parameter documentation
- All command-line options and configuration parameters
- Default values and acceptable ranges
- Temperature parameters for controlling diversity
- Model checkpoint locations and version flags
Read this file when users need:
- Detailed parameter explanations
- Fine-tuning guidance for specific systems
- Alternative sampling strategies
**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
- Detailed confidence score interpretation
- When to trust predictions
- Scope and limitations of DiffDock
- Integration with complementary tools
- Troubleshooting prediction quality
Read this file when users need:
- Help interpreting confidence scores
- Understanding when NOT to use DiffDock
- Guidance on combining with other tools
- Validation strategies
**`workflows_examples.md`**: Comprehensive workflow examples
- Detailed installation instructions
- Step-by-step examples for all workflows
- Advanced integration patterns
- Troubleshooting common issues
- Best practices and optimization tips
Read this file when users need:
- Complete workflow examples with code
- Integration with GNINA, OpenMM, or other tools
- Virtual screening workflows
- Ensemble docking procedures
### Assets (`assets/`)
**`batch_template.csv`**: Template for batch processing
- Pre-formatted CSV with required columns
- Example entries showing different input types
- Ready to customize with actual data
**`custom_inference_config.yaml`**: Configuration template
- Annotated YAML with all parameters
- Four preset configurations for common use cases
- Detailed comments explaining each parameter
- Ready to customize and use
## Best Practices
1. **Always verify environment** with `setup_check.py` before starting large jobs
2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
3. **Start with defaults** then tune parameters based on system-specific needs
4. **Generate multiple samples** (10-40) for robust predictions
5. **Visual inspection** of top poses before downstream analysis
6. **Combine with scoring** functions for affinity assessment
7. **Use confidence scores** for initial ranking, not final decisions
8. **Pre-compute embeddings** for virtual screening campaigns
9. **Document parameters** used for reproducibility
10. **Validate results** experimentally when possible
## Citations
When using DiffDock, cite the appropriate papers:
**DiffDock-L (current default model):**
```
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
arXiv:2402.18396
```
**Original DiffDock:**
```
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
ICLR 2023, arXiv:2210.01776
```
## Additional Resources
- **GitHub Repository**: https://github.com/gcorso/DiffDock
- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
- **Original Paper**: https://arxiv.org/abs/2210.01776

View File

@@ -0,0 +1,4 @@
complex_name,protein_path,ligand_description,protein_sequence
example_1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
example_2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
example_3,protein3.pdb,ligand3.sdf,
1 complex_name protein_path ligand_description protein_sequence
2 example_1 protein1.pdb CC(=O)Oc1ccccc1C(=O)O
3 example_2 COc1ccc(C#N)cc1 MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
4 example_3 protein3.pdb ligand3.sdf

View File

@@ -0,0 +1,90 @@
# DiffDock Custom Inference Configuration Template
# Copy and modify this file to customize inference parameters
# Model paths (usually don't need to change these)
model_dir: ./workdir/v1.1/score_model
confidence_model_dir: ./workdir/v1.1/confidence_model
ckpt: best_ema_inference_epoch_model.pt
confidence_ckpt: best_model_epoch75.pt
# Model version flags
old_score_model: false # Set to true to use original DiffDock instead of DiffDock-L
old_filtering_model: true
# Inference steps
inference_steps: 20 # Increase for potentially better accuracy (e.g., 25-30)
actual_steps: 19
no_final_step_noise: true
# Sampling parameters
samples_per_complex: 10 # Increase for difficult cases (e.g., 20-40)
sigma_schedule: expbeta
initial_noise_std_proportion: 1.46
# Temperature controls - Adjust these to balance exploration vs accuracy
# Higher values = more diverse predictions, lower values = more focused predictions
# Sampling temperatures
temp_sampling_tr: 1.17 # Translation sampling temperature
temp_sampling_rot: 2.06 # Rotation sampling temperature
temp_sampling_tor: 7.04 # Torsion sampling temperature (increase for flexible ligands)
# Psi angle temperatures
temp_psi_tr: 0.73
temp_psi_rot: 0.90
temp_psi_tor: 0.59
# Sigma data temperatures
temp_sigma_data_tr: 0.93
temp_sigma_data_rot: 0.75
temp_sigma_data_tor: 0.69
# Feature flags
no_model: false
no_random: false
ode: false # Set to true to use ODE solver instead of SDE
different_schedules: false
limit_failures: 5
# Output settings
# save_visualisation: true # Uncomment to save SDF files
# ============================================================================
# Configuration Presets for Common Use Cases
# ============================================================================
# PRESET 1: High Accuracy (slower, more thorough)
# samples_per_complex: 30
# inference_steps: 25
# temp_sampling_tr: 1.0
# temp_sampling_rot: 1.8
# temp_sampling_tor: 6.5
# PRESET 2: Fast Screening (faster, less thorough)
# samples_per_complex: 5
# inference_steps: 15
# temp_sampling_tr: 1.3
# temp_sampling_rot: 2.2
# temp_sampling_tor: 7.5
# PRESET 3: Flexible Ligands (more conformational diversity)
# samples_per_complex: 20
# inference_steps: 20
# temp_sampling_tr: 1.2
# temp_sampling_rot: 2.1
# temp_sampling_tor: 8.5 # Increased torsion temperature
# PRESET 4: Rigid Ligands (more focused predictions)
# samples_per_complex: 10
# inference_steps: 20
# temp_sampling_tr: 1.1
# temp_sampling_rot: 2.0
# temp_sampling_tor: 6.0 # Decreased torsion temperature
# ============================================================================
# Usage Example
# ============================================================================
# python -m inference \
# --config custom_inference_config.yaml \
# --protein_ligand_csv input.csv \
# --out_dir results/

View File

@@ -0,0 +1,182 @@
# DiffDock Confidence Scores and Limitations
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
## Confidence Score Interpretation
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
### Score Ranges
| Score Range | Confidence Level | Interpretation |
|------------|------------------|----------------|
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
### Important Notes on Confidence Scores
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
- High confidence = model is confident about the structure
- Does NOT indicate strong/weak binding affinity
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
- **Lower expectations** for:
- Large ligands (>500 Da)
- Protein complexes with many chains
- Unbound protein conformations (may require conformational changes)
- Novel protein families not well-represented in training data
- **Higher expectations** for:
- Drug-like small molecules (150-500 Da)
- Single-chain proteins or well-defined binding sites
- Proteins similar to those in training data (PDBBind, BindingMOAD)
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
- Review top-ranked predictions (by confidence)
- Consider clustering similar poses
- High-confidence consensus across multiple samples strengthens prediction
## What DiffDock Predicts
### ✅ DiffDock DOES Predict
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
- **Confidence scores**: Model's certainty about predictions
- **Multiple conformations**: Various possible binding modes
### ❌ DiffDock DOES NOT Predict
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
- **Binding kinetics**: On/off rates, residence time
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
- **Selectivity**: Relative binding to different targets
## Scope and Limitations
### Designed For
- **Small molecule docking**: Organic compounds typically 100-1000 Da
- **Protein targets**: Single or multi-chain proteins
- **Small peptides**: Short peptide ligands (< ~20 residues)
- **Small nucleic acids**: Short oligonucleotides
### NOT Designed For
- **Large biomolecules**: Full protein-protein interactions
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
- **Large peptides/proteins**: >20 residues as ligands
- **Covalent docking**: Irreversible covalent bond formation
- **Metalloprotein specifics**: May not accurately handle metal coordination
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
### Training Data Considerations
DiffDock was trained on:
- **PDBBind**: Diverse protein-ligand complexes
- **BindingMOAD**: Multi-domain protein structures
**Implications**:
- Best performance on proteins/ligands similar to training data
- May underperform on:
- Novel protein families
- Unusual ligand chemotypes
- Allosteric sites not well-represented in training data
## Validation and Complementary Tools
### Recommended Workflow
1. **Generate poses with DiffDock**
- Use confidence scores for initial ranking
- Consider multiple high-confidence predictions
2. **Visual Inspection**
- Examine protein-ligand interactions in molecular viewer
- Check for reasonable:
- Hydrogen bonds
- Hydrophobic interactions
- Steric complementarity
- Electrostatic interactions
3. **Scoring and Refinement** (choose one or more):
- **GNINA**: Deep learning-based scoring function
- **Molecular mechanics**: Energy minimization and refinement
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
- **Free energy calculations**: FEP or TI for accurate affinity prediction
4. **Experimental Validation**
- Biochemical assays (IC50, Kd measurements)
- Structural validation (X-ray crystallography, cryo-EM)
### Tools for Binding Affinity Assessment
DiffDock should be combined with these tools for affinity prediction:
- **GNINA**: Fast, accurate scoring function
- Github: github.com/gnina/gnina
- **AutoDock Vina**: Classical docking and scoring
- Website: vina.scripps.edu
- **Free Energy Calculations**:
- OpenMM + OpenFE
- GROMACS + ABFE/RBFE protocols
- **MM/GBSA Tools**:
- MMPBSA.py (AmberTools)
- gmx_MMPBSA
## Performance Optimization
### For Best Results
1. **Protein Preparation**:
- Remove water molecules far from binding site
- Resolve missing residues if possible
- Consider protonation states at physiological pH
2. **Ligand Input**:
- Provide reasonable 3D conformers when using structure files
- Use canonical SMILES for consistent results
- Pre-process with RDKit if needed
3. **Computational Resources**:
- GPU strongly recommended (10-100x speedup)
- First run pre-computes lookup tables (takes a few minutes)
- Batch processing more efficient than single predictions
4. **Parameter Tuning**:
- Increase `samples_per_complex` for difficult cases (20-40)
- Adjust temperature parameters for diversity/accuracy trade-off
- Use pre-computed ESM embeddings for repeated predictions
## Common Issues and Troubleshooting
### Low Confidence Scores
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
- **Multiple binding sites**: May predict multiple locations with distributed confidence
- **Protein flexibility**: Consider using ensemble of protein conformations
### Unrealistic Predictions
- **Clashes**: May indicate need for protein preparation or refinement
- **Surface binding**: Check if true binding site is blocked or unclear
- **Unusual poses**: Consider increasing samples to explore more conformations
### Slow Performance
- **Use GPU**: Essential for reasonable runtime
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
- **Batch processing**: More efficient than sequential individual predictions
- **Reduce samples**: Lower `samples_per_complex` for quick screening
## Citation and Further Reading
For methodology details and benchmarking results, see:
1. **Original DiffDock Paper** (ICLR 2023):
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
- Corso et al., arXiv:2210.01776
2. **DiffDock-L Paper** (2024):
- Enhanced model with improved generalization
- Stärk et al., arXiv:2402.18396
3. **PoseBusters Benchmark**:
- Rigorous docking evaluation framework
- Used for DiffDock validation

View File

@@ -0,0 +1,163 @@
# DiffDock Configuration Parameters Reference
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
## Model & Checkpoint Settings
### Model Paths
- **`--model_dir`**: Directory containing the score model checkpoint
- Default: `./workdir/v1.1/score_model`
- DiffDock-L model (current default)
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
- Default: `./workdir/v1.1/confidence_model`
- **`--ckpt`**: Name of the score model checkpoint file
- Default: `best_ema_inference_epoch_model.pt`
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
- Default: `best_model_epoch75.pt`
### Model Version Flags
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
- Default: `false` (uses DiffDock-L)
- **`--old_filtering_model`**: Use legacy confidence filtering approach
- Default: `true`
## Input/Output Options
### Input Specification
- **`--protein_path`**: Path to protein PDB file
- Example: `--protein_path protein.pdb`
- Alternative to `--protein_sequence`
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
- Automatically generates protein structure from sequence
- Alternative to `--protein_path`
- **`--ligand`**: Ligand specification (SMILES string or file path)
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
- File path: `--ligand ligand.sdf` or `.mol2`
- **`--protein_ligand_csv`**: CSV file for batch processing
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
### Output Control
- **`--out_dir`**: Output directory for predictions
- Example: `--out_dir results/user_predictions/`
- **`--save_visualisation`**: Export predicted molecules as SDF files
- Enables visualization of results
## Inference Parameters
### Diffusion Steps
- **`--inference_steps`**: Number of planned inference iterations
- Default: `20`
- Higher values may improve accuracy but increase runtime
- **`--actual_steps`**: Actual diffusion steps executed
- Default: `19`
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
- Default: `true`
### Sampling Settings
- **`--samples_per_complex`**: Number of samples to generate per complex
- Default: `10`
- More samples provide better coverage but increase computation
- **`--sigma_schedule`**: Noise schedule type
- Default: `expbeta` (exponential-beta)
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
- Default: `1.46`
### Temperature Parameters
#### Sampling Temperatures (Controls diversity of predictions)
- **`--temp_sampling_tr`**: Translation sampling temperature
- Default: `1.17`
- **`--temp_sampling_rot`**: Rotation sampling temperature
- Default: `2.06`
- **`--temp_sampling_tor`**: Torsion sampling temperature
- Default: `7.04`
#### Psi Angle Temperatures
- **`--temp_psi_tr`**: Translation psi temperature
- Default: `0.73`
- **`--temp_psi_rot`**: Rotation psi temperature
- Default: `0.90`
- **`--temp_psi_tor`**: Torsion psi temperature
- Default: `0.59`
#### Sigma Data Temperatures
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
- Default: `0.93`
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
- Default: `0.75`
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
- Default: `0.69`
## Processing Options
### Performance
- **`--batch_size`**: Processing batch size
- Default: `10`
- Larger values increase throughput but require more memory
- **`--tqdm`**: Enable progress bar visualization
- Useful for monitoring long-running jobs
### Protein Structure
- **`--chain_cutoff`**: Maximum number of protein chains to process
- Example: `--chain_cutoff 10`
- Useful for large multi-chain complexes
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
- Speeds up inference by reusing embeddings
- Optional optimization
### Dataset Options
- **`--split`**: Dataset split to use (train/test/val)
- Used for evaluation on standard benchmarks
## Advanced Flags
### Debugging & Testing
- **`--no_model`**: Disable model inference (debugging)
- Default: `false`
- **`--no_random`**: Disable randomization
- Default: `false`
- Useful for reproducibility testing
### Alternative Sampling
- **`--ode`**: Use ODE solver instead of SDE
- Default: `false`
- Alternative sampling approach
- **`--different_schedules`**: Use different noise schedules per component
- Default: `false`
### Error Handling
- **`--limit_failures`**: Maximum allowed failures before stopping
- Default: `5`
## Configuration File
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
```bash
python -m inference --config default_inference_args.yaml --samples_per_complex 20
```
Command-line arguments take precedence over configuration file values.

View File

@@ -0,0 +1,392 @@
# DiffDock Workflows and Examples
This document provides practical workflows and usage examples for common DiffDock tasks.
## Installation and Setup
### Conda Installation (Recommended)
```bash
# Clone repository
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
# Create conda environment
conda env create --file environment.yml
conda activate diffdock
```
### Docker Installation
```bash
# Pull Docker image
docker pull rbgcsail/diffdock
# Run container with GPU support
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
# Inside container, activate environment
micromamba activate diffdock
```
### First Run
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
## Workflow 1: Single Protein-Ligand Docking
### Using PDB File and SMILES String
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path examples/protein.pdb \
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
--out_dir results/single_docking/
```
**Output Structure**:
```
results/single_docking/
├── index_0_rank_1.sdf # Top-ranked prediction
├── index_0_rank_2.sdf # Second-ranked prediction
├── ...
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
└── confidence_scores.txt # Scores for all predictions
```
### Using Ligand Structure File
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand ligand.sdf \
--out_dir results/ligand_file/
```
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
## Workflow 2: Protein Sequence to Structure Docking
### Using ESMFold for Protein Folding
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
--out_dir results/sequence_docking/
```
**Use Cases**:
- Protein structure not available in PDB
- Modeling mutations or variants
- De novo protein design validation
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
## Workflow 3: Batch Processing Multiple Complexes
### Prepare CSV File
Create `complexes.csv` with required columns:
```csv
complex_name,protein_path,ligand_description,protein_sequence
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
```
**Column Descriptions**:
- `complex_name`: Unique identifier for the complex
- `protein_path`: Path to PDB file (leave empty if using sequence)
- `ligand_description`: SMILES string or path to ligand file
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
### Run Batch Docking
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv complexes.csv \
--out_dir results/batch_predictions/ \
--batch_size 10
```
**Output Structure**:
```
results/batch_predictions/
├── complex1/
│ ├── rank_1.sdf
│ ├── rank_2.sdf
│ └── ...
├── complex2/
│ ├── rank_1.sdf
│ └── ...
└── complex3/
└── ...
```
## Workflow 4: High-Throughput Virtual Screening
### Setup for Screening Large Ligand Libraries
```python
# generate_screening_csv.py
import pandas as pd
# Load ligand library
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
# Create DiffDock input
screening_data = {
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
"protein_path": ["target_protein.pdb"] * len(ligands),
"ligand_description": ligands["smiles"].tolist(),
"protein_sequence": [""] * len(ligands)
}
df = pd.DataFrame(screening_data)
df.to_csv("screening_input.csv", index=False)
```
### Run Screening
```bash
# Pre-compute ESM embeddings for faster screening
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
# Run docking with pre-computed embeddings
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/virtual_screening/ \
--batch_size 32
```
### Post-Processing: Extract Top Hits
```python
# analyze_screening_results.py
import os
import pandas as pd
results = []
results_dir = "results/virtual_screening/"
for complex_dir in os.listdir(results_dir):
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
if os.path.exists(confidence_file):
with open(confidence_file) as f:
scores = [float(line.strip()) for line in f]
top_score = max(scores)
results.append({"complex": complex_dir, "top_confidence": top_score})
# Sort by confidence
df = pd.DataFrame(results)
df_sorted = df.sort_values("top_confidence", ascending=False)
# Get top 100 hits
top_hits = df_sorted.head(100)
top_hits.to_csv("top_hits.csv", index=False)
```
## Workflow 5: Ensemble Docking with Protein Flexibility
### Prepare Protein Ensemble
```python
# For proteins with known flexibility, use multiple conformations
# Example: Using MD snapshots or crystal structures
# create_ensemble_csv.py
import pandas as pd
conformations = [
"protein_conf1.pdb",
"protein_conf2.pdb",
"protein_conf3.pdb",
"protein_conf4.pdb"
]
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
```
### Run Ensemble Docking
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--out_dir results/ensemble_docking/ \
--samples_per_complex 20 # More samples per conformation
```
## Workflow 6: Integration with Downstream Analysis
### Example: DiffDock + GNINA Rescoring
```bash
# 1. Run DiffDock
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
--out_dir results/diffdock_poses/ \
--save_visualisation
# 2. Rescore with GNINA
for pose in results/diffdock_poses/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
done
```
### Example: DiffDock + OpenMM Energy Minimization
```python
# minimize_poses.py
from openmm import app, LangevinIntegrator, Platform
from openmm.app import ForceField, Modeller, PDBFile
from rdkit import Chem
import os
# Load protein
protein = PDBFile('protein.pdb')
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
# Process each DiffDock pose
pose_dir = 'results/diffdock_poses/'
for pose_file in os.listdir(pose_dir):
if pose_file.endswith('.sdf'):
# Load ligand
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
# Combine protein + ligand
modeller = Modeller(protein.topology, protein.positions)
# ... add ligand to modeller ...
# Create system and minimize
system = forcefield.createSystem(modeller.topology)
integrator = LangevinIntegrator(300, 1.0, 0.002)
simulation = app.Simulation(modeller.topology, system, integrator)
simulation.minimizeEnergy(maxIterations=1000)
# Save minimized structure
positions = simulation.context.getState(getPositions=True).getPositions()
PDBFile.writeFile(simulation.topology, positions,
open(f"minimized_{pose_file}.pdb", 'w'))
```
## Workflow 7: Using the Graphical Interface
### Launch Web Interface
```bash
python app/main.py
```
### Access Interface
Navigate to `http://localhost:7860` in web browser
### Features
- Upload protein PDB or enter sequence
- Input ligand SMILES or upload structure
- Adjust inference parameters via GUI
- Visualize results interactively
- Download predictions directly
### Online Alternative
Use the Hugging Face Spaces demo without local installation:
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
## Advanced Configuration
### Custom Inference Settings
Create custom YAML configuration:
```yaml
# custom_inference.yaml
# Model settings
model_dir: ./workdir/v1.1/score_model
confidence_model_dir: ./workdir/v1.1/confidence_model
# Sampling parameters
samples_per_complex: 20 # More samples for better coverage
inference_steps: 25 # More steps for accuracy
# Temperature adjustments (increase for more diversity)
temp_sampling_tr: 1.3
temp_sampling_rot: 2.2
temp_sampling_tor: 7.5
# Output
save_visualisation: true
```
Use custom configuration:
```bash
python -m inference \
--config custom_inference.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
--out_dir results/custom_config/
```
## Troubleshooting Common Issues
### Issue: Out of Memory Errors
**Solution**: Reduce batch size
```bash
python -m inference ... --batch_size 2
```
### Issue: Slow Performance
**Solution**: Ensure GPU usage
```python
import torch
print(torch.cuda.is_available()) # Should return True
```
### Issue: Poor Predictions for Large Ligands
**Solution**: Increase sampling diversity
```bash
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
```
### Issue: Protein with Many Chains
**Solution**: Limit chains or isolate binding site
```bash
python -m inference ... --chain_cutoff 4
```
Or pre-process PDB to include only relevant chains.
## Best Practices Summary
1. **Start Simple**: Test with single complex before batch processing
2. **GPU Essential**: Use GPU for reasonable performance
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
4. **Validate Results**: Use molecular visualization and complementary scoring
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
7. **Pre-compute Embeddings**: For repeated use of same protein
8. **Combine Tools**: Integrate with scoring functions and energy minimization

View File

@@ -0,0 +1,334 @@
#!/usr/bin/env python3
"""
DiffDock Results Analysis Script
This script analyzes DiffDock prediction results, extracting confidence scores,
ranking predictions, and generating summary reports.
Usage:
python analyze_results.py results/output_dir/
python analyze_results.py results/ --top 50 --threshold 0.0
python analyze_results.py results/ --export summary.csv
"""
import argparse
import os
import sys
import json
from pathlib import Path
from collections import defaultdict
import re
def parse_confidence_scores(results_dir):
"""
Parse confidence scores from DiffDock output directory.
Args:
results_dir: Path to DiffDock results directory
Returns:
dict: Dictionary mapping complex names to their predictions and scores
"""
results = {}
results_path = Path(results_dir)
# Check if this is a single complex or batch results
sdf_files = list(results_path.glob("*.sdf"))
if sdf_files:
# Single complex output
results['single_complex'] = parse_single_complex(results_path)
else:
# Batch output - multiple subdirectories
for subdir in results_path.iterdir():
if subdir.is_dir():
complex_results = parse_single_complex(subdir)
if complex_results:
results[subdir.name] = complex_results
return results
def parse_single_complex(complex_dir):
"""Parse results for a single complex."""
predictions = []
# Look for SDF files with rank information
for sdf_file in complex_dir.glob("*.sdf"):
filename = sdf_file.name
# Extract rank from filename (e.g., "rank_1.sdf" or "index_0_rank_1.sdf")
rank_match = re.search(r'rank_(\d+)', filename)
if rank_match:
rank = int(rank_match.group(1))
# Try to extract confidence score from filename or separate file
confidence = extract_confidence_score(sdf_file, complex_dir)
predictions.append({
'rank': rank,
'file': sdf_file.name,
'path': str(sdf_file),
'confidence': confidence
})
# Sort by rank
predictions.sort(key=lambda x: x['rank'])
return {'predictions': predictions} if predictions else None
def extract_confidence_score(sdf_file, complex_dir):
"""
Extract confidence score for a prediction.
Tries multiple methods:
1. Read from confidence_scores.txt file
2. Parse from SDF file properties
3. Extract from filename if present
"""
# Method 1: confidence_scores.txt
confidence_file = complex_dir / "confidence_scores.txt"
if confidence_file.exists():
try:
with open(confidence_file) as f:
lines = f.readlines()
# Extract rank from filename
rank_match = re.search(r'rank_(\d+)', sdf_file.name)
if rank_match:
rank = int(rank_match.group(1))
if rank <= len(lines):
return float(lines[rank - 1].strip())
except Exception:
pass
# Method 2: Parse from SDF file
try:
with open(sdf_file) as f:
content = f.read()
# Look for confidence score in SDF properties
conf_match = re.search(r'confidence[:\s]+(-?\d+\.?\d*)', content, re.IGNORECASE)
if conf_match:
return float(conf_match.group(1))
except Exception:
pass
# Method 3: Filename (e.g., "rank_1_conf_0.95.sdf")
conf_match = re.search(r'conf_(-?\d+\.?\d*)', sdf_file.name)
if conf_match:
return float(conf_match.group(1))
return None
def classify_confidence(score):
"""Classify confidence score into categories."""
if score is None:
return "Unknown"
elif score > 0:
return "High"
elif score > -1.5:
return "Moderate"
else:
return "Low"
def print_summary(results, top_n=None, min_confidence=None):
"""Print a formatted summary of results."""
print("\n" + "="*80)
print("DiffDock Results Summary")
print("="*80)
all_predictions = []
for complex_name, data in results.items():
predictions = data.get('predictions', [])
print(f"\n{complex_name}")
print("-" * 80)
if not predictions:
print(" No predictions found")
continue
# Filter by confidence if specified
filtered_predictions = predictions
if min_confidence is not None:
filtered_predictions = [p for p in predictions if p['confidence'] is not None and p['confidence'] >= min_confidence]
# Limit to top N if specified
if top_n is not None:
filtered_predictions = filtered_predictions[:top_n]
for pred in filtered_predictions:
confidence = pred['confidence']
confidence_class = classify_confidence(confidence)
conf_str = f"{confidence:>7.3f}" if confidence is not None else " N/A"
print(f" Rank {pred['rank']:2d}: Confidence = {conf_str} ({confidence_class:8s}) | {pred['file']}")
# Add to all predictions for overall statistics
if confidence is not None:
all_predictions.append((complex_name, pred['rank'], confidence))
# Show statistics for this complex
if filtered_predictions and any(p['confidence'] is not None for p in filtered_predictions):
confidences = [p['confidence'] for p in filtered_predictions if p['confidence'] is not None]
print(f"\n Statistics: {len(filtered_predictions)} predictions")
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
print(f" Max confidence: {max(confidences):.3f}")
print(f" Min confidence: {min(confidences):.3f}")
# Overall statistics
if all_predictions:
print("\n" + "="*80)
print("Overall Statistics")
print("="*80)
confidences = [conf for _, _, conf in all_predictions]
print(f" Total predictions: {len(all_predictions)}")
print(f" Total complexes: {len(results)}")
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
print(f" Max confidence: {max(confidences):.3f}")
print(f" Min confidence: {min(confidences):.3f}")
# Confidence distribution
high = sum(1 for c in confidences if c > 0)
moderate = sum(1 for c in confidences if -1.5 < c <= 0)
low = sum(1 for c in confidences if c <= -1.5)
print(f"\n Confidence distribution:")
print(f" High (> 0): {high:4d} ({100*high/len(confidences):5.1f}%)")
print(f" Moderate (-1.5 to 0): {moderate:4d} ({100*moderate/len(confidences):5.1f}%)")
print(f" Low (< -1.5): {low:4d} ({100*low/len(confidences):5.1f}%)")
print("\n" + "="*80)
def export_to_csv(results, output_path):
"""Export results to CSV file."""
import csv
with open(output_path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['complex_name', 'rank', 'confidence', 'confidence_class', 'file_path'])
for complex_name, data in results.items():
predictions = data.get('predictions', [])
for pred in predictions:
confidence = pred['confidence']
confidence_class = classify_confidence(confidence)
conf_value = confidence if confidence is not None else ''
writer.writerow([
complex_name,
pred['rank'],
conf_value,
confidence_class,
pred['path']
])
print(f"✓ Exported results to: {output_path}")
def get_top_predictions(results, n=10, sort_by='confidence'):
"""Get top N predictions across all complexes."""
all_predictions = []
for complex_name, data in results.items():
predictions = data.get('predictions', [])
for pred in predictions:
if pred['confidence'] is not None:
all_predictions.append({
'complex': complex_name,
**pred
})
# Sort by confidence (descending)
all_predictions.sort(key=lambda x: x['confidence'], reverse=True)
return all_predictions[:n]
def print_top_predictions(results, n=10):
"""Print top N predictions across all complexes."""
top_preds = get_top_predictions(results, n)
print("\n" + "="*80)
print(f"Top {n} Predictions Across All Complexes")
print("="*80)
for i, pred in enumerate(top_preds, 1):
confidence_class = classify_confidence(pred['confidence'])
print(f"{i:2d}. {pred['complex']:30s} | Rank {pred['rank']:2d} | "
f"Confidence: {pred['confidence']:7.3f} ({confidence_class})")
print("="*80)
def main():
parser = argparse.ArgumentParser(
description='Analyze DiffDock prediction results',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Analyze all results in directory
python analyze_results.py results/output_dir/
# Show only top 5 predictions per complex
python analyze_results.py results/ --top 5
# Filter by confidence threshold
python analyze_results.py results/ --threshold 0.0
# Export to CSV
python analyze_results.py results/ --export summary.csv
# Show top 20 predictions across all complexes
python analyze_results.py results/ --best 20
"""
)
parser.add_argument('results_dir', help='Path to DiffDock results directory')
parser.add_argument('--top', '-t', type=int,
help='Show only top N predictions per complex')
parser.add_argument('--threshold', type=float,
help='Minimum confidence threshold')
parser.add_argument('--export', '-e', metavar='FILE',
help='Export results to CSV file')
parser.add_argument('--best', '-b', type=int, metavar='N',
help='Show top N predictions across all complexes')
args = parser.parse_args()
# Validate results directory
if not os.path.exists(args.results_dir):
print(f"Error: Results directory not found: {args.results_dir}")
return 1
# Parse results
print(f"Analyzing results in: {args.results_dir}")
results = parse_confidence_scores(args.results_dir)
if not results:
print("No DiffDock results found in directory")
return 1
# Print summary
print_summary(results, top_n=args.top, min_confidence=args.threshold)
# Print top predictions across all complexes
if args.best:
print_top_predictions(results, args.best)
# Export to CSV if requested
if args.export:
export_to_csv(results, args.export)
return 0
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
DiffDock Batch CSV Preparation and Validation Script
This script helps prepare and validate CSV files for DiffDock batch processing.
It checks for required columns, validates file paths, and ensures SMILES strings
are properly formatted.
Usage:
python prepare_batch_csv.py input.csv --validate
python prepare_batch_csv.py --create --output batch_input.csv
"""
import argparse
import os
import sys
import pandas as pd
from pathlib import Path
try:
from rdkit import Chem
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
RDKIT_AVAILABLE = True
except ImportError:
RDKIT_AVAILABLE = False
print("Warning: RDKit not available. SMILES validation will be skipped.")
def validate_smiles(smiles_string):
"""Validate a SMILES string using RDKit."""
if not RDKIT_AVAILABLE:
return True, "RDKit not available for validation"
try:
mol = Chem.MolFromSmiles(smiles_string)
if mol is None:
return False, "Invalid SMILES structure"
return True, "Valid SMILES"
except Exception as e:
return False, str(e)
def validate_file_path(file_path, base_dir=None):
"""Validate that a file path exists."""
if pd.isna(file_path) or file_path == "":
return True, "Empty (will use protein_sequence)"
# Handle relative paths
if base_dir:
full_path = Path(base_dir) / file_path
else:
full_path = Path(file_path)
if full_path.exists():
return True, f"File exists: {full_path}"
else:
return False, f"File not found: {full_path}"
def validate_csv(csv_path, base_dir=None):
"""
Validate a DiffDock batch input CSV file.
Args:
csv_path: Path to CSV file
base_dir: Base directory for relative paths (default: CSV directory)
Returns:
bool: True if validation passes
list: List of validation messages
"""
messages = []
valid = True
# Read CSV
try:
df = pd.read_csv(csv_path)
messages.append(f"✓ Successfully read CSV with {len(df)} rows")
except Exception as e:
messages.append(f"✗ Error reading CSV: {e}")
return False, messages
# Check required columns
required_cols = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
messages.append(f"✗ Missing required columns: {', '.join(missing_cols)}")
valid = False
else:
messages.append("✓ All required columns present")
# Set base directory
if base_dir is None:
base_dir = Path(csv_path).parent
# Validate each row
for idx, row in df.iterrows():
row_msgs = []
# Check complex name
if pd.isna(row['complex_name']) or row['complex_name'] == "":
row_msgs.append("Missing complex_name")
valid = False
# Check that either protein_path or protein_sequence is provided
has_protein_path = not pd.isna(row['protein_path']) and row['protein_path'] != ""
has_protein_seq = not pd.isna(row['protein_sequence']) and row['protein_sequence'] != ""
if not has_protein_path and not has_protein_seq:
row_msgs.append("Must provide either protein_path or protein_sequence")
valid = False
elif has_protein_path and has_protein_seq:
row_msgs.append("Warning: Both protein_path and protein_sequence provided, will use protein_path")
# Validate protein path if provided
if has_protein_path:
file_valid, msg = validate_file_path(row['protein_path'], base_dir)
if not file_valid:
row_msgs.append(f"Protein file issue: {msg}")
valid = False
# Validate ligand description
if pd.isna(row['ligand_description']) or row['ligand_description'] == "":
row_msgs.append("Missing ligand_description")
valid = False
else:
ligand_desc = row['ligand_description']
# Check if it's a file path or SMILES
if os.path.exists(ligand_desc) or "/" in ligand_desc or "\\" in ligand_desc:
# Likely a file path
file_valid, msg = validate_file_path(ligand_desc, base_dir)
if not file_valid:
row_msgs.append(f"Ligand file issue: {msg}")
valid = False
else:
# Likely a SMILES string
smiles_valid, msg = validate_smiles(ligand_desc)
if not smiles_valid:
row_msgs.append(f"SMILES issue: {msg}")
valid = False
if row_msgs:
messages.append(f"\nRow {idx + 1} ({row.get('complex_name', 'unnamed')}):")
for msg in row_msgs:
messages.append(f" - {msg}")
# Summary
messages.append(f"\n{'='*60}")
if valid:
messages.append("✓ CSV validation PASSED - ready for DiffDock")
else:
messages.append("✗ CSV validation FAILED - please fix issues above")
return valid, messages
def create_template_csv(output_path, num_examples=3):
"""Create a template CSV file with example entries."""
examples = {
'complex_name': ['example1', 'example2', 'example3'][:num_examples],
'protein_path': ['protein1.pdb', '', 'protein3.pdb'][:num_examples],
'ligand_description': [
'CC(=O)Oc1ccccc1C(=O)O', # Aspirin SMILES
'COc1ccc(C#N)cc1', # Example SMILES
'ligand.sdf' # Example file path
][:num_examples],
'protein_sequence': [
'', # Empty - using PDB file
'MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK', # GFP sequence
'' # Empty - using PDB file
][:num_examples]
}
df = pd.DataFrame(examples)
df.to_csv(output_path, index=False)
return df
def main():
parser = argparse.ArgumentParser(
description='Prepare and validate DiffDock batch CSV files',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Validate existing CSV
python prepare_batch_csv.py input.csv --validate
# Create template CSV
python prepare_batch_csv.py --create --output batch_template.csv
# Create template with 5 example rows
python prepare_batch_csv.py --create --output template.csv --num-examples 5
# Validate with custom base directory for relative paths
python prepare_batch_csv.py input.csv --validate --base-dir /path/to/data/
"""
)
parser.add_argument('csv_file', nargs='?', help='CSV file to validate')
parser.add_argument('--validate', action='store_true',
help='Validate the CSV file')
parser.add_argument('--create', action='store_true',
help='Create a template CSV file')
parser.add_argument('--output', '-o', help='Output path for template CSV')
parser.add_argument('--num-examples', type=int, default=3,
help='Number of example rows in template (default: 3)')
parser.add_argument('--base-dir', help='Base directory for relative file paths')
args = parser.parse_args()
# Create template
if args.create:
output_path = args.output or 'diffdock_batch_template.csv'
df = create_template_csv(output_path, args.num_examples)
print(f"✓ Created template CSV: {output_path}")
print(f"\nTemplate contents:")
print(df.to_string(index=False))
print(f"\nEdit this file with your protein-ligand pairs and run with:")
print(f" python -m inference --config default_inference_args.yaml \\")
print(f" --protein_ligand_csv {output_path} --out_dir results/")
return 0
# Validate CSV
if args.validate or args.csv_file:
if not args.csv_file:
print("Error: CSV file required for validation")
parser.print_help()
return 1
if not os.path.exists(args.csv_file):
print(f"Error: CSV file not found: {args.csv_file}")
return 1
print(f"Validating: {args.csv_file}")
print("="*60)
valid, messages = validate_csv(args.csv_file, args.base_dir)
for msg in messages:
print(msg)
return 0 if valid else 1
# No action specified
parser.print_help()
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,278 @@
#!/usr/bin/env python3
"""
DiffDock Environment Setup Checker
This script verifies that the DiffDock environment is properly configured
and all dependencies are available.
Usage:
python setup_check.py
python setup_check.py --verbose
"""
import argparse
import sys
import os
from pathlib import Path
def check_python_version():
"""Check Python version."""
import sys
version = sys.version_info
print("Checking Python version...")
if version.major == 3 and version.minor >= 8:
print(f" ✓ Python {version.major}.{version.minor}.{version.micro}")
return True
else:
print(f" ✗ Python {version.major}.{version.minor}.{version.micro} "
f"(requires Python 3.8 or higher)")
return False
def check_package(package_name, import_name=None, version_attr='__version__'):
"""Check if a Python package is installed."""
if import_name is None:
import_name = package_name
try:
module = __import__(import_name)
version = getattr(module, version_attr, 'unknown')
print(f"{package_name:20s} (version: {version})")
return True
except ImportError:
print(f"{package_name:20s} (not installed)")
return False
def check_pytorch():
"""Check PyTorch installation and CUDA availability."""
print("\nChecking PyTorch...")
try:
import torch
print(f" ✓ PyTorch version: {torch.__version__}")
# Check CUDA
if torch.cuda.is_available():
print(f" ✓ CUDA available: {torch.cuda.get_device_name(0)}")
print(f" - CUDA version: {torch.version.cuda}")
print(f" - Number of GPUs: {torch.cuda.device_count()}")
return True, True
else:
print(f" ⚠ CUDA not available (will run on CPU)")
return True, False
except ImportError:
print(f" ✗ PyTorch not installed")
return False, False
def check_pytorch_geometric():
"""Check PyTorch Geometric installation."""
print("\nChecking PyTorch Geometric...")
packages = [
('torch-geometric', 'torch_geometric'),
('torch-scatter', 'torch_scatter'),
('torch-sparse', 'torch_sparse'),
('torch-cluster', 'torch_cluster'),
]
all_ok = True
for pkg_name, import_name in packages:
if not check_package(pkg_name, import_name):
all_ok = False
return all_ok
def check_core_dependencies():
"""Check core DiffDock dependencies."""
print("\nChecking core dependencies...")
dependencies = [
('numpy', 'numpy'),
('scipy', 'scipy'),
('pandas', 'pandas'),
('rdkit', 'rdkit', 'rdBase.__version__'),
('biopython', 'Bio', '__version__'),
('pytorch-lightning', 'pytorch_lightning'),
('PyYAML', 'yaml'),
]
all_ok = True
for dep in dependencies:
pkg_name = dep[0]
import_name = dep[1]
version_attr = dep[2] if len(dep) > 2 else '__version__'
if not check_package(pkg_name, import_name, version_attr):
all_ok = False
return all_ok
def check_esm():
"""Check ESM (protein language model) installation."""
print("\nChecking ESM (for protein sequence folding)...")
try:
import esm
print(f" ✓ ESM installed (version: {esm.__version__ if hasattr(esm, '__version__') else 'unknown'})")
return True
except ImportError:
print(f" ⚠ ESM not installed (needed for protein sequence folding)")
print(f" Install with: pip install fair-esm")
return False
def check_diffdock_installation():
"""Check if DiffDock is properly installed/cloned."""
print("\nChecking DiffDock installation...")
# Look for key files
key_files = [
'inference.py',
'default_inference_args.yaml',
'environment.yml',
]
found_files = []
missing_files = []
for filename in key_files:
if os.path.exists(filename):
found_files.append(filename)
else:
missing_files.append(filename)
if found_files:
print(f" ✓ Found DiffDock files in current directory:")
for f in found_files:
print(f" - {f}")
else:
print(f" ⚠ DiffDock files not found in current directory")
print(f" Current directory: {os.getcwd()}")
print(f" Make sure you're in the DiffDock repository root")
# Check for model checkpoints
model_dir = Path('./workdir/v1.1/score_model')
confidence_dir = Path('./workdir/v1.1/confidence_model')
if model_dir.exists() and confidence_dir.exists():
print(f" ✓ Model checkpoints found")
else:
print(f" ⚠ Model checkpoints not found in ./workdir/v1.1/")
print(f" Models will be downloaded on first run")
return len(found_files) > 0
def print_installation_instructions():
"""Print installation instructions if setup is incomplete."""
print("\n" + "="*80)
print("Installation Instructions")
print("="*80)
print("""
If DiffDock is not installed, follow these steps:
1. Clone the repository:
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
2. Create conda environment:
conda env create --file environment.yml
conda activate diffdock
3. Verify installation:
python setup_check.py
For Docker installation:
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
For more information, visit: https://github.com/gcorso/DiffDock
""")
def print_performance_notes(has_cuda):
"""Print performance notes based on available hardware."""
print("\n" + "="*80)
print("Performance Notes")
print("="*80)
if has_cuda:
print("""
✓ GPU detected - DiffDock will run efficiently
Expected performance:
- First run: ~2-5 minutes (pre-computing SO(2)/SO(3) tables)
- Subsequent runs: ~10-60 seconds per complex (depending on settings)
- Batch processing: Highly efficient with GPU
""")
else:
print("""
⚠ No GPU detected - DiffDock will run on CPU
Expected performance:
- CPU inference is SIGNIFICANTLY slower than GPU
- Single complex: Several minutes to hours
- Batch processing: Not recommended on CPU
Recommendation: Use GPU for practical applications
- Cloud options: Google Colab, AWS, or other cloud GPU services
- Local: Install CUDA-capable GPU
""")
def main():
parser = argparse.ArgumentParser(
description='Check DiffDock environment setup',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--verbose', '-v', action='store_true',
help='Show detailed version information')
args = parser.parse_args()
print("="*80)
print("DiffDock Environment Setup Checker")
print("="*80)
checks = []
# Run all checks
checks.append(("Python version", check_python_version()))
pytorch_ok, has_cuda = check_pytorch()
checks.append(("PyTorch", pytorch_ok))
checks.append(("PyTorch Geometric", check_pytorch_geometric()))
checks.append(("Core dependencies", check_core_dependencies()))
checks.append(("ESM", check_esm()))
checks.append(("DiffDock files", check_diffdock_installation()))
# Summary
print("\n" + "="*80)
print("Summary")
print("="*80)
all_passed = all(result for _, result in checks)
for check_name, result in checks:
status = "✓ PASS" if result else "✗ FAIL"
print(f" {status:8s} - {check_name}")
if all_passed:
print("\n✓ All checks passed! DiffDock is ready to use.")
print_performance_notes(has_cuda)
return 0
else:
print("\n✗ Some checks failed. Please install missing dependencies.")
print_installation_instructions()
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,617 @@
---
name: etetoolkit
description: Comprehensive toolkit for phylogenetic and hierarchical tree analysis using the ETE (Environment for Tree Exploration) Python library. This skill should be used when working with phylogenetic trees, gene trees, species trees, clustering dendrograms, or any hierarchical tree structures. Applies to tasks involving tree manipulation (pruning, rerooting, format conversion), evolutionary analysis (orthology detection, duplication/speciation events), tree comparison (Robinson-Foulds distance), NCBI taxonomy integration, tree visualization (PDF, SVG, PNG output), and clustering analysis with heatmaps.
---
# ETE Toolkit Skill
## Overview
Provide comprehensive support for phylogenetic and hierarchical tree analysis using the ETE (Environment for Tree Exploration) toolkit. Enable tree manipulation, evolutionary analysis, visualization, and integration with biological databases for phylogenomic research and clustering analysis.
## Core Capabilities
### 1. Tree Manipulation and Analysis
Load, manipulate, and analyze hierarchical tree structures with support for:
- **Tree I/O**: Read and write Newick, NHX, PhyloXML, and NeXML formats
- **Tree traversal**: Navigate trees using preorder, postorder, or levelorder strategies
- **Topology modification**: Prune, root, collapse nodes, resolve polytomies
- **Distance calculations**: Compute branch lengths and topological distances between nodes
- **Tree comparison**: Calculate Robinson-Foulds distances and identify topological differences
**Common patterns:**
```python
from ete3 import Tree
# Load tree from file
tree = Tree("tree.nw", format=1)
# Basic statistics
print(f"Leaves: {len(tree)}")
print(f"Total nodes: {len(list(tree.traverse()))}")
# Prune to taxa of interest
taxa_to_keep = ["species1", "species2", "species3"]
tree.prune(taxa_to_keep, preserve_branch_length=True)
# Midpoint root
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
# Save modified tree
tree.write(outfile="rooted_tree.nw")
```
Use `scripts/tree_operations.py` for command-line tree manipulation:
```bash
# Display tree statistics
python scripts/tree_operations.py stats tree.nw
# Convert format
python scripts/tree_operations.py convert tree.nw output.nw --in-format 0 --out-format 1
# Reroot tree
python scripts/tree_operations.py reroot tree.nw rooted.nw --midpoint
# Prune to specific taxa
python scripts/tree_operations.py prune tree.nw pruned.nw --keep-taxa "sp1,sp2,sp3"
# Show ASCII visualization
python scripts/tree_operations.py ascii tree.nw
```
### 2. Phylogenetic Analysis
Analyze gene trees with evolutionary event detection:
- **Sequence alignment integration**: Link trees to multiple sequence alignments (FASTA, Phylip)
- **Species naming**: Automatic or custom species extraction from gene names
- **Evolutionary events**: Detect duplication and speciation events using Species Overlap or tree reconciliation
- **Orthology detection**: Identify orthologs and paralogs based on evolutionary events
- **Gene family analysis**: Split trees by duplications, collapse lineage-specific expansions
**Workflow for gene tree analysis:**
```python
from ete3 import PhyloTree
# Load gene tree with alignment
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
# Set species naming function
def get_species(gene_name):
return gene_name.split("_")[0]
tree.set_species_naming_function(get_species)
# Detect evolutionary events
events = tree.get_descendant_evol_events()
# Analyze events
for node in tree.traverse():
if hasattr(node, "evoltype"):
if node.evoltype == "D":
print(f"Duplication at {node.name}")
elif node.evoltype == "S":
print(f"Speciation at {node.name}")
# Extract ortholog groups
ortho_groups = tree.get_speciation_trees()
for i, ortho_tree in enumerate(ortho_groups):
ortho_tree.write(outfile=f"ortholog_group_{i}.nw")
```
**Finding orthologs and paralogs:**
```python
# Find orthologs to query gene
query = tree & "species1_gene1"
orthologs = []
paralogs = []
for event in events:
if query in event.in_seqs:
if event.etype == "S":
orthologs.extend([s for s in event.out_seqs if s != query])
elif event.etype == "D":
paralogs.extend([s for s in event.out_seqs if s != query])
```
### 3. NCBI Taxonomy Integration
Integrate taxonomic information from NCBI Taxonomy database:
- **Database access**: Automatic download and local caching of NCBI taxonomy (~300MB)
- **Taxid/name translation**: Convert between taxonomic IDs and scientific names
- **Lineage retrieval**: Get complete evolutionary lineages
- **Taxonomy trees**: Build species trees connecting specified taxa
- **Tree annotation**: Automatically annotate trees with taxonomic information
**Building taxonomy-based trees:**
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
# Build tree from species names
species = ["Homo sapiens", "Pan troglodytes", "Mus musculus"]
name2taxid = ncbi.get_name_translator(species)
taxids = [name2taxid[sp][0] for sp in species]
# Get minimal tree connecting taxa
tree = ncbi.get_topology(taxids)
# Annotate nodes with taxonomy info
for node in tree.traverse():
if hasattr(node, "sci_name"):
print(f"{node.sci_name} - Rank: {node.rank} - TaxID: {node.taxid}")
```
**Annotating existing trees:**
```python
# Get taxonomy info for tree leaves
for leaf in tree:
species = extract_species_from_name(leaf.name)
taxid = ncbi.get_name_translator([species])[species][0]
# Get lineage
lineage = ncbi.get_lineage(taxid)
ranks = ncbi.get_rank(lineage)
names = ncbi.get_taxid_translator(lineage)
# Add to node
leaf.add_feature("taxid", taxid)
leaf.add_feature("lineage", [names[t] for t in lineage])
```
### 4. Tree Visualization
Create publication-quality tree visualizations:
- **Output formats**: PNG (raster), PDF, and SVG (vector) for publications
- **Layout modes**: Rectangular and circular tree layouts
- **Interactive GUI**: Explore trees interactively with zoom, pan, and search
- **Custom styling**: NodeStyle for node appearance (colors, shapes, sizes)
- **Faces**: Add graphical elements (text, images, charts, heatmaps) to nodes
- **Layout functions**: Dynamic styling based on node properties
**Basic visualization workflow:**
```python
from ete3 import Tree, TreeStyle, NodeStyle
tree = Tree("tree.nw")
# Configure tree style
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = True
ts.scale = 50 # pixels per branch length unit
# Style nodes
for node in tree.traverse():
nstyle = NodeStyle()
if node.is_leaf():
nstyle["fgcolor"] = "blue"
nstyle["size"] = 8
else:
# Color by support
if node.support > 0.9:
nstyle["fgcolor"] = "darkgreen"
else:
nstyle["fgcolor"] = "red"
nstyle["size"] = 5
node.set_style(nstyle)
# Render to file
tree.render("tree.pdf", tree_style=ts)
tree.render("tree.png", w=800, h=600, units="px", dpi=300)
```
Use `scripts/quick_visualize.py` for rapid visualization:
```bash
# Basic visualization
python scripts/quick_visualize.py tree.nw output.pdf
# Circular layout with custom styling
python scripts/quick_visualize.py tree.nw output.pdf --mode c --color-by-support
# High-resolution PNG
python scripts/quick_visualize.py tree.nw output.png --width 1200 --height 800 --units px --dpi 300
# Custom title and styling
python scripts/quick_visualize.py tree.nw output.pdf --title "Species Phylogeny" --show-support
```
**Advanced visualization with faces:**
```python
from ete3 import Tree, TreeStyle, TextFace, CircleFace
tree = Tree("tree.nw")
# Add features to nodes
for leaf in tree:
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "land")
# Layout function
def layout(node):
if node.is_leaf():
# Add colored circle
color = "blue" if node.habitat == "marine" else "green"
circle = CircleFace(radius=5, color=color)
node.add_face(circle, column=0, position="aligned")
# Add label
label = TextFace(node.name, fsize=10)
node.add_face(label, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("annotated_tree.pdf", tree_style=ts)
```
### 5. Clustering Analysis
Analyze hierarchical clustering results with data integration:
- **ClusterTree**: Specialized class for clustering dendrograms
- **Data matrix linking**: Connect tree leaves to numerical profiles
- **Cluster metrics**: Silhouette coefficient, Dunn index, inter/intra-cluster distances
- **Validation**: Test cluster quality with different distance metrics
- **Heatmap visualization**: Display data matrices alongside trees
**Clustering workflow:**
```python
from ete3 import ClusterTree
# Load tree with data matrix
matrix = """#Names\tSample1\tSample2\tSample3
Gene1\t1.5\t2.3\t0.8
Gene2\t0.9\t1.1\t1.8
Gene3\t2.1\t2.5\t0.5"""
tree = ClusterTree("((Gene1,Gene2),Gene3);", text_array=matrix)
# Evaluate cluster quality
for node in tree.traverse():
if not node.is_leaf():
silhouette = node.get_silhouette()
dunn = node.get_dunn()
print(f"Cluster: {node.name}")
print(f" Silhouette: {silhouette:.3f}")
print(f" Dunn index: {dunn:.3f}")
# Visualize with heatmap
tree.show("heatmap")
```
### 6. Tree Comparison
Quantify topological differences between trees:
- **Robinson-Foulds distance**: Standard metric for tree comparison
- **Normalized RF**: Scale-invariant distance (0.0 to 1.0)
- **Partition analysis**: Identify unique and shared bipartitions
- **Consensus trees**: Analyze support across multiple trees
- **Batch comparison**: Compare multiple trees pairwise
**Compare two trees:**
```python
from ete3 import Tree
tree1 = Tree("tree1.nw")
tree2 = Tree("tree2.nw")
# Calculate RF distance
rf, max_rf, common_leaves, parts_t1, parts_t2 = tree1.robinson_foulds(tree2)
print(f"RF distance: {rf}/{max_rf}")
print(f"Normalized RF: {rf/max_rf:.3f}")
print(f"Common leaves: {len(common_leaves)}")
# Find unique partitions
unique_t1 = parts_t1 - parts_t2
unique_t2 = parts_t2 - parts_t1
print(f"Unique to tree1: {len(unique_t1)}")
print(f"Unique to tree2: {len(unique_t2)}")
```
**Compare multiple trees:**
```python
import numpy as np
trees = [Tree(f"tree{i}.nw") for i in range(4)]
# Create distance matrix
n = len(trees)
dist_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
rf, max_rf, _, _, _ = trees[i].robinson_foulds(trees[j])
norm_rf = rf / max_rf if max_rf > 0 else 0
dist_matrix[i, j] = norm_rf
dist_matrix[j, i] = norm_rf
```
## Installation and Setup
Install ETE toolkit:
```bash
# Basic installation
pip install ete3
# With external dependencies for rendering (optional but recommended)
# On macOS:
brew install qt@5
# On Ubuntu/Debian:
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
# For full features including GUI
pip install ete3[gui]
```
**First-time NCBI Taxonomy setup:**
The first time NCBITaxa is instantiated, it automatically downloads the NCBI taxonomy database (~300MB) to `~/.etetoolkit/taxa.sqlite`. This happens only once:
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa() # Downloads database on first run
```
Update taxonomy database:
```python
ncbi.update_taxonomy_database() # Download latest NCBI data
```
## Common Use Cases
### Use Case 1: Phylogenomic Pipeline
Complete workflow from gene tree to ortholog identification:
```python
from ete3 import PhyloTree, NCBITaxa
# 1. Load gene tree with alignment
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
# 2. Configure species naming
tree.set_species_naming_function(lambda x: x.split("_")[0])
# 3. Detect evolutionary events
tree.get_descendant_evol_events()
# 4. Annotate with taxonomy
ncbi = NCBITaxa()
for leaf in tree:
if leaf.species in species_to_taxid:
taxid = species_to_taxid[leaf.species]
lineage = ncbi.get_lineage(taxid)
leaf.add_feature("lineage", lineage)
# 5. Extract ortholog groups
ortho_groups = tree.get_speciation_trees()
# 6. Save and visualize
for i, ortho in enumerate(ortho_groups):
ortho.write(outfile=f"ortho_{i}.nw")
```
### Use Case 2: Tree Preprocessing and Formatting
Batch process trees for analysis:
```bash
# Convert format
python scripts/tree_operations.py convert input.nw output.nw --in-format 0 --out-format 1
# Root at midpoint
python scripts/tree_operations.py reroot input.nw rooted.nw --midpoint
# Prune to focal taxa
python scripts/tree_operations.py prune rooted.nw pruned.nw --keep-taxa taxa_list.txt
# Get statistics
python scripts/tree_operations.py stats pruned.nw
```
### Use Case 3: Publication-Quality Figures
Create styled visualizations:
```python
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
tree = Tree("tree.nw")
# Define clade colors
clade_colors = {
"Mammals": "red",
"Birds": "blue",
"Fish": "green"
}
def layout(node):
# Highlight clades
if node.is_leaf():
for clade, color in clade_colors.items():
if clade in node.name:
nstyle = NodeStyle()
nstyle["fgcolor"] = color
nstyle["size"] = 8
node.set_style(nstyle)
else:
# Add support values
if node.support > 0.95:
support = TextFace(f"{node.support:.2f}", fsize=8)
node.add_face(support, column=0, position="branch-top")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_scale = True
# Render for publication
tree.render("figure.pdf", w=200, units="mm", tree_style=ts)
tree.render("figure.svg", tree_style=ts) # Editable vector
```
### Use Case 4: Automated Tree Analysis
Process multiple trees systematically:
```python
from ete3 import Tree
import os
input_dir = "trees"
output_dir = "processed"
for filename in os.listdir(input_dir):
if filename.endswith(".nw"):
tree = Tree(os.path.join(input_dir, filename))
# Standardize: midpoint root, resolve polytomies
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
tree.resolve_polytomy(recursive=True)
# Filter low support branches
for node in tree.traverse():
if hasattr(node, 'support') and node.support < 0.5:
if not node.is_leaf() and not node.is_root():
node.delete()
# Save processed tree
output_file = os.path.join(output_dir, f"processed_{filename}")
tree.write(outfile=output_file)
```
## Reference Documentation
For comprehensive API documentation, code examples, and detailed guides, refer to the following resources in the `references/` directory:
- **`api_reference.md`**: Complete API documentation for all ETE classes and methods (Tree, PhyloTree, ClusterTree, NCBITaxa), including parameters, return types, and code examples
- **`workflows.md`**: Common workflow patterns organized by task (tree operations, phylogenetic analysis, tree comparison, taxonomy integration, clustering analysis)
- **`visualization.md`**: Comprehensive visualization guide covering TreeStyle, NodeStyle, Faces, layout functions, and advanced visualization techniques
Load these references when detailed information is needed:
```python
# To use API reference
# Read references/api_reference.md for complete method signatures and parameters
# To implement workflows
# Read references/workflows.md for step-by-step workflow examples
# To create visualizations
# Read references/visualization.md for styling and rendering options
```
## Troubleshooting
**Import errors:**
```bash
# If "ModuleNotFoundError: No module named 'ete3'"
pip install ete3
# For GUI and rendering issues
pip install ete3[gui]
```
**Rendering issues:**
If `tree.render()` or `tree.show()` fails with Qt-related errors, install system dependencies:
```bash
# macOS
brew install qt@5
# Ubuntu/Debian
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
```
**NCBI Taxonomy database:**
If database download fails or becomes corrupted:
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database() # Redownload database
```
**Memory issues with large trees:**
For very large trees (>10,000 leaves), use iterators instead of list comprehensions:
```python
# Memory-efficient iteration
for leaf in tree.iter_leaves():
process(leaf)
# Instead of
for leaf in tree.get_leaves(): # Loads all into memory
process(leaf)
```
## Newick Format Reference
ETE supports multiple Newick format specifications (0-100):
- **Format 0**: Flexible with branch lengths (default)
- **Format 1**: With internal node names
- **Format 2**: With bootstrap/support values
- **Format 5**: Internal node names + branch lengths
- **Format 8**: All features (names, distances, support)
- **Format 9**: Leaf names only
- **Format 100**: Topology only
Specify format when reading/writing:
```python
tree = Tree("tree.nw", format=1)
tree.write(outfile="output.nw", format=5)
```
NHX (New Hampshire eXtended) format preserves custom features:
```python
tree.write(outfile="tree.nhx", features=["habitat", "temperature", "depth"])
```
## Best Practices
1. **Preserve branch lengths**: Use `preserve_branch_length=True` when pruning for phylogenetic analysis
2. **Cache content**: Use `get_cached_content()` for repeated access to node contents on large trees
3. **Use iterators**: Employ `iter_*` methods for memory-efficient processing of large trees
4. **Choose appropriate traversal**: Postorder for bottom-up analysis, preorder for top-down
5. **Validate monophyly**: Always check returned clade type (monophyletic/paraphyletic/polyphyletic)
6. **Vector formats for publication**: Use PDF or SVG for publication figures (scalable, editable)
7. **Interactive testing**: Use `tree.show()` to test visualizations before rendering to file
8. **PhyloTree for phylogenetics**: Use PhyloTree class for gene trees and evolutionary analysis
9. **Copy method selection**: "newick" for speed, "cpickle" for full fidelity, "deepcopy" for complex objects
10. **NCBI query caching**: Store NCBI taxonomy query results to avoid repeated database access

View File

@@ -0,0 +1,583 @@
# ETE Toolkit API Reference
## Overview
ETE (Environment for Tree Exploration) is a Python toolkit for phylogenetic tree manipulation, analysis, and visualization. This reference covers the main classes and methods.
## Core Classes
### TreeNode (alias: Tree)
The fundamental class representing tree structures with hierarchical node organization.
**Constructor:**
```python
from ete3 import Tree
t = Tree(newick=None, format=0, dist=None, support=None, name=None)
```
**Parameters:**
- `newick`: Newick string or file path
- `format`: Newick format (0-100). Common formats:
- `0`: Flexible format with branch lengths and names
- `1`: With internal node names
- `2`: With bootstrap/support values
- `5`: Internal node names and branch lengths
- `8`: All features (names, distances, support)
- `9`: Leaf names only
- `100`: Topology only
- `dist`: Branch length to parent (default: 1.0)
- `support`: Bootstrap/confidence value (default: 1.0)
- `name`: Node identifier
### PhyloTree
Specialized class for phylogenetic analysis, extending TreeNode.
**Constructor:**
```python
from ete3 import PhyloTree
t = PhyloTree(newick=None, alignment=None, alg_format='fasta',
sp_naming_function=None, format=0)
```
**Additional Parameters:**
- `alignment`: Path to alignment file or alignment string
- `alg_format`: 'fasta' or 'phylip'
- `sp_naming_function`: Custom function to extract species from node names
### ClusterTree
Class for hierarchical clustering analysis.
**Constructor:**
```python
from ete3 import ClusterTree
t = ClusterTree(newick, text_array=None)
```
**Parameters:**
- `text_array`: Tab-delimited matrix with column headers and row names
### NCBITaxa
Class for NCBI taxonomy database operations.
**Constructor:**
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa(dbfile=None)
```
First instantiation downloads ~300MB NCBI taxonomy database to `~/.etetoolkit/taxa.sqlite`.
## Node Properties
### Basic Attributes
| Property | Type | Description | Default |
|----------|------|-------------|---------|
| `name` | str | Node identifier | "NoName" |
| `dist` | float | Branch length to parent | 1.0 |
| `support` | float | Bootstrap/confidence value | 1.0 |
| `up` | TreeNode | Parent node reference | None |
| `children` | list | Child nodes | [] |
### Custom Features
Add any custom data to nodes:
```python
node.add_feature("custom_name", value)
node.add_features(feature1=value1, feature2=value2)
```
Access features:
```python
value = node.custom_name
# or
value = getattr(node, "custom_name", default_value)
```
## Navigation & Traversal
### Basic Navigation
```python
# Check node type
node.is_leaf() # Returns True if terminal node
node.is_root() # Returns True if root node
len(node) # Number of leaves under node
# Get relatives
parent = node.up
children = node.children
root = node.get_tree_root()
```
### Traversal Strategies
```python
# Three traversal strategies
for node in tree.traverse("preorder"): # Root → Left → Right
print(node.name)
for node in tree.traverse("postorder"): # Left → Right → Root
print(node.name)
for node in tree.traverse("levelorder"): # Level by level
print(node.name)
# Exclude root
for node in tree.iter_descendants("postorder"):
print(node.name)
```
### Getting Nodes
```python
# Get all leaves
leaves = tree.get_leaves()
for leaf in tree: # Shortcut iteration
print(leaf.name)
# Get all descendants
descendants = tree.get_descendants()
# Get ancestors
ancestors = node.get_ancestors()
# Get specific nodes by attribute
nodes = tree.search_nodes(name="NodeA")
node = tree & "NodeA" # Shortcut syntax
# Get leaves by name
leaves = tree.get_leaves_by_name("LeafA")
# Get common ancestor
ancestor = tree.get_common_ancestor("LeafA", "LeafB", "LeafC")
# Custom filtering
filtered = [n for n in tree.traverse() if n.dist > 0.5 and n.is_leaf()]
```
### Iterator Methods (Memory Efficient)
```python
# For large trees, use iterators
for match in tree.iter_search_nodes(name="X"):
if some_condition:
break # Stop early
for leaf in tree.iter_leaves():
process(leaf)
for descendant in node.iter_descendants():
process(descendant)
```
## Tree Construction & Modification
### Creating Trees from Scratch
```python
# Empty tree
t = Tree()
# Add children
child1 = t.add_child(name="A", dist=1.0)
child2 = t.add_child(name="B", dist=2.0)
# Add siblings
sister = child1.add_sister(name="C", dist=1.5)
# Populate with random topology
t.populate(10) # Creates 10 random leaves
t.populate(5, names_library=["A", "B", "C", "D", "E"])
```
### Removing & Deleting Nodes
```python
# Detach: removes entire subtree
node.detach()
# or
parent.remove_child(node)
# Delete: removes node, reconnects children to parent
node.delete()
# or
parent.remove_child(node)
```
### Pruning
Keep only specified leaves:
```python
# Keep only these leaves, remove all others
tree.prune(["A", "B", "C"])
# Preserve original branch lengths
tree.prune(["A", "B", "C"], preserve_branch_length=True)
```
### Tree Concatenation
```python
# Attach one tree as child of another
t1 = Tree("(A,(B,C));")
t2 = Tree("((D,E),(F,G));")
A = t1 & "A"
A.add_child(t2)
```
### Tree Copying
```python
# Four copy methods
copy1 = tree.copy() # Default: cpickle (preserves types)
copy2 = tree.copy("newick") # Fastest: basic topology
copy3 = tree.copy("newick-extended") # Includes custom features as text
copy4 = tree.copy("deepcopy") # Slowest: handles complex objects
```
## Tree Operations
### Rooting
```python
# Set outgroup (reroot tree)
outgroup_node = tree & "OutgroupLeaf"
tree.set_outgroup(outgroup_node)
# Midpoint rooting
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
# Unroot tree
tree.unroot()
```
### Resolving Polytomies
```python
# Resolve multifurcations to bifurcations
tree.resolve_polytomy(recursive=False) # Single node only
tree.resolve_polytomy(recursive=True) # Entire tree
```
### Ladderize
```python
# Sort branches by size
tree.ladderize()
tree.ladderize(direction=1) # Ascending order
```
### Convert to Ultrametric
```python
# Make all leaves equidistant from root
tree.convert_to_ultrametric()
tree.convert_to_ultrametric(tree_length=100) # Specific total length
```
## Distance & Comparison
### Distance Calculations
```python
# Branch length distance between nodes
dist = tree.get_distance("A", "B")
dist = nodeA.get_distance(nodeB)
# Topology-only distance (count nodes)
dist = tree.get_distance("A", "B", topology_only=True)
# Farthest node
farthest, distance = node.get_farthest_node()
farthest_leaf, distance = node.get_farthest_leaf()
```
### Monophyly Testing
```python
# Check if values form monophyletic group
is_mono, clade_type, base_node = tree.check_monophyly(
values=["A", "B", "C"],
target_attr="name"
)
# Returns: (bool, "monophyletic"|"paraphyletic"|"polyphyletic", node)
# Get all monophyletic clades
monophyletic_nodes = tree.get_monophyletic(
values=["A", "B", "C"],
target_attr="name"
)
```
### Tree Comparison
```python
# Robinson-Foulds distance
rf, max_rf, common_leaves, parts_t1, parts_t2 = t1.robinson_foulds(t2)
print(f"RF distance: {rf}/{max_rf}")
# Normalized RF distance
result = t1.compare(t2)
norm_rf = result["norm_rf"] # 0.0 to 1.0
ref_edges = result["ref_edges_in_source"]
```
## Input/Output
### Reading Trees
```python
# From string
t = Tree("(A:1,(B:1,(C:1,D:1):0.5):0.5);")
# From file
t = Tree("tree.nw")
# With format
t = Tree("tree.nw", format=1)
```
### Writing Trees
```python
# To string
newick = tree.write()
newick = tree.write(format=1)
newick = tree.write(format=1, features=["support", "custom_feature"])
# To file
tree.write(outfile="output.nw")
tree.write(format=5, outfile="output.nw", features=["name", "dist"])
# Custom leaf function (for collapsing)
def is_leaf(node):
return len(node) <= 3 # Treat small clades as leaves
newick = tree.write(is_leaf_fn=is_leaf)
```
### Tree Rendering
```python
# Show interactive GUI
tree.show()
# Render to file (PNG, PDF, SVG)
tree.render("tree.png")
tree.render("tree.pdf", w=200, units="mm")
tree.render("tree.svg", dpi=300)
# ASCII representation
print(tree)
print(tree.get_ascii(show_internal=True, compact=False))
```
## Performance Optimization
### Caching Content
For frequent access to node contents:
```python
# Cache all node contents
node2content = tree.get_cached_content()
# Fast lookup
for node in tree.traverse():
leaves = node2content[node]
print(f"Node has {len(leaves)} leaves")
```
### Precomputing Distances
```python
# For multiple distance queries
node2dist = {}
for node in tree.traverse():
node2dist[node] = node.get_distance(tree)
```
## PhyloTree-Specific Methods
### Sequence Alignment
```python
# Link alignment
tree.link_to_alignment("alignment.fasta", alg_format="fasta")
# Access sequences
for leaf in tree:
print(f"{leaf.name}: {leaf.sequence}")
```
### Species Naming
```python
# Default: first 3 letters
# Custom function
def get_species(node_name):
return node_name.split("_")[0]
tree.set_species_naming_function(get_species)
# Manual setting
for leaf in tree:
leaf.species = extract_species(leaf.name)
```
### Evolutionary Events
```python
# Detect duplication/speciation events
events = tree.get_descendant_evol_events()
for node in tree.traverse():
if hasattr(node, "evoltype"):
print(f"{node.name}: {node.evoltype}") # "D" or "S"
# With species tree
species_tree = Tree("(human, (chimp, gorilla));")
events = tree.get_descendant_evol_events(species_tree=species_tree)
```
### Gene Tree Operations
```python
# Get species trees from duplicated gene families
species_trees = tree.get_speciation_trees()
# Split by duplication events
subtrees = tree.split_by_dups()
# Collapse lineage-specific expansions
tree.collapse_lineage_specific_expansions()
```
## NCBITaxa Methods
### Database Operations
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
# Update database
ncbi.update_taxonomy_database()
```
### Querying Taxonomy
```python
# Get taxid from name
taxid = ncbi.get_name_translator(["Homo sapiens"])
# Returns: {'Homo sapiens': [9606]}
# Get name from taxid
names = ncbi.get_taxid_translator([9606, 9598])
# Returns: {9606: 'Homo sapiens', 9598: 'Pan troglodytes'}
# Get rank
rank = ncbi.get_rank([9606])
# Returns: {9606: 'species'}
# Get lineage
lineage = ncbi.get_lineage(9606)
# Returns: [1, 131567, 2759, ..., 9606]
# Get descendants
descendants = ncbi.get_descendant_taxa("Primates")
descendants = ncbi.get_descendant_taxa("Primates", collapse_subspecies=True)
```
### Building Taxonomy Trees
```python
# Get minimal tree connecting taxa
tree = ncbi.get_topology([9606, 9598, 9593]) # Human, chimp, gorilla
# Annotate tree with taxonomy
tree.annotate_ncbi_taxa()
# Access taxonomy info
for node in tree.traverse():
print(f"{node.sci_name} ({node.taxid}) - Rank: {node.rank}")
```
## ClusterTree Methods
### Linking to Data
```python
# Link matrix to tree
tree.link_to_arraytable(matrix_string)
# Access profiles
for leaf in tree:
print(leaf.profile) # Numerical array
```
### Cluster Metrics
```python
# Get silhouette coefficient
silhouette = tree.get_silhouette()
# Get Dunn index
dunn = tree.get_dunn()
# Inter/intra cluster distances
inter = node.intercluster_dist
intra = node.intracluster_dist
# Standard deviation
dev = node.deviation
```
### Distance Metrics
Supported metrics:
- `"euclidean"`: Euclidean distance
- `"pearson"`: Pearson correlation
- `"spearman"`: Spearman rank correlation
```python
tree.dist_to(node2, metric="pearson")
```
## Common Error Handling
```python
# Check if tree is empty
if tree.children:
print("Tree has children")
# Check if node exists
nodes = tree.search_nodes(name="X")
if nodes:
node = nodes[0]
# Safe feature access
value = getattr(node, "feature_name", default_value)
# Check format compatibility
try:
tree.write(format=1)
except:
print("Tree lacks internal node names")
```
## Best Practices
1. **Use appropriate traversal**: Postorder for bottom-up, preorder for top-down
2. **Cache for repeated access**: Use `get_cached_content()` for frequent queries
3. **Use iterators for large trees**: Memory-efficient processing
4. **Preserve branch lengths**: Use `preserve_branch_length=True` when pruning
5. **Choose copy method wisely**: "newick" for speed, "cpickle" for full fidelity
6. **Validate monophyly**: Check returned clade type (monophyletic/paraphyletic/polyphyletic)
7. **Use PhyloTree for phylogenetics**: Specialized methods for evolutionary analysis
8. **Cache NCBI queries**: Store results to avoid repeated database access

View File

@@ -0,0 +1,783 @@
# ETE Toolkit Visualization Guide
Complete guide to tree visualization with ETE Toolkit.
## Table of Contents
1. [Rendering Basics](#rendering-basics)
2. [TreeStyle Configuration](#treestyle-configuration)
3. [Node Styling](#node-styling)
4. [Faces](#faces)
5. [Layout Functions](#layout-functions)
6. [Advanced Visualization](#advanced-visualization)
---
## Rendering Basics
### Output Formats
ETE supports three main output formats:
```python
from ete3 import Tree
tree = Tree("tree.nw")
# PNG (raster, good for presentations)
tree.render("output.png", w=800, h=600, units="px", dpi=300)
# PDF (vector, good for publications)
tree.render("output.pdf", w=200, units="mm")
# SVG (vector, editable)
tree.render("output.svg")
```
### Units and Dimensions
```python
# Pixels
tree.render("tree.png", w=1200, h=800, units="px")
# Millimeters
tree.render("tree.pdf", w=210, h=297, units="mm") # A4 size
# Inches
tree.render("tree.pdf", w=8.5, h=11, units="in") # US Letter
# Auto-size (aspect ratio preserved)
tree.render("tree.pdf", w=200, units="mm") # Height auto-calculated
```
### Interactive Visualization
```python
from ete3 import Tree
tree = Tree("tree.nw")
# Launch GUI
# - Zoom with mouse wheel
# - Pan by dragging
# - Search with Ctrl+F
# - Export from menu
# - Edit node properties
tree.show()
```
---
## TreeStyle Configuration
### Basic TreeStyle Options
```python
from ete3 import Tree, TreeStyle
tree = Tree("tree.nw")
ts = TreeStyle()
# Display options
ts.show_leaf_name = True # Show leaf names
ts.show_branch_length = True # Show branch lengths
ts.show_branch_support = True # Show support values
ts.show_scale = True # Show scale bar
# Branch length scaling
ts.scale = 50 # Pixels per branch length unit
ts.min_leaf_separation = 10 # Minimum space between leaves (pixels)
# Layout orientation
ts.rotation = 0 # 0=left-to-right, 90=top-to-bottom
ts.branch_vertical_margin = 10 # Vertical spacing between branches
# Tree shape
ts.mode = "r" # "r"=rectangular (default), "c"=circular
tree.render("tree.pdf", tree_style=ts)
```
### Circular Trees
```python
from ete3 import Tree, TreeStyle
tree = Tree("tree.nw")
ts = TreeStyle()
# Circular mode
ts.mode = "c"
ts.arc_start = 0 # Starting angle (degrees)
ts.arc_span = 360 # Angular span (degrees, 360=full circle)
# For semicircle
ts.arc_start = -180
ts.arc_span = 180
tree.render("circular_tree.pdf", tree_style=ts)
```
### Title and Legend
```python
from ete3 import Tree, TreeStyle, TextFace
tree = Tree("tree.nw")
ts = TreeStyle()
# Add title
title = TextFace("Phylogenetic Tree of Species", fsize=20, bold=True)
ts.title.add_face(title, column=0)
# Add legend
ts.legend.add_face(TextFace("Red nodes: High support", fsize=10), column=0)
ts.legend.add_face(TextFace("Blue nodes: Low support", fsize=10), column=0)
# Legend position
ts.legend_position = 1 # 1=top-right, 2=top-left, 3=bottom-left, 4=bottom-right
tree.render("tree_with_legend.pdf", tree_style=ts)
```
### Custom Background
```python
from ete3 import Tree, TreeStyle
tree = Tree("tree.nw")
ts = TreeStyle()
# Background color
ts.bgcolor = "#f0f0f0" # Light gray background
# Tree border
ts.show_border = True
tree.render("tree_background.pdf", tree_style=ts)
```
---
## Node Styling
### NodeStyle Properties
```python
from ete3 import Tree, NodeStyle
tree = Tree("tree.nw")
for node in tree.traverse():
nstyle = NodeStyle()
# Node size and shape
nstyle["size"] = 10 # Node size in pixels
nstyle["shape"] = "circle" # "circle", "square", "sphere"
# Colors
nstyle["fgcolor"] = "blue" # Foreground color (node itself)
nstyle["bgcolor"] = "lightblue" # Background color (only for sphere)
# Line style for branches
nstyle["hz_line_type"] = 0 # 0=solid, 1=dashed, 2=dotted
nstyle["vt_line_type"] = 0 # Vertical line type
nstyle["hz_line_color"] = "black" # Horizontal line color
nstyle["vt_line_color"] = "black" # Vertical line color
nstyle["hz_line_width"] = 2 # Line width in pixels
nstyle["vt_line_width"] = 2
node.set_style(nstyle)
tree.render("styled_tree.pdf")
```
### Conditional Styling
```python
from ete3 import Tree, NodeStyle
tree = Tree("tree.nw")
# Style based on node properties
for node in tree.traverse():
nstyle = NodeStyle()
if node.is_leaf():
# Leaf node style
nstyle["size"] = 8
nstyle["fgcolor"] = "darkgreen"
nstyle["shape"] = "circle"
else:
# Internal node style based on support
if node.support > 0.9:
nstyle["size"] = 6
nstyle["fgcolor"] = "red"
nstyle["shape"] = "sphere"
else:
nstyle["size"] = 4
nstyle["fgcolor"] = "gray"
nstyle["shape"] = "circle"
# Style branches by length
if node.dist > 1.0:
nstyle["hz_line_width"] = 3
nstyle["hz_line_color"] = "blue"
else:
nstyle["hz_line_width"] = 1
nstyle["hz_line_color"] = "black"
node.set_style(nstyle)
tree.render("conditional_styled_tree.pdf")
```
### Hiding Nodes
```python
from ete3 import Tree, NodeStyle
tree = Tree("tree.nw")
# Hide specific nodes
for node in tree.traverse():
if node.support < 0.5: # Hide low support nodes
nstyle = NodeStyle()
nstyle["draw_descendants"] = False # Don't draw this node's subtree
nstyle["size"] = 0 # Make node invisible
node.set_style(nstyle)
tree.render("filtered_tree.pdf")
```
---
## Faces
Faces are graphical elements attached to nodes. They appear at specific positions around nodes.
### Face Positions
- `"branch-right"`: Right side of branch (after node)
- `"branch-top"`: Above branch
- `"branch-bottom"`: Below branch
- `"aligned"`: Aligned column at tree edge (for leaves)
### TextFace
```python
from ete3 import Tree, TreeStyle, TextFace
tree = Tree("tree.nw")
def layout(node):
if node.is_leaf():
# Add species name
name_face = TextFace(node.name, fsize=12, fgcolor="black")
node.add_face(name_face, column=0, position="branch-right")
# Add additional text
info_face = TextFace(f"Length: {node.dist:.3f}", fsize=8, fgcolor="gray")
node.add_face(info_face, column=1, position="branch-right")
else:
# Add support value
if node.support:
support_face = TextFace(f"{node.support:.2f}", fsize=8, fgcolor="red")
node.add_face(support_face, column=0, position="branch-top")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False # We're adding custom names
tree.render("tree_textfaces.pdf", tree_style=ts)
```
### AttrFace
Display node attributes directly:
```python
from ete3 import Tree, TreeStyle, AttrFace
tree = Tree("tree.nw")
# Add custom attributes
for leaf in tree:
leaf.add_feature("habitat", "aquatic" if "fish" in leaf.name else "terrestrial")
leaf.add_feature("temperature", 20)
def layout(node):
if node.is_leaf():
# Display attribute directly
habitat_face = AttrFace("habitat", fsize=10)
node.add_face(habitat_face, column=0, position="aligned")
temp_face = AttrFace("temperature", fsize=10)
node.add_face(temp_face, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
tree.render("tree_attrfaces.pdf", tree_style=ts)
```
### CircleFace
```python
from ete3 import Tree, TreeStyle, CircleFace, TextFace
tree = Tree("tree.nw")
# Annotate with habitat
for leaf in tree:
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "land")
def layout(node):
if node.is_leaf():
# Colored circle based on habitat
color = "blue" if node.habitat == "marine" else "green"
circle = CircleFace(radius=5, color=color, style="circle")
node.add_face(circle, column=0, position="aligned")
# Label
name = TextFace(node.name, fsize=10)
node.add_face(name, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("tree_circles.pdf", tree_style=ts)
```
### ImgFace
Add images to nodes:
```python
from ete3 import Tree, TreeStyle, ImgFace, TextFace
tree = Tree("tree.nw")
def layout(node):
if node.is_leaf():
# Add species image
img_path = f"images/{node.name}.png" # Path to image
try:
img_face = ImgFace(img_path, width=50, height=50)
node.add_face(img_face, column=0, position="aligned")
except:
pass # Skip if image doesn't exist
# Add name
name_face = TextFace(node.name, fsize=10)
node.add_face(name_face, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("tree_images.pdf", tree_style=ts)
```
### BarChartFace
```python
from ete3 import Tree, TreeStyle, BarChartFace, TextFace
tree = Tree("tree.nw")
# Add data for bar charts
for leaf in tree:
leaf.add_feature("values", [1.2, 2.3, 0.5, 1.8]) # Multiple values
def layout(node):
if node.is_leaf():
# Add bar chart
chart = BarChartFace(
node.values,
width=100,
height=40,
colors=["red", "blue", "green", "orange"],
labels=["A", "B", "C", "D"]
)
node.add_face(chart, column=0, position="aligned")
# Add name
name = TextFace(node.name, fsize=10)
node.add_face(name, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("tree_barcharts.pdf", tree_style=ts)
```
### PieChartFace
```python
from ete3 import Tree, TreeStyle, PieChartFace, TextFace
tree = Tree("tree.nw")
# Add data
for leaf in tree:
leaf.add_feature("proportions", [25, 35, 40]) # Percentages
def layout(node):
if node.is_leaf():
# Add pie chart
pie = PieChartFace(
node.proportions,
width=30,
height=30,
colors=["red", "blue", "green"]
)
node.add_face(pie, column=0, position="aligned")
name = TextFace(node.name, fsize=10)
node.add_face(name, column=1, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("tree_piecharts.pdf", tree_style=ts)
```
### SequenceFace (for alignments)
```python
from ete3 import PhyloTree, TreeStyle, SeqMotifFace
tree = PhyloTree("tree.nw")
tree.link_to_alignment("alignment.fasta")
def layout(node):
if node.is_leaf():
# Display sequence
seq_face = SeqMotifFace(node.sequence, seq_format="seq")
node.add_face(seq_face, column=0, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = True
tree.render("tree_alignment.pdf", tree_style=ts)
```
---
## Layout Functions
Layout functions are Python functions that modify node appearance during rendering.
### Basic Layout Function
```python
from ete3 import Tree, TreeStyle, TextFace
tree = Tree("tree.nw")
def my_layout(node):
"""Called for every node before rendering"""
if node.is_leaf():
# Add text to leaves
name_face = TextFace(node.name.upper(), fsize=12, fgcolor="blue")
node.add_face(name_face, column=0, position="branch-right")
else:
# Add support to internal nodes
if node.support:
support_face = TextFace(f"BS: {node.support:.0f}", fsize=8)
node.add_face(support_face, column=0, position="branch-top")
# Apply layout function
ts = TreeStyle()
ts.layout_fn = my_layout
ts.show_leaf_name = False
tree.render("tree_custom_layout.pdf", tree_style=ts)
```
### Dynamic Styling in Layout
```python
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
tree = Tree("tree.nw")
def layout(node):
# Modify node style dynamically
nstyle = NodeStyle()
# Color by clade
if "clade_A" in [l.name for l in node.get_leaves()]:
nstyle["bgcolor"] = "lightblue"
elif "clade_B" in [l.name for l in node.get_leaves()]:
nstyle["bgcolor"] = "lightgreen"
node.set_style(nstyle)
# Add faces based on features
if hasattr(node, "annotation"):
text = TextFace(node.annotation, fsize=8)
node.add_face(text, column=0, position="branch-top")
ts = TreeStyle()
ts.layout_fn = layout
tree.render("tree_dynamic.pdf", tree_style=ts)
```
### Multiple Column Layout
```python
from ete3 import Tree, TreeStyle, TextFace, CircleFace
tree = Tree("tree.nw")
# Add features
for leaf in tree:
leaf.add_feature("habitat", "aquatic")
leaf.add_feature("temp", 20)
leaf.add_feature("depth", 100)
def layout(node):
if node.is_leaf():
# Column 0: Name
name = TextFace(node.name, fsize=10)
node.add_face(name, column=0, position="aligned")
# Column 1: Habitat indicator
color = "blue" if node.habitat == "aquatic" else "brown"
circle = CircleFace(radius=5, color=color)
node.add_face(circle, column=1, position="aligned")
# Column 2: Temperature
temp = TextFace(f"{node.temp}°C", fsize=8)
node.add_face(temp, column=2, position="aligned")
# Column 3: Depth
depth = TextFace(f"{node.depth}m", fsize=8)
node.add_face(depth, column=3, position="aligned")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
tree.render("tree_columns.pdf", tree_style=ts)
```
---
## Advanced Visualization
### Highlighting Clades
```python
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
tree = Tree("tree.nw")
# Define clades to highlight
clade_members = {
"Clade_A": ["species1", "species2", "species3"],
"Clade_B": ["species4", "species5"]
}
def layout(node):
# Check if node is ancestor of specific clade
node_leaves = set([l.name for l in node.get_leaves()])
for clade_name, members in clade_members.items():
if set(members).issubset(node_leaves):
# This node is ancestor of the clade
nstyle = NodeStyle()
nstyle["bgcolor"] = "yellow"
nstyle["size"] = 0
# Add label
if set(members) == node_leaves: # Exact match
label = TextFace(clade_name, fsize=14, bold=True, fgcolor="red")
node.add_face(label, column=0, position="branch-top")
node.set_style(nstyle)
break
ts = TreeStyle()
ts.layout_fn = layout
tree.render("tree_highlighted_clades.pdf", tree_style=ts)
```
### Collapsing Clades
```python
from ete3 import Tree, TreeStyle, TextFace, NodeStyle
tree = Tree("tree.nw")
# Define which clades to collapse
clades_to_collapse = ["clade1_species1", "clade1_species2"]
def layout(node):
if not node.is_leaf():
node_leaves = [l.name for l in node.get_leaves()]
# Check if this is a clade we want to collapse
if all(l in clades_to_collapse for l in node_leaves):
# Collapse by hiding descendants
nstyle = NodeStyle()
nstyle["draw_descendants"] = False
nstyle["size"] = 20
nstyle["fgcolor"] = "steelblue"
nstyle["shape"] = "sphere"
node.set_style(nstyle)
# Add label showing what's collapsed
label = TextFace(f"[{len(node_leaves)} species]", fsize=10)
node.add_face(label, column=0, position="branch-right")
ts = TreeStyle()
ts.layout_fn = layout
tree.render("tree_collapsed.pdf", tree_style=ts)
```
### Heat Map Visualization
```python
from ete3 import Tree, TreeStyle, RectFace, TextFace
import numpy as np
tree = Tree("tree.nw")
# Generate random data for heatmap
for leaf in tree:
leaf.add_feature("data", np.random.rand(10)) # 10 data points
def layout(node):
if node.is_leaf():
# Add name
name = TextFace(node.name, fsize=8)
node.add_face(name, column=0, position="aligned")
# Add heatmap cells
for i, value in enumerate(node.data):
# Color based on value
intensity = int(255 * value)
color = f"#{255-intensity:02x}{intensity:02x}00" # Green-red gradient
rect = RectFace(width=20, height=15, fgcolor=color, bgcolor=color)
node.add_face(rect, column=i+1, position="aligned")
# Add column headers
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False
# Add header
for i in range(10):
header = TextFace(f"C{i+1}", fsize=8, fgcolor="gray")
ts.aligned_header.add_face(header, column=i+1)
tree.render("tree_heatmap.pdf", tree_style=ts)
```
### Phylogenetic Events Visualization
```python
from ete3 import PhyloTree, TreeStyle, TextFace, NodeStyle
tree = PhyloTree("gene_tree.nw")
tree.set_species_naming_function(lambda x: x.split("_")[0])
tree.get_descendant_evol_events()
def layout(node):
# Style based on evolutionary event
if hasattr(node, "evoltype"):
nstyle = NodeStyle()
if node.evoltype == "D": # Duplication
nstyle["fgcolor"] = "red"
nstyle["size"] = 10
nstyle["shape"] = "square"
label = TextFace("DUP", fsize=8, fgcolor="red", bold=True)
node.add_face(label, column=0, position="branch-top")
elif node.evoltype == "S": # Speciation
nstyle["fgcolor"] = "blue"
nstyle["size"] = 6
nstyle["shape"] = "circle"
node.set_style(nstyle)
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = True
tree.render("gene_tree_events.pdf", tree_style=ts)
```
### Custom Tree with Legend
```python
from ete3 import Tree, TreeStyle, TextFace, CircleFace, NodeStyle
tree = Tree("tree.nw")
# Categorize species
for leaf in tree:
if "fish" in leaf.name.lower():
leaf.add_feature("category", "fish")
elif "bird" in leaf.name.lower():
leaf.add_feature("category", "bird")
else:
leaf.add_feature("category", "mammal")
category_colors = {
"fish": "blue",
"bird": "green",
"mammal": "red"
}
def layout(node):
if node.is_leaf():
# Color by category
nstyle = NodeStyle()
nstyle["fgcolor"] = category_colors[node.category]
nstyle["size"] = 10
node.set_style(nstyle)
ts = TreeStyle()
ts.layout_fn = layout
# Add legend
ts.legend.add_face(TextFace("Legend:", fsize=12, bold=True), column=0)
for category, color in category_colors.items():
circle = CircleFace(radius=5, color=color)
ts.legend.add_face(circle, column=0)
label = TextFace(f" {category.capitalize()}", fsize=10)
ts.legend.add_face(label, column=1)
ts.legend_position = 1
tree.render("tree_with_legend.pdf", tree_style=ts)
```
---
## Best Practices
1. **Use layout functions** for complex visualizations - they're called during rendering
2. **Set `show_leaf_name = False`** when using custom name faces
3. **Use aligned position** for columnar data at leaf level
4. **Choose appropriate units**: pixels for screen, mm/inches for print
5. **Use vector formats (PDF/SVG)** for publications
6. **Precompute styling** when possible - layout functions should be fast
7. **Test interactively** with `show()` before rendering to file
8. **Use NodeStyle for permanent** changes, layout functions for rendering-time changes
9. **Align faces in columns** for clean, organized appearance
10. **Add legends** to explain colors and symbols used

View File

@@ -0,0 +1,774 @@
# ETE Toolkit Common Workflows
This document provides complete workflows for common tasks using the ETE Toolkit.
## Table of Contents
1. [Basic Tree Operations](#basic-tree-operations)
2. [Phylogenetic Analysis](#phylogenetic-analysis)
3. [Tree Comparison](#tree-comparison)
4. [Taxonomy Integration](#taxonomy-integration)
5. [Clustering Analysis](#clustering-analysis)
6. [Tree Visualization](#tree-visualization)
---
## Basic Tree Operations
### Loading and Exploring a Tree
```python
from ete3 import Tree
# Load tree from file
tree = Tree("my_tree.nw", format=1)
# Display ASCII representation
print(tree.get_ascii(show_internal=True))
# Get basic statistics
print(f"Number of leaves: {len(tree)}")
print(f"Total nodes: {len(list(tree.traverse()))}")
print(f"Tree depth: {tree.get_farthest_leaf()[1]}")
# List all leaf names
for leaf in tree:
print(leaf.name)
```
### Extracting and Saving Subtrees
```python
from ete3 import Tree
tree = Tree("full_tree.nw")
# Get subtree rooted at specific node
node = tree.search_nodes(name="MyNode")[0]
subtree = node.copy()
# Save subtree to file
subtree.write(outfile="subtree.nw", format=1)
# Extract monophyletic clade
species_of_interest = ["species1", "species2", "species3"]
ancestor = tree.get_common_ancestor(species_of_interest)
clade = ancestor.copy()
clade.write(outfile="clade.nw")
```
### Pruning Trees to Specific Taxa
```python
from ete3 import Tree
tree = Tree("large_tree.nw")
# Keep only taxa of interest
taxa_to_keep = ["taxon1", "taxon2", "taxon3", "taxon4"]
tree.prune(taxa_to_keep, preserve_branch_length=True)
# Save pruned tree
tree.write(outfile="pruned_tree.nw")
```
### Rerooting Trees
```python
from ete3 import Tree
tree = Tree("unrooted_tree.nw")
# Method 1: Root by outgroup
outgroup = tree & "Outgroup_species"
tree.set_outgroup(outgroup)
# Method 2: Midpoint rooting
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
# Save rooted tree
tree.write(outfile="rooted_tree.nw")
```
### Annotating Nodes with Custom Data
```python
from ete3 import Tree
tree = Tree("tree.nw")
# Add features to nodes based on metadata
metadata = {
"species1": {"habitat": "marine", "temperature": 20},
"species2": {"habitat": "freshwater", "temperature": 15},
}
for leaf in tree:
if leaf.name in metadata:
leaf.add_features(**metadata[leaf.name])
# Query annotated features
for leaf in tree:
if hasattr(leaf, "habitat"):
print(f"{leaf.name}: {leaf.habitat}, {leaf.temperature}°C")
# Save with custom features (NHX format)
tree.write(outfile="annotated_tree.nhx", features=["habitat", "temperature"])
```
### Modifying Tree Topology
```python
from ete3 import Tree
tree = Tree("tree.nw")
# Remove a clade
node_to_remove = tree & "unwanted_clade"
node_to_remove.detach()
# Collapse a node (delete but keep children)
node_to_collapse = tree & "low_support_node"
node_to_collapse.delete()
# Add a new species to existing clade
target_clade = tree & "target_node"
new_leaf = target_clade.add_child(name="new_species", dist=0.5)
# Resolve polytomies
tree.resolve_polytomy(recursive=True)
# Save modified tree
tree.write(outfile="modified_tree.nw")
```
---
## Phylogenetic Analysis
### Complete Gene Tree Analysis with Alignment
```python
from ete3 import PhyloTree
# Load gene tree and link alignment
tree = PhyloTree("gene_tree.nw", format=1)
tree.link_to_alignment("alignment.fasta", alg_format="fasta")
# Set species naming function (e.g., gene_species format)
def extract_species(node_name):
return node_name.split("_")[0]
tree.set_species_naming_function(extract_species)
# Access sequences
for leaf in tree:
print(f"{leaf.name} ({leaf.species})")
print(f"Sequence: {leaf.sequence[:50]}...")
```
### Detecting Duplication and Speciation Events
```python
from ete3 import PhyloTree, Tree
# Load gene tree
gene_tree = PhyloTree("gene_tree.nw")
# Set species naming
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
# Option 1: Species Overlap algorithm (no species tree needed)
events = gene_tree.get_descendant_evol_events()
# Option 2: Tree reconciliation (requires species tree)
species_tree = Tree("species_tree.nw")
events = gene_tree.get_descendant_evol_events(species_tree=species_tree)
# Analyze events
duplications = 0
speciations = 0
for node in gene_tree.traverse():
if hasattr(node, "evoltype"):
if node.evoltype == "D":
duplications += 1
print(f"Duplication at node {node.name}")
elif node.evoltype == "S":
speciations += 1
print(f"\nTotal duplications: {duplications}")
print(f"Total speciations: {speciations}")
```
### Extracting Orthologs and Paralogs
```python
from ete3 import PhyloTree
gene_tree = PhyloTree("gene_tree.nw")
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
# Detect evolutionary events
events = gene_tree.get_descendant_evol_events()
# Find all orthologs to a query gene
query_gene = gene_tree & "species1_gene1"
orthologs = []
paralogs = []
for event in events:
if query_gene in event.in_seqs:
if event.etype == "S": # Speciation
orthologs.extend([s for s in event.out_seqs if s != query_gene])
elif event.etype == "D": # Duplication
paralogs.extend([s for s in event.out_seqs if s != query_gene])
print(f"Orthologs of {query_gene.name}:")
for ortholog in set(orthologs):
print(f" {ortholog.name}")
print(f"\nParalogs of {query_gene.name}:")
for paralog in set(paralogs):
print(f" {paralog.name}")
```
### Splitting Gene Families by Duplication Events
```python
from ete3 import PhyloTree
gene_tree = PhyloTree("gene_family.nw")
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
gene_tree.get_descendant_evol_events()
# Split into individual gene families
subfamilies = gene_tree.split_by_dups()
print(f"Gene family split into {len(subfamilies)} subfamilies")
for i, subtree in enumerate(subfamilies):
subtree.write(outfile=f"subfamily_{i}.nw")
species = set([leaf.species for leaf in subtree])
print(f"Subfamily {i}: {len(subtree)} genes from {len(species)} species")
```
### Collapsing Lineage-Specific Expansions
```python
from ete3 import PhyloTree
gene_tree = PhyloTree("expanded_tree.nw")
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
# Collapse lineage-specific duplications
gene_tree.collapse_lineage_specific_expansions()
print("After collapsing expansions:")
print(gene_tree.get_ascii())
gene_tree.write(outfile="collapsed_tree.nw")
```
### Testing Monophyly
```python
from ete3 import Tree
tree = Tree("tree.nw")
# Test if a group is monophyletic
target_species = ["species1", "species2", "species3"]
is_mono, clade_type, base_node = tree.check_monophyly(
values=target_species,
target_attr="name"
)
if is_mono:
print(f"Group is monophyletic")
print(f"MRCA: {base_node.name}")
elif clade_type == "paraphyletic":
print(f"Group is paraphyletic")
elif clade_type == "polyphyletic":
print(f"Group is polyphyletic")
# Get all monophyletic clades of a specific type
# Annotate leaves first
for leaf in tree:
if leaf.name.startswith("species"):
leaf.add_feature("type", "typeA")
else:
leaf.add_feature("type", "typeB")
mono_clades = tree.get_monophyletic(values=["typeA"], target_attr="type")
print(f"Found {len(mono_clades)} monophyletic clades of typeA")
```
---
## Tree Comparison
### Computing Robinson-Foulds Distance
```python
from ete3 import Tree
tree1 = Tree("tree1.nw")
tree2 = Tree("tree2.nw")
# Compute RF distance
rf, max_rf, common_leaves, parts_t1, parts_t2 = tree1.robinson_foulds(tree2)
print(f"Robinson-Foulds distance: {rf}")
print(f"Maximum RF distance: {max_rf}")
print(f"Normalized RF: {rf/max_rf:.3f}")
print(f"Common leaves: {len(common_leaves)}")
# Find unique partitions
unique_in_t1 = parts_t1 - parts_t2
unique_in_t2 = parts_t2 - parts_t1
print(f"\nPartitions unique to tree1: {len(unique_in_t1)}")
print(f"Partitions unique to tree2: {len(unique_in_t2)}")
```
### Comparing Multiple Trees
```python
from ete3 import Tree
import numpy as np
# Load multiple trees
tree_files = ["tree1.nw", "tree2.nw", "tree3.nw", "tree4.nw"]
trees = [Tree(f) for f in tree_files]
# Create distance matrix
n = len(trees)
dist_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
rf, max_rf, _, _, _ = trees[i].robinson_foulds(trees[j])
norm_rf = rf / max_rf if max_rf > 0 else 0
dist_matrix[i, j] = norm_rf
dist_matrix[j, i] = norm_rf
print("Normalized RF distance matrix:")
print(dist_matrix)
# Find most similar pair
min_dist = float('inf')
best_pair = None
for i in range(n):
for j in range(i+1, n):
if dist_matrix[i, j] < min_dist:
min_dist = dist_matrix[i, j]
best_pair = (i, j)
print(f"\nMost similar trees: {tree_files[best_pair[0]]} and {tree_files[best_pair[1]]}")
print(f"Distance: {min_dist:.3f}")
```
### Finding Consensus Topology
```python
from ete3 import Tree
# Load multiple bootstrap trees
bootstrap_trees = [Tree(f"bootstrap_{i}.nw") for i in range(100)]
# Get reference tree (first tree)
ref_tree = bootstrap_trees[0].copy()
# Count bipartitions
bipartition_counts = {}
for tree in bootstrap_trees:
rf, max_rf, common, parts_ref, parts_tree = ref_tree.robinson_foulds(tree)
for partition in parts_tree:
bipartition_counts[partition] = bipartition_counts.get(partition, 0) + 1
# Filter by support threshold
threshold = 70 # 70% support
supported_bipartitions = {
k: v for k, v in bipartition_counts.items()
if (v / len(bootstrap_trees)) * 100 >= threshold
}
print(f"Bipartitions with >{threshold}% support: {len(supported_bipartitions)}")
```
---
## Taxonomy Integration
### Building Species Trees from NCBI Taxonomy
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
# Define species of interest
species = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla",
"Mus musculus", "Rattus norvegicus"]
# Get taxids
name2taxid = ncbi.get_name_translator(species)
taxids = [name2taxid[sp][0] for sp in species]
# Build tree
tree = ncbi.get_topology(taxids)
# Annotate with taxonomy info
for node in tree.traverse():
if hasattr(node, "sci_name"):
print(f"{node.sci_name} - Rank: {node.rank} - TaxID: {node.taxid}")
# Save tree
tree.write(outfile="species_tree.nw")
```
### Annotating Existing Tree with NCBI Taxonomy
```python
from ete3 import Tree, NCBITaxa
tree = Tree("species_tree.nw")
ncbi = NCBITaxa()
# Map leaf names to species names (adjust as needed)
leaf_to_species = {
"Hsap_gene1": "Homo sapiens",
"Ptro_gene1": "Pan troglodytes",
"Mmur_gene1": "Microcebus murinus",
}
# Get taxids
all_species = list(set(leaf_to_species.values()))
name2taxid = ncbi.get_name_translator(all_species)
# Annotate leaves
for leaf in tree:
if leaf.name in leaf_to_species:
species_name = leaf_to_species[leaf.name]
taxid = name2taxid[species_name][0]
# Add taxonomy info
leaf.add_feature("species", species_name)
leaf.add_feature("taxid", taxid)
# Get full lineage
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
leaf.add_feature("lineage", [names[t] for t in lineage])
print(f"{leaf.name}: {species_name} (taxid: {taxid})")
```
### Querying NCBI Taxonomy
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
# Get all primates
primates_taxid = ncbi.get_name_translator(["Primates"])["Primates"][0]
all_primates = ncbi.get_descendant_taxa(primates_taxid, collapse_subspecies=True)
print(f"Total primate species: {len(all_primates)}")
# Get names for subset
taxid2name = ncbi.get_taxid_translator(all_primates[:10])
for taxid, name in taxid2name.items():
rank = ncbi.get_rank([taxid])[taxid]
print(f"{name} ({rank})")
# Get lineage for specific species
human_taxid = 9606
lineage = ncbi.get_lineage(human_taxid)
ranks = ncbi.get_rank(lineage)
names = ncbi.get_taxid_translator(lineage)
print("\nHuman lineage:")
for taxid in lineage:
print(f"{ranks[taxid]:15s} {names[taxid]}")
```
---
## Clustering Analysis
### Analyzing Hierarchical Clustering Results
```python
from ete3 import ClusterTree
# Load clustering tree with data matrix
matrix = """#Names\tSample1\tSample2\tSample3\tSample4
Gene1\t1.5\t2.3\t0.8\t1.2
Gene2\t0.9\t1.1\t1.8\t2.1
Gene3\t2.1\t2.5\t0.5\t0.9
Gene4\t0.7\t0.9\t2.2\t2.4"""
tree = ClusterTree("((Gene1,Gene2),(Gene3,Gene4));", text_array=matrix)
# Calculate cluster quality metrics
for node in tree.traverse():
if not node.is_leaf():
# Silhouette coefficient
silhouette = node.get_silhouette()
# Dunn index
dunn = node.get_dunn()
# Distances
inter = node.intercluster_dist
intra = node.intracluster_dist
print(f"Node: {node.name}")
print(f" Silhouette: {silhouette:.3f}")
print(f" Dunn index: {dunn:.3f}")
print(f" Intercluster distance: {inter:.3f}")
print(f" Intracluster distance: {intra:.3f}")
```
### Validating Clusters
```python
from ete3 import ClusterTree
matrix = """#Names\tCol1\tCol2\tCol3
ItemA\t1.2\t0.5\t0.8
ItemB\t1.3\t0.6\t0.9
ItemC\t0.1\t2.5\t2.3
ItemD\t0.2\t2.6\t2.4"""
tree = ClusterTree("((ItemA,ItemB),(ItemC,ItemD));", text_array=matrix)
# Test different distance metrics
metrics = ["euclidean", "pearson", "spearman"]
for metric in metrics:
print(f"\nUsing {metric} distance:")
for node in tree.traverse():
if not node.is_leaf():
silhouette = node.get_silhouette(distance=metric)
# Positive silhouette = good clustering
# Negative silhouette = poor clustering
quality = "good" if silhouette > 0 else "poor"
print(f" Cluster {node.name}: {silhouette:.3f} ({quality})")
```
---
## Tree Visualization
### Basic Tree Rendering
```python
from ete3 import Tree, TreeStyle
tree = Tree("tree.nw")
# Create tree style
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_length = True
ts.show_branch_support = True
ts.scale = 50 # pixels per branch length unit
# Render to file
tree.render("tree_output.pdf", tree_style=ts)
tree.render("tree_output.png", tree_style=ts, w=800, h=600, units="px")
tree.render("tree_output.svg", tree_style=ts)
```
### Customizing Node Appearance
```python
from ete3 import Tree, TreeStyle, NodeStyle
tree = Tree("tree.nw")
# Define node styles
for node in tree.traverse():
nstyle = NodeStyle()
if node.is_leaf():
nstyle["fgcolor"] = "blue"
nstyle["size"] = 10
else:
nstyle["fgcolor"] = "red"
nstyle["size"] = 5
if node.support > 0.9:
nstyle["shape"] = "sphere"
else:
nstyle["shape"] = "circle"
node.set_style(nstyle)
# Render
ts = TreeStyle()
tree.render("styled_tree.pdf", tree_style=ts)
```
### Adding Faces to Nodes
```python
from ete3 import Tree, TreeStyle, TextFace, CircleFace, AttrFace
tree = Tree("tree.nw")
# Add features to nodes
for leaf in tree:
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "terrestrial")
leaf.add_feature("temp", 20)
# Layout function to add faces
def layout(node):
if node.is_leaf():
# Add text face
name_face = TextFace(node.name, fsize=10)
node.add_face(name_face, column=0, position="branch-right")
# Add colored circle based on habitat
color = "blue" if node.habitat == "marine" else "green"
circle_face = CircleFace(radius=5, color=color)
node.add_face(circle_face, column=1, position="branch-right")
# Add attribute face
temp_face = AttrFace("temp", fsize=8)
node.add_face(temp_face, column=2, position="branch-right")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = False # We're adding custom names
tree.render("tree_with_faces.pdf", tree_style=ts)
```
### Circular Tree Layout
```python
from ete3 import Tree, TreeStyle
tree = Tree("tree.nw")
ts = TreeStyle()
ts.mode = "c" # Circular mode
ts.arc_start = 0 # Degrees
ts.arc_span = 360 # Full circle
ts.show_leaf_name = True
tree.render("circular_tree.pdf", tree_style=ts)
```
### Interactive Exploration
```python
from ete3 import Tree
tree = Tree("tree.nw")
# Launch GUI (allows zooming, searching, modifying)
# Changes persist after closing
tree.show()
# Can save changes made in GUI
tree.write(outfile="modified_tree.nw")
```
---
## Advanced Workflows
### Complete Phylogenomic Pipeline
```python
from ete3 import PhyloTree, NCBITaxa, TreeStyle
# 1. Load gene tree
gene_tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
# 2. Set species naming
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
# 3. Detect evolutionary events
gene_tree.get_descendant_evol_events()
# 4. Annotate with NCBI taxonomy
ncbi = NCBITaxa()
species_set = set([leaf.species for leaf in gene_tree])
name2taxid = ncbi.get_name_translator(list(species_set))
for leaf in gene_tree:
if leaf.species in name2taxid:
taxid = name2taxid[leaf.species][0]
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
leaf.add_feature("lineage", [names[t] for t in lineage])
# 5. Identify and save ortholog groups
ortho_groups = gene_tree.get_speciation_trees()
for i, ortho_tree in enumerate(ortho_groups):
ortho_tree.write(outfile=f"ortholog_group_{i}.nw")
# 6. Visualize with evolutionary events marked
def layout(node):
from ete3 import TextFace
if hasattr(node, "evoltype"):
if node.evoltype == "D":
dup_face = TextFace("DUPLICATION", fsize=8, fgcolor="red")
node.add_face(dup_face, column=0, position="branch-top")
ts = TreeStyle()
ts.layout_fn = layout
ts.show_leaf_name = True
gene_tree.render("annotated_gene_tree.pdf", tree_style=ts)
print(f"Pipeline complete. Found {len(ortho_groups)} ortholog groups.")
```
### Batch Processing Multiple Trees
```python
from ete3 import Tree
import os
input_dir = "input_trees"
output_dir = "processed_trees"
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
if filename.endswith(".nw"):
# Load tree
tree = Tree(os.path.join(input_dir, filename))
# Process: root, prune, annotate
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
# Filter by branch length
to_remove = []
for node in tree.traverse():
if node.dist < 0.001 and not node.is_root():
to_remove.append(node)
for node in to_remove:
node.delete()
# Save processed tree
output_file = os.path.join(output_dir, f"processed_{filename}")
tree.write(outfile=output_file)
print(f"Processed {filename}")
```

View File

@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
Quick tree visualization script with common customization options.
Provides command-line interface for rapid tree visualization with
customizable styles, layouts, and output formats.
"""
import argparse
import sys
from pathlib import Path
try:
from ete3 import Tree, TreeStyle, NodeStyle
except ImportError:
print("Error: ete3 not installed. Install with: pip install ete3")
sys.exit(1)
def create_tree_style(args):
"""Create TreeStyle based on arguments."""
ts = TreeStyle()
# Basic display options
ts.show_leaf_name = args.show_names
ts.show_branch_length = args.show_lengths
ts.show_branch_support = args.show_support
ts.show_scale = args.show_scale
# Layout
ts.mode = args.mode
ts.rotation = args.rotation
# Circular tree options
if args.mode == "c":
ts.arc_start = args.arc_start
ts.arc_span = args.arc_span
# Spacing
ts.branch_vertical_margin = args.vertical_margin
if args.scale_factor:
ts.scale = args.scale_factor
# Title
if args.title:
from ete3 import TextFace
title_face = TextFace(args.title, fsize=16, bold=True)
ts.title.add_face(title_face, column=0)
return ts
def apply_node_styling(tree, args):
"""Apply styling to tree nodes."""
for node in tree.traverse():
nstyle = NodeStyle()
if node.is_leaf():
# Leaf style
nstyle["fgcolor"] = args.leaf_color
nstyle["size"] = args.leaf_size
else:
# Internal node style
nstyle["fgcolor"] = args.internal_color
nstyle["size"] = args.internal_size
# Color by support if enabled
if args.color_by_support and hasattr(node, 'support') and node.support:
if node.support >= 0.9:
nstyle["fgcolor"] = "darkgreen"
elif node.support >= 0.7:
nstyle["fgcolor"] = "orange"
else:
nstyle["fgcolor"] = "red"
node.set_style(nstyle)
def visualize_tree(tree_file, output, args):
"""Load tree, apply styles, and render."""
try:
tree = Tree(str(tree_file), format=args.format)
except Exception as e:
print(f"Error loading tree: {e}")
sys.exit(1)
# Apply styling
apply_node_styling(tree, args)
# Create tree style
ts = create_tree_style(args)
# Render
try:
# Determine output parameters based on format
output_path = str(output)
render_args = {"tree_style": ts}
if args.width:
render_args["w"] = args.width
if args.height:
render_args["h"] = args.height
if args.units:
render_args["units"] = args.units
if args.dpi:
render_args["dpi"] = args.dpi
tree.render(output_path, **render_args)
print(f"Tree rendered successfully to: {output}")
except Exception as e:
print(f"Error rendering tree: {e}")
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description="Quick tree visualization with ETE toolkit",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic visualization
%(prog)s tree.nw output.pdf
# Circular tree
%(prog)s tree.nw output.pdf --mode c
# Large tree with custom sizing
%(prog)s tree.nw output.png --width 1200 --height 800 --units px --dpi 300
# Hide names, show support, color by support
%(prog)s tree.nw output.pdf --no-names --show-support --color-by-support
# Custom title
%(prog)s tree.nw output.pdf --title "Phylogenetic Tree of Species"
# Semicircular layout
%(prog)s tree.nw output.pdf --mode c --arc-start -90 --arc-span 180
"""
)
parser.add_argument("input", help="Input tree file (Newick format)")
parser.add_argument("output", help="Output image file (png, pdf, or svg)")
# Tree format
parser.add_argument("--format", type=int, default=0,
help="Newick format number (default: 0)")
# Display options
display = parser.add_argument_group("Display options")
display.add_argument("--no-names", dest="show_names", action="store_false",
help="Don't show leaf names")
display.add_argument("--show-lengths", action="store_true",
help="Show branch lengths")
display.add_argument("--show-support", action="store_true",
help="Show support values")
display.add_argument("--show-scale", action="store_true",
help="Show scale bar")
# Layout options
layout = parser.add_argument_group("Layout options")
layout.add_argument("--mode", choices=["r", "c"], default="r",
help="Tree mode: r=rectangular, c=circular (default: r)")
layout.add_argument("--rotation", type=int, default=0,
help="Tree rotation in degrees (default: 0)")
layout.add_argument("--arc-start", type=int, default=0,
help="Circular tree start angle (default: 0)")
layout.add_argument("--arc-span", type=int, default=360,
help="Circular tree arc span (default: 360)")
# Styling options
styling = parser.add_argument_group("Styling options")
styling.add_argument("--leaf-color", default="blue",
help="Leaf node color (default: blue)")
styling.add_argument("--leaf-size", type=int, default=6,
help="Leaf node size (default: 6)")
styling.add_argument("--internal-color", default="gray",
help="Internal node color (default: gray)")
styling.add_argument("--internal-size", type=int, default=4,
help="Internal node size (default: 4)")
styling.add_argument("--color-by-support", action="store_true",
help="Color internal nodes by support value")
# Size and spacing
size = parser.add_argument_group("Size and spacing")
size.add_argument("--width", type=int, help="Output width")
size.add_argument("--height", type=int, help="Output height")
size.add_argument("--units", choices=["px", "mm", "in"],
help="Size units (px, mm, in)")
size.add_argument("--dpi", type=int, help="DPI for raster output")
size.add_argument("--scale-factor", type=int,
help="Branch length scale factor (pixels per unit)")
size.add_argument("--vertical-margin", type=int, default=10,
help="Vertical margin between branches (default: 10)")
# Other options
parser.add_argument("--title", help="Tree title")
args = parser.parse_args()
# Validate output format
output_path = Path(args.output)
valid_extensions = {".png", ".pdf", ".svg"}
if output_path.suffix.lower() not in valid_extensions:
print(f"Error: Output must be PNG, PDF, or SVG file")
sys.exit(1)
# Visualize
visualize_tree(args.input, args.output, args)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,229 @@
#!/usr/bin/env python3
"""
Tree operations helper script for common ETE toolkit tasks.
Provides command-line interface for basic tree operations like:
- Format conversion
- Rooting (outgroup, midpoint)
- Pruning
- Basic statistics
- ASCII visualization
"""
import argparse
import sys
from pathlib import Path
try:
from ete3 import Tree
except ImportError:
print("Error: ete3 not installed. Install with: pip install ete3")
sys.exit(1)
def load_tree(tree_file, format_num=0):
"""Load tree from file."""
try:
return Tree(str(tree_file), format=format_num)
except Exception as e:
print(f"Error loading tree: {e}")
sys.exit(1)
def convert_format(tree_file, output, in_format=0, out_format=1):
"""Convert tree between Newick formats."""
tree = load_tree(tree_file, in_format)
tree.write(outfile=str(output), format=out_format)
print(f"Converted {tree_file} (format {in_format}) → {output} (format {out_format})")
def reroot_tree(tree_file, output, outgroup=None, midpoint=False, format_num=0):
"""Reroot tree by outgroup or midpoint."""
tree = load_tree(tree_file, format_num)
if midpoint:
midpoint_node = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint_node)
print(f"Rerooted tree using midpoint method")
elif outgroup:
try:
outgroup_node = tree & outgroup
tree.set_outgroup(outgroup_node)
print(f"Rerooted tree using outgroup: {outgroup}")
except Exception as e:
print(f"Error: Could not find outgroup '{outgroup}': {e}")
sys.exit(1)
else:
print("Error: Must specify either --outgroup or --midpoint")
sys.exit(1)
tree.write(outfile=str(output), format=format_num)
print(f"Saved rerooted tree to: {output}")
def prune_tree(tree_file, output, keep_taxa, preserve_length=True, format_num=0):
"""Prune tree to keep only specified taxa."""
tree = load_tree(tree_file, format_num)
# Read taxa list
taxa_file = Path(keep_taxa)
if taxa_file.exists():
with open(taxa_file) as f:
taxa = [line.strip() for line in f if line.strip()]
else:
taxa = [t.strip() for t in keep_taxa.split(",")]
print(f"Pruning tree to {len(taxa)} taxa")
try:
tree.prune(taxa, preserve_branch_length=preserve_length)
tree.write(outfile=str(output), format=format_num)
print(f"Pruned tree saved to: {output}")
print(f"Retained {len(tree)} leaves")
except Exception as e:
print(f"Error pruning tree: {e}")
sys.exit(1)
def tree_stats(tree_file, format_num=0):
"""Display tree statistics."""
tree = load_tree(tree_file, format_num)
print(f"\n=== Tree Statistics ===")
print(f"File: {tree_file}")
print(f"Number of leaves: {len(tree)}")
print(f"Total nodes: {len(list(tree.traverse()))}")
farthest_leaf, distance = tree.get_farthest_leaf()
print(f"Tree depth: {distance:.4f}")
print(f"Farthest leaf: {farthest_leaf.name}")
# Branch length statistics
branch_lengths = [node.dist for node in tree.traverse() if not node.is_root()]
if branch_lengths:
print(f"\nBranch length statistics:")
print(f" Mean: {sum(branch_lengths)/len(branch_lengths):.4f}")
print(f" Min: {min(branch_lengths):.4f}")
print(f" Max: {max(branch_lengths):.4f}")
# Support values
supports = [node.support for node in tree.traverse() if not node.is_leaf() and hasattr(node, 'support')]
if supports:
print(f"\nSupport value statistics:")
print(f" Mean: {sum(supports)/len(supports):.2f}")
print(f" Min: {min(supports):.2f}")
print(f" Max: {max(supports):.2f}")
print()
def show_ascii(tree_file, format_num=0, show_internal=True):
"""Display tree as ASCII art."""
tree = load_tree(tree_file, format_num)
print(tree.get_ascii(show_internal=show_internal))
def list_leaves(tree_file, format_num=0):
"""List all leaf names."""
tree = load_tree(tree_file, format_num)
for leaf in tree:
print(leaf.name)
def main():
parser = argparse.ArgumentParser(
description="ETE toolkit tree operations helper",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Convert format
%(prog)s convert input.nw output.nw --in-format 0 --out-format 1
# Midpoint root
%(prog)s reroot input.nw output.nw --midpoint
# Reroot with outgroup
%(prog)s reroot input.nw output.nw --outgroup "Outgroup_species"
# Prune tree
%(prog)s prune input.nw output.nw --keep-taxa "speciesA,speciesB,speciesC"
# Show statistics
%(prog)s stats input.nw
# Display as ASCII
%(prog)s ascii input.nw
# List all leaves
%(prog)s leaves input.nw
"""
)
subparsers = parser.add_subparsers(dest="command", help="Command to execute")
# Convert command
convert_parser = subparsers.add_parser("convert", help="Convert tree format")
convert_parser.add_argument("input", help="Input tree file")
convert_parser.add_argument("output", help="Output tree file")
convert_parser.add_argument("--in-format", type=int, default=0, help="Input format (default: 0)")
convert_parser.add_argument("--out-format", type=int, default=1, help="Output format (default: 1)")
# Reroot command
reroot_parser = subparsers.add_parser("reroot", help="Reroot tree")
reroot_parser.add_argument("input", help="Input tree file")
reroot_parser.add_argument("output", help="Output tree file")
reroot_parser.add_argument("--outgroup", help="Outgroup taxon name")
reroot_parser.add_argument("--midpoint", action="store_true", help="Use midpoint rooting")
reroot_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
# Prune command
prune_parser = subparsers.add_parser("prune", help="Prune tree to specified taxa")
prune_parser.add_argument("input", help="Input tree file")
prune_parser.add_argument("output", help="Output tree file")
prune_parser.add_argument("--keep-taxa", required=True,
help="Taxa to keep (comma-separated or file path)")
prune_parser.add_argument("--no-preserve-length", action="store_true",
help="Don't preserve branch lengths")
prune_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
# Stats command
stats_parser = subparsers.add_parser("stats", help="Display tree statistics")
stats_parser.add_argument("input", help="Input tree file")
stats_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
# ASCII command
ascii_parser = subparsers.add_parser("ascii", help="Display tree as ASCII art")
ascii_parser.add_argument("input", help="Input tree file")
ascii_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
ascii_parser.add_argument("--no-internal", action="store_true",
help="Don't show internal node names")
# Leaves command
leaves_parser = subparsers.add_parser("leaves", help="List all leaf names")
leaves_parser.add_argument("input", help="Input tree file")
leaves_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
args = parser.parse_args()
if not args.command:
parser.print_help()
sys.exit(1)
# Execute command
if args.command == "convert":
convert_format(args.input, args.output, args.in_format, args.out_format)
elif args.command == "reroot":
reroot_tree(args.input, args.output, args.outgroup, args.midpoint, args.format)
elif args.command == "prune":
prune_tree(args.input, args.output, args.keep_taxa,
not args.no_preserve_length, args.format)
elif args.command == "stats":
tree_stats(args.input, args.format)
elif args.command == "ascii":
show_ascii(args.input, args.format, not args.no_internal)
elif args.command == "leaves":
list_leaves(args.input, args.format)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,602 @@
---
name: flowio
description: Toolkit for working with Flow Cytometry Standard (FCS) files in Python. Use this skill when reading, parsing, creating, or exporting FCS files (versions 2.0, 3.0, 3.1), extracting flow cytometry metadata, accessing event data, handling multi-dataset FCS files, or converting between FCS formats. Essential for flow cytometry data processing, channel analysis, and cytometry file manipulation tasks.
---
# FlowIO: Flow Cytometry Standard File Handler
## Overview
FlowIO is a lightweight Python library for reading and writing Flow Cytometry Standard (FCS) files. It excels at parsing FCS metadata, extracting event data, and creating new FCS files with minimal dependencies. The library supports FCS versions 2.0, 3.0, and 3.1, making it ideal for backend services, data pipelines, and basic cytometry file operations.
## When to Use This Skill
Apply this skill when working with:
- FCS files requiring parsing or metadata extraction
- Flow cytometry data needing conversion to NumPy arrays
- Event data requiring export to FCS format
- Multi-dataset FCS files needing separation
- Channel information extraction (scatter, fluorescence, time)
- Cytometry file validation or inspection
- Pre-processing workflows before advanced analysis
**Related Tools:** For advanced flow cytometry analysis including compensation, gating, and FlowJo/GatingML support, recommend FlowKit library as a companion to FlowIO.
## Installation
```bash
pip install flowio
```
Requires Python 3.9 or later.
## Quick Start
### Basic File Reading
```python
from flowio import FlowData
# Read FCS file
flow_data = FlowData('experiment.fcs')
# Access basic information
print(f"FCS Version: {flow_data.version}")
print(f"Events: {flow_data.event_count}")
print(f"Channels: {flow_data.pnn_labels}")
# Get event data as NumPy array
events = flow_data.as_array() # Shape: (events, channels)
```
### Creating FCS Files
```python
import numpy as np
from flowio import create_fcs
# Prepare data
data = np.array([[100, 200, 50], [150, 180, 60]]) # 2 events, 3 channels
channels = ['FSC-A', 'SSC-A', 'FL1-A']
# Create FCS file
create_fcs('output.fcs', data, channels)
```
## Core Workflows
### Reading and Parsing FCS Files
The FlowData class provides the primary interface for reading FCS files.
**Standard Reading:**
```python
from flowio import FlowData
# Basic reading
flow = FlowData('sample.fcs')
# Access attributes
version = flow.version # '3.0', '3.1', etc.
event_count = flow.event_count # Number of events
channel_count = flow.channel_count # Number of channels
pnn_labels = flow.pnn_labels # Short channel names
pns_labels = flow.pns_labels # Descriptive stain names
# Get event data
events = flow.as_array() # Preprocessed (gain, log scaling applied)
raw_events = flow.as_array(preprocess=False) # Raw data
```
**Memory-Efficient Metadata Reading:**
When only metadata is needed (no event data):
```python
# Only parse TEXT segment, skip DATA and ANALYSIS
flow = FlowData('sample.fcs', only_text=True)
# Access metadata
metadata = flow.text # Dictionary of TEXT segment keywords
print(metadata.get('$DATE')) # Acquisition date
print(metadata.get('$CYT')) # Instrument name
```
**Handling Problematic Files:**
Some FCS files have offset discrepancies or errors:
```python
# Ignore offset discrepancies between HEADER and TEXT sections
flow = FlowData('problematic.fcs', ignore_offset_discrepancy=True)
# Use HEADER offsets instead of TEXT offsets
flow = FlowData('problematic.fcs', use_header_offsets=True)
# Ignore offset errors entirely
flow = FlowData('problematic.fcs', ignore_offset_error=True)
```
**Excluding Null Channels:**
```python
# Exclude specific channels during parsing
flow = FlowData('sample.fcs', null_channel_list=['Time', 'Null'])
```
### Extracting Metadata and Channel Information
FCS files contain rich metadata in the TEXT segment.
**Common Metadata Keywords:**
```python
flow = FlowData('sample.fcs')
# File-level metadata
text_dict = flow.text
acquisition_date = text_dict.get('$DATE', 'Unknown')
instrument = text_dict.get('$CYT', 'Unknown')
data_type = flow.data_type # 'I', 'F', 'D', 'A'
# Channel metadata
for i in range(flow.channel_count):
pnn = flow.pnn_labels[i] # Short name (e.g., 'FSC-A')
pns = flow.pns_labels[i] # Descriptive name (e.g., 'Forward Scatter')
pnr = flow.pnr_values[i] # Range/max value
print(f"Channel {i}: {pnn} ({pns}), Range: {pnr}")
```
**Channel Type Identification:**
FlowIO automatically categorizes channels:
```python
# Get indices by channel type
scatter_idx = flow.scatter_indices # [0, 1] for FSC, SSC
fluoro_idx = flow.fluoro_indices # [2, 3, 4] for FL channels
time_idx = flow.time_index # Index of time channel (or None)
# Access specific channel types
events = flow.as_array()
scatter_data = events[:, scatter_idx]
fluorescence_data = events[:, fluoro_idx]
```
**ANALYSIS Segment:**
If present, access processed results:
```python
if flow.analysis:
analysis_keywords = flow.analysis # Dictionary of ANALYSIS keywords
print(analysis_keywords)
```
### Creating New FCS Files
Generate FCS files from NumPy arrays or other data sources.
**Basic Creation:**
```python
import numpy as np
from flowio import create_fcs
# Create event data (rows=events, columns=channels)
events = np.random.rand(10000, 5) * 1000
# Define channel names
channel_names = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
# Create FCS file
create_fcs('output.fcs', events, channel_names)
```
**With Descriptive Channel Names:**
```python
# Add optional descriptive names (PnS)
channel_names = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
descriptive_names = ['Forward Scatter', 'Side Scatter', 'FITC', 'PE', 'Time']
create_fcs('output.fcs',
events,
channel_names,
opt_channel_names=descriptive_names)
```
**With Custom Metadata:**
```python
# Add TEXT segment metadata
metadata = {
'$SRC': 'Python script',
'$DATE': '19-OCT-2025',
'$CYT': 'Synthetic Instrument',
'$INST': 'Laboratory A'
}
create_fcs('output.fcs',
events,
channel_names,
opt_channel_names=descriptive_names,
metadata=metadata)
```
**Note:** FlowIO exports as FCS 3.1 with single-precision floating-point data.
### Exporting Modified Data
Modify existing FCS files and re-export them.
**Approach 1: Using write_fcs() Method:**
```python
from flowio import FlowData
# Read original file
flow = FlowData('original.fcs')
# Write with updated metadata
flow.write_fcs('modified.fcs', metadata={'$SRC': 'Modified data'})
```
**Approach 2: Extract, Modify, and Recreate:**
For modifying event data:
```python
from flowio import FlowData, create_fcs
# Read and extract data
flow = FlowData('original.fcs')
events = flow.as_array(preprocess=False)
# Modify event data
events[:, 0] = events[:, 0] * 1.5 # Scale first channel
# Create new FCS file with modified data
create_fcs('modified.fcs',
events,
flow.pnn_labels,
opt_channel_names=flow.pns_labels,
metadata=flow.text)
```
### Handling Multi-Dataset FCS Files
Some FCS files contain multiple datasets in a single file.
**Detecting Multi-Dataset Files:**
```python
from flowio import FlowData, MultipleDataSetsError
try:
flow = FlowData('sample.fcs')
except MultipleDataSetsError:
print("File contains multiple datasets")
# Use read_multiple_data_sets() instead
```
**Reading All Datasets:**
```python
from flowio import read_multiple_data_sets
# Read all datasets from file
datasets = read_multiple_data_sets('multi_dataset.fcs')
print(f"Found {len(datasets)} datasets")
# Process each dataset
for i, dataset in enumerate(datasets):
print(f"\nDataset {i}:")
print(f" Events: {dataset.event_count}")
print(f" Channels: {dataset.pnn_labels}")
# Get event data for this dataset
events = dataset.as_array()
print(f" Shape: {events.shape}")
print(f" Mean values: {events.mean(axis=0)}")
```
**Reading Specific Dataset:**
```python
from flowio import FlowData
# Read first dataset (nextdata_offset=0)
first_dataset = FlowData('multi.fcs', nextdata_offset=0)
# Read second dataset using NEXTDATA offset from first
next_offset = int(first_dataset.text['$NEXTDATA'])
if next_offset > 0:
second_dataset = FlowData('multi.fcs', nextdata_offset=next_offset)
```
## Data Preprocessing
FlowIO applies standard FCS preprocessing transformations when `preprocess=True`.
**Preprocessing Steps:**
1. **Gain Scaling:** Multiply values by PnG (gain) keyword
2. **Logarithmic Transformation:** Apply PnE exponential transformation if present
- Formula: `value = a * 10^(b * raw_value)` where PnE = "a,b"
3. **Time Scaling:** Convert time values to appropriate units
**Controlling Preprocessing:**
```python
# Preprocessed data (default)
preprocessed = flow.as_array(preprocess=True)
# Raw data (no transformations)
raw = flow.as_array(preprocess=False)
```
## Error Handling
Handle common FlowIO exceptions appropriately.
```python
from flowio import (
FlowData,
FCSParsingError,
DataOffsetDiscrepancyError,
MultipleDataSetsError
)
try:
flow = FlowData('sample.fcs')
events = flow.as_array()
except FCSParsingError as e:
print(f"Failed to parse FCS file: {e}")
# Try with relaxed parsing
flow = FlowData('sample.fcs', ignore_offset_error=True)
except DataOffsetDiscrepancyError as e:
print(f"Offset discrepancy detected: {e}")
# Use ignore_offset_discrepancy parameter
flow = FlowData('sample.fcs', ignore_offset_discrepancy=True)
except MultipleDataSetsError as e:
print(f"Multiple datasets detected: {e}")
# Use read_multiple_data_sets instead
from flowio import read_multiple_data_sets
datasets = read_multiple_data_sets('sample.fcs')
except Exception as e:
print(f"Unexpected error: {e}")
```
## Common Use Cases
### Inspecting FCS File Contents
Quick exploration of FCS file structure:
```python
from flowio import FlowData
flow = FlowData('unknown.fcs')
print("=" * 50)
print(f"File: {flow.name}")
print(f"Version: {flow.version}")
print(f"Size: {flow.file_size:,} bytes")
print("=" * 50)
print(f"\nEvents: {flow.event_count:,}")
print(f"Channels: {flow.channel_count}")
print("\nChannel Information:")
for i, (pnn, pns) in enumerate(zip(flow.pnn_labels, flow.pns_labels)):
ch_type = "scatter" if i in flow.scatter_indices else \
"fluoro" if i in flow.fluoro_indices else \
"time" if i == flow.time_index else "other"
print(f" [{i}] {pnn:10s} | {pns:30s} | {ch_type}")
print("\nKey Metadata:")
for key in ['$DATE', '$BTIM', '$ETIM', '$CYT', '$INST', '$SRC']:
value = flow.text.get(key, 'N/A')
print(f" {key:15s}: {value}")
```
### Batch Processing Multiple Files
Process a directory of FCS files:
```python
from pathlib import Path
from flowio import FlowData
import pandas as pd
# Find all FCS files
fcs_files = list(Path('data/').glob('*.fcs'))
# Extract summary information
summaries = []
for fcs_path in fcs_files:
try:
flow = FlowData(str(fcs_path), only_text=True)
summaries.append({
'filename': fcs_path.name,
'version': flow.version,
'events': flow.event_count,
'channels': flow.channel_count,
'date': flow.text.get('$DATE', 'N/A')
})
except Exception as e:
print(f"Error processing {fcs_path.name}: {e}")
# Create summary DataFrame
df = pd.DataFrame(summaries)
print(df)
```
### Converting FCS to CSV
Export event data to CSV format:
```python
from flowio import FlowData
import pandas as pd
# Read FCS file
flow = FlowData('sample.fcs')
# Convert to DataFrame
df = pd.DataFrame(
flow.as_array(),
columns=flow.pnn_labels
)
# Add metadata as attributes
df.attrs['fcs_version'] = flow.version
df.attrs['instrument'] = flow.text.get('$CYT', 'Unknown')
# Export to CSV
df.to_csv('output.csv', index=False)
print(f"Exported {len(df)} events to CSV")
```
### Filtering Events and Re-exporting
Apply filters and save filtered data:
```python
from flowio import FlowData, create_fcs
import numpy as np
# Read original file
flow = FlowData('sample.fcs')
events = flow.as_array(preprocess=False)
# Apply filtering (example: threshold on first channel)
fsc_idx = 0
threshold = 500
mask = events[:, fsc_idx] > threshold
filtered_events = events[mask]
print(f"Original events: {len(events)}")
print(f"Filtered events: {len(filtered_events)}")
# Create new FCS file with filtered data
create_fcs('filtered.fcs',
filtered_events,
flow.pnn_labels,
opt_channel_names=flow.pns_labels,
metadata={**flow.text, '$SRC': 'Filtered data'})
```
### Extracting Specific Channels
Extract and process specific channels:
```python
from flowio import FlowData
import numpy as np
flow = FlowData('sample.fcs')
events = flow.as_array()
# Extract fluorescence channels only
fluoro_indices = flow.fluoro_indices
fluoro_data = events[:, fluoro_indices]
fluoro_names = [flow.pnn_labels[i] for i in fluoro_indices]
print(f"Fluorescence channels: {fluoro_names}")
print(f"Shape: {fluoro_data.shape}")
# Calculate statistics per channel
for i, name in enumerate(fluoro_names):
channel_data = fluoro_data[:, i]
print(f"\n{name}:")
print(f" Mean: {channel_data.mean():.2f}")
print(f" Median: {np.median(channel_data):.2f}")
print(f" Std Dev: {channel_data.std():.2f}")
```
## Best Practices
1. **Memory Efficiency:** Use `only_text=True` when event data is not needed
2. **Error Handling:** Wrap file operations in try-except blocks for robust code
3. **Multi-Dataset Detection:** Check for MultipleDataSetsError and use appropriate function
4. **Preprocessing Control:** Explicitly set `preprocess` parameter based on analysis needs
5. **Offset Issues:** If parsing fails, try `ignore_offset_discrepancy=True` parameter
6. **Channel Validation:** Verify channel counts and names match expectations before processing
7. **Metadata Preservation:** When modifying files, preserve original TEXT segment keywords
## Advanced Topics
### Understanding FCS File Structure
FCS files consist of four segments:
1. **HEADER:** FCS version and byte offsets for other segments
2. **TEXT:** Key-value metadata pairs (delimiter-separated)
3. **DATA:** Raw event data (binary/float/ASCII format)
4. **ANALYSIS** (optional): Results from data processing
Access these segments via FlowData attributes:
- `flow.header` - HEADER segment
- `flow.text` - TEXT segment keywords
- `flow.events` - DATA segment (as bytes)
- `flow.analysis` - ANALYSIS segment keywords (if present)
### Detailed API Reference
For comprehensive API documentation including all parameters, methods, exceptions, and FCS keyword reference, consult the detailed reference file:
**Read:** `references/api_reference.md`
The reference includes:
- Complete FlowData class documentation
- All utility functions (read_multiple_data_sets, create_fcs)
- Exception classes and handling
- FCS file structure details
- Common TEXT segment keywords
- Extended example workflows
When working with complex FCS operations or encountering unusual file formats, load this reference for detailed guidance.
## Integration Notes
**NumPy Arrays:** All event data is returned as NumPy ndarrays with shape (events, channels)
**Pandas DataFrames:** Easily convert to DataFrames for analysis:
```python
import pandas as pd
df = pd.DataFrame(flow.as_array(), columns=flow.pnn_labels)
```
**FlowKit Integration:** For advanced analysis (compensation, gating, FlowJo support), use FlowKit library which builds on FlowIO's parsing capabilities
**Web Applications:** FlowIO's minimal dependencies make it ideal for web backend services processing FCS uploads
## Troubleshooting
**Problem:** "Offset discrepancy error"
**Solution:** Use `ignore_offset_discrepancy=True` parameter
**Problem:** "Multiple datasets error"
**Solution:** Use `read_multiple_data_sets()` function instead of FlowData constructor
**Problem:** Out of memory with large files
**Solution:** Use `only_text=True` for metadata-only operations, or process events in chunks
**Problem:** Unexpected channel counts
**Solution:** Check for null channels; use `null_channel_list` parameter to exclude them
**Problem:** Cannot modify event data in place
**Solution:** FlowIO doesn't support direct modification; extract data, modify, then use `create_fcs()` to save
## Summary
FlowIO provides essential FCS file handling capabilities for flow cytometry workflows. Use it for parsing, metadata extraction, and file creation. For simple file operations and data extraction, FlowIO is sufficient. For complex analysis including compensation and gating, integrate with FlowKit or other specialized tools.

View File

@@ -0,0 +1,372 @@
# FlowIO API Reference
## Overview
FlowIO is a Python library for reading and writing Flow Cytometry Standard (FCS) files. It supports FCS versions 2.0, 3.0, and 3.1 with minimal dependencies.
## Installation
```bash
pip install flowio
```
Supports Python 3.9 and later.
## Core Classes
### FlowData
The primary class for working with FCS files.
#### Constructor
```python
FlowData(fcs_file,
ignore_offset_error=False,
ignore_offset_discrepancy=False,
use_header_offsets=False,
only_text=False,
nextdata_offset=None,
null_channel_list=None)
```
**Parameters:**
- `fcs_file`: File path (str), Path object, or file handle
- `ignore_offset_error` (bool): Ignore offset errors (default: False)
- `ignore_offset_discrepancy` (bool): Ignore offset discrepancies between HEADER and TEXT sections (default: False)
- `use_header_offsets` (bool): Use HEADER section offsets instead of TEXT section (default: False)
- `only_text` (bool): Only parse the TEXT segment, skip DATA and ANALYSIS (default: False)
- `nextdata_offset` (int): Byte offset for reading multi-dataset files
- `null_channel_list` (list): List of PnN labels for null channels to exclude
#### Attributes
**File Information:**
- `name`: Name of the FCS file
- `file_size`: Size of the file in bytes
- `version`: FCS version (e.g., '3.0', '3.1')
- `header`: Dictionary containing HEADER segment information
- `data_type`: Type of data format ('I', 'F', 'D', 'A')
**Channel Information:**
- `channel_count`: Number of channels in the dataset
- `channels`: Dictionary mapping channel numbers to channel info
- `pnn_labels`: List of PnN (short channel name) labels
- `pns_labels`: List of PnS (descriptive stain name) labels
- `pnr_values`: List of PnR (range) values for each channel
- `fluoro_indices`: List of indices for fluorescence channels
- `scatter_indices`: List of indices for scatter channels
- `time_index`: Index of the time channel (or None)
- `null_channels`: List of null channel indices
**Event Data:**
- `event_count`: Number of events (rows) in the dataset
- `events`: Raw event data as bytes
**Metadata:**
- `text`: Dictionary of TEXT segment key-value pairs
- `analysis`: Dictionary of ANALYSIS segment key-value pairs (if present)
#### Methods
##### as_array()
```python
as_array(preprocess=True)
```
Return event data as a 2-D NumPy array.
**Parameters:**
- `preprocess` (bool): Apply gain, logarithmic, and time scaling transformations (default: True)
**Returns:**
- NumPy ndarray with shape (event_count, channel_count)
**Example:**
```python
flow_data = FlowData('sample.fcs')
events_array = flow_data.as_array() # Preprocessed data
raw_array = flow_data.as_array(preprocess=False) # Raw data
```
##### write_fcs()
```python
write_fcs(filename, metadata=None)
```
Export the FlowData instance as a new FCS file.
**Parameters:**
- `filename` (str): Output file path
- `metadata` (dict): Optional dictionary of TEXT segment keywords to add/update
**Example:**
```python
flow_data = FlowData('sample.fcs')
flow_data.write_fcs('output.fcs', metadata={'$SRC': 'Modified data'})
```
**Note:** Exports as FCS 3.1 with single-precision floating-point data.
## Utility Functions
### read_multiple_data_sets()
```python
read_multiple_data_sets(fcs_file,
ignore_offset_error=False,
ignore_offset_discrepancy=False,
use_header_offsets=False)
```
Read all datasets from an FCS file containing multiple datasets.
**Parameters:**
- Same as FlowData constructor (except `nextdata_offset`)
**Returns:**
- List of FlowData instances, one for each dataset
**Example:**
```python
from flowio import read_multiple_data_sets
datasets = read_multiple_data_sets('multi_dataset.fcs')
print(f"Found {len(datasets)} datasets")
for i, dataset in enumerate(datasets):
print(f"Dataset {i}: {dataset.event_count} events")
```
### create_fcs()
```python
create_fcs(filename,
event_data,
channel_names,
opt_channel_names=None,
metadata=None)
```
Create a new FCS file from event data.
**Parameters:**
- `filename` (str): Output file path
- `event_data` (ndarray): 2-D NumPy array of event data (rows=events, columns=channels)
- `channel_names` (list): List of PnN (short) channel names
- `opt_channel_names` (list): Optional list of PnS (descriptive) channel names
- `metadata` (dict): Optional dictionary of TEXT segment keywords
**Example:**
```python
import numpy as np
from flowio import create_fcs
# Create synthetic data
events = np.random.rand(10000, 5)
channels = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
opt_channels = ['Forward Scatter', 'Side Scatter', 'FITC', 'PE', 'Time']
create_fcs('synthetic.fcs',
events,
channels,
opt_channel_names=opt_channels,
metadata={'$SRC': 'Synthetic data'})
```
## Exception Classes
### FlowIOWarning
Generic warning class for non-critical issues.
### PnEWarning
Warning raised when PnE values are invalid during FCS file creation.
### FlowIOException
Base exception class for FlowIO errors.
### FCSParsingError
Raised when there are issues parsing an FCS file.
### DataOffsetDiscrepancyError
Raised when the HEADER and TEXT sections provide different byte offsets for data segments.
**Workaround:** Use `ignore_offset_discrepancy=True` parameter when creating FlowData instance.
### MultipleDataSetsError
Raised when attempting to read a file with multiple datasets using the standard FlowData constructor.
**Solution:** Use `read_multiple_data_sets()` function instead.
## FCS File Structure Reference
FCS files consist of four segments:
1. **HEADER**: Contains FCS version and byte locations of other segments
2. **TEXT**: Key-value metadata pairs (delimited format)
3. **DATA**: Raw event data (binary, floating-point, or ASCII)
4. **ANALYSIS** (optional): Results from data processing
### Common TEXT Segment Keywords
- `$BEGINDATA`, `$ENDDATA`: Byte offsets for DATA segment
- `$BEGINANALYSIS`, `$ENDANALYSIS`: Byte offsets for ANALYSIS segment
- `$BYTEORD`: Byte order (1,2,3,4 for little-endian; 4,3,2,1 for big-endian)
- `$DATATYPE`: Data type ('I'=integer, 'F'=float, 'D'=double, 'A'=ASCII)
- `$MODE`: Data mode ('L'=list mode, most common)
- `$NEXTDATA`: Offset to next dataset (0 if single dataset)
- `$PAR`: Number of parameters (channels)
- `$TOT`: Total number of events
- `PnN`: Short name for parameter n
- `PnS`: Descriptive stain name for parameter n
- `PnR`: Range (max value) for parameter n
- `PnE`: Amplification exponent for parameter n (format: "a,b" where value = a * 10^(b*x))
- `PnG`: Amplification gain for parameter n
## Channel Types
FlowIO automatically categorizes channels:
- **Scatter channels**: FSC (forward scatter), SSC (side scatter)
- **Fluorescence channels**: FL1, FL2, FITC, PE, etc.
- **Time channel**: Usually labeled "Time"
Access indices via:
- `flow_data.scatter_indices`
- `flow_data.fluoro_indices`
- `flow_data.time_index`
## Data Preprocessing
When calling `as_array(preprocess=True)`, FlowIO applies:
1. **Gain scaling**: Multiply by PnG value
2. **Logarithmic transformation**: Apply PnE exponential transformation if present
3. **Time scaling**: Convert time values to appropriate units
To access raw, unprocessed data: `as_array(preprocess=False)`
## Best Practices
1. **Memory efficiency**: Use `only_text=True` when only metadata is needed
2. **Error handling**: Wrap file operations in try-except blocks for FCSParsingError
3. **Multi-dataset files**: Always use `read_multiple_data_sets()` if unsure about dataset count
4. **Offset issues**: If encountering offset errors, try `ignore_offset_discrepancy=True`
5. **Channel selection**: Use null_channel_list to exclude unwanted channels during parsing
## Integration with FlowKit
For advanced flow cytometry analysis including compensation, gating, and GatingML support, consider using FlowKit library alongside FlowIO. FlowKit provides higher-level abstractions built on top of FlowIO's file parsing capabilities.
## Example Workflows
### Basic File Reading
```python
from flowio import FlowData
# Read FCS file
flow = FlowData('experiment.fcs')
# Print basic info
print(f"Version: {flow.version}")
print(f"Events: {flow.event_count}")
print(f"Channels: {flow.channel_count}")
print(f"Channel names: {flow.pnn_labels}")
# Get event data
events = flow.as_array()
print(f"Data shape: {events.shape}")
```
### Metadata Extraction
```python
from flowio import FlowData
flow = FlowData('sample.fcs', only_text=True)
# Access metadata
print(f"Acquisition date: {flow.text.get('$DATE', 'N/A')}")
print(f"Instrument: {flow.text.get('$CYT', 'N/A')}")
# Channel information
for i, (pnn, pns) in enumerate(zip(flow.pnn_labels, flow.pns_labels)):
print(f"Channel {i}: {pnn} ({pns})")
```
### Creating New FCS Files
```python
import numpy as np
from flowio import create_fcs
# Generate or process data
data = np.random.rand(5000, 3) * 1000
# Define channels
channels = ['FSC-A', 'SSC-A', 'FL1-A']
stains = ['Forward Scatter', 'Side Scatter', 'GFP']
# Create FCS file
create_fcs('output.fcs',
data,
channels,
opt_channel_names=stains,
metadata={
'$SRC': 'Python script',
'$DATE': '19-OCT-2025'
})
```
### Processing Multi-Dataset Files
```python
from flowio import read_multiple_data_sets
# Read all datasets
datasets = read_multiple_data_sets('multi.fcs')
# Process each dataset
for i, dataset in enumerate(datasets):
print(f"\nDataset {i}:")
print(f" Events: {dataset.event_count}")
print(f" Channels: {dataset.pnn_labels}")
# Get data array
events = dataset.as_array()
mean_values = events.mean(axis=0)
print(f" Mean values: {mean_values}")
```
### Modifying and Re-exporting
```python
from flowio import FlowData
# Read original file
flow = FlowData('original.fcs')
# Get event data
events = flow.as_array(preprocess=False)
# Modify data (example: apply custom transformation)
events[:, 0] = events[:, 0] * 1.5 # Scale first channel
# Note: Currently, FlowIO doesn't support direct modification of event data
# For modifications, use create_fcs() instead:
from flowio import create_fcs
create_fcs('modified.fcs',
events,
flow.pnn_labels,
opt_channel_names=flow.pns_labels,
metadata=flow.text)
```

View File

@@ -0,0 +1,870 @@
---
name: gget
description: Toolkit for querying genomic databases and performing bioinformatics analysis. Use this skill when working with gene sequences, protein structures, genomic databases (Ensembl, UniProt, NCBI, PDB, COSMIC, etc.), performing BLAST/BLAT searches, retrieving gene expression data, conducting enrichment analysis, predicting protein structures with AlphaFold, analyzing mutations, or any bioinformatics workflow requiring efficient database queries. This skill applies to tasks involving nucleotide/amino acid sequences, gene names, Ensembl IDs, UniProt accessions, or requests for genomic annotations, orthologs, disease associations, drug information, or single-cell RNA-seq data.
---
# gget
## Overview
gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Execute queries for gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.
**Important**: The databases queried by gget are continuously updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary.
## Installation
Install gget in a clean virtual environment to avoid conflicts:
```bash
# Using uv (recommended)
uv pip install gget
# Or using pip
pip install --upgrade gget
# In Python/Jupyter
import gget
```
## Quick Start
Basic usage pattern for all modules:
```bash
# Command-line
gget <module> [arguments] [options]
# Python
gget.module(arguments, options)
```
Most modules return:
- **Command-line**: JSON (default) or CSV with `-csv` flag
- **Python**: DataFrame or dictionary
Common flags across modules:
- `-o/--out`: Save results to file
- `-q/--quiet`: Suppress progress information
- `-csv`: Return CSV format (command-line only)
## Module Categories
### 1. Reference & Gene Information
#### gget ref - Reference Genome Downloads
Retrieve download links and metadata for Ensembl reference genomes.
**Parameters**:
- `species`: Genus_species format (e.g., 'homo_sapiens', 'mus_musculus'). Shortcuts: 'human', 'mouse'
- `-w/--which`: Specify return types (gtf, cdna, dna, cds, cdrna, pep). Default: all
- `-r/--release`: Ensembl release number (default: latest)
- `-l/--list_species`: List available vertebrate species
- `-liv/--list_iv_species`: List available invertebrate species
- `-ftp`: Return only FTP links
- `-d/--download`: Download files (requires curl)
**Examples**:
```bash
# List available species
gget ref --list_species
# Get all reference files for human
gget ref homo_sapiens
# Download only GTF annotation for mouse
gget ref -w gtf -d mouse
```
```python
# Python
gget.ref("homo_sapiens")
gget.ref("mus_musculus", which="gtf", download=True)
```
#### gget search - Gene Search
Locate genes by name or description across species.
**Parameters**:
- `searchwords`: One or more search terms (case-insensitive)
- `-s/--species`: Target species (e.g., 'homo_sapiens', 'mouse')
- `-r/--release`: Ensembl release number
- `-t/--id_type`: Return 'gene' (default) or 'transcript'
- `-ao/--andor`: 'or' (default) finds ANY searchword; 'and' requires ALL
- `-l/--limit`: Maximum results to return
**Returns**: ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL
**Examples**:
```bash
# Search for GABA-related genes in human
gget search -s human gaba gamma-aminobutyric
# Find specific gene, require all terms
gget search -s mouse -ao and pax7 transcription
```
```python
# Python
gget.search(["gaba", "gamma-aminobutyric"], species="homo_sapiens")
```
#### gget info - Gene/Transcript Information
Retrieve comprehensive gene and transcript metadata from Ensembl, UniProt, and NCBI.
**Parameters**:
- `ens_ids`: One or more Ensembl IDs (also supports WormBase, Flybase IDs). Limit: ~1000 IDs
- `-n/--ncbi`: Disable NCBI data retrieval
- `-u/--uniprot`: Disable UniProt data retrieval
- `-pdb`: Include PDB identifiers (increases runtime)
**Returns**: UniProt ID, NCBI gene ID, primary gene name, synonyms, protein names, descriptions, biotype, canonical transcript
**Examples**:
```bash
# Get info for multiple genes
gget info ENSG00000034713 ENSG00000104853 ENSG00000170296
# Include PDB IDs
gget info ENSG00000034713 -pdb
```
```python
# Python
gget.info(["ENSG00000034713", "ENSG00000104853"], pdb=True)
```
#### gget seq - Sequence Retrieval
Fetch nucleotide or amino acid sequences for genes and transcripts.
**Parameters**:
- `ens_ids`: One or more Ensembl identifiers
- `-t/--translate`: Fetch amino acid sequences instead of nucleotide
- `-iso/--isoforms`: Return all transcript variants (gene IDs only)
**Returns**: FASTA format sequences
**Examples**:
```bash
# Get nucleotide sequences
gget seq ENSG00000034713 ENSG00000104853
# Get all protein isoforms
gget seq -t -iso ENSG00000034713
```
```python
# Python
gget.seq(["ENSG00000034713"], translate=True, isoforms=True)
```
### 2. Sequence Analysis & Alignment
#### gget blast - BLAST Searches
BLAST nucleotide or amino acid sequences against standard databases.
**Parameters**:
- `sequence`: Sequence string or path to FASTA/.txt file
- `-p/--program`: blastn, blastp, blastx, tblastn, tblastx (auto-detected)
- `-db/--database`:
- Nucleotide: nt, refseq_rna, pdbnt
- Protein: nr, swissprot, pdbaa, refseq_protein
- `-l/--limit`: Max hits (default: 50)
- `-e/--expect`: E-value cutoff (default: 10.0)
- `-lcf/--low_comp_filt`: Enable low complexity filtering
- `-mbo/--megablast_off`: Disable MegaBLAST (blastn only)
**Examples**:
```bash
# BLAST protein sequence
gget blast MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
# BLAST from file with specific database
gget blast sequence.fasta -db swissprot -l 10
```
```python
# Python
gget.blast("MKWMFK...", database="swissprot", limit=10)
```
#### gget blat - BLAT Searches
Locate genomic positions of sequences using UCSC BLAT.
**Parameters**:
- `sequence`: Sequence string or path to FASTA/.txt file
- `-st/--seqtype`: 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' (auto-detected)
- `-a/--assembly`: Target assembly (default: 'human'/hg38; options: 'mouse'/mm39, 'zebrafinch'/taeGut2, etc.)
**Returns**: genome, query size, alignment positions, matches, mismatches, alignment percentage
**Examples**:
```bash
# Find genomic location in human
gget blat ATCGATCGATCGATCG
# Search in different assembly
gget blat -a mm39 ATCGATCGATCGATCG
```
```python
# Python
gget.blat("ATCGATCGATCGATCG", assembly="mouse")
```
#### gget muscle - Multiple Sequence Alignment
Align multiple nucleotide or amino acid sequences using Muscle5.
**Parameters**:
- `fasta`: Sequences or path to FASTA/.txt file
- `-s5/--super5`: Use Super5 algorithm for faster processing (large datasets)
**Returns**: Aligned sequences in ClustalW format or aligned FASTA (.afa)
**Examples**:
```bash
# Align sequences from file
gget muscle sequences.fasta -o aligned.afa
# Use Super5 for large dataset
gget muscle large_dataset.fasta -s5
```
```python
# Python
gget.muscle("sequences.fasta", save=True)
```
#### gget diamond - Local Sequence Alignment
Perform fast local protein or translated DNA alignment using DIAMOND.
**Parameters**:
- Query: Sequences (string/list) or FASTA file path
- `--reference`: Reference sequences (string/list) or FASTA file path (required)
- `--sensitivity`: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive (default), ultra-sensitive
- `--threads`: CPU threads (default: 1)
- `--diamond_db`: Save database for reuse
- `--translated`: Enable nucleotide-to-amino acid alignment
**Returns**: Identity percentage, sequence lengths, match positions, gap openings, E-values, bit scores
**Examples**:
```bash
# Align against reference
gget diamond GGETISAWESQME -ref reference.fasta --threads 4
# Save database for reuse
gget diamond query.fasta -ref ref.fasta --diamond_db my_db.dmnd
```
```python
# Python
gget.diamond("GGETISAWESQME", reference="reference.fasta", threads=4)
```
### 3. Structural & Protein Analysis
#### gget pdb - Protein Structures
Query RCSB Protein Data Bank for structure and metadata.
**Parameters**:
- `pdb_id`: PDB identifier (e.g., '7S7U')
- `-r/--resource`: Data type (pdb, entry, pubmed, assembly, entity types)
- `-i/--identifier`: Assembly, entity, or chain ID
**Returns**: PDB format (structures) or JSON (metadata)
**Examples**:
```bash
# Download PDB structure
gget pdb 7S7U -o 7S7U.pdb
# Get metadata
gget pdb 7S7U -r entry
```
```python
# Python
gget.pdb("7S7U", save=True)
```
#### gget alphafold - Protein Structure Prediction
Predict 3D protein structures using simplified AlphaFold2.
**Setup Required**:
```bash
# Install OpenMM first (version depends on Python version)
# Python < 3.10:
conda install -qy conda==4.13.0 && conda install -qy -c conda-forge openmm=7.5.1
# Python 3.10:
conda install -qy conda==24.1.2 && conda install -qy -c conda-forge openmm=7.7.0
# Python 3.11:
conda install -qy conda==24.11.1 && conda install -qy -c conda-forge openmm=8.0.0
# Then setup AlphaFold
gget setup alphafold
```
**Parameters**:
- `sequence`: Amino acid sequence (string), multiple sequences (list), or FASTA file. Multiple sequences trigger multimer modeling
- `-mr/--multimer_recycles`: Recycling iterations (default: 3; recommend 20 for accuracy)
- `-mfm/--multimer_for_monomer`: Apply multimer model to single proteins
- `-r/--relax`: AMBER relaxation for top-ranked model
- `plot`: Python-only; generate interactive 3D visualization (default: True)
- `show_sidechains`: Python-only; include side chains (default: True)
**Returns**: PDB structure file, JSON alignment error data, optional 3D visualization
**Examples**:
```bash
# Predict single protein structure
gget alphafold MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
# Predict multimer with higher accuracy
gget alphafold sequence1.fasta -mr 20 -r
```
```python
# Python with visualization
gget.alphafold("MKWMFK...", plot=True, show_sidechains=True)
# Multimer prediction
gget.alphafold(["sequence1", "sequence2"], multimer_recycles=20)
```
#### gget elm - Eukaryotic Linear Motifs
Predict Eukaryotic Linear Motifs in protein sequences.
**Setup Required**:
```bash
gget setup elm
```
**Parameters**:
- `sequence`: Amino acid sequence or UniProt Acc
- `-u/--uniprot`: Indicates sequence is UniProt Acc
- `-e/--expand`: Include protein names, organisms, references
- `-s/--sensitivity`: DIAMOND alignment sensitivity (default: "very-sensitive")
- `-t/--threads`: Number of threads (default: 1)
**Returns**: Two outputs:
1. **ortholog_df**: Linear motifs from orthologous proteins
2. **regex_df**: Motifs directly matched in input sequence
**Examples**:
```bash
# Predict motifs from sequence
gget elm LIAQSIGQASFV -o results
# Use UniProt accession with expanded info
gget elm --uniprot Q02410 -e
```
```python
# Python
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
```
### 4. Expression & Disease Data
#### gget archs4 - Gene Correlation & Tissue Expression
Query ARCHS4 database for correlated genes or tissue expression data.
**Parameters**:
- `gene`: Gene symbol or Ensembl ID (with `--ensembl` flag)
- `-w/--which`: 'correlation' (default, returns 100 most correlated genes) or 'tissue' (expression atlas)
- `-s/--species`: 'human' (default) or 'mouse' (tissue data only)
- `-e/--ensembl`: Input is Ensembl ID
**Returns**:
- **Correlation mode**: Gene symbols, Pearson correlation coefficients
- **Tissue mode**: Tissue identifiers, min/Q1/median/Q3/max expression values
**Examples**:
```bash
# Get correlated genes
gget archs4 ACE2
# Get tissue expression
gget archs4 -w tissue ACE2
```
```python
# Python
gget.archs4("ACE2", which="tissue")
```
#### gget cellxgene - Single-Cell RNA-seq Data
Query CZ CELLxGENE Discover Census for single-cell data.
**Setup Required**:
```bash
gget setup cellxgene
```
**Parameters**:
- `--gene` (-g): Gene names or Ensembl IDs (case-sensitive! 'PAX7' for human, 'Pax7' for mouse)
- `--tissue`: Tissue type(s)
- `--cell_type`: Specific cell type(s)
- `--species` (-s): 'homo_sapiens' (default) or 'mus_musculus'
- `--census_version` (-cv): Version ("stable", "latest", or dated)
- `--ensembl` (-e): Use Ensembl IDs
- `--meta_only` (-mo): Return metadata only
- Additional filters: disease, development_stage, sex, assay, dataset_id, donor_id, ethnicity, suspension_type
**Returns**: AnnData object with count matrices and metadata (or metadata-only dataframes)
**Examples**:
```bash
# Get single-cell data for specific genes and cell types
gget cellxgene --gene ACE2 ABCA1 --tissue lung --cell_type "mucus secreting cell" -o lung_data.h5ad
# Metadata only
gget cellxgene --gene PAX7 --tissue muscle --meta_only -o metadata.csv
```
```python
# Python
adata = gget.cellxgene(gene=["ACE2", "ABCA1"], tissue="lung", cell_type="mucus secreting cell")
```
#### gget enrichr - Enrichment Analysis
Perform ontology enrichment analysis on gene lists using Enrichr.
**Parameters**:
- `genes`: Gene symbols or Ensembl IDs
- `-db/--database`: Reference database (supports shortcuts: 'pathway', 'transcription', 'ontology', 'diseases_drugs', 'celltypes')
- `-s/--species`: human (default), mouse, fly, yeast, worm, fish
- `-bkg_l/--background_list`: Background genes for comparison
- `-ko/--kegg_out`: Save KEGG pathway images with highlighted genes
- `plot`: Python-only; generate graphical results
**Database Shortcuts**:
- 'pathway' → KEGG_2021_Human
- 'transcription' → ChEA_2016
- 'ontology' → GO_Biological_Process_2021
- 'diseases_drugs' → GWAS_Catalog_2019
- 'celltypes' → PanglaoDB_Augmented_2021
**Examples**:
```bash
# Enrichment analysis for ontology
gget enrichr -db ontology ACE2 AGT AGTR1
# Save KEGG pathways
gget enrichr -db pathway ACE2 AGT AGTR1 -ko ./kegg_images/
```
```python
# Python with plot
gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology", plot=True)
```
#### gget bgee - Orthology & Expression
Retrieve orthology and gene expression data from Bgee database.
**Parameters**:
- `ens_id`: Ensembl gene ID or NCBI gene ID (for non-Ensembl species). Multiple IDs supported when `type=expression`
- `-t/--type`: 'orthologs' (default) or 'expression'
**Returns**:
- **Orthologs mode**: Matching genes across species with IDs, names, taxonomic info
- **Expression mode**: Anatomical entities, confidence scores, expression status
**Examples**:
```bash
# Get orthologs
gget bgee ENSG00000169194
# Get expression data
gget bgee ENSG00000169194 -t expression
# Multiple genes
gget bgee ENSBTAG00000047356 ENSBTAG00000018317 -t expression
```
```python
# Python
gget.bgee("ENSG00000169194", type="orthologs")
```
#### gget opentargets - Disease & Drug Associations
Retrieve disease and drug associations from OpenTargets.
**Parameters**:
- Ensembl gene ID (required)
- `-r/--resource`: diseases (default), drugs, tractability, pharmacogenetics, expression, depmap, interactions
- `-l/--limit`: Cap results count
- Filter arguments (vary by resource):
- drugs: `--filter_disease`
- pharmacogenetics: `--filter_drug`
- expression/depmap: `--filter_tissue`, `--filter_anat_sys`, `--filter_organ`
- interactions: `--filter_protein_a`, `--filter_protein_b`, `--filter_gene_b`
**Examples**:
```bash
# Get associated diseases
gget opentargets ENSG00000169194 -r diseases -l 5
# Get associated drugs
gget opentargets ENSG00000169194 -r drugs -l 10
# Get tissue expression
gget opentargets ENSG00000169194 -r expression --filter_tissue brain
```
```python
# Python
gget.opentargets("ENSG00000169194", resource="diseases", limit=5)
```
#### gget cbio - cBioPortal Cancer Genomics
Plot cancer genomics heatmaps using cBioPortal data.
**Two subcommands**:
**search** - Find study IDs:
```bash
gget cbio search breast lung
```
**plot** - Generate heatmaps:
**Parameters**:
- `-s/--study_ids`: Space-separated cBioPortal study IDs (required)
- `-g/--genes`: Space-separated gene names or Ensembl IDs (required)
- `-st/--stratification`: Column to organize data (tissue, cancer_type, cancer_type_detailed, study_id, sample)
- `-vt/--variation_type`: Data type (mutation_occurrences, cna_nonbinary, sv_occurrences, cna_occurrences, Consequence)
- `-f/--filter`: Filter by column value (e.g., 'study_id:msk_impact_2017')
- `-dd/--data_dir`: Cache directory (default: ./gget_cbio_cache)
- `-fd/--figure_dir`: Output directory (default: ./gget_cbio_figures)
- `-dpi`: Resolution (default: 100)
- `-sh/--show`: Display plot in window
- `-nc/--no_confirm`: Skip download confirmations
**Examples**:
```bash
# Search for studies
gget cbio search esophag ovary
# Create heatmap
gget cbio plot -s msk_impact_2017 -g AKT1 ALK BRAF -st tissue -vt mutation_occurrences
```
```python
# Python
gget.cbio_search(["esophag", "ovary"])
gget.cbio_plot(["msk_impact_2017"], ["AKT1", "ALK"], stratification="tissue")
```
#### gget cosmic - COSMIC Database
Search COSMIC (Catalogue Of Somatic Mutations In Cancer) database.
**Important**: License fees apply for commercial use. Requires COSMIC account credentials.
**Parameters**:
- `searchterm`: Gene name, Ensembl ID, mutation notation, or sample ID
- `-ctp/--cosmic_tsv_path`: Path to downloaded COSMIC TSV file (required for querying)
- `-l/--limit`: Maximum results (default: 100)
**Database download flags**:
- `-d/--download_cosmic`: Activate download mode
- `-gm/--gget_mutate`: Create version for gget mutate
- `-cp/--cosmic_project`: Database type (cancer, census, cell_line, resistance, genome_screen, targeted_screen)
- `-cv/--cosmic_version`: COSMIC version
- `-gv/--grch_version`: Human reference genome (37 or 38)
- `--email`, `--password`: COSMIC credentials
**Examples**:
```bash
# First download database
gget cosmic -d --email user@example.com --password xxx -cp cancer
# Then query
gget cosmic EGFR -ctp cosmic_data.tsv -l 10
```
```python
# Python
gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
```
### 5. Additional Tools
#### gget mutate - Generate Mutated Sequences
Generate mutated nucleotide sequences from mutation annotations.
**Parameters**:
- `sequences`: FASTA file path or direct sequence input (string/list)
- `-m/--mutations`: CSV/TSV file or DataFrame with mutation data (required)
- `-mc/--mut_column`: Mutation column name (default: 'mutation')
- `-sic/--seq_id_column`: Sequence ID column (default: 'seq_ID')
- `-mic/--mut_id_column`: Mutation ID column
- `-k/--k`: Length of flanking sequences (default: 30 nucleotides)
**Returns**: Mutated sequences in FASTA format
**Examples**:
```bash
# Single mutation
gget mutate ATCGCTAAGCT -m "c.4G>T"
# Multiple sequences with mutations from file
gget mutate sequences.fasta -m mutations.csv -o mutated.fasta
```
```python
# Python
import pandas as pd
mutations_df = pd.DataFrame({"seq_ID": ["seq1"], "mutation": ["c.4G>T"]})
gget.mutate(["ATCGCTAAGCT"], mutations=mutations_df)
```
#### gget gpt - OpenAI Text Generation
Generate natural language text using OpenAI's API.
**Setup Required**:
```bash
gget setup gpt
```
**Important**: Free tier limited to 3 months after account creation. Set monthly billing limits.
**Parameters**:
- `prompt`: Text input for generation (required)
- `api_key`: OpenAI authentication (required)
- Model configuration: temperature, top_p, max_tokens, frequency_penalty, presence_penalty
- Default model: gpt-3.5-turbo (configurable)
**Examples**:
```bash
gget gpt "Explain CRISPR" --api_key your_key_here
```
```python
# Python
gget.gpt("Explain CRISPR", api_key="your_key_here")
```
#### gget setup - Install Dependencies
Install/download third-party dependencies for specific modules.
**Parameters**:
- `module`: Module name requiring dependency installation
- `-o/--out`: Output folder path (elm module only)
**Modules requiring setup**:
- `alphafold` - Downloads ~4GB of model parameters
- `cellxgene` - Installs cellxgene-census (may not support latest Python)
- `elm` - Downloads local ELM database
- `gpt` - Configures OpenAI integration
**Examples**:
```bash
# Setup AlphaFold
gget setup alphafold
# Setup ELM with custom directory
gget setup elm -o /path/to/elm_data
```
```python
# Python
gget.setup("alphafold")
```
## Common Workflows
### Workflow 1: Gene Discovery to Sequence Analysis
Find and analyze genes of interest:
```python
# 1. Search for genes
results = gget.search(["GABA", "receptor"], species="homo_sapiens")
# 2. Get detailed information
gene_ids = results["ensembl_id"].tolist()
info = gget.info(gene_ids[:5])
# 3. Retrieve sequences
sequences = gget.seq(gene_ids[:5], translate=True)
```
### Workflow 2: Sequence Alignment and Structure
Align sequences and predict structures:
```python
# 1. Align multiple sequences
alignment = gget.muscle("sequences.fasta")
# 2. Find similar sequences
blast_results = gget.blast(my_sequence, database="swissprot", limit=10)
# 3. Predict structure
structure = gget.alphafold(my_sequence, plot=True)
# 4. Find linear motifs
ortholog_df, regex_df = gget.elm(my_sequence)
```
### Workflow 3: Gene Expression and Enrichment
Analyze expression patterns and functional enrichment:
```python
# 1. Get tissue expression
tissue_expr = gget.archs4("ACE2", which="tissue")
# 2. Find correlated genes
correlated = gget.archs4("ACE2", which="correlation")
# 3. Get single-cell data
adata = gget.cellxgene(gene=["ACE2"], tissue="lung", cell_type="epithelial cell")
# 4. Perform enrichment analysis
gene_list = correlated["gene_symbol"].tolist()[:50]
enrichment = gget.enrichr(gene_list, database="ontology", plot=True)
```
### Workflow 4: Disease and Drug Analysis
Investigate disease associations and therapeutic targets:
```python
# 1. Search for genes
genes = gget.search(["breast cancer"], species="homo_sapiens")
# 2. Get disease associations
diseases = gget.opentargets("ENSG00000169194", resource="diseases")
# 3. Get drug associations
drugs = gget.opentargets("ENSG00000169194", resource="drugs")
# 4. Query cancer genomics data
study_ids = gget.cbio_search(["breast"])
gget.cbio_plot(study_ids[:2], ["BRCA1", "BRCA2"], stratification="cancer_type")
# 5. Search COSMIC for mutations
cosmic_results = gget.cosmic("BRCA1", cosmic_tsv_path="cosmic.tsv")
```
### Workflow 5: Comparative Genomics
Compare proteins across species:
```python
# 1. Get orthologs
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
# 2. Get sequences for comparison
human_seq = gget.seq("ENSG00000169194", translate=True)
mouse_seq = gget.seq("ENSMUSG00000026091", translate=True)
# 3. Align sequences
alignment = gget.muscle([human_seq, mouse_seq])
# 4. Compare structures
human_structure = gget.pdb("7S7U")
mouse_structure = gget.alphafold(mouse_seq)
```
### Workflow 6: Building Reference Indices
Prepare reference data for downstream analysis (e.g., kallisto|bustools):
```bash
# 1. List available species
gget ref --list_species
# 2. Download reference files
gget ref -w gtf -w cdna -d homo_sapiens
# 3. Build kallisto index
kallisto index -i transcriptome.idx transcriptome.fasta
# 4. Download genome for alignment
gget ref -w dna -d homo_sapiens
```
## Best Practices
### Data Retrieval
- Use `--limit` to control result sizes for large queries
- Save results with `-o/--out` for reproducibility
- Check database versions/releases for consistency across analyses
- Use `--quiet` in production scripts to reduce output
### Sequence Analysis
- For BLAST/BLAT, start with default parameters, then adjust sensitivity
- Use `gget diamond` with `--threads` for faster local alignment
- Save DIAMOND databases with `--diamond_db` for repeated queries
- For multiple sequence alignment, use `-s5/--super5` for large datasets
### Expression and Disease Data
- Gene symbols are case-sensitive in cellxgene (e.g., 'PAX7' vs 'Pax7')
- Run `gget setup` before first use of alphafold, cellxgene, elm, gpt
- For enrichment analysis, use database shortcuts for convenience
- Cache cBioPortal data with `-dd` to avoid repeated downloads
### Structure Prediction
- AlphaFold multimer predictions: use `-mr 20` for higher accuracy
- Use `-r` flag for AMBER relaxation of final structures
- Visualize results in Python with `plot=True`
- Check PDB database first before running AlphaFold predictions
### Error Handling
- Database structures change; update gget regularly: `pip install --upgrade gget`
- Process max ~1000 Ensembl IDs at once with gget info
- For large-scale analyses, implement rate limiting for API queries
- Use virtual environments to avoid dependency conflicts
## Output Formats
### Command-line
- Default: JSON
- CSV: Add `-csv` flag
- FASTA: gget seq, gget mutate
- PDB: gget pdb, gget alphafold
- PNG: gget cbio plot
### Python
- Default: DataFrame or dictionary
- JSON: Add `json=True` parameter
- Save to file: Add `save=True` or specify `out="filename"`
- AnnData: gget cellxgene
## Resources
This skill includes reference documentation for detailed module information:
### references/
- `module_reference.md` - Comprehensive parameter reference for all modules
- `database_info.md` - Information about queried databases and their update frequencies
- `workflows.md` - Extended workflow examples and use cases
For additional help:
- Official documentation: https://pachterlab.github.io/gget/
- GitHub issues: https://github.com/pachterlab/gget/issues
- Citation: Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836

View File

@@ -0,0 +1,300 @@
# gget Database Information
Overview of databases queried by gget modules, including update frequencies and important considerations.
## Important Note
The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated:
```bash
pip install --upgrade gget
```
## Database Directory
### Genomic Reference Databases
#### Ensembl
- **Used by:** gget ref, gget search, gget info, gget seq
- **Description:** Comprehensive genome database with annotations for vertebrate and invertebrate species
- **Update frequency:** Regular releases (numbered); new releases approximately every 3 months
- **Access:** FTP downloads, REST API
- **Website:** https://www.ensembl.org/
- **Notes:**
- Supports both vertebrate and invertebrate genomes
- Can specify release number for reproducibility
- Shortcuts available for common species ('human', 'mouse')
#### UCSC Genome Browser
- **Used by:** gget blat
- **Description:** Genome browser database with BLAT alignment tool
- **Update frequency:** Regular updates with new assemblies
- **Access:** Web service API
- **Website:** https://genome.ucsc.edu/
- **Notes:**
- Multiple genome assemblies available (hg38, mm39, etc.)
- BLAT optimized for vertebrate genomes
### Protein & Structure Databases
#### UniProt
- **Used by:** gget info, gget seq (amino acid sequences), gget elm
- **Description:** Universal Protein Resource, comprehensive protein sequence and functional information
- **Update frequency:** Regular releases (weekly for Swiss-Prot, monthly for TrEMBL)
- **Access:** REST API
- **Website:** https://www.uniprot.org/
- **Notes:**
- Swiss-Prot: manually annotated and reviewed
- TrEMBL: automatically annotated
#### NCBI (National Center for Biotechnology Information)
- **Used by:** gget info, gget bgee (for non-Ensembl species)
- **Description:** Gene and protein databases with extensive cross-references
- **Update frequency:** Continuous updates
- **Access:** E-utilities API
- **Website:** https://www.ncbi.nlm.nih.gov/
- **Databases:** Gene, Protein, RefSeq
#### RCSB PDB (Protein Data Bank)
- **Used by:** gget pdb
- **Description:** Repository of 3D structural data for proteins and nucleic acids
- **Update frequency:** Weekly updates
- **Access:** REST API
- **Website:** https://www.rcsb.org/
- **Notes:**
- Experimentally determined structures (X-ray, NMR, cryo-EM)
- Includes metadata about experiments and publications
#### ELM (Eukaryotic Linear Motif)
- **Used by:** gget elm
- **Description:** Database of functional sites in eukaryotic proteins
- **Update frequency:** Periodic updates
- **Access:** Downloaded database (via gget setup elm)
- **Website:** http://elm.eu.org/
- **Notes:**
- Requires local download before first use
- Contains validated motifs and patterns
### Sequence Similarity Databases
#### BLAST Databases (NCBI)
- **Used by:** gget blast
- **Description:** Pre-formatted databases for BLAST searches
- **Update frequency:** Regular updates
- **Access:** NCBI BLAST API
- **Databases:**
- **Nucleotide:** nt (all GenBank), refseq_rna, pdbnt
- **Protein:** nr (non-redundant), swissprot, pdbaa, refseq_protein
- **Notes:**
- nt and nr are very large databases
- Consider specialized databases for faster, more focused searches
### Expression & Correlation Databases
#### ARCHS4
- **Used by:** gget archs4
- **Description:** Massive mining of publicly available RNA-seq data
- **Update frequency:** Periodic updates with new samples
- **Access:** HTTP API
- **Website:** https://maayanlab.cloud/archs4/
- **Data:**
- Human and mouse RNA-seq data
- Correlation matrices
- Tissue expression atlases
- **Citation:** Lachmann et al., Nature Communications, 2018
#### CZ CELLxGENE Discover
- **Used by:** gget cellxgene
- **Description:** Single-cell RNA-seq data from multiple studies
- **Update frequency:** Continuous additions of new datasets
- **Access:** Census API (via cellxgene-census package)
- **Website:** https://cellxgene.cziscience.com/
- **Data:**
- Single-cell RNA-seq count matrices
- Cell type annotations
- Tissue and disease metadata
- **Notes:**
- Requires gget setup cellxgene
- Gene symbols are case-sensitive
- May not support latest Python versions
#### Bgee
- **Used by:** gget bgee
- **Description:** Gene expression and orthology database
- **Update frequency:** Regular releases
- **Access:** REST API
- **Website:** https://www.bgee.org/
- **Data:**
- Gene expression across tissues and developmental stages
- Orthology relationships across species
- **Citation:** Bastian et al., 2021
### Functional & Pathway Databases
#### Enrichr / modEnrichr
- **Used by:** gget enrichr
- **Description:** Gene set enrichment analysis web service
- **Update frequency:** Regular updates to underlying databases
- **Access:** REST API
- **Website:** https://maayanlab.cloud/Enrichr/
- **Databases included:**
- KEGG pathways
- Gene Ontology (GO)
- Transcription factor targets (ChEA)
- Disease associations (GWAS Catalog)
- Cell type markers (PanglaoDB)
- **Notes:**
- Supports multiple model organisms
- Background gene lists can be provided for custom enrichment
### Disease & Drug Databases
#### Open Targets
- **Used by:** gget opentargets
- **Description:** Integrative platform for disease-target associations
- **Update frequency:** Regular releases (quarterly)
- **Access:** GraphQL API
- **Website:** https://www.opentargets.org/
- **Data:**
- Disease associations
- Drug information and clinical trials
- Target tractability
- Pharmacogenetics
- Gene expression
- DepMap gene-disease effects
- Protein-protein interactions
#### cBioPortal
- **Used by:** gget cbio
- **Description:** Cancer genomics data portal
- **Update frequency:** Continuous addition of new studies
- **Access:** Web API, downloadable datasets
- **Website:** https://www.cbioportal.org/
- **Data:**
- Mutations, copy number alterations, structural variants
- Gene expression
- Clinical data
- **Notes:**
- Large datasets; caching recommended
- Multiple cancer types and studies available
#### COSMIC (Catalogue Of Somatic Mutations In Cancer)
- **Used by:** gget cosmic
- **Description:** Comprehensive cancer mutation database
- **Update frequency:** Regular releases
- **Access:** Download (requires account and license for commercial use)
- **Website:** https://cancer.sanger.ac.uk/cosmic
- **Data:**
- Somatic mutations in cancer
- Gene census
- Cell line data
- Drug resistance mutations
- **Important:**
- Free for academic use
- License fees apply for commercial use
- Requires COSMIC account credentials
- Must download database before querying
### AI & Prediction Services
#### AlphaFold2 (DeepMind)
- **Used by:** gget alphafold
- **Description:** Deep learning model for protein structure prediction
- **Model version:** Simplified version for local execution
- **Access:** Local computation (requires model download via gget setup)
- **Website:** https://alphafold.ebi.ac.uk/
- **Notes:**
- Requires ~4GB model parameters download
- Requires OpenMM installation
- Computationally intensive
- Python version-specific requirements
#### OpenAI API
- **Used by:** gget gpt
- **Description:** Large language model API
- **Update frequency:** New models released periodically
- **Access:** REST API (requires API key)
- **Website:** https://openai.com/
- **Notes:**
- Default model: gpt-3.5-turbo
- Free tier limited to 3 months after account creation
- Set billing limits to control costs
## Data Consistency & Reproducibility
### Version Control
To ensure reproducibility in analyses:
1. **Specify database versions/releases:**
```python
# Use specific Ensembl release
gget.ref("homo_sapiens", release=110)
# Use specific Census version
gget.cellxgene(gene=["PAX7"], census_version="2023-07-25")
```
2. **Document gget version:**
```python
import gget
print(gget.__version__)
```
3. **Save raw data:**
```python
# Always save results for reproducibility
results = gget.search(["ACE2"], species="homo_sapiens")
results.to_csv("search_results_2025-01-15.csv", index=False)
```
### Handling Database Updates
1. **Regular gget updates:**
- Update gget biweekly to match database structure changes
- Check release notes for breaking changes
2. **Error handling:**
- Database structure changes may cause temporary failures
- Check GitHub issues: https://github.com/pachterlab/gget/issues
- Update gget if errors occur
3. **API rate limiting:**
- Implement delays for large-scale queries
- Use local databases (DIAMOND, COSMIC) when possible
- Cache results to avoid repeated queries
## Database-Specific Best Practices
### Ensembl
- Use species shortcuts ('human', 'mouse') for convenience
- Specify release numbers for reproducibility
- Check available species with `gget ref --list_species`
### UniProt
- UniProt IDs are more stable than gene names
- Swiss-Prot annotations are manually curated and more reliable
- Use PDB flag in gget info only when needed (increases runtime)
### BLAST/BLAT
- Start with default parameters, then optimize
- Use specialized databases (swissprot, refseq_protein) for focused searches
- Consider E-value cutoffs based on query length
### Expression Databases
- Gene symbols are case-sensitive in CELLxGENE
- ARCHS4 correlation data is based on co-expression patterns
- Consider tissue-specificity when interpreting results
### Cancer Databases
- cBioPortal: cache data locally for repeated analyses
- COSMIC: download appropriate database subset for your needs
- Respect license agreements for commercial use
## Citations
When using gget, cite both the gget publication and the underlying databases:
**gget:**
Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836
**Database-specific citations:** Check references/ directory or database websites for appropriate citations.

View File

@@ -0,0 +1,467 @@
# gget Module Reference
Comprehensive parameter reference for all gget modules.
## Reference & Gene Information Modules
### gget ref
Retrieve Ensembl reference genome FTPs and metadata.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `species` | str | Species in Genus_species format or shortcuts ('human', 'mouse') | Required |
| `-w/--which` | str | File types to return: gtf, cdna, dna, cds, cdrna, pep | All |
| `-r/--release` | int | Ensembl release number | Latest |
| `-od/--out_dir` | str | Output directory path | None |
| `-o/--out` | str | JSON file path for results | None |
| `-l/--list_species` | flag | List available vertebrate species | False |
| `-liv/--list_iv_species` | flag | List available invertebrate species | False |
| `-ftp` | flag | Return only FTP links | False |
| `-d/--download` | flag | Download files (requires curl) | False |
| `-q/--quiet` | flag | Suppress progress information | False |
**Returns:** JSON containing FTP links, Ensembl release numbers, release dates, file sizes
---
### gget search
Search for genes by name or description in Ensembl.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `searchwords` | str/list | Search terms (case-insensitive) | Required |
| `-s/--species` | str | Target species or core database name | Required |
| `-r/--release` | int | Ensembl release number | Latest |
| `-t/--id_type` | str | Return 'gene' or 'transcript' | 'gene' |
| `-ao/--andor` | str | 'or' (ANY term) or 'and' (ALL terms) | 'or' |
| `-l/--limit` | int | Maximum results to return | None |
| `-o/--out` | str | Output file path (CSV/JSON) | None |
**Returns:** ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL
---
### gget info
Get comprehensive gene/transcript metadata from Ensembl, UniProt, and NCBI.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `ens_ids` | str/list | Ensembl IDs (WormBase, Flybase also supported) | Required |
| `-o/--out` | str | Output file path (CSV/JSON) | None |
| `-n/--ncbi` | bool | Disable NCBI data retrieval | False |
| `-u/--uniprot` | bool | Disable UniProt data retrieval | False |
| `-pdb` | bool | Include PDB identifiers | False |
| `-csv` | flag | Return CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress display | False |
**Python-specific:**
- `save=True`: Save output to current directory
- `wrap_text=True`: Format dataframe with wrapped text
**Note:** Processing >1000 IDs simultaneously may cause server errors.
**Returns:** UniProt ID, NCBI gene ID, gene name, synonyms, protein names, descriptions, biotype, canonical transcript
---
### gget seq
Retrieve nucleotide or amino acid sequences in FASTA format.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `ens_ids` | str/list | Ensembl identifiers | Required |
| `-o/--out` | str | Output file path | stdout |
| `-t/--translate` | flag | Fetch amino acid sequences | False |
| `-iso/--isoforms` | flag | Return all transcript variants | False |
| `-q/--quiet` | flag | Suppress progress information | False |
**Data sources:** Ensembl (nucleotide), UniProt (amino acid)
**Returns:** FASTA format sequences
---
## Sequence Analysis & Alignment Modules
### gget blast
BLAST sequences against standard databases.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `sequence` | str | Sequence or path to FASTA/.txt | Required |
| `-p/--program` | str | blastn, blastp, blastx, tblastn, tblastx | Auto-detect |
| `-db/--database` | str | nt, refseq_rna, pdbnt, nr, swissprot, pdbaa, refseq_protein | nt or nr |
| `-l/--limit` | int | Max hits returned | 50 |
| `-e/--expect` | float | E-value cutoff | 10.0 |
| `-lcf/--low_comp_filt` | flag | Enable low complexity filtering | False |
| `-mbo/--megablast_off` | flag | Disable MegaBLAST (blastn only) | False |
| `-o/--out` | str | Output file path | None |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** Description, Scientific Name, Common Name, Taxid, Max Score, Total Score, Query Coverage
---
### gget blat
Find genomic positions using UCSC BLAT.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `sequence` | str | Sequence or path to FASTA/.txt | Required |
| `-st/--seqtype` | str | 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' | Auto-detect |
| `-a/--assembly` | str | Target assembly (hg38, mm39, taeGut2, etc.) | 'human'/hg38 |
| `-o/--out` | str | Output file path | None |
| `-csv` | flag | Return CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** genome, query size, alignment start/end, matches, mismatches, alignment percentage
---
### gget muscle
Align multiple sequences using Muscle5.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `fasta` | str/list | Sequences or FASTA file path | Required |
| `-o/--out` | str | Output file path | stdout |
| `-s5/--super5` | flag | Use Super5 algorithm (faster, large datasets) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** ClustalW format alignment or aligned FASTA (.afa)
---
### gget diamond
Fast local protein/translated DNA alignment.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `query` | str/list | Query sequences or FASTA file | Required |
| `--reference` | str/list | Reference sequences or FASTA file | Required |
| `--sensitivity` | str | fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive | very-sensitive |
| `--threads` | int | CPU threads | 1 |
| `--diamond_binary` | str | Path to DIAMOND installation | Auto-detect |
| `--diamond_db` | str | Save database for reuse | None |
| `--translated` | flag | Enable nucleotide-to-amino acid alignment | False |
| `-o/--out` | str | Output file path | None |
| `-csv` | flag | CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** Identity %, sequence lengths, match positions, gap openings, E-values, bit scores
---
## Structural & Protein Analysis Modules
### gget pdb
Query RCSB Protein Data Bank.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `pdb_id` | str | PDB identifier (e.g., '7S7U') | Required |
| `-r/--resource` | str | pdb, entry, pubmed, assembly, entity types | 'pdb' |
| `-i/--identifier` | str | Assembly, entity, or chain ID | None |
| `-o/--out` | str | Output file path | stdout |
**Returns:** PDB format (structures) or JSON (metadata)
---
### gget alphafold
Predict 3D protein structures using AlphaFold2.
**Setup:** Requires OpenMM and `gget setup alphafold` (~4GB download)
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `sequence` | str/list | Amino acid sequence(s) or FASTA file | Required |
| `-mr/--multimer_recycles` | int | Recycling iterations for multimers | 3 |
| `-o/--out` | str | Output folder path | timestamped |
| `-mfm/--multimer_for_monomer` | flag | Apply multimer model to monomers | False |
| `-r/--relax` | flag | AMBER relaxation for top model | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Python-only:**
- `plot` (bool): Generate 3D visualization (default: True)
- `show_sidechains` (bool): Include side chains (default: True)
**Note:** Multiple sequences automatically trigger multimer modeling
**Returns:** PDB structure file, JSON alignment error data, optional 3D plot
---
### gget elm
Predict Eukaryotic Linear Motifs.
**Setup:** Requires `gget setup elm`
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `sequence` | str | Amino acid sequence or UniProt Acc | Required |
| `-s/--sensitivity` | str | DIAMOND alignment sensitivity | very-sensitive |
| `-t/--threads` | int | Number of threads | 1 |
| `-bin/--diamond_binary` | str | Path to DIAMOND binary | Auto-detect |
| `-o/--out` | str | Output directory path | None |
| `-u/--uniprot` | flag | Input is UniProt Acc | False |
| `-e/--expand` | flag | Include protein names, organisms, references | False |
| `-csv` | flag | CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** Two outputs:
1. **ortholog_df**: Motifs from orthologous proteins
2. **regex_df**: Motifs matched in input sequence
---
## Expression & Disease Data Modules
### gget archs4
Query ARCHS4 for gene correlation or tissue expression.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `gene` | str | Gene symbol or Ensembl ID | Required |
| `-w/--which` | str | 'correlation' or 'tissue' | 'correlation' |
| `-s/--species` | str | 'human' or 'mouse' (tissue only) | 'human' |
| `-o/--out` | str | Output file path | None |
| `-e/--ensembl` | flag | Input is Ensembl ID | False |
| `-csv` | flag | CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:**
- **correlation**: Gene symbols, Pearson correlation coefficients (top 100)
- **tissue**: Tissue IDs, min/Q1/median/Q3/max expression
---
### gget cellxgene
Query CZ CELLxGENE Discover Census for single-cell data.
**Setup:** Requires `gget setup cellxgene`
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `--gene` (-g) | list | Gene names or Ensembl IDs (case-sensitive!) | Required |
| `--tissue` | list | Tissue type(s) | None |
| `--cell_type` | list | Cell type(s) | None |
| `--species` (-s) | str | 'homo_sapiens' or 'mus_musculus' | 'homo_sapiens' |
| `--census_version` (-cv) | str | "stable", "latest", or dated version | "stable" |
| `-o/--out` | str | Output file path (required for CLI) | Required |
| `--ensembl` (-e) | flag | Use Ensembl IDs | False |
| `--meta_only` (-mo) | flag | Return metadata only | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Additional filters:** disease, development_stage, sex, assay, dataset_id, donor_id, ethnicity, suspension_type
**Important:** Gene symbols are case-sensitive ('PAX7' for human, 'Pax7' for mouse)
**Returns:** AnnData object with count matrices and metadata
---
### gget enrichr
Perform enrichment analysis using Enrichr/modEnrichr.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `genes` | list | Gene symbols or Ensembl IDs | Required |
| `-db/--database` | str | Reference database or shortcut | Required |
| `-s/--species` | str | human, mouse, fly, yeast, worm, fish | 'human' |
| `-bkg_l/--background_list` | list | Background genes | None |
| `-o/--out` | str | Output file path | None |
| `-ko/--kegg_out` | str | KEGG pathway images directory | None |
**Python-only:**
- `plot` (bool): Generate graphical results
**Database shortcuts:**
- 'pathway' → KEGG_2021_Human
- 'transcription' → ChEA_2016
- 'ontology' → GO_Biological_Process_2021
- 'diseases_drugs' → GWAS_Catalog_2019
- 'celltypes' → PanglaoDB_Augmented_2021
**Returns:** Pathway/function associations with adjusted p-values, overlapping gene counts
---
### gget bgee
Retrieve orthology and expression from Bgee.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `ens_id` | str/list | Ensembl or NCBI gene ID | Required |
| `-t/--type` | str | 'orthologs' or 'expression' | 'orthologs' |
| `-o/--out` | str | Output file path | None |
| `-csv` | flag | CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Note:** Multiple IDs supported when `type='expression'`
**Returns:**
- **orthologs**: Genes across species with IDs, names, taxonomic info
- **expression**: Anatomical entities, confidence scores, expression status
---
### gget opentargets
Retrieve disease/drug associations from OpenTargets.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `ens_id` | str | Ensembl gene ID | Required |
| `-r/--resource` | str | diseases, drugs, tractability, pharmacogenetics, expression, depmap, interactions | 'diseases' |
| `-l/--limit` | int | Maximum results | None |
| `-o/--out` | str | Output file path | None |
| `-csv` | flag | CSV format (CLI) | False |
| `-q/--quiet` | flag | Suppress progress | False |
**Resource-specific filters:**
- drugs: `--filter_disease`
- pharmacogenetics: `--filter_drug`
- expression/depmap: `--filter_tissue`, `--filter_anat_sys`, `--filter_organ`
- interactions: `--filter_protein_a`, `--filter_protein_b`, `--filter_gene_b`
**Returns:** Disease/drug associations, tractability, pharmacogenetics, expression, DepMap, interactions
---
### gget cbio
Plot cancer genomics heatmaps from cBioPortal.
**Subcommands:** search, plot
**search parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `keywords` | list | Search terms | Required |
**plot parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `-s/--study_ids` | list | cBioPortal study IDs | Required |
| `-g/--genes` | list | Gene names or Ensembl IDs | Required |
| `-st/--stratification` | str | tissue, cancer_type, cancer_type_detailed, study_id, sample | None |
| `-vt/--variation_type` | str | mutation_occurrences, cna_nonbinary, sv_occurrences, cna_occurrences, Consequence | None |
| `-f/--filter` | str | Filter by column value (e.g., 'study_id:msk_impact_2017') | None |
| `-dd/--data_dir` | str | Cache directory | ./gget_cbio_cache |
| `-fd/--figure_dir` | str | Output directory | ./gget_cbio_figures |
| `-t/--title` | str | Custom figure title | None |
| `-dpi` | int | Resolution | 100 |
| `-q/--quiet` | flag | Suppress progress | False |
| `-nc/--no_confirm` | flag | Skip download confirmations | False |
| `-sh/--show` | flag | Display plot in window | False |
**Returns:** PNG heatmap figure
---
### gget cosmic
Search COSMIC database for cancer mutations.
**Important:** License fees for commercial use. Requires COSMIC account.
**Query parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `searchterm` | str | Gene name, Ensembl ID, mutation, sample ID | Required |
| `-ctp/--cosmic_tsv_path` | str | Path to COSMIC TSV file | Required |
| `-l/--limit` | int | Maximum results | 100 |
| `-csv` | flag | CSV format (CLI) | False |
**Download parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `-d/--download_cosmic` | flag | Activate download mode | False |
| `-gm/--gget_mutate` | flag | Create version for gget mutate | False |
| `-cp/--cosmic_project` | str | cancer, census, cell_line, resistance, genome_screen, targeted_screen | None |
| `-cv/--cosmic_version` | str | COSMIC version | Latest |
| `-gv/--grch_version` | int | Human reference genome (37 or 38) | None |
| `--email` | str | COSMIC account email | Required |
| `--password` | str | COSMIC account password | Required |
**Note:** First-time users must download database
**Returns:** Mutation data from COSMIC
---
## Additional Tools
### gget mutate
Generate mutated nucleotide sequences.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `sequences` | str/list | FASTA file or sequences | Required |
| `-m/--mutations` | str/df | CSV/TSV file or DataFrame | Required |
| `-mc/--mut_column` | str | Mutation column name | 'mutation' |
| `-sic/--seq_id_column` | str | Sequence ID column | 'seq_ID' |
| `-mic/--mut_id_column` | str | Mutation ID column | None |
| `-k/--k` | int | Length of flanking sequences | 30 |
| `-o/--out` | str | Output FASTA file path | stdout |
| `-q/--quiet` | flag | Suppress progress | False |
**Returns:** Mutated sequences in FASTA format
---
### gget gpt
Generate text using OpenAI's API.
**Setup:** Requires `gget setup gpt` and OpenAI API key
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `prompt` | str | Text input for generation | Required |
| `api_key` | str | OpenAI API key | Required |
| `model` | str | OpenAI model name | gpt-3.5-turbo |
| `temperature` | float | Sampling temperature (0-2) | 1.0 |
| `top_p` | float | Nucleus sampling | 1.0 |
| `max_tokens` | int | Maximum tokens to generate | None |
| `frequency_penalty` | float | Frequency penalty (0-2) | 0 |
| `presence_penalty` | float | Presence penalty (0-2) | 0 |
**Important:** Free tier limited to 3 months. Set billing limits.
**Returns:** Generated text string
---
### gget setup
Install/download dependencies for modules.
**Parameters:**
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `module` | str | Module name | Required |
| `-o/--out` | str | Output folder (elm only) | Package install folder |
| `-q/--quiet` | flag | Suppress progress | False |
**Modules requiring setup:**
- `alphafold` - Downloads ~4GB model parameters
- `cellxgene` - Installs cellxgene-census
- `elm` - Downloads local ELM database
- `gpt` - Configures OpenAI integration
**Returns:** None (installs dependencies)

View File

@@ -0,0 +1,814 @@
# gget Workflow Examples
Extended workflow examples demonstrating how to combine multiple gget modules for common bioinformatics tasks.
## Table of Contents
1. [Complete Gene Analysis Pipeline](#complete-gene-analysis-pipeline)
2. [Comparative Structural Biology](#comparative-structural-biology)
3. [Cancer Genomics Analysis](#cancer-genomics-analysis)
4. [Single-Cell Expression Analysis](#single-cell-expression-analysis)
5. [Building Reference Transcriptomes](#building-reference-transcriptomes)
6. [Mutation Impact Assessment](#mutation-impact-assessment)
7. [Drug Target Discovery](#drug-target-discovery)
---
## Complete Gene Analysis Pipeline
Comprehensive analysis of a gene from discovery to functional annotation.
```python
import gget
import pandas as pd
# Step 1: Search for genes of interest
print("Step 1: Searching for GABA receptor genes...")
search_results = gget.search(["GABA", "receptor", "alpha"],
species="homo_sapiens",
andor="and")
print(f"Found {len(search_results)} genes")
# Step 2: Get detailed information
print("\nStep 2: Getting detailed information...")
gene_ids = search_results["ensembl_id"].tolist()[:5] # Top 5 genes
gene_info = gget.info(gene_ids, pdb=True)
print(gene_info[["ensembl_id", "gene_name", "uniprot_id", "description"]])
# Step 3: Retrieve sequences
print("\nStep 3: Retrieving sequences...")
nucleotide_seqs = gget.seq(gene_ids)
protein_seqs = gget.seq(gene_ids, translate=True)
# Save sequences
with open("gaba_receptors_nt.fasta", "w") as f:
f.write(nucleotide_seqs)
with open("gaba_receptors_aa.fasta", "w") as f:
f.write(protein_seqs)
# Step 4: Get expression data
print("\nStep 4: Getting tissue expression...")
for gene_id, gene_name in zip(gene_ids, gene_info["gene_name"]):
expr_data = gget.archs4(gene_name, which="tissue")
print(f"\n{gene_name} expression:")
print(expr_data.head())
# Step 5: Find correlated genes
print("\nStep 5: Finding correlated genes...")
correlated = gget.archs4(gene_info["gene_name"].iloc[0], which="correlation")
correlated_top = correlated.head(20)
print(correlated_top)
# Step 6: Enrichment analysis on correlated genes
print("\nStep 6: Performing enrichment analysis...")
gene_list = correlated_top["gene_symbol"].tolist()
enrichment = gget.enrichr(gene_list, database="ontology", plot=True)
print(enrichment.head(10))
# Step 7: Get disease associations
print("\nStep 7: Getting disease associations...")
for gene_id, gene_name in zip(gene_ids[:3], gene_info["gene_name"][:3]):
diseases = gget.opentargets(gene_id, resource="diseases", limit=5)
print(f"\n{gene_name} disease associations:")
print(diseases)
# Step 8: Check for orthologs
print("\nStep 8: Finding orthologs...")
orthologs = gget.bgee(gene_ids[0], type="orthologs")
print(orthologs)
print("\nComplete gene analysis pipeline finished!")
```
---
## Comparative Structural Biology
Compare protein structures across species and analyze functional motifs.
```python
import gget
# Define genes for comparison
human_gene = "ENSG00000169174" # PCSK9
mouse_gene = "ENSMUSG00000044254" # Pcsk9
print("Comparative Structural Biology Workflow")
print("=" * 50)
# Step 1: Get gene information
print("\n1. Getting gene information...")
human_info = gget.info([human_gene])
mouse_info = gget.info([mouse_gene])
print(f"Human: {human_info['gene_name'].iloc[0]}")
print(f"Mouse: {mouse_info['gene_name'].iloc[0]}")
# Step 2: Retrieve protein sequences
print("\n2. Retrieving protein sequences...")
human_seq = gget.seq(human_gene, translate=True)
mouse_seq = gget.seq(mouse_gene, translate=True)
# Save to file for alignment
with open("pcsk9_sequences.fasta", "w") as f:
f.write(human_seq)
f.write("\n")
f.write(mouse_seq)
# Step 3: Align sequences
print("\n3. Aligning sequences...")
alignment = gget.muscle("pcsk9_sequences.fasta")
print("Alignment completed. Visualizing in ClustalW format:")
print(alignment)
# Step 4: Get existing structures from PDB
print("\n4. Searching PDB for existing structures...")
# Search by sequence using BLAST
pdb_results = gget.blast(human_seq, database="pdbaa", limit=5)
print("Top PDB matches:")
print(pdb_results[["Description", "Max Score", "Query Coverage"]])
# Download top structure
if len(pdb_results) > 0:
# Extract PDB ID from description (usually format: "PDB|XXXX|...")
pdb_id = pdb_results.iloc[0]["Description"].split("|")[1]
print(f"\nDownloading PDB structure: {pdb_id}")
gget.pdb(pdb_id, save=True)
# Step 5: Predict AlphaFold structures
print("\n5. Predicting structures with AlphaFold...")
# Note: This requires gget setup alphafold and is computationally intensive
# Uncomment to run:
# human_structure = gget.alphafold(human_seq, plot=True)
# mouse_structure = gget.alphafold(mouse_seq, plot=True)
print("(AlphaFold prediction skipped - uncomment to run)")
# Step 6: Identify functional motifs
print("\n6. Identifying functional motifs with ELM...")
# Note: Requires gget setup elm
# Uncomment to run:
# human_ortholog_df, human_regex_df = gget.elm(human_seq)
# print("Human PCSK9 functional motifs:")
# print(human_regex_df)
print("(ELM analysis skipped - uncomment to run)")
# Step 7: Get orthology information
print("\n7. Getting orthology information from Bgee...")
orthologs = gget.bgee(human_gene, type="orthologs")
print("PCSK9 orthologs:")
print(orthologs)
print("\nComparative structural biology workflow completed!")
```
---
## Cancer Genomics Analysis
Analyze cancer-associated genes and their mutations.
```python
import gget
import matplotlib.pyplot as plt
print("Cancer Genomics Analysis Workflow")
print("=" * 50)
# Step 1: Search for cancer-related genes
print("\n1. Searching for breast cancer genes...")
genes = gget.search(["breast", "cancer", "BRCA"],
species="homo_sapiens",
andor="or",
limit=20)
print(f"Found {len(genes)} genes")
# Focus on specific genes
target_genes = ["BRCA1", "BRCA2", "TP53", "PIK3CA", "ESR1"]
print(f"\nAnalyzing: {', '.join(target_genes)}")
# Step 2: Get gene information
print("\n2. Getting gene information...")
gene_search = []
for gene in target_genes:
result = gget.search([gene], species="homo_sapiens", limit=1)
if len(result) > 0:
gene_search.append(result.iloc[0])
gene_df = pd.DataFrame(gene_search)
gene_ids = gene_df["ensembl_id"].tolist()
# Step 3: Get disease associations
print("\n3. Getting disease associations from OpenTargets...")
for gene_id, gene_name in zip(gene_ids, target_genes):
print(f"\n{gene_name} disease associations:")
diseases = gget.opentargets(gene_id, resource="diseases", limit=3)
print(diseases[["disease_name", "overall_score"]])
# Step 4: Get drug associations
print("\n4. Getting drug associations...")
for gene_id, gene_name in zip(gene_ids[:3], target_genes[:3]):
print(f"\n{gene_name} drug associations:")
drugs = gget.opentargets(gene_id, resource="drugs", limit=3)
if len(drugs) > 0:
print(drugs[["drug_name", "drug_type", "max_phase_for_all_diseases"]])
# Step 5: Search cBioPortal for studies
print("\n5. Searching cBioPortal for breast cancer studies...")
studies = gget.cbio_search(["breast", "cancer"])
print(f"Found {len(studies)} studies")
print(studies[:5])
# Step 6: Create cancer genomics heatmap
print("\n6. Creating cancer genomics heatmap...")
if len(studies) > 0:
# Select relevant studies
selected_studies = studies[:2] # Top 2 studies
gget.cbio_plot(
selected_studies,
target_genes,
stratification="cancer_type",
variation_type="mutation_occurrences",
show=False
)
print("Heatmap saved to ./gget_cbio_figures/")
# Step 7: Query COSMIC database (requires setup)
print("\n7. Querying COSMIC database...")
# Note: Requires COSMIC account and database download
# Uncomment to run:
# for gene in target_genes[:2]:
# cosmic_results = gget.cosmic(
# gene,
# cosmic_tsv_path="cosmic_cancer.tsv",
# limit=10
# )
# print(f"\n{gene} mutations in COSMIC:")
# print(cosmic_results)
print("(COSMIC query skipped - requires database download)")
# Step 8: Enrichment analysis
print("\n8. Performing pathway enrichment...")
enrichment = gget.enrichr(target_genes, database="pathway", plot=True)
print("\nTop enriched pathways:")
print(enrichment.head(10))
print("\nCancer genomics analysis completed!")
```
---
## Single-Cell Expression Analysis
Analyze single-cell RNA-seq data for specific cell types and tissues.
```python
import gget
import scanpy as sc
print("Single-Cell Expression Analysis Workflow")
print("=" * 50)
# Note: Requires gget setup cellxgene
# Step 1: Define genes and cell types of interest
genes_of_interest = ["ACE2", "TMPRSS2", "CD4", "CD8A"]
tissue = "lung"
cell_types = ["type ii pneumocyte", "macrophage", "t cell"]
print(f"\nAnalyzing genes: {', '.join(genes_of_interest)}")
print(f"Tissue: {tissue}")
print(f"Cell types: {', '.join(cell_types)}")
# Step 2: Get metadata first
print("\n1. Retrieving metadata...")
metadata = gget.cellxgene(
gene=genes_of_interest,
tissue=tissue,
species="homo_sapiens",
meta_only=True
)
print(f"Found {len(metadata)} datasets")
print(metadata.head())
# Step 3: Download count matrices
print("\n2. Downloading single-cell data...")
# Note: This can be a large download
adata = gget.cellxgene(
gene=genes_of_interest,
tissue=tissue,
species="homo_sapiens",
census_version="stable"
)
print(f"AnnData shape: {adata.shape}")
print(f"Genes: {adata.n_vars}")
print(f"Cells: {adata.n_obs}")
# Step 4: Basic QC and filtering with scanpy
print("\n3. Performing quality control...")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
print(f"After QC - Cells: {adata.n_obs}, Genes: {adata.n_vars}")
# Step 5: Normalize and log-transform
print("\n4. Normalizing data...")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Step 6: Calculate gene expression statistics
print("\n5. Calculating expression statistics...")
for gene in genes_of_interest:
if gene in adata.var_names:
expr = adata[:, gene].X.toarray().flatten()
print(f"\n{gene} expression:")
print(f" Mean: {expr.mean():.3f}")
print(f" Median: {np.median(expr):.3f}")
print(f" % expressing: {(expr > 0).sum() / len(expr) * 100:.1f}%")
# Step 7: Get tissue expression from ARCHS4 for comparison
print("\n6. Getting bulk tissue expression from ARCHS4...")
for gene in genes_of_interest:
tissue_expr = gget.archs4(gene, which="tissue")
lung_expr = tissue_expr[tissue_expr["tissue"] == "lung"]
if len(lung_expr) > 0:
print(f"\n{gene} in lung (ARCHS4):")
print(f" Median: {lung_expr['median'].iloc[0]:.3f}")
# Step 8: Enrichment analysis
print("\n7. Performing enrichment analysis...")
enrichment = gget.enrichr(genes_of_interest, database="celltypes", plot=True)
print("\nTop cell type associations:")
print(enrichment.head(10))
# Step 9: Get disease associations
print("\n8. Getting disease associations...")
for gene in genes_of_interest:
gene_search = gget.search([gene], species="homo_sapiens", limit=1)
if len(gene_search) > 0:
gene_id = gene_search["ensembl_id"].iloc[0]
diseases = gget.opentargets(gene_id, resource="diseases", limit=3)
print(f"\n{gene} disease associations:")
print(diseases[["disease_name", "overall_score"]])
print("\nSingle-cell expression analysis completed!")
```
---
## Building Reference Transcriptomes
Prepare reference data for RNA-seq analysis pipelines.
```bash
#!/bin/bash
# Reference transcriptome building workflow
echo "Reference Transcriptome Building Workflow"
echo "=========================================="
# Step 1: List available species
echo -e "\n1. Listing available species..."
gget ref --list_species > available_species.txt
echo "Available species saved to available_species.txt"
# Step 2: Download reference files for human
echo -e "\n2. Downloading human reference files..."
SPECIES="homo_sapiens"
RELEASE=110 # Specify release for reproducibility
# Download GTF annotation
echo "Downloading GTF annotation..."
gget ref -w gtf -r $RELEASE -d $SPECIES -o human_ref_gtf.json
# Download cDNA sequences
echo "Downloading cDNA sequences..."
gget ref -w cdna -r $RELEASE -d $SPECIES -o human_ref_cdna.json
# Download protein sequences
echo "Downloading protein sequences..."
gget ref -w pep -r $RELEASE -d $SPECIES -o human_ref_pep.json
# Step 3: Build kallisto index (if kallisto is installed)
echo -e "\n3. Building kallisto index..."
if command -v kallisto &> /dev/null; then
# Get cDNA FASTA file from download
CDNA_FILE=$(ls *.cdna.all.fa.gz)
if [ -f "$CDNA_FILE" ]; then
kallisto index -i transcriptome.idx $CDNA_FILE
echo "Kallisto index created: transcriptome.idx"
else
echo "cDNA FASTA file not found"
fi
else
echo "kallisto not installed, skipping index building"
fi
# Step 4: Download genome for alignment-based methods
echo -e "\n4. Downloading genome sequence..."
gget ref -w dna -r $RELEASE -d $SPECIES -o human_ref_dna.json
# Step 5: Get gene information for genes of interest
echo -e "\n5. Getting information for specific genes..."
gget search -s $SPECIES "TP53 BRCA1 BRCA2" -o key_genes.csv
echo -e "\nReference transcriptome building completed!"
```
```python
# Python version
import gget
import json
print("Reference Transcriptome Building Workflow")
print("=" * 50)
# Configuration
species = "homo_sapiens"
release = 110
genes_of_interest = ["TP53", "BRCA1", "BRCA2", "MYC", "EGFR"]
# Step 1: Get reference information
print("\n1. Getting reference information...")
ref_info = gget.ref(species, release=release)
# Save reference information
with open("reference_info.json", "w") as f:
json.dump(ref_info, f, indent=2)
print("Reference information saved to reference_info.json")
# Step 2: Download specific files
print("\n2. Downloading reference files...")
# GTF annotation
gget.ref(species, which="gtf", release=release, download=True)
# cDNA sequences
gget.ref(species, which="cdna", release=release, download=True)
# Step 3: Get information for genes of interest
print(f"\n3. Getting information for {len(genes_of_interest)} genes...")
gene_data = []
for gene in genes_of_interest:
result = gget.search([gene], species=species, limit=1)
if len(result) > 0:
gene_data.append(result.iloc[0])
# Get detailed info
if gene_data:
gene_ids = [g["ensembl_id"] for g in gene_data]
detailed_info = gget.info(gene_ids)
detailed_info.to_csv("genes_of_interest_info.csv", index=False)
print("Gene information saved to genes_of_interest_info.csv")
# Step 4: Get sequences
print("\n4. Retrieving sequences...")
sequences_nt = gget.seq(gene_ids)
sequences_aa = gget.seq(gene_ids, translate=True)
with open("key_genes_nucleotide.fasta", "w") as f:
f.write(sequences_nt)
with open("key_genes_protein.fasta", "w") as f:
f.write(sequences_aa)
print("\nReference transcriptome building completed!")
print(f"Files created:")
print(" - reference_info.json")
print(" - genes_of_interest_info.csv")
print(" - key_genes_nucleotide.fasta")
print(" - key_genes_protein.fasta")
```
---
## Mutation Impact Assessment
Analyze the impact of genetic mutations on protein structure and function.
```python
import gget
import pandas as pd
print("Mutation Impact Assessment Workflow")
print("=" * 50)
# Define mutations to analyze
mutations = [
{"gene": "TP53", "mutation": "c.818G>A", "description": "R273H hotspot"},
{"gene": "EGFR", "mutation": "c.2573T>G", "description": "L858R activating"},
]
# Step 1: Get gene information
print("\n1. Getting gene information...")
for mut in mutations:
results = gget.search([mut["gene"]], species="homo_sapiens", limit=1)
if len(results) > 0:
mut["ensembl_id"] = results["ensembl_id"].iloc[0]
print(f"{mut['gene']}: {mut['ensembl_id']}")
# Step 2: Get sequences
print("\n2. Retrieving wild-type sequences...")
for mut in mutations:
# Get nucleotide sequence
nt_seq = gget.seq(mut["ensembl_id"])
mut["wt_sequence"] = nt_seq
# Get protein sequence
aa_seq = gget.seq(mut["ensembl_id"], translate=True)
mut["wt_protein"] = aa_seq
# Step 3: Generate mutated sequences
print("\n3. Generating mutated sequences...")
# Create mutation dataframe for gget mutate
mut_df = pd.DataFrame({
"seq_ID": [m["gene"] for m in mutations],
"mutation": [m["mutation"] for m in mutations]
})
# For each mutation
for mut in mutations:
# Extract sequence from FASTA
lines = mut["wt_sequence"].split("\n")
seq = "".join(lines[1:])
# Create single mutation df
single_mut = pd.DataFrame({
"seq_ID": [mut["gene"]],
"mutation": [mut["mutation"]]
})
# Generate mutated sequence
mutated = gget.mutate([seq], mutations=single_mut)
mut["mutated_sequence"] = mutated
print("Mutated sequences generated")
# Step 4: Get existing structure information
print("\n4. Getting structure information...")
for mut in mutations:
# Get info with PDB IDs
info = gget.info([mut["ensembl_id"]], pdb=True)
if "pdb_id" in info.columns and pd.notna(info["pdb_id"].iloc[0]):
pdb_ids = info["pdb_id"].iloc[0].split(";")
print(f"\n{mut['gene']} PDB structures: {', '.join(pdb_ids[:3])}")
# Download first structure
if len(pdb_ids) > 0:
pdb_id = pdb_ids[0].strip()
mut["pdb_id"] = pdb_id
gget.pdb(pdb_id, save=True)
else:
print(f"\n{mut['gene']}: No PDB structure available")
mut["pdb_id"] = None
# Step 5: Predict structures with AlphaFold (optional)
print("\n5. Predicting structures with AlphaFold...")
# Note: Requires gget setup alphafold and is computationally intensive
# Uncomment to run:
# for mut in mutations:
# print(f"Predicting {mut['gene']} wild-type structure...")
# wt_structure = gget.alphafold(mut["wt_protein"])
#
# print(f"Predicting {mut['gene']} mutant structure...")
# # Would need to translate mutated sequence first
# # mutant_structure = gget.alphafold(mutated_protein)
print("(AlphaFold prediction skipped - uncomment to run)")
# Step 6: Find functional motifs
print("\n6. Identifying functional motifs...")
# Note: Requires gget setup elm
# Uncomment to run:
# for mut in mutations:
# ortholog_df, regex_df = gget.elm(mut["wt_protein"])
# print(f"\n{mut['gene']} functional motifs:")
# print(regex_df)
print("(ELM analysis skipped - uncomment to run)")
# Step 7: Get disease associations
print("\n7. Getting disease associations...")
for mut in mutations:
diseases = gget.opentargets(
mut["ensembl_id"],
resource="diseases",
limit=5
)
print(f"\n{mut['gene']} ({mut['description']}) disease associations:")
print(diseases[["disease_name", "overall_score"]])
# Step 8: Query COSMIC for mutation frequency
print("\n8. Querying COSMIC database...")
# Note: Requires COSMIC database download
# Uncomment to run:
# for mut in mutations:
# cosmic_results = gget.cosmic(
# mut["mutation"],
# cosmic_tsv_path="cosmic_cancer.tsv",
# limit=10
# )
# print(f"\n{mut['gene']} {mut['mutation']} in COSMIC:")
# print(cosmic_results)
print("(COSMIC query skipped - requires database download)")
print("\nMutation impact assessment completed!")
```
---
## Drug Target Discovery
Identify and validate potential drug targets for specific diseases.
```python
import gget
import pandas as pd
print("Drug Target Discovery Workflow")
print("=" * 50)
# Step 1: Search for disease-related genes
disease = "alzheimer"
print(f"\n1. Searching for {disease} disease genes...")
genes = gget.search([disease], species="homo_sapiens", limit=50)
print(f"Found {len(genes)} potential genes")
# Step 2: Get detailed information
print("\n2. Getting detailed gene information...")
gene_ids = genes["ensembl_id"].tolist()[:20] # Top 20
gene_info = gget.info(gene_ids[:10]) # Limit to avoid timeout
# Step 3: Get disease associations from OpenTargets
print("\n3. Getting disease associations...")
disease_scores = []
for gene_id, gene_name in zip(gene_info["ensembl_id"], gene_info["gene_name"]):
diseases = gget.opentargets(gene_id, resource="diseases", limit=10)
# Filter for Alzheimer's disease
alzheimer = diseases[diseases["disease_name"].str.contains("Alzheimer", case=False, na=False)]
if len(alzheimer) > 0:
disease_scores.append({
"ensembl_id": gene_id,
"gene_name": gene_name,
"disease_score": alzheimer["overall_score"].max()
})
disease_df = pd.DataFrame(disease_scores).sort_values("disease_score", ascending=False)
print("\nTop disease-associated genes:")
print(disease_df.head(10))
# Step 4: Get tractability information
print("\n4. Assessing target tractability...")
top_targets = disease_df.head(5)
for _, row in top_targets.iterrows():
tractability = gget.opentargets(
row["ensembl_id"],
resource="tractability"
)
print(f"\n{row['gene_name']} tractability:")
print(tractability)
# Step 5: Get expression data
print("\n5. Getting tissue expression data...")
for _, row in top_targets.iterrows():
# Brain expression from OpenTargets
expression = gget.opentargets(
row["ensembl_id"],
resource="expression",
filter_tissue="brain"
)
print(f"\n{row['gene_name']} brain expression:")
print(expression)
# Tissue expression from ARCHS4
tissue_expr = gget.archs4(row["gene_name"], which="tissue")
brain_expr = tissue_expr[tissue_expr["tissue"].str.contains("brain", case=False, na=False)]
print(f"ARCHS4 brain expression:")
print(brain_expr)
# Step 6: Check for existing drugs
print("\n6. Checking for existing drugs...")
for _, row in top_targets.iterrows():
drugs = gget.opentargets(row["ensembl_id"], resource="drugs", limit=5)
print(f"\n{row['gene_name']} drug associations:")
if len(drugs) > 0:
print(drugs[["drug_name", "drug_type", "max_phase_for_all_diseases"]])
else:
print("No drugs found")
# Step 7: Get protein-protein interactions
print("\n7. Getting protein-protein interactions...")
for _, row in top_targets.iterrows():
interactions = gget.opentargets(
row["ensembl_id"],
resource="interactions",
limit=10
)
print(f"\n{row['gene_name']} interacts with:")
if len(interactions) > 0:
print(interactions[["gene_b_symbol", "interaction_score"]])
# Step 8: Enrichment analysis
print("\n8. Performing pathway enrichment...")
gene_list = top_targets["gene_name"].tolist()
enrichment = gget.enrichr(gene_list, database="pathway", plot=True)
print("\nTop enriched pathways:")
print(enrichment.head(10))
# Step 9: Get structure information
print("\n9. Getting structure information...")
for _, row in top_targets.iterrows():
info = gget.info([row["ensembl_id"]], pdb=True)
if "pdb_id" in info.columns and pd.notna(info["pdb_id"].iloc[0]):
pdb_ids = info["pdb_id"].iloc[0].split(";")
print(f"\n{row['gene_name']} PDB structures: {', '.join(pdb_ids[:3])}")
else:
print(f"\n{row['gene_name']}: No PDB structure available")
# Could predict with AlphaFold
print(f" Consider AlphaFold prediction")
# Step 10: Generate target summary report
print("\n10. Generating target summary report...")
report = []
for _, row in top_targets.iterrows():
report.append({
"Gene": row["gene_name"],
"Ensembl ID": row["ensembl_id"],
"Disease Score": row["disease_score"],
"Target Status": "High Priority"
})
report_df = pd.DataFrame(report)
report_df.to_csv("drug_targets_report.csv", index=False)
print("\nTarget report saved to drug_targets_report.csv")
print("\nDrug target discovery workflow completed!")
```
---
## Tips for Workflow Development
### Error Handling
```python
import gget
def safe_gget_call(func, *args, **kwargs):
"""Wrapper for gget calls with error handling"""
try:
result = func(*args, **kwargs)
return result
except Exception as e:
print(f"Error in {func.__name__}: {str(e)}")
return None
# Usage
result = safe_gget_call(gget.search, ["ACE2"], species="homo_sapiens")
if result is not None:
print(result)
```
### Rate Limiting
```python
import time
import gget
def rate_limited_queries(gene_ids, delay=1):
"""Query multiple genes with rate limiting"""
results = []
for i, gene_id in enumerate(gene_ids):
print(f"Querying {i+1}/{len(gene_ids)}: {gene_id}")
result = gget.info([gene_id])
results.append(result)
if i < len(gene_ids) - 1: # Don't sleep after last query
time.sleep(delay)
return pd.concat(results, ignore_index=True)
```
### Caching Results
```python
import os
import pickle
import gget
def cached_gget(cache_file, func, *args, **kwargs):
"""Cache gget results to avoid repeated queries"""
if os.path.exists(cache_file):
print(f"Loading from cache: {cache_file}")
with open(cache_file, "rb") as f:
return pickle.load(f)
result = func(*args, **kwargs)
with open(cache_file, "wb") as f:
pickle.dump(result, f)
print(f"Saved to cache: {cache_file}")
return result
# Usage
result = cached_gget("ace2_info.pkl", gget.info, ["ENSG00000130234"])
```
---
These workflows demonstrate how to combine multiple gget modules for comprehensive bioinformatics analyses. Adapt them to your specific research questions and data types.

View File

@@ -0,0 +1,191 @@
#!/usr/bin/env python3
"""
Batch Sequence Analysis Script
Analyze multiple sequences: BLAST, alignment, and structure prediction
"""
import argparse
import sys
from pathlib import Path
import gget
def read_fasta(fasta_file):
"""Read sequences from FASTA file."""
sequences = []
current_id = None
current_seq = []
with open(fasta_file, "r") as f:
for line in f:
line = line.strip()
if line.startswith(">"):
if current_id:
sequences.append({"id": current_id, "seq": "".join(current_seq)})
current_id = line[1:]
current_seq = []
else:
current_seq.append(line)
if current_id:
sequences.append({"id": current_id, "seq": "".join(current_seq)})
return sequences
def analyze_sequences(
fasta_file,
blast_db="nr",
align=True,
predict_structure=False,
output_dir="output",
):
"""
Perform batch sequence analysis.
Args:
fasta_file: Path to FASTA file with sequences
blast_db: BLAST database to search (default: nr)
align: Whether to perform multiple sequence alignment
predict_structure: Whether to predict structures with AlphaFold
output_dir: Output directory for results
"""
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
print(f"Batch Sequence Analysis")
print("=" * 60)
print(f"Input file: {fasta_file}")
print(f"Output directory: {output_dir}")
print("")
# Read sequences
print("Reading sequences...")
sequences = read_fasta(fasta_file)
print(f"Found {len(sequences)} sequences\n")
# Step 1: BLAST each sequence
print("Step 1: Running BLAST searches...")
print("-" * 60)
for i, seq_data in enumerate(sequences):
print(f"\n{i+1}. BLASTing {seq_data['id']}...")
try:
blast_results = gget.blast(
seq_data["seq"], database=blast_db, limit=10, save=False
)
output_file = output_path / f"{seq_data['id']}_blast.csv"
blast_results.to_csv(output_file, index=False)
print(f" Results saved to: {output_file}")
if len(blast_results) > 0:
print(f" Top hit: {blast_results.iloc[0]['Description']}")
print(
f" Max Score: {blast_results.iloc[0]['Max Score']}, "
f"Query Coverage: {blast_results.iloc[0]['Query Coverage']}"
)
except Exception as e:
print(f" Error: {e}")
# Step 2: Multiple sequence alignment
if align and len(sequences) > 1:
print("\n\nStep 2: Multiple sequence alignment...")
print("-" * 60)
try:
alignment = gget.muscle(fasta_file)
alignment_file = output_path / "alignment.afa"
with open(alignment_file, "w") as f:
f.write(alignment)
print(f"Alignment saved to: {alignment_file}")
except Exception as e:
print(f"Error in alignment: {e}")
else:
print("\n\nStep 2: Skipping alignment (only 1 sequence or disabled)")
# Step 3: Structure prediction (optional)
if predict_structure:
print("\n\nStep 3: Predicting structures with AlphaFold...")
print("-" * 60)
print(
"Note: This requires 'gget setup alphafold' and is computationally intensive"
)
for i, seq_data in enumerate(sequences):
print(f"\n{i+1}. Predicting structure for {seq_data['id']}...")
try:
structure_dir = output_path / f"structure_{seq_data['id']}"
# Uncomment to run AlphaFold prediction:
# gget.alphafold(seq_data['seq'], out=str(structure_dir))
# print(f" Structure saved to: {structure_dir}")
print(
" (Prediction skipped - uncomment code to run AlphaFold prediction)"
)
except Exception as e:
print(f" Error: {e}")
else:
print("\n\nStep 3: Structure prediction disabled")
# Summary
print("\n" + "=" * 60)
print("Batch analysis complete!")
print(f"\nResults saved to: {output_dir}/")
print(f" - BLAST results: *_blast.csv")
if align and len(sequences) > 1:
print(f" - Alignment: alignment.afa")
if predict_structure:
print(f" - Structures: structure_*/")
return True
def main():
parser = argparse.ArgumentParser(
description="Perform batch sequence analysis using gget"
)
parser.add_argument("fasta", help="Input FASTA file with sequences")
parser.add_argument(
"-db",
"--database",
default="nr",
help="BLAST database (default: nr for proteins, nt for nucleotides)",
)
parser.add_argument(
"--no-align", action="store_true", help="Skip multiple sequence alignment"
)
parser.add_argument(
"--predict-structure",
action="store_true",
help="Predict structures with AlphaFold (requires setup)",
)
parser.add_argument(
"-o", "--output", default="output", help="Output directory (default: output)"
)
args = parser.parse_args()
if not Path(args.fasta).exists():
print(f"Error: File not found: {args.fasta}")
sys.exit(1)
try:
success = analyze_sequences(
args.fasta,
blast_db=args.database,
align=not args.no_align,
predict_structure=args.predict_structure,
output_dir=args.output,
)
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\nAnalysis interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\nError: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,235 @@
#!/usr/bin/env python3
"""
Enrichment Analysis Pipeline
Perform comprehensive enrichment analysis on a gene list
"""
import argparse
import sys
from pathlib import Path
import gget
import pandas as pd
def read_gene_list(file_path):
"""Read gene list from file (one gene per line or CSV)."""
file_path = Path(file_path)
if file_path.suffix == ".csv":
df = pd.read_csv(file_path)
# Assume first column contains gene names
genes = df.iloc[:, 0].tolist()
else:
# Plain text file
with open(file_path, "r") as f:
genes = [line.strip() for line in f if line.strip()]
return genes
def enrichment_pipeline(
gene_list,
species="human",
background=None,
output_prefix="enrichment",
plot=True,
):
"""
Perform comprehensive enrichment analysis.
Args:
gene_list: List of gene symbols
species: Species for analysis
background: Background gene list (optional)
output_prefix: Prefix for output files
plot: Whether to generate plots
"""
print("Enrichment Analysis Pipeline")
print("=" * 60)
print(f"Analyzing {len(gene_list)} genes")
print(f"Species: {species}\n")
# Database categories to analyze
databases = {
"pathway": "KEGG Pathways",
"ontology": "Gene Ontology (Biological Process)",
"transcription": "Transcription Factors (ChEA)",
"diseases_drugs": "Disease Associations (GWAS)",
"celltypes": "Cell Type Markers (PanglaoDB)",
}
results = {}
for db_key, db_name in databases.items():
print(f"\nAnalyzing: {db_name}")
print("-" * 60)
try:
enrichment = gget.enrichr(
gene_list,
database=db_key,
species=species,
background_list=background,
plot=plot,
)
if enrichment is not None and len(enrichment) > 0:
# Save results
output_file = f"{output_prefix}_{db_key}.csv"
enrichment.to_csv(output_file, index=False)
print(f"Results saved to: {output_file}")
# Show top 5 results
print(f"\nTop 5 enriched terms:")
for i, row in enrichment.head(5).iterrows():
term = row.get("name", row.get("term", "Unknown"))
p_val = row.get(
"adjusted_p_value",
row.get("p_value", row.get("Adjusted P-value", 1)),
)
print(f" {i+1}. {term}")
print(f" P-value: {p_val:.2e}")
results[db_key] = enrichment
else:
print("No significant results found")
except Exception as e:
print(f"Error: {e}")
# Generate summary report
print("\n" + "=" * 60)
print("Generating summary report...")
summary = []
for db_key, db_name in databases.items():
if db_key in results and len(results[db_key]) > 0:
summary.append(
{
"Database": db_name,
"Total Terms": len(results[db_key]),
"Top Term": results[db_key].iloc[0].get(
"name", results[db_key].iloc[0].get("term", "N/A")
),
}
)
if summary:
summary_df = pd.DataFrame(summary)
summary_file = f"{output_prefix}_summary.csv"
summary_df.to_csv(summary_file, index=False)
print(f"\nSummary saved to: {summary_file}")
print("\n" + summary_df.to_string(index=False))
else:
print("\nNo enrichment results to summarize")
# Get expression data for genes
print("\n" + "=" * 60)
print("Getting expression data for input genes...")
try:
# Get tissue expression for first few genes
expr_data = []
for gene in gene_list[:5]: # Limit to first 5
print(f" Getting expression for {gene}...")
try:
tissue_expr = gget.archs4(gene, which="tissue")
top_tissue = tissue_expr.nlargest(1, "median").iloc[0]
expr_data.append(
{
"Gene": gene,
"Top Tissue": top_tissue["tissue"],
"Median Expression": top_tissue["median"],
}
)
except Exception as e:
print(f" Warning: {e}")
if expr_data:
expr_df = pd.DataFrame(expr_data)
expr_file = f"{output_prefix}_expression.csv"
expr_df.to_csv(expr_file, index=False)
print(f"\nExpression data saved to: {expr_file}")
except Exception as e:
print(f"Error getting expression data: {e}")
print("\n" + "=" * 60)
print("Enrichment analysis complete!")
print(f"\nOutput files (prefix: {output_prefix}):")
for db_key in databases.keys():
if db_key in results:
print(f" - {output_prefix}_{db_key}.csv")
print(f" - {output_prefix}_summary.csv")
print(f" - {output_prefix}_expression.csv")
return True
def main():
parser = argparse.ArgumentParser(
description="Perform comprehensive enrichment analysis using gget"
)
parser.add_argument(
"genes",
help="Gene list file (one gene per line or CSV with genes in first column)",
)
parser.add_argument(
"-s",
"--species",
default="human",
help="Species (human, mouse, fly, yeast, worm, fish)",
)
parser.add_argument(
"-b", "--background", help="Background gene list file (optional)"
)
parser.add_argument(
"-o", "--output", default="enrichment", help="Output prefix (default: enrichment)"
)
parser.add_argument(
"--no-plot", action="store_true", help="Disable plotting"
)
args = parser.parse_args()
# Read gene list
if not Path(args.genes).exists():
print(f"Error: File not found: {args.genes}")
sys.exit(1)
try:
gene_list = read_gene_list(args.genes)
print(f"Read {len(gene_list)} genes from {args.genes}")
# Read background if provided
background = None
if args.background:
if Path(args.background).exists():
background = read_gene_list(args.background)
print(f"Read {len(background)} background genes from {args.background}")
else:
print(f"Warning: Background file not found: {args.background}")
success = enrichment_pipeline(
gene_list,
species=args.species,
background=background,
output_prefix=args.output,
plot=not args.no_plot,
)
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\nAnalysis interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\nError: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,161 @@
#!/usr/bin/env python3
"""
Gene Analysis Script
Quick analysis of a gene: search, info, sequences, expression, and enrichment
"""
import argparse
import sys
import gget
def analyze_gene(gene_name, species="homo_sapiens", output_prefix=None):
"""
Perform comprehensive analysis of a gene.
Args:
gene_name: Gene symbol to analyze
species: Species name (default: homo_sapiens)
output_prefix: Prefix for output files (default: gene_name)
"""
if output_prefix is None:
output_prefix = gene_name.lower()
print(f"Analyzing gene: {gene_name}")
print("=" * 60)
# Step 1: Search for the gene
print("\n1. Searching for gene...")
search_results = gget.search([gene_name], species=species, limit=1)
if len(search_results) == 0:
print(f"Error: Gene '{gene_name}' not found in {species}")
return False
gene_id = search_results["ensembl_id"].iloc[0]
print(f" Found: {gene_id}")
print(f" Description: {search_results['ensembl_description'].iloc[0]}")
# Step 2: Get detailed information
print("\n2. Getting detailed information...")
gene_info = gget.info([gene_id], pdb=True)
gene_info.to_csv(f"{output_prefix}_info.csv", index=False)
print(f" Saved to: {output_prefix}_info.csv")
if "uniprot_id" in gene_info.columns and gene_info["uniprot_id"].iloc[0]:
print(f" UniProt ID: {gene_info['uniprot_id'].iloc[0]}")
if "pdb_id" in gene_info.columns and gene_info["pdb_id"].iloc[0]:
print(f" PDB IDs: {gene_info['pdb_id'].iloc[0]}")
# Step 3: Get sequences
print("\n3. Retrieving sequences...")
nucleotide_seq = gget.seq([gene_id])
protein_seq = gget.seq([gene_id], translate=True)
with open(f"{output_prefix}_nucleotide.fasta", "w") as f:
f.write(nucleotide_seq)
print(f" Nucleotide sequence saved to: {output_prefix}_nucleotide.fasta")
with open(f"{output_prefix}_protein.fasta", "w") as f:
f.write(protein_seq)
print(f" Protein sequence saved to: {output_prefix}_protein.fasta")
# Step 4: Get tissue expression
print("\n4. Getting tissue expression...")
try:
tissue_expr = gget.archs4(gene_name, which="tissue")
tissue_expr.to_csv(f"{output_prefix}_tissue_expression.csv", index=False)
print(f" Saved to: {output_prefix}_tissue_expression.csv")
# Show top tissues
top_tissues = tissue_expr.nlargest(5, "median")
print("\n Top expressing tissues:")
for _, row in top_tissues.iterrows():
print(f" {row['tissue']}: median = {row['median']:.2f}")
except Exception as e:
print(f" Warning: Could not retrieve ARCHS4 data: {e}")
# Step 5: Find correlated genes
print("\n5. Finding correlated genes...")
try:
correlated = gget.archs4(gene_name, which="correlation")
correlated.to_csv(f"{output_prefix}_correlated_genes.csv", index=False)
print(f" Saved to: {output_prefix}_correlated_genes.csv")
# Show top correlated
print("\n Top 10 correlated genes:")
for _, row in correlated.head(10).iterrows():
print(f" {row['gene_symbol']}: r = {row['correlation']:.3f}")
except Exception as e:
print(f" Warning: Could not retrieve correlation data: {e}")
# Step 6: Get disease associations
print("\n6. Getting disease associations...")
try:
diseases = gget.opentargets(gene_id, resource="diseases", limit=10)
diseases.to_csv(f"{output_prefix}_diseases.csv", index=False)
print(f" Saved to: {output_prefix}_diseases.csv")
print("\n Top 5 disease associations:")
for _, row in diseases.head(5).iterrows():
print(f" {row['disease_name']}: score = {row['overall_score']:.3f}")
except Exception as e:
print(f" Warning: Could not retrieve disease data: {e}")
# Step 7: Get drug associations
print("\n7. Getting drug associations...")
try:
drugs = gget.opentargets(gene_id, resource="drugs", limit=10)
if len(drugs) > 0:
drugs.to_csv(f"{output_prefix}_drugs.csv", index=False)
print(f" Saved to: {output_prefix}_drugs.csv")
print(f"\n Found {len(drugs)} drug associations")
else:
print(" No drug associations found")
except Exception as e:
print(f" Warning: Could not retrieve drug data: {e}")
print("\n" + "=" * 60)
print("Analysis complete!")
print(f"\nOutput files (prefix: {output_prefix}):")
print(f" - {output_prefix}_info.csv")
print(f" - {output_prefix}_nucleotide.fasta")
print(f" - {output_prefix}_protein.fasta")
print(f" - {output_prefix}_tissue_expression.csv")
print(f" - {output_prefix}_correlated_genes.csv")
print(f" - {output_prefix}_diseases.csv")
print(f" - {output_prefix}_drugs.csv (if available)")
return True
def main():
parser = argparse.ArgumentParser(
description="Perform comprehensive analysis of a gene using gget"
)
parser.add_argument("gene", help="Gene symbol to analyze")
parser.add_argument(
"-s",
"--species",
default="homo_sapiens",
help="Species (default: homo_sapiens)",
)
parser.add_argument(
"-o", "--output", help="Output prefix for files (default: gene name)"
)
args = parser.parse_args()
try:
success = analyze_gene(args.gene, args.species, args.output)
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\nAnalysis interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\nError: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,355 @@
---
name: matplotlib
description: Comprehensive toolkit for creating publication-quality data visualizations in Python. Use this skill when creating plots, charts, or any scientific/statistical visualizations including line plots, scatter plots, bar charts, histograms, heatmaps, 3D plots, and more. Applies to tasks involving data visualization, figure generation, plot customization, or exporting graphics to various formats.
---
# Matplotlib
## Overview
Matplotlib is Python's foundational visualization library for creating static, animated, and interactive plots. This skill provides guidance on using matplotlib effectively, covering both the pyplot interface (MATLAB-style) and the object-oriented API (Figure/Axes), along with best practices for creating publication-quality visualizations.
## When to Use This Skill
Apply this skill when:
- Creating any type of plot or chart (line, scatter, bar, histogram, heatmap, contour, etc.)
- Generating scientific or statistical visualizations
- Customizing plot appearance (colors, styles, labels, legends)
- Creating multi-panel figures with subplots
- Exporting visualizations to various formats (PNG, PDF, SVG, etc.)
- Building interactive plots or animations
- Working with 3D visualizations
- Integrating plots into Jupyter notebooks or GUI applications
## Core Concepts
### The Matplotlib Hierarchy
Matplotlib uses a hierarchical structure of objects:
1. **Figure** - The top-level container for all plot elements
2. **Axes** - The actual plotting area where data is displayed (one Figure can contain multiple Axes)
3. **Artist** - Everything visible on the figure (lines, text, ticks, etc.)
4. **Axis** - The number line objects (x-axis, y-axis) that handle ticks and labels
### Two Interfaces
**1. pyplot Interface (Implicit, MATLAB-style)**
```python
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()
```
- Convenient for quick, simple plots
- Maintains state automatically
- Good for interactive work and simple scripts
**2. Object-Oriented Interface (Explicit)**
```python
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4])
ax.set_ylabel('some numbers')
plt.show()
```
- **Recommended for most use cases**
- More explicit control over figure and axes
- Better for complex figures with multiple subplots
- Easier to maintain and debug
## Common Workflows
### 1. Basic Plot Creation
**Single plot workflow:**
```python
import matplotlib.pyplot as plt
import numpy as np
# Create figure and axes (OO interface - RECOMMENDED)
fig, ax = plt.subplots(figsize=(10, 6))
# Generate and plot data
x = np.linspace(0, 2*np.pi, 100)
ax.plot(x, np.sin(x), label='sin(x)')
ax.plot(x, np.cos(x), label='cos(x)')
# Customize
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Trigonometric Functions')
ax.legend()
ax.grid(True, alpha=0.3)
# Save and/or display
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.show()
```
### 2. Multiple Subplots
**Creating subplot layouts:**
```python
# Method 1: Regular grid
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(x, y1)
axes[0, 1].scatter(x, y2)
axes[1, 0].bar(categories, values)
axes[1, 1].hist(data, bins=30)
# Method 2: Mosaic layout (more flexible)
fig, axes = plt.subplot_mosaic([['left', 'right_top'],
['left', 'right_bottom']],
figsize=(10, 8))
axes['left'].plot(x, y)
axes['right_top'].scatter(x, y)
axes['right_bottom'].hist(data)
# Method 3: GridSpec (maximum control)
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(3, 3, figure=fig)
ax1 = fig.add_subplot(gs[0, :]) # Top row, all columns
ax2 = fig.add_subplot(gs[1:, 0]) # Bottom two rows, first column
ax3 = fig.add_subplot(gs[1:, 1:]) # Bottom two rows, last two columns
```
### 3. Plot Types and Use Cases
**Line plots** - Time series, continuous data, trends
```python
ax.plot(x, y, linewidth=2, linestyle='--', marker='o', color='blue')
```
**Scatter plots** - Relationships between variables, correlations
```python
ax.scatter(x, y, s=sizes, c=colors, alpha=0.6, cmap='viridis')
```
**Bar charts** - Categorical comparisons
```python
ax.bar(categories, values, color='steelblue', edgecolor='black')
# For horizontal bars:
ax.barh(categories, values)
```
**Histograms** - Distributions
```python
ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
```
**Heatmaps** - Matrix data, correlations
```python
im = ax.imshow(matrix, cmap='coolwarm', aspect='auto')
plt.colorbar(im, ax=ax)
```
**Contour plots** - 3D data on 2D plane
```python
contour = ax.contour(X, Y, Z, levels=10)
ax.clabel(contour, inline=True, fontsize=8)
```
**Box plots** - Statistical distributions
```python
ax.boxplot([data1, data2, data3], labels=['A', 'B', 'C'])
```
**Violin plots** - Distribution densities
```python
ax.violinplot([data1, data2, data3], positions=[1, 2, 3])
```
For comprehensive plot type examples and variations, refer to `references/plot_types.md`.
### 4. Styling and Customization
**Color specification methods:**
- Named colors: `'red'`, `'blue'`, `'steelblue'`
- Hex codes: `'#FF5733'`
- RGB tuples: `(0.1, 0.2, 0.3)`
- Colormaps: `cmap='viridis'`, `cmap='plasma'`, `cmap='coolwarm'`
**Using style sheets:**
```python
plt.style.use('seaborn-v0_8-darkgrid') # Apply predefined style
# Available styles: 'ggplot', 'bmh', 'fivethirtyeight', etc.
print(plt.style.available) # List all available styles
```
**Customizing with rcParams:**
```python
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['figure.titlesize'] = 18
```
**Text and annotations:**
```python
ax.text(x, y, 'annotation', fontsize=12, ha='center')
ax.annotate('important point', xy=(x, y), xytext=(x+1, y+1),
arrowprops=dict(arrowstyle='->', color='red'))
```
For detailed styling options and colormap guidelines, see `references/styling_guide.md`.
### 5. Saving Figures
**Export to various formats:**
```python
# High-resolution PNG for presentations/papers
plt.savefig('figure.png', dpi=300, bbox_inches='tight', facecolor='white')
# Vector format for publications (scalable)
plt.savefig('figure.pdf', bbox_inches='tight')
plt.savefig('figure.svg', bbox_inches='tight')
# Transparent background
plt.savefig('figure.png', dpi=300, bbox_inches='tight', transparent=True)
```
**Important parameters:**
- `dpi`: Resolution (300 for publications, 150 for web, 72 for screen)
- `bbox_inches='tight'`: Removes excess whitespace
- `facecolor='white'`: Ensures white background (useful for transparent themes)
- `transparent=True`: Transparent background
### 6. Working with 3D Plots
```python
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Surface plot
ax.plot_surface(X, Y, Z, cmap='viridis')
# 3D scatter
ax.scatter(x, y, z, c=colors, marker='o')
# 3D line plot
ax.plot(x, y, z, linewidth=2)
# Labels
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
```
## Best Practices
### 1. Interface Selection
- **Use the object-oriented interface** (fig, ax = plt.subplots()) for production code
- Reserve pyplot interface for quick interactive exploration only
- Always create figures explicitly rather than relying on implicit state
### 2. Figure Size and DPI
- Set figsize at creation: `fig, ax = plt.subplots(figsize=(10, 6))`
- Use appropriate DPI for output medium:
- Screen/notebook: 72-100 dpi
- Web: 150 dpi
- Print/publications: 300 dpi
### 3. Layout Management
- Use `constrained_layout=True` or `tight_layout()` to prevent overlapping elements
- `fig, ax = plt.subplots(constrained_layout=True)` is recommended for automatic spacing
### 4. Colormap Selection
- **Sequential** (viridis, plasma, inferno): Ordered data with consistent progression
- **Diverging** (coolwarm, RdBu): Data with meaningful center point (e.g., zero)
- **Qualitative** (tab10, Set3): Categorical/nominal data
- Avoid rainbow colormaps (jet) - they are not perceptually uniform
### 5. Accessibility
- Use colorblind-friendly colormaps (viridis, cividis)
- Add patterns/hatching for bar charts in addition to colors
- Ensure sufficient contrast between elements
- Include descriptive labels and legends
### 6. Performance
- For large datasets, use `rasterized=True` in plot calls to reduce file size
- Use appropriate data reduction before plotting (e.g., downsample dense time series)
- For animations, use blitting for better performance
### 7. Code Organization
```python
# Good practice: Clear structure
def create_analysis_plot(data, title):
"""Create standardized analysis plot."""
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
# Plot data
ax.plot(data['x'], data['y'], linewidth=2)
# Customize
ax.set_xlabel('X Axis Label', fontsize=12)
ax.set_ylabel('Y Axis Label', fontsize=12)
ax.set_title(title, fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
return fig, ax
# Use the function
fig, ax = create_analysis_plot(my_data, 'My Analysis')
plt.savefig('analysis.png', dpi=300, bbox_inches='tight')
```
## Quick Reference Scripts
This skill includes helper scripts in the `scripts/` directory:
### `plot_template.py`
Template script demonstrating various plot types with best practices. Use this as a starting point for creating new visualizations.
**Usage:**
```bash
python scripts/plot_template.py
```
### `style_configurator.py`
Interactive utility to configure matplotlib style preferences and generate custom style sheets.
**Usage:**
```bash
python scripts/style_configurator.py
```
## Detailed References
For comprehensive information, consult the reference documents:
- **`references/plot_types.md`** - Complete catalog of plot types with code examples and use cases
- **`references/styling_guide.md`** - Detailed styling options, colormaps, and customization
- **`references/api_reference.md`** - Core classes and methods reference
- **`references/common_issues.md`** - Troubleshooting guide for common problems
## Integration with Other Tools
Matplotlib integrates well with:
- **NumPy/Pandas** - Direct plotting from arrays and DataFrames
- **Seaborn** - High-level statistical visualizations built on matplotlib
- **Jupyter** - Interactive plotting with `%matplotlib inline` or `%matplotlib widget`
- **GUI frameworks** - Embedding in Tkinter, Qt, wxPython applications
## Common Gotchas
1. **Overlapping elements**: Use `constrained_layout=True` or `tight_layout()`
2. **State confusion**: Use OO interface to avoid pyplot state machine issues
3. **Memory issues with many figures**: Close figures explicitly with `plt.close(fig)`
4. **Font warnings**: Install fonts or suppress warnings with `plt.rcParams['font.sans-serif']`
5. **DPI confusion**: Remember that figsize is in inches, not pixels: `pixels = dpi * inches`
## Additional Resources
- Official documentation: https://matplotlib.org/
- Gallery: https://matplotlib.org/stable/gallery/index.html
- Cheatsheets: https://matplotlib.org/cheatsheets/
- Tutorials: https://matplotlib.org/stable/tutorials/index.html

View File

@@ -0,0 +1,412 @@
# Matplotlib API Reference
This document provides a quick reference for the most commonly used matplotlib classes and methods.
## Core Classes
### Figure
The top-level container for all plot elements.
**Creation:**
```python
fig = plt.figure(figsize=(10, 6), dpi=100, facecolor='white')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
```
**Key Methods:**
- `fig.add_subplot(nrows, ncols, index)` - Add a subplot
- `fig.add_axes([left, bottom, width, height])` - Add axes at specific position
- `fig.savefig(filename, dpi=300, bbox_inches='tight')` - Save figure
- `fig.tight_layout()` - Adjust spacing to prevent overlaps
- `fig.suptitle(title)` - Set figure title
- `fig.legend()` - Create figure-level legend
- `fig.colorbar(mappable)` - Add colorbar to figure
- `plt.close(fig)` - Close figure to free memory
**Key Attributes:**
- `fig.axes` - List of all axes in the figure
- `fig.dpi` - Resolution in dots per inch
- `fig.figsize` - Figure dimensions in inches (width, height)
### Axes
The actual plotting area where data is visualized.
**Creation:**
```python
fig, ax = plt.subplots() # Single axes
ax = fig.add_subplot(111) # Alternative method
```
**Plotting Methods:**
**Line plots:**
- `ax.plot(x, y, **kwargs)` - Line plot
- `ax.step(x, y, where='pre'/'mid'/'post')` - Step plot
- `ax.errorbar(x, y, yerr, xerr)` - Error bars
**Scatter plots:**
- `ax.scatter(x, y, s=size, c=color, marker='o', alpha=0.5)` - Scatter plot
**Bar charts:**
- `ax.bar(x, height, width=0.8, align='center')` - Vertical bar chart
- `ax.barh(y, width)` - Horizontal bar chart
**Statistical plots:**
- `ax.hist(data, bins=10, density=False)` - Histogram
- `ax.boxplot(data, labels=None)` - Box plot
- `ax.violinplot(data)` - Violin plot
**2D plots:**
- `ax.imshow(array, cmap='viridis', aspect='auto')` - Display image/matrix
- `ax.contour(X, Y, Z, levels=10)` - Contour lines
- `ax.contourf(X, Y, Z, levels=10)` - Filled contours
- `ax.pcolormesh(X, Y, Z)` - Pseudocolor plot
**Filling:**
- `ax.fill_between(x, y1, y2, alpha=0.3)` - Fill between curves
- `ax.fill_betweenx(y, x1, x2)` - Fill between vertical curves
**Text and annotations:**
- `ax.text(x, y, text, fontsize=12)` - Add text
- `ax.annotate(text, xy=(x, y), xytext=(x2, y2), arrowprops={})` - Annotate with arrow
**Customization Methods:**
**Labels and titles:**
- `ax.set_xlabel(label, fontsize=12)` - Set x-axis label
- `ax.set_ylabel(label, fontsize=12)` - Set y-axis label
- `ax.set_title(title, fontsize=14)` - Set axes title
**Limits and scales:**
- `ax.set_xlim(left, right)` - Set x-axis limits
- `ax.set_ylim(bottom, top)` - Set y-axis limits
- `ax.set_xscale('linear'/'log'/'symlog')` - Set x-axis scale
- `ax.set_yscale('linear'/'log'/'symlog')` - Set y-axis scale
**Ticks:**
- `ax.set_xticks(positions)` - Set x-tick positions
- `ax.set_xticklabels(labels)` - Set x-tick labels
- `ax.tick_params(axis='both', labelsize=10)` - Customize tick appearance
**Grid and spines:**
- `ax.grid(True, alpha=0.3, linestyle='--')` - Add grid
- `ax.spines['top'].set_visible(False)` - Hide top spine
- `ax.spines['right'].set_visible(False)` - Hide right spine
**Legend:**
- `ax.legend(loc='best', fontsize=10, frameon=True)` - Add legend
- `ax.legend(handles, labels)` - Custom legend
**Aspect and layout:**
- `ax.set_aspect('equal'/'auto'/ratio)` - Set aspect ratio
- `ax.invert_xaxis()` - Invert x-axis
- `ax.invert_yaxis()` - Invert y-axis
### pyplot Module
High-level interface for quick plotting.
**Figure creation:**
- `plt.figure()` - Create new figure
- `plt.subplots()` - Create figure and axes
- `plt.subplot()` - Add subplot to current figure
**Plotting (uses current axes):**
- `plt.plot()` - Line plot
- `plt.scatter()` - Scatter plot
- `plt.bar()` - Bar chart
- `plt.hist()` - Histogram
- (All axes methods available)
**Display and save:**
- `plt.show()` - Display figure
- `plt.savefig()` - Save figure
- `plt.close()` - Close figure
**Style:**
- `plt.style.use(style_name)` - Apply style sheet
- `plt.style.available` - List available styles
**State management:**
- `plt.gca()` - Get current axes
- `plt.gcf()` - Get current figure
- `plt.sca(ax)` - Set current axes
- `plt.clf()` - Clear current figure
- `plt.cla()` - Clear current axes
## Line and Marker Styles
### Line Styles
- `'-'` or `'solid'` - Solid line
- `'--'` or `'dashed'` - Dashed line
- `'-.'` or `'dashdot'` - Dash-dot line
- `':'` or `'dotted'` - Dotted line
- `''` or `' '` or `'None'` - No line
### Marker Styles
- `'.'` - Point marker
- `'o'` - Circle marker
- `'v'`, `'^'`, `'<'`, `'>'` - Triangle markers
- `'s'` - Square marker
- `'p'` - Pentagon marker
- `'*'` - Star marker
- `'h'`, `'H'` - Hexagon markers
- `'+'` - Plus marker
- `'x'` - X marker
- `'D'`, `'d'` - Diamond markers
### Color Specifications
**Single character shortcuts:**
- `'b'` - Blue
- `'g'` - Green
- `'r'` - Red
- `'c'` - Cyan
- `'m'` - Magenta
- `'y'` - Yellow
- `'k'` - Black
- `'w'` - White
**Named colors:**
- `'steelblue'`, `'coral'`, `'teal'`, etc.
- See full list: https://matplotlib.org/stable/gallery/color/named_colors.html
**Other formats:**
- Hex: `'#FF5733'`
- RGB tuple: `(0.1, 0.2, 0.3)`
- RGBA tuple: `(0.1, 0.2, 0.3, 0.5)`
## Common Parameters
### Plot Function Parameters
```python
ax.plot(x, y,
color='blue', # Line color
linewidth=2, # Line width
linestyle='--', # Line style
marker='o', # Marker style
markersize=8, # Marker size
markerfacecolor='red', # Marker fill color
markeredgecolor='black',# Marker edge color
markeredgewidth=1, # Marker edge width
alpha=0.7, # Transparency (0-1)
label='data', # Legend label
zorder=2, # Drawing order
rasterized=True # Rasterize for smaller file size
)
```
### Scatter Function Parameters
```python
ax.scatter(x, y,
s=50, # Size (scalar or array)
c='blue', # Color (scalar, array, or sequence)
marker='o', # Marker style
cmap='viridis', # Colormap (if c is numeric)
alpha=0.5, # Transparency
edgecolors='black', # Edge color
linewidths=1, # Edge width
vmin=0, vmax=1, # Color scale limits
label='data' # Legend label
)
```
### Text Parameters
```python
ax.text(x, y, text,
fontsize=12, # Font size
fontweight='normal', # 'normal', 'bold', 'heavy', 'light'
fontstyle='normal', # 'normal', 'italic', 'oblique'
fontfamily='sans-serif',# Font family
color='black', # Text color
alpha=1.0, # Transparency
ha='center', # Horizontal alignment: 'left', 'center', 'right'
va='center', # Vertical alignment: 'top', 'center', 'bottom', 'baseline'
rotation=0, # Rotation angle in degrees
bbox=dict( # Background box
facecolor='white',
edgecolor='black',
boxstyle='round'
)
)
```
## rcParams Configuration
Common rcParams settings for global customization:
```python
# Font settings
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica']
plt.rcParams['font.size'] = 12
# Figure settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.bbox'] = 'tight'
# Axes settings
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.grid'] = True
plt.rcParams['axes.grid.alpha'] = 0.3
# Line settings
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.markersize'] = 8
# Tick settings
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['xtick.direction'] = 'in' # 'in', 'out', 'inout'
plt.rcParams['ytick.direction'] = 'in'
# Legend settings
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['legend.frameon'] = True
plt.rcParams['legend.framealpha'] = 0.8
# Grid settings
plt.rcParams['grid.alpha'] = 0.3
plt.rcParams['grid.linestyle'] = '--'
```
## GridSpec for Complex Layouts
```python
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(3, 3, figure=fig, hspace=0.3, wspace=0.3)
# Span multiple cells
ax1 = fig.add_subplot(gs[0, :]) # Top row, all columns
ax2 = fig.add_subplot(gs[1:, 0]) # Bottom two rows, first column
ax3 = fig.add_subplot(gs[1, 1:]) # Middle row, last two columns
ax4 = fig.add_subplot(gs[2, 1]) # Bottom row, middle column
ax5 = fig.add_subplot(gs[2, 2]) # Bottom row, right column
```
## 3D Plotting
```python
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot types
ax.plot(x, y, z) # 3D line
ax.scatter(x, y, z) # 3D scatter
ax.plot_surface(X, Y, Z) # 3D surface
ax.plot_wireframe(X, Y, Z) # 3D wireframe
ax.contour(X, Y, Z) # 3D contour
ax.bar3d(x, y, z, dx, dy, dz) # 3D bar
# Customization
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.view_init(elev=30, azim=45) # Set viewing angle
```
## Animation
```python
from matplotlib.animation import FuncAnimation
fig, ax = plt.subplots()
line, = ax.plot([], [])
def init():
ax.set_xlim(0, 2*np.pi)
ax.set_ylim(-1, 1)
return line,
def update(frame):
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x + frame/10)
line.set_data(x, y)
return line,
anim = FuncAnimation(fig, update, init_func=init,
frames=100, interval=50, blit=True)
# Save animation
anim.save('animation.gif', writer='pillow', fps=20)
anim.save('animation.mp4', writer='ffmpeg', fps=20)
```
## Image Operations
```python
# Read and display image
img = plt.imread('image.png')
ax.imshow(img)
# Display matrix as image
ax.imshow(matrix, cmap='viridis', aspect='auto',
interpolation='nearest', origin='lower')
# Colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Values')
# Image extent (set coordinates)
ax.imshow(img, extent=[x_min, x_max, y_min, y_max])
```
## Event Handling
```python
# Mouse click event
def on_click(event):
if event.inaxes:
print(f'Clicked at x={event.xdata:.2f}, y={event.ydata:.2f}')
fig.canvas.mpl_connect('button_press_event', on_click)
# Key press event
def on_key(event):
print(f'Key pressed: {event.key}')
fig.canvas.mpl_connect('key_press_event', on_key)
```
## Useful Utilities
```python
# Get current axis limits
xlims = ax.get_xlim()
ylims = ax.get_ylim()
# Set equal aspect ratio
ax.set_aspect('equal', adjustable='box')
# Share axes between subplots
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
# Twin axes (two y-axes)
ax2 = ax1.twinx()
# Remove tick labels
ax.set_xticklabels([])
ax.set_yticklabels([])
# Scientific notation
ax.ticklabel_format(style='scientific', axis='y', scilimits=(0,0))
# Date formatting
import matplotlib.dates as mdates
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
```

View File

@@ -0,0 +1,563 @@
# Matplotlib Common Issues and Solutions
Troubleshooting guide for frequently encountered matplotlib problems.
## Display and Backend Issues
### Issue: Plots Not Showing
**Problem:** `plt.show()` doesn't display anything
**Solutions:**
```python
# 1. Check if backend is properly set (for interactive use)
import matplotlib
print(matplotlib.get_backend())
# 2. Try different backends
matplotlib.use('TkAgg') # or 'Qt5Agg', 'MacOSX'
import matplotlib.pyplot as plt
# 3. In Jupyter notebooks, use magic command
%matplotlib inline # Static images
# or
%matplotlib widget # Interactive plots
# 4. Ensure plt.show() is called
plt.plot([1, 2, 3])
plt.show()
```
### Issue: "RuntimeError: main thread is not in main loop"
**Problem:** Interactive mode issues with threading
**Solution:**
```python
# Switch to non-interactive backend
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
# Or turn off interactive mode
plt.ioff()
```
### Issue: Figures Not Updating Interactively
**Problem:** Changes not reflected in interactive windows
**Solution:**
```python
# Enable interactive mode
plt.ion()
# Draw after each change
plt.plot(x, y)
plt.draw()
plt.pause(0.001) # Brief pause to update display
```
## Layout and Spacing Issues
### Issue: Overlapping Labels and Titles
**Problem:** Labels, titles, or tick labels overlap or get cut off
**Solutions:**
```python
# Solution 1: Constrained layout (RECOMMENDED)
fig, ax = plt.subplots(constrained_layout=True)
# Solution 2: Tight layout
fig, ax = plt.subplots()
plt.tight_layout()
# Solution 3: Adjust margins manually
plt.subplots_adjust(left=0.15, right=0.95, top=0.95, bottom=0.15)
# Solution 4: Save with bbox_inches='tight'
plt.savefig('figure.png', bbox_inches='tight')
# Solution 5: Rotate long tick labels
ax.set_xticklabels(labels, rotation=45, ha='right')
```
### Issue: Colorbar Affects Subplot Size
**Problem:** Adding colorbar shrinks the plot
**Solution:**
```python
# Solution 1: Use constrained layout
fig, ax = plt.subplots(constrained_layout=True)
im = ax.imshow(data)
plt.colorbar(im, ax=ax)
# Solution 2: Manually specify colorbar dimensions
from mpl_toolkits.axes_grid1 import make_axes_locatable
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax)
# Solution 3: For multiple subplots, share colorbar
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax in axes:
im = ax.imshow(data)
fig.colorbar(im, ax=axes.ravel().tolist(), shrink=0.95)
```
### Issue: Subplots Too Close Together
**Problem:** Multiple subplots overlapping
**Solution:**
```python
# Solution 1: Use constrained_layout
fig, axes = plt.subplots(2, 2, constrained_layout=True)
# Solution 2: Adjust spacing with subplots_adjust
fig, axes = plt.subplots(2, 2)
plt.subplots_adjust(hspace=0.4, wspace=0.4)
# Solution 3: Specify spacing in tight_layout
plt.tight_layout(h_pad=2.0, w_pad=2.0)
```
## Memory and Performance Issues
### Issue: Memory Leak with Multiple Figures
**Problem:** Memory usage grows when creating many figures
**Solution:**
```python
# Close figures explicitly
fig, ax = plt.subplots()
ax.plot(x, y)
plt.savefig('plot.png')
plt.close(fig) # or plt.close('all')
# Clear current figure without closing
plt.clf()
# Clear current axes
plt.cla()
```
### Issue: Large File Sizes
**Problem:** Saved figures are too large
**Solutions:**
```python
# Solution 1: Reduce DPI
plt.savefig('figure.png', dpi=150) # Instead of 300
# Solution 2: Use rasterization for complex plots
ax.plot(x, y, rasterized=True)
# Solution 3: Use vector format for simple plots
plt.savefig('figure.pdf') # or .svg
# Solution 4: Compress PNG
plt.savefig('figure.png', dpi=300, optimize=True)
```
### Issue: Slow Plotting with Large Datasets
**Problem:** Plotting takes too long with many points
**Solutions:**
```python
# Solution 1: Downsample data
from scipy.signal import decimate
y_downsampled = decimate(y, 10) # Keep every 10th point
# Solution 2: Use rasterization
ax.plot(x, y, rasterized=True)
# Solution 3: Use line simplification
ax.plot(x, y)
for line in ax.get_lines():
line.set_rasterized(True)
# Solution 4: For scatter plots, consider hexbin or 2d histogram
ax.hexbin(x, y, gridsize=50, cmap='viridis')
```
## Font and Text Issues
### Issue: Font Warnings
**Problem:** "findfont: Font family [...] not found"
**Solutions:**
```python
# Solution 1: Use available fonts
from matplotlib.font_manager import findfont, FontProperties
print(findfont(FontProperties(family='sans-serif')))
# Solution 2: Rebuild font cache
import matplotlib.font_manager
matplotlib.font_manager._rebuild()
# Solution 3: Suppress warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
# Solution 4: Specify fallback fonts
plt.rcParams['font.sans-serif'] = ['Arial', 'DejaVu Sans', 'sans-serif']
```
### Issue: LaTeX Rendering Errors
**Problem:** Math text not rendering correctly
**Solutions:**
```python
# Solution 1: Use raw strings with r prefix
ax.set_xlabel(r'$\alpha$') # Not '\alpha'
# Solution 2: Escape backslashes in regular strings
ax.set_xlabel('$\\alpha$')
# Solution 3: Disable LaTeX if not installed
plt.rcParams['text.usetex'] = False
# Solution 4: Use mathtext instead of full LaTeX
# Mathtext is always available, no LaTeX installation needed
ax.text(x, y, r'$\int_0^\infty e^{-x} dx$')
```
### Issue: Text Cut Off or Outside Figure
**Problem:** Labels or annotations appear outside figure bounds
**Solutions:**
```python
# Solution 1: Use bbox_inches='tight'
plt.savefig('figure.png', bbox_inches='tight')
# Solution 2: Adjust figure bounds
plt.subplots_adjust(left=0.15, right=0.85, top=0.85, bottom=0.15)
# Solution 3: Clip text to axes
ax.text(x, y, 'text', clip_on=True)
# Solution 4: Use constrained_layout
fig, ax = plt.subplots(constrained_layout=True)
```
## Color and Colormap Issues
### Issue: Colorbar Not Matching Plot
**Problem:** Colorbar shows different range than data
**Solution:**
```python
# Explicitly set vmin and vmax
im = ax.imshow(data, vmin=0, vmax=1, cmap='viridis')
plt.colorbar(im, ax=ax)
# Or use the same norm for multiple plots
import matplotlib.colors as mcolors
norm = mcolors.Normalize(vmin=data.min(), vmax=data.max())
im1 = ax1.imshow(data1, norm=norm, cmap='viridis')
im2 = ax2.imshow(data2, norm=norm, cmap='viridis')
```
### Issue: Colors Look Wrong
**Problem:** Unexpected colors in plots
**Solutions:**
```python
# Solution 1: Check color specification format
ax.plot(x, y, color='blue') # Correct
ax.plot(x, y, color=(0, 0, 1)) # Correct RGB
ax.plot(x, y, color='#0000FF') # Correct hex
# Solution 2: Verify colormap exists
print(plt.colormaps()) # List available colormaps
# Solution 3: For scatter plots, ensure c shape matches
ax.scatter(x, y, c=colors) # colors should have same length as x, y
# Solution 4: Check if alpha is set correctly
ax.plot(x, y, alpha=1.0) # 0=transparent, 1=opaque
```
### Issue: Reversed Colormap
**Problem:** Colormap direction is backwards
**Solution:**
```python
# Add _r suffix to reverse any colormap
ax.imshow(data, cmap='viridis_r')
```
## Axis and Scale Issues
### Issue: Axis Limits Not Working
**Problem:** `set_xlim` or `set_ylim` not taking effect
**Solutions:**
```python
# Solution 1: Set after plotting
ax.plot(x, y)
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)
# Solution 2: Disable autoscaling
ax.autoscale(False)
ax.set_xlim(0, 10)
# Solution 3: Use axis method
ax.axis([xmin, xmax, ymin, ymax])
```
### Issue: Log Scale with Zero or Negative Values
**Problem:** ValueError when using log scale with data ≤ 0
**Solutions:**
```python
# Solution 1: Filter out non-positive values
mask = (data > 0)
ax.plot(x[mask], data[mask])
ax.set_yscale('log')
# Solution 2: Use symlog for data with positive and negative values
ax.set_yscale('symlog')
# Solution 3: Add small offset
ax.plot(x, data + 1e-10)
ax.set_yscale('log')
```
### Issue: Dates Not Displaying Correctly
**Problem:** Date axis shows numbers instead of dates
**Solution:**
```python
import matplotlib.dates as mdates
import pandas as pd
# Convert to datetime if needed
dates = pd.to_datetime(date_strings)
ax.plot(dates, values)
# Format date axis
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
plt.xticks(rotation=45)
```
## Legend Issues
### Issue: Legend Covers Data
**Problem:** Legend obscures important parts of plot
**Solutions:**
```python
# Solution 1: Use 'best' location
ax.legend(loc='best')
# Solution 2: Place outside plot area
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# Solution 3: Make legend semi-transparent
ax.legend(framealpha=0.7)
# Solution 4: Put legend below plot
ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3)
```
### Issue: Too Many Items in Legend
**Problem:** Legend is cluttered with many entries
**Solutions:**
```python
# Solution 1: Only label selected items
for i, (x, y) in enumerate(data):
label = f'Data {i}' if i % 5 == 0 else None
ax.plot(x, y, label=label)
# Solution 2: Use multiple columns
ax.legend(ncol=3)
# Solution 3: Create custom legend with fewer entries
from matplotlib.lines import Line2D
custom_lines = [Line2D([0], [0], color='r'),
Line2D([0], [0], color='b')]
ax.legend(custom_lines, ['Category A', 'Category B'])
# Solution 4: Use separate legend figure
fig_leg = plt.figure(figsize=(3, 2))
ax_leg = fig_leg.add_subplot(111)
ax_leg.legend(*ax.get_legend_handles_labels(), loc='center')
ax_leg.axis('off')
```
## 3D Plot Issues
### Issue: 3D Plots Look Flat
**Problem:** Difficult to perceive depth in 3D plots
**Solutions:**
```python
# Solution 1: Adjust viewing angle
ax.view_init(elev=30, azim=45)
# Solution 2: Add gridlines
ax.grid(True)
# Solution 3: Use color for depth
scatter = ax.scatter(x, y, z, c=z, cmap='viridis')
# Solution 4: Rotate interactively (if using interactive backend)
# User can click and drag to rotate
```
### Issue: 3D Axis Labels Cut Off
**Problem:** 3D axis labels appear outside figure
**Solution:**
```python
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z)
# Add padding
fig.tight_layout(pad=3.0)
# Or save with tight bounding box
plt.savefig('3d_plot.png', bbox_inches='tight', pad_inches=0.5)
```
## Image and Colorbar Issues
### Issue: Images Appear Flipped
**Problem:** Image orientation is wrong
**Solution:**
```python
# Set origin parameter
ax.imshow(img, origin='lower') # or 'upper' (default)
# Or flip array
ax.imshow(np.flipud(img))
```
### Issue: Images Look Pixelated
**Problem:** Image appears blocky when zoomed
**Solutions:**
```python
# Solution 1: Use interpolation
ax.imshow(img, interpolation='bilinear')
# Options: 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', etc.
# Solution 2: Increase DPI when saving
plt.savefig('figure.png', dpi=300)
# Solution 3: Use vector format if appropriate
plt.savefig('figure.pdf')
```
## Common Errors and Fixes
### "TypeError: 'AxesSubplot' object is not subscriptable"
**Problem:** Trying to index single axes
```python
# Wrong
fig, ax = plt.subplots()
ax[0].plot(x, y) # Error!
# Correct
fig, ax = plt.subplots()
ax.plot(x, y)
```
### "ValueError: x and y must have same first dimension"
**Problem:** Data arrays have mismatched lengths
```python
# Check shapes
print(f"x shape: {x.shape}, y shape: {y.shape}")
# Ensure they match
assert len(x) == len(y), "x and y must have same length"
```
### "AttributeError: 'numpy.ndarray' object has no attribute 'plot'"
**Problem:** Calling plot on array instead of axes
```python
# Wrong
data.plot(x, y)
# Correct
ax.plot(x, y)
# or for pandas
data.plot(ax=ax)
```
## Best Practices to Avoid Issues
1. **Always use the OO interface** - Avoid pyplot state machine
```python
fig, ax = plt.subplots() # Good
ax.plot(x, y)
```
2. **Use constrained_layout** - Prevents overlap issues
```python
fig, ax = plt.subplots(constrained_layout=True)
```
3. **Close figures explicitly** - Prevents memory leaks
```python
plt.close(fig)
```
4. **Set figure size at creation** - Better than resizing later
```python
fig, ax = plt.subplots(figsize=(10, 6))
```
5. **Use raw strings for math text** - Avoids escape issues
```python
ax.set_xlabel(r'$\alpha$')
```
6. **Check data shapes before plotting** - Catch size mismatches early
```python
assert len(x) == len(y)
```
7. **Use appropriate DPI** - 300 for print, 150 for web
```python
plt.savefig('figure.png', dpi=300)
```
8. **Test with different backends** - If display issues occur
```python
import matplotlib
matplotlib.use('TkAgg')
```

View File

@@ -0,0 +1,476 @@
# Matplotlib Plot Types Guide
Comprehensive guide to different plot types in matplotlib with examples and use cases.
## 1. Line Plots
**Use cases:** Time series, continuous data, trends, function visualization
### Basic Line Plot
```python
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, linewidth=2, label='Data')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.legend()
```
### Multiple Lines
```python
ax.plot(x, y1, label='Dataset 1', linewidth=2)
ax.plot(x, y2, label='Dataset 2', linewidth=2, linestyle='--')
ax.plot(x, y3, label='Dataset 3', linewidth=2, linestyle=':')
ax.legend()
```
### Line with Markers
```python
ax.plot(x, y, marker='o', markersize=8, linestyle='-',
linewidth=2, markerfacecolor='red', markeredgecolor='black')
```
### Step Plot
```python
ax.step(x, y, where='mid', linewidth=2, label='Step function')
# where options: 'pre', 'post', 'mid'
```
### Error Bars
```python
ax.errorbar(x, y, yerr=error, fmt='o-', linewidth=2,
capsize=5, capthick=2, label='With uncertainty')
```
## 2. Scatter Plots
**Use cases:** Correlations, relationships between variables, clusters, outliers
### Basic Scatter
```python
ax.scatter(x, y, s=50, alpha=0.6)
```
### Sized and Colored Scatter
```python
scatter = ax.scatter(x, y, s=sizes*100, c=colors,
cmap='viridis', alpha=0.6, edgecolors='black')
plt.colorbar(scatter, ax=ax, label='Color variable')
```
### Categorical Scatter
```python
for category in categories:
mask = data['category'] == category
ax.scatter(data[mask]['x'], data[mask]['y'],
label=category, s=50, alpha=0.7)
ax.legend()
```
## 3. Bar Charts
**Use cases:** Categorical comparisons, discrete data, counts
### Vertical Bar Chart
```python
ax.bar(categories, values, color='steelblue',
edgecolor='black', linewidth=1.5)
ax.set_ylabel('Values')
```
### Horizontal Bar Chart
```python
ax.barh(categories, values, color='coral',
edgecolor='black', linewidth=1.5)
ax.set_xlabel('Values')
```
### Grouped Bar Chart
```python
x = np.arange(len(categories))
width = 0.35
ax.bar(x - width/2, values1, width, label='Group 1')
ax.bar(x + width/2, values2, width, label='Group 2')
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()
```
### Stacked Bar Chart
```python
ax.bar(categories, values1, label='Part 1')
ax.bar(categories, values2, bottom=values1, label='Part 2')
ax.bar(categories, values3, bottom=values1+values2, label='Part 3')
ax.legend()
```
### Bar Chart with Error Bars
```python
ax.bar(categories, values, yerr=errors, capsize=5,
color='steelblue', edgecolor='black')
```
### Bar Chart with Patterns
```python
bars1 = ax.bar(x - width/2, values1, width, label='Group 1',
color='white', edgecolor='black', hatch='//')
bars2 = ax.bar(x + width/2, values2, width, label='Group 2',
color='white', edgecolor='black', hatch='\\\\')
```
## 4. Histograms
**Use cases:** Distributions, frequency analysis
### Basic Histogram
```python
ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
```
### Multiple Overlapping Histograms
```python
ax.hist(data1, bins=30, alpha=0.5, label='Dataset 1')
ax.hist(data2, bins=30, alpha=0.5, label='Dataset 2')
ax.legend()
```
### Normalized Histogram (Density)
```python
ax.hist(data, bins=30, density=True, alpha=0.7,
edgecolor='black', label='Empirical')
# Overlay theoretical distribution
from scipy.stats import norm
x = np.linspace(data.min(), data.max(), 100)
ax.plot(x, norm.pdf(x, data.mean(), data.std()),
'r-', linewidth=2, label='Normal fit')
ax.legend()
```
### 2D Histogram (Hexbin)
```python
hexbin = ax.hexbin(x, y, gridsize=30, cmap='Blues')
plt.colorbar(hexbin, ax=ax, label='Counts')
```
### 2D Histogram (hist2d)
```python
h = ax.hist2d(x, y, bins=30, cmap='Blues')
plt.colorbar(h[3], ax=ax, label='Counts')
```
## 5. Box and Violin Plots
**Use cases:** Statistical distributions, outlier detection, comparing distributions
### Box Plot
```python
ax.boxplot([data1, data2, data3],
labels=['Group A', 'Group B', 'Group C'],
showmeans=True, meanline=True)
ax.set_ylabel('Values')
```
### Horizontal Box Plot
```python
ax.boxplot([data1, data2, data3], vert=False,
labels=['Group A', 'Group B', 'Group C'])
ax.set_xlabel('Values')
```
### Violin Plot
```python
parts = ax.violinplot([data1, data2, data3],
positions=[1, 2, 3],
showmeans=True, showmedians=True)
ax.set_xticks([1, 2, 3])
ax.set_xticklabels(['Group A', 'Group B', 'Group C'])
```
## 6. Heatmaps
**Use cases:** Matrix data, correlations, intensity maps
### Basic Heatmap
```python
im = ax.imshow(matrix, cmap='coolwarm', aspect='auto')
plt.colorbar(im, ax=ax, label='Values')
ax.set_xlabel('X')
ax.set_ylabel('Y')
```
### Heatmap with Annotations
```python
im = ax.imshow(matrix, cmap='coolwarm')
plt.colorbar(im, ax=ax)
# Add text annotations
for i in range(matrix.shape[0]):
for j in range(matrix.shape[1]):
text = ax.text(j, i, f'{matrix[i, j]:.2f}',
ha='center', va='center', color='black')
```
### Correlation Matrix
```python
corr = data.corr()
im = ax.imshow(corr, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(im, ax=ax, label='Correlation')
# Set tick labels
ax.set_xticks(range(len(corr)))
ax.set_yticks(range(len(corr)))
ax.set_xticklabels(corr.columns, rotation=45, ha='right')
ax.set_yticklabels(corr.columns)
```
## 7. Contour Plots
**Use cases:** 3D data on 2D plane, topography, function visualization
### Contour Lines
```python
contour = ax.contour(X, Y, Z, levels=10, cmap='viridis')
ax.clabel(contour, inline=True, fontsize=8)
plt.colorbar(contour, ax=ax)
```
### Filled Contours
```python
contourf = ax.contourf(X, Y, Z, levels=20, cmap='viridis')
plt.colorbar(contourf, ax=ax)
```
### Combined Contours
```python
contourf = ax.contourf(X, Y, Z, levels=20, cmap='viridis', alpha=0.8)
contour = ax.contour(X, Y, Z, levels=10, colors='black',
linewidths=0.5, alpha=0.4)
ax.clabel(contour, inline=True, fontsize=8)
plt.colorbar(contourf, ax=ax)
```
## 8. Pie Charts
**Use cases:** Proportions, percentages (use sparingly)
### Basic Pie Chart
```python
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
startangle=90, colors=colors)
ax.axis('equal') # Equal aspect ratio ensures circular pie
```
### Exploded Pie Chart
```python
explode = (0.1, 0, 0, 0) # Explode first slice
ax.pie(sizes, explode=explode, labels=labels,
autopct='%1.1f%%', shadow=True, startangle=90)
ax.axis('equal')
```
### Donut Chart
```python
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
wedgeprops=dict(width=0.5), startangle=90)
ax.axis('equal')
```
## 9. Polar Plots
**Use cases:** Cyclic data, directional data, radar charts
### Basic Polar Plot
```python
theta = np.linspace(0, 2*np.pi, 100)
r = np.abs(np.sin(2*theta))
ax = plt.subplot(111, projection='polar')
ax.plot(theta, r, linewidth=2)
```
### Radar Chart
```python
categories = ['A', 'B', 'C', 'D', 'E']
values = [4, 3, 5, 2, 4]
# Add first value to the end to close the polygon
angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False)
values_closed = np.concatenate((values, [values[0]]))
angles_closed = np.concatenate((angles, [angles[0]]))
ax = plt.subplot(111, projection='polar')
ax.plot(angles_closed, values_closed, 'o-', linewidth=2)
ax.fill(angles_closed, values_closed, alpha=0.25)
ax.set_xticks(angles)
ax.set_xticklabels(categories)
```
## 10. Stream and Quiver Plots
**Use cases:** Vector fields, flow visualization
### Quiver Plot (Vector Field)
```python
ax.quiver(X, Y, U, V, alpha=0.8)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_aspect('equal')
```
### Stream Plot
```python
ax.streamplot(X, Y, U, V, density=1.5, color='k', linewidth=1)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_aspect('equal')
```
## 11. Fill Between
**Use cases:** Uncertainty bounds, confidence intervals, areas under curves
### Fill Between Two Curves
```python
ax.plot(x, y, 'k-', linewidth=2, label='Mean')
ax.fill_between(x, y - std, y + std, alpha=0.3,
label='±1 std dev')
ax.legend()
```
### Fill Between with Condition
```python
ax.plot(x, y1, label='Line 1')
ax.plot(x, y2, label='Line 2')
ax.fill_between(x, y1, y2, where=(y2 >= y1),
alpha=0.3, label='y2 > y1', interpolate=True)
ax.legend()
```
## 12. 3D Plots
**Use cases:** Three-dimensional data visualization
### 3D Scatter
```python
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(x, y, z, c=colors, cmap='viridis',
marker='o', s=50)
plt.colorbar(scatter, ax=ax)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
```
### 3D Surface Plot
```python
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
edgecolor='none', alpha=0.9)
plt.colorbar(surf, ax=ax)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
```
### 3D Wireframe
```python
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X, Y, Z, color='black', linewidth=0.5)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
```
### 3D Contour
```python
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.contour(X, Y, Z, levels=15, cmap='viridis')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
```
## 13. Specialized Plots
### Stem Plot
```python
ax.stem(x, y, linefmt='C0-', markerfmt='C0o', basefmt='k-')
ax.set_xlabel('X')
ax.set_ylabel('Y')
```
### Filled Polygon
```python
vertices = [(0, 0), (1, 0), (1, 1), (0, 1)]
from matplotlib.patches import Polygon
polygon = Polygon(vertices, closed=True, edgecolor='black',
facecolor='lightblue', alpha=0.5)
ax.add_patch(polygon)
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
```
### Staircase Plot
```python
ax.stairs(values, edges, fill=True, alpha=0.5)
```
### Broken Barh (Gantt-style)
```python
ax.broken_barh([(10, 50), (100, 20), (130, 10)], (10, 9),
facecolors='tab:blue')
ax.broken_barh([(10, 20), (50, 50), (120, 30)], (20, 9),
facecolors='tab:orange')
ax.set_ylim(5, 35)
ax.set_xlim(0, 200)
ax.set_xlabel('Time')
ax.set_yticks([15, 25])
ax.set_yticklabels(['Task 1', 'Task 2'])
```
## 14. Time Series Plots
### Basic Time Series
```python
import pandas as pd
import matplotlib.dates as mdates
ax.plot(dates, values, linewidth=2)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
plt.xticks(rotation=45)
ax.set_xlabel('Date')
ax.set_ylabel('Value')
```
### Time Series with Shaded Regions
```python
ax.plot(dates, values, linewidth=2)
# Shade weekends or specific periods
ax.axvspan(start_date, end_date, alpha=0.2, color='gray')
```
## Plot Selection Guide
| Data Type | Recommended Plot | Alternative Options |
|-----------|-----------------|---------------------|
| Single continuous variable | Histogram, KDE | Box plot, Violin plot |
| Two continuous variables | Scatter plot | Hexbin, 2D histogram |
| Time series | Line plot | Area plot, Step plot |
| Categorical vs continuous | Bar chart, Box plot | Violin plot, Strip plot |
| Two categorical variables | Heatmap | Grouped bar chart |
| Three continuous variables | 3D scatter, Contour | Color-coded scatter |
| Proportions | Bar chart | Pie chart (use sparingly) |
| Distributions comparison | Box plot, Violin plot | Overlaid histograms |
| Correlation matrix | Heatmap | Clustered heatmap |
| Vector field | Quiver plot, Stream plot | - |
| Function visualization | Line plot, Contour | 3D surface |

View File

@@ -0,0 +1,589 @@
# Matplotlib Styling Guide
Comprehensive guide for styling and customizing matplotlib visualizations.
## Colormaps
### Colormap Categories
**1. Perceptually Uniform Sequential**
Best for ordered data that progresses from low to high values.
- `viridis` (default, colorblind-friendly)
- `plasma`
- `inferno`
- `magma`
- `cividis` (optimized for colorblind viewers)
**Usage:**
```python
im = ax.imshow(data, cmap='viridis')
scatter = ax.scatter(x, y, c=values, cmap='plasma')
```
**2. Sequential**
Traditional colormaps for ordered data.
- `Blues`, `Greens`, `Reds`, `Oranges`, `Purples`
- `YlOrBr`, `YlOrRd`, `OrRd`, `PuRd`
- `BuPu`, `GnBu`, `PuBu`, `YlGnBu`
**3. Diverging**
Best for data with a meaningful center point (e.g., zero, mean).
- `coolwarm` (blue to red)
- `RdBu` (red-blue)
- `RdYlBu` (red-yellow-blue)
- `RdYlGn` (red-yellow-green)
- `PiYG`, `PRGn`, `BrBG`, `PuOr`, `RdGy`
**Usage:**
```python
# Center colormap at zero
im = ax.imshow(data, cmap='coolwarm', vmin=-1, vmax=1)
```
**4. Qualitative**
Best for categorical/nominal data without inherent ordering.
- `tab10` (10 distinct colors)
- `tab20` (20 distinct colors)
- `Set1`, `Set2`, `Set3`
- `Pastel1`, `Pastel2`
- `Dark2`, `Accent`, `Paired`
**Usage:**
```python
colors = plt.cm.tab10(np.linspace(0, 1, n_categories))
for i, category in enumerate(categories):
ax.plot(x, y[i], color=colors[i], label=category)
```
**5. Cyclic**
Best for cyclic data (e.g., phase, angle).
- `twilight`
- `twilight_shifted`
- `hsv`
### Colormap Best Practices
1. **Avoid `jet` colormap** - Not perceptually uniform, misleading
2. **Use perceptually uniform colormaps** - `viridis`, `plasma`, `cividis`
3. **Consider colorblind users** - Use `viridis`, `cividis`, or test with colorblind simulators
4. **Match colormap to data type**:
- Sequential: increasing/decreasing data
- Diverging: data with meaningful center
- Qualitative: categories
5. **Reverse colormaps** - Add `_r` suffix: `viridis_r`, `coolwarm_r`
### Creating Custom Colormaps
```python
from matplotlib.colors import LinearSegmentedColormap
# From color list
colors = ['blue', 'white', 'red']
n_bins = 100
cmap = LinearSegmentedColormap.from_list('custom', colors, N=n_bins)
# From RGB values
colors = [(0, 0, 1), (1, 1, 1), (1, 0, 0)] # RGB tuples
cmap = LinearSegmentedColormap.from_list('custom', colors)
# Use the custom colormap
ax.imshow(data, cmap=cmap)
```
### Discrete Colormaps
```python
import matplotlib.colors as mcolors
# Create discrete colormap from continuous
cmap = plt.cm.viridis
bounds = np.linspace(0, 10, 11)
norm = mcolors.BoundaryNorm(bounds, cmap.N)
im = ax.imshow(data, cmap=cmap, norm=norm)
```
## Style Sheets
### Using Built-in Styles
```python
# List available styles
print(plt.style.available)
# Apply a style
plt.style.use('seaborn-v0_8-darkgrid')
# Apply multiple styles (later styles override earlier ones)
plt.style.use(['seaborn-v0_8-whitegrid', 'seaborn-v0_8-poster'])
# Temporarily use a style
with plt.style.context('ggplot'):
fig, ax = plt.subplots()
ax.plot(x, y)
```
### Popular Built-in Styles
- `default` - Matplotlib's default style
- `classic` - Classic matplotlib look (pre-2.0)
- `seaborn-v0_8-*` - Seaborn-inspired styles
- `seaborn-v0_8-darkgrid`, `seaborn-v0_8-whitegrid`
- `seaborn-v0_8-dark`, `seaborn-v0_8-white`
- `seaborn-v0_8-ticks`, `seaborn-v0_8-poster`, `seaborn-v0_8-talk`
- `ggplot` - ggplot2-inspired style
- `bmh` - Bayesian Methods for Hackers style
- `fivethirtyeight` - FiveThirtyEight style
- `grayscale` - Grayscale style
### Creating Custom Style Sheets
Create a file named `custom_style.mplstyle`:
```
# custom_style.mplstyle
# Figure
figure.figsize: 10, 6
figure.dpi: 100
figure.facecolor: white
# Font
font.family: sans-serif
font.sans-serif: Arial, Helvetica
font.size: 12
# Axes
axes.labelsize: 14
axes.titlesize: 16
axes.facecolor: white
axes.edgecolor: black
axes.linewidth: 1.5
axes.grid: True
axes.axisbelow: True
# Grid
grid.color: gray
grid.linestyle: --
grid.linewidth: 0.5
grid.alpha: 0.3
# Lines
lines.linewidth: 2
lines.markersize: 8
# Ticks
xtick.labelsize: 10
ytick.labelsize: 10
xtick.direction: in
ytick.direction: in
xtick.major.size: 6
ytick.major.size: 6
xtick.minor.size: 3
ytick.minor.size: 3
# Legend
legend.fontsize: 12
legend.frameon: True
legend.framealpha: 0.8
legend.fancybox: True
# Savefig
savefig.dpi: 300
savefig.bbox: tight
savefig.facecolor: white
```
Load and use:
```python
plt.style.use('path/to/custom_style.mplstyle')
```
## rcParams Configuration
### Global Configuration
```python
import matplotlib.pyplot as plt
# Configure globally
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
# Or update multiple at once
plt.rcParams.update({
'figure.figsize': (10, 6),
'font.size': 12,
'axes.labelsize': 14,
'axes.titlesize': 16,
'lines.linewidth': 2
})
```
### Temporary Configuration
```python
# Context manager for temporary changes
with plt.rc_context({'font.size': 14, 'lines.linewidth': 2.5}):
fig, ax = plt.subplots()
ax.plot(x, y)
```
### Common rcParams
**Figure settings:**
```python
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['figure.edgecolor'] = 'white'
plt.rcParams['figure.autolayout'] = False
plt.rcParams['figure.constrained_layout.use'] = True
```
**Font settings:**
```python
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica', 'DejaVu Sans']
plt.rcParams['font.size'] = 12
plt.rcParams['font.weight'] = 'normal'
```
**Axes settings:**
```python
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.edgecolor'] = 'black'
plt.rcParams['axes.linewidth'] = 1.5
plt.rcParams['axes.grid'] = True
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelweight'] = 'normal'
plt.rcParams['axes.spines.top'] = True
plt.rcParams['axes.spines.right'] = True
```
**Line settings:**
```python
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.linestyle'] = '-'
plt.rcParams['lines.marker'] = 'None'
plt.rcParams['lines.markersize'] = 6
```
**Save settings:**
```python
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.format'] = 'png'
plt.rcParams['savefig.bbox'] = 'tight'
plt.rcParams['savefig.pad_inches'] = 0.1
plt.rcParams['savefig.transparent'] = False
```
## Color Palettes
### Named Color Sets
```python
# Tableau colors
tableau_colors = plt.cm.tab10.colors
# CSS4 colors (subset)
css_colors = ['steelblue', 'coral', 'teal', 'goldenrod', 'crimson']
# Manual definition
custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
```
### Color Cycles
```python
# Set default color cycle
from cycler import cycler
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
plt.rcParams['axes.prop_cycle'] = cycler(color=colors)
# Or combine color and line style
plt.rcParams['axes.prop_cycle'] = cycler(color=colors) + cycler(linestyle=['-', '--', ':', '-.'])
```
### Palette Generation
```python
# Evenly spaced colors from colormap
n_colors = 5
colors = plt.cm.viridis(np.linspace(0, 1, n_colors))
# Use in plot
for i, (x, y) in enumerate(data):
ax.plot(x, y, color=colors[i])
```
## Typography
### Font Configuration
```python
# Set font family
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman', 'DejaVu Serif']
# Or sans-serif
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica']
# Or monospace
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['font.monospace'] = ['Courier New', 'DejaVu Sans Mono']
```
### Font Properties in Text
```python
from matplotlib import font_manager
# Specify font properties
ax.text(x, y, 'Text',
fontsize=14,
fontweight='bold', # 'normal', 'bold', 'heavy', 'light'
fontstyle='italic', # 'normal', 'italic', 'oblique'
fontfamily='serif')
# Use specific font file
prop = font_manager.FontProperties(fname='path/to/font.ttf')
ax.text(x, y, 'Text', fontproperties=prop)
```
### Mathematical Text
```python
# LaTeX-style math
ax.set_title(r'$\alpha > \beta$')
ax.set_xlabel(r'$\mu \pm \sigma$')
ax.text(x, y, r'$\int_0^\infty e^{-x} dx = 1$')
# Subscripts and superscripts
ax.set_ylabel(r'$y = x^2 + 2x + 1$')
ax.text(x, y, r'$x_1, x_2, \ldots, x_n$')
# Greek letters
ax.text(x, y, r'$\alpha, \beta, \gamma, \delta, \epsilon$')
```
### Using Full LaTeX
```python
# Enable full LaTeX rendering (requires LaTeX installation)
plt.rcParams['text.usetex'] = True
plt.rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'
ax.set_title(r'\textbf{Bold Title}')
ax.set_xlabel(r'Time $t$ (s)')
```
## Spines and Grids
### Spine Customization
```python
# Hide specific spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Move spine position
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('data', 0))
# Change spine color and width
ax.spines['left'].set_color('red')
ax.spines['bottom'].set_linewidth(2)
```
### Grid Customization
```python
# Basic grid
ax.grid(True)
# Customized grid
ax.grid(True, which='major', linestyle='--', linewidth=0.8, alpha=0.3)
ax.grid(True, which='minor', linestyle=':', linewidth=0.5, alpha=0.2)
# Grid for specific axis
ax.grid(True, axis='x') # Only vertical lines
ax.grid(True, axis='y') # Only horizontal lines
# Grid behind or in front of data
ax.set_axisbelow(True) # Grid behind data
```
## Legend Customization
### Legend Positioning
```python
# Location strings
ax.legend(loc='best') # Automatic best position
ax.legend(loc='upper right')
ax.legend(loc='upper left')
ax.legend(loc='lower right')
ax.legend(loc='lower left')
ax.legend(loc='center')
ax.legend(loc='upper center')
ax.legend(loc='lower center')
ax.legend(loc='center left')
ax.legend(loc='center right')
# Precise positioning (bbox_to_anchor)
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Outside plot area
ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3) # Below plot
```
### Legend Styling
```python
ax.legend(
fontsize=12,
frameon=True, # Show frame
framealpha=0.9, # Frame transparency
fancybox=True, # Rounded corners
shadow=True, # Shadow effect
ncol=2, # Number of columns
title='Legend Title', # Legend title
title_fontsize=14, # Title font size
edgecolor='black', # Frame edge color
facecolor='white' # Frame background color
)
```
### Custom Legend Entries
```python
from matplotlib.lines import Line2D
# Create custom legend handles
custom_lines = [Line2D([0], [0], color='red', lw=2),
Line2D([0], [0], color='blue', lw=2, linestyle='--'),
Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10)]
ax.legend(custom_lines, ['Label 1', 'Label 2', 'Label 3'])
```
## Layout and Spacing
### Constrained Layout
```python
# Preferred method (automatic adjustment)
fig, axes = plt.subplots(2, 2, constrained_layout=True)
```
### Tight Layout
```python
# Alternative method
fig, axes = plt.subplots(2, 2)
plt.tight_layout(pad=1.5, h_pad=2.0, w_pad=2.0)
```
### Manual Adjustment
```python
# Fine-grained control
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1,
hspace=0.3, wspace=0.4)
```
## Professional Publication Style
Example configuration for publication-quality figures:
```python
# Publication style configuration
plt.rcParams.update({
# Figure
'figure.figsize': (8, 6),
'figure.dpi': 100,
'savefig.dpi': 300,
'savefig.bbox': 'tight',
'savefig.pad_inches': 0.1,
# Font
'font.family': 'sans-serif',
'font.sans-serif': ['Arial', 'Helvetica'],
'font.size': 11,
# Axes
'axes.labelsize': 12,
'axes.titlesize': 14,
'axes.linewidth': 1.5,
'axes.grid': False,
'axes.spines.top': False,
'axes.spines.right': False,
# Lines
'lines.linewidth': 2,
'lines.markersize': 8,
# Ticks
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'xtick.major.size': 6,
'ytick.major.size': 6,
'xtick.major.width': 1.5,
'ytick.major.width': 1.5,
'xtick.direction': 'in',
'ytick.direction': 'in',
# Legend
'legend.fontsize': 10,
'legend.frameon': True,
'legend.framealpha': 1.0,
'legend.edgecolor': 'black'
})
```
## Dark Theme
```python
# Dark background style
plt.style.use('dark_background')
# Or manual configuration
plt.rcParams.update({
'figure.facecolor': '#1e1e1e',
'axes.facecolor': '#1e1e1e',
'axes.edgecolor': 'white',
'axes.labelcolor': 'white',
'text.color': 'white',
'xtick.color': 'white',
'ytick.color': 'white',
'grid.color': 'gray',
'legend.facecolor': '#1e1e1e',
'legend.edgecolor': 'white'
})
```
## Color Accessibility
### Colorblind-Friendly Palettes
```python
# Use colorblind-friendly colormaps
colorblind_friendly = ['viridis', 'plasma', 'cividis']
# Colorblind-friendly discrete colors
cb_colors = ['#0173B2', '#DE8F05', '#029E73', '#CC78BC',
'#CA9161', '#949494', '#ECE133', '#56B4E9']
# Test with simulation tools or use these validated palettes
```
### High Contrast
```python
# Ensure sufficient contrast
plt.rcParams['axes.edgecolor'] = 'black'
plt.rcParams['axes.linewidth'] = 2
plt.rcParams['xtick.major.width'] = 2
plt.rcParams['ytick.major.width'] = 2
```

View File

@@ -0,0 +1,401 @@
#!/usr/bin/env python3
"""
Matplotlib Plot Template
Comprehensive template demonstrating various plot types and best practices.
Use this as a starting point for creating publication-quality visualizations.
Usage:
python plot_template.py [--plot-type TYPE] [--style STYLE] [--output FILE]
Plot types:
line, scatter, bar, histogram, heatmap, contour, box, violin, 3d, all
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import argparse
def set_publication_style():
"""Configure matplotlib for publication-quality figures."""
plt.rcParams.update({
'figure.figsize': (10, 6),
'figure.dpi': 100,
'savefig.dpi': 300,
'savefig.bbox': 'tight',
'font.size': 11,
'axes.labelsize': 12,
'axes.titlesize': 14,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'legend.fontsize': 10,
'lines.linewidth': 2,
'axes.linewidth': 1.5,
})
def generate_sample_data():
"""Generate sample data for demonstrations."""
np.random.seed(42)
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
scatter_x = np.random.randn(200)
scatter_y = np.random.randn(200)
categories = ['A', 'B', 'C', 'D', 'E']
bar_values = np.random.randint(10, 100, len(categories))
hist_data = np.random.normal(0, 1, 1000)
matrix = np.random.rand(10, 10)
X, Y = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
Z = np.sin(np.sqrt(X**2 + Y**2))
return {
'x': x, 'y1': y1, 'y2': y2,
'scatter_x': scatter_x, 'scatter_y': scatter_y,
'categories': categories, 'bar_values': bar_values,
'hist_data': hist_data, 'matrix': matrix,
'X': X, 'Y': Y, 'Z': Z
}
def create_line_plot(data, ax=None):
"""Create line plot with best practices."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
ax.plot(data['x'], data['y1'], label='sin(x)', linewidth=2, marker='o',
markevery=10, markersize=6)
ax.plot(data['x'], data['y2'], label='cos(x)', linewidth=2, linestyle='--')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Line Plot Example')
ax.legend(loc='best', framealpha=0.9)
ax.grid(True, alpha=0.3, linestyle='--')
# Remove top and right spines for cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
if ax is None:
return fig
return ax
def create_scatter_plot(data, ax=None):
"""Create scatter plot with color and size variations."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
# Color based on distance from origin
colors = np.sqrt(data['scatter_x']**2 + data['scatter_y']**2)
sizes = 50 * (1 + np.abs(data['scatter_x']))
scatter = ax.scatter(data['scatter_x'], data['scatter_y'],
c=colors, s=sizes, alpha=0.6,
cmap='viridis', edgecolors='black', linewidth=0.5)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Scatter Plot Example')
ax.grid(True, alpha=0.3, linestyle='--')
# Add colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Distance from origin')
if ax is None:
return fig
return ax
def create_bar_chart(data, ax=None):
"""Create bar chart with error bars and styling."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
x_pos = np.arange(len(data['categories']))
errors = np.random.randint(5, 15, len(data['categories']))
bars = ax.bar(x_pos, data['bar_values'], yerr=errors,
color='steelblue', edgecolor='black', linewidth=1.5,
capsize=5, alpha=0.8)
# Color bars by value
colors = plt.cm.viridis(data['bar_values'] / data['bar_values'].max())
for bar, color in zip(bars, colors):
bar.set_facecolor(color)
ax.set_xlabel('Category')
ax.set_ylabel('Values')
ax.set_title('Bar Chart Example')
ax.set_xticks(x_pos)
ax.set_xticklabels(data['categories'])
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
# Remove top and right spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
if ax is None:
return fig
return ax
def create_histogram(data, ax=None):
"""Create histogram with density overlay."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
n, bins, patches = ax.hist(data['hist_data'], bins=30, density=True,
alpha=0.7, edgecolor='black', color='steelblue')
# Overlay theoretical normal distribution
from scipy.stats import norm
mu, std = norm.fit(data['hist_data'])
x_theory = np.linspace(data['hist_data'].min(), data['hist_data'].max(), 100)
ax.plot(x_theory, norm.pdf(x_theory, mu, std), 'r-', linewidth=2,
label=f'Normal fit (μ={mu:.2f}, σ={std:.2f})')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Histogram with Normal Fit')
ax.legend()
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
if ax is None:
return fig
return ax
def create_heatmap(data, ax=None):
"""Create heatmap with colorbar and annotations."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 8), constrained_layout=True)
im = ax.imshow(data['matrix'], cmap='coolwarm', aspect='auto',
vmin=0, vmax=1)
# Add colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Value')
# Optional: Add text annotations
# for i in range(data['matrix'].shape[0]):
# for j in range(data['matrix'].shape[1]):
# text = ax.text(j, i, f'{data["matrix"][i, j]:.2f}',
# ha='center', va='center', color='black', fontsize=8)
ax.set_xlabel('X Index')
ax.set_ylabel('Y Index')
ax.set_title('Heatmap Example')
if ax is None:
return fig
return ax
def create_contour_plot(data, ax=None):
"""Create contour plot with filled contours and labels."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 8), constrained_layout=True)
# Filled contours
contourf = ax.contourf(data['X'], data['Y'], data['Z'],
levels=20, cmap='viridis', alpha=0.8)
# Contour lines
contour = ax.contour(data['X'], data['Y'], data['Z'],
levels=10, colors='black', linewidths=0.5, alpha=0.4)
# Add labels to contour lines
ax.clabel(contour, inline=True, fontsize=8)
# Add colorbar
cbar = plt.colorbar(contourf, ax=ax)
cbar.set_label('Z value')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Contour Plot Example')
ax.set_aspect('equal')
if ax is None:
return fig
return ax
def create_box_plot(data, ax=None):
"""Create box plot comparing distributions."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
# Generate multiple distributions
box_data = [np.random.normal(0, std, 100) for std in range(1, 5)]
bp = ax.boxplot(box_data, labels=['Group 1', 'Group 2', 'Group 3', 'Group 4'],
patch_artist=True, showmeans=True,
boxprops=dict(facecolor='lightblue', edgecolor='black'),
medianprops=dict(color='red', linewidth=2),
meanprops=dict(marker='D', markerfacecolor='green', markersize=8))
ax.set_xlabel('Groups')
ax.set_ylabel('Values')
ax.set_title('Box Plot Example')
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
if ax is None:
return fig
return ax
def create_violin_plot(data, ax=None):
"""Create violin plot showing distribution shapes."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
# Generate multiple distributions
violin_data = [np.random.normal(0, std, 100) for std in range(1, 5)]
parts = ax.violinplot(violin_data, positions=range(1, 5),
showmeans=True, showmedians=True)
# Customize colors
for pc in parts['bodies']:
pc.set_facecolor('lightblue')
pc.set_alpha(0.7)
pc.set_edgecolor('black')
ax.set_xlabel('Groups')
ax.set_ylabel('Values')
ax.set_title('Violin Plot Example')
ax.set_xticks(range(1, 5))
ax.set_xticklabels(['Group 1', 'Group 2', 'Group 3', 'Group 4'])
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
if ax is None:
return fig
return ax
def create_3d_plot():
"""Create 3D surface plot."""
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
# Generate data
X = np.linspace(-5, 5, 50)
Y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))
# Create surface plot
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
edgecolor='none', alpha=0.9)
# Add colorbar
fig.colorbar(surf, ax=ax, shrink=0.5)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('3D Surface Plot Example')
# Set viewing angle
ax.view_init(elev=30, azim=45)
plt.tight_layout()
return fig
def create_comprehensive_figure():
"""Create a comprehensive figure with multiple subplots."""
data = generate_sample_data()
fig = plt.figure(figsize=(16, 12), constrained_layout=True)
gs = GridSpec(3, 3, figure=fig)
# Create subplots
ax1 = fig.add_subplot(gs[0, :2]) # Line plot - top left, spans 2 columns
create_line_plot(data, ax1)
ax2 = fig.add_subplot(gs[0, 2]) # Bar chart - top right
create_bar_chart(data, ax2)
ax3 = fig.add_subplot(gs[1, 0]) # Scatter plot - middle left
create_scatter_plot(data, ax3)
ax4 = fig.add_subplot(gs[1, 1]) # Histogram - middle center
create_histogram(data, ax4)
ax5 = fig.add_subplot(gs[1, 2]) # Box plot - middle right
create_box_plot(data, ax5)
ax6 = fig.add_subplot(gs[2, :2]) # Contour plot - bottom left, spans 2 columns
create_contour_plot(data, ax6)
ax7 = fig.add_subplot(gs[2, 2]) # Heatmap - bottom right
create_heatmap(data, ax7)
fig.suptitle('Comprehensive Matplotlib Template', fontsize=18, fontweight='bold')
return fig
def main():
"""Main function to run the template."""
parser = argparse.ArgumentParser(description='Matplotlib plot template')
parser.add_argument('--plot-type', type=str, default='all',
choices=['line', 'scatter', 'bar', 'histogram', 'heatmap',
'contour', 'box', 'violin', '3d', 'all'],
help='Type of plot to create')
parser.add_argument('--style', type=str, default='default',
help='Matplotlib style to use')
parser.add_argument('--output', type=str, default='plot.png',
help='Output filename')
args = parser.parse_args()
# Set style
if args.style != 'default':
plt.style.use(args.style)
else:
set_publication_style()
# Generate data
data = generate_sample_data()
# Create plot based on type
plot_functions = {
'line': create_line_plot,
'scatter': create_scatter_plot,
'bar': create_bar_chart,
'histogram': create_histogram,
'heatmap': create_heatmap,
'contour': create_contour_plot,
'box': create_box_plot,
'violin': create_violin_plot,
}
if args.plot_type == '3d':
fig = create_3d_plot()
elif args.plot_type == 'all':
fig = create_comprehensive_figure()
else:
fig = plot_functions[args.plot_type](data)
# Save figure
plt.savefig(args.output, dpi=300, bbox_inches='tight')
print(f"Plot saved to {args.output}")
# Display
plt.show()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,409 @@
#!/usr/bin/env python3
"""
Matplotlib Style Configurator
Interactive utility to configure matplotlib style preferences and generate
custom style sheets. Creates a preview of the style and optionally saves
it as a .mplstyle file.
Usage:
python style_configurator.py [--preset PRESET] [--output FILE] [--preview]
Presets:
publication, presentation, web, dark, minimal
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import argparse
import os
# Predefined style presets
STYLE_PRESETS = {
'publication': {
'figure.figsize': (8, 6),
'figure.dpi': 100,
'savefig.dpi': 300,
'savefig.bbox': 'tight',
'font.family': 'sans-serif',
'font.sans-serif': ['Arial', 'Helvetica'],
'font.size': 11,
'axes.labelsize': 12,
'axes.titlesize': 14,
'axes.linewidth': 1.5,
'axes.grid': False,
'axes.spines.top': False,
'axes.spines.right': False,
'lines.linewidth': 2,
'lines.markersize': 8,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'xtick.direction': 'in',
'ytick.direction': 'in',
'xtick.major.size': 6,
'ytick.major.size': 6,
'xtick.major.width': 1.5,
'ytick.major.width': 1.5,
'legend.fontsize': 10,
'legend.frameon': True,
'legend.framealpha': 1.0,
'legend.edgecolor': 'black',
},
'presentation': {
'figure.figsize': (12, 8),
'figure.dpi': 100,
'savefig.dpi': 150,
'font.size': 16,
'axes.labelsize': 20,
'axes.titlesize': 24,
'axes.linewidth': 2,
'lines.linewidth': 3,
'lines.markersize': 12,
'xtick.labelsize': 16,
'ytick.labelsize': 16,
'legend.fontsize': 16,
'axes.grid': True,
'grid.alpha': 0.3,
},
'web': {
'figure.figsize': (10, 6),
'figure.dpi': 96,
'savefig.dpi': 150,
'font.size': 11,
'axes.labelsize': 12,
'axes.titlesize': 14,
'lines.linewidth': 2,
'axes.grid': True,
'grid.alpha': 0.2,
'grid.linestyle': '--',
},
'dark': {
'figure.facecolor': '#1e1e1e',
'figure.edgecolor': '#1e1e1e',
'axes.facecolor': '#1e1e1e',
'axes.edgecolor': 'white',
'axes.labelcolor': 'white',
'text.color': 'white',
'xtick.color': 'white',
'ytick.color': 'white',
'grid.color': 'gray',
'grid.alpha': 0.3,
'axes.grid': True,
'legend.facecolor': '#1e1e1e',
'legend.edgecolor': 'white',
'savefig.facecolor': '#1e1e1e',
},
'minimal': {
'figure.figsize': (10, 6),
'axes.spines.top': False,
'axes.spines.right': False,
'axes.spines.left': False,
'axes.spines.bottom': False,
'axes.grid': False,
'xtick.bottom': True,
'ytick.left': True,
'axes.axisbelow': True,
'lines.linewidth': 2.5,
'font.size': 12,
}
}
def generate_preview_data():
"""Generate sample data for style preview."""
np.random.seed(42)
x = np.linspace(0, 10, 100)
y1 = np.sin(x) + 0.1 * np.random.randn(100)
y2 = np.cos(x) + 0.1 * np.random.randn(100)
scatter_x = np.random.randn(100)
scatter_y = 2 * scatter_x + np.random.randn(100)
categories = ['A', 'B', 'C', 'D', 'E']
bar_values = [25, 40, 30, 55, 45]
return {
'x': x, 'y1': y1, 'y2': y2,
'scatter_x': scatter_x, 'scatter_y': scatter_y,
'categories': categories, 'bar_values': bar_values
}
def create_style_preview(style_dict=None):
"""Create a preview figure demonstrating the style."""
if style_dict:
plt.rcParams.update(style_dict)
data = generate_preview_data()
fig = plt.figure(figsize=(14, 10))
gs = GridSpec(2, 2, figure=fig, hspace=0.3, wspace=0.3)
# Line plot
ax1 = fig.add_subplot(gs[0, 0])
ax1.plot(data['x'], data['y1'], label='sin(x)', marker='o', markevery=10)
ax1.plot(data['x'], data['y2'], label='cos(x)', linestyle='--')
ax1.set_xlabel('X axis')
ax1.set_ylabel('Y axis')
ax1.set_title('Line Plot')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Scatter plot
ax2 = fig.add_subplot(gs[0, 1])
colors = np.sqrt(data['scatter_x']**2 + data['scatter_y']**2)
scatter = ax2.scatter(data['scatter_x'], data['scatter_y'],
c=colors, cmap='viridis', alpha=0.6, s=50)
ax2.set_xlabel('X axis')
ax2.set_ylabel('Y axis')
ax2.set_title('Scatter Plot')
cbar = plt.colorbar(scatter, ax=ax2)
cbar.set_label('Distance')
ax2.grid(True, alpha=0.3)
# Bar chart
ax3 = fig.add_subplot(gs[1, 0])
bars = ax3.bar(data['categories'], data['bar_values'],
edgecolor='black', linewidth=1)
# Color bars with gradient
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(bars)))
for bar, color in zip(bars, colors):
bar.set_facecolor(color)
ax3.set_xlabel('Categories')
ax3.set_ylabel('Values')
ax3.set_title('Bar Chart')
ax3.grid(True, axis='y', alpha=0.3)
# Multiple line plot with fills
ax4 = fig.add_subplot(gs[1, 1])
ax4.plot(data['x'], data['y1'], label='Signal 1', linewidth=2)
ax4.fill_between(data['x'], data['y1'] - 0.2, data['y1'] + 0.2,
alpha=0.3, label='±1 std')
ax4.plot(data['x'], data['y2'], label='Signal 2', linewidth=2)
ax4.fill_between(data['x'], data['y2'] - 0.2, data['y2'] + 0.2,
alpha=0.3)
ax4.set_xlabel('X axis')
ax4.set_ylabel('Y axis')
ax4.set_title('Time Series with Uncertainty')
ax4.legend()
ax4.grid(True, alpha=0.3)
fig.suptitle('Style Preview', fontsize=16, fontweight='bold')
return fig
def save_style_file(style_dict, filename):
"""Save style dictionary as .mplstyle file."""
with open(filename, 'w') as f:
f.write("# Custom matplotlib style\n")
f.write("# Generated by style_configurator.py\n\n")
# Group settings by category
categories = {
'Figure': ['figure.'],
'Font': ['font.'],
'Axes': ['axes.'],
'Lines': ['lines.'],
'Markers': ['markers.'],
'Ticks': ['tick.', 'xtick.', 'ytick.'],
'Grid': ['grid.'],
'Legend': ['legend.'],
'Savefig': ['savefig.'],
'Text': ['text.'],
}
for category, prefixes in categories.items():
category_items = {k: v for k, v in style_dict.items()
if any(k.startswith(p) for p in prefixes)}
if category_items:
f.write(f"# {category}\n")
for key, value in sorted(category_items.items()):
# Format value appropriately
if isinstance(value, (list, tuple)):
value_str = ', '.join(str(v) for v in value)
elif isinstance(value, bool):
value_str = str(value)
else:
value_str = str(value)
f.write(f"{key}: {value_str}\n")
f.write("\n")
print(f"Style saved to {filename}")
def print_style_info(style_dict):
"""Print information about the style."""
print("\n" + "="*60)
print("STYLE CONFIGURATION")
print("="*60)
categories = {
'Figure Settings': ['figure.'],
'Font Settings': ['font.'],
'Axes Settings': ['axes.'],
'Line Settings': ['lines.'],
'Grid Settings': ['grid.'],
'Legend Settings': ['legend.'],
}
for category, prefixes in categories.items():
category_items = {k: v for k, v in style_dict.items()
if any(k.startswith(p) for p in prefixes)}
if category_items:
print(f"\n{category}:")
for key, value in sorted(category_items.items()):
print(f" {key}: {value}")
print("\n" + "="*60 + "\n")
def list_available_presets():
"""Print available style presets."""
print("\nAvailable style presets:")
print("-" * 40)
descriptions = {
'publication': 'Optimized for academic publications',
'presentation': 'Large fonts for presentations',
'web': 'Optimized for web display',
'dark': 'Dark background theme',
'minimal': 'Minimal, clean style',
}
for preset, desc in descriptions.items():
print(f" {preset:15s} - {desc}")
print("-" * 40 + "\n")
def interactive_mode():
"""Run interactive mode to customize style settings."""
print("\n" + "="*60)
print("MATPLOTLIB STYLE CONFIGURATOR - Interactive Mode")
print("="*60)
list_available_presets()
preset = input("Choose a preset to start from (or 'custom' for default): ").strip().lower()
if preset in STYLE_PRESETS:
style_dict = STYLE_PRESETS[preset].copy()
print(f"\nStarting from '{preset}' preset")
else:
style_dict = {}
print("\nStarting from default matplotlib style")
print("\nCommon settings you might want to customize:")
print(" 1. Figure size")
print(" 2. Font sizes")
print(" 3. Line widths")
print(" 4. Grid settings")
print(" 5. Color scheme")
print(" 6. Done, show preview")
while True:
choice = input("\nSelect option (1-6): ").strip()
if choice == '1':
width = input(" Figure width (inches, default 10): ").strip() or '10'
height = input(" Figure height (inches, default 6): ").strip() or '6'
style_dict['figure.figsize'] = (float(width), float(height))
elif choice == '2':
base = input(" Base font size (default 12): ").strip() or '12'
style_dict['font.size'] = float(base)
style_dict['axes.labelsize'] = float(base) + 2
style_dict['axes.titlesize'] = float(base) + 4
elif choice == '3':
lw = input(" Line width (default 2): ").strip() or '2'
style_dict['lines.linewidth'] = float(lw)
elif choice == '4':
grid = input(" Enable grid? (y/n): ").strip().lower()
style_dict['axes.grid'] = grid == 'y'
if style_dict['axes.grid']:
alpha = input(" Grid transparency (0-1, default 0.3): ").strip() or '0.3'
style_dict['grid.alpha'] = float(alpha)
elif choice == '5':
print(" Theme options: 1=Light, 2=Dark")
theme = input(" Select theme (1-2): ").strip()
if theme == '2':
style_dict.update(STYLE_PRESETS['dark'])
elif choice == '6':
break
return style_dict
def main():
"""Main function."""
parser = argparse.ArgumentParser(
description='Matplotlib style configurator',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Show available presets
python style_configurator.py --list
# Preview a preset
python style_configurator.py --preset publication --preview
# Save a preset as .mplstyle file
python style_configurator.py --preset publication --output my_style.mplstyle
# Interactive mode
python style_configurator.py --interactive
"""
)
parser.add_argument('--preset', type=str, choices=list(STYLE_PRESETS.keys()),
help='Use a predefined style preset')
parser.add_argument('--output', type=str,
help='Save style to .mplstyle file')
parser.add_argument('--preview', action='store_true',
help='Show style preview')
parser.add_argument('--list', action='store_true',
help='List available presets')
parser.add_argument('--interactive', action='store_true',
help='Run in interactive mode')
args = parser.parse_args()
if args.list:
list_available_presets()
# Also show currently available matplotlib styles
print("\nBuilt-in matplotlib styles:")
print("-" * 40)
for style in sorted(plt.style.available):
print(f" {style}")
return
if args.interactive:
style_dict = interactive_mode()
elif args.preset:
style_dict = STYLE_PRESETS[args.preset].copy()
print(f"Using '{args.preset}' preset")
else:
print("No preset or interactive mode specified. Showing default preview.")
style_dict = {}
if style_dict:
print_style_info(style_dict)
if args.output:
save_style_file(style_dict, args.output)
if args.preview or args.interactive:
print("Creating style preview...")
fig = create_style_preview(style_dict if style_dict else None)
if args.output:
preview_filename = args.output.replace('.mplstyle', '_preview.png')
plt.savefig(preview_filename, dpi=150, bbox_inches='tight')
print(f"Preview saved to {preview_filename}")
plt.show()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,398 @@
---
name: medchem
description: Python library for molecular filtering and prioritization in drug discovery. Use when applying medicinal chemistry rules (Rule of Five, CNS, leadlike), detecting structural alerts (PAINS, NIBR, Lilly demerits), analyzing chemical groups, calculating molecular complexity, or filtering compound libraries. Works with SMILES strings and RDKit mol objects, with built-in parallelization for large datasets.
---
# Medchem
## Overview
Medchem is a Python library for molecular filtering and prioritization in drug discovery workflows. It provides hundreds of well-established and novel molecular filters, structural alerts, and medicinal chemistry rules to efficiently triage and prioritize compound libraries at scale.
**Key Principle:** Rules and filters are always context-specific. Avoid blindly applying filters—marketed drugs often don't pass standard medchem filters, and prodrugs may intentionally violate rules. Use these tools as guidelines combined with domain expertise.
## Installation
Install medchem via conda or pip:
```bash
# Via conda
micromamba install -c conda-forge medchem
# Via pip
pip install medchem
```
## Core Capabilities
### 1. Medicinal Chemistry Rules
Apply established drug-likeness rules to molecules using the `medchem.rules` module.
**Available Rules:**
- Rule of Five (Lipinski)
- Rule of Oprea
- Rule of CNS
- Rule of leadlike (soft and strict)
- Rule of three
- Rule of Reos
- Rule of drug
- Rule of Veber
- Golden triangle
- PAINS filters
**Single Rule Application:**
```python
import medchem as mc
# Apply Rule of Five to a SMILES string
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" # Aspirin
passes = mc.rules.basic_rules.rule_of_five(smiles)
# Returns: True
# Check specific rules
passes_oprea = mc.rules.basic_rules.rule_of_oprea(smiles)
passes_cns = mc.rules.basic_rules.rule_of_cns(smiles)
```
**Multiple Rules with RuleFilters:**
```python
import datamol as dm
import medchem as mc
# Load molecules
mols = [dm.to_mol(smiles) for smiles in smiles_list]
# Create filter with multiple rules
rfilter = mc.rules.RuleFilters(
rule_list=[
"rule_of_five",
"rule_of_oprea",
"rule_of_cns",
"rule_of_leadlike_soft"
]
)
# Apply filters with parallelization
results = rfilter(
mols=mols,
n_jobs=-1, # Use all CPU cores
progress=True
)
```
**Result Format:**
Results are returned as dictionaries with pass/fail status and detailed information for each rule.
### 2. Structural Alert Filters
Detect potentially problematic structural patterns using the `medchem.structural` module.
**Available Filters:**
1. **Common Alerts** - General structural alerts derived from ChEMBL curation and literature
2. **NIBR Filters** - Novartis Institutes for BioMedical Research filter set
3. **Lilly Demerits** - Eli Lilly's demerit-based system (275 rules, molecules rejected at >100 demerits)
**Common Alerts:**
```python
import medchem as mc
# Create filter
alert_filter = mc.structural.CommonAlertsFilters()
# Check single molecule
mol = dm.to_mol("c1ccccc1")
has_alerts, details = alert_filter.check_mol(mol)
# Batch filtering with parallelization
results = alert_filter(
mols=mol_list,
n_jobs=-1,
progress=True
)
```
**NIBR Filters:**
```python
import medchem as mc
# Apply NIBR filters
nibr_filter = mc.structural.NIBRFilters()
results = nibr_filter(mols=mol_list, n_jobs=-1)
```
**Lilly Demerits:**
```python
import medchem as mc
# Calculate Lilly demerits
lilly = mc.structural.LillyDemeritsFilters()
results = lilly(mols=mol_list, n_jobs=-1)
# Each result includes demerit score and whether it passes (≤100 demerits)
```
### 3. Functional API for High-Level Operations
The `medchem.functional` module provides convenient functions for common workflows.
**Quick Filtering:**
```python
import medchem as mc
# Apply NIBR filters to a list
filter_ok = mc.functional.nibr_filter(
mols=mol_list,
n_jobs=-1
)
# Apply common alerts
alert_results = mc.functional.common_alerts_filter(
mols=mol_list,
n_jobs=-1
)
```
### 4. Chemical Groups Detection
Identify specific chemical groups and functional groups using `medchem.groups`.
**Available Groups:**
- Hinge binders
- Phosphate binders
- Michael acceptors
- Reactive groups
- Custom SMARTS patterns
**Usage:**
```python
import medchem as mc
# Create group detector
group = mc.groups.ChemicalGroup(groups=["hinge_binders"])
# Check for matches
has_matches = group.has_match(mol_list)
# Get detailed match information
matches = group.get_matches(mol)
```
### 5. Named Catalogs
Access curated collections of chemical structures through `medchem.catalogs`.
**Available Catalogs:**
- Functional groups
- Protecting groups
- Common reagents
- Standard fragments
**Usage:**
```python
import medchem as mc
# Access named catalogs
catalogs = mc.catalogs.NamedCatalogs
# Use catalog for matching
catalog = catalogs.get("functional_groups")
matches = catalog.get_matches(mol)
```
### 6. Molecular Complexity
Calculate complexity metrics that approximate synthetic accessibility using `medchem.complexity`.
**Common Metrics:**
- Bertz complexity
- Whitlock complexity
- Barone complexity
**Usage:**
```python
import medchem as mc
# Calculate complexity
complexity_score = mc.complexity.calculate_complexity(mol)
# Filter by complexity threshold
complex_filter = mc.complexity.ComplexityFilter(max_complexity=500)
results = complex_filter(mols=mol_list)
```
### 7. Constraints Filtering
Apply custom property-based constraints using `medchem.constraints`.
**Example Constraints:**
- Molecular weight ranges
- LogP bounds
- TPSA limits
- Rotatable bond counts
**Usage:**
```python
import medchem as mc
# Define constraints
constraints = mc.constraints.Constraints(
mw_range=(200, 500),
logp_range=(-2, 5),
tpsa_max=140,
rotatable_bonds_max=10
)
# Apply constraints
results = constraints(mols=mol_list, n_jobs=-1)
```
### 8. Medchem Query Language
Use a specialized query language for complex filtering criteria.
**Query Examples:**
```
# Molecules passing Ro5 AND not having common alerts
"rule_of_five AND NOT common_alerts"
# CNS-like molecules with low complexity
"rule_of_cns AND complexity < 400"
# Leadlike molecules without Lilly demerits
"rule_of_leadlike AND lilly_demerits == 0"
```
**Usage:**
```python
import medchem as mc
# Parse and apply query
query = mc.query.parse("rule_of_five AND NOT common_alerts")
results = query.apply(mols=mol_list, n_jobs=-1)
```
## Workflow Patterns
### Pattern 1: Initial Triage of Compound Library
Filter a large compound collection to identify drug-like candidates.
```python
import datamol as dm
import medchem as mc
import pandas as pd
# Load compound library
df = pd.read_csv("compounds.csv")
mols = [dm.to_mol(smi) for smi in df["smiles"]]
# Apply primary filters
rule_filter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_veber"])
rule_results = rule_filter(mols=mols, n_jobs=-1, progress=True)
# Apply structural alerts
alert_filter = mc.structural.CommonAlertsFilters()
alert_results = alert_filter(mols=mols, n_jobs=-1, progress=True)
# Combine results
df["passes_rules"] = rule_results["pass"]
df["has_alerts"] = alert_results["has_alerts"]
df["drug_like"] = df["passes_rules"] & ~df["has_alerts"]
# Save filtered compounds
filtered_df = df[df["drug_like"]]
filtered_df.to_csv("filtered_compounds.csv", index=False)
```
### Pattern 2: Lead Optimization Filtering
Apply stricter criteria during lead optimization.
```python
import medchem as mc
# Create comprehensive filter
filters = {
"rules": mc.rules.RuleFilters(rule_list=["rule_of_leadlike_strict"]),
"alerts": mc.structural.NIBRFilters(),
"lilly": mc.structural.LillyDemeritsFilters(),
"complexity": mc.complexity.ComplexityFilter(max_complexity=400)
}
# Apply all filters
results = {}
for name, filt in filters.items():
results[name] = filt(mols=candidate_mols, n_jobs=-1)
# Identify compounds passing all filters
passes_all = all(r["pass"] for r in results.values())
```
### Pattern 3: Identify Specific Chemical Groups
Find molecules containing specific functional groups or scaffolds.
```python
import medchem as mc
# Create group detector for multiple groups
group_detector = mc.groups.ChemicalGroup(
groups=["hinge_binders", "phosphate_binders"]
)
# Screen library
matches = group_detector.get_all_matches(mol_list)
# Filter molecules with desired groups
mol_with_groups = [mol for mol, match in zip(mol_list, matches) if match]
```
## Best Practices
1. **Context Matters**: Don't blindly apply filters. Understand the biological target and chemical space.
2. **Combine Multiple Filters**: Use rules, structural alerts, and domain knowledge together for better decisions.
3. **Use Parallelization**: For large datasets (>1000 molecules), always use `n_jobs=-1` for parallel processing.
4. **Iterative Refinement**: Start with broad filters (Ro5), then apply more specific criteria (CNS, leadlike) as needed.
5. **Document Filtering Decisions**: Track which molecules were filtered out and why for reproducibility.
6. **Validate Results**: Remember that marketed drugs often fail standard filters—use these as guidelines, not absolute rules.
7. **Consider Prodrugs**: Molecules designed as prodrugs may intentionally violate standard medicinal chemistry rules.
## Resources
### references/api_guide.md
Comprehensive API reference covering all medchem modules with detailed function signatures, parameters, and return types.
### references/rules_catalog.md
Complete catalog of available rules, filters, and alerts with descriptions, thresholds, and literature references.
### scripts/filter_molecules.py
Production-ready script for batch filtering workflows. Supports multiple input formats (CSV, SDF, SMILES), configurable filter combinations, and detailed reporting.
**Usage:**
```bash
python scripts/filter_molecules.py input.csv --rules rule_of_five,rule_of_cns --alerts nibr --output filtered.csv
```
## Documentation
Official documentation: https://medchem-docs.datamol.io/
GitHub repository: https://github.com/datamol-io/medchem

View File

@@ -0,0 +1,600 @@
# Medchem API Reference
Comprehensive reference for all medchem modules and functions.
## Module: medchem.rules
### Class: RuleFilters
Filter molecules based on multiple medicinal chemistry rules.
**Constructor:**
```python
RuleFilters(rule_list: List[str])
```
**Parameters:**
- `rule_list`: List of rule names to apply. See available rules below.
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> Dict
```
- `mols`: List of RDKit molecule objects
- `n_jobs`: Number of parallel jobs (-1 uses all cores)
- `progress`: Show progress bar
- **Returns**: Dictionary with results for each rule
**Example:**
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_cns"])
results = rfilter(mols=mol_list, n_jobs=-1, progress=True)
```
### Module: medchem.rules.basic_rules
Individual rule functions that can be applied to single molecules.
#### rule_of_five()
```python
rule_of_five(mol: Union[str, Chem.Mol]) -> bool
```
Lipinski's Rule of Five for oral bioavailability.
**Criteria:**
- Molecular weight ≤ 500 Da
- LogP ≤ 5
- H-bond donors ≤ 5
- H-bond acceptors ≤ 10
**Parameters:**
- `mol`: SMILES string or RDKit molecule object
**Returns:** True if molecule passes all criteria
#### rule_of_three()
```python
rule_of_three(mol: Union[str, Chem.Mol]) -> bool
```
Rule of Three for fragment screening libraries.
**Criteria:**
- Molecular weight ≤ 300 Da
- LogP ≤ 3
- H-bond donors ≤ 3
- H-bond acceptors ≤ 3
- Rotatable bonds ≤ 3
- Polar surface area ≤ 60 Ų
#### rule_of_oprea()
```python
rule_of_oprea(mol: Union[str, Chem.Mol]) -> bool
```
Oprea's lead-like criteria for hit-to-lead optimization.
**Criteria:**
- Molecular weight: 200-350 Da
- LogP: -2 to 4
- Rotatable bonds ≤ 7
- Rings ≤ 4
#### rule_of_cns()
```python
rule_of_cns(mol: Union[str, Chem.Mol]) -> bool
```
CNS drug-likeness rules.
**Criteria:**
- Molecular weight ≤ 450 Da
- LogP: -1 to 5
- H-bond donors ≤ 2
- TPSA ≤ 90 Ų
#### rule_of_leadlike_soft()
```python
rule_of_leadlike_soft(mol: Union[str, Chem.Mol]) -> bool
```
Soft lead-like criteria (more permissive).
**Criteria:**
- Molecular weight: 250-450 Da
- LogP: -3 to 4
- Rotatable bonds ≤ 10
#### rule_of_leadlike_strict()
```python
rule_of_leadlike_strict(mol: Union[str, Chem.Mol]) -> bool
```
Strict lead-like criteria (more restrictive).
**Criteria:**
- Molecular weight: 200-350 Da
- LogP: -2 to 3.5
- Rotatable bonds ≤ 7
- Rings: 1-3
#### rule_of_veber()
```python
rule_of_veber(mol: Union[str, Chem.Mol]) -> bool
```
Veber's rules for oral bioavailability.
**Criteria:**
- Rotatable bonds ≤ 10
- TPSA ≤ 140 Ų
#### rule_of_reos()
```python
rule_of_reos(mol: Union[str, Chem.Mol]) -> bool
```
Rapid Elimination Of Swill (REOS) filter.
**Criteria:**
- Molecular weight: 200-500 Da
- LogP: -5 to 5
- H-bond donors: 0-5
- H-bond acceptors: 0-10
#### rule_of_drug()
```python
rule_of_drug(mol: Union[str, Chem.Mol]) -> bool
```
Combined drug-likeness criteria.
**Criteria:**
- Passes Rule of Five
- Passes Veber rules
- No PAINS substructures
#### golden_triangle()
```python
golden_triangle(mol: Union[str, Chem.Mol]) -> bool
```
Golden Triangle for drug-likeness balance.
**Criteria:**
- 200 ≤ MW ≤ 50×LogP + 400
- LogP: -2 to 5
#### pains_filter()
```python
pains_filter(mol: Union[str, Chem.Mol]) -> bool
```
Pan Assay INterference compoundS (PAINS) filter.
**Returns:** True if molecule does NOT contain PAINS substructures
---
## Module: medchem.structural
### Class: CommonAlertsFilters
Filter for common structural alerts derived from ChEMBL and literature.
**Constructor:**
```python
CommonAlertsFilters()
```
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[Dict]
```
Apply common alerts filter to a list of molecules.
**Returns:** List of dictionaries with keys:
- `has_alerts`: Boolean indicating if molecule has alerts
- `alert_details`: List of matched alert patterns
- `num_alerts`: Number of alerts found
```python
check_mol(mol: Chem.Mol) -> Tuple[bool, List[str]]
```
Check a single molecule for structural alerts.
**Returns:** Tuple of (has_alerts, list_of_alert_names)
### Class: NIBRFilters
Novartis NIBR medicinal chemistry filters.
**Constructor:**
```python
NIBRFilters()
```
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[bool]
```
Apply NIBR filters to molecules.
**Returns:** List of booleans (True if molecule passes)
### Class: LillyDemeritsFilters
Eli Lilly's demerit-based structural alert system (275 rules).
**Constructor:**
```python
LillyDemeritsFilters()
```
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[Dict]
```
Calculate Lilly demerits for molecules.
**Returns:** List of dictionaries with keys:
- `demerits`: Total demerit score
- `passes`: Boolean (True if demerits ≤ 100)
- `matched_patterns`: List of matched patterns with scores
---
## Module: medchem.functional
High-level functional API for common operations.
### nibr_filter()
```python
nibr_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
```
Apply NIBR filters using functional API.
**Parameters:**
- `mols`: List of molecules
- `n_jobs`: Parallelization level
**Returns:** List of pass/fail booleans
### common_alerts_filter()
```python
common_alerts_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
```
Apply common alerts filter using functional API.
**Returns:** List of results dictionaries
### lilly_demerits_filter()
```python
lilly_demerits_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
```
Calculate Lilly demerits using functional API.
---
## Module: medchem.groups
### Class: ChemicalGroup
Detect specific chemical groups in molecules.
**Constructor:**
```python
ChemicalGroup(groups: List[str], custom_smarts: Optional[Dict[str, str]] = None)
```
**Parameters:**
- `groups`: List of predefined group names
- `custom_smarts`: Dictionary mapping custom group names to SMARTS patterns
**Predefined Groups:**
- `"hinge_binders"`: Kinase hinge binding motifs
- `"phosphate_binders"`: Phosphate binding groups
- `"michael_acceptors"`: Michael acceptor electrophiles
- `"reactive_groups"`: General reactive functionalities
**Methods:**
```python
has_match(mols: List[Chem.Mol]) -> List[bool]
```
Check if molecules contain any of the specified groups.
```python
get_matches(mol: Chem.Mol) -> Dict[str, List[Tuple]]
```
Get detailed match information for a single molecule.
**Returns:** Dictionary mapping group names to lists of atom indices
```python
get_all_matches(mols: List[Chem.Mol]) -> List[Dict]
```
Get match information for all molecules.
**Example:**
```python
group = mc.groups.ChemicalGroup(groups=["hinge_binders", "phosphate_binders"])
matches = group.get_all_matches(mol_list)
```
---
## Module: medchem.catalogs
### Class: NamedCatalogs
Access to curated chemical catalogs.
**Available Catalogs:**
- `"functional_groups"`: Common functional groups
- `"protecting_groups"`: Protecting group structures
- `"reagents"`: Common reagents
- `"fragments"`: Standard fragments
**Usage:**
```python
catalog = mc.catalogs.NamedCatalogs.get("functional_groups")
matches = catalog.get_matches(mol)
```
---
## Module: medchem.complexity
Calculate molecular complexity metrics.
### calculate_complexity()
```python
calculate_complexity(mol: Chem.Mol, method: str = "bertz") -> float
```
Calculate complexity score for a molecule.
**Parameters:**
- `mol`: RDKit molecule
- `method`: Complexity metric ("bertz", "whitlock", "barone")
**Returns:** Complexity score (higher = more complex)
### Class: ComplexityFilter
Filter molecules by complexity threshold.
**Constructor:**
```python
ComplexityFilter(max_complexity: float, method: str = "bertz")
```
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
```
Filter molecules exceeding complexity threshold.
---
## Module: medchem.constraints
### Class: Constraints
Apply custom property-based constraints.
**Constructor:**
```python
Constraints(
mw_range: Optional[Tuple[float, float]] = None,
logp_range: Optional[Tuple[float, float]] = None,
tpsa_max: Optional[float] = None,
tpsa_range: Optional[Tuple[float, float]] = None,
hbd_max: Optional[int] = None,
hba_max: Optional[int] = None,
rotatable_bonds_max: Optional[int] = None,
rings_range: Optional[Tuple[int, int]] = None,
aromatic_rings_max: Optional[int] = None,
)
```
**Parameters:** All parameters are optional. Specify only the constraints needed.
**Methods:**
```python
__call__(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
```
Apply constraints to molecules.
**Returns:** List of dictionaries with keys:
- `passes`: Boolean indicating if all constraints pass
- `violations`: List of constraint names that failed
**Example:**
```python
constraints = mc.constraints.Constraints(
mw_range=(200, 500),
logp_range=(-2, 5),
tpsa_max=140
)
results = constraints(mols=mol_list, n_jobs=-1)
```
---
## Module: medchem.query
Query language for complex filtering.
### parse()
```python
parse(query: str) -> Query
```
Parse a medchem query string into a Query object.
**Query Syntax:**
- Operators: `AND`, `OR`, `NOT`
- Comparisons: `<`, `>`, `<=`, `>=`, `==`, `!=`
- Properties: `complexity`, `lilly_demerits`, `mw`, `logp`, `tpsa`
- Rules: `rule_of_five`, `rule_of_cns`, etc.
- Filters: `common_alerts`, `nibr_filter`, `pains_filter`
**Example Queries:**
```python
"rule_of_five AND NOT common_alerts"
"rule_of_cns AND complexity < 400"
"mw > 200 AND mw < 500 AND logp < 5"
"(rule_of_five OR rule_of_oprea) AND NOT pains_filter"
```
### Class: Query
**Methods:**
```python
apply(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
```
Apply parsed query to molecules.
**Example:**
```python
query = mc.query.parse("rule_of_five AND NOT common_alerts")
results = query.apply(mols=mol_list, n_jobs=-1)
passing_mols = [mol for mol, passes in zip(mol_list, results) if passes]
```
---
## Module: medchem.utils
Utility functions for working with molecules.
### batch_process()
```python
batch_process(
mols: List[Chem.Mol],
func: Callable,
n_jobs: int = 1,
progress: bool = False,
batch_size: Optional[int] = None
) -> List
```
Process molecules in parallel batches.
**Parameters:**
- `mols`: List of molecules
- `func`: Function to apply to each molecule
- `n_jobs`: Number of parallel workers
- `progress`: Show progress bar
- `batch_size`: Size of processing batches
### standardize_mol()
```python
standardize_mol(mol: Chem.Mol) -> Chem.Mol
```
Standardize molecule representation (sanitize, neutralize charges, etc.).
---
## Common Patterns
### Pattern: Parallel Processing
All filters support parallelization:
```python
# Use all CPU cores
results = filter_object(mols=mol_list, n_jobs=-1, progress=True)
# Use specific number of cores
results = filter_object(mols=mol_list, n_jobs=4, progress=True)
```
### Pattern: Combining Multiple Filters
```python
import medchem as mc
# Apply multiple filters
rule_filter = mc.rules.RuleFilters(rule_list=["rule_of_five"])
alert_filter = mc.structural.CommonAlertsFilters()
lilly_filter = mc.structural.LillyDemeritsFilters()
# Get results
rule_results = rule_filter(mols=mol_list, n_jobs=-1)
alert_results = alert_filter(mols=mol_list, n_jobs=-1)
lilly_results = lilly_filter(mols=mol_list, n_jobs=-1)
# Combine criteria
passing_mols = [
mol for i, mol in enumerate(mol_list)
if rule_results[i]["passes"]
and not alert_results[i]["has_alerts"]
and lilly_results[i]["passes"]
]
```
### Pattern: Working with DataFrames
```python
import pandas as pd
import datamol as dm
import medchem as mc
# Load data
df = pd.read_csv("molecules.csv")
df["mol"] = df["smiles"].apply(dm.to_mol)
# Apply filters
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_cns"])
results = rfilter(mols=df["mol"].tolist(), n_jobs=-1)
# Add results to dataframe
df["passes_ro5"] = [r["rule_of_five"] for r in results]
df["passes_cns"] = [r["rule_of_cns"] for r in results]
# Filter dataframe
filtered_df = df[df["passes_ro5"] & df["passes_cns"]]
```

View File

@@ -0,0 +1,604 @@
# Medchem Rules and Filters Catalog
Comprehensive catalog of all available medicinal chemistry rules, structural alerts, and filters in medchem.
## Table of Contents
1. [Drug-Likeness Rules](#drug-likeness-rules)
2. [Lead-Likeness Rules](#lead-likeness-rules)
3. [Fragment Rules](#fragment-rules)
4. [CNS Rules](#cns-rules)
5. [Structural Alert Filters](#structural-alert-filters)
6. [Chemical Group Patterns](#chemical-group-patterns)
---
## Drug-Likeness Rules
### Rule of Five (Lipinski)
**Reference:** Lipinski et al., Adv Drug Deliv Rev (1997) 23:3-25
**Purpose:** Predict oral bioavailability
**Criteria:**
- Molecular Weight ≤ 500 Da
- LogP ≤ 5
- Hydrogen Bond Donors ≤ 5
- Hydrogen Bond Acceptors ≤ 10
**Usage:**
```python
mc.rules.basic_rules.rule_of_five(mol)
```
**Notes:**
- One of the most widely used filters in drug discovery
- About 90% of orally active drugs comply with these rules
- Exceptions exist, especially for natural products and antibiotics
---
### Rule of Veber
**Reference:** Veber et al., J Med Chem (2002) 45:2615-2623
**Purpose:** Additional criteria for oral bioavailability
**Criteria:**
- Rotatable Bonds ≤ 10
- Topological Polar Surface Area (TPSA) ≤ 140 Ų
**Usage:**
```python
mc.rules.basic_rules.rule_of_veber(mol)
```
**Notes:**
- Complements Rule of Five
- TPSA correlates with cell permeability
- Rotatable bonds affect molecular flexibility
---
### Rule of Drug
**Purpose:** Combined drug-likeness assessment
**Criteria:**
- Passes Rule of Five
- Passes Veber rules
- Does not contain PAINS substructures
**Usage:**
```python
mc.rules.basic_rules.rule_of_drug(mol)
```
---
### REOS (Rapid Elimination Of Swill)
**Reference:** Walters & Murcko, Adv Drug Deliv Rev (2002) 54:255-271
**Purpose:** Filter out compounds unlikely to be drugs
**Criteria:**
- Molecular Weight: 200-500 Da
- LogP: -5 to 5
- Hydrogen Bond Donors: 0-5
- Hydrogen Bond Acceptors: 0-10
**Usage:**
```python
mc.rules.basic_rules.rule_of_reos(mol)
```
---
### Golden Triangle
**Reference:** Johnson et al., J Med Chem (2009) 52:5487-5500
**Purpose:** Balance lipophilicity and molecular weight
**Criteria:**
- 200 ≤ MW ≤ 50 × LogP + 400
- LogP: -2 to 5
**Usage:**
```python
mc.rules.basic_rules.golden_triangle(mol)
```
**Notes:**
- Defines optimal physicochemical space
- Visual representation resembles a triangle on MW vs LogP plot
---
## Lead-Likeness Rules
### Rule of Oprea
**Reference:** Oprea et al., J Chem Inf Comput Sci (2001) 41:1308-1315
**Purpose:** Identify lead-like compounds for optimization
**Criteria:**
- Molecular Weight: 200-350 Da
- LogP: -2 to 4
- Rotatable Bonds ≤ 7
- Number of Rings ≤ 4
**Usage:**
```python
mc.rules.basic_rules.rule_of_oprea(mol)
```
**Rationale:** Lead compounds should have "room to grow" during optimization
---
### Rule of Leadlike (Soft)
**Purpose:** Permissive lead-like criteria
**Criteria:**
- Molecular Weight: 250-450 Da
- LogP: -3 to 4
- Rotatable Bonds ≤ 10
**Usage:**
```python
mc.rules.basic_rules.rule_of_leadlike_soft(mol)
```
---
### Rule of Leadlike (Strict)
**Purpose:** Restrictive lead-like criteria
**Criteria:**
- Molecular Weight: 200-350 Da
- LogP: -2 to 3.5
- Rotatable Bonds ≤ 7
- Number of Rings: 1-3
**Usage:**
```python
mc.rules.basic_rules.rule_of_leadlike_strict(mol)
```
---
## Fragment Rules
### Rule of Three
**Reference:** Congreve et al., Drug Discov Today (2003) 8:876-877
**Purpose:** Screen fragment libraries for fragment-based drug discovery
**Criteria:**
- Molecular Weight ≤ 300 Da
- LogP ≤ 3
- Hydrogen Bond Donors ≤ 3
- Hydrogen Bond Acceptors ≤ 3
- Rotatable Bonds ≤ 3
- Polar Surface Area ≤ 60 Ų
**Usage:**
```python
mc.rules.basic_rules.rule_of_three(mol)
```
**Notes:**
- Fragments are grown into leads during optimization
- Lower complexity allows more starting points
---
## CNS Rules
### Rule of CNS
**Purpose:** Central nervous system drug-likeness
**Criteria:**
- Molecular Weight ≤ 450 Da
- LogP: -1 to 5
- Hydrogen Bond Donors ≤ 2
- TPSA ≤ 90 Ų
**Usage:**
```python
mc.rules.basic_rules.rule_of_cns(mol)
```
**Rationale:**
- Blood-brain barrier penetration requires specific properties
- Lower TPSA and HBD count improve BBB permeability
- Tight constraints reflect CNS challenges
---
## Structural Alert Filters
### PAINS (Pan Assay INterference compoundS)
**Reference:** Baell & Holloway, J Med Chem (2010) 53:2719-2740
**Purpose:** Identify compounds that interfere with assays
**Categories:**
- Catechols
- Quinones
- Rhodanines
- Hydroxyphenylhydrazones
- Alkyl/aryl aldehydes
- Michael acceptors (specific patterns)
**Usage:**
```python
mc.rules.basic_rules.pains_filter(mol)
# Returns True if NO PAINS found
```
**Notes:**
- PAINS compounds show activity in multiple assays through non-specific mechanisms
- Common false positives in screening campaigns
- Should be deprioritized in lead selection
---
### Common Alerts Filters
**Source:** Derived from ChEMBL curation and medicinal chemistry literature
**Purpose:** Flag common problematic structural patterns
**Alert Categories:**
1. **Reactive Groups**
- Epoxides
- Aziridines
- Acid halides
- Isocyanates
2. **Metabolic Liabilities**
- Hydrazines
- Thioureas
- Anilines (certain patterns)
3. **Aggregators**
- Polyaromatic systems
- Long aliphatic chains
4. **Toxicophores**
- Nitro aromatics
- Aromatic N-oxides
- Certain heterocycles
**Usage:**
```python
alert_filter = mc.structural.CommonAlertsFilters()
has_alerts, details = alert_filter.check_mol(mol)
```
**Return Format:**
```python
{
"has_alerts": True,
"alert_details": ["reactive_epoxide", "metabolic_hydrazine"],
"num_alerts": 2
}
```
---
### NIBR Filters
**Source:** Novartis Institutes for BioMedical Research
**Purpose:** Industrial medicinal chemistry filtering rules
**Features:**
- Proprietary filter set developed from Novartis experience
- Balances drug-likeness with practical medicinal chemistry
- Includes both structural alerts and property filters
**Usage:**
```python
nibr_filter = mc.structural.NIBRFilters()
results = nibr_filter(mols=mol_list, n_jobs=-1)
```
**Return Format:** Boolean list (True = passes)
---
### Lilly Demerits Filter
**Reference:** Based on Eli Lilly medicinal chemistry rules
**Source:** 275 structural patterns accumulated over 18 years
**Purpose:** Identify assay interference and problematic functionalities
**Mechanism:**
- Each matched pattern adds demerits
- Molecules with >100 demerits are rejected
- Some patterns add 10-50 demerits, others add 100+ (instant rejection)
**Demerit Categories:**
1. **High Demerits (>50):**
- Known toxic groups
- Highly reactive functionalities
- Strong metal chelators
2. **Medium Demerits (20-50):**
- Metabolic liabilities
- Aggregation-prone structures
- Frequent hitters
3. **Low Demerits (5-20):**
- Minor concerns
- Context-dependent issues
**Usage:**
```python
lilly_filter = mc.structural.LillyDemeritsFilters()
results = lilly_filter(mols=mol_list, n_jobs=-1)
```
**Return Format:**
```python
{
"demerits": 35,
"passes": True, # (demerits ≤ 100)
"matched_patterns": [
{"pattern": "phenolic_ester", "demerits": 20},
{"pattern": "aniline_derivative", "demerits": 15}
]
}
```
---
## Chemical Group Patterns
### Hinge Binders
**Purpose:** Identify kinase hinge-binding motifs
**Common Patterns:**
- Aminopyridines
- Aminopyrimidines
- Indazoles
- Benzimidazoles
**Usage:**
```python
group = mc.groups.ChemicalGroup(groups=["hinge_binders"])
has_hinge = group.has_match(mol_list)
```
**Application:** Kinase inhibitor design
---
### Phosphate Binders
**Purpose:** Identify phosphate-binding groups
**Common Patterns:**
- Basic amines in specific geometries
- Guanidinium groups
- Arginine mimetics
**Usage:**
```python
group = mc.groups.ChemicalGroup(groups=["phosphate_binders"])
```
**Application:** Kinase inhibitors, phosphatase inhibitors
---
### Michael Acceptors
**Purpose:** Identify electrophilic Michael acceptor groups
**Common Patterns:**
- α,β-Unsaturated carbonyls
- α,β-Unsaturated nitriles
- Vinyl sulfones
- Acrylamides
**Usage:**
```python
group = mc.groups.ChemicalGroup(groups=["michael_acceptors"])
```
**Notes:**
- Can be desirable for covalent inhibitors
- Often flagged as reactive alerts in screening
---
### Reactive Groups
**Purpose:** Identify generally reactive functionalities
**Common Patterns:**
- Epoxides
- Aziridines
- Acyl halides
- Isocyanates
- Sulfonyl chlorides
**Usage:**
```python
group = mc.groups.ChemicalGroup(groups=["reactive_groups"])
```
---
## Custom SMARTS Patterns
Define custom structural patterns using SMARTS:
```python
custom_patterns = {
"my_warhead": "[C;H0](=O)C(F)(F)F", # Trifluoromethyl ketone
"my_scaffold": "c1ccc2c(c1)ncc(n2)N", # Aminobenzimidazole
}
group = mc.groups.ChemicalGroup(
groups=["hinge_binders"],
custom_smarts=custom_patterns
)
```
---
## Filter Selection Guidelines
### Initial Screening (High-Throughput)
Recommended filters:
- Rule of Five
- PAINS filter
- Common Alerts (permissive settings)
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "pains_filter"])
alert_filter = mc.structural.CommonAlertsFilters()
```
---
### Hit-to-Lead
Recommended filters:
- Rule of Oprea or Leadlike (soft)
- NIBR filters
- Lilly Demerits
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_oprea"])
nibr_filter = mc.structural.NIBRFilters()
lilly_filter = mc.structural.LillyDemeritsFilters()
```
---
### Lead Optimization
Recommended filters:
- Rule of Drug
- Leadlike (strict)
- Full structural alert analysis
- Complexity filters
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_drug", "rule_of_leadlike_strict"])
alert_filter = mc.structural.CommonAlertsFilters()
complexity_filter = mc.complexity.ComplexityFilter(max_complexity=400)
```
---
### CNS Targets
Recommended filters:
- Rule of CNS
- Reduced PAINS criteria (CNS-focused)
- BBB permeability constraints
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_cns"])
constraints = mc.constraints.Constraints(
tpsa_max=90,
hbd_max=2,
mw_range=(300, 450)
)
```
---
### Fragment-Based Drug Discovery
Recommended filters:
- Rule of Three
- Minimal complexity
- Basic reactive group check
```python
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_three"])
complexity_filter = mc.complexity.ComplexityFilter(max_complexity=250)
```
---
## Important Considerations
### False Positives and False Negatives
**Filters are guidelines, not absolutes:**
1. **False Positives** (good drugs flagged):
- ~10% of marketed drugs fail Rule of Five
- Natural products often violate standard rules
- Prodrugs intentionally break rules
- Antibiotics and antivirals frequently non-compliant
2. **False Negatives** (bad compounds passing):
- Passing filters doesn't guarantee success
- Target-specific issues not captured
- In vivo properties not fully predicted
### Context-Specific Application
**Different contexts require different criteria:**
- **Target Class:** Kinases vs GPCRs vs ion channels have different optimal spaces
- **Modality:** Small molecules vs PROTACs vs molecular glues
- **Administration Route:** Oral vs IV vs topical
- **Disease Area:** CNS vs oncology vs infectious disease
- **Stage:** Screening vs hit-to-lead vs lead optimization
### Complementing with Machine Learning
Modern approaches combine rules with ML:
```python
# Rule-based pre-filtering
rule_results = mc.rules.RuleFilters(rule_list=["rule_of_five"])(mols)
filtered_mols = [mol for mol, r in zip(mols, rule_results) if r["passes"]]
# ML model scoring on filtered set
ml_scores = ml_model.predict(filtered_mols)
# Combined decision
final_candidates = [
mol for mol, score in zip(filtered_mols, ml_scores)
if score > threshold
]
```
---
## References
1. Lipinski CA et al. Adv Drug Deliv Rev (1997) 23:3-25
2. Veber DF et al. J Med Chem (2002) 45:2615-2623
3. Oprea TI et al. J Chem Inf Comput Sci (2001) 41:1308-1315
4. Congreve M et al. Drug Discov Today (2003) 8:876-877
5. Baell JB & Holloway GA. J Med Chem (2010) 53:2719-2740
6. Johnson TW et al. J Med Chem (2009) 52:5487-5500
7. Walters WP & Murcko MA. Adv Drug Deliv Rev (2002) 54:255-271
8. Hann MM & Oprea TI. Curr Opin Chem Biol (2004) 8:255-263
9. Rishton GM. Drug Discov Today (1997) 2:382-384

View File

@@ -0,0 +1,418 @@
#!/usr/bin/env python3
"""
Batch molecular filtering using medchem library.
This script provides a production-ready workflow for filtering compound libraries
using medchem rules, structural alerts, and custom constraints.
Usage:
python filter_molecules.py input.csv --rules rule_of_five,rule_of_cns --alerts nibr --output filtered.csv
python filter_molecules.py input.sdf --rules rule_of_drug --lilly --complexity 400 --output results.csv
python filter_molecules.py smiles.txt --nibr --pains --n-jobs -1 --output clean.csv
"""
import argparse
import sys
from pathlib import Path
from typing import List, Dict, Optional, Tuple
import json
try:
import pandas as pd
import datamol as dm
import medchem as mc
from rdkit import Chem
from tqdm import tqdm
except ImportError as e:
print(f"Error: Missing required package: {e}")
print("Install dependencies: pip install medchem datamol pandas tqdm")
sys.exit(1)
def load_molecules(input_file: Path, smiles_column: str = "smiles") -> Tuple[pd.DataFrame, List[Chem.Mol]]:
"""
Load molecules from various file formats.
Supports:
- CSV/TSV with SMILES column
- SDF files
- Plain text files with one SMILES per line
Returns:
Tuple of (DataFrame with metadata, list of RDKit molecules)
"""
suffix = input_file.suffix.lower()
if suffix == ".sdf":
print(f"Loading SDF file: {input_file}")
supplier = Chem.SDMolSupplier(str(input_file))
mols = [mol for mol in supplier if mol is not None]
# Create DataFrame from SDF properties
data = []
for mol in mols:
props = mol.GetPropsAsDict()
props["smiles"] = Chem.MolToSmiles(mol)
data.append(props)
df = pd.DataFrame(data)
elif suffix in [".csv", ".tsv"]:
print(f"Loading CSV/TSV file: {input_file}")
sep = "\t" if suffix == ".tsv" else ","
df = pd.read_csv(input_file, sep=sep)
if smiles_column not in df.columns:
print(f"Error: Column '{smiles_column}' not found in file")
print(f"Available columns: {', '.join(df.columns)}")
sys.exit(1)
print(f"Converting SMILES to molecules...")
mols = [dm.to_mol(smi) for smi in tqdm(df[smiles_column], desc="Parsing")]
elif suffix == ".txt":
print(f"Loading text file: {input_file}")
with open(input_file) as f:
smiles_list = [line.strip() for line in f if line.strip()]
df = pd.DataFrame({"smiles": smiles_list})
print(f"Converting SMILES to molecules...")
mols = [dm.to_mol(smi) for smi in tqdm(smiles_list, desc="Parsing")]
else:
print(f"Error: Unsupported file format: {suffix}")
print("Supported formats: .csv, .tsv, .sdf, .txt")
sys.exit(1)
# Filter out invalid molecules
valid_indices = [i for i, mol in enumerate(mols) if mol is not None]
if len(valid_indices) < len(mols):
n_invalid = len(mols) - len(valid_indices)
print(f"Warning: {n_invalid} invalid molecules removed")
df = df.iloc[valid_indices].reset_index(drop=True)
mols = [mols[i] for i in valid_indices]
print(f"Loaded {len(mols)} valid molecules")
return df, mols
def apply_rule_filters(mols: List[Chem.Mol], rules: List[str], n_jobs: int) -> pd.DataFrame:
"""Apply medicinal chemistry rule filters."""
print(f"\nApplying rule filters: {', '.join(rules)}")
rfilter = mc.rules.RuleFilters(rule_list=rules)
results = rfilter(mols=mols, n_jobs=n_jobs, progress=True)
# Convert to DataFrame
df_results = pd.DataFrame(results)
# Add summary column
df_results["passes_all_rules"] = df_results.all(axis=1)
return df_results
def apply_structural_alerts(mols: List[Chem.Mol], alert_type: str, n_jobs: int) -> pd.DataFrame:
"""Apply structural alert filters."""
print(f"\nApplying {alert_type} structural alerts...")
if alert_type == "common":
alert_filter = mc.structural.CommonAlertsFilters()
results = alert_filter(mols=mols, n_jobs=n_jobs, progress=True)
df_results = pd.DataFrame({
"has_common_alerts": [r["has_alerts"] for r in results],
"num_common_alerts": [r["num_alerts"] for r in results],
"common_alert_details": [", ".join(r["alert_details"]) if r["alert_details"] else "" for r in results]
})
elif alert_type == "nibr":
nibr_filter = mc.structural.NIBRFilters()
results = nibr_filter(mols=mols, n_jobs=n_jobs, progress=True)
df_results = pd.DataFrame({
"passes_nibr": results
})
elif alert_type == "lilly":
lilly_filter = mc.structural.LillyDemeritsFilters()
results = lilly_filter(mols=mols, n_jobs=n_jobs, progress=True)
df_results = pd.DataFrame({
"lilly_demerits": [r["demerits"] for r in results],
"passes_lilly": [r["passes"] for r in results],
"lilly_patterns": [", ".join([p["pattern"] for p in r["matched_patterns"]]) for r in results]
})
elif alert_type == "pains":
results = [mc.rules.basic_rules.pains_filter(mol) for mol in tqdm(mols, desc="PAINS")]
df_results = pd.DataFrame({
"passes_pains": results
})
else:
raise ValueError(f"Unknown alert type: {alert_type}")
return df_results
def apply_complexity_filter(mols: List[Chem.Mol], max_complexity: float, method: str = "bertz") -> pd.DataFrame:
"""Calculate molecular complexity."""
print(f"\nCalculating molecular complexity (method={method}, max={max_complexity})...")
complexity_scores = [
mc.complexity.calculate_complexity(mol, method=method)
for mol in tqdm(mols, desc="Complexity")
]
df_results = pd.DataFrame({
"complexity_score": complexity_scores,
"passes_complexity": [score <= max_complexity for score in complexity_scores]
})
return df_results
def apply_constraints(mols: List[Chem.Mol], constraints: Dict, n_jobs: int) -> pd.DataFrame:
"""Apply custom property constraints."""
print(f"\nApplying constraints: {constraints}")
constraint_filter = mc.constraints.Constraints(**constraints)
results = constraint_filter(mols=mols, n_jobs=n_jobs, progress=True)
df_results = pd.DataFrame({
"passes_constraints": [r["passes"] for r in results],
"constraint_violations": [", ".join(r["violations"]) if r["violations"] else "" for r in results]
})
return df_results
def apply_chemical_groups(mols: List[Chem.Mol], groups: List[str]) -> pd.DataFrame:
"""Detect chemical groups."""
print(f"\nDetecting chemical groups: {', '.join(groups)}")
group_detector = mc.groups.ChemicalGroup(groups=groups)
results = group_detector.get_all_matches(mols)
df_results = pd.DataFrame()
for group in groups:
df_results[f"has_{group}"] = [bool(r.get(group)) for r in results]
return df_results
def generate_summary(df: pd.DataFrame, output_file: Path):
"""Generate filtering summary report."""
summary_file = output_file.parent / f"{output_file.stem}_summary.txt"
with open(summary_file, "w") as f:
f.write("=" * 80 + "\n")
f.write("MEDCHEM FILTERING SUMMARY\n")
f.write("=" * 80 + "\n\n")
f.write(f"Total molecules processed: {len(df)}\n\n")
# Rule results
rule_cols = [col for col in df.columns if col.startswith("rule_") or col == "passes_all_rules"]
if rule_cols:
f.write("RULE FILTERS:\n")
f.write("-" * 40 + "\n")
for col in rule_cols:
if col in df.columns and df[col].dtype == bool:
n_pass = df[col].sum()
pct = 100 * n_pass / len(df)
f.write(f" {col}: {n_pass} passed ({pct:.1f}%)\n")
f.write("\n")
# Structural alerts
alert_cols = [col for col in df.columns if "alert" in col.lower() or "nibr" in col.lower() or "lilly" in col.lower() or "pains" in col.lower()]
if alert_cols:
f.write("STRUCTURAL ALERTS:\n")
f.write("-" * 40 + "\n")
if "has_common_alerts" in df.columns:
n_clean = (~df["has_common_alerts"]).sum()
pct = 100 * n_clean / len(df)
f.write(f" No common alerts: {n_clean} ({pct:.1f}%)\n")
if "passes_nibr" in df.columns:
n_pass = df["passes_nibr"].sum()
pct = 100 * n_pass / len(df)
f.write(f" Passes NIBR: {n_pass} ({pct:.1f}%)\n")
if "passes_lilly" in df.columns:
n_pass = df["passes_lilly"].sum()
pct = 100 * n_pass / len(df)
f.write(f" Passes Lilly: {n_pass} ({pct:.1f}%)\n")
avg_demerits = df["lilly_demerits"].mean()
f.write(f" Average Lilly demerits: {avg_demerits:.1f}\n")
if "passes_pains" in df.columns:
n_pass = df["passes_pains"].sum()
pct = 100 * n_pass / len(df)
f.write(f" Passes PAINS: {n_pass} ({pct:.1f}%)\n")
f.write("\n")
# Complexity
if "complexity_score" in df.columns:
f.write("COMPLEXITY:\n")
f.write("-" * 40 + "\n")
avg_complexity = df["complexity_score"].mean()
f.write(f" Average complexity: {avg_complexity:.1f}\n")
if "passes_complexity" in df.columns:
n_pass = df["passes_complexity"].sum()
pct = 100 * n_pass / len(df)
f.write(f" Within threshold: {n_pass} ({pct:.1f}%)\n")
f.write("\n")
# Constraints
if "passes_constraints" in df.columns:
f.write("CONSTRAINTS:\n")
f.write("-" * 40 + "\n")
n_pass = df["passes_constraints"].sum()
pct = 100 * n_pass / len(df)
f.write(f" Passes all constraints: {n_pass} ({pct:.1f}%)\n")
f.write("\n")
# Overall pass rate
pass_cols = [col for col in df.columns if col.startswith("passes_")]
if pass_cols:
df["passes_all_filters"] = df[pass_cols].all(axis=1)
n_pass = df["passes_all_filters"].sum()
pct = 100 * n_pass / len(df)
f.write("OVERALL:\n")
f.write("-" * 40 + "\n")
f.write(f" Molecules passing all filters: {n_pass} ({pct:.1f}%)\n")
f.write("\n" + "=" * 80 + "\n")
print(f"\nSummary report saved to: {summary_file}")
def main():
parser = argparse.ArgumentParser(
description="Batch molecular filtering using medchem",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
# Input/Output
parser.add_argument("input", type=Path, help="Input file (CSV, TSV, SDF, or TXT)")
parser.add_argument("--output", "-o", type=Path, required=True, help="Output CSV file")
parser.add_argument("--smiles-column", default="smiles", help="Name of SMILES column (default: smiles)")
# Rule filters
parser.add_argument("--rules", help="Comma-separated list of rules (e.g., rule_of_five,rule_of_cns)")
# Structural alerts
parser.add_argument("--common-alerts", action="store_true", help="Apply common structural alerts")
parser.add_argument("--nibr", action="store_true", help="Apply NIBR filters")
parser.add_argument("--lilly", action="store_true", help="Apply Lilly demerits filter")
parser.add_argument("--pains", action="store_true", help="Apply PAINS filter")
# Complexity
parser.add_argument("--complexity", type=float, help="Maximum complexity threshold")
parser.add_argument("--complexity-method", default="bertz", choices=["bertz", "whitlock", "barone"],
help="Complexity calculation method")
# Constraints
parser.add_argument("--mw-range", help="Molecular weight range (e.g., 200,500)")
parser.add_argument("--logp-range", help="LogP range (e.g., -2,5)")
parser.add_argument("--tpsa-max", type=float, help="Maximum TPSA")
parser.add_argument("--hbd-max", type=int, help="Maximum H-bond donors")
parser.add_argument("--hba-max", type=int, help="Maximum H-bond acceptors")
parser.add_argument("--rotatable-bonds-max", type=int, help="Maximum rotatable bonds")
# Chemical groups
parser.add_argument("--groups", help="Comma-separated chemical groups to detect")
# Processing options
parser.add_argument("--n-jobs", type=int, default=-1, help="Number of parallel jobs (-1 = all cores)")
parser.add_argument("--no-summary", action="store_true", help="Don't generate summary report")
parser.add_argument("--filter-output", action="store_true", help="Only output molecules passing all filters")
args = parser.parse_args()
# Load molecules
df, mols = load_molecules(args.input, args.smiles_column)
# Apply filters
result_dfs = [df]
# Rules
if args.rules:
rule_list = [r.strip() for r in args.rules.split(",")]
df_rules = apply_rule_filters(mols, rule_list, args.n_jobs)
result_dfs.append(df_rules)
# Structural alerts
if args.common_alerts:
df_alerts = apply_structural_alerts(mols, "common", args.n_jobs)
result_dfs.append(df_alerts)
if args.nibr:
df_nibr = apply_structural_alerts(mols, "nibr", args.n_jobs)
result_dfs.append(df_nibr)
if args.lilly:
df_lilly = apply_structural_alerts(mols, "lilly", args.n_jobs)
result_dfs.append(df_lilly)
if args.pains:
df_pains = apply_structural_alerts(mols, "pains", args.n_jobs)
result_dfs.append(df_pains)
# Complexity
if args.complexity:
df_complexity = apply_complexity_filter(mols, args.complexity, args.complexity_method)
result_dfs.append(df_complexity)
# Constraints
constraints = {}
if args.mw_range:
mw_min, mw_max = map(float, args.mw_range.split(","))
constraints["mw_range"] = (mw_min, mw_max)
if args.logp_range:
logp_min, logp_max = map(float, args.logp_range.split(","))
constraints["logp_range"] = (logp_min, logp_max)
if args.tpsa_max:
constraints["tpsa_max"] = args.tpsa_max
if args.hbd_max:
constraints["hbd_max"] = args.hbd_max
if args.hba_max:
constraints["hba_max"] = args.hba_max
if args.rotatable_bonds_max:
constraints["rotatable_bonds_max"] = args.rotatable_bonds_max
if constraints:
df_constraints = apply_constraints(mols, constraints, args.n_jobs)
result_dfs.append(df_constraints)
# Chemical groups
if args.groups:
group_list = [g.strip() for g in args.groups.split(",")]
df_groups = apply_chemical_groups(mols, group_list)
result_dfs.append(df_groups)
# Combine results
df_final = pd.concat(result_dfs, axis=1)
# Filter output if requested
if args.filter_output:
pass_cols = [col for col in df_final.columns if col.startswith("passes_")]
if pass_cols:
df_final["passes_all"] = df_final[pass_cols].all(axis=1)
df_final = df_final[df_final["passes_all"]]
print(f"\nFiltered to {len(df_final)} molecules passing all filters")
# Save results
args.output.parent.mkdir(parents=True, exist_ok=True)
df_final.to_csv(args.output, index=False)
print(f"\nResults saved to: {args.output}")
# Generate summary
if not args.no_summary:
generate_summary(df_final, args.output)
print("\nDone!")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,516 @@
---
name: molfeat
description: Comprehensive molecular featurization toolkit for converting chemical structures into numerical representations for machine learning. Use this skill when working with molecular data, SMILES strings, chemical fingerprints, molecular descriptors, or building QSAR/QSPR models. Provides access to 100+ featurizers including traditional fingerprints (ECFP, MACCS), molecular descriptors (RDKit, Mordred), and pretrained deep learning models (ChemBERTa, ChemGPT, GNN models) for cheminformatics and drug discovery tasks.
---
# Molfeat - Molecular Featurization Hub
## Overview
Molfeat is a comprehensive Python library for molecular featurization that unifies pre-trained embeddings and hand-crafted featurizers into a single, fast, and user-friendly package. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations suitable for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications.
**Key Capabilities:**
- 100+ featurizers including fingerprints, descriptors, and pretrained models
- Fast parallel processing with simple API
- Scikit-learn compatible transformers
- Built-in caching and state persistence
- Integration with PyTorch, TensorFlow, and graph neural networks
## When to Use This Skill
Apply molfeat when working with:
- **Molecular machine learning**: Building QSAR/QSPR models, property prediction
- **Virtual screening**: Ranking compound libraries for biological activity
- **Similarity searching**: Finding structurally similar molecules
- **Chemical space analysis**: Clustering, visualization, dimensionality reduction
- **Deep learning**: Training neural networks on molecular data
- **Featurization pipelines**: Converting SMILES to ML-ready representations
- **Cheminformatics**: Any task requiring molecular feature extraction
## Installation
```bash
# Recommended: Using conda/mamba
mamba install -c conda-forge molfeat
# Alternative: Using pip
pip install molfeat
# With all optional dependencies
pip install "molfeat[all]"
```
**Optional dependencies for specific featurizers:**
- `molfeat[dgl]` - GNN models (GIN variants)
- `molfeat[graphormer]` - Graphormer models
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
- `molfeat[fcd]` - FCD descriptors
- `molfeat[map4]` - MAP4 fingerprints
## Core Concepts
Molfeat organizes featurization into three hierarchical classes:
### 1. Calculators (`molfeat.calc`)
Callable objects that convert individual molecules into feature vectors. Accept RDKit `Chem.Mol` objects or SMILES strings.
**Use calculators for:**
- Single molecule featurization
- Custom processing loops
- Direct feature computation
**Example:**
```python
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # Returns numpy array (2048,)
```
### 2. Transformers (`molfeat.trans`)
Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
**Use transformers for:**
- Batch featurization of molecular datasets
- Integration with scikit-learn pipelines
- Parallel processing (automatic CPU utilization)
**Example:**
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # Parallel processing
```
### 3. Pretrained Transformers (`molfeat.trans.pretrained`)
Specialized transformers for deep learning models with batched inference and caching.
**Use pretrained transformers for:**
- State-of-the-art molecular embeddings
- Transfer learning from large chemical datasets
- Deep learning feature extraction
**Example:**
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # Deep learning embeddings
```
## Quick Start Workflow
### Basic Featurization
```python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load molecular data
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
# Create calculator and transformer
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize molecules
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
```
### Save and Load Configuration
```python
# Save featurizer configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")
# Reload exact configuration
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
```
### Handle Errors Gracefully
```python
# Process dataset with potentially invalid SMILES
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # Continue on failures
verbose=True # Log error details
)
features = transformer(smiles_with_errors)
# Returns None for failed molecules
```
## Choosing the Right Featurizer
### For Traditional Machine Learning (RF, SVM, XGBoost)
**Start with fingerprints:**
```python
# ECFP - Most popular, general-purpose
FPCalculator("ecfp", radius=3, fpSize=2048)
# MACCS - Fast, good for scaffold hopping
FPCalculator("maccs")
# MAP4 - Efficient for large-scale screening
FPCalculator("map4")
```
**For interpretable models:**
```python
# RDKit 2D descriptors (200+ named properties)
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
# Mordred (1800+ comprehensive descriptors)
from molfeat.calc import MordredDescriptors
MordredDescriptors()
```
**Combine multiple featurizers:**
```python
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
]) # Result: 2215-dimensional combined features
```
### For Deep Learning
**Transformer-based embeddings:**
```python
# ChemBERTa - Pre-trained on 77M PubChem compounds
PretrainedMolTransformer("ChemBERTa-77M-MLM")
# ChemGPT - Autoregressive language model
PretrainedMolTransformer("ChemGPT-1.2B")
```
**Graph neural networks:**
```python
# GIN models with different pre-training objectives
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
# Graphormer for quantum chemistry
PretrainedMolTransformer("Graphormer-pcqm4mv2")
```
### For Similarity Searching
```python
# ECFP - General purpose, most widely used
FPCalculator("ecfp")
# MACCS - Fast, scaffold-based similarity
FPCalculator("maccs")
# MAP4 - Efficient for large databases
FPCalculator("map4")
# USR/USRCAT - 3D shape similarity
from molfeat.calc import USRDescriptors
USRDescriptors()
```
### For Pharmacophore-Based Approaches
```python
# FCFP - Functional group based
FPCalculator("fcfp")
# CATS - Pharmacophore pair distributions
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
# Gobbi - Explicit pharmacophore features
FPCalculator("gobbi2D")
```
## Common Workflows
### Building a QSAR Model
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
# Train model
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
# Save configuration for deployment
transformer.to_state_yaml_file("production_featurizer.yml")
```
### Virtual Screening Pipeline
```python
from sklearn.ensemble import RandomForestClassifier
# Train on known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
# Screen large library
X_screen = transformer(screening_library) # e.g., 1M compounds
predictions = clf.predict_proba(X_screen)[:, 1]
# Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
```
### Similarity Search
```python
from sklearn.metrics.pairwise import cosine_similarity
# Query molecule
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
# Database fingerprints
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
```
### Scikit-learn Pipeline Integration
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Create end-to-end pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and predict directly on SMILES
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
```
### Comparing Multiple Featurizers
```python
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# Evaluate with your ML model
score = evaluate_model(X, y)
results[name] = score
```
## Discovering Available Featurizers
Use the ModelStore to explore all available featurizers:
```python
from molfeat.store.modelstore import ModelStore
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Total featurizers: {len(all_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get usage information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # Display usage examples
# Load model
transformer = store.load("ChemBERTa-77M-MLM")
```
## Advanced Features
### Custom Preprocessing
```python
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing pipeline"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
```
### Batch Processing Large Datasets
```python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""Process large datasets in chunks to manage memory"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)
```
### Caching Expensive Embeddings
```python
import pickle
cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
try:
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
except FileNotFoundError:
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)
```
## Performance Tips
1. **Use parallelization**: Set `n_jobs=-1` to utilize all CPU cores
2. **Batch processing**: Process multiple molecules at once instead of loops
3. **Choose appropriate featurizers**: Fingerprints are faster than deep learning models
4. **Cache pretrained models**: Leverage built-in caching for repeated use
5. **Use float32**: Set `dtype=np.float32` when precision allows
6. **Handle errors efficiently**: Use `ignore_errors=True` for large datasets
## Common Featurizers Reference
**Quick reference for frequently used featurizers:**
| Featurizer | Type | Dimensions | Speed | Use Case |
|------------|------|------------|-------|----------|
| `ecfp` | Fingerprint | 2048 | Fast | General purpose |
| `maccs` | Fingerprint | 167 | Very fast | Scaffold similarity |
| `desc2D` | Descriptors | 200+ | Fast | Interpretable models |
| `mordred` | Descriptors | 1800+ | Medium | Comprehensive features |
| `map4` | Fingerprint | 1024 | Fast | Large-scale screening |
| `ChemBERTa-77M-MLM` | Deep learning | 768 | Slow* | Transfer learning |
| `gin-supervised-masking` | GNN | Variable | Slow* | Graph-based models |
*First run is slow; subsequent runs benefit from caching
## Resources
This skill includes comprehensive reference documentation:
### references/api_reference.md
Complete API documentation covering:
- `molfeat.calc` - All calculator classes and parameters
- `molfeat.trans` - Transformer classes and methods
- `molfeat.store` - ModelStore usage
- Common patterns and integration examples
- Performance optimization tips
**When to load:** Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
### references/available_featurizers.md
Comprehensive catalog of all 100+ featurizers organized by category:
- Transformer-based language models (ChemBERTa, ChemGPT)
- Graph neural networks (GIN, Graphormer)
- Molecular descriptors (RDKit, Mordred)
- Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
- Pharmacophore descriptors (CATS, Gobbi)
- Shape descriptors (USR, ElectroShape)
- Scaffold-based descriptors
**When to load:** Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
**Search tip:** Use grep to find specific featurizer types:
```bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md
```
### references/examples.md
Practical code examples for common scenarios:
- Installation and quick start
- Calculator and transformer examples
- Pretrained model usage
- Scikit-learn and PyTorch integration
- Virtual screening workflows
- QSAR model building
- Similarity searching
- Troubleshooting and best practices
**When to load:** Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
## Troubleshooting
### Invalid Molecules
Enable error handling to skip invalid SMILES:
```python
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)
```
### Memory Issues with Large Datasets
Process in chunks or use streaming approaches for datasets > 100K molecules.
### Pretrained Model Dependencies
Some models require additional packages. Install specific extras:
```bash
pip install "molfeat[transformer]" # For ChemBERTa/ChemGPT
pip install "molfeat[dgl]" # For GIN models
```
### Reproducibility
Save exact configurations and document versions:
```python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")
```
## Additional Resources
- **Official Documentation**: https://molfeat-docs.datamol.io/
- **GitHub Repository**: https://github.com/datamol-io/molfeat
- **PyPI Package**: https://pypi.org/project/molfeat/
- **Tutorial**: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6

View File

@@ -0,0 +1,428 @@
# Molfeat API Reference
## Core Modules
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
- **`molfeat.store`** - Manages model loading, listing, and registration
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
- **`molfeat.utils`** - Utility functions for data handling
- **`molfeat.viz`** - Visualization tools for molecular features
---
## molfeat.calc - Calculators
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
### SerializableCalculator (Base Class)
Base abstract class for all calculators. When subclassing, must implement:
- `__call__()` - Required method for featurization
- `__len__()` - Optional, returns output length
- `columns` - Optional property, returns feature names
- `batch_compute()` - Optional, for efficient batch processing
**State Management Methods:**
- `to_state_json()` - Save calculator state as JSON
- `to_state_yaml()` - Save calculator state as YAML
- `from_state_dict()` - Load calculator from state dictionary
- `to_state_dict()` - Export calculator state as dictionary
### FPCalculator
Computes molecular fingerprints. Supports 15+ fingerprint methods.
**Supported Fingerprint Types:**
**Structural Fingerprints:**
- `ecfp` - Extended-connectivity fingerprints (circular)
- `fcfp` - Functional-class fingerprints
- `rdkit` - RDKit topological fingerprints
- `maccs` - MACCS keys (166-bit structural keys)
- `avalon` - Avalon fingerprints
- `pattern` - Pattern fingerprints
- `layered` - Layered fingerprints
**Atom-based Fingerprints:**
- `atompair` - Atom pair fingerprints
- `atompair-count` - Counted atom pairs
- `topological` - Topological torsion fingerprints
- `topological-count` - Counted topological torsions
**Specialized Fingerprints:**
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
- `secfp` - SMILES extended connectivity fingerprint
- `erg` - Extended reduced graphs
- `estate` - Electrotopological state indices
**Parameters:**
- `method` (str) - Fingerprint type name
- `radius` (int) - Radius for circular fingerprints (default: 3)
- `fpSize` (int) - Fingerprint size (default: 2048)
- `includeChirality` (bool) - Include chirality information
- `counting` (bool) - Use count vectors instead of binary
**Usage:**
```python
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
```
**Common Fingerprint Dimensions:**
- MACCS: 167 dimensions
- ECFP (default): 2048 dimensions
- MAP4 (default): 1024 dimensions
### Descriptor Calculators
**RDKitDescriptors2D**
Computes 2D molecular descriptors using RDKit.
```python
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
```
**RDKitDescriptors3D**
Computes 3D molecular descriptors (requires conformer generation).
**MordredDescriptors**
Calculates over 1800 molecular descriptors using Mordred.
```python
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
```
### Pharmacophore Calculators
**Pharmacophore2D**
RDKit's 2D pharmacophore fingerprint generation.
**Pharmacophore3D**
Consensus pharmacophore fingerprints from multiple conformers.
**CATSCalculator**
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
**Parameters:**
- `mode` - "2D" or "3D" distance calculations
- `dist_bins` - Distance bins for pair distributions
- `scale` - Scaling mode: "raw", "num", or "count"
```python
from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
```
### Shape Descriptors
**USRDescriptors**
Ultrafast shape recognition descriptors (multiple variants).
**ElectroShapeDescriptors**
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
### Graph-Based Calculators
**ScaffoldKeyCalculator**
Computes 40+ scaffold-based molecular properties.
**AtomCalculator**
Atom-level featurization for graph neural networks.
**BondCalculator**
Bond-level featurization for graph neural networks.
### Utility Function
**get_calculator()**
Factory function to instantiate calculators by name.
```python
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
```
Raises `ValueError` for unsupported featurizers.
---
## molfeat.trans - Transformers
Transformers wrap calculators into complete featurization pipelines for batch processing.
### MoleculeTransformer
Scikit-learn compatible transformer for batch molecular featurization.
**Key Parameters:**
- `featurizer` - Calculator or featurizer to use
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
- `dtype` - Output data type (numpy float32/64, torch tensors)
- `verbose` (bool) - Enable verbose logging
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
**Essential Methods:**
- `transform(mols)` - Processes batches and returns representations
- `_transform(mol)` - Handles individual molecule featurization
- `__call__(mols)` - Convenience wrapper around transform()
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
- `to_state_yaml_file(path)` - Save transformer configuration
- `from_state_yaml_file(path)` - Load transformer configuration
**Usage:**
```python
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
```
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
### FeatConcat
Concatenates multiple featurizers into unified representations.
```python
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
```
### PretrainedMolTransformer
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
**Unique Features:**
- `_embed()` - Batched inference for neural networks
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
- SELFIES strings for language models
- DGL graphs for graph neural networks
- Integrated caching system for efficient storage
**Usage:**
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
```
### PrecomputedMolTransformer
Transformer for cached/precomputed features.
---
## molfeat.store - Model Store
Manages featurizer discovery, loading, and registration.
### ModelStore
Central hub for accessing available featurizers.
**Key Methods:**
- `available_models` - Property listing all available featurizers
- `search(name=None, **kwargs)` - Search for specific featurizers
- `load(name, **kwargs)` - Load a featurizer by name
- `register(name, card)` - Register custom featurizer
**Usage:**
```python
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
```
**ModelCard Attributes:**
- `name` - Model identifier
- `description` - Model description
- `version` - Model version
- `authors` - Model authors
- `tags` - Categorization tags
- `usage()` - Display usage examples
- `load(**kwargs)` - Load the model
---
## Common Patterns
### Error Handling
```python
# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
```
### Data Type Control
```python
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
```
### Persistence and Reproducibility
```python
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
```
### Preprocessing
```python
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
```
---
## Integration Examples
### Scikit-learn Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
```
### PyTorch Integration
```python
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
```
---
## Performance Tips
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
2. **Batch Processing**: Process multiple molecules at once instead of loops
3. **Caching**: Leverage built-in caching for pretrained models
4. **Data Types**: Use float32 instead of float64 when precision allows
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules

View File

@@ -0,0 +1,333 @@
# Available Featurizers in Molfeat
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
## Transformer-Based Language Models
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
### RoBERTa-style Models
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
### GPT-style Autoregressive Models
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
### Specialized Transformer Models
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
## Graph Neural Networks (GNNs)
Pre-trained graph neural network models operating on molecular graph structures.
### GIN (Graph Isomorphism Network) Variants
All pre-trained on ChEMBL molecules with different objectives:
- **gin-supervised-masking** - Supervised with node masking objective
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
- **gin-supervised-edgepred** - Supervised with edge prediction objective
- **gin-supervised-contextpred** - Supervised with context prediction objective
### Other Graph-Based Models
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
## Molecular Descriptors
Calculators for physico-chemical properties and molecular characteristics.
### 2D Descriptors
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
- Molecular weight, logP, TPSA
- H-bond donors/acceptors
- Rotatable bonds
- Ring counts and aromaticity
- Molecular complexity metrics
### 3D Descriptors
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
- Inertial moments
- PMI (Principal Moments of Inertia) ratios
- Asphericity, eccentricity
- Radius of gyration
### Comprehensive Descriptor Sets
- **mordred** - Over 1800 molecular descriptors covering:
- Constitutional descriptors
- Topological indices
- Connectivity indices
- Information content
- 2D/3D autocorrelations
- WHIM descriptors
- GETAWAY descriptors
- And many more
### Electrotopological Descriptors
- **estate** - Electrotopological state (E-State) indices encoding:
- Atomic environment information
- Electronic and topological properties
- Heteroatom contributions
## Molecular Fingerprints
Binary or count-based fixed-length vectors representing molecular substructures.
### Circular Fingerprints (ECFP-style)
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
- Radius variants (2, 4, 6 correspond to diameter)
- Default: radius=3, 2048 bits
- Most popular for similarity searching
- **ecfp-count** - Count version of ECFP (non-binary)
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
- Similar to ECFP but uses functional groups
- Better for pharmacophore-based similarity
### Path-Based Fingerprints
- **rdkit** - RDKit topological fingerprints based on linear paths
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
- **layered** - Layered fingerprints with multiple substructure layers
### Key-Based Fingerprints
- **maccs** - MACCS keys (166-bit structural keys)
- Fixed set of predefined substructures
- Good for scaffold hopping
- Fast computation
- **avalon** - Avalon fingerprints
- Similar to MACCS but more features
- Optimized for similarity searching
### Atom-Pair Fingerprints
- **atompair** - Atom pair fingerprints
- Encodes pairs of atoms and distance between them
- Good for 3D similarity
- **atompair-count** - Count version of atom pairs
### Topological Torsion Fingerprints
- **topological** - Topological torsion fingerprints
- Encodes sequences of 4 connected atoms
- Captures local topology
- **topological-count** - Count version of topological torsions
### MinHashed Fingerprints
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
- Combines atom-pair and ECFP concepts
- Default: 1024 dimensions
- Fast and efficient for large datasets
- **secfp** - SMILES Extended Connectivity Fingerprint
- Operates directly on SMILES strings
- Captures both substructure and atom-pair information
### Extended Reduced Graph
- **erg** - Extended Reduced Graph
- Uses pharmacophoric points instead of atoms
- Reduces graph complexity while preserving key features
## Pharmacophore Descriptors
Features based on pharmacologically relevant functional groups and their spatial relationships.
### CATS (Chemically Advanced Template Search)
- **cats2D** - 2D CATS descriptors
- Pharmacophore point pair distributions
- Distance based on shortest path
- 21 descriptors by default
- **cats3D** - 3D CATS descriptors
- Euclidean distance based
- Requires conformer generation
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
### Gobbi Pharmacophores
- **gobbi2D** - 2D pharmacophore fingerprints
- 8 pharmacophore feature types:
- Hydrophobic
- Aromatic
- H-bond acceptor
- H-bond donor
- Positive ionizable
- Negative ionizable
- Lumped hydrophobe
- Good for virtual screening
### Pmapper Pharmacophores
- **pmapper2D** - 2D pharmacophore signatures
- **pmapper3D** - 3D pharmacophore signatures
- High-dimensional pharmacophore descriptors
- Useful for QSAR and similarity searching
## Shape Descriptors
Descriptors capturing 3D molecular shape and electrostatic properties.
### USR (Ultrafast Shape Recognition)
- **usr** - Basic USR descriptors
- 12 dimensions encoding shape distribution
- Extremely fast computation
- **usrcat** - USR with pharmacophoric constraints
- 60 dimensions (12 per feature type)
- Combines shape and pharmacophore information
### Electrostatic Shape
- **electroshape** - ElectroShape descriptors
- Combines molecular shape, chirality, and electrostatics
- Useful for protein-ligand docking predictions
## Scaffold-Based Descriptors
Descriptors based on molecular scaffolds and core structures.
### Scaffold Keys
- **scaffoldkeys** - Scaffold key calculator
- 40+ scaffold-based properties
- Bioisosteric scaffold representation
- Captures core structural features
## Graph Featurizers for GNN Input
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
### Atom-Level Features
- **atom-onehot** - One-hot encoded atom features
- **atom-default** - Default atom featurization including:
- Atomic number
- Degree, formal charge
- Hybridization
- Aromaticity
- Number of hydrogen atoms
### Bond-Level Features
- **bond-onehot** - One-hot encoded bond features
- **bond-default** - Default bond featurization including:
- Bond type (single, double, triple, aromatic)
- Conjugation
- Ring membership
- Stereochemistry
## Integrated Pretrained Model Collections
Molfeat integrates models from various sources:
### HuggingFace Models
Access to transformer models through HuggingFace hub:
- ChemBERTa variants
- ChemGPT variants
- MolT5
- Custom uploaded models
### DGL-LifeSci Models
Pre-trained GNN models from DGL-Life:
- GIN variants with different pre-training tasks
- AttentiveFP models
- MPNN models
### FCD (Fréchet ChemNet Distance)
- **fcd** - Pre-trained CNN for molecular generation evaluation
### Graphormer Models
- Graph transformers from Microsoft Research
- Pre-trained on quantum chemistry datasets
## Usage Notes
### Choosing a Featurizer
**For traditional ML (Random Forest, SVM, etc.):**
- Start with **ecfp** or **maccs** fingerprints
- Try **desc2D** for interpretable models
- Use **FeatConcat** to combine multiple fingerprints
**For deep learning:**
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
- Use **gin-supervised-*** for graph neural network embeddings
- Consider **Graphormer** for quantum property predictions
**For similarity searching:**
- **ecfp** - General purpose, most popular
- **maccs** - Fast, good for scaffold hopping
- **map4** - Efficient for large-scale searches
- **usr** / **usrcat** - 3D shape similarity
**For pharmacophore-based approaches:**
- **fcfp** - Functional group based
- **cats2D/3D** - Pharmacophore pair distributions
- **gobbi2D** - Explicit pharmacophore features
**For interpretability:**
- **desc2D** / **mordred** - Named descriptors
- **maccs** - Interpretable substructure keys
- **scaffoldkeys** - Scaffold-based features
### Model Dependencies
Some featurizers require optional dependencies:
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
- **Graphormer**: `pip install "molfeat[graphormer]"`
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
- **FCD**: `pip install "molfeat[fcd]"`
- **MAP4**: `pip install "molfeat[map4]"`
- **All dependencies**: `pip install "molfeat[all]"`
### Accessing All Available Models
```python
from molfeat.store.modelstore import ModelStore
store = ModelStore()
all_models = store.available_models
# Print all available featurizers
for model in all_models:
print(f"{model.name}: {model.description}")
# Search for specific types
transformers = [m for m in all_models if "transformer" in m.tags]
gnn_models = [m for m in all_models if "gnn" in m.tags]
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
```
## Performance Characteristics
### Computational Speed (relative)
**Fastest:**
- maccs
- ecfp
- rdkit fingerprints
- usr
**Medium:**
- desc2D
- cats2D
- Most fingerprints
**Slower:**
- mordred (1800+ descriptors)
- desc3D (requires conformer generation)
- 3D descriptors in general
**Slowest (first run):**
- Pretrained models (ChemBERTa, ChemGPT, GIN)
- Note: Subsequent runs benefit from caching
### Dimensionality
**Low (< 200 dims):**
- maccs (167)
- usr (12)
- usrcat (60)
**Medium (200-2000 dims):**
- desc2D (~200)
- ecfp (2048 default, configurable)
- map4 (1024 default)
**High (> 2000 dims):**
- mordred (1800+)
- Concatenated fingerprints
- Some transformer embeddings
**Variable:**
- Transformer models (typically 768-1024)
- GNN models (depends on architecture)

View File

@@ -0,0 +1,723 @@
# Molfeat Usage Examples
This document provides practical examples for common molfeat use cases.
## Installation
```bash
# Recommended: Using conda/mamba
mamba install -c conda-forge molfeat
# Alternative: Using pip
pip install molfeat
# With all optional dependencies
pip install "molfeat[all]"
# With specific dependencies
pip install "molfeat[dgl]" # For GNN models
pip install "molfeat[graphormer]" # For Graphormer
pip install "molfeat[transformer]" # For ChemBERTa, ChemGPT
```
---
## Quick Start
### Basic Featurization Workflow
```python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load sample data
data = dm.data.freesolv().sample(100).smiles.values
# Single molecule featurization
calc = FPCalculator("ecfp")
features_single = calc(data[0])
print(f"Single molecule features shape: {features_single.shape}")
# Output: (2048,)
# Batch featurization with parallelization
transformer = MoleculeTransformer(calc, n_jobs=-1)
features_batch = transformer(data)
print(f"Batch features shape: {features_batch.shape}")
# Output: (100, 2048)
```
---
## Calculator Examples
### Fingerprint Calculators
```python
from molfeat.calc import FPCalculator
# ECFP (Extended-Connectivity Fingerprints)
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO") # Ethanol
print(f"ECFP shape: {fp.shape}") # (2048,)
# MACCS keys
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1") # Benzene
print(f"MACCS shape: {fp.shape}") # (167,)
# Count-based fingerprints
ecfp_count = FPCalculator("ecfp-count", radius=3)
fp_count = ecfp_count("CC(C)CC(C)C") # Non-binary counts
# MAP4 fingerprints
map4 = FPCalculator("map4")
fp = map4("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
```
### Descriptor Calculators
```python
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
# RDKit 2D descriptors (200+ properties)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"Number of 2D descriptors: {len(descriptors)}")
# Get descriptor names
names = desc2d.columns
print(f"First 5 descriptors: {names[:5]}")
# Mordred descriptors (1800+ properties)
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O") # Phenol
print(f"Mordred descriptors: {len(descriptors)}")
```
### Pharmacophore Calculators
```python
from molfeat.calc import CATSCalculator
# 2D CATS descriptors
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C") # Cymene
print(f"CATS descriptors: {descriptors.shape}") # (21,)
# 3D CATS descriptors (requires conformer)
cats3d = CATSCalculator(mode="3D", scale="num")
```
---
## Transformer Examples
### Basic Transformer Usage
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
import datamol as dm
# Prepare data
smiles_list = [
"CCO",
"CC(=O)O",
"c1ccccc1",
"CC(C)O",
"CCCC"
]
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Transform molecules
features = transformer(smiles_list)
print(f"Features shape: {features.shape}") # (5, 2048)
```
### Error Handling
```python
# Handle invalid SMILES gracefully
smiles_with_errors = [
"CCO", # Valid
"invalid", # Invalid
"CC(=O)O", # Valid
"xyz123", # Invalid
]
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
n_jobs=-1,
verbose=True, # Log errors
ignore_errors=True # Continue on failure
)
features = transformer(smiles_with_errors)
# Returns: array with None for failed molecules
print(features) # [array(...), None, array(...), None]
```
### Concatenating Multiple Featurizers
```python
from molfeat.trans import FeatConcat, MoleculeTransformer
from molfeat.calc import FPCalculator
# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
concat_calc = FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp", radius=3, fpSize=2048)
])
transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
features = transformer(smiles_list)
print(f"Combined features shape: {features.shape}") # (n, 2215)
# Triple combination
triple_concat = FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp"),
FPCalculator("rdkit")
])
```
### Saving and Loading Configurations
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create and save transformer
transformer = MoleculeTransformer(
FPCalculator("ecfp", radius=3, fpSize=2048),
n_jobs=-1
)
# Save to YAML
transformer.to_state_yaml_file("my_featurizer.yml")
# Save to JSON
transformer.to_state_json_file("my_featurizer.json")
# Load from saved state
loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
# Use loaded transformer
features = loaded_transformer(smiles_list)
```
---
## Pretrained Model Examples
### Using the ModelStore
```python
from molfeat.store.modelstore import ModelStore
# Initialize model store
store = ModelStore()
# List all available models
print(f"Total available models: {len(store.available_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get model information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
print(f"Model: {model_card.name}")
print(f"Version: {model_card.version}")
print(f"Authors: {model_card.authors}")
# View usage instructions
model_card.usage()
# Load model directly
transformer = store.load("ChemBERTa-77M-MLM")
```
### ChemBERTa Embeddings
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load ChemBERTa model
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
embeddings = chemberta(smiles)
print(f"ChemBERTa embeddings shape: {embeddings.shape}")
# Output: (3, 768) - 768-dimensional embeddings
# Use in ML pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
embeddings, labels, test_size=0.2
)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
```
### ChemGPT Models
```python
# Small model (4.7M parameters)
chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
# Medium model (19M parameters)
chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
# Large model (1.2B parameters)
chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
# Generate embeddings
embeddings = chemgpt_small(smiles)
```
### Graph Neural Network Models
```python
# GIN models with different pre-training objectives
gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
# Generate graph embeddings
embeddings = gin_masking(smiles)
print(f"GIN embeddings shape: {embeddings.shape}")
# Graphormer (for quantum chemistry)
graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
embeddings = graphormer(smiles)
```
---
## Machine Learning Integration
### Scikit-learn Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create ML pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and evaluate
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
# Cross-validation
scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
```
### Grid Search for Hyperparameter Tuning
```python
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Define pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', SVC())
])
# Define parameter grid
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['rbf', 'linear'],
'classifier__gamma': ['scale', 'auto']
}
# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(smiles_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
```
### Multiple Featurizer Comparison
```python
from sklearn.metrics import roc_auc_score
# Test different featurizers
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'RDKit': FPCalculator("rdkit"),
'Descriptors': RDKitDescriptors2D(),
'Combined': FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp")
])
}
results = {}
for name, calc in featurizers.items():
transformer = MoleculeTransformer(calc, n_jobs=-1)
X_train = transformer(smiles_train)
X_test = transformer(smiles_test)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred)
results[name] = auc
print(f"{name}: AUC = {auc:.3f}")
```
### PyTorch Deep Learning
```python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Custom dataset
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.features = transformer(smiles)
self.labels = torch.tensor(labels, dtype=torch.float32)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return (
torch.tensor(self.features[idx], dtype=torch.float32),
self.labels[idx]
)
# Prepare data
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Simple neural network
class MoleculeClassifier(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.network(x)
# Train model
model = MoleculeClassifier(input_dim=2048)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(10):
for batch_features, batch_labels in train_loader:
optimizer.zero_grad()
outputs = model(batch_features).squeeze()
loss = criterion(outputs, batch_labels)
loss.backward()
optimizer.step()
```
---
## Advanced Usage Patterns
### Custom Preprocessing
```python
from molfeat.trans import MoleculeTransformer
import datamol as dm
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing: standardize molecule"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
# Standardize
mol = dm.standardize_mol(mol)
# Remove salts
mol = dm.remove_salts(mol)
return mol
# Use custom transformer
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list)
```
### Featurization with Conformers
```python
import datamol as dm
from molfeat.calc import RDKitDescriptors3D
# Generate conformers
def prepare_3d_mol(smiles):
mol = dm.to_mol(smiles)
mol = dm.add_hs(mol)
mol = dm.conform.generate_conformers(mol, n_confs=1)
return mol
# 3D descriptors
calc_3d = RDKitDescriptors3D()
smiles = "CC(C)Cc1ccc(C)cc1C"
mol_3d = prepare_3d_mol(smiles)
descriptors_3d = calc_3d(mol_3d)
```
### Parallel Batch Processing
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
import time
# Large dataset
smiles_large = load_large_dataset() # e.g., 100,000 molecules
# Test different parallelization levels
for n_jobs in [1, 2, 4, -1]:
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
n_jobs=n_jobs
)
start = time.time()
features = transformer(smiles_large)
elapsed = time.time() - start
print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
```
### Caching for Expensive Operations
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
import pickle
# Load expensive pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Cache embeddings for reuse
cache_file = "embeddings_cache.pkl"
try:
# Try loading cached embeddings
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
print("Loaded cached embeddings")
except FileNotFoundError:
# Compute and cache
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)
print("Computed and cached embeddings")
```
---
## Common Workflows
### Virtual Screening Workflow
```python
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestClassifier
import datamol as dm
# 1. Prepare training data (known actives/inactives)
train_smiles = load_training_data()
train_labels = load_training_labels() # 1=active, 0=inactive
# 2. Featurize training set
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
# 3. Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)
# 4. Featurize screening library
screening_smiles = load_screening_library() # e.g., 1M compounds
X_screen = transformer(screening_smiles)
# 5. Predict and rank
predictions = clf.predict_proba(X_screen)[:, 1]
ranked_indices = predictions.argsort()[::-1]
# 6. Get top hits
top_n = 1000
top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
```
### QSAR Model Building
```python
from molfeat.calc import RDKitDescriptors2D
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
# Load QSAR dataset
smiles = load_molecules()
y = load_activity_values() # e.g., IC50, logP
# Featurize with interpretable descriptors
transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
X = transformer(smiles)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Build linear model
model = Ridge(alpha=1.0)
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
# Fit final model
model.fit(X_scaled, y)
# Interpret feature importance
feature_names = transformer.featurizer.columns
importance = np.abs(model.coef_)
top_features_idx = importance.argsort()[-10:][::-1]
print("Top 10 important features:")
for idx in top_features_idx:
print(f" {feature_names[idx]}: {model.coef_[idx]:.3f}")
```
### Similarity Search
```python
from molfeat.calc import FPCalculator
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Query molecule
query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
# Database of molecules
database_smiles = load_molecule_database() # Large collection
# Compute fingerprints
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
# Find most similar
top_k = 10
top_indices = similarities.argsort()[-top_k:][::-1]
print(f"Top {top_k} similar molecules:")
for i, idx in enumerate(top_indices, 1):
print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
```
---
## Troubleshooting
### Handling Invalid Molecules
```python
# Use ignore_errors to skip invalid molecules
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
ignore_errors=True,
verbose=True
)
# Filter out None values after transformation
features = transformer(smiles_list)
valid_mask = [f is not None for f in features]
valid_features = [f for f in features if f is not None]
valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
```
### Memory Management for Large Datasets
```python
# Process in chunks for very large datasets
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
return np.vstack(all_features)
# Use with large dataset
features = featurize_in_chunks(large_smiles_list, transformer)
```
### Reproducibility
```python
import random
import numpy as np
import torch
# Set all random seeds
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
set_seed(42)
# Save exact configuration
transformer.to_state_yaml_file("config.yml")
# Document version
import molfeat
print(f"molfeat version: {molfeat.__version__}")
```

View File

@@ -0,0 +1,381 @@
---
name: polars
description: This skill should be used when working with the Polars DataFrame library for high-performance data manipulation in Python. Use when users ask about Polars operations, migrating from pandas, optimizing data processing pipelines, or working with large datasets that benefit from lazy evaluation and parallel processing.
---
# Polars
## Overview
Polars is a lightning-fast DataFrame library for Python (and Rust) built on Apache Arrow. This skill provides guidance for working with Polars, including its expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities. Use this skill when helping users write efficient data processing code, migrate from pandas, or optimize data pipelines.
## Quick Start
### Installation and Basic Usage
Install Polars:
```python
pip install polars
```
Basic DataFrame creation and operations:
```python
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# Select columns
df.select("name", "age")
# Filter rows
df.filter(pl.col("age") > 25)
# Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
```
## Core Concepts
### Expressions
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
**Key principles:**
- Use `pl.col("column_name")` to reference columns
- Chain methods to build complex transformations
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
**Example:**
```python
# Expression-based computation
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
```
### Lazy vs Eager Evaluation
**Eager (DataFrame):** Operations execute immediately
```python
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediately
```
**Lazy (LazyFrame):** Operations build a query plan, optimized before execution
```python
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized query
```
**When to use lazy:**
- Working with large datasets
- Complex query pipelines
- When only some columns/rows are needed
- Performance is critical
**Benefits of lazy evaluation:**
- Automatic query optimization
- Predicate pushdown
- Projection pushdown
- Parallel execution
For detailed concepts, load `references/core_concepts.md`.
## Common Operations
### Select
Select and manipulate columns:
```python
# Select specific columns
df.select("name", "age")
# Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
```
### Filter
Filter rows by conditions:
```python
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (cleaner than using &)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# Complex conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
```
### With Columns
Add or modify columns while preserving existing ones:
```python
# Add new columns
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# Parallel computation (all columns computed in parallel)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
```
### Group By and Aggregations
Group data and compute aggregations:
```python
# Basic grouping
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# Multiple group keys
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# Conditional aggregations
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
```
For detailed operation patterns, load `references/operations.md`.
## Aggregations and Window Functions
### Aggregation Functions
Common aggregations within `group_by` context:
- `pl.len()` - count rows
- `pl.col("x").sum()` - sum values
- `pl.col("x").mean()` - average
- `pl.col("x").min()` / `pl.col("x").max()` - extremes
- `pl.first()` / `pl.last()` - first/last values
### Window Functions with `over()`
Apply aggregations while preserving row count:
```python
# Add group statistics to each row
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# Multiple grouping columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
```
**Mapping strategies:**
- `group_to_rows` (default): Preserves original row order
- `explode`: Faster but groups rows together
- `join`: Creates list columns
## Data I/O
### Supported Formats
Polars supports reading and writing:
- CSV, Parquet, JSON, Excel
- Databases (via connectors)
- Cloud storage (S3, Azure, GCS)
- Google BigQuery
- Multiple/partitioned files
### Common I/O Operations
**CSV:**
```python
# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
```
**Parquet (recommended for performance):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
```
**JSON:**
```python
df = pl.read_json("file.json")
df.write_json("output.json")
```
For comprehensive I/O documentation, load `references/io_guide.md`.
## Transformations
### Joins
Combine DataFrames:
```python
# Inner join
df1.join(df2, on="id", how="inner")
# Left join
df1.join(df2, on="id", how="left")
# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
```
### Concatenation
Stack DataFrames:
```python
# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
```
### Pivot and Unpivot
Reshape data:
```python
# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
```
For detailed transformation examples, load `references/transformations.md`.
## Pandas Migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
### Conceptual Differences
- **No index**: Polars uses integer positions only
- **Strict typing**: No silent type conversions
- **Lazy evaluation**: Available via LazyFrame
- **Parallel by default**: Operations parallelized automatically
### Common Operation Mappings
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Select column | `df["col"]` | `df.select("col")` |
| Filter | `df[df["col"] > 10]` | `df.filter(pl.col("col") > 10)` |
| Add column | `df.assign(x=...)` | `df.with_columns(x=...)` |
| Group by | `df.groupby("col").agg(...)` | `df.group_by("col").agg(...)` |
| Window | `df.groupby("col").transform(...)` | `df.with_columns(...).over("col")` |
### Key Syntax Patterns
**Pandas sequential (slow):**
```python
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
```
**Polars parallel (fast):**
```python
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
```
For comprehensive migration guide, load `references/pandas_migration.md`.
## Best Practices
### Performance Optimization
1. **Use lazy evaluation for large datasets:**
```python
lf = pl.scan_csv("large.csv") # Don't use read_csv
result = lf.filter(...).select(...).collect()
```
2. **Avoid Python functions in hot paths:**
- Stay within expression API for parallelization
- Use `.map_elements()` only when necessary
- Prefer native Polars operations
3. **Use streaming for very large data:**
```python
lf.collect(streaming=True)
```
4. **Select only needed columns early:**
```python
# Good: Select columns early
lf.select("col1", "col2").filter(...)
# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
```
5. **Use appropriate data types:**
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data
### Expression Patterns
**Conditional operations:**
```python
pl.when(condition).then(value).otherwise(other_value)
```
**Column operations across multiple columns:**
```python
df.select(pl.col("^.*_value$") * 2) # Regex pattern
```
**Null handling:**
```python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
```
For additional best practices and patterns, load `references/best_practices.md`.
## Resources
This skill includes comprehensive reference documentation:
### references/
- `core_concepts.md` - Detailed explanations of expressions, lazy evaluation, and type system
- `operations.md` - Comprehensive guide to all common operations with examples
- `pandas_migration.md` - Complete migration guide from pandas to Polars
- `io_guide.md` - Data I/O operations for all supported formats
- `transformations.md` - Joins, concatenation, pivots, and reshaping operations
- `best_practices.md` - Performance optimization tips and common patterns
Load these references as needed when users require detailed information about specific topics.

View File

@@ -0,0 +1,649 @@
# Polars Best Practices and Performance Guide
Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
## Performance Optimization
### 1. Use Lazy Evaluation
**Always prefer lazy mode for large datasets:**
```python
# Bad: Eager mode loads everything immediately
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")
# Good: Lazy mode optimizes before execution
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
**Benefits of lazy evaluation:**
- Predicate pushdown (filter at source)
- Projection pushdown (read only needed columns)
- Query optimization
- Parallel execution planning
### 2. Filter and Select Early
Push filters and column selection as early as possible in the pipeline:
```python
# Bad: Process all data, then filter and select
result = (
lf.group_by("category")
.agg(pl.col("value").mean())
.join(other, on="category")
.filter(pl.col("value") > 100)
.select("category", "value")
)
# Good: Filter and select early
result = (
lf.select("category", "value") # Only needed columns
.filter(pl.col("value") > 100) # Filter early
.group_by("category")
.agg(pl.col("value").mean())
.join(other.select("category", "other_col"), on="category")
)
```
### 3. Avoid Python Functions
Stay within the expression API to maintain parallelization:
```python
# Bad: Python function disables parallelization
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)
# Good: Use native expressions (parallelized)
df = df.with_columns(result=pl.col("value") * 2)
```
**When you must use custom functions:**
```python
# If truly needed, be explicit
df = df.with_columns(
result=pl.col("value").map_elements(
custom_function,
return_dtype=pl.Float64,
skip_nulls=True # Optimize null handling
)
)
```
### 4. Use Streaming for Very Large Data
Enable streaming for datasets larger than RAM:
```python
# Streaming mode processes data in chunks
lf = pl.scan_parquet("very_large.parquet")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
# Or use sink for direct streaming writes
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
```
### 5. Optimize Data Types
Choose appropriate data types to reduce memory and improve performance:
```python
# Bad: Default types may be wasteful
df = pl.read_csv("data.csv")
# Good: Specify optimal types
df = pl.read_csv(
"data.csv",
dtypes={
"id": pl.UInt32, # Instead of Int64 if values fit
"category": pl.Categorical, # For low-cardinality strings
"date": pl.Date, # Instead of String
"small_int": pl.Int16, # Instead of Int64
}
)
```
**Type optimization guidelines:**
- Use smallest integer type that fits your data
- Use `Categorical` for strings with low cardinality (<50% unique)
- Use `Date` instead of `Datetime` when time isn't needed
- Use `Boolean` instead of integers for binary flags
### 6. Parallel Operations
Structure code to maximize parallelization:
```python
# Bad: Sequential pipe operations disable parallelization
df = (
df.pipe(operation1)
.pipe(operation2)
.pipe(operation3)
)
# Good: Combined operations enable parallelization
df = df.with_columns(
result1=operation1_expr(),
result2=operation2_expr(),
result3=operation3_expr()
)
```
### 7. Rechunk After Concatenation
```python
# Concatenation can fragment data
combined = pl.concat([df1, df2, df3])
# Rechunk for better performance in subsequent operations
combined = pl.concat([df1, df2, df3], rechunk=True)
```
## Expression Patterns
### Conditional Logic
**Simple conditions:**
```python
df.with_columns(
status=pl.when(pl.col("age") >= 18)
.then("adult")
.otherwise("minor")
)
```
**Multiple conditions:**
```python
df.with_columns(
grade=pl.when(pl.col("score") >= 90)
.then("A")
.when(pl.col("score") >= 80)
.then("B")
.when(pl.col("score") >= 70)
.then("C")
.when(pl.col("score") >= 60)
.then("D")
.otherwise("F")
)
```
**Complex conditions:**
```python
df.with_columns(
category=pl.when(
(pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
)
.then("enterprise")
.when(
(pl.col("revenue") > 100000) | (pl.col("customers") > 50)
)
.then("business")
.otherwise("starter")
)
```
### Null Handling
**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```
**Fill nulls:**
```python
# Constant value
df.with_columns(pl.col("value").fill_null(0))
# Forward fill
df.with_columns(pl.col("value").fill_null(strategy="forward"))
# Backward fill
df.with_columns(pl.col("value").fill_null(strategy="backward"))
# Mean
df.with_columns(pl.col("value").fill_null(strategy="mean"))
# Per-group fill
df.with_columns(
pl.col("value").fill_null(pl.col("value").mean()).over("group")
)
```
**Coalesce (first non-null):**
```python
df.with_columns(
combined=pl.coalesce(["col1", "col2", "col3"])
)
```
### Column Selection Patterns
**By name:**
```python
df.select("col1", "col2", "col3")
```
**By pattern:**
```python
# Regex
df.select(pl.col("^sales_.*$"))
# Starts with
df.select(pl.col("^sales"))
# Ends with
df.select(pl.col("_total$"))
# Contains
df.select(pl.col(".*revenue.*"))
```
**By type:**
```python
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES))
# All string columns
df.select(pl.col(pl.Utf8))
# Multiple types
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
```
**Exclude columns:**
```python
df.select(pl.all().exclude("id", "timestamp"))
```
**Transform multiple columns:**
```python
# Apply same operation to multiple columns
df.select(
pl.col("^sales_.*$") * 1.1 # 10% increase to all sales columns
)
```
### Aggregation Patterns
**Multiple aggregations:**
```python
df.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("value").std().alias("std_dev"),
pl.col("id").count().alias("count"),
pl.col("id").n_unique().alias("unique_count"),
pl.col("value").min().alias("minimum"),
pl.col("value").max().alias("maximum"),
pl.col("value").quantile(0.5).alias("median"),
pl.col("value").quantile(0.95).alias("p95")
)
```
**Conditional aggregations:**
```python
df.group_by("category").agg(
# Count high values
(pl.col("value") > 100).sum().alias("high_count"),
# Average of filtered values
pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
# Conditional sum
pl.when(pl.col("status") == "completed")
.then(pl.col("amount"))
.otherwise(0)
.sum()
.alias("completed_total")
)
```
**Grouped transformations:**
```python
df.with_columns(
# Group statistics
group_mean=pl.col("value").mean().over("category"),
group_std=pl.col("value").std().over("category"),
# Rank within groups
rank=pl.col("value").rank().over("category"),
# Percentage of group total
pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
)
```
## Common Pitfalls and Anti-Patterns
### Pitfall 1: Row Iteration
```python
# Bad: Never iterate rows
for row in df.iter_rows():
# Process row
result = row[0] * 2
# Good: Use vectorized operations
df = df.with_columns(result=pl.col("value") * 2)
```
### Pitfall 2: Modifying in Place
```python
# Bad: Polars is immutable, this doesn't work as expected
df["new_col"] = df["old_col"] * 2 # May work but not recommended
# Good: Functional style
df = df.with_columns(new_col=pl.col("old_col") * 2)
```
### Pitfall 3: Not Using Expressions
```python
# Bad: String-based operations
df.select("value * 2") # Won't work
# Good: Expression-based
df.select(pl.col("value") * 2)
```
### Pitfall 4: Inefficient Joins
```python
# Bad: Join large tables without filtering
result = large_df1.join(large_df2, on="id")
# Good: Filter before joining
result = (
large_df1.filter(pl.col("active"))
.join(
large_df2.filter(pl.col("status") == "valid"),
on="id"
)
)
```
### Pitfall 5: Not Specifying Types
```python
# Bad: Let Polars infer everything
df = pl.read_csv("data.csv")
# Good: Specify types for correctness and performance
df = pl.read_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
)
```
### Pitfall 6: Creating Many Small DataFrames
```python
# Bad: Many operations creating intermediate DataFrames
df1 = df.filter(pl.col("age") > 25)
df2 = df1.select("name", "age")
df3 = df2.sort("age")
result = df3.head(10)
# Good: Chain operations
result = (
df.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
)
# Better: Use lazy mode
result = (
df.lazy()
.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
.collect()
)
```
## Memory Management
### Monitor Memory Usage
```python
# Check DataFrame size
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
# Profile memory during operations
lf = pl.scan_csv("large.csv")
print(lf.explain()) # See query plan
```
### Reduce Memory Footprint
```python
# 1. Use lazy mode
lf = pl.scan_parquet("data.parquet")
# 2. Stream results
result = lf.collect(streaming=True)
# 3. Select only needed columns
lf = lf.select("col1", "col2")
# 4. Optimize data types
df = df.with_columns(
pl.col("int_col").cast(pl.Int32), # Downcast if possible
pl.col("category").cast(pl.Categorical) # For low cardinality
)
# 5. Drop columns not needed
df = df.drop("large_text_col", "unused_col")
```
## Testing and Debugging
### Inspect Query Plans
```python
lf = pl.scan_csv("data.csv")
query = lf.filter(pl.col("age") > 25).select("name", "age")
# View the optimized query plan
print(query.explain())
# View detailed query plan
print(query.explain(optimized=True))
```
### Sample Data for Development
```python
# Use n_rows for testing
df = pl.read_csv("large.csv", n_rows=1000)
# Or sample after reading
df_sample = df.sample(n=1000, seed=42)
```
### Validate Schemas
```python
# Check schema
print(df.schema)
# Ensure schema matches expectation
expected_schema = {
"id": pl.Int64,
"name": pl.Utf8,
"date": pl.Date
}
assert df.schema == expected_schema
```
### Profile Performance
```python
import time
# Time operations
start = time.time()
result = lf.collect()
print(f"Execution time: {time.time() - start:.2f}s")
# Compare eager vs lazy
start = time.time()
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
eager_time = time.time() - start
start = time.time()
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
lazy_time = time.time() - start
print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
```
## File Format Best Practices
### Choose the Right Format
**Parquet:**
- Best for: Large datasets, archival, data lakes
- Pros: Excellent compression, columnar, fast reads
- Cons: Not human-readable
**CSV:**
- Best for: Small datasets, human inspection, legacy systems
- Pros: Universal, human-readable
- Cons: Slow, large file size, no type preservation
**Arrow IPC:**
- Best for: Inter-process communication, temporary storage
- Pros: Fastest, zero-copy, preserves all types
- Cons: Less compression than Parquet
### File Reading Best Practices
```python
# 1. Use lazy reading
lf = pl.scan_parquet("data.parquet") # Not read_parquet
# 2. Read multiple files efficiently
lf = pl.scan_parquet("data/*.parquet") # Parallel reading
# 3. Specify schema when known
lf = pl.scan_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date}
)
# 4. Use predicate pushdown
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
```
### File Writing Best Practices
```python
# 1. Use Parquet for large data
df.write_parquet("output.parquet", compression="zstd")
# 2. Partition large datasets
df.write_parquet("output", partition_by=["year", "month"])
# 3. Use streaming for very large writes
lf.sink_parquet("output.parquet") # Streaming write
# 4. Optimize compression
df.write_parquet(
"output.parquet",
compression="snappy", # Fast compression
statistics=True # Enable predicate pushdown on read
)
```
## Code Organization
### Reusable Expressions
```python
# Define reusable expressions
age_group = (
pl.when(pl.col("age") < 18)
.then("minor")
.when(pl.col("age") < 65)
.then("adult")
.otherwise("senior")
)
revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
# Use in multiple contexts
df = df.with_columns(
age_group=age_group,
rpc=revenue_per_customer
)
# Reuse in filtering
df = df.filter(revenue_per_customer > 100)
```
### Pipeline Functions
```python
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Clean and standardize data."""
return lf.with_columns(
pl.col("name").str.to_uppercase(),
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("amount").fill_null(0)
)
def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Add computed features."""
return lf.with_columns(
month=pl.col("date").dt.month(),
year=pl.col("date").dt.year(),
amount_log=pl.col("amount").log()
)
# Compose pipeline
result = (
pl.scan_csv("data.csv")
.pipe(clean_data)
.pipe(add_features)
.filter(pl.col("year") == 2023)
.collect()
)
```
## Documentation
Always document complex expressions and transformations:
```python
# Good: Document intent
df = df.with_columns(
# Calculate customer lifetime value as sum of purchases
# divided by months since first purchase
clv=(
pl.col("total_purchases") /
((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
.dt.total_days() / 30)
)
)
```
## Version Compatibility
```python
# Check Polars version
import polars as pl
print(pl.__version__)
# Feature availability varies by version
# Document version requirements for production code
```

View File

@@ -0,0 +1,378 @@
# Polars Core Concepts
## Expressions
Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
### What are Expressions?
An expression describes a transformation on data. It only materializes (executes) within specific contexts:
- `select()` - Select and transform columns
- `with_columns()` - Add or modify columns
- `filter()` - Filter rows
- `group_by().agg()` - Aggregate data
### Expression Syntax
**Basic column reference:**
```python
pl.col("column_name")
```
**Computed expressions:**
```python
# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")
# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)
```
### Expression Contexts
**Select context:**
```python
df.select(
"name", # Simple column name
pl.col("age"), # Expression
(pl.col("age") * 12).alias("age_in_months") # Computed expression
)
```
**With_columns context:**
```python
df.with_columns(
age_doubled=pl.col("age") * 2,
name_upper=pl.col("name").str.to_uppercase()
)
```
**Filter context:**
```python
df.filter(
pl.col("age") > 25,
pl.col("city").is_in(["NY", "LA", "SF"])
)
```
**Group_by context:**
```python
df.group_by("department").agg(
pl.col("salary").mean(),
pl.col("employee_id").count()
)
```
### Expression Expansion
Apply operations to multiple columns at once:
**All columns:**
```python
df.select(pl.all() * 2)
```
**Pattern matching:**
```python
# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
```
**Exclude patterns:**
```python
df.select(pl.all().exclude("id", "name"))
```
### Expression Composition
Expressions can be stored and reused:
```python
# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()
# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)
```
## Data Types
Polars has a strict type system based on Apache Arrow.
### Core Data Types
**Numeric:**
- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
- `Float32`, `Float64` - Floating point numbers
**Text:**
- `Utf8` / `String` - UTF-8 encoded strings
- `Categorical` - Categorized strings (low cardinality)
- `Enum` - Fixed set of string values
**Temporal:**
- `Date` - Calendar date (no time)
- `Datetime` - Date and time with optional timezone
- `Time` - Time of day
- `Duration` - Time duration/difference
**Boolean:**
- `Boolean` - True/False values
**Nested:**
- `List` - Variable-length lists
- `Array` - Fixed-length arrays
- `Struct` - Nested record structures
**Other:**
- `Binary` - Binary data
- `Object` - Python objects (avoid in production)
- `Null` - Null type
### Type Casting
Convert between types explicitly:
```python
# Cast to different type
df.select(
pl.col("age").cast(pl.Float64),
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("id").cast(pl.Utf8)
)
```
### Null Handling
Polars uses consistent null handling across all types:
**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```
**Fill nulls:**
```python
pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")
```
**Drop nulls:**
```python
df.drop_nulls() # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns
```
### Categorical Data
Use categorical types for string columns with low cardinality (repeated values):
```python
# Cast to categorical
df.with_columns(
pl.col("category").cast(pl.Categorical)
)
# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information
```
## Lazy vs Eager Evaluation
Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
### Eager Evaluation (DataFrame)
Operations execute immediately:
```python
import polars as pl
# DataFrame operations execute right away
df = pl.read_csv("data.csv") # Reads file immediately
result = df.filter(pl.col("age") > 25) # Filters immediately
final = result.select("name", "age") # Selects immediately
```
**When to use eager:**
- Small datasets that fit in memory
- Interactive exploration in notebooks
- Simple one-off operations
- Immediate feedback needed
### Lazy Evaluation (LazyFrame)
Operations build a query plan, optimized before execution:
```python
import polars as pl
# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv") # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25) # Adds to plan
lf3 = lf2.select("name", "age") # Adds to plan
df = lf3.collect() # NOW executes optimized plan
```
**When to use lazy:**
- Large datasets
- Complex query pipelines
- Only need subset of data
- Performance is critical
- Streaming required
### Query Optimization
Polars automatically optimizes lazy queries:
**Predicate Pushdown:**
Filter operations pushed to data source when possible:
```python
# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()
```
**Projection Pushdown:**
Only read needed columns from data source:
```python
# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()
```
**Query Plan Inspection:**
```python
# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain()) # Shows optimized plan
```
### Streaming Mode
Process data larger than memory:
```python
# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(streaming=True)
```
**Streaming benefits:**
- Process data larger than RAM
- Lower peak memory usage
- Chunk-based processing
- Automatic memory management
**Streaming limitations:**
- Not all operations support streaming
- May be slower for small data
- Some operations require materializing entire dataset
### Converting Between Eager and Lazy
**Eager to Lazy:**
```python
df = pl.read_csv("data.csv")
lf = df.lazy() # Convert to LazyFrame
```
**Lazy to Eager:**
```python
lf = pl.scan_csv("data.csv")
df = lf.collect() # Execute and return DataFrame
```
## Memory Format
Polars uses Apache Arrow columnar memory format:
**Benefits:**
- Zero-copy data sharing with other Arrow libraries
- Efficient columnar operations
- SIMD vectorization
- Reduced memory overhead
- Fast serialization
**Implications:**
- Data stored column-wise, not row-wise
- Column operations very fast
- Random row access slower than pandas
- Best for analytical workloads
## Parallelization
Polars parallelizes operations automatically using Rust's concurrency:
**What gets parallelized:**
- Aggregations within groups
- Window functions
- Most expression evaluations
- File reading (multiple files)
- Join operations
**What to avoid for parallelization:**
- Python user-defined functions (UDFs)
- Lambda functions in `.map_elements()`
- Sequential `.pipe()` chains
**Best practice:**
```python
# Good: Stays in expression API (parallelized)
df.with_columns(
pl.col("value") * 10,
pl.col("value").log(),
pl.col("value").sqrt()
)
# Bad: Uses Python function (sequential)
df.with_columns(
pl.col("value").map_elements(lambda x: x * 10)
)
```
## Strict Type System
Polars enforces strict typing:
**No silent conversions:**
```python
# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")
# Must cast explicitly
df.with_columns(
pl.col("int_col").cast(pl.Utf8) + "_suffix"
)
```
**Benefits:**
- Prevents silent bugs
- Predictable behavior
- Better performance
- Clearer code intent
**Integer nulls:**
Unlike pandas, integer columns can have nulls without converting to float:
```python
# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)
```

View File

@@ -0,0 +1,557 @@
# Polars Data I/O Guide
Comprehensive guide to reading and writing data in various formats with Polars.
## CSV Files
### Reading CSV
**Eager mode (loads into memory):**
```python
import polars as pl
# Basic read
df = pl.read_csv("data.csv")
# With options
df = pl.read_csv(
"data.csv",
separator=",",
has_header=True,
columns=["col1", "col2"], # Select specific columns
n_rows=1000, # Read only first 1000 rows
skip_rows=10, # Skip first 10 rows
dtypes={"col1": pl.Int64, "col2": pl.Utf8}, # Specify types
null_values=["NA", "null", ""], # Define null values
encoding="utf-8",
ignore_errors=False
)
```
**Lazy mode (scans without loading - recommended for large files):**
```python
# Scan CSV (builds query plan)
lf = pl.scan_csv("data.csv")
# Apply operations
result = lf.filter(pl.col("age") > 25).select("name", "age")
# Execute and load
df = result.collect()
```
### Writing CSV
```python
# Basic write
df.write_csv("output.csv")
# With options
df.write_csv(
"output.csv",
separator=",",
include_header=True,
null_value="", # How to represent nulls
quote_char='"',
line_terminator="\n"
)
```
### Multiple CSV Files
**Read multiple files:**
```python
# Read all CSVs in directory
lf = pl.scan_csv("data/*.csv")
# Read specific files
lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])
```
## Parquet Files
Parquet is the recommended format for performance and compression.
### Reading Parquet
**Eager:**
```python
df = pl.read_parquet("data.parquet")
# With options
df = pl.read_parquet(
"data.parquet",
columns=["col1", "col2"], # Select specific columns
n_rows=1000, # Read first N rows
parallel="auto" # Control parallelization
)
```
**Lazy (recommended):**
```python
lf = pl.scan_parquet("data.parquet")
# Automatic predicate and projection pushdown
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
### Writing Parquet
```python
# Basic write
df.write_parquet("output.parquet")
# With compression
df.write_parquet(
"output.parquet",
compression="snappy", # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
statistics=True, # Write statistics (enables predicate pushdown)
use_pyarrow=False # Use Rust writer (faster)
)
```
### Partitioned Parquet (Hive-style)
**Write partitioned:**
```python
# Write with partitioning
df.write_parquet(
"output_dir",
partition_by=["year", "month"] # Creates directory structure
)
# Creates: output_dir/year=2023/month=01/data.parquet
```
**Read partitioned:**
```python
lf = pl.scan_parquet("output_dir/**/*.parquet")
# Hive partitioning columns are automatically added
result = lf.filter(pl.col("year") == 2023).collect()
```
## JSON Files
### Reading JSON
**NDJSON (newline-delimited JSON) - recommended:**
```python
df = pl.read_ndjson("data.ndjson")
# Lazy
lf = pl.scan_ndjson("data.ndjson")
```
**Standard JSON:**
```python
df = pl.read_json("data.json")
# From JSON string
df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')
```
### Writing JSON
```python
# Write NDJSON
df.write_ndjson("output.ndjson")
# Write standard JSON
df.write_json("output.json")
# Pretty printed
df.write_json("output.json", pretty=True, row_oriented=False)
```
## Excel Files
### Reading Excel
```python
# Read first sheet
df = pl.read_excel("data.xlsx")
# Specific sheet
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
# Or by index
df = pl.read_excel("data.xlsx", sheet_id=0)
# With options
df = pl.read_excel(
"data.xlsx",
sheet_name="Sheet1",
columns=["A", "B", "C"], # Excel columns
n_rows=100,
skip_rows=5,
has_header=True
)
```
### Writing Excel
```python
# Write to Excel
df.write_excel("output.xlsx")
# Multiple sheets
with pl.ExcelWriter("output.xlsx") as writer:
df1.write_excel(writer, worksheet="Sheet1")
df2.write_excel(writer, worksheet="Sheet2")
```
## Database Connectivity
### Read from Database
```python
import polars as pl
# Read entire table
df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")
# Using connectorx for better performance
df = pl.read_database_uri(
"SELECT * FROM users WHERE age > 25",
uri="postgresql://user:pass@localhost/db"
)
```
### Write to Database
```python
# Using SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/db")
df.write_database("table_name", connection=engine)
# With options
df.write_database(
"table_name",
connection=engine,
if_exists="replace", # or "append", "fail"
)
```
### Common Database Connectors
**PostgreSQL:**
```python
uri = "postgresql://username:password@localhost:5432/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
**MySQL:**
```python
uri = "mysql://username:password@localhost:3306/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
**SQLite:**
```python
uri = "sqlite:///path/to/database.db"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
## Cloud Storage
### AWS S3
```python
# Read from S3
df = pl.read_parquet("s3://bucket/path/to/file.parquet")
lf = pl.scan_parquet("s3://bucket/path/*.parquet")
# Write to S3
df.write_parquet("s3://bucket/path/output.parquet")
# With credentials
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
os.environ["AWS_REGION"] = "us-west-2"
df = pl.read_parquet("s3://bucket/file.parquet")
```
### Azure Blob Storage
```python
# Read from Azure
df = pl.read_parquet("az://container/path/file.parquet")
# Write to Azure
df.write_parquet("az://container/path/output.parquet")
# With credentials
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "account"
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "key"
```
### Google Cloud Storage (GCS)
```python
# Read from GCS
df = pl.read_parquet("gs://bucket/path/file.parquet")
# Write to GCS
df.write_parquet("gs://bucket/path/output.parquet")
# With credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials.json"
```
## Google BigQuery
```python
# Read from BigQuery
df = pl.read_database(
"SELECT * FROM project.dataset.table",
connection_uri="bigquery://project"
)
# Or using Google Cloud SDK
from google.cloud import bigquery
client = bigquery.Client()
query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
df = pl.from_pandas(client.query(query).to_dataframe())
```
## Apache Arrow
### IPC/Feather Format
**Read:**
```python
df = pl.read_ipc("data.arrow")
lf = pl.scan_ipc("data.arrow")
```
**Write:**
```python
df.write_ipc("output.arrow")
# Compressed
df.write_ipc("output.arrow", compression="zstd")
```
### Arrow Streaming
```python
# Write streaming format
df.write_ipc("output.arrows", compression="zstd")
# Read streaming
df = pl.read_ipc("output.arrows")
```
### From/To Arrow
```python
import pyarrow as pa
# From Arrow Table
arrow_table = pa.table({"col": [1, 2, 3]})
df = pl.from_arrow(arrow_table)
# To Arrow Table
arrow_table = df.to_arrow()
```
## In-Memory Formats
### Python Dictionaries
```python
# From dict
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "b", "c"]
})
# To dict
data_dict = df.to_dict() # Column-oriented
data_dict = df.to_dict(as_series=False) # Lists instead of Series
```
### NumPy Arrays
```python
import numpy as np
# From NumPy
arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pl.DataFrame(arr, schema=["col1", "col2"])
# To NumPy
arr = df.to_numpy()
```
### Pandas DataFrames
```python
import pandas as pd
# From Pandas
pd_df = pd.DataFrame({"col": [1, 2, 3]})
pl_df = pl.from_pandas(pd_df)
# To Pandas
pd_df = pl_df.to_pandas()
# Zero-copy when possible
pl_df = pl.from_arrow(pd_df)
```
### Lists of Rows
```python
# From list of dicts
data = [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30}
]
df = pl.DataFrame(data)
# To list of dicts
rows = df.to_dicts()
# From list of tuples
data = [("Alice", 25), ("Bob", 30)]
df = pl.DataFrame(data, schema=["name", "age"])
```
## Streaming Large Files
For datasets larger than memory, use lazy mode with streaming:
```python
# Streaming mode
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
# Streaming with multiple files
lf = pl.scan_parquet("data/*.parquet")
result = lf.group_by("category").agg(pl.col("value").sum()).collect(streaming=True)
```
## Best Practices
### Format Selection
**Use Parquet when:**
- Need compression (up to 10x smaller than CSV)
- Want fast reads/writes
- Need to preserve data types
- Working with large datasets
- Need predicate pushdown
**Use CSV when:**
- Need human-readable format
- Interfacing with legacy systems
- Data is small
- Need universal compatibility
**Use JSON when:**
- Working with nested/hierarchical data
- Need web API compatibility
- Data has flexible schema
**Use Arrow IPC when:**
- Need zero-copy data sharing
- Fastest serialization required
- Working between Arrow-compatible systems
### Reading Large Files
```python
# 1. Always use lazy mode
lf = pl.scan_csv("large.csv") # NOT read_csv
# 2. Filter and select early (pushdown optimization)
result = (
lf
.select("col1", "col2", "col3") # Only needed columns
.filter(pl.col("date") > "2023-01-01") # Filter early
.collect()
)
# 3. Use streaming for very large data
result = lf.filter(...).select(...).collect(streaming=True)
# 4. Read only needed rows during development
df = pl.read_csv("large.csv", n_rows=10000) # Sample for testing
```
### Writing Large Files
```python
# 1. Use Parquet with compression
df.write_parquet("output.parquet", compression="zstd")
# 2. Use partitioning for very large datasets
df.write_parquet("output", partition_by=["year", "month"])
# 3. Write streaming
lf = pl.scan_csv("input.csv")
lf.sink_parquet("output.parquet") # Streaming write
```
### Performance Tips
```python
# 1. Specify dtypes when reading CSV
df = pl.read_csv(
"data.csv",
dtypes={"id": pl.Int64, "name": pl.Utf8} # Avoids inference
)
# 2. Use appropriate compression
df.write_parquet("output.parquet", compression="snappy") # Fast
df.write_parquet("output.parquet", compression="zstd") # Better compression
# 3. Parallel reading
df = pl.read_csv("data.csv", parallel="auto")
# 4. Read multiple files in parallel
lf = pl.scan_parquet("data/*.parquet") # Automatic parallel read
```
## Error Handling
```python
try:
df = pl.read_csv("data.csv")
except pl.exceptions.ComputeError as e:
print(f"Error reading CSV: {e}")
# Ignore errors during parsing
df = pl.read_csv("messy.csv", ignore_errors=True)
# Handle missing files
from pathlib import Path
if Path("data.csv").exists():
df = pl.read_csv("data.csv")
else:
print("File not found")
```
## Schema Management
```python
# Infer schema from sample
schema = pl.read_csv("data.csv", n_rows=1000).schema
# Use inferred schema for full read
df = pl.read_csv("data.csv", dtypes=schema)
# Define schema explicitly
schema = {
"id": pl.Int64,
"name": pl.Utf8,
"date": pl.Date,
"value": pl.Float64
}
df = pl.read_csv("data.csv", dtypes=schema)
```

Some files were not shown because too many files have changed in this diff Show More