mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
Add more scientific skills
This commit is contained in:
@@ -17,7 +17,41 @@
|
||||
"strict": false,
|
||||
"skills": [
|
||||
"./scientific-packages/anndata",
|
||||
"./scientific-packages/arboreto"
|
||||
"./scientific-packages/arboreto",
|
||||
"./scientific-packages/astropy",
|
||||
"./scientific-packages/biomni",
|
||||
"./scientific-packages/biopython",
|
||||
"./scientific-packages/bioservices",
|
||||
"./scientific-packages/cellxgene-census",
|
||||
"./scientific-packages/cobrapy",
|
||||
"./scientific-packages/datamol",
|
||||
"./scientific-packages/deepchem",
|
||||
"./scientific-packages/deeptools",
|
||||
"./scientific-packages/diffdock",
|
||||
"./scientific-packages/etetoolkit",
|
||||
"./scientific-packages/flowio",
|
||||
"./scientific-packages/gget",
|
||||
"./scientific-packages/matplotlib",
|
||||
"./scientific-packages/medchem",
|
||||
"./scientific-packages/molfeat",
|
||||
"./scientific-packages/polars",
|
||||
"./scientific-packages/pubchem-database",
|
||||
"./scientific-packages/pydeseq2",
|
||||
"./scientific-packages/pymatgen",
|
||||
"./scientific-packages/pymc",
|
||||
"./scientific-packages/pymoo",
|
||||
"./scientific-packages/pytdc",
|
||||
"./scientific-packages/pytorch-lightning",
|
||||
"./scientific-packages/rdkit",
|
||||
"./scientific-packages/reportlab",
|
||||
"./scientific-packages/scanpy",
|
||||
"./scientific-packages/scikit-bio",
|
||||
"./scientific-packages/scikit-learn",
|
||||
"./scientific-packages/seaborn",
|
||||
"./scientific-packages/torch_geometric",
|
||||
"./scientific-packages/transformers",
|
||||
"./scientific-packages/umap-learn",
|
||||
"./scientific-packages/zarr-python"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
790
scientific-packages/astropy/SKILL.md
Normal file
790
scientific-packages/astropy/SKILL.md
Normal file
@@ -0,0 +1,790 @@
|
||||
---
|
||||
name: astropy
|
||||
description: Comprehensive toolkit for astronomical data analysis and computation using the astropy Python library. This skill should be used when working with astronomical data including FITS files, coordinate transformations, cosmological calculations, time systems, physical units, data tables, model fitting, WCS transformations, and visualization. Use this skill for tasks involving celestial coordinates, astronomical file formats, photometry, spectroscopy, or any astronomy-specific Python computations.
|
||||
---
|
||||
|
||||
# Astropy
|
||||
|
||||
## Overview
|
||||
|
||||
Astropy is the community standard Python library for astronomy, providing core functionality for astronomical data analysis and computation. This skill provides comprehensive guidance and tools for working with astropy's extensive capabilities across coordinate systems, file I/O, units and quantities, time systems, cosmology, modeling, and more.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Working with FITS files (reading, writing, inspecting, modifying)
|
||||
- Performing coordinate transformations between astronomical reference frames
|
||||
- Calculating cosmological distances, ages, or other quantities
|
||||
- Handling astronomical time systems and conversions
|
||||
- Working with physical units and dimensional analysis
|
||||
- Processing astronomical data tables with specialized column types
|
||||
- Fitting models to astronomical data
|
||||
- Converting between pixel and world coordinates (WCS)
|
||||
- Performing robust statistical analysis on astronomical data
|
||||
- Visualizing astronomical images with proper scaling and stretching
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. FITS File Operations
|
||||
|
||||
FITS (Flexible Image Transport System) is the standard file format in astronomy. Astropy provides comprehensive FITS support.
|
||||
|
||||
**Quick FITS Inspection**:
|
||||
Use the included `scripts/fits_info.py` script for rapid file inspection:
|
||||
```bash
|
||||
python scripts/fits_info.py observation.fits
|
||||
python scripts/fits_info.py observation.fits --detailed
|
||||
python scripts/fits_info.py observation.fits --ext 1
|
||||
```
|
||||
|
||||
**Common FITS workflows**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
|
||||
# Read FITS file
|
||||
with fits.open('image.fits') as hdul:
|
||||
hdul.info() # Display structure
|
||||
data = hdul[0].data
|
||||
header = hdul[0].header
|
||||
|
||||
# Write FITS file
|
||||
fits.writeto('output.fits', data, header, overwrite=True)
|
||||
|
||||
# Quick access (less efficient for multiple operations)
|
||||
data = fits.getdata('image.fits', ext=0)
|
||||
header = fits.getheader('image.fits', ext=0)
|
||||
|
||||
# Update specific header keyword
|
||||
fits.setval('image.fits', 'OBJECT', value='M31')
|
||||
```
|
||||
|
||||
**Multi-extension FITS**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
|
||||
# Create multi-extension FITS
|
||||
primary = fits.PrimaryHDU(primary_data)
|
||||
image_ext = fits.ImageHDU(science_data, name='SCI')
|
||||
error_ext = fits.ImageHDU(error_data, name='ERR')
|
||||
|
||||
hdul = fits.HDUList([primary, image_ext, error_ext])
|
||||
hdul.writeto('multi_ext.fits', overwrite=True)
|
||||
```
|
||||
|
||||
**Binary tables**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
|
||||
# Read binary table
|
||||
with fits.open('catalog.fits') as hdul:
|
||||
table_data = hdul[1].data
|
||||
ra = table_data['RA']
|
||||
dec = table_data['DEC']
|
||||
|
||||
# Better: use astropy.table for table operations (see section 5)
|
||||
```
|
||||
|
||||
### 2. Coordinate Systems and Transformations
|
||||
|
||||
Astropy supports ~25 coordinate frames with seamless transformations.
|
||||
|
||||
**Quick Coordinate Conversion**:
|
||||
Use the included `scripts/coord_convert.py` script:
|
||||
```bash
|
||||
python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
|
||||
python scripts/coord_convert.py --file coords.txt --from icrs --to galactic --output sexagesimal
|
||||
```
|
||||
|
||||
**Basic coordinate operations**:
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
# Create coordinate (multiple input formats supported)
|
||||
c = SkyCoord(ra=10.68*u.degree, dec=41.27*u.degree, frame='icrs')
|
||||
c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
|
||||
c = SkyCoord('00h42m44.3s +41d16m09s')
|
||||
|
||||
# Transform between frames
|
||||
c_galactic = c.galactic
|
||||
c_fk5 = c.fk5
|
||||
|
||||
print(f"Galactic: l={c_galactic.l.deg:.3f}, b={c_galactic.b.deg:.3f}")
|
||||
```
|
||||
|
||||
**Working with coordinate arrays**:
|
||||
```python
|
||||
import numpy as np
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
# Arrays of coordinates
|
||||
ra = np.array([10.1, 10.2, 10.3]) * u.degree
|
||||
dec = np.array([40.1, 40.2, 40.3]) * u.degree
|
||||
coords = SkyCoord(ra=ra, dec=dec, frame='icrs')
|
||||
|
||||
# Calculate separations
|
||||
sep = coords[0].separation(coords[1])
|
||||
print(f"Separation: {sep.to(u.arcmin)}")
|
||||
|
||||
# Position angle
|
||||
pa = coords[0].position_angle(coords[1])
|
||||
```
|
||||
|
||||
**Catalog matching**:
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
|
||||
catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
|
||||
|
||||
# Find nearest neighbors
|
||||
idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
|
||||
|
||||
# Filter by separation threshold
|
||||
max_sep = 1 * u.arcsec
|
||||
matched = sep2d < max_sep
|
||||
```
|
||||
|
||||
**Horizontal coordinates (Alt/Az)**:
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord, EarthLocation, AltAz
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
|
||||
obstime = Time('2023-01-01 03:00:00')
|
||||
target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
|
||||
|
||||
altaz_frame = AltAz(obstime=obstime, location=location)
|
||||
target_altaz = target.transform_to(altaz_frame)
|
||||
|
||||
print(f"Alt: {target_altaz.alt.deg:.2f}°, Az: {target_altaz.az.deg:.2f}°")
|
||||
```
|
||||
|
||||
**Available coordinate frames**:
|
||||
- `icrs` - International Celestial Reference System (default, preferred)
|
||||
- `fk5`, `fk4` - Fifth/Fourth Fundamental Katalog
|
||||
- `galactic` - Galactic coordinates
|
||||
- `supergalactic` - Supergalactic coordinates
|
||||
- `altaz` - Horizontal (altitude-azimuth) coordinates
|
||||
- `gcrs`, `cirs`, `itrs` - Earth-based systems
|
||||
- Ecliptic frames: `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic`
|
||||
|
||||
### 3. Units and Quantities
|
||||
|
||||
Physical units are fundamental to astronomical calculations. Astropy's units system provides dimensional analysis and automatic conversions.
|
||||
|
||||
**Basic unit operations**:
|
||||
```python
|
||||
import astropy.units as u
|
||||
|
||||
# Create quantities
|
||||
distance = 5.2 * u.parsec
|
||||
velocity = 300 * u.km / u.s
|
||||
time = 10 * u.year
|
||||
|
||||
# Convert units
|
||||
distance_ly = distance.to(u.lightyear)
|
||||
velocity_mps = velocity.to(u.m / u.s)
|
||||
|
||||
# Arithmetic with units
|
||||
wavelength = 500 * u.nm
|
||||
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
|
||||
```
|
||||
|
||||
**Working with arrays**:
|
||||
```python
|
||||
import numpy as np
|
||||
import astropy.units as u
|
||||
|
||||
wavelengths = np.array([400, 500, 600]) * u.nm
|
||||
frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
|
||||
|
||||
fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
|
||||
luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
|
||||
```
|
||||
|
||||
**Important equivalencies**:
|
||||
- `u.spectral()` - Convert wavelength ↔ frequency ↔ energy
|
||||
- `u.doppler_optical(rest)` - Optical Doppler velocity
|
||||
- `u.doppler_radio(rest)` - Radio Doppler velocity
|
||||
- `u.doppler_relativistic(rest)` - Relativistic Doppler
|
||||
- `u.temperature()` - Temperature unit conversions
|
||||
- `u.brightness_temperature(freq)` - Brightness temperature
|
||||
|
||||
**Physical constants**:
|
||||
```python
|
||||
from astropy import constants as const
|
||||
|
||||
print(const.c) # Speed of light
|
||||
print(const.G) # Gravitational constant
|
||||
print(const.M_sun) # Solar mass
|
||||
print(const.R_sun) # Solar radius
|
||||
print(const.L_sun) # Solar luminosity
|
||||
```
|
||||
|
||||
**Performance tip**: Use the `<<` operator for fast unit assignment to arrays:
|
||||
```python
|
||||
# Fast
|
||||
result = large_array << u.m
|
||||
|
||||
# Slower
|
||||
result = large_array * u.m
|
||||
```
|
||||
|
||||
### 4. Time Systems
|
||||
|
||||
Astronomical time systems require high precision and multiple time scales.
|
||||
|
||||
**Creating time objects**:
|
||||
```python
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
# Various input formats
|
||||
t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
|
||||
t2 = Time(2459945.5, format='jd', scale='utc')
|
||||
t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
|
||||
|
||||
# Convert formats
|
||||
print(t1.jd) # Julian Date
|
||||
print(t1.mjd) # Modified Julian Date
|
||||
print(t1.unix) # Unix timestamp
|
||||
print(t1.iso) # ISO format
|
||||
|
||||
# Convert time scales
|
||||
print(t1.tai) # International Atomic Time
|
||||
print(t1.tt) # Terrestrial Time
|
||||
print(t1.tdb) # Barycentric Dynamical Time
|
||||
```
|
||||
|
||||
**Time arithmetic**:
|
||||
```python
|
||||
from astropy.time import Time, TimeDelta
|
||||
import astropy.units as u
|
||||
|
||||
t1 = Time('2023-01-01T00:00:00')
|
||||
dt = TimeDelta(1*u.day)
|
||||
|
||||
t2 = t1 + dt
|
||||
diff = t2 - t1
|
||||
print(diff.to(u.hour))
|
||||
|
||||
# Array of times
|
||||
times = t1 + np.arange(10) * u.day
|
||||
```
|
||||
|
||||
**Astronomical time calculations**:
|
||||
```python
|
||||
from astropy.time import Time
|
||||
from astropy.coordinates import SkyCoord, EarthLocation
|
||||
import astropy.units as u
|
||||
|
||||
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
|
||||
t = Time('2023-01-01T00:00:00')
|
||||
|
||||
# Local sidereal time
|
||||
lst = t.sidereal_time('apparent', longitude=location.lon)
|
||||
|
||||
# Barycentric correction
|
||||
target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
|
||||
ltt = t.light_travel_time(target, location=location)
|
||||
t_bary = t.tdb + ltt
|
||||
```
|
||||
|
||||
**Available time scales**:
|
||||
- `utc` - Coordinated Universal Time
|
||||
- `tai` - International Atomic Time
|
||||
- `tt` - Terrestrial Time
|
||||
- `tcb`, `tcg` - Barycentric/Geocentric Coordinate Time
|
||||
- `tdb` - Barycentric Dynamical Time
|
||||
- `ut1` - Universal Time
|
||||
|
||||
### 5. Data Tables
|
||||
|
||||
Astropy tables provide astronomy-specific enhancements over pandas.
|
||||
|
||||
**Creating and manipulating tables**:
|
||||
```python
|
||||
from astropy.table import Table
|
||||
import astropy.units as u
|
||||
|
||||
# Create table
|
||||
t = Table()
|
||||
t['name'] = ['Star1', 'Star2', 'Star3']
|
||||
t['ra'] = [10.5, 11.2, 12.3] * u.degree
|
||||
t['dec'] = [41.2, 42.1, 43.5] * u.degree
|
||||
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
|
||||
|
||||
# Column metadata
|
||||
t['flux'].description = 'Flux at 1.4 GHz'
|
||||
t['flux'].format = '.2f'
|
||||
|
||||
# Add calculated column
|
||||
t['flux_mJy'] = t['flux'].to(u.mJy)
|
||||
|
||||
# Filter and sort
|
||||
bright = t[t['flux'] > 1.0 * u.Jy]
|
||||
t.sort('flux')
|
||||
```
|
||||
|
||||
**Table I/O**:
|
||||
```python
|
||||
from astropy.table import Table
|
||||
|
||||
# Read (format auto-detected from extension)
|
||||
t = Table.read('data.fits')
|
||||
t = Table.read('data.csv', format='ascii.csv')
|
||||
t = Table.read('data.ecsv', format='ascii.ecsv') # Preserves units!
|
||||
t = Table.read('data.votable', format='votable')
|
||||
|
||||
# Write
|
||||
t.write('output.fits', overwrite=True)
|
||||
t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
|
||||
```
|
||||
|
||||
**Advanced operations**:
|
||||
```python
|
||||
from astropy.table import Table, join, vstack, hstack
|
||||
|
||||
# Join tables (like SQL)
|
||||
joined = join(table1, table2, keys='id')
|
||||
|
||||
# Stack tables
|
||||
combined_rows = vstack([t1, t2])
|
||||
combined_cols = hstack([t1, t2])
|
||||
|
||||
# Grouping and aggregation
|
||||
t.group_by('category').groups.aggregate(np.mean)
|
||||
```
|
||||
|
||||
**Tables with astronomical objects**:
|
||||
```python
|
||||
from astropy.table import Table
|
||||
from astropy.coordinates import SkyCoord
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
|
||||
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
|
||||
|
||||
t = Table([coords, times], names=['coords', 'obstime'])
|
||||
print(t['coords'][0].ra) # Access coordinate properties
|
||||
```
|
||||
|
||||
### 6. Cosmological Calculations
|
||||
|
||||
Quick cosmology calculations using standard models.
|
||||
|
||||
**Using the cosmology calculator**:
|
||||
```bash
|
||||
python scripts/cosmo_calc.py 0.5 1.0 1.5
|
||||
python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
|
||||
python scripts/cosmo_calc.py 0.5 --verbose
|
||||
python scripts/cosmo_calc.py --convert 1000 --from luminosity_distance
|
||||
```
|
||||
|
||||
**Programmatic usage**:
|
||||
```python
|
||||
from astropy.cosmology import Planck18
|
||||
import astropy.units as u
|
||||
import numpy as np
|
||||
|
||||
cosmo = Planck18
|
||||
|
||||
# Calculate distances
|
||||
z = 1.5
|
||||
d_L = cosmo.luminosity_distance(z)
|
||||
d_A = cosmo.angular_diameter_distance(z)
|
||||
d_C = cosmo.comoving_distance(z)
|
||||
|
||||
# Time calculations
|
||||
age = cosmo.age(z)
|
||||
lookback = cosmo.lookback_time(z)
|
||||
|
||||
# Hubble parameter
|
||||
H_z = cosmo.H(z)
|
||||
|
||||
print(f"At z={z}:")
|
||||
print(f" Luminosity distance: {d_L:.2f}")
|
||||
print(f" Age of universe: {age:.2f}")
|
||||
```
|
||||
|
||||
**Convert observables**:
|
||||
```python
|
||||
from astropy.cosmology import Planck18
|
||||
import astropy.units as u
|
||||
|
||||
cosmo = Planck18
|
||||
z = 1.5
|
||||
|
||||
# Angular size to physical size
|
||||
d_A = cosmo.angular_diameter_distance(z)
|
||||
angular_size = 1 * u.arcsec
|
||||
physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
|
||||
|
||||
# Flux to luminosity
|
||||
flux = 1e-17 * u.erg / u.s / u.cm**2
|
||||
d_L = cosmo.luminosity_distance(z)
|
||||
luminosity = flux * 4 * np.pi * d_L**2
|
||||
|
||||
# Find redshift for given distance
|
||||
from astropy.cosmology import z_at_value
|
||||
z = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
|
||||
```
|
||||
|
||||
**Available cosmologies**:
|
||||
- `Planck18`, `Planck15`, `Planck13` - Planck satellite parameters
|
||||
- `WMAP9`, `WMAP7`, `WMAP5` - WMAP satellite parameters
|
||||
- Custom: `FlatLambdaCDM(H0=70*u.km/u.s/u.Mpc, Om0=0.3)`
|
||||
|
||||
### 7. Model Fitting
|
||||
|
||||
Fit mathematical models to astronomical data.
|
||||
|
||||
**1D fitting example**:
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
import numpy as np
|
||||
|
||||
# Generate data
|
||||
x = np.linspace(0, 10, 100)
|
||||
y_data = 10 * np.exp(-0.5 * ((x - 5) / 1)**2) + np.random.normal(0, 0.5, x.shape)
|
||||
|
||||
# Create and fit model
|
||||
g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
|
||||
fitter = fitting.LevMarLSQFitter()
|
||||
g_fit = fitter(g_init, x, y_data)
|
||||
|
||||
# Results
|
||||
print(f"Amplitude: {g_fit.amplitude.value:.3f}")
|
||||
print(f"Mean: {g_fit.mean.value:.3f}")
|
||||
print(f"Stddev: {g_fit.stddev.value:.3f}")
|
||||
|
||||
# Evaluate fitted model
|
||||
y_fit = g_fit(x)
|
||||
```
|
||||
|
||||
**Common 1D models**:
|
||||
- `Gaussian1D` - Gaussian profile
|
||||
- `Lorentz1D` - Lorentzian profile
|
||||
- `Voigt1D` - Voigt profile
|
||||
- `Moffat1D` - Moffat profile (PSF modeling)
|
||||
- `Polynomial1D` - Polynomial
|
||||
- `PowerLaw1D` - Power law
|
||||
- `BlackBody` - Blackbody spectrum
|
||||
|
||||
**Common 2D models**:
|
||||
- `Gaussian2D` - 2D Gaussian
|
||||
- `Moffat2D` - 2D Moffat (stellar PSF)
|
||||
- `AiryDisk2D` - Airy disk (diffraction pattern)
|
||||
- `Disk2D` - Circular disk
|
||||
|
||||
**Fitting with constraints**:
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
|
||||
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
|
||||
|
||||
# Set bounds
|
||||
g.amplitude.bounds = (0, None) # Positive only
|
||||
g.mean.bounds = (4, 6) # Constrain center
|
||||
|
||||
# Fix parameters
|
||||
g.stddev.fixed = True
|
||||
|
||||
# Compound models
|
||||
model = models.Gaussian1D() + models.Polynomial1D(degree=1)
|
||||
```
|
||||
|
||||
**Available fitters**:
|
||||
- `LinearLSQFitter` - Linear least squares (fast, for linear models)
|
||||
- `LevMarLSQFitter` - Levenberg-Marquardt (most common)
|
||||
- `SimplexLSQFitter` - Downhill simplex
|
||||
- `SLSQPLSQFitter` - Sequential Least Squares with constraints
|
||||
|
||||
### 8. World Coordinate System (WCS)
|
||||
|
||||
Transform between pixel and world coordinates in images.
|
||||
|
||||
**Basic WCS usage**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.wcs import WCS
|
||||
|
||||
# Read FITS with WCS
|
||||
hdu = fits.open('image.fits')[0]
|
||||
wcs = WCS(hdu.header)
|
||||
|
||||
# Pixel to world
|
||||
ra, dec = wcs.pixel_to_world_values(100, 200)
|
||||
|
||||
# World to pixel
|
||||
x, y = wcs.world_to_pixel_values(ra, dec)
|
||||
|
||||
# Using SkyCoord (more powerful)
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
coord = SkyCoord(ra=150*u.deg, dec=-30*u.deg)
|
||||
x, y = wcs.world_to_pixel(coord)
|
||||
```
|
||||
|
||||
**Plotting with WCS**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.wcs import WCS
|
||||
from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
hdu = fits.open('image.fits')[0]
|
||||
wcs = WCS(hdu.header)
|
||||
data = hdu.data
|
||||
|
||||
# Create figure with WCS projection
|
||||
fig = plt.figure()
|
||||
ax = fig.add_subplot(111, projection=wcs)
|
||||
|
||||
# Plot with coordinate grid
|
||||
norm = ImageNormalize(data, interval=PercentileInterval(99.5),
|
||||
stretch=LogStretch())
|
||||
ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
|
||||
|
||||
# Coordinate labels and grid
|
||||
ax.set_xlabel('RA')
|
||||
ax.set_ylabel('Dec')
|
||||
ax.coords.grid(color='white', alpha=0.5)
|
||||
```
|
||||
|
||||
### 9. Statistics and Data Processing
|
||||
|
||||
Robust statistical tools for astronomical data.
|
||||
|
||||
**Sigma clipping** (remove outliers):
|
||||
```python
|
||||
from astropy.stats import sigma_clip, sigma_clipped_stats
|
||||
|
||||
# Remove outliers
|
||||
clipped = sigma_clip(data, sigma=3, maxiters=5)
|
||||
|
||||
# Get statistics on cleaned data
|
||||
mean, median, std = sigma_clipped_stats(data, sigma=3)
|
||||
|
||||
# Use clipped data
|
||||
background = median
|
||||
signal = data - background
|
||||
snr = signal / std
|
||||
```
|
||||
|
||||
**Other statistical functions**:
|
||||
```python
|
||||
from astropy.stats import mad_std, biweight_location, biweight_scale
|
||||
|
||||
# Robust standard deviation
|
||||
std_robust = mad_std(data)
|
||||
|
||||
# Robust central location
|
||||
center = biweight_location(data)
|
||||
|
||||
# Robust scale
|
||||
scale = biweight_scale(data)
|
||||
```
|
||||
|
||||
### 10. Visualization
|
||||
|
||||
Display astronomical images with proper scaling.
|
||||
|
||||
**Image normalization and stretching**:
|
||||
```python
|
||||
from astropy.visualization import (ImageNormalize, MinMaxInterval,
|
||||
PercentileInterval, ZScaleInterval,
|
||||
SqrtStretch, LogStretch, PowerStretch,
|
||||
AsinhStretch)
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Common combination: percentile interval + sqrt stretch
|
||||
norm = ImageNormalize(data,
|
||||
interval=PercentileInterval(99),
|
||||
stretch=SqrtStretch())
|
||||
|
||||
plt.imshow(data, norm=norm, origin='lower', cmap='gray')
|
||||
plt.colorbar()
|
||||
```
|
||||
|
||||
**Available intervals** (determine min/max):
|
||||
- `MinMaxInterval()` - Use actual min/max
|
||||
- `PercentileInterval(percentile)` - Clip to percentile (e.g., 99%)
|
||||
- `ZScaleInterval()` - IRAF's zscale algorithm
|
||||
- `ManualInterval(vmin, vmax)` - Specify manually
|
||||
|
||||
**Available stretches** (nonlinear scaling):
|
||||
- `LinearStretch()` - Linear (default)
|
||||
- `SqrtStretch()` - Square root (common for images)
|
||||
- `LogStretch()` - Logarithmic (for high dynamic range)
|
||||
- `PowerStretch(power)` - Power law
|
||||
- `AsinhStretch()` - Arcsinh (good for wide range)
|
||||
|
||||
## Bundled Resources
|
||||
|
||||
### scripts/
|
||||
|
||||
**`fits_info.py`** - Comprehensive FITS file inspection tool
|
||||
```bash
|
||||
python scripts/fits_info.py observation.fits
|
||||
python scripts/fits_info.py observation.fits --detailed
|
||||
python scripts/fits_info.py observation.fits --ext 1
|
||||
```
|
||||
|
||||
**`coord_convert.py`** - Batch coordinate transformation utility
|
||||
```bash
|
||||
python scripts/coord_convert.py 10.68 41.27 --from icrs --to galactic
|
||||
python scripts/coord_convert.py --file coords.txt --from icrs --to galactic
|
||||
```
|
||||
|
||||
**`cosmo_calc.py`** - Cosmological calculator
|
||||
```bash
|
||||
python scripts/cosmo_calc.py 0.5 1.0 1.5
|
||||
python scripts/cosmo_calc.py --range 0 3 0.5 --cosmology Planck18
|
||||
```
|
||||
|
||||
### references/
|
||||
|
||||
**`module_overview.md`** - Comprehensive reference of all astropy subpackages, classes, and methods. Consult this for detailed API information, available functions, and module capabilities.
|
||||
|
||||
**`common_workflows.md`** - Complete working examples for common astronomical data analysis tasks. Contains full code examples for FITS operations, coordinate transformations, cosmology, modeling, and complete analysis pipelines.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use context managers for FITS files**:
|
||||
```python
|
||||
with fits.open('file.fits') as hdul:
|
||||
# Work with file
|
||||
```
|
||||
|
||||
2. **Prefer astropy.table over raw FITS tables** for better unit/metadata support
|
||||
|
||||
3. **Use SkyCoord for coordinates** (high-level interface) rather than low-level frame classes
|
||||
|
||||
4. **Always attach units** to quantities when possible for dimensional safety
|
||||
|
||||
5. **Use ECSV format** for saving tables when you want to preserve units and metadata
|
||||
|
||||
6. **Vectorize coordinate operations** rather than looping for performance
|
||||
|
||||
7. **Use memmap=True** when opening large FITS files to save memory
|
||||
|
||||
8. **Install Bottleneck** package for faster statistics operations
|
||||
|
||||
9. **Pre-compute composite units** for repeated operations to improve performance
|
||||
|
||||
10. **Consult `references/module_overview.md`** for detailed module information and `references/common_workflows.md`** for complete working examples
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern: FITS → Process → FITS
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.stats import sigma_clipped_stats
|
||||
|
||||
# Read
|
||||
with fits.open('input.fits') as hdul:
|
||||
data = hdul[0].data
|
||||
header = hdul[0].header
|
||||
|
||||
# Process
|
||||
mean, median, std = sigma_clipped_stats(data, sigma=3)
|
||||
processed = (data - median) / std
|
||||
|
||||
# Write
|
||||
fits.writeto('output.fits', processed, header, overwrite=True)
|
||||
```
|
||||
|
||||
### Pattern: Catalog Matching
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord
|
||||
from astropy.table import Table
|
||||
import astropy.units as u
|
||||
|
||||
# Load catalogs
|
||||
cat1 = Table.read('catalog1.fits')
|
||||
cat2 = Table.read('catalog2.fits')
|
||||
|
||||
# Create coordinate objects
|
||||
coords1 = SkyCoord(ra=cat1['RA'], dec=cat1['DEC'], unit=u.degree)
|
||||
coords2 = SkyCoord(ra=cat2['RA'], dec=cat2['DEC'], unit=u.degree)
|
||||
|
||||
# Match
|
||||
idx, sep2d, dist3d = coords1.match_to_catalog_sky(coords2)
|
||||
|
||||
# Filter by separation
|
||||
max_sep = 1 * u.arcsec
|
||||
matched_mask = sep2d < max_sep
|
||||
|
||||
# Create matched catalog
|
||||
matched_cat1 = cat1[matched_mask]
|
||||
matched_cat2 = cat2[idx[matched_mask]]
|
||||
```
|
||||
|
||||
### Pattern: Time Series Analysis
|
||||
```python
|
||||
from astropy.time import Time
|
||||
from astropy.timeseries import TimeSeries
|
||||
import astropy.units as u
|
||||
|
||||
# Create time series
|
||||
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
|
||||
flux = [1.2, 2.3, 1.8] * u.Jy
|
||||
|
||||
ts = TimeSeries(time=times)
|
||||
ts['flux'] = flux
|
||||
|
||||
# Fold on period
|
||||
from astropy.timeseries import aggregate_downsample
|
||||
period = 1.5 * u.day
|
||||
folded = ts.fold(period=period)
|
||||
```
|
||||
|
||||
### Pattern: Image Display with WCS
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.wcs import WCS
|
||||
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
hdu = fits.open('image.fits')[0]
|
||||
wcs = WCS(hdu.header)
|
||||
data = hdu.data
|
||||
|
||||
fig = plt.figure(figsize=(10, 10))
|
||||
ax = fig.add_subplot(111, projection=wcs)
|
||||
|
||||
norm = ImageNormalize(data, interval=PercentileInterval(99),
|
||||
stretch=SqrtStretch())
|
||||
im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
|
||||
|
||||
ax.set_xlabel('RA')
|
||||
ax.set_ylabel('Dec')
|
||||
ax.coords.grid(color='white', alpha=0.5, linestyle='solid')
|
||||
plt.colorbar(im, ax=ax)
|
||||
```
|
||||
|
||||
## Installation Note
|
||||
|
||||
Ensure astropy is installed in the Python environment:
|
||||
```bash
|
||||
pip install astropy
|
||||
```
|
||||
|
||||
For additional performance and features:
|
||||
```bash
|
||||
pip install astropy[all] # Includes optional dependencies
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- Official documentation: https://docs.astropy.org
|
||||
- Tutorials: https://learn.astropy.org
|
||||
- API reference: Consult `references/module_overview.md` in this skill
|
||||
- Working examples: Consult `references/common_workflows.md` in this skill
|
||||
618
scientific-packages/astropy/references/common_workflows.md
Normal file
618
scientific-packages/astropy/references/common_workflows.md
Normal file
@@ -0,0 +1,618 @@
|
||||
# Common Astropy Workflows
|
||||
|
||||
This document describes frequently used workflows when working with astronomical data using astropy.
|
||||
|
||||
## 1. Working with FITS Files
|
||||
|
||||
### Basic FITS Reading
|
||||
```python
|
||||
from astropy.io import fits
|
||||
import numpy as np
|
||||
|
||||
# Open and examine structure
|
||||
with fits.open('observation.fits') as hdul:
|
||||
hdul.info()
|
||||
|
||||
# Access primary HDU
|
||||
primary_hdr = hdul[0].header
|
||||
primary_data = hdul[0].data
|
||||
|
||||
# Access extension
|
||||
ext_data = hdul[1].data
|
||||
ext_hdr = hdul[1].header
|
||||
|
||||
# Read specific header keywords
|
||||
object_name = primary_hdr['OBJECT']
|
||||
exposure = primary_hdr['EXPTIME']
|
||||
```
|
||||
|
||||
### Writing FITS Files
|
||||
```python
|
||||
# Create new FITS file
|
||||
from astropy.io import fits
|
||||
import numpy as np
|
||||
|
||||
# Create data
|
||||
data = np.random.random((100, 100))
|
||||
|
||||
# Create primary HDU
|
||||
hdu = fits.PrimaryHDU(data)
|
||||
hdu.header['OBJECT'] = 'M31'
|
||||
hdu.header['EXPTIME'] = 300.0
|
||||
|
||||
# Write to file
|
||||
hdu.writeto('output.fits', overwrite=True)
|
||||
|
||||
# Multi-extension FITS
|
||||
hdul = fits.HDUList([
|
||||
fits.PrimaryHDU(data1),
|
||||
fits.ImageHDU(data2, name='SCI'),
|
||||
fits.ImageHDU(data3, name='ERR')
|
||||
])
|
||||
hdul.writeto('multi_ext.fits', overwrite=True)
|
||||
```
|
||||
|
||||
### FITS Table Operations
|
||||
```python
|
||||
from astropy.io import fits
|
||||
|
||||
# Read binary table
|
||||
with fits.open('catalog.fits') as hdul:
|
||||
table_data = hdul[1].data
|
||||
|
||||
# Access columns
|
||||
ra = table_data['RA']
|
||||
dec = table_data['DEC']
|
||||
mag = table_data['MAG']
|
||||
|
||||
# Filter data
|
||||
bright = table_data[table_data['MAG'] < 15]
|
||||
|
||||
# Write binary table
|
||||
from astropy.table import Table
|
||||
import astropy.units as u
|
||||
|
||||
t = Table([ra, dec, mag], names=['RA', 'DEC', 'MAG'])
|
||||
t['RA'].unit = u.degree
|
||||
t['DEC'].unit = u.degree
|
||||
t.write('output_catalog.fits', format='fits', overwrite=True)
|
||||
```
|
||||
|
||||
## 2. Coordinate Transformations
|
||||
|
||||
### Basic Coordinate Creation and Transformation
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
# Create from RA/Dec
|
||||
c = SkyCoord(ra=10.68458*u.degree, dec=41.26917*u.degree, frame='icrs')
|
||||
|
||||
# Alternative creation methods
|
||||
c = SkyCoord('00:42:44.3 +41:16:09', unit=(u.hourangle, u.deg))
|
||||
c = SkyCoord('00h42m44.3s +41d16m09s')
|
||||
|
||||
# Transform to different frames
|
||||
c_gal = c.galactic
|
||||
c_fk5 = c.fk5
|
||||
print(f"Galactic: l={c_gal.l.deg}, b={c_gal.b.deg}")
|
||||
```
|
||||
|
||||
### Coordinate Arrays and Separations
|
||||
```python
|
||||
import numpy as np
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
# Create array of coordinates
|
||||
ra_array = np.array([10.1, 10.2, 10.3]) * u.degree
|
||||
dec_array = np.array([40.1, 40.2, 40.3]) * u.degree
|
||||
coords = SkyCoord(ra=ra_array, dec=dec_array, frame='icrs')
|
||||
|
||||
# Calculate separations
|
||||
c1 = SkyCoord(ra=10*u.degree, dec=40*u.degree)
|
||||
c2 = SkyCoord(ra=11*u.degree, dec=41*u.degree)
|
||||
sep = c1.separation(c2)
|
||||
print(f"Separation: {sep.to(u.arcmin)}")
|
||||
|
||||
# Position angle
|
||||
pa = c1.position_angle(c2)
|
||||
```
|
||||
|
||||
### Catalog Matching
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord, match_coordinates_sky
|
||||
import astropy.units as u
|
||||
|
||||
# Two catalogs of coordinates
|
||||
catalog1 = SkyCoord(ra=[10, 11, 12]*u.degree, dec=[40, 41, 42]*u.degree)
|
||||
catalog2 = SkyCoord(ra=[10.01, 11.02, 13]*u.degree, dec=[40.01, 41.01, 43]*u.degree)
|
||||
|
||||
# Find nearest neighbors
|
||||
idx, sep2d, dist3d = catalog1.match_to_catalog_sky(catalog2)
|
||||
|
||||
# Filter by separation threshold
|
||||
max_sep = 1 * u.arcsec
|
||||
matched = sep2d < max_sep
|
||||
matching_indices = idx[matched]
|
||||
```
|
||||
|
||||
### Horizontal Coordinates (Alt/Az)
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord, EarthLocation, AltAz
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
# Observer location
|
||||
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg, height=300*u.m)
|
||||
|
||||
# Observation time
|
||||
obstime = Time('2023-01-01 03:00:00')
|
||||
|
||||
# Target coordinate
|
||||
target = SkyCoord(ra=10*u.degree, dec=40*u.degree, frame='icrs')
|
||||
|
||||
# Transform to Alt/Az
|
||||
altaz_frame = AltAz(obstime=obstime, location=location)
|
||||
target_altaz = target.transform_to(altaz_frame)
|
||||
|
||||
print(f"Altitude: {target_altaz.alt.deg}")
|
||||
print(f"Azimuth: {target_altaz.az.deg}")
|
||||
```
|
||||
|
||||
## 3. Units and Quantities
|
||||
|
||||
### Basic Unit Operations
|
||||
```python
|
||||
import astropy.units as u
|
||||
|
||||
# Create quantities
|
||||
distance = 5.2 * u.parsec
|
||||
time = 10 * u.year
|
||||
velocity = 300 * u.km / u.s
|
||||
|
||||
# Unit conversion
|
||||
distance_ly = distance.to(u.lightyear)
|
||||
velocity_mps = velocity.to(u.m / u.s)
|
||||
|
||||
# Arithmetic with units
|
||||
wavelength = 500 * u.nm
|
||||
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
|
||||
|
||||
# Compose/decompose units
|
||||
composite = (1 * u.kg * u.m**2 / u.s**2)
|
||||
print(composite.decompose()) # Base SI units
|
||||
print(composite.compose()) # Known compound units (Joule)
|
||||
```
|
||||
|
||||
### Working with Arrays
|
||||
```python
|
||||
import numpy as np
|
||||
import astropy.units as u
|
||||
|
||||
# Quantity arrays
|
||||
wavelengths = np.array([400, 500, 600]) * u.nm
|
||||
frequencies = wavelengths.to(u.THz, equivalencies=u.spectral())
|
||||
|
||||
# Mathematical operations preserve units
|
||||
fluxes = np.array([1.2, 2.3, 1.8]) * u.Jy
|
||||
luminosities = 4 * np.pi * (10*u.pc)**2 * fluxes
|
||||
```
|
||||
|
||||
### Custom Units and Equivalencies
|
||||
```python
|
||||
import astropy.units as u
|
||||
|
||||
# Define custom unit
|
||||
beam = u.def_unit('beam', 1.5e-10 * u.steradian)
|
||||
|
||||
# Register for session
|
||||
u.add_enabled_units([beam])
|
||||
|
||||
# Use in calculations
|
||||
flux_per_beam = 1.5 * u.Jy / beam
|
||||
|
||||
# Doppler equivalencies
|
||||
rest_wavelength = 656.3 * u.nm # H-alpha
|
||||
observed = 656.5 * u.nm
|
||||
velocity = observed.to(u.km/u.s,
|
||||
equivalencies=u.doppler_optical(rest_wavelength))
|
||||
```
|
||||
|
||||
## 4. Time Handling
|
||||
|
||||
### Time Creation and Conversion
|
||||
```python
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
# Create time objects
|
||||
t1 = Time('2023-01-01T00:00:00', format='isot', scale='utc')
|
||||
t2 = Time(2459945.5, format='jd', scale='utc')
|
||||
t3 = Time(['2023-01-01', '2023-06-01'], format='iso')
|
||||
|
||||
# Convert formats
|
||||
print(t1.jd) # Julian Date
|
||||
print(t1.mjd) # Modified Julian Date
|
||||
print(t1.unix) # Unix timestamp
|
||||
print(t1.iso) # ISO format
|
||||
|
||||
# Convert time scales
|
||||
print(t1.tai) # Convert to TAI
|
||||
print(t1.tt) # Convert to TT
|
||||
print(t1.tdb) # Convert to TDB
|
||||
```
|
||||
|
||||
### Time Arithmetic
|
||||
```python
|
||||
from astropy.time import Time, TimeDelta
|
||||
import astropy.units as u
|
||||
|
||||
t1 = Time('2023-01-01T00:00:00')
|
||||
dt = TimeDelta(1*u.day)
|
||||
|
||||
# Add time delta
|
||||
t2 = t1 + dt
|
||||
|
||||
# Difference between times
|
||||
diff = t2 - t1
|
||||
print(diff.to(u.hour))
|
||||
|
||||
# Array of times
|
||||
times = t1 + np.arange(10) * u.day
|
||||
```
|
||||
|
||||
### Sidereal Time and Astronomical Calculations
|
||||
```python
|
||||
from astropy.time import Time
|
||||
from astropy.coordinates import EarthLocation
|
||||
import astropy.units as u
|
||||
|
||||
location = EarthLocation(lat=40*u.deg, lon=-70*u.deg)
|
||||
t = Time('2023-01-01T00:00:00')
|
||||
|
||||
# Local sidereal time
|
||||
lst = t.sidereal_time('apparent', longitude=location.lon)
|
||||
|
||||
# Light travel time correction
|
||||
from astropy.coordinates import SkyCoord
|
||||
target = SkyCoord(ra=10*u.deg, dec=40*u.deg)
|
||||
ltt_bary = t.light_travel_time(target, location=location)
|
||||
t_bary = t + ltt_bary
|
||||
```
|
||||
|
||||
## 5. Tables and Data Management
|
||||
|
||||
### Creating and Manipulating Tables
|
||||
```python
|
||||
from astropy.table import Table, Column
|
||||
import astropy.units as u
|
||||
import numpy as np
|
||||
|
||||
# Create table
|
||||
t = Table()
|
||||
t['name'] = ['Star1', 'Star2', 'Star3']
|
||||
t['ra'] = [10.5, 11.2, 12.3] * u.degree
|
||||
t['dec'] = [41.2, 42.1, 43.5] * u.degree
|
||||
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
|
||||
|
||||
# Add column metadata
|
||||
t['flux'].description = 'Flux at 1.4 GHz'
|
||||
t['flux'].format = '.2f'
|
||||
|
||||
# Add new column
|
||||
t['flux_mJy'] = t['flux'].to(u.mJy)
|
||||
|
||||
# Filter rows
|
||||
bright = t[t['flux'] > 1.0 * u.Jy]
|
||||
|
||||
# Sort
|
||||
t.sort('flux')
|
||||
```
|
||||
|
||||
### Table I/O
|
||||
```python
|
||||
from astropy.table import Table
|
||||
|
||||
# Read various formats
|
||||
t = Table.read('data.fits')
|
||||
t = Table.read('data.csv', format='ascii.csv')
|
||||
t = Table.read('data.ecsv', format='ascii.ecsv') # Preserves units
|
||||
t = Table.read('data.votable', format='votable')
|
||||
|
||||
# Write various formats
|
||||
t.write('output.fits', overwrite=True)
|
||||
t.write('output.csv', format='ascii.csv', overwrite=True)
|
||||
t.write('output.ecsv', format='ascii.ecsv', overwrite=True)
|
||||
t.write('output.votable', format='votable', overwrite=True)
|
||||
```
|
||||
|
||||
### Advanced Table Operations
|
||||
```python
|
||||
from astropy.table import Table, join, vstack, hstack
|
||||
|
||||
# Join tables
|
||||
t1 = Table([[1, 2], ['a', 'b']], names=['id', 'val1'])
|
||||
t2 = Table([[1, 2], ['c', 'd']], names=['id', 'val2'])
|
||||
joined = join(t1, t2, keys='id')
|
||||
|
||||
# Stack tables vertically
|
||||
combined = vstack([t1, t2])
|
||||
|
||||
# Stack horizontally
|
||||
combined = hstack([t1, t2])
|
||||
|
||||
# Grouping
|
||||
t.group_by('category').groups.aggregate(np.mean)
|
||||
```
|
||||
|
||||
### Tables with Astronomical Objects
|
||||
```python
|
||||
from astropy.table import Table
|
||||
from astropy.coordinates import SkyCoord
|
||||
from astropy.time import Time
|
||||
import astropy.units as u
|
||||
|
||||
# Table with SkyCoord column
|
||||
coords = SkyCoord(ra=[10, 11, 12]*u.deg, dec=[40, 41, 42]*u.deg)
|
||||
times = Time(['2023-01-01', '2023-01-02', '2023-01-03'])
|
||||
|
||||
t = Table([coords, times], names=['coords', 'obstime'])
|
||||
|
||||
# Access individual coordinates
|
||||
print(t['coords'][0].ra)
|
||||
print(t['coords'][0].dec)
|
||||
```
|
||||
|
||||
## 6. Cosmological Calculations
|
||||
|
||||
### Distance Calculations
|
||||
```python
|
||||
from astropy.cosmology import Planck18, FlatLambdaCDM
|
||||
import astropy.units as u
|
||||
import numpy as np
|
||||
|
||||
# Use built-in cosmology
|
||||
cosmo = Planck18
|
||||
|
||||
# Redshifts
|
||||
z = np.linspace(0, 5, 50)
|
||||
|
||||
# Calculate distances
|
||||
comoving_dist = cosmo.comoving_distance(z)
|
||||
angular_diam_dist = cosmo.angular_diameter_distance(z)
|
||||
luminosity_dist = cosmo.luminosity_distance(z)
|
||||
|
||||
# Age of universe
|
||||
age_at_z = cosmo.age(z)
|
||||
lookback_time = cosmo.lookback_time(z)
|
||||
|
||||
# Hubble parameter
|
||||
H_z = cosmo.H(z)
|
||||
```
|
||||
|
||||
### Converting Observables
|
||||
```python
|
||||
from astropy.cosmology import Planck18
|
||||
import astropy.units as u
|
||||
|
||||
cosmo = Planck18
|
||||
z = 1.5
|
||||
|
||||
# Angular diameter distance
|
||||
d_A = cosmo.angular_diameter_distance(z)
|
||||
|
||||
# Convert angular size to physical size
|
||||
angular_size = 1 * u.arcsec
|
||||
physical_size = (angular_size.to(u.radian) * d_A).to(u.kpc)
|
||||
|
||||
# Convert flux to luminosity
|
||||
flux = 1e-17 * u.erg / u.s / u.cm**2
|
||||
d_L = cosmo.luminosity_distance(z)
|
||||
luminosity = flux * 4 * np.pi * d_L**2
|
||||
|
||||
# Find redshift for given distance
|
||||
from astropy.cosmology import z_at_value
|
||||
z_result = z_at_value(cosmo.luminosity_distance, 1000*u.Mpc)
|
||||
```
|
||||
|
||||
### Custom Cosmology
|
||||
```python
|
||||
from astropy.cosmology import FlatLambdaCDM
|
||||
import astropy.units as u
|
||||
|
||||
# Define custom cosmology
|
||||
my_cosmo = FlatLambdaCDM(H0=70 * u.km/u.s/u.Mpc,
|
||||
Om0=0.3,
|
||||
Tcmb0=2.725 * u.K)
|
||||
|
||||
# Use it for calculations
|
||||
print(my_cosmo.age(0))
|
||||
print(my_cosmo.luminosity_distance(1.5))
|
||||
```
|
||||
|
||||
## 7. Model Fitting
|
||||
|
||||
### Fitting 1D Models
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Generate data with noise
|
||||
x = np.linspace(0, 10, 100)
|
||||
true_model = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
|
||||
y = true_model(x) + np.random.normal(0, 0.5, x.shape)
|
||||
|
||||
# Create and fit model
|
||||
g_init = models.Gaussian1D(amplitude=8, mean=4.5, stddev=0.8)
|
||||
fitter = fitting.LevMarLSQFitter()
|
||||
g_fit = fitter(g_init, x, y)
|
||||
|
||||
# Plot results
|
||||
plt.plot(x, y, 'o', label='Data')
|
||||
plt.plot(x, g_fit(x), label='Fit')
|
||||
plt.legend()
|
||||
|
||||
# Get fitted parameters
|
||||
print(f"Amplitude: {g_fit.amplitude.value}")
|
||||
print(f"Mean: {g_fit.mean.value}")
|
||||
print(f"Stddev: {g_fit.stddev.value}")
|
||||
```
|
||||
|
||||
### Fitting with Constraints
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
|
||||
# Set parameter bounds
|
||||
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
|
||||
g.amplitude.bounds = (0, None) # Positive only
|
||||
g.mean.bounds = (4, 6) # Constrain center
|
||||
g.stddev.fixed = True # Fix width
|
||||
|
||||
# Tie parameters (for multi-component models)
|
||||
g1 = models.Gaussian1D(amplitude=10, mean=5, stddev=1, name='g1')
|
||||
g2 = models.Gaussian1D(amplitude=5, mean=6, stddev=1, name='g2')
|
||||
g2.stddev.tied = lambda model: model.g1.stddev
|
||||
|
||||
# Compound model
|
||||
model = g1 + g2
|
||||
```
|
||||
|
||||
### 2D Image Fitting
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
import numpy as np
|
||||
|
||||
# Create 2D data
|
||||
y, x = np.mgrid[0:100, 0:100]
|
||||
z = models.Gaussian2D(amplitude=100, x_mean=50, y_mean=50,
|
||||
x_stddev=5, y_stddev=5)(x, y)
|
||||
z += np.random.normal(0, 5, z.shape)
|
||||
|
||||
# Fit 2D Gaussian
|
||||
g_init = models.Gaussian2D(amplitude=90, x_mean=48, y_mean=48,
|
||||
x_stddev=4, y_stddev=4)
|
||||
fitter = fitting.LevMarLSQFitter()
|
||||
g_fit = fitter(g_init, x, y, z)
|
||||
|
||||
# Get parameters
|
||||
print(f"Center: ({g_fit.x_mean.value}, {g_fit.y_mean.value})")
|
||||
print(f"Width: ({g_fit.x_stddev.value}, {g_fit.y_stddev.value})")
|
||||
```
|
||||
|
||||
## 8. Image Processing and Visualization
|
||||
|
||||
### Image Display with Proper Scaling
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Read FITS image
|
||||
data = fits.getdata('image.fits')
|
||||
|
||||
# Apply normalization
|
||||
norm = ImageNormalize(data,
|
||||
interval=PercentileInterval(99),
|
||||
stretch=SqrtStretch())
|
||||
|
||||
# Display
|
||||
plt.imshow(data, norm=norm, origin='lower', cmap='gray')
|
||||
plt.colorbar()
|
||||
```
|
||||
|
||||
### WCS Plotting
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.wcs import WCS
|
||||
from astropy.visualization import ImageNormalize, LogStretch, PercentileInterval
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Read FITS with WCS
|
||||
hdu = fits.open('image.fits')[0]
|
||||
wcs = WCS(hdu.header)
|
||||
data = hdu.data
|
||||
|
||||
# Create figure with WCS projection
|
||||
fig = plt.figure()
|
||||
ax = fig.add_subplot(111, projection=wcs)
|
||||
|
||||
# Plot with coordinate grid
|
||||
norm = ImageNormalize(data, interval=PercentileInterval(99.5),
|
||||
stretch=LogStretch())
|
||||
im = ax.imshow(data, norm=norm, origin='lower', cmap='viridis')
|
||||
|
||||
# Add coordinate labels
|
||||
ax.set_xlabel('RA')
|
||||
ax.set_ylabel('Dec')
|
||||
ax.coords.grid(color='white', alpha=0.5)
|
||||
plt.colorbar(im)
|
||||
```
|
||||
|
||||
### Sigma Clipping and Statistics
|
||||
```python
|
||||
from astropy.stats import sigma_clip, sigma_clipped_stats
|
||||
import numpy as np
|
||||
|
||||
# Data with outliers
|
||||
data = np.random.normal(100, 15, 1000)
|
||||
data[0:50] = np.random.normal(200, 10, 50) # Add outliers
|
||||
|
||||
# Sigma clipping
|
||||
clipped = sigma_clip(data, sigma=3, maxiters=5)
|
||||
|
||||
# Get statistics on clipped data
|
||||
mean, median, std = sigma_clipped_stats(data, sigma=3)
|
||||
|
||||
print(f"Mean: {mean:.2f}")
|
||||
print(f"Median: {median:.2f}")
|
||||
print(f"Std: {std:.2f}")
|
||||
print(f"Clipped {clipped.mask.sum()} values")
|
||||
```
|
||||
|
||||
## 9. Complete Analysis Example
|
||||
|
||||
### Photometry Pipeline
|
||||
```python
|
||||
from astropy.io import fits
|
||||
from astropy.wcs import WCS
|
||||
from astropy.coordinates import SkyCoord
|
||||
from astropy.stats import sigma_clipped_stats
|
||||
from astropy.visualization import ImageNormalize, LogStretch
|
||||
import astropy.units as u
|
||||
import numpy as np
|
||||
|
||||
# Read FITS file
|
||||
hdu = fits.open('observation.fits')[0]
|
||||
data = hdu.data
|
||||
header = hdu.header
|
||||
wcs = WCS(header)
|
||||
|
||||
# Calculate background statistics
|
||||
mean, median, std = sigma_clipped_stats(data, sigma=3.0)
|
||||
print(f"Background: {median:.2f} +/- {std:.2f}")
|
||||
|
||||
# Subtract background
|
||||
data_sub = data - median
|
||||
|
||||
# Known source coordinates
|
||||
source_coord = SkyCoord(ra='10:42:30', dec='+41:16:09', unit=(u.hourangle, u.deg))
|
||||
|
||||
# Convert to pixel coordinates
|
||||
x_pix, y_pix = wcs.world_to_pixel(source_coord)
|
||||
|
||||
# Simple aperture photometry
|
||||
aperture_radius = 10 # pixels
|
||||
y, x = np.ogrid[:data.shape[0], :data.shape[1]]
|
||||
mask = (x - x_pix)**2 + (y - y_pix)**2 <= aperture_radius**2
|
||||
|
||||
aperture_sum = np.sum(data_sub[mask])
|
||||
npix = np.sum(mask)
|
||||
|
||||
print(f"Source position: ({x_pix:.1f}, {y_pix:.1f})")
|
||||
print(f"Aperture sum: {aperture_sum:.2f}")
|
||||
print(f"S/N: {aperture_sum / (std * np.sqrt(npix)):.2f}")
|
||||
```
|
||||
|
||||
This workflow document provides practical examples for common astronomical data analysis tasks using astropy.
|
||||
340
scientific-packages/astropy/references/module_overview.md
Normal file
340
scientific-packages/astropy/references/module_overview.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# Astropy Module Overview
|
||||
|
||||
This document provides a comprehensive reference of all major astropy subpackages and their capabilities.
|
||||
|
||||
## Core Data Structures
|
||||
|
||||
### astropy.units
|
||||
**Purpose**: Handle physical units and dimensional analysis in computations.
|
||||
|
||||
**Key Classes**:
|
||||
- `Quantity` - Combines numerical values with units
|
||||
- `Unit` - Represents physical units
|
||||
|
||||
**Common Operations**:
|
||||
```python
|
||||
import astropy.units as u
|
||||
distance = 5 * u.meter
|
||||
time = 2 * u.second
|
||||
velocity = distance / time # Returns Quantity in m/s
|
||||
wavelength = 500 * u.nm
|
||||
frequency = wavelength.to(u.Hz, equivalencies=u.spectral())
|
||||
```
|
||||
|
||||
**Equivalencies**:
|
||||
- `u.spectral()` - Convert wavelength ↔ frequency
|
||||
- `u.doppler_optical()`, `u.doppler_radio()` - Velocity conversions
|
||||
- `u.temperature()` - Temperature unit conversions
|
||||
- `u.pixel_scale()` - Pixel to physical units
|
||||
|
||||
### astropy.constants
|
||||
**Purpose**: Provide physical and astronomical constants.
|
||||
|
||||
**Common Constants**:
|
||||
- `c` - Speed of light
|
||||
- `G` - Gravitational constant
|
||||
- `h` - Planck constant
|
||||
- `M_sun`, `R_sun`, `L_sun` - Solar mass, radius, luminosity
|
||||
- `M_earth`, `R_earth` - Earth mass, radius
|
||||
- `pc`, `au` - Parsec, astronomical unit
|
||||
|
||||
### astropy.time
|
||||
**Purpose**: Represent and manipulate times and dates with astronomical precision.
|
||||
|
||||
**Time Scales**:
|
||||
- `UTC` - Coordinated Universal Time
|
||||
- `TAI` - International Atomic Time
|
||||
- `TT` - Terrestrial Time
|
||||
- `TCB`, `TCG` - Barycentric/Geocentric Coordinate Time
|
||||
- `TDB` - Barycentric Dynamical Time
|
||||
- `UT1` - Universal Time
|
||||
|
||||
**Common Formats**:
|
||||
- `iso`, `isot` - ISO 8601 strings
|
||||
- `jd`, `mjd` - Julian/Modified Julian Date
|
||||
- `unix`, `gps` - Unix/GPS timestamps
|
||||
- `datetime` - Python datetime objects
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.time import Time
|
||||
t = Time('2023-01-01T00:00:00', format='isot', scale='utc')
|
||||
print(t.mjd) # Modified Julian Date
|
||||
print(t.jd) # Julian Date
|
||||
print(t.tt) # Convert to TT scale
|
||||
```
|
||||
|
||||
### astropy.table
|
||||
**Purpose**: Work with tabular data optimized for astronomical applications.
|
||||
|
||||
**Key Features**:
|
||||
- Native support for astropy Quantity, Time, and SkyCoord columns
|
||||
- Multi-dimensional columns
|
||||
- Metadata preservation (units, descriptions, formats)
|
||||
- Advanced operations: joins, grouping, binning
|
||||
- File I/O via unified interface
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.table import Table
|
||||
import astropy.units as u
|
||||
|
||||
t = Table()
|
||||
t['name'] = ['Star1', 'Star2', 'Star3']
|
||||
t['ra'] = [10.5, 11.2, 12.3] * u.degree
|
||||
t['dec'] = [41.2, 42.1, 43.5] * u.degree
|
||||
t['flux'] = [1.2, 2.3, 0.8] * u.Jy
|
||||
```
|
||||
|
||||
## Coordinates and World Coordinate Systems
|
||||
|
||||
### astropy.coordinates
|
||||
**Purpose**: Represent and transform celestial coordinates.
|
||||
|
||||
**Primary Interface**: `SkyCoord` - High-level class for sky positions
|
||||
|
||||
**Coordinate Frames**:
|
||||
- `ICRS` - International Celestial Reference System (default)
|
||||
- `FK5`, `FK4` - Fifth/Fourth Fundamental Katalog
|
||||
- `Galactic`, `Supergalactic` - Galactic coordinates
|
||||
- `AltAz` - Horizontal (altitude-azimuth) coordinates
|
||||
- `GCRS`, `CIRS`, `ITRS` - Earth-based systems
|
||||
- `BarycentricMeanEcliptic`, `HeliocentricMeanEcliptic`, `GeocentricMeanEcliptic` - Ecliptic coordinates
|
||||
|
||||
**Common Operations**:
|
||||
```python
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
# Create coordinate
|
||||
c = SkyCoord(ra=10.625*u.degree, dec=41.2*u.degree, frame='icrs')
|
||||
|
||||
# Transform to galactic
|
||||
c_gal = c.galactic
|
||||
|
||||
# Calculate separation
|
||||
c2 = SkyCoord(ra=11*u.degree, dec=42*u.degree, frame='icrs')
|
||||
sep = c.separation(c2)
|
||||
|
||||
# Match catalogs
|
||||
idx, sep2d, dist3d = c.match_to_catalog_sky(catalog_coords)
|
||||
```
|
||||
|
||||
### astropy.wcs
|
||||
**Purpose**: Handle World Coordinate System transformations for astronomical images.
|
||||
|
||||
**Key Class**: `WCS` - Maps between pixel and world coordinates
|
||||
|
||||
**Common Use Cases**:
|
||||
- Convert pixel coordinates to RA/Dec
|
||||
- Convert RA/Dec to pixel coordinates
|
||||
- Handle distortion corrections (SIP, lookup tables)
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.wcs import WCS
|
||||
from astropy.io import fits
|
||||
|
||||
hdu = fits.open('image.fits')[0]
|
||||
wcs = WCS(hdu.header)
|
||||
|
||||
# Pixel to world
|
||||
ra, dec = wcs.pixel_to_world_values(100, 200)
|
||||
|
||||
# World to pixel
|
||||
x, y = wcs.world_to_pixel_values(ra, dec)
|
||||
```
|
||||
|
||||
## File I/O
|
||||
|
||||
### astropy.io.fits
|
||||
**Purpose**: Read and write FITS (Flexible Image Transport System) files.
|
||||
|
||||
**Key Classes**:
|
||||
- `HDUList` - Container for all HDUs in a file
|
||||
- `PrimaryHDU` - Primary header data unit
|
||||
- `ImageHDU` - Image extension
|
||||
- `BinTableHDU` - Binary table extension
|
||||
- `Header` - FITS header keywords
|
||||
|
||||
**Common Operations**:
|
||||
```python
|
||||
from astropy.io import fits
|
||||
|
||||
# Read FITS file
|
||||
with fits.open('file.fits') as hdul:
|
||||
hdul.info() # Display structure
|
||||
header = hdul[0].header
|
||||
data = hdul[0].data
|
||||
|
||||
# Write FITS file
|
||||
fits.writeto('output.fits', data, header)
|
||||
|
||||
# Update header keyword
|
||||
fits.setval('file.fits', 'OBJECT', value='M31')
|
||||
```
|
||||
|
||||
### astropy.io.ascii
|
||||
**Purpose**: Read and write ASCII tables in various formats.
|
||||
|
||||
**Supported Formats**:
|
||||
- Basic, CSV, tab-delimited
|
||||
- CDS/MRT (Machine Readable Tables)
|
||||
- IPAC, Daophot, SExtractor
|
||||
- LaTeX tables
|
||||
- HTML tables
|
||||
|
||||
### astropy.io.votable
|
||||
**Purpose**: Handle Virtual Observatory (VO) table format.
|
||||
|
||||
### astropy.io.misc
|
||||
**Purpose**: Additional formats including HDF5, Parquet, and YAML.
|
||||
|
||||
## Scientific Calculations
|
||||
|
||||
### astropy.cosmology
|
||||
**Purpose**: Perform cosmological calculations.
|
||||
|
||||
**Common Models**:
|
||||
- `FlatLambdaCDM` - Flat universe with cosmological constant (most common)
|
||||
- `LambdaCDM` - Universe with cosmological constant
|
||||
- `Planck18`, `Planck15`, `Planck13` - Pre-defined Planck parameters
|
||||
- `WMAP9`, `WMAP7`, `WMAP5` - Pre-defined WMAP parameters
|
||||
|
||||
**Common Methods**:
|
||||
```python
|
||||
from astropy.cosmology import FlatLambdaCDM, Planck18
|
||||
import astropy.units as u
|
||||
|
||||
cosmo = FlatLambdaCDM(H0=70, Om0=0.3)
|
||||
# Or use built-in: cosmo = Planck18
|
||||
|
||||
z = 1.5
|
||||
print(cosmo.age(z)) # Age of universe at z
|
||||
print(cosmo.luminosity_distance(z)) # Luminosity distance
|
||||
print(cosmo.angular_diameter_distance(z)) # Angular diameter distance
|
||||
print(cosmo.comoving_distance(z)) # Comoving distance
|
||||
print(cosmo.H(z)) # Hubble parameter at z
|
||||
```
|
||||
|
||||
### astropy.modeling
|
||||
**Purpose**: Framework for model evaluation and fitting.
|
||||
|
||||
**Model Categories**:
|
||||
- 1D models: Gaussian1D, Lorentz1D, Voigt1D, Polynomial1D
|
||||
- 2D models: Gaussian2D, Disk2D, Moffat2D
|
||||
- Physical models: BlackBody, Drude1D, NFW
|
||||
- Polynomial models: Chebyshev, Legendre
|
||||
|
||||
**Common Fitters**:
|
||||
- `LinearLSQFitter` - Linear least squares
|
||||
- `LevMarLSQFitter` - Levenberg-Marquardt
|
||||
- `SimplexLSQFitter` - Downhill simplex
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.modeling import models, fitting
|
||||
|
||||
# Create model
|
||||
g = models.Gaussian1D(amplitude=10, mean=5, stddev=1)
|
||||
|
||||
# Fit to data
|
||||
fitter = fitting.LevMarLSQFitter()
|
||||
fitted_model = fitter(g, x_data, y_data)
|
||||
```
|
||||
|
||||
### astropy.convolution
|
||||
**Purpose**: Convolve and filter astronomical data.
|
||||
|
||||
**Common Kernels**:
|
||||
- `Gaussian2DKernel` - 2D Gaussian smoothing
|
||||
- `Box2DKernel` - 2D boxcar smoothing
|
||||
- `Tophat2DKernel` - 2D tophat filter
|
||||
- Custom kernels via arrays
|
||||
|
||||
### astropy.stats
|
||||
**Purpose**: Statistical tools for astronomical data analysis.
|
||||
|
||||
**Key Functions**:
|
||||
- `sigma_clip()` - Remove outliers via sigma clipping
|
||||
- `sigma_clipped_stats()` - Compute mean, median, std with clipping
|
||||
- `mad_std()` - Median Absolute Deviation
|
||||
- `biweight_location()`, `biweight_scale()` - Robust statistics
|
||||
- `circmean()`, `circstd()` - Circular statistics
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.stats import sigma_clip, sigma_clipped_stats
|
||||
|
||||
# Remove outliers
|
||||
filtered_data = sigma_clip(data, sigma=3, maxiters=5)
|
||||
|
||||
# Get statistics
|
||||
mean, median, std = sigma_clipped_stats(data, sigma=3)
|
||||
```
|
||||
|
||||
## Data Processing
|
||||
|
||||
### astropy.nddata
|
||||
**Purpose**: Handle N-dimensional datasets with metadata.
|
||||
|
||||
**Key Class**: `NDData` - Container for array data with units, uncertainty, mask, and WCS
|
||||
|
||||
### astropy.timeseries
|
||||
**Purpose**: Work with time series data.
|
||||
|
||||
**Key Classes**:
|
||||
- `TimeSeries` - Time-indexed data table
|
||||
- `BinnedTimeSeries` - Time-binned data
|
||||
|
||||
**Common Operations**:
|
||||
- Period finding (Lomb-Scargle)
|
||||
- Folding time series
|
||||
- Binning data
|
||||
|
||||
### astropy.visualization
|
||||
**Purpose**: Display astronomical data effectively.
|
||||
|
||||
**Key Features**:
|
||||
- Image normalization (LogStretch, PowerStretch, SqrtStretch, etc.)
|
||||
- Interval scaling (MinMaxInterval, PercentileInterval, ZScaleInterval)
|
||||
- WCSAxes for plotting with coordinate overlays
|
||||
- RGB image creation with stretching
|
||||
- Astronomical colormaps
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from astropy.visualization import ImageNormalize, SqrtStretch, PercentileInterval
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
norm = ImageNormalize(data, interval=PercentileInterval(99),
|
||||
stretch=SqrtStretch())
|
||||
plt.imshow(data, norm=norm, origin='lower')
|
||||
```
|
||||
|
||||
## Utilities
|
||||
|
||||
### astropy.samp
|
||||
**Purpose**: Simple Application Messaging Protocol for inter-application communication.
|
||||
|
||||
**Use Case**: Connect Python scripts with other astronomical tools (e.g., DS9, TOPCAT).
|
||||
|
||||
## Module Import Patterns
|
||||
|
||||
**Standard imports**:
|
||||
```python
|
||||
import astropy.units as u
|
||||
from astropy.coordinates import SkyCoord
|
||||
from astropy.time import Time
|
||||
from astropy.io import fits
|
||||
from astropy.table import Table
|
||||
from astropy import constants as const
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Pre-compute composite units** for repeated operations
|
||||
2. **Use `<<` operator** for fast unit assignments: `array << u.m` instead of `array * u.m`
|
||||
3. **Vectorize operations** rather than looping over coordinates/times
|
||||
4. **Use memmap=True** when opening large FITS files
|
||||
5. **Install Bottleneck** for faster stats operations
|
||||
226
scientific-packages/astropy/scripts/coord_convert.py
Normal file
226
scientific-packages/astropy/scripts/coord_convert.py
Normal file
@@ -0,0 +1,226 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Coordinate conversion utility for astronomical coordinates.
|
||||
|
||||
This script provides batch coordinate transformations between different
|
||||
astronomical coordinate systems using astropy.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
from astropy.coordinates import SkyCoord
|
||||
import astropy.units as u
|
||||
|
||||
|
||||
def convert_coordinates(coords_input, input_frame='icrs', output_frame='galactic',
|
||||
input_format='decimal', output_format='decimal'):
|
||||
"""
|
||||
Convert astronomical coordinates between different frames.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
coords_input : list of tuples or str
|
||||
Input coordinates as (lon, lat) pairs or strings
|
||||
input_frame : str
|
||||
Input coordinate frame (icrs, fk5, galactic, etc.)
|
||||
output_frame : str
|
||||
Output coordinate frame
|
||||
input_format : str
|
||||
Format of input coordinates ('decimal', 'sexagesimal', 'hourangle')
|
||||
output_format : str
|
||||
Format for output display ('decimal', 'sexagesimal', 'hourangle')
|
||||
|
||||
Returns
|
||||
-------
|
||||
list
|
||||
Converted coordinates
|
||||
"""
|
||||
results = []
|
||||
|
||||
for coord in coords_input:
|
||||
try:
|
||||
# Parse input coordinate
|
||||
if input_format == 'decimal':
|
||||
if isinstance(coord, str):
|
||||
parts = coord.split()
|
||||
lon, lat = float(parts[0]), float(parts[1])
|
||||
else:
|
||||
lon, lat = coord
|
||||
c = SkyCoord(lon*u.degree, lat*u.degree, frame=input_frame)
|
||||
|
||||
elif input_format == 'sexagesimal':
|
||||
c = SkyCoord(coord, frame=input_frame, unit=(u.hourangle, u.deg))
|
||||
|
||||
elif input_format == 'hourangle':
|
||||
if isinstance(coord, str):
|
||||
parts = coord.split()
|
||||
lon, lat = parts[0], parts[1]
|
||||
else:
|
||||
lon, lat = coord
|
||||
c = SkyCoord(lon, lat, frame=input_frame, unit=(u.hourangle, u.deg))
|
||||
|
||||
# Transform to output frame
|
||||
if output_frame == 'icrs':
|
||||
c_out = c.icrs
|
||||
elif output_frame == 'fk5':
|
||||
c_out = c.fk5
|
||||
elif output_frame == 'fk4':
|
||||
c_out = c.fk4
|
||||
elif output_frame == 'galactic':
|
||||
c_out = c.galactic
|
||||
elif output_frame == 'supergalactic':
|
||||
c_out = c.supergalactic
|
||||
else:
|
||||
c_out = c.transform_to(output_frame)
|
||||
|
||||
results.append(c_out)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error converting coordinate {coord}: {e}", file=sys.stderr)
|
||||
results.append(None)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def format_output(coords, frame, output_format='decimal'):
|
||||
"""Format coordinates for display."""
|
||||
output = []
|
||||
|
||||
for c in coords:
|
||||
if c is None:
|
||||
output.append("ERROR")
|
||||
continue
|
||||
|
||||
if frame in ['icrs', 'fk5', 'fk4']:
|
||||
lon_name, lat_name = 'RA', 'Dec'
|
||||
lon = c.ra
|
||||
lat = c.dec
|
||||
elif frame == 'galactic':
|
||||
lon_name, lat_name = 'l', 'b'
|
||||
lon = c.l
|
||||
lat = c.b
|
||||
elif frame == 'supergalactic':
|
||||
lon_name, lat_name = 'sgl', 'sgb'
|
||||
lon = c.sgl
|
||||
lat = c.sgb
|
||||
else:
|
||||
lon_name, lat_name = 'lon', 'lat'
|
||||
lon = c.spherical.lon
|
||||
lat = c.spherical.lat
|
||||
|
||||
if output_format == 'decimal':
|
||||
out_str = f"{lon.degree:12.8f} {lat.degree:+12.8f}"
|
||||
elif output_format == 'sexagesimal':
|
||||
if frame in ['icrs', 'fk5', 'fk4']:
|
||||
out_str = f"{lon.to_string(unit=u.hourangle, sep=':', pad=True)} "
|
||||
out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
|
||||
else:
|
||||
out_str = f"{lon.to_string(unit=u.degree, sep=':', pad=True)} "
|
||||
out_str += f"{lat.to_string(unit=u.degree, sep=':', pad=True)}"
|
||||
elif output_format == 'hourangle':
|
||||
out_str = f"{lon.to_string(unit=u.hourangle, sep=' ', pad=True)} "
|
||||
out_str += f"{lat.to_string(unit=u.degree, sep=' ', pad=True)}"
|
||||
|
||||
output.append(out_str)
|
||||
|
||||
return output
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function for command-line usage."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Convert astronomical coordinates between different frames',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Supported frames: icrs, fk5, fk4, galactic, supergalactic
|
||||
|
||||
Input formats:
|
||||
decimal : Degrees (e.g., "10.68 41.27")
|
||||
sexagesimal : HMS/DMS (e.g., "00:42:44.3 +41:16:09")
|
||||
hourangle : Hours and degrees (e.g., "10.5h 41.5d")
|
||||
|
||||
Examples:
|
||||
%(prog)s --from icrs --to galactic "10.68 41.27"
|
||||
%(prog)s --from icrs --to galactic --input decimal --output sexagesimal "150.5 -30.2"
|
||||
%(prog)s --from galactic --to icrs "120.5 45.3"
|
||||
%(prog)s --file coords.txt --from icrs --to galactic
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('coordinates', nargs='*',
|
||||
help='Coordinates to convert (lon lat pairs)')
|
||||
parser.add_argument('-f', '--from', dest='input_frame', default='icrs',
|
||||
help='Input coordinate frame (default: icrs)')
|
||||
parser.add_argument('-t', '--to', dest='output_frame', default='galactic',
|
||||
help='Output coordinate frame (default: galactic)')
|
||||
parser.add_argument('-i', '--input', dest='input_format', default='decimal',
|
||||
choices=['decimal', 'sexagesimal', 'hourangle'],
|
||||
help='Input format (default: decimal)')
|
||||
parser.add_argument('-o', '--output', dest='output_format', default='decimal',
|
||||
choices=['decimal', 'sexagesimal', 'hourangle'],
|
||||
help='Output format (default: decimal)')
|
||||
parser.add_argument('--file', dest='input_file',
|
||||
help='Read coordinates from file (one per line)')
|
||||
parser.add_argument('--header', action='store_true',
|
||||
help='Print header line with coordinate names')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Get coordinates from file or command line
|
||||
if args.input_file:
|
||||
try:
|
||||
with open(args.input_file, 'r') as f:
|
||||
coords = [line.strip() for line in f if line.strip()]
|
||||
except FileNotFoundError:
|
||||
print(f"Error: File '{args.input_file}' not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
else:
|
||||
if not args.coordinates:
|
||||
print("Error: No coordinates provided.", file=sys.stderr)
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Combine pairs of arguments
|
||||
if args.input_format == 'decimal':
|
||||
coords = []
|
||||
i = 0
|
||||
while i < len(args.coordinates):
|
||||
if i + 1 < len(args.coordinates):
|
||||
coords.append(f"{args.coordinates[i]} {args.coordinates[i+1]}")
|
||||
i += 2
|
||||
else:
|
||||
print(f"Warning: Odd number of coordinates, skipping last value",
|
||||
file=sys.stderr)
|
||||
break
|
||||
else:
|
||||
coords = args.coordinates
|
||||
|
||||
# Convert coordinates
|
||||
converted = convert_coordinates(coords,
|
||||
input_frame=args.input_frame,
|
||||
output_frame=args.output_frame,
|
||||
input_format=args.input_format,
|
||||
output_format=args.output_format)
|
||||
|
||||
# Format and print output
|
||||
formatted = format_output(converted, args.output_frame, args.output_format)
|
||||
|
||||
# Print header if requested
|
||||
if args.header:
|
||||
if args.output_frame in ['icrs', 'fk5', 'fk4']:
|
||||
if args.output_format == 'decimal':
|
||||
print(f"{'RA (deg)':>12s} {'Dec (deg)':>13s}")
|
||||
else:
|
||||
print(f"{'RA':>25s} {'Dec':>26s}")
|
||||
elif args.output_frame == 'galactic':
|
||||
if args.output_format == 'decimal':
|
||||
print(f"{'l (deg)':>12s} {'b (deg)':>13s}")
|
||||
else:
|
||||
print(f"{'l':>25s} {'b':>26s}")
|
||||
|
||||
for line in formatted:
|
||||
print(line)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
250
scientific-packages/astropy/scripts/cosmo_calc.py
Normal file
250
scientific-packages/astropy/scripts/cosmo_calc.py
Normal file
@@ -0,0 +1,250 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cosmological calculator using astropy.cosmology.
|
||||
|
||||
This script provides quick calculations of cosmological distances,
|
||||
ages, and other quantities for given redshifts.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import numpy as np
|
||||
from astropy.cosmology import FlatLambdaCDM, Planck18, Planck15, WMAP9
|
||||
import astropy.units as u
|
||||
|
||||
|
||||
def calculate_cosmology(redshifts, cosmology='Planck18', H0=None, Om0=None):
|
||||
"""
|
||||
Calculate cosmological quantities for given redshifts.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
redshifts : array-like
|
||||
Redshift values
|
||||
cosmology : str
|
||||
Cosmology to use ('Planck18', 'Planck15', 'WMAP9', 'custom')
|
||||
H0 : float, optional
|
||||
Hubble constant for custom cosmology (km/s/Mpc)
|
||||
Om0 : float, optional
|
||||
Matter density parameter for custom cosmology
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
Dictionary containing calculated quantities
|
||||
"""
|
||||
# Select cosmology
|
||||
if cosmology == 'Planck18':
|
||||
cosmo = Planck18
|
||||
elif cosmology == 'Planck15':
|
||||
cosmo = Planck15
|
||||
elif cosmology == 'WMAP9':
|
||||
cosmo = WMAP9
|
||||
elif cosmology == 'custom':
|
||||
if H0 is None or Om0 is None:
|
||||
raise ValueError("Must provide H0 and Om0 for custom cosmology")
|
||||
cosmo = FlatLambdaCDM(H0=H0 * u.km/u.s/u.Mpc, Om0=Om0)
|
||||
else:
|
||||
raise ValueError(f"Unknown cosmology: {cosmology}")
|
||||
|
||||
z = np.atleast_1d(redshifts)
|
||||
|
||||
results = {
|
||||
'redshift': z,
|
||||
'cosmology': str(cosmo),
|
||||
'luminosity_distance': cosmo.luminosity_distance(z),
|
||||
'angular_diameter_distance': cosmo.angular_diameter_distance(z),
|
||||
'comoving_distance': cosmo.comoving_distance(z),
|
||||
'comoving_volume': cosmo.comoving_volume(z),
|
||||
'age': cosmo.age(z),
|
||||
'lookback_time': cosmo.lookback_time(z),
|
||||
'H': cosmo.H(z),
|
||||
'scale_factor': 1.0 / (1.0 + z)
|
||||
}
|
||||
|
||||
return results, cosmo
|
||||
|
||||
|
||||
def print_results(results, verbose=False, csv=False):
|
||||
"""Print calculation results."""
|
||||
|
||||
z = results['redshift']
|
||||
|
||||
if csv:
|
||||
# CSV output
|
||||
print("z,D_L(Mpc),D_A(Mpc),D_C(Mpc),Age(Gyr),t_lookback(Gyr),H(km/s/Mpc)")
|
||||
for i in range(len(z)):
|
||||
print(f"{z[i]:.6f},"
|
||||
f"{results['luminosity_distance'][i].value:.6f},"
|
||||
f"{results['angular_diameter_distance'][i].value:.6f},"
|
||||
f"{results['comoving_distance'][i].value:.6f},"
|
||||
f"{results['age'][i].value:.6f},"
|
||||
f"{results['lookback_time'][i].value:.6f},"
|
||||
f"{results['H'][i].value:.6f}")
|
||||
else:
|
||||
# Formatted table output
|
||||
if verbose:
|
||||
print(f"\nCosmology: {results['cosmology']}")
|
||||
print("-" * 80)
|
||||
|
||||
print(f"\n{'z':>8s} {'D_L':>12s} {'D_A':>12s} {'D_C':>12s} "
|
||||
f"{'Age':>10s} {'t_lb':>10s} {'H(z)':>10s}")
|
||||
print(f"{'':>8s} {'(Mpc)':>12s} {'(Mpc)':>12s} {'(Mpc)':>12s} "
|
||||
f"{'(Gyr)':>10s} {'(Gyr)':>10s} {'(km/s/Mpc)':>10s}")
|
||||
print("-" * 80)
|
||||
|
||||
for i in range(len(z)):
|
||||
print(f"{z[i]:8.4f} "
|
||||
f"{results['luminosity_distance'][i].value:12.3f} "
|
||||
f"{results['angular_diameter_distance'][i].value:12.3f} "
|
||||
f"{results['comoving_distance'][i].value:12.3f} "
|
||||
f"{results['age'][i].value:10.4f} "
|
||||
f"{results['lookback_time'][i].value:10.4f} "
|
||||
f"{results['H'][i].value:10.4f}")
|
||||
|
||||
if verbose:
|
||||
print("\nLegend:")
|
||||
print(" z : Redshift")
|
||||
print(" D_L : Luminosity distance")
|
||||
print(" D_A : Angular diameter distance")
|
||||
print(" D_C : Comoving distance")
|
||||
print(" Age : Age of universe at z")
|
||||
print(" t_lb : Lookback time to z")
|
||||
print(" H(z) : Hubble parameter at z")
|
||||
|
||||
|
||||
def convert_quantity(value, quantity_type, cosmo, to_redshift=False):
|
||||
"""
|
||||
Convert between redshift and cosmological quantity.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
value : float
|
||||
Value to convert
|
||||
quantity_type : str
|
||||
Type of quantity ('luminosity_distance', 'age', etc.)
|
||||
cosmo : Cosmology
|
||||
Cosmology object
|
||||
to_redshift : bool
|
||||
If True, convert quantity to redshift; else convert z to quantity
|
||||
"""
|
||||
from astropy.cosmology import z_at_value
|
||||
|
||||
if to_redshift:
|
||||
# Convert quantity to redshift
|
||||
if quantity_type == 'luminosity_distance':
|
||||
z = z_at_value(cosmo.luminosity_distance, value * u.Mpc)
|
||||
elif quantity_type == 'age':
|
||||
z = z_at_value(cosmo.age, value * u.Gyr)
|
||||
elif quantity_type == 'lookback_time':
|
||||
z = z_at_value(cosmo.lookback_time, value * u.Gyr)
|
||||
elif quantity_type == 'comoving_distance':
|
||||
z = z_at_value(cosmo.comoving_distance, value * u.Mpc)
|
||||
else:
|
||||
raise ValueError(f"Unknown quantity type: {quantity_type}")
|
||||
return z
|
||||
else:
|
||||
# Convert redshift to quantity
|
||||
if quantity_type == 'luminosity_distance':
|
||||
return cosmo.luminosity_distance(value)
|
||||
elif quantity_type == 'age':
|
||||
return cosmo.age(value)
|
||||
elif quantity_type == 'lookback_time':
|
||||
return cosmo.lookback_time(value)
|
||||
elif quantity_type == 'comoving_distance':
|
||||
return cosmo.comoving_distance(value)
|
||||
else:
|
||||
raise ValueError(f"Unknown quantity type: {quantity_type}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function for command-line usage."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Calculate cosmological quantities for given redshifts',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Available cosmologies: Planck18, Planck15, WMAP9, custom
|
||||
|
||||
Examples:
|
||||
%(prog)s 0.5 1.0 1.5
|
||||
%(prog)s 0.5 --cosmology Planck15
|
||||
%(prog)s 0.5 --cosmology custom --H0 70 --Om0 0.3
|
||||
%(prog)s --range 0 3 0.5
|
||||
%(prog)s 0.5 --verbose
|
||||
%(prog)s 0.5 1.0 --csv
|
||||
%(prog)s --convert 1000 --from luminosity_distance --cosmology Planck18
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('redshifts', nargs='*', type=float,
|
||||
help='Redshift values to calculate')
|
||||
parser.add_argument('-c', '--cosmology', default='Planck18',
|
||||
choices=['Planck18', 'Planck15', 'WMAP9', 'custom'],
|
||||
help='Cosmology to use (default: Planck18)')
|
||||
parser.add_argument('--H0', type=float,
|
||||
help='Hubble constant for custom cosmology (km/s/Mpc)')
|
||||
parser.add_argument('--Om0', type=float,
|
||||
help='Matter density parameter for custom cosmology')
|
||||
parser.add_argument('-r', '--range', nargs=3, type=float, metavar=('START', 'STOP', 'STEP'),
|
||||
help='Generate redshift range (start stop step)')
|
||||
parser.add_argument('-v', '--verbose', action='store_true',
|
||||
help='Print verbose output with cosmology details')
|
||||
parser.add_argument('--csv', action='store_true',
|
||||
help='Output in CSV format')
|
||||
parser.add_argument('--convert', type=float,
|
||||
help='Convert a quantity to redshift')
|
||||
parser.add_argument('--from', dest='from_quantity',
|
||||
choices=['luminosity_distance', 'age', 'lookback_time', 'comoving_distance'],
|
||||
help='Type of quantity to convert from')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Handle conversion mode
|
||||
if args.convert is not None:
|
||||
if args.from_quantity is None:
|
||||
print("Error: Must specify --from when using --convert", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Get cosmology
|
||||
if args.cosmology == 'Planck18':
|
||||
cosmo = Planck18
|
||||
elif args.cosmology == 'Planck15':
|
||||
cosmo = Planck15
|
||||
elif args.cosmology == 'WMAP9':
|
||||
cosmo = WMAP9
|
||||
elif args.cosmology == 'custom':
|
||||
if args.H0 is None or args.Om0 is None:
|
||||
print("Error: Must provide --H0 and --Om0 for custom cosmology",
|
||||
file=sys.stderr)
|
||||
sys.exit(1)
|
||||
cosmo = FlatLambdaCDM(H0=args.H0 * u.km/u.s/u.Mpc, Om0=args.Om0)
|
||||
|
||||
z = convert_quantity(args.convert, args.from_quantity, cosmo, to_redshift=True)
|
||||
print(f"\n{args.from_quantity.replace('_', ' ').title()} = {args.convert}")
|
||||
print(f"Redshift z = {z:.6f}")
|
||||
print(f"(using {args.cosmology} cosmology)")
|
||||
return
|
||||
|
||||
# Get redshifts
|
||||
if args.range:
|
||||
start, stop, step = args.range
|
||||
redshifts = np.arange(start, stop + step/2, step)
|
||||
elif args.redshifts:
|
||||
redshifts = np.array(args.redshifts)
|
||||
else:
|
||||
print("Error: No redshifts provided.", file=sys.stderr)
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Calculate
|
||||
try:
|
||||
results, cosmo = calculate_cosmology(redshifts, args.cosmology,
|
||||
H0=args.H0, Om0=args.Om0)
|
||||
print_results(results, verbose=args.verbose, csv=args.csv)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
189
scientific-packages/astropy/scripts/fits_info.py
Normal file
189
scientific-packages/astropy/scripts/fits_info.py
Normal file
@@ -0,0 +1,189 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quick FITS file inspection tool.
|
||||
|
||||
This script provides a convenient way to inspect FITS file structure,
|
||||
headers, and basic statistics without writing custom code each time.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from astropy.io import fits
|
||||
import numpy as np
|
||||
|
||||
|
||||
def print_fits_info(filename, detailed=False, ext=None):
|
||||
"""
|
||||
Print comprehensive information about a FITS file.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
filename : str
|
||||
Path to FITS file
|
||||
detailed : bool
|
||||
If True, print detailed statistics for each HDU
|
||||
ext : int or str, optional
|
||||
Specific extension to examine in detail
|
||||
"""
|
||||
print(f"\n{'='*70}")
|
||||
print(f"FITS File: {filename}")
|
||||
print(f"{'='*70}\n")
|
||||
|
||||
try:
|
||||
with fits.open(filename) as hdul:
|
||||
# Print file structure
|
||||
print("File Structure:")
|
||||
print("-" * 70)
|
||||
hdul.info()
|
||||
print()
|
||||
|
||||
# If specific extension requested
|
||||
if ext is not None:
|
||||
print(f"\nDetailed view of extension: {ext}")
|
||||
print("-" * 70)
|
||||
hdu = hdul[ext]
|
||||
print_hdu_details(hdu, detailed=True)
|
||||
return
|
||||
|
||||
# Print header and data info for each HDU
|
||||
for i, hdu in enumerate(hdul):
|
||||
print(f"\n{'='*70}")
|
||||
print(f"HDU {i}: {hdu.name}")
|
||||
print(f"{'='*70}")
|
||||
print_hdu_details(hdu, detailed=detailed)
|
||||
|
||||
except FileNotFoundError:
|
||||
print(f"Error: File '{filename}' not found.")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error reading FITS file: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def print_hdu_details(hdu, detailed=False):
|
||||
"""Print details for a single HDU."""
|
||||
|
||||
# Header information
|
||||
print("\nHeader Information:")
|
||||
print("-" * 70)
|
||||
|
||||
# Key header keywords
|
||||
important_keywords = ['SIMPLE', 'BITPIX', 'NAXIS', 'EXTEND',
|
||||
'OBJECT', 'TELESCOP', 'INSTRUME', 'OBSERVER',
|
||||
'DATE-OBS', 'EXPTIME', 'FILTER', 'AIRMASS',
|
||||
'RA', 'DEC', 'EQUINOX', 'CTYPE1', 'CTYPE2']
|
||||
|
||||
header = hdu.header
|
||||
for key in important_keywords:
|
||||
if key in header:
|
||||
value = header[key]
|
||||
comment = header.comments[key]
|
||||
print(f" {key:12s} = {str(value):20s} / {comment}")
|
||||
|
||||
# NAXIS keywords
|
||||
if 'NAXIS' in header:
|
||||
naxis = header['NAXIS']
|
||||
for i in range(1, naxis + 1):
|
||||
key = f'NAXIS{i}'
|
||||
if key in header:
|
||||
print(f" {key:12s} = {str(header[key]):20s} / {header.comments[key]}")
|
||||
|
||||
# Data information
|
||||
if hdu.data is not None:
|
||||
print("\nData Information:")
|
||||
print("-" * 70)
|
||||
|
||||
data = hdu.data
|
||||
print(f" Data type: {data.dtype}")
|
||||
print(f" Shape: {data.shape}")
|
||||
|
||||
# For image data
|
||||
if hasattr(data, 'ndim') and data.ndim >= 1:
|
||||
try:
|
||||
# Calculate statistics
|
||||
finite_data = data[np.isfinite(data)]
|
||||
if len(finite_data) > 0:
|
||||
print(f" Min: {np.min(finite_data):.6g}")
|
||||
print(f" Max: {np.max(finite_data):.6g}")
|
||||
print(f" Mean: {np.mean(finite_data):.6g}")
|
||||
print(f" Median: {np.median(finite_data):.6g}")
|
||||
print(f" Std: {np.std(finite_data):.6g}")
|
||||
|
||||
# Count special values
|
||||
n_nan = np.sum(np.isnan(data))
|
||||
n_inf = np.sum(np.isinf(data))
|
||||
if n_nan > 0:
|
||||
print(f" NaN values: {n_nan}")
|
||||
if n_inf > 0:
|
||||
print(f" Inf values: {n_inf}")
|
||||
except Exception as e:
|
||||
print(f" Could not calculate statistics: {e}")
|
||||
|
||||
# For table data
|
||||
if hasattr(data, 'columns'):
|
||||
print(f"\n Table Columns ({len(data.columns)}):")
|
||||
for col in data.columns:
|
||||
print(f" {col.name:20s} {col.format:10s} {col.unit or ''}")
|
||||
|
||||
if detailed:
|
||||
print(f"\n First few rows:")
|
||||
print(data[:min(5, len(data))])
|
||||
else:
|
||||
print("\n No data in this HDU")
|
||||
|
||||
# WCS information if present
|
||||
try:
|
||||
from astropy.wcs import WCS
|
||||
wcs = WCS(hdu.header)
|
||||
if wcs.has_celestial:
|
||||
print("\nWCS Information:")
|
||||
print("-" * 70)
|
||||
print(f" Has celestial WCS: Yes")
|
||||
print(f" CTYPE: {wcs.wcs.ctype}")
|
||||
if wcs.wcs.crval is not None:
|
||||
print(f" CRVAL: {wcs.wcs.crval}")
|
||||
if wcs.wcs.crpix is not None:
|
||||
print(f" CRPIX: {wcs.wcs.crpix}")
|
||||
if wcs.wcs.cdelt is not None:
|
||||
print(f" CDELT: {wcs.wcs.cdelt}")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function for command-line usage."""
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Inspect FITS file structure and contents',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s image.fits
|
||||
%(prog)s image.fits --detailed
|
||||
%(prog)s image.fits --ext 1
|
||||
%(prog)s image.fits --ext SCI
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('filename', help='FITS file to inspect')
|
||||
parser.add_argument('-d', '--detailed', action='store_true',
|
||||
help='Show detailed statistics for each HDU')
|
||||
parser.add_argument('-e', '--ext', type=str, default=None,
|
||||
help='Show details for specific extension only (number or name)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Convert extension to int if numeric
|
||||
ext = args.ext
|
||||
if ext is not None:
|
||||
try:
|
||||
ext = int(ext)
|
||||
except ValueError:
|
||||
pass # Keep as string for extension name
|
||||
|
||||
print_fits_info(args.filename, detailed=args.detailed, ext=ext)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
375
scientific-packages/biomni/SKILL.md
Normal file
375
scientific-packages/biomni/SKILL.md
Normal file
@@ -0,0 +1,375 @@
|
||||
---
|
||||
name: biomni
|
||||
description: General-purpose biomedical AI agent for autonomously executing research tasks across diverse biomedical domains. Use this skill when working with biomedical data analysis, CRISPR screening, single-cell RNA-seq, molecular property prediction, genomics, proteomics, drug discovery, or any computational biology task requiring LLM-powered code generation and retrieval-augmented planning.
|
||||
---
|
||||
|
||||
# Biomni
|
||||
|
||||
## Overview
|
||||
|
||||
Biomni is a general-purpose biomedical AI agent that autonomously executes research tasks across diverse biomedical subfields. It combines large language model reasoning with retrieval-augmented planning and code-based execution to enhance scientific productivity and hypothesis generation. The system operates with an ~11GB biomedical knowledge base covering molecular, genomic, and clinical domains.
|
||||
|
||||
## Quick Start
|
||||
|
||||
Initialize and use the Biomni agent with these basic steps:
|
||||
|
||||
```python
|
||||
from biomni.agent import A1
|
||||
|
||||
# Initialize agent with data path and LLM model
|
||||
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
|
||||
|
||||
# Execute a biomedical research task
|
||||
agent.go("Your biomedical task description")
|
||||
```
|
||||
|
||||
The agent will autonomously decompose the task, retrieve relevant biomedical knowledge, generate and execute code, and provide results.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Environment Preparation
|
||||
|
||||
1. **Set up the conda environment:**
|
||||
- Follow instructions in `biomni_env/README.md` from the repository
|
||||
- Activate the environment: `conda activate biomni_e1`
|
||||
|
||||
2. **Install the package:**
|
||||
```bash
|
||||
pip install biomni --upgrade
|
||||
```
|
||||
|
||||
Or install from source:
|
||||
```bash
|
||||
git clone https://github.com/snap-stanford/biomni.git
|
||||
cd biomni
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
3. **Configure API keys:**
|
||||
|
||||
Set up credentials via environment variables or `.env` file:
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY="your-key-here"
|
||||
export OPENAI_API_KEY="your-key-here" # Optional
|
||||
```
|
||||
|
||||
4. **Data initialization:**
|
||||
|
||||
On first use, the agent will automatically download the ~11GB biomedical knowledge base.
|
||||
|
||||
### LLM Provider Configuration
|
||||
|
||||
Biomni supports multiple LLM providers. Configure the default provider using:
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
# Set the default LLM model
|
||||
default_config.llm = "claude-sonnet-4-20250514" # Anthropic
|
||||
# default_config.llm = "gpt-4" # OpenAI
|
||||
# default_config.llm = "azure/gpt-4" # Azure OpenAI
|
||||
# default_config.llm = "gemini/gemini-pro" # Google Gemini
|
||||
|
||||
# Set timeout (optional)
|
||||
default_config.timeout_seconds = 1200
|
||||
|
||||
# Set data path (optional)
|
||||
default_config.data_path = "./custom/data/path"
|
||||
```
|
||||
|
||||
Refer to `references/llm_providers.md` for detailed configuration options for each provider.
|
||||
|
||||
## Core Biomedical Research Tasks
|
||||
|
||||
### 1. CRISPR Screening and Design
|
||||
|
||||
Execute CRISPR screening tasks including guide RNA design, off-target analysis, and screening experiment planning:
|
||||
|
||||
```python
|
||||
agent.go("Design a CRISPR screening experiment to identify genes involved in cancer cell resistance to drug X")
|
||||
```
|
||||
|
||||
The agent will:
|
||||
- Retrieve relevant gene databases
|
||||
- Design guide RNAs with specificity analysis
|
||||
- Plan experimental controls and readout strategies
|
||||
- Generate analysis code for screening results
|
||||
|
||||
### 2. Single-Cell RNA-seq Analysis
|
||||
|
||||
Perform comprehensive scRNA-seq analysis workflows:
|
||||
|
||||
```python
|
||||
agent.go("Analyze this 10X Genomics scRNA-seq dataset, identify cell types, and find differentially expressed genes between clusters")
|
||||
```
|
||||
|
||||
Capabilities include:
|
||||
- Quality control and preprocessing
|
||||
- Dimensionality reduction and clustering
|
||||
- Cell type annotation using marker databases
|
||||
- Differential expression analysis
|
||||
- Pathway enrichment analysis
|
||||
|
||||
### 3. Molecular Property Prediction (ADMET)
|
||||
|
||||
Predict absorption, distribution, metabolism, excretion, and toxicity properties:
|
||||
|
||||
```python
|
||||
agent.go("Predict ADMET properties for these drug candidates: [SMILES strings]")
|
||||
```
|
||||
|
||||
The agent handles:
|
||||
- Molecular descriptor calculation
|
||||
- Property prediction using integrated models
|
||||
- Toxicity screening
|
||||
- Drug-likeness assessment
|
||||
|
||||
### 4. Genomic Analysis
|
||||
|
||||
Execute genomic data analysis tasks:
|
||||
|
||||
```python
|
||||
agent.go("Perform GWAS analysis to identify SNPs associated with disease phenotype in this cohort")
|
||||
```
|
||||
|
||||
Supports:
|
||||
- Genome-wide association studies (GWAS)
|
||||
- Variant calling and annotation
|
||||
- Population genetics analysis
|
||||
- Functional genomics integration
|
||||
|
||||
### 5. Protein Structure and Function
|
||||
|
||||
Analyze protein sequences and structures:
|
||||
|
||||
```python
|
||||
agent.go("Predict the structure of this protein sequence and identify potential binding sites")
|
||||
```
|
||||
|
||||
Capabilities:
|
||||
- Sequence analysis and domain identification
|
||||
- Structure prediction integration
|
||||
- Binding site prediction
|
||||
- Protein-protein interaction analysis
|
||||
|
||||
### 6. Disease Diagnosis and Classification
|
||||
|
||||
Perform disease classification from multi-omics data:
|
||||
|
||||
```python
|
||||
agent.go("Build a classifier to diagnose disease X from patient RNA-seq and clinical data")
|
||||
```
|
||||
|
||||
### 7. Systems Biology and Pathway Analysis
|
||||
|
||||
Analyze biological pathways and networks:
|
||||
|
||||
```python
|
||||
agent.go("Identify dysregulated pathways in this differential expression dataset")
|
||||
```
|
||||
|
||||
### 8. Drug Discovery and Repurposing
|
||||
|
||||
Support drug discovery workflows:
|
||||
|
||||
```python
|
||||
agent.go("Identify FDA-approved drugs that could be repurposed for treating disease Y based on mechanism of action")
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Configuration per Agent
|
||||
|
||||
Override global configuration for specific agent instances:
|
||||
|
||||
```python
|
||||
agent = A1(
|
||||
path='./project_data',
|
||||
llm='gpt-4o',
|
||||
timeout=1800
|
||||
)
|
||||
```
|
||||
|
||||
### Conversation History and Reporting
|
||||
|
||||
Save execution traces as formatted PDF reports:
|
||||
|
||||
```python
|
||||
# After executing tasks
|
||||
agent.save_conversation_history(
|
||||
output_path='./reports/experiment_log.pdf',
|
||||
format='pdf'
|
||||
)
|
||||
```
|
||||
|
||||
Requires one of: WeasyPrint, markdown2pdf, or Pandoc.
|
||||
|
||||
### Model Context Protocol (MCP) Integration
|
||||
|
||||
Extend agent capabilities with external tools:
|
||||
|
||||
```python
|
||||
# Add MCP-compatible tools
|
||||
agent.add_mcp(config_path='./mcp_config.json')
|
||||
```
|
||||
|
||||
MCP enables integration with:
|
||||
- Laboratory information management systems (LIMS)
|
||||
- Specialized bioinformatics databases
|
||||
- Custom analysis pipelines
|
||||
- External computational resources
|
||||
|
||||
### Using Biomni-R0 (Specialized Reasoning Model)
|
||||
|
||||
Deploy the 32B parameter Biomni-R0 model for enhanced biological reasoning:
|
||||
|
||||
```bash
|
||||
# Install SGLang
|
||||
pip install "sglang[all]"
|
||||
|
||||
# Deploy Biomni-R0
|
||||
python -m sglang.launch_server \
|
||||
--model-path snap-stanford/biomni-r0 \
|
||||
--port 30000 \
|
||||
--trust-remote-code
|
||||
```
|
||||
|
||||
Then configure the agent:
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "openai/biomni-r0"
|
||||
default_config.api_base = "http://localhost:30000/v1"
|
||||
```
|
||||
|
||||
Biomni-R0 provides specialized reasoning for:
|
||||
- Complex multi-step biological workflows
|
||||
- Hypothesis generation and evaluation
|
||||
- Experimental design optimization
|
||||
- Literature-informed analysis
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Task Specification
|
||||
|
||||
Provide clear, specific task descriptions:
|
||||
|
||||
✅ **Good:** "Analyze this scRNA-seq dataset (file: data.h5ad) to identify T cell subtypes, then perform differential expression analysis comparing activated vs. resting T cells"
|
||||
|
||||
❌ **Vague:** "Analyze my RNA-seq data"
|
||||
|
||||
### Data Organization
|
||||
|
||||
Structure data directories for efficient retrieval:
|
||||
|
||||
```
|
||||
project/
|
||||
├── data/ # Biomni knowledge base
|
||||
├── raw_data/ # Your experimental data
|
||||
├── results/ # Analysis outputs
|
||||
└── reports/ # Generated reports
|
||||
```
|
||||
|
||||
### Iterative Refinement
|
||||
|
||||
Use iterative task execution for complex analyses:
|
||||
|
||||
```python
|
||||
# Step 1: Exploratory analysis
|
||||
agent.go("Load and perform initial QC on the proteomics dataset")
|
||||
|
||||
# Step 2: Based on results, refine analysis
|
||||
agent.go("Based on the QC results, remove low-quality samples and normalize using method X")
|
||||
|
||||
# Step 3: Downstream analysis
|
||||
agent.go("Perform differential abundance analysis with adjusted parameters")
|
||||
```
|
||||
|
||||
### Security Considerations
|
||||
|
||||
**CRITICAL:** Biomni executes LLM-generated code with full system privileges. For production use:
|
||||
|
||||
1. **Use sandboxed environments:** Deploy in Docker containers or VMs with restricted permissions
|
||||
2. **Validate sensitive operations:** Review code before execution for file access, network calls, or credential usage
|
||||
3. **Limit data access:** Restrict agent access to only necessary data directories
|
||||
4. **Monitor execution:** Log all executed code for audit trails
|
||||
|
||||
Never run Biomni with:
|
||||
- Unrestricted file system access
|
||||
- Direct access to sensitive credentials
|
||||
- Network access to production systems
|
||||
- Elevated system privileges
|
||||
|
||||
### Model Selection Guidelines
|
||||
|
||||
Choose models based on task complexity:
|
||||
|
||||
- **Claude Sonnet 4:** Recommended for most biomedical tasks, excellent biological reasoning
|
||||
- **GPT-4/GPT-4o:** Strong general capabilities, good for diverse tasks
|
||||
- **Biomni-R0:** Specialized for complex biological reasoning, multi-step workflows
|
||||
- **Smaller models:** Use for simple, well-defined tasks to reduce cost
|
||||
|
||||
## Evaluation and Benchmarking
|
||||
|
||||
Biomni-Eval1 benchmark contains 433 evaluation instances across 10 biological tasks:
|
||||
|
||||
- GWAS analysis
|
||||
- Disease diagnosis
|
||||
- Gene detection and classification
|
||||
- Molecular property prediction
|
||||
- Pathway analysis
|
||||
- Protein function prediction
|
||||
- Drug response prediction
|
||||
- Variant interpretation
|
||||
- Cell type annotation
|
||||
- Biomarker discovery
|
||||
|
||||
Use the benchmark to:
|
||||
- Evaluate custom agent configurations
|
||||
- Compare LLM providers for specific tasks
|
||||
- Validate analysis pipelines
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue:** Data download fails or times out
|
||||
**Solution:** Manually download the knowledge base or increase timeout settings
|
||||
|
||||
**Issue:** Package dependency conflicts
|
||||
**Solution:** Some optional dependencies cannot be installed by default due to conflicts. Install specific packages manually and uncomment relevant code sections as documented in the repository
|
||||
|
||||
**Issue:** LLM API errors
|
||||
**Solution:** Verify API key configuration, check rate limits, ensure sufficient credits
|
||||
|
||||
**Issue:** Memory errors with large datasets
|
||||
**Solution:** Process data in chunks, use data subsampling, or deploy on higher-memory instances
|
||||
|
||||
### Getting Help
|
||||
|
||||
For detailed troubleshooting:
|
||||
- Review the Biomni GitHub repository issues
|
||||
- Check `references/api_reference.md` for detailed API documentation
|
||||
- Consult `references/task_examples.md` for comprehensive task patterns
|
||||
|
||||
## Resources
|
||||
|
||||
### references/
|
||||
Detailed reference documentation for advanced usage:
|
||||
|
||||
- **api_reference.md:** Complete API documentation for A1 agent, configuration objects, and utility functions
|
||||
- **llm_providers.md:** Comprehensive guide for configuring all supported LLM providers (Anthropic, OpenAI, Azure, Gemini, Groq, Ollama, AWS Bedrock)
|
||||
- **task_examples.md:** Extensive collection of biomedical task examples with code patterns
|
||||
|
||||
### scripts/
|
||||
Helper scripts for common operations:
|
||||
|
||||
- **setup_environment.py:** Automated environment setup and validation
|
||||
- **generate_report.py:** Enhanced PDF report generation with custom formatting
|
||||
|
||||
Load reference documentation as needed:
|
||||
```python
|
||||
# Claude can read reference files when needed for detailed information
|
||||
# Example: "Check references/llm_providers.md for Azure OpenAI configuration"
|
||||
```
|
||||
635
scientific-packages/biomni/references/api_reference.md
Normal file
635
scientific-packages/biomni/references/api_reference.md
Normal file
@@ -0,0 +1,635 @@
|
||||
# Biomni API Reference
|
||||
|
||||
This document provides comprehensive API documentation for the Biomni biomedical AI agent system.
|
||||
|
||||
## Core Classes
|
||||
|
||||
### A1 Agent
|
||||
|
||||
The primary agent class for executing biomedical research tasks.
|
||||
|
||||
#### Initialization
|
||||
|
||||
```python
|
||||
from biomni.agent import A1
|
||||
|
||||
agent = A1(
|
||||
path='./data', # Path to biomedical knowledge base
|
||||
llm='claude-sonnet-4-20250514', # LLM model identifier
|
||||
timeout=None, # Optional timeout in seconds
|
||||
verbose=True # Enable detailed logging
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
- `path` (str, required): Directory path where the biomedical knowledge base is stored or will be downloaded. First-time initialization will download ~11GB of data.
|
||||
- `llm` (str, optional): LLM model identifier. Defaults to the value in `default_config.llm`. Supports multiple providers (see LLM Providers section).
|
||||
- `timeout` (int, optional): Maximum execution time in seconds for agent operations. Overrides `default_config.timeout_seconds`.
|
||||
- `verbose` (bool, optional): Enable verbose logging for debugging. Default: True.
|
||||
|
||||
**Returns:** A1 agent instance ready for task execution.
|
||||
|
||||
#### Methods
|
||||
|
||||
##### `go(task_description: str) -> None`
|
||||
|
||||
Execute a biomedical research task autonomously.
|
||||
|
||||
```python
|
||||
agent.go("Analyze this scRNA-seq dataset and identify cell types")
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `task_description` (str, required): Natural language description of the biomedical task to execute. Be specific about:
|
||||
- Data location and format
|
||||
- Desired analysis or output
|
||||
- Any specific methods or parameters
|
||||
- Expected results format
|
||||
|
||||
**Behavior:**
|
||||
1. Decomposes the task into executable steps
|
||||
2. Retrieves relevant biomedical knowledge from the data lake
|
||||
3. Generates and executes Python/R code
|
||||
4. Provides results and visualizations
|
||||
5. Handles errors and retries with refinement
|
||||
|
||||
**Notes:**
|
||||
- Executes code with system privileges - use in sandboxed environments
|
||||
- Long-running tasks may require timeout adjustments
|
||||
- Intermediate results are displayed during execution
|
||||
|
||||
##### `save_conversation_history(output_path: str, format: str = 'pdf') -> None`
|
||||
|
||||
Export conversation history and execution trace as a formatted report.
|
||||
|
||||
```python
|
||||
agent.save_conversation_history(
|
||||
output_path='./reports/analysis_log.pdf',
|
||||
format='pdf'
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `output_path` (str, required): File path for the output report
|
||||
- `format` (str, optional): Output format. Options: 'pdf', 'markdown'. Default: 'pdf'
|
||||
|
||||
**Requirements:**
|
||||
- For PDF: Install one of: WeasyPrint, markdown2pdf, or Pandoc
|
||||
```bash
|
||||
pip install weasyprint # Recommended
|
||||
# or
|
||||
pip install markdown2pdf
|
||||
# or install Pandoc system-wide
|
||||
```
|
||||
|
||||
**Report Contents:**
|
||||
- Task description and parameters
|
||||
- Retrieved biomedical knowledge
|
||||
- Generated code with execution traces
|
||||
- Results, visualizations, and outputs
|
||||
- Timestamps and execution metadata
|
||||
|
||||
##### `add_mcp(config_path: str) -> None`
|
||||
|
||||
Add Model Context Protocol (MCP) tools to extend agent capabilities.
|
||||
|
||||
```python
|
||||
agent.add_mcp(config_path='./mcp_tools_config.json')
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `config_path` (str, required): Path to MCP configuration JSON file
|
||||
|
||||
**MCP Configuration Format:**
|
||||
```json
|
||||
{
|
||||
"tools": [
|
||||
{
|
||||
"name": "tool_name",
|
||||
"endpoint": "http://localhost:8000/tool",
|
||||
"description": "Tool description for LLM",
|
||||
"parameters": {
|
||||
"param1": "string",
|
||||
"param2": "integer"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Connect to laboratory information systems
|
||||
- Integrate proprietary databases
|
||||
- Access specialized computational resources
|
||||
- Link to institutional data repositories
|
||||
|
||||
## Configuration
|
||||
|
||||
### default_config
|
||||
|
||||
Global configuration object for Biomni settings.
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
```
|
||||
|
||||
#### Attributes
|
||||
|
||||
##### `llm: str`
|
||||
|
||||
Default LLM model identifier for all agent instances.
|
||||
|
||||
```python
|
||||
default_config.llm = "claude-sonnet-4-20250514"
|
||||
```
|
||||
|
||||
**Supported Models:**
|
||||
|
||||
**Anthropic:**
|
||||
- `claude-sonnet-4-20250514` (Recommended)
|
||||
- `claude-opus-4-20250514`
|
||||
- `claude-3-5-sonnet-20241022`
|
||||
- `claude-3-opus-20240229`
|
||||
|
||||
**OpenAI:**
|
||||
- `gpt-4o`
|
||||
- `gpt-4`
|
||||
- `gpt-4-turbo`
|
||||
- `gpt-3.5-turbo`
|
||||
|
||||
**Azure OpenAI:**
|
||||
- `azure/gpt-4`
|
||||
- `azure/<deployment-name>`
|
||||
|
||||
**Google Gemini:**
|
||||
- `gemini/gemini-pro`
|
||||
- `gemini/gemini-1.5-pro`
|
||||
|
||||
**Groq:**
|
||||
- `groq/llama-3.1-70b-versatile`
|
||||
- `groq/mixtral-8x7b-32768`
|
||||
|
||||
**Ollama (Local):**
|
||||
- `ollama/llama3`
|
||||
- `ollama/mistral`
|
||||
- `ollama/<model-name>`
|
||||
|
||||
**AWS Bedrock:**
|
||||
- `bedrock/anthropic.claude-v2`
|
||||
- `bedrock/anthropic.claude-3-sonnet`
|
||||
|
||||
**Custom/Biomni-R0:**
|
||||
- `openai/biomni-r0` (requires local SGLang deployment)
|
||||
|
||||
##### `timeout_seconds: int`
|
||||
|
||||
Default timeout for agent operations in seconds.
|
||||
|
||||
```python
|
||||
default_config.timeout_seconds = 1200 # 20 minutes
|
||||
```
|
||||
|
||||
**Recommended Values:**
|
||||
- Simple tasks (QC, basic analysis): 300-600 seconds
|
||||
- Medium tasks (differential expression, clustering): 600-1200 seconds
|
||||
- Complex tasks (full pipelines, ML models): 1200-3600 seconds
|
||||
- Very complex tasks: 3600+ seconds
|
||||
|
||||
##### `data_path: str`
|
||||
|
||||
Default path to biomedical knowledge base.
|
||||
|
||||
```python
|
||||
default_config.data_path = "/path/to/biomni/data"
|
||||
```
|
||||
|
||||
**Storage Requirements:**
|
||||
- Initial download: ~11GB
|
||||
- Extracted size: ~15GB
|
||||
- Additional working space: ~5-10GB recommended
|
||||
|
||||
##### `api_base: str`
|
||||
|
||||
Custom API endpoint for LLM providers (advanced usage).
|
||||
|
||||
```python
|
||||
# For local Biomni-R0 deployment
|
||||
default_config.api_base = "http://localhost:30000/v1"
|
||||
|
||||
# For custom OpenAI-compatible endpoints
|
||||
default_config.api_base = "https://your-endpoint.com/v1"
|
||||
```
|
||||
|
||||
##### `max_retries: int`
|
||||
|
||||
Number of retry attempts for failed operations.
|
||||
|
||||
```python
|
||||
default_config.max_retries = 3
|
||||
```
|
||||
|
||||
#### Methods
|
||||
|
||||
##### `reset() -> None`
|
||||
|
||||
Reset all configuration values to system defaults.
|
||||
|
||||
```python
|
||||
default_config.reset()
|
||||
```
|
||||
|
||||
## Database Query System
|
||||
|
||||
Biomni includes a retrieval-augmented generation (RAG) system for querying the biomedical knowledge base.
|
||||
|
||||
### Query Functions
|
||||
|
||||
#### `query_genes(query: str, top_k: int = 10) -> List[Dict]`
|
||||
|
||||
Query gene information from integrated databases.
|
||||
|
||||
```python
|
||||
from biomni.database import query_genes
|
||||
|
||||
results = query_genes(
|
||||
query="genes involved in p53 pathway",
|
||||
top_k=20
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `query` (str): Natural language or gene identifier query
|
||||
- `top_k` (int): Number of results to return
|
||||
|
||||
**Returns:** List of dictionaries containing:
|
||||
- `gene_symbol`: Official gene symbol
|
||||
- `gene_name`: Full gene name
|
||||
- `description`: Functional description
|
||||
- `pathways`: Associated biological pathways
|
||||
- `go_terms`: Gene Ontology annotations
|
||||
- `diseases`: Associated diseases
|
||||
- `similarity_score`: Relevance score (0-1)
|
||||
|
||||
#### `query_proteins(query: str, top_k: int = 10) -> List[Dict]`
|
||||
|
||||
Query protein information from UniProt and other sources.
|
||||
|
||||
```python
|
||||
from biomni.database import query_proteins
|
||||
|
||||
results = query_proteins(
|
||||
query="kinase proteins in cell cycle",
|
||||
top_k=15
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:** List of dictionaries with protein metadata:
|
||||
- `uniprot_id`: UniProt accession
|
||||
- `protein_name`: Protein name
|
||||
- `function`: Functional annotation
|
||||
- `domains`: Protein domains
|
||||
- `subcellular_location`: Cellular localization
|
||||
- `similarity_score`: Relevance score
|
||||
|
||||
#### `query_drugs(query: str, top_k: int = 10) -> List[Dict]`
|
||||
|
||||
Query drug and compound information.
|
||||
|
||||
```python
|
||||
from biomni.database import query_drugs
|
||||
|
||||
results = query_drugs(
|
||||
query="FDA approved cancer drugs targeting EGFR",
|
||||
top_k=10
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:** Drug information including:
|
||||
- `drug_name`: Common name
|
||||
- `drugbank_id`: DrugBank identifier
|
||||
- `indication`: Therapeutic indication
|
||||
- `mechanism`: Mechanism of action
|
||||
- `targets`: Molecular targets
|
||||
- `approval_status`: Regulatory status
|
||||
- `smiles`: Chemical structure (SMILES notation)
|
||||
|
||||
#### `query_diseases(query: str, top_k: int = 10) -> List[Dict]`
|
||||
|
||||
Query disease information from clinical databases.
|
||||
|
||||
```python
|
||||
from biomni.database import query_diseases
|
||||
|
||||
results = query_diseases(
|
||||
query="autoimmune diseases affecting joints",
|
||||
top_k=10
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:** Disease data:
|
||||
- `disease_name`: Standard disease name
|
||||
- `disease_id`: Ontology identifier
|
||||
- `symptoms`: Clinical manifestations
|
||||
- `associated_genes`: Genetic associations
|
||||
- `prevalence`: Epidemiological data
|
||||
|
||||
#### `query_pathways(query: str, top_k: int = 10) -> List[Dict]`
|
||||
|
||||
Query biological pathways from KEGG, Reactome, and other sources.
|
||||
|
||||
```python
|
||||
from biomni.database import query_pathways
|
||||
|
||||
results = query_pathways(
|
||||
query="immune response signaling pathways",
|
||||
top_k=15
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:** Pathway information:
|
||||
- `pathway_name`: Pathway name
|
||||
- `pathway_id`: Database identifier
|
||||
- `genes`: Genes in pathway
|
||||
- `description`: Functional description
|
||||
- `source`: Database source (KEGG, Reactome, etc.)
|
||||
|
||||
## Data Structures
|
||||
|
||||
### TaskResult
|
||||
|
||||
Result object returned by complex agent operations.
|
||||
|
||||
```python
|
||||
class TaskResult:
|
||||
success: bool # Whether task completed successfully
|
||||
output: Any # Task output (varies by task)
|
||||
code: str # Generated code
|
||||
execution_time: float # Execution time in seconds
|
||||
error: Optional[str] # Error message if failed
|
||||
metadata: Dict # Additional metadata
|
||||
```
|
||||
|
||||
### BiomedicalEntity
|
||||
|
||||
Base class for biomedical entities in the knowledge base.
|
||||
|
||||
```python
|
||||
class BiomedicalEntity:
|
||||
entity_id: str # Unique identifier
|
||||
entity_type: str # Type (gene, protein, drug, etc.)
|
||||
name: str # Entity name
|
||||
description: str # Description
|
||||
attributes: Dict # Additional attributes
|
||||
references: List[str] # Literature references
|
||||
```
|
||||
|
||||
## Utility Functions
|
||||
|
||||
### `download_data(path: str, force: bool = False) -> None`
|
||||
|
||||
Manually download or update the biomedical knowledge base.
|
||||
|
||||
```python
|
||||
from biomni.utils import download_data
|
||||
|
||||
download_data(
|
||||
path='./data',
|
||||
force=True # Force re-download
|
||||
)
|
||||
```
|
||||
|
||||
### `validate_environment() -> Dict[str, bool]`
|
||||
|
||||
Check if the environment is properly configured.
|
||||
|
||||
```python
|
||||
from biomni.utils import validate_environment
|
||||
|
||||
status = validate_environment()
|
||||
# Returns: {
|
||||
# 'conda_env': True,
|
||||
# 'api_keys': True,
|
||||
# 'data_available': True,
|
||||
# 'dependencies': True
|
||||
# }
|
||||
```
|
||||
|
||||
### `list_available_models() -> List[str]`
|
||||
|
||||
Get a list of available LLM models based on configured API keys.
|
||||
|
||||
```python
|
||||
from biomni.utils import list_available_models
|
||||
|
||||
models = list_available_models()
|
||||
# Returns: ['claude-sonnet-4-20250514', 'gpt-4o', ...]
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Exceptions
|
||||
|
||||
#### `BiomniConfigError`
|
||||
|
||||
Raised when configuration is invalid or incomplete.
|
||||
|
||||
```python
|
||||
from biomni.exceptions import BiomniConfigError
|
||||
|
||||
try:
|
||||
agent = A1(path='./data')
|
||||
except BiomniConfigError as e:
|
||||
print(f"Configuration error: {e}")
|
||||
```
|
||||
|
||||
#### `BiomniExecutionError`
|
||||
|
||||
Raised when code generation or execution fails.
|
||||
|
||||
```python
|
||||
from biomni.exceptions import BiomniExecutionError
|
||||
|
||||
try:
|
||||
agent.go("invalid task")
|
||||
except BiomniExecutionError as e:
|
||||
print(f"Execution failed: {e}")
|
||||
# Access failed code: e.code
|
||||
# Access error details: e.details
|
||||
```
|
||||
|
||||
#### `BiomniDataError`
|
||||
|
||||
Raised when knowledge base or data access fails.
|
||||
|
||||
```python
|
||||
from biomni.exceptions import BiomniDataError
|
||||
|
||||
try:
|
||||
results = query_genes("unknown query format")
|
||||
except BiomniDataError as e:
|
||||
print(f"Data access error: {e}")
|
||||
```
|
||||
|
||||
#### `BiomniTimeoutError`
|
||||
|
||||
Raised when operations exceed timeout limit.
|
||||
|
||||
```python
|
||||
from biomni.exceptions import BiomniTimeoutError
|
||||
|
||||
try:
|
||||
agent.go("very complex long-running task")
|
||||
except BiomniTimeoutError as e:
|
||||
print(f"Task timed out after {e.duration} seconds")
|
||||
# Partial results may be available: e.partial_results
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Efficient Knowledge Retrieval
|
||||
|
||||
Pre-query databases for relevant context before complex tasks:
|
||||
|
||||
```python
|
||||
from biomni.database import query_genes, query_pathways
|
||||
|
||||
# Gather relevant biological context first
|
||||
genes = query_genes("cell cycle genes", top_k=50)
|
||||
pathways = query_pathways("cell cycle regulation", top_k=20)
|
||||
|
||||
# Then execute task with enriched context
|
||||
agent.go(f"""
|
||||
Analyze the cell cycle progression in this dataset.
|
||||
Focus on these genes: {[g['gene_symbol'] for g in genes]}
|
||||
Consider these pathways: {[p['pathway_name'] for p in pathways]}
|
||||
""")
|
||||
```
|
||||
|
||||
### Error Recovery
|
||||
|
||||
Implement robust error handling for production workflows:
|
||||
|
||||
```python
|
||||
from biomni.exceptions import BiomniExecutionError, BiomniTimeoutError
|
||||
|
||||
max_attempts = 3
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
agent.go("complex biomedical task")
|
||||
break
|
||||
except BiomniTimeoutError:
|
||||
# Increase timeout and retry
|
||||
default_config.timeout_seconds *= 2
|
||||
print(f"Timeout, retrying with {default_config.timeout_seconds}s timeout")
|
||||
except BiomniExecutionError as e:
|
||||
# Refine task based on error
|
||||
print(f"Execution failed: {e}, refining task...")
|
||||
# Optionally modify task description
|
||||
else:
|
||||
print("Task failed after max attempts")
|
||||
```
|
||||
|
||||
### Memory Management
|
||||
|
||||
For large-scale analyses, manage memory explicitly:
|
||||
|
||||
```python
|
||||
import gc
|
||||
|
||||
# Process datasets in chunks
|
||||
for chunk_id in range(num_chunks):
|
||||
agent.go(f"Process data chunk {chunk_id} located at data/chunk_{chunk_id}.h5ad")
|
||||
|
||||
# Force garbage collection between chunks
|
||||
gc.collect()
|
||||
|
||||
# Save intermediate results
|
||||
agent.save_conversation_history(f"./reports/chunk_{chunk_id}.pdf")
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
|
||||
Ensure reproducible analyses by:
|
||||
|
||||
1. **Fixing random seeds:**
|
||||
```python
|
||||
agent.go("Set random seed to 42 for all analyses, then perform clustering...")
|
||||
```
|
||||
|
||||
2. **Logging configuration:**
|
||||
```python
|
||||
import json
|
||||
config_log = {
|
||||
'llm': default_config.llm,
|
||||
'timeout': default_config.timeout_seconds,
|
||||
'data_path': default_config.data_path,
|
||||
'timestamp': datetime.now().isoformat()
|
||||
}
|
||||
with open('config_log.json', 'w') as f:
|
||||
json.dump(config_log, f, indent=2)
|
||||
```
|
||||
|
||||
3. **Saving execution traces:**
|
||||
```python
|
||||
# Always save detailed reports
|
||||
agent.save_conversation_history('./reports/full_analysis.pdf')
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Model Selection Strategy
|
||||
|
||||
Choose models based on task characteristics:
|
||||
|
||||
```python
|
||||
# For exploratory, simple tasks
|
||||
default_config.llm = "gpt-3.5-turbo" # Fast, cost-effective
|
||||
|
||||
# For standard biomedical analyses
|
||||
default_config.llm = "claude-sonnet-4-20250514" # Recommended
|
||||
|
||||
# For complex reasoning and hypothesis generation
|
||||
default_config.llm = "claude-opus-4-20250514" # Highest quality
|
||||
|
||||
# For specialized biological reasoning
|
||||
default_config.llm = "openai/biomni-r0" # Requires local deployment
|
||||
```
|
||||
|
||||
### Timeout Tuning
|
||||
|
||||
Set appropriate timeouts based on task complexity:
|
||||
|
||||
```python
|
||||
# Quick queries and simple analyses
|
||||
agent = A1(path='./data', timeout=300)
|
||||
|
||||
# Standard workflows
|
||||
agent = A1(path='./data', timeout=1200)
|
||||
|
||||
# Full pipelines with ML training
|
||||
agent = A1(path='./data', timeout=3600)
|
||||
```
|
||||
|
||||
### Caching and Reuse
|
||||
|
||||
Reuse agent instances for multiple related tasks:
|
||||
|
||||
```python
|
||||
# Create agent once
|
||||
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
|
||||
|
||||
# Execute multiple related tasks
|
||||
tasks = [
|
||||
"Load and QC the scRNA-seq dataset",
|
||||
"Perform clustering with resolution 0.5",
|
||||
"Identify marker genes for each cluster",
|
||||
"Annotate cell types based on markers"
|
||||
]
|
||||
|
||||
for task in tasks:
|
||||
agent.go(task)
|
||||
|
||||
# Save complete workflow
|
||||
agent.save_conversation_history('./reports/full_workflow.pdf')
|
||||
```
|
||||
649
scientific-packages/biomni/references/llm_providers.md
Normal file
649
scientific-packages/biomni/references/llm_providers.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# LLM Provider Configuration Guide
|
||||
|
||||
This document provides comprehensive configuration instructions for all LLM providers supported by Biomni.
|
||||
|
||||
## Overview
|
||||
|
||||
Biomni supports multiple LLM providers through a unified interface. Configure providers using:
|
||||
- Environment variables
|
||||
- `.env` files
|
||||
- Runtime configuration via `default_config`
|
||||
|
||||
## Quick Reference Table
|
||||
|
||||
| Provider | Recommended For | API Key Required | Cost | Setup Complexity |
|
||||
|----------|----------------|------------------|------|------------------|
|
||||
| Anthropic Claude | Most biomedical tasks | Yes | Medium | Easy |
|
||||
| OpenAI | General tasks | Yes | Medium-High | Easy |
|
||||
| Azure OpenAI | Enterprise deployment | Yes | Varies | Medium |
|
||||
| Google Gemini | Multimodal tasks | Yes | Medium | Easy |
|
||||
| Groq | Fast inference | Yes | Low | Easy |
|
||||
| Ollama | Local/offline use | No | Free | Medium |
|
||||
| AWS Bedrock | AWS ecosystem | Yes | Varies | Hard |
|
||||
| Biomni-R0 | Complex biological reasoning | No | Free | Hard |
|
||||
|
||||
## Anthropic Claude (Recommended)
|
||||
|
||||
### Overview
|
||||
|
||||
Claude models from Anthropic provide excellent biological reasoning capabilities and are the recommended choice for most Biomni tasks.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Obtain API Key:**
|
||||
- Sign up at https://console.anthropic.com/
|
||||
- Navigate to API Keys section
|
||||
- Generate a new key
|
||||
|
||||
2. **Configure Environment:**
|
||||
|
||||
**Option A: Environment Variable**
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY="sk-ant-api03-..."
|
||||
```
|
||||
|
||||
**Option B: .env File**
|
||||
```bash
|
||||
# .env file in project root
|
||||
ANTHROPIC_API_KEY=sk-ant-api03-...
|
||||
```
|
||||
|
||||
3. **Set Model in Code:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
# Claude Sonnet 4 (Recommended)
|
||||
default_config.llm = "claude-sonnet-4-20250514"
|
||||
|
||||
# Claude Opus 4 (Most capable)
|
||||
default_config.llm = "claude-opus-4-20250514"
|
||||
|
||||
# Claude 3.5 Sonnet (Previous version)
|
||||
default_config.llm = "claude-3-5-sonnet-20241022"
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Context Window | Strengths | Best For |
|
||||
|-------|---------------|-----------|----------|
|
||||
| `claude-sonnet-4-20250514` | 200K tokens | Balanced performance, cost-effective | Most biomedical tasks |
|
||||
| `claude-opus-4-20250514` | 200K tokens | Highest capability, complex reasoning | Difficult multi-step analyses |
|
||||
| `claude-3-5-sonnet-20241022` | 200K tokens | Fast, reliable | Standard workflows |
|
||||
| `claude-3-opus-20240229` | 200K tokens | Strong reasoning | Legacy support |
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
# Use Claude with custom parameters
|
||||
default_config.llm = "claude-sonnet-4-20250514"
|
||||
default_config.timeout_seconds = 1800
|
||||
|
||||
# Optional: Custom API endpoint (for proxy/enterprise)
|
||||
default_config.api_base = "https://your-proxy.com/v1"
|
||||
```
|
||||
|
||||
### Cost Estimation
|
||||
|
||||
Approximate costs per 1M tokens (as of January 2025):
|
||||
- Input: $3-15 depending on model
|
||||
- Output: $15-75 depending on model
|
||||
|
||||
For a typical biomedical analysis (~50K tokens total): $0.50-$2.00
|
||||
|
||||
## OpenAI
|
||||
|
||||
### Overview
|
||||
|
||||
OpenAI's GPT models provide strong general capabilities suitable for diverse biomedical tasks.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Obtain API Key:**
|
||||
- Sign up at https://platform.openai.com/
|
||||
- Navigate to API Keys
|
||||
- Create new secret key
|
||||
|
||||
2. **Configure Environment:**
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY="sk-proj-..."
|
||||
```
|
||||
|
||||
Or in `.env`:
|
||||
```
|
||||
OPENAI_API_KEY=sk-proj-...
|
||||
```
|
||||
|
||||
3. **Set Model:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "gpt-4o" # Recommended
|
||||
# default_config.llm = "gpt-4" # Previous flagship
|
||||
# default_config.llm = "gpt-4-turbo" # Fast variant
|
||||
# default_config.llm = "gpt-3.5-turbo" # Budget option
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Context Window | Strengths | Cost |
|
||||
|-------|---------------|-----------|------|
|
||||
| `gpt-4o` | 128K tokens | Fast, multimodal | Medium |
|
||||
| `gpt-4-turbo` | 128K tokens | Fast inference | Medium |
|
||||
| `gpt-4` | 8K tokens | Reliable | High |
|
||||
| `gpt-3.5-turbo` | 16K tokens | Fast, cheap | Low |
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
```python
|
||||
# For exploratory analysis (budget-conscious)
|
||||
default_config.llm = "gpt-3.5-turbo"
|
||||
|
||||
# For production analysis (quality-focused)
|
||||
default_config.llm = "gpt-4o"
|
||||
```
|
||||
|
||||
## Azure OpenAI
|
||||
|
||||
### Overview
|
||||
|
||||
Azure-hosted OpenAI models for enterprise users requiring data residency and compliance.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Azure Prerequisites:**
|
||||
- Active Azure subscription
|
||||
- Azure OpenAI resource created
|
||||
- Model deployment configured
|
||||
|
||||
2. **Environment Variables:**
|
||||
```bash
|
||||
export AZURE_OPENAI_API_KEY="your-key"
|
||||
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
|
||||
export AZURE_OPENAI_API_VERSION="2024-02-15-preview"
|
||||
```
|
||||
|
||||
3. **Configuration:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
# Option 1: Use deployment name
|
||||
default_config.llm = "azure/your-deployment-name"
|
||||
|
||||
# Option 2: Specify endpoint explicitly
|
||||
default_config.llm = "azure/gpt-4"
|
||||
default_config.api_base = "https://your-resource.openai.azure.com/"
|
||||
```
|
||||
|
||||
### Deployment Setup
|
||||
|
||||
Azure OpenAI requires explicit model deployments:
|
||||
|
||||
1. Navigate to Azure OpenAI Studio
|
||||
2. Create deployment for desired model (e.g., GPT-4)
|
||||
3. Note the deployment name
|
||||
4. Use deployment name in Biomni configuration
|
||||
|
||||
### Example Configuration
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
import os
|
||||
|
||||
# Set Azure credentials
|
||||
os.environ['AZURE_OPENAI_API_KEY'] = 'your-key'
|
||||
os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://your-resource.openai.azure.com/'
|
||||
|
||||
# Configure Biomni to use Azure deployment
|
||||
default_config.llm = "azure/gpt-4-biomni" # Your deployment name
|
||||
default_config.api_base = os.environ['AZURE_OPENAI_ENDPOINT']
|
||||
```
|
||||
|
||||
## Google Gemini
|
||||
|
||||
### Overview
|
||||
|
||||
Google's Gemini models offer multimodal capabilities and competitive performance.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Obtain API Key:**
|
||||
- Visit https://makersuite.google.com/app/apikey
|
||||
- Create new API key
|
||||
|
||||
2. **Environment Configuration:**
|
||||
```bash
|
||||
export GEMINI_API_KEY="your-key"
|
||||
```
|
||||
|
||||
3. **Set Model:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "gemini/gemini-1.5-pro"
|
||||
# Or: default_config.llm = "gemini/gemini-pro"
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Context Window | Strengths |
|
||||
|-------|---------------|-----------|
|
||||
| `gemini/gemini-1.5-pro` | 1M tokens | Very large context, multimodal |
|
||||
| `gemini/gemini-pro` | 32K tokens | Balanced performance |
|
||||
|
||||
### Use Cases
|
||||
|
||||
Gemini excels at:
|
||||
- Tasks requiring very large context windows
|
||||
- Multimodal analysis (when incorporating images)
|
||||
- Cost-effective alternative to GPT-4
|
||||
|
||||
```python
|
||||
# For tasks with large context requirements
|
||||
default_config.llm = "gemini/gemini-1.5-pro"
|
||||
default_config.timeout_seconds = 2400 # May need longer timeout
|
||||
```
|
||||
|
||||
## Groq
|
||||
|
||||
### Overview
|
||||
|
||||
Groq provides ultra-fast inference with open-source models, ideal for rapid iteration.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Get API Key:**
|
||||
- Sign up at https://console.groq.com/
|
||||
- Generate API key
|
||||
|
||||
2. **Configure:**
|
||||
```bash
|
||||
export GROQ_API_KEY="gsk_..."
|
||||
```
|
||||
|
||||
3. **Set Model:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "groq/llama-3.1-70b-versatile"
|
||||
# Or: default_config.llm = "groq/mixtral-8x7b-32768"
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Context Window | Speed | Quality |
|
||||
|-------|---------------|-------|---------|
|
||||
| `groq/llama-3.1-70b-versatile` | 32K tokens | Very Fast | Good |
|
||||
| `groq/mixtral-8x7b-32768` | 32K tokens | Very Fast | Good |
|
||||
| `groq/llama-3-70b-8192` | 8K tokens | Ultra Fast | Moderate |
|
||||
|
||||
### Best Practices
|
||||
|
||||
```python
|
||||
# For rapid prototyping and testing
|
||||
default_config.llm = "groq/llama-3.1-70b-versatile"
|
||||
default_config.timeout_seconds = 600 # Groq is fast
|
||||
|
||||
# Note: Quality may be lower than GPT-4/Claude for complex tasks
|
||||
# Recommended for: QC, simple analyses, testing workflows
|
||||
```
|
||||
|
||||
## Ollama (Local Deployment)
|
||||
|
||||
### Overview
|
||||
|
||||
Run LLMs entirely locally for offline use, data privacy, or cost savings.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Install Ollama:**
|
||||
```bash
|
||||
# macOS/Linux
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
|
||||
# Or download from https://ollama.com/download
|
||||
```
|
||||
|
||||
2. **Pull Models:**
|
||||
```bash
|
||||
ollama pull llama3 # Meta Llama 3 (8B)
|
||||
ollama pull mixtral # Mixtral (47B)
|
||||
ollama pull codellama # Code-specialized
|
||||
ollama pull medllama # Medical domain (if available)
|
||||
```
|
||||
|
||||
3. **Start Ollama Server:**
|
||||
```bash
|
||||
ollama serve # Runs on http://localhost:11434
|
||||
```
|
||||
|
||||
4. **Configure Biomni:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "ollama/llama3"
|
||||
default_config.api_base = "http://localhost:11434"
|
||||
```
|
||||
|
||||
### Hardware Requirements
|
||||
|
||||
Minimum recommendations:
|
||||
- **8B models:** 16GB RAM, CPU inference acceptable
|
||||
- **70B models:** 64GB RAM, GPU highly recommended
|
||||
- **Storage:** 5-50GB per model
|
||||
|
||||
### Model Selection
|
||||
|
||||
```python
|
||||
# Fast, local, good for testing
|
||||
default_config.llm = "ollama/llama3"
|
||||
|
||||
# Better quality (requires more resources)
|
||||
default_config.llm = "ollama/mixtral"
|
||||
|
||||
# Code generation tasks
|
||||
default_config.llm = "ollama/codellama"
|
||||
```
|
||||
|
||||
### Advantages & Limitations
|
||||
|
||||
**Advantages:**
|
||||
- Complete data privacy
|
||||
- No API costs
|
||||
- Offline operation
|
||||
- Unlimited usage
|
||||
|
||||
**Limitations:**
|
||||
- Lower quality than GPT-4/Claude for complex tasks
|
||||
- Requires significant hardware
|
||||
- Slower inference (especially on CPU)
|
||||
- May struggle with specialized biomedical knowledge
|
||||
|
||||
## AWS Bedrock
|
||||
|
||||
### Overview
|
||||
|
||||
AWS-managed LLM service offering multiple model providers.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **AWS Prerequisites:**
|
||||
- AWS account with Bedrock access
|
||||
- Model access enabled in Bedrock console
|
||||
- AWS credentials configured
|
||||
|
||||
2. **Configure AWS Credentials:**
|
||||
```bash
|
||||
# Option 1: AWS CLI
|
||||
aws configure
|
||||
|
||||
# Option 2: Environment variables
|
||||
export AWS_ACCESS_KEY_ID="your-key"
|
||||
export AWS_SECRET_ACCESS_KEY="your-secret"
|
||||
export AWS_REGION="us-east-1"
|
||||
```
|
||||
|
||||
3. **Enable Model Access:**
|
||||
- Navigate to AWS Bedrock console
|
||||
- Request access to desired models
|
||||
- Wait for approval (may take hours/days)
|
||||
|
||||
4. **Configure Biomni:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "bedrock/anthropic.claude-3-sonnet"
|
||||
# Or: default_config.llm = "bedrock/anthropic.claude-v2"
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
Bedrock provides access to:
|
||||
- Anthropic Claude models
|
||||
- Amazon Titan models
|
||||
- AI21 Jurassic models
|
||||
- Cohere Command models
|
||||
- Meta Llama models
|
||||
|
||||
### IAM Permissions
|
||||
|
||||
Required IAM policy:
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"bedrock:InvokeModel",
|
||||
"bedrock:InvokeModelWithResponseStream"
|
||||
],
|
||||
"Resource": "arn:aws:bedrock:*::foundation-model/*"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Example Configuration
|
||||
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
import boto3
|
||||
|
||||
# Verify AWS credentials
|
||||
session = boto3.Session()
|
||||
credentials = session.get_credentials()
|
||||
print(f"AWS Access Key: {credentials.access_key[:8]}...")
|
||||
|
||||
# Configure Biomni
|
||||
default_config.llm = "bedrock/anthropic.claude-3-sonnet"
|
||||
default_config.timeout_seconds = 1800
|
||||
```
|
||||
|
||||
## Biomni-R0 (Local Specialized Model)
|
||||
|
||||
### Overview
|
||||
|
||||
Biomni-R0 is a 32B parameter reasoning model specifically trained for biological problem-solving. Provides the highest quality for complex biomedical reasoning but requires local deployment.
|
||||
|
||||
### Setup
|
||||
|
||||
1. **Hardware Requirements:**
|
||||
- GPU with 48GB+ VRAM (e.g., A100, H100)
|
||||
- Or multi-GPU setup (2x 24GB)
|
||||
- 100GB+ storage for model weights
|
||||
|
||||
2. **Install Dependencies:**
|
||||
```bash
|
||||
pip install "sglang[all]"
|
||||
pip install flashinfer # Optional but recommended
|
||||
```
|
||||
|
||||
3. **Deploy Model:**
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path snap-stanford/biomni-r0 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.8
|
||||
```
|
||||
|
||||
For multi-GPU:
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path snap-stanford/biomni-r0 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--trust-remote-code \
|
||||
--tp 2 # Tensor parallelism across 2 GPUs
|
||||
```
|
||||
|
||||
4. **Configure Biomni:**
|
||||
```python
|
||||
from biomni.config import default_config
|
||||
|
||||
default_config.llm = "openai/biomni-r0"
|
||||
default_config.api_base = "http://localhost:30000/v1"
|
||||
default_config.timeout_seconds = 2400 # Longer for complex reasoning
|
||||
```
|
||||
|
||||
### When to Use Biomni-R0
|
||||
|
||||
Biomni-R0 excels at:
|
||||
- Multi-step biological reasoning
|
||||
- Complex experimental design
|
||||
- Hypothesis generation and evaluation
|
||||
- Literature-informed analysis
|
||||
- Tasks requiring deep biological knowledge
|
||||
|
||||
```python
|
||||
# For complex biological reasoning tasks
|
||||
default_config.llm = "openai/biomni-r0"
|
||||
|
||||
agent.go("""
|
||||
Design a comprehensive CRISPR screening experiment to identify synthetic
|
||||
lethal interactions with TP53 mutations in cancer cells, including:
|
||||
1. Rationale and hypothesis
|
||||
2. Guide RNA library design strategy
|
||||
3. Experimental controls
|
||||
4. Statistical analysis plan
|
||||
5. Expected outcomes and validation approach
|
||||
""")
|
||||
```
|
||||
|
||||
### Performance Comparison
|
||||
|
||||
| Model | Speed | Biological Reasoning | Code Quality | Cost |
|
||||
|-------|-------|---------------------|--------------|------|
|
||||
| GPT-4 | Fast | Good | Excellent | Medium |
|
||||
| Claude Sonnet 4 | Fast | Excellent | Excellent | Medium |
|
||||
| Biomni-R0 | Moderate | Outstanding | Good | Free (local) |
|
||||
|
||||
## Multi-Provider Strategy
|
||||
|
||||
### Intelligent Model Selection
|
||||
|
||||
Use different models for different task types:
|
||||
|
||||
```python
|
||||
from biomni.agent import A1
|
||||
from biomni.config import default_config
|
||||
|
||||
# Strategy 1: Task-based selection
|
||||
def get_agent_for_task(task_complexity):
|
||||
if task_complexity == "simple":
|
||||
default_config.llm = "gpt-3.5-turbo"
|
||||
default_config.timeout_seconds = 300
|
||||
elif task_complexity == "medium":
|
||||
default_config.llm = "claude-sonnet-4-20250514"
|
||||
default_config.timeout_seconds = 1200
|
||||
else: # complex
|
||||
default_config.llm = "openai/biomni-r0"
|
||||
default_config.timeout_seconds = 2400
|
||||
|
||||
return A1(path='./data')
|
||||
|
||||
# Strategy 2: Fallback on failure
|
||||
def execute_with_fallback(task):
|
||||
models = [
|
||||
"claude-sonnet-4-20250514",
|
||||
"gpt-4o",
|
||||
"claude-opus-4-20250514"
|
||||
]
|
||||
|
||||
for model in models:
|
||||
try:
|
||||
default_config.llm = model
|
||||
agent = A1(path='./data')
|
||||
agent.go(task)
|
||||
return
|
||||
except Exception as e:
|
||||
print(f"Failed with {model}: {e}, trying next...")
|
||||
|
||||
raise Exception("All models failed")
|
||||
```
|
||||
|
||||
### Cost Optimization Strategy
|
||||
|
||||
```python
|
||||
# Phase 1: Rapid prototyping with cheap models
|
||||
default_config.llm = "gpt-3.5-turbo"
|
||||
agent.go("Quick exploratory analysis of dataset structure")
|
||||
|
||||
# Phase 2: Detailed analysis with high-quality models
|
||||
default_config.llm = "claude-sonnet-4-20250514"
|
||||
agent.go("Comprehensive differential expression analysis with pathway enrichment")
|
||||
|
||||
# Phase 3: Complex reasoning with specialized models
|
||||
default_config.llm = "openai/biomni-r0"
|
||||
agent.go("Generate biological hypotheses based on multi-omics integration")
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: "API key not found"**
|
||||
- Verify environment variable is set: `echo $ANTHROPIC_API_KEY`
|
||||
- Check `.env` file exists and is in correct location
|
||||
- Try setting key programmatically: `os.environ['ANTHROPIC_API_KEY'] = 'key'`
|
||||
|
||||
**Issue: "Rate limit exceeded"**
|
||||
- Implement exponential backoff and retry
|
||||
- Upgrade API tier if available
|
||||
- Switch to alternative provider temporarily
|
||||
|
||||
**Issue: "Model not found"**
|
||||
- Verify model identifier is correct
|
||||
- Check API key has access to requested model
|
||||
- For Azure: ensure deployment exists with exact name
|
||||
|
||||
**Issue: "Timeout errors"**
|
||||
- Increase `default_config.timeout_seconds`
|
||||
- Break complex tasks into smaller steps
|
||||
- Consider using faster model for initial phases
|
||||
|
||||
**Issue: "Connection refused (Ollama/Biomni-R0)"**
|
||||
- Verify local server is running
|
||||
- Check port is not blocked by firewall
|
||||
- Confirm `api_base` URL is correct
|
||||
|
||||
### Testing Configuration
|
||||
|
||||
```python
|
||||
from biomni.utils import list_available_models, validate_environment
|
||||
|
||||
# Check environment setup
|
||||
status = validate_environment()
|
||||
print("Environment Status:", status)
|
||||
|
||||
# List available models based on configured keys
|
||||
models = list_available_models()
|
||||
print("Available Models:", models)
|
||||
|
||||
# Test specific model
|
||||
try:
|
||||
from biomni.agent import A1
|
||||
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
|
||||
agent.go("Print 'Configuration successful!'")
|
||||
except Exception as e:
|
||||
print(f"Configuration test failed: {e}")
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **For most users:** Start with Claude Sonnet 4 or GPT-4o
|
||||
2. **For cost sensitivity:** Use GPT-3.5-turbo for exploration, Claude Sonnet 4 for production
|
||||
3. **For privacy/offline:** Deploy Ollama locally
|
||||
4. **For complex reasoning:** Use Biomni-R0 if hardware available
|
||||
5. **For enterprise:** Consider Azure OpenAI or AWS Bedrock
|
||||
6. **For speed:** Use Groq for rapid iteration
|
||||
|
||||
7. **Always:**
|
||||
- Set appropriate timeouts
|
||||
- Implement error handling and retries
|
||||
- Log model and configuration for reproducibility
|
||||
- Test configuration before production use
|
||||
1472
scientific-packages/biomni/references/task_examples.md
Normal file
1472
scientific-packages/biomni/references/task_examples.md
Normal file
File diff suppressed because it is too large
Load Diff
381
scientific-packages/biomni/scripts/generate_report.py
Normal file
381
scientific-packages/biomni/scripts/generate_report.py
Normal file
@@ -0,0 +1,381 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Enhanced PDF Report Generation for Biomni
|
||||
|
||||
This script provides advanced PDF report generation with custom formatting,
|
||||
styling, and metadata for Biomni analysis results.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from typing import Optional, Dict, Any
|
||||
|
||||
|
||||
def generate_markdown_report(
|
||||
title: str,
|
||||
sections: list,
|
||||
metadata: Optional[Dict[str, Any]] = None,
|
||||
output_path: str = "report.md"
|
||||
) -> str:
|
||||
"""
|
||||
Generate a formatted markdown report.
|
||||
|
||||
Args:
|
||||
title: Report title
|
||||
sections: List of dicts with 'heading' and 'content' keys
|
||||
metadata: Optional metadata dict (author, date, etc.)
|
||||
output_path: Path to save markdown file
|
||||
|
||||
Returns:
|
||||
Path to generated markdown file
|
||||
"""
|
||||
md_content = []
|
||||
|
||||
# Title
|
||||
md_content.append(f"# {title}\n")
|
||||
|
||||
# Metadata
|
||||
if metadata:
|
||||
md_content.append("---\n")
|
||||
for key, value in metadata.items():
|
||||
md_content.append(f"**{key}:** {value} \n")
|
||||
md_content.append("---\n\n")
|
||||
|
||||
# Sections
|
||||
for section in sections:
|
||||
heading = section.get('heading', 'Section')
|
||||
content = section.get('content', '')
|
||||
level = section.get('level', 2) # Default to h2
|
||||
|
||||
md_content.append(f"{'#' * level} {heading}\n\n")
|
||||
md_content.append(f"{content}\n\n")
|
||||
|
||||
# Write to file
|
||||
output = Path(output_path)
|
||||
output.write_text('\n'.join(md_content))
|
||||
|
||||
return str(output)
|
||||
|
||||
|
||||
def convert_to_pdf_weasyprint(
|
||||
markdown_path: str,
|
||||
output_path: str,
|
||||
css_style: Optional[str] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Convert markdown to PDF using WeasyPrint.
|
||||
|
||||
Args:
|
||||
markdown_path: Path to markdown file
|
||||
output_path: Path for output PDF
|
||||
css_style: Optional CSS stylesheet path
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
import markdown
|
||||
from weasyprint import HTML, CSS
|
||||
|
||||
# Read markdown
|
||||
with open(markdown_path, 'r') as f:
|
||||
md_content = f.read()
|
||||
|
||||
# Convert to HTML
|
||||
html_content = markdown.markdown(
|
||||
md_content,
|
||||
extensions=['tables', 'fenced_code', 'codehilite']
|
||||
)
|
||||
|
||||
# Wrap in HTML template
|
||||
html_template = f"""
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>Biomni Report</title>
|
||||
<style>
|
||||
body {{
|
||||
font-family: 'Helvetica', 'Arial', sans-serif;
|
||||
line-height: 1.6;
|
||||
color: #333;
|
||||
max-width: 800px;
|
||||
margin: 40px auto;
|
||||
padding: 20px;
|
||||
}}
|
||||
h1 {{
|
||||
color: #2c3e50;
|
||||
border-bottom: 3px solid #3498db;
|
||||
padding-bottom: 10px;
|
||||
}}
|
||||
h2 {{
|
||||
color: #34495e;
|
||||
margin-top: 30px;
|
||||
border-bottom: 1px solid #bdc3c7;
|
||||
padding-bottom: 5px;
|
||||
}}
|
||||
h3 {{
|
||||
color: #7f8c8d;
|
||||
}}
|
||||
code {{
|
||||
background-color: #f4f4f4;
|
||||
padding: 2px 6px;
|
||||
border-radius: 3px;
|
||||
font-family: 'Courier New', monospace;
|
||||
}}
|
||||
pre {{
|
||||
background-color: #f4f4f4;
|
||||
padding: 15px;
|
||||
border-radius: 5px;
|
||||
overflow-x: auto;
|
||||
}}
|
||||
table {{
|
||||
border-collapse: collapse;
|
||||
width: 100%;
|
||||
margin: 20px 0;
|
||||
}}
|
||||
th, td {{
|
||||
border: 1px solid #ddd;
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
}}
|
||||
th {{
|
||||
background-color: #3498db;
|
||||
color: white;
|
||||
}}
|
||||
tr:nth-child(even) {{
|
||||
background-color: #f9f9f9;
|
||||
}}
|
||||
.metadata {{
|
||||
background-color: #ecf0f1;
|
||||
padding: 15px;
|
||||
border-radius: 5px;
|
||||
margin: 20px 0;
|
||||
}}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
{html_content}
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
# Generate PDF
|
||||
pdf = HTML(string=html_template)
|
||||
|
||||
# Add custom CSS if provided
|
||||
stylesheets = []
|
||||
if css_style and Path(css_style).exists():
|
||||
stylesheets.append(CSS(filename=css_style))
|
||||
|
||||
pdf.write_pdf(output_path, stylesheets=stylesheets)
|
||||
|
||||
return True
|
||||
|
||||
except ImportError:
|
||||
print("Error: WeasyPrint not installed. Install with: pip install weasyprint")
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"Error generating PDF: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def convert_to_pdf_pandoc(markdown_path: str, output_path: str) -> bool:
|
||||
"""
|
||||
Convert markdown to PDF using Pandoc.
|
||||
|
||||
Args:
|
||||
markdown_path: Path to markdown file
|
||||
output_path: Path for output PDF
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
import subprocess
|
||||
|
||||
# Check if pandoc is installed
|
||||
result = subprocess.run(
|
||||
['pandoc', '--version'],
|
||||
capture_output=True,
|
||||
text=True
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
print("Error: Pandoc not installed")
|
||||
return False
|
||||
|
||||
# Convert with pandoc
|
||||
result = subprocess.run(
|
||||
[
|
||||
'pandoc',
|
||||
markdown_path,
|
||||
'-o', output_path,
|
||||
'--pdf-engine=pdflatex',
|
||||
'-V', 'geometry:margin=1in',
|
||||
'--toc'
|
||||
],
|
||||
capture_output=True,
|
||||
text=True
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"Pandoc error: {result.stderr}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
except FileNotFoundError:
|
||||
print("Error: Pandoc not found. Install from https://pandoc.org/")
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def create_biomni_report(
|
||||
conversation_history: list,
|
||||
output_path: str = "biomni_report.pdf",
|
||||
method: str = "weasyprint"
|
||||
) -> bool:
|
||||
"""
|
||||
Create a formatted PDF report from Biomni conversation history.
|
||||
|
||||
Args:
|
||||
conversation_history: List of conversation turns
|
||||
output_path: Output PDF path
|
||||
method: Conversion method ('weasyprint' or 'pandoc')
|
||||
|
||||
Returns:
|
||||
True if successful
|
||||
"""
|
||||
# Prepare report sections
|
||||
metadata = {
|
||||
'Date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
'Tool': 'Biomni AI Agent',
|
||||
'Report Type': 'Analysis Summary'
|
||||
}
|
||||
|
||||
sections = []
|
||||
|
||||
# Executive Summary
|
||||
sections.append({
|
||||
'heading': 'Executive Summary',
|
||||
'level': 2,
|
||||
'content': 'This report contains the complete analysis workflow executed by the Biomni biomedical AI agent.'
|
||||
})
|
||||
|
||||
# Conversation history
|
||||
for i, turn in enumerate(conversation_history, 1):
|
||||
sections.append({
|
||||
'heading': f'Task {i}: {turn.get("task", "Analysis")}',
|
||||
'level': 2,
|
||||
'content': f'**Input:**\n```\n{turn.get("input", "")}\n```\n\n**Output:**\n{turn.get("output", "")}'
|
||||
})
|
||||
|
||||
# Generate markdown
|
||||
md_path = output_path.replace('.pdf', '.md')
|
||||
generate_markdown_report(
|
||||
title="Biomni Analysis Report",
|
||||
sections=sections,
|
||||
metadata=metadata,
|
||||
output_path=md_path
|
||||
)
|
||||
|
||||
# Convert to PDF
|
||||
if method == 'weasyprint':
|
||||
success = convert_to_pdf_weasyprint(md_path, output_path)
|
||||
elif method == 'pandoc':
|
||||
success = convert_to_pdf_pandoc(md_path, output_path)
|
||||
else:
|
||||
print(f"Unknown method: {method}")
|
||||
return False
|
||||
|
||||
if success:
|
||||
print(f"✓ Report generated: {output_path}")
|
||||
print(f" Markdown: {md_path}")
|
||||
else:
|
||||
print("✗ Failed to generate PDF")
|
||||
print(f" Markdown available: {md_path}")
|
||||
|
||||
return success
|
||||
|
||||
|
||||
def main():
|
||||
"""CLI for report generation."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Generate formatted PDF reports for Biomni analyses'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'input',
|
||||
type=str,
|
||||
help='Input markdown file or conversation history'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-o', '--output',
|
||||
type=str,
|
||||
default='biomni_report.pdf',
|
||||
help='Output PDF path (default: biomni_report.pdf)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-m', '--method',
|
||||
type=str,
|
||||
choices=['weasyprint', 'pandoc'],
|
||||
default='weasyprint',
|
||||
help='Conversion method (default: weasyprint)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--css',
|
||||
type=str,
|
||||
help='Custom CSS stylesheet path'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Check if input is markdown or conversation history
|
||||
input_path = Path(args.input)
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"Error: Input file not found: {args.input}")
|
||||
return 1
|
||||
|
||||
# If input is markdown, convert directly
|
||||
if input_path.suffix == '.md':
|
||||
if args.method == 'weasyprint':
|
||||
success = convert_to_pdf_weasyprint(
|
||||
str(input_path),
|
||||
args.output,
|
||||
args.css
|
||||
)
|
||||
else:
|
||||
success = convert_to_pdf_pandoc(str(input_path), args.output)
|
||||
|
||||
return 0 if success else 1
|
||||
|
||||
# Otherwise, assume it's conversation history (JSON)
|
||||
try:
|
||||
import json
|
||||
with open(input_path) as f:
|
||||
history = json.load(f)
|
||||
|
||||
success = create_biomni_report(
|
||||
history,
|
||||
args.output,
|
||||
args.method
|
||||
)
|
||||
|
||||
return 0 if success else 1
|
||||
|
||||
except json.JSONDecodeError:
|
||||
print("Error: Input file is not valid JSON or markdown")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
230
scientific-packages/biomni/scripts/setup_environment.py
Normal file
230
scientific-packages/biomni/scripts/setup_environment.py
Normal file
@@ -0,0 +1,230 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Biomni Environment Setup and Validation Script
|
||||
|
||||
This script helps users set up and validate their Biomni environment,
|
||||
including checking dependencies, API keys, and data availability.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
|
||||
def check_python_version() -> Tuple[bool, str]:
|
||||
"""Check if Python version is compatible."""
|
||||
version = sys.version_info
|
||||
if version.major == 3 and version.minor >= 8:
|
||||
return True, f"Python {version.major}.{version.minor}.{version.micro} ✓"
|
||||
else:
|
||||
return False, f"Python {version.major}.{version.minor} - requires Python 3.8+"
|
||||
|
||||
|
||||
def check_conda_env() -> Tuple[bool, str]:
|
||||
"""Check if running in biomni conda environment."""
|
||||
conda_env = os.environ.get('CONDA_DEFAULT_ENV', None)
|
||||
if conda_env == 'biomni_e1':
|
||||
return True, f"Conda environment: {conda_env} ✓"
|
||||
else:
|
||||
return False, f"Not in biomni_e1 environment (current: {conda_env})"
|
||||
|
||||
|
||||
def check_package_installed(package: str) -> bool:
|
||||
"""Check if a Python package is installed."""
|
||||
try:
|
||||
__import__(package)
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
|
||||
|
||||
def check_dependencies() -> Tuple[bool, List[str]]:
|
||||
"""Check for required and optional dependencies."""
|
||||
required = ['biomni']
|
||||
optional = ['weasyprint', 'markdown2pdf']
|
||||
|
||||
missing_required = [pkg for pkg in required if not check_package_installed(pkg)]
|
||||
missing_optional = [pkg for pkg in optional if not check_package_installed(pkg)]
|
||||
|
||||
messages = []
|
||||
success = len(missing_required) == 0
|
||||
|
||||
if missing_required:
|
||||
messages.append(f"Missing required packages: {', '.join(missing_required)}")
|
||||
messages.append("Install with: pip install biomni --upgrade")
|
||||
else:
|
||||
messages.append("Required packages: ✓")
|
||||
|
||||
if missing_optional:
|
||||
messages.append(f"Missing optional packages: {', '.join(missing_optional)}")
|
||||
messages.append("For PDF reports, install: pip install weasyprint")
|
||||
|
||||
return success, messages
|
||||
|
||||
|
||||
def check_api_keys() -> Tuple[bool, Dict[str, bool]]:
|
||||
"""Check which API keys are configured."""
|
||||
api_keys = {
|
||||
'ANTHROPIC_API_KEY': os.environ.get('ANTHROPIC_API_KEY'),
|
||||
'OPENAI_API_KEY': os.environ.get('OPENAI_API_KEY'),
|
||||
'GEMINI_API_KEY': os.environ.get('GEMINI_API_KEY'),
|
||||
'GROQ_API_KEY': os.environ.get('GROQ_API_KEY'),
|
||||
}
|
||||
|
||||
configured = {key: bool(value) for key, value in api_keys.items()}
|
||||
has_any = any(configured.values())
|
||||
|
||||
return has_any, configured
|
||||
|
||||
|
||||
def check_data_directory(data_path: str = './data') -> Tuple[bool, str]:
|
||||
"""Check if Biomni data directory exists and has content."""
|
||||
path = Path(data_path)
|
||||
|
||||
if not path.exists():
|
||||
return False, f"Data directory not found at {data_path}"
|
||||
|
||||
# Check if directory has files (data has been downloaded)
|
||||
files = list(path.glob('*'))
|
||||
if len(files) == 0:
|
||||
return False, f"Data directory exists but is empty. Run agent once to download."
|
||||
|
||||
# Rough size check (should be ~11GB)
|
||||
total_size = sum(f.stat().st_size for f in path.rglob('*') if f.is_file())
|
||||
size_gb = total_size / (1024**3)
|
||||
|
||||
if size_gb < 1:
|
||||
return False, f"Data directory exists but seems incomplete ({size_gb:.1f} GB)"
|
||||
|
||||
return True, f"Data directory: {data_path} ({size_gb:.1f} GB) ✓"
|
||||
|
||||
|
||||
def check_disk_space(required_gb: float = 20) -> Tuple[bool, str]:
|
||||
"""Check if sufficient disk space is available."""
|
||||
try:
|
||||
import shutil
|
||||
stat = shutil.disk_usage('.')
|
||||
free_gb = stat.free / (1024**3)
|
||||
|
||||
if free_gb >= required_gb:
|
||||
return True, f"Disk space: {free_gb:.1f} GB available ✓"
|
||||
else:
|
||||
return False, f"Low disk space: {free_gb:.1f} GB (need {required_gb} GB)"
|
||||
except Exception as e:
|
||||
return False, f"Could not check disk space: {e}"
|
||||
|
||||
|
||||
def test_biomni_import() -> Tuple[bool, str]:
|
||||
"""Test if Biomni can be imported and initialized."""
|
||||
try:
|
||||
from biomni.agent import A1
|
||||
from biomni.config import default_config
|
||||
return True, "Biomni import successful ✓"
|
||||
except ImportError as e:
|
||||
return False, f"Cannot import Biomni: {e}"
|
||||
except Exception as e:
|
||||
return False, f"Biomni import error: {e}"
|
||||
|
||||
|
||||
def suggest_fixes(results: Dict[str, Tuple[bool, any]]) -> List[str]:
|
||||
"""Generate suggestions for fixing issues."""
|
||||
suggestions = []
|
||||
|
||||
if not results['python'][0]:
|
||||
suggestions.append("➜ Upgrade Python to 3.8 or higher")
|
||||
|
||||
if not results['conda'][0]:
|
||||
suggestions.append("➜ Activate biomni environment: conda activate biomni_e1")
|
||||
|
||||
if not results['dependencies'][0]:
|
||||
suggestions.append("➜ Install Biomni: pip install biomni --upgrade")
|
||||
|
||||
if not results['api_keys'][0]:
|
||||
suggestions.append("➜ Set API key: export ANTHROPIC_API_KEY='your-key'")
|
||||
suggestions.append(" Or create .env file with API keys")
|
||||
|
||||
if not results['data'][0]:
|
||||
suggestions.append("➜ Data will auto-download on first agent.go() call")
|
||||
|
||||
if not results['disk_space'][0]:
|
||||
suggestions.append("➜ Free up disk space (need ~20GB total)")
|
||||
|
||||
return suggestions
|
||||
|
||||
|
||||
def main():
|
||||
"""Run all environment checks and display results."""
|
||||
print("=" * 60)
|
||||
print("Biomni Environment Validation")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Run all checks
|
||||
results = {}
|
||||
|
||||
print("Checking Python version...")
|
||||
results['python'] = check_python_version()
|
||||
print(f" {results['python'][1]}")
|
||||
print()
|
||||
|
||||
print("Checking conda environment...")
|
||||
results['conda'] = check_conda_env()
|
||||
print(f" {results['conda'][1]}")
|
||||
print()
|
||||
|
||||
print("Checking dependencies...")
|
||||
results['dependencies'] = check_dependencies()
|
||||
for msg in results['dependencies'][1]:
|
||||
print(f" {msg}")
|
||||
print()
|
||||
|
||||
print("Checking API keys...")
|
||||
results['api_keys'] = check_api_keys()
|
||||
has_keys, key_status = results['api_keys']
|
||||
for key, configured in key_status.items():
|
||||
status = "✓" if configured else "✗"
|
||||
print(f" {key}: {status}")
|
||||
print()
|
||||
|
||||
print("Checking Biomni data directory...")
|
||||
results['data'] = check_data_directory()
|
||||
print(f" {results['data'][1]}")
|
||||
print()
|
||||
|
||||
print("Checking disk space...")
|
||||
results['disk_space'] = check_disk_space()
|
||||
print(f" {results['disk_space'][1]}")
|
||||
print()
|
||||
|
||||
print("Testing Biomni import...")
|
||||
results['biomni_import'] = test_biomni_import()
|
||||
print(f" {results['biomni_import'][1]}")
|
||||
print()
|
||||
|
||||
# Summary
|
||||
print("=" * 60)
|
||||
all_passed = all(result[0] for result in results.values())
|
||||
|
||||
if all_passed:
|
||||
print("✓ All checks passed! Environment is ready.")
|
||||
print()
|
||||
print("Quick start:")
|
||||
print(" from biomni.agent import A1")
|
||||
print(" agent = A1(path='./data', llm='claude-sonnet-4-20250514')")
|
||||
print(" agent.go('Your biomedical task')")
|
||||
else:
|
||||
print("⚠ Some checks failed. See suggestions below:")
|
||||
print()
|
||||
suggestions = suggest_fixes(results)
|
||||
for suggestion in suggestions:
|
||||
print(suggestion)
|
||||
|
||||
print("=" * 60)
|
||||
|
||||
return 0 if all_passed else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
450
scientific-packages/biopython/SKILL.md
Normal file
450
scientific-packages/biopython/SKILL.md
Normal file
@@ -0,0 +1,450 @@
|
||||
---
|
||||
name: biopython
|
||||
description: Comprehensive toolkit for computational molecular biology using BioPython. Use this skill when working with biological sequences (DNA, RNA, protein), parsing sequence files (FASTA, GenBank, FASTQ), accessing NCBI databases (Entrez, BLAST), performing sequence alignments, building phylogenetic trees, analyzing protein structures (PDB), or any bioinformatics task requiring BioPython modules.
|
||||
---
|
||||
|
||||
# BioPython
|
||||
|
||||
## Overview
|
||||
|
||||
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Working with biological sequences (DNA, RNA, protein)
|
||||
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
|
||||
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
|
||||
- Running or parsing BLAST searches
|
||||
- Performing sequence alignments (pairwise or multiple)
|
||||
- Building or analyzing phylogenetic trees
|
||||
- Analyzing protein structures (PDB files)
|
||||
- Calculating sequence properties (GC content, melting temp, molecular weight)
|
||||
- Converting between sequence file formats
|
||||
- Performing population genetics analysis
|
||||
- Any bioinformatics task requiring BioPython
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Sequence Manipulation
|
||||
|
||||
Create and manipulate biological sequences using `Bio.Seq`:
|
||||
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
|
||||
dna_seq = Seq("ATGGTGCATCTGACT")
|
||||
rna_seq = dna_seq.transcribe() # DNA → RNA
|
||||
protein = dna_seq.translate() # DNA → Protein
|
||||
rev_comp = dna_seq.reverse_complement() # Reverse complement
|
||||
```
|
||||
|
||||
**Common operations:**
|
||||
- Transcription and back-transcription
|
||||
- Translation with custom genetic codes
|
||||
- Complement and reverse complement
|
||||
- Sequence slicing and concatenation
|
||||
- Pattern searching and counting
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.Seq) for detailed operations and examples.
|
||||
|
||||
### 2. File Input/Output
|
||||
|
||||
Read and write sequence files in multiple formats using `Bio.SeqIO`:
|
||||
|
||||
```python
|
||||
from Bio import SeqIO
|
||||
|
||||
# Read sequences
|
||||
for record in SeqIO.parse("sequences.fasta", "fasta"):
|
||||
print(record.id, len(record.seq))
|
||||
|
||||
# Write sequences
|
||||
SeqIO.write(records, "output.gb", "genbank")
|
||||
|
||||
# Convert formats
|
||||
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
|
||||
```
|
||||
|
||||
**Supported formats:** FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
|
||||
|
||||
**Common workflows:**
|
||||
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
|
||||
- Filtering sequences by length, ID, or content
|
||||
- Batch processing large files with iterators
|
||||
- Random access with `SeqIO.index()` for large files
|
||||
|
||||
**Script:** Use `scripts/file_io.py` for file I/O examples and patterns.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.SeqIO) for comprehensive format details and workflows.
|
||||
|
||||
### 3. NCBI Database Access
|
||||
|
||||
Access NCBI databases (GenBank, PubMed, Protein, etc.) using `Bio.Entrez`:
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
|
||||
Entrez.email = "your.email@example.com" # Required!
|
||||
|
||||
# Search database
|
||||
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
|
||||
record = Entrez.read(handle)
|
||||
id_list = record["IdList"]
|
||||
|
||||
# Fetch sequences
|
||||
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
|
||||
records = SeqIO.parse(handle, "fasta")
|
||||
```
|
||||
|
||||
**Key Entrez functions:**
|
||||
- `esearch()`: Search databases, retrieve IDs
|
||||
- `efetch()`: Download full records
|
||||
- `esummary()`: Get document summaries
|
||||
- `elink()`: Find related records across databases
|
||||
- `einfo()`: Get database information
|
||||
- `epost()`: Upload ID lists for large queries
|
||||
|
||||
**Important:** Always set `Entrez.email` before using Entrez functions.
|
||||
|
||||
**Script:** Use `scripts/ncbi_entrez.py` for complete Entrez workflows including batch downloads and WebEnv usage.
|
||||
|
||||
**Reference:** See `references/database_tools.md` (section: Bio.Entrez) for detailed function documentation and parameters.
|
||||
|
||||
### 4. BLAST Searches
|
||||
|
||||
Run BLAST searches and parse results using `Bio.Blast`:
|
||||
|
||||
```python
|
||||
from Bio.Blast import NCBIWWW, NCBIXML
|
||||
|
||||
# Run BLAST online
|
||||
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
||||
|
||||
# Save results
|
||||
with open("blast_results.xml", "w") as out:
|
||||
out.write(result_handle.read())
|
||||
|
||||
# Parse results
|
||||
with open("blast_results.xml") as result_handle:
|
||||
blast_record = NCBIXML.read(result_handle)
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 0.001:
|
||||
print(f"Hit: {alignment.title}")
|
||||
print(f"E-value: {hsp.expect}")
|
||||
print(f"Identity: {hsp.identities}/{hsp.align_length}")
|
||||
```
|
||||
|
||||
**BLAST programs:** blastn, blastp, blastx, tblastn, tblastx
|
||||
|
||||
**Key result attributes:**
|
||||
- `alignment.title`: Hit description
|
||||
- `hsp.expect`: E-value
|
||||
- `hsp.identities`: Number of identical residues
|
||||
- `hsp.query`, `hsp.match`, `hsp.sbjct`: Aligned sequences
|
||||
|
||||
**Script:** Use `scripts/blast_search.py` for complete BLAST workflows including result filtering and extraction.
|
||||
|
||||
**Reference:** See `references/database_tools.md` (section: Bio.Blast) for detailed parsing and filtering strategies.
|
||||
|
||||
### 5. Sequence Alignment
|
||||
|
||||
Perform pairwise and multiple sequence alignments using `Bio.Align`:
|
||||
|
||||
**Pairwise alignment:**
|
||||
```python
|
||||
from Bio import Align
|
||||
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = 'global' # or 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.gap_score = -2
|
||||
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
print(alignments[0])
|
||||
print(f"Score: {alignments.score}")
|
||||
```
|
||||
|
||||
**Multiple sequence alignment I/O:**
|
||||
```python
|
||||
from Bio import AlignIO
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read("alignment.clustal", "clustal")
|
||||
|
||||
# Write alignment
|
||||
AlignIO.write(alignment, "output.phylip", "phylip")
|
||||
|
||||
# Convert formats
|
||||
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**Supported formats:** Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for alignment examples and workflows.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
|
||||
|
||||
### 6. Phylogenetic Analysis
|
||||
|
||||
Build and analyze phylogenetic trees using `Bio.Phylo`:
|
||||
|
||||
```python
|
||||
from Bio import Phylo
|
||||
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read("sequences.fasta", "fasta")
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator('identity')
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Build tree (UPGMA or Neighbor-Joining)
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.upgma(dm) # or constructor.nj(dm)
|
||||
|
||||
# Visualize tree
|
||||
Phylo.draw_ascii(tree)
|
||||
Phylo.draw(tree) # matplotlib visualization
|
||||
|
||||
# Save tree
|
||||
Phylo.write(tree, "tree.nwk", "newick")
|
||||
```
|
||||
|
||||
**Tree manipulation:**
|
||||
- `tree.ladderize()`: Sort branches
|
||||
- `tree.root_at_midpoint()`: Root at midpoint
|
||||
- `tree.prune()`: Remove taxa
|
||||
- `tree.collapse_all()`: Collapse short branches
|
||||
- `tree.distance()`: Calculate distances between clades
|
||||
|
||||
**Supported formats:** Newick, NEXUS, PhyloXML, NeXML
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for tree construction and manipulation examples.
|
||||
|
||||
**Reference:** See `references/specialized_modules.md` (section: Bio.Phylo) for comprehensive tree analysis capabilities.
|
||||
|
||||
### 7. Structural Bioinformatics
|
||||
|
||||
Analyze protein structures using `Bio.PDB`:
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBParser, PDBList
|
||||
|
||||
# Download structure
|
||||
pdbl = PDBList()
|
||||
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
|
||||
|
||||
# Parse structure
|
||||
parser = PDBParser()
|
||||
structure = parser.get_structure("protein", "1abc.pdb")
|
||||
|
||||
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
|
||||
for model in structure:
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
for atom in residue:
|
||||
print(atom.name, atom.coord)
|
||||
|
||||
# Secondary structure with DSSP
|
||||
from Bio.PDB import DSSP
|
||||
dssp = DSSP(model, "structure.pdb")
|
||||
|
||||
# Structural alignment
|
||||
from Bio.PDB import Superimposer
|
||||
sup = Superimposer()
|
||||
sup.set_atoms(ref_atoms, alt_atoms)
|
||||
print(f"RMSD: {sup.rms}")
|
||||
```
|
||||
|
||||
**Key capabilities:**
|
||||
- Parse PDB, mmCIF, MMTF formats
|
||||
- Secondary structure analysis (DSSP)
|
||||
- Solvent accessibility calculations
|
||||
- Structural superimposition
|
||||
- Distance and angle calculations
|
||||
- Structure quality validation
|
||||
|
||||
**Reference:** See `references/specialized_modules.md` (section: Bio.PDB) for complete structural analysis capabilities.
|
||||
|
||||
### 8. Sequence Analysis Utilities
|
||||
|
||||
Calculate sequence properties using `Bio.SeqUtils`:
|
||||
|
||||
```python
|
||||
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
|
||||
from Bio.SeqUtils.ProtParam import ProteinAnalysis
|
||||
|
||||
# DNA analysis
|
||||
gc = gc_fraction(dna_seq) * 100
|
||||
tm = mt.Tm_NN(dna_seq) # Melting temperature
|
||||
|
||||
# Protein analysis
|
||||
protein_analysis = ProteinAnalysis(str(protein_seq))
|
||||
mw = protein_analysis.molecular_weight()
|
||||
pi = protein_analysis.isoelectric_point()
|
||||
aromaticity = protein_analysis.aromaticity()
|
||||
instability = protein_analysis.instability_index()
|
||||
```
|
||||
|
||||
**Available analyses:**
|
||||
- GC content and GC skew
|
||||
- Melting temperature (multiple methods)
|
||||
- Molecular weight
|
||||
- Isoelectric point
|
||||
- Aromaticity
|
||||
- Instability index
|
||||
- Secondary structure prediction
|
||||
- Sequence checksums
|
||||
|
||||
**Script:** Use `scripts/sequence_operations.py` for sequence analysis examples.
|
||||
|
||||
**Reference:** See `references/core_modules.md` (section: Bio.SeqUtils) for all available utilities.
|
||||
|
||||
### 9. Specialized Modules
|
||||
|
||||
**Restriction enzymes:**
|
||||
```python
|
||||
from Bio import Restriction
|
||||
enzyme = Restriction.EcoRI
|
||||
sites = enzyme.search(seq)
|
||||
```
|
||||
|
||||
**Motif analysis:**
|
||||
```python
|
||||
from Bio import motifs
|
||||
m = motifs.create([seq1, seq2, seq3])
|
||||
pwm = m.counts.normalize(pseudocounts=0.5)
|
||||
```
|
||||
|
||||
**Population genetics:**
|
||||
Use `Bio.PopGen` for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
|
||||
|
||||
**Clustering:**
|
||||
Use `Bio.Cluster` for hierarchical clustering, k-means, PCA on biological data.
|
||||
|
||||
**Reference:** See `references/core_modules.md` and `references/specialized_modules.md` for specialized module documentation.
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Download and Analyze NCBI Sequences
|
||||
|
||||
1. Search NCBI database with `Entrez.esearch()`
|
||||
2. Fetch sequences with `Entrez.efetch()`
|
||||
3. Parse with `SeqIO.parse()`
|
||||
4. Analyze sequences (GC content, translation, etc.)
|
||||
5. Save results to file
|
||||
|
||||
**Script:** Use `scripts/ncbi_entrez.py` for complete implementation.
|
||||
|
||||
### Workflow 2: Sequence Similarity Search
|
||||
|
||||
1. Run BLAST with `NCBIWWW.qblast()` or parse existing results
|
||||
2. Parse XML results with `NCBIXML.read()`
|
||||
3. Filter hits by E-value, identity, coverage
|
||||
4. Extract and save significant hits
|
||||
5. Perform downstream analysis
|
||||
|
||||
**Script:** Use `scripts/blast_search.py` for complete implementation.
|
||||
|
||||
### Workflow 3: Phylogenetic Tree Construction
|
||||
|
||||
1. Read multiple sequence alignment with `AlignIO.read()`
|
||||
2. Calculate distance matrix with `DistanceCalculator`
|
||||
3. Build tree with `DistanceTreeConstructor` (UPGMA or NJ)
|
||||
4. Manipulate tree (ladderize, root, prune)
|
||||
5. Visualize with `Phylo.draw()` or `Phylo.draw_ascii()`
|
||||
6. Save tree with `Phylo.write()`
|
||||
|
||||
**Script:** Use `scripts/alignment_phylogeny.py` for complete implementation.
|
||||
|
||||
### Workflow 4: Format Conversion Pipeline
|
||||
|
||||
1. Read sequences in original format with `SeqIO.parse()`
|
||||
2. Filter or modify sequences as needed
|
||||
3. Write to new format with `SeqIO.write()`
|
||||
4. Or use `SeqIO.convert()` for direct conversion
|
||||
|
||||
**Script:** Use `scripts/file_io.py` for format conversion examples.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Email Configuration
|
||||
Always set `Entrez.email` before using NCBI services:
|
||||
```python
|
||||
Entrez.email = "your.email@example.com"
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
Be polite to NCBI servers:
|
||||
- Use `time.sleep()` between requests
|
||||
- Use WebEnv for large queries
|
||||
- Batch downloads in reasonable chunks (100-500 sequences)
|
||||
|
||||
### Memory Management
|
||||
For large files:
|
||||
- Use iterators (`SeqIO.parse()`) instead of lists
|
||||
- Use `SeqIO.index()` for random access without loading entire file
|
||||
- Process in batches when possible
|
||||
|
||||
### Error Handling
|
||||
Always handle potential errors:
|
||||
```python
|
||||
try:
|
||||
record = SeqIO.read(handle, format)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
### File Format Selection
|
||||
Choose appropriate formats:
|
||||
- FASTA: Simple sequences, no annotations
|
||||
- GenBank: Rich annotations, features, references
|
||||
- FASTQ: Sequences with quality scores
|
||||
- PDB: 3D structural data
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
Executable Python scripts demonstrating common BioPython workflows:
|
||||
|
||||
- `sequence_operations.py`: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)
|
||||
- `file_io.py`: Reading, writing, and converting sequence files; filtering; indexing large files
|
||||
- `ncbi_entrez.py`: Searching and downloading from NCBI databases; batch processing with WebEnv
|
||||
- `blast_search.py`: Running BLAST searches online; parsing and filtering results
|
||||
- `alignment_phylogeny.py`: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
|
||||
|
||||
Run any script with `python3 scripts/<script_name>.py` to see examples.
|
||||
|
||||
### references/
|
||||
Comprehensive reference documentation for BioPython modules:
|
||||
|
||||
- `core_modules.md`: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)
|
||||
- `database_tools.md`: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)
|
||||
- `specialized_modules.md`: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
|
||||
|
||||
Reference these files when:
|
||||
- Learning about specific module capabilities
|
||||
- Looking up function parameters and options
|
||||
- Understanding supported file formats
|
||||
- Finding example code patterns
|
||||
|
||||
Use `grep` to search references for specific topics:
|
||||
```bash
|
||||
grep -n "secondary structure" references/specialized_modules.md
|
||||
grep -n "efetch" references/database_tools.md
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
**Official Documentation:** https://biopython.org/docs/latest/
|
||||
|
||||
**Tutorial:** https://biopython.org/docs/latest/Tutorial/index.html
|
||||
|
||||
**API Reference:** https://biopython.org/docs/latest/api/index.html
|
||||
|
||||
**Cookbook:** https://biopython.org/wiki/Category:Cookbook
|
||||
232
scientific-packages/biopython/references/core_modules.md
Normal file
232
scientific-packages/biopython/references/core_modules.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# BioPython Core Modules Reference
|
||||
|
||||
This document provides detailed information about BioPython's core modules and their capabilities.
|
||||
|
||||
## Sequence Handling
|
||||
|
||||
### Bio.Seq - Sequence Objects
|
||||
|
||||
Seq objects are BioPython's fundamental data structure for biological sequences, providing biological methods on top of string-like behavior.
|
||||
|
||||
**Creation:**
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
my_seq = Seq("AGTACACTGGT")
|
||||
```
|
||||
|
||||
**Key Operations:**
|
||||
- String methods: `find()`, `count()`, `count_overlap()` (for overlapping patterns)
|
||||
- Complement/Reverse complement: Returns complementary sequences
|
||||
- Transcription: DNA → RNA (T → U)
|
||||
- Back transcription: RNA → DNA
|
||||
- Translation: DNA/RNA → protein with customizable genetic codes and stop codon handling
|
||||
|
||||
**Use Cases:**
|
||||
- DNA/RNA sequence manipulation
|
||||
- Converting between nucleic acid types
|
||||
- Protein translation from coding sequences
|
||||
- Sequence searching and pattern counting
|
||||
|
||||
### Bio.SeqRecord - Sequence Metadata
|
||||
|
||||
SeqRecord wraps Seq objects with metadata like ID, description, and features.
|
||||
|
||||
**Attributes:**
|
||||
- `seq`: The sequence itself (Seq object)
|
||||
- `id`: Unique identifier
|
||||
- `name`: Short name
|
||||
- `description`: Longer description
|
||||
- `features`: List of SeqFeature objects
|
||||
- `annotations`: Dictionary of additional information
|
||||
- `letter_annotations`: Per-letter annotations (e.g., quality scores)
|
||||
|
||||
### Bio.SeqFeature - Sequence Annotations
|
||||
|
||||
Manages sequence annotations and features such as genes, promoters, and coding regions.
|
||||
|
||||
**Common Features:**
|
||||
- Gene locations
|
||||
- CDS (coding sequences)
|
||||
- Promoters and regulatory elements
|
||||
- Exons and introns
|
||||
- Protein domains
|
||||
|
||||
## File Input/Output
|
||||
|
||||
### Bio.SeqIO - Sequence File I/O
|
||||
|
||||
Unified interface for reading and writing sequence files in multiple formats.
|
||||
|
||||
**Supported Formats:**
|
||||
- FASTA/FASTQ: Standard sequence formats
|
||||
- GenBank/EMBL: Feature-rich annotation formats
|
||||
- Clustal/Stockholm/PHYLIP: Alignment formats
|
||||
- ABI/SFF: Trace and flowgram data
|
||||
- Swiss-Prot/PIR: Protein databases
|
||||
- PDB: Protein structure files
|
||||
|
||||
**Key Functions:**
|
||||
|
||||
**SeqIO.parse()** - Iterator for reading multiple records:
|
||||
```python
|
||||
from Bio import SeqIO
|
||||
for record in SeqIO.parse("file.fasta", "fasta"):
|
||||
print(record.id, len(record.seq))
|
||||
```
|
||||
|
||||
**SeqIO.read()** - Read single record:
|
||||
```python
|
||||
record = SeqIO.read("file.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.write()** - Write sequences:
|
||||
```python
|
||||
SeqIO.write(sequences, "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.convert()** - Direct format conversion:
|
||||
```python
|
||||
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
|
||||
```
|
||||
|
||||
**SeqIO.index()** - Memory-efficient random access for large files:
|
||||
```python
|
||||
record_dict = SeqIO.index("large_file.fasta", "fasta")
|
||||
sequence = record_dict["seq_id"]
|
||||
```
|
||||
|
||||
**SeqIO.to_dict()** - Load all records into dictionary (memory-based):
|
||||
```python
|
||||
record_dict = SeqIO.to_dict(SeqIO.parse("file.fasta", "fasta"))
|
||||
```
|
||||
|
||||
**Common Patterns:**
|
||||
- Format conversion between FASTA, GenBank, FASTQ
|
||||
- Filtering sequences by length, ID, or content
|
||||
- Extracting subsequences
|
||||
- Batch processing large files with iterators
|
||||
|
||||
### Bio.AlignIO - Multiple Sequence Alignment I/O
|
||||
|
||||
Handles multiple sequence alignment files.
|
||||
|
||||
**Key Functions:**
|
||||
- `write()`: Save alignments
|
||||
- `parse()`: Read multiple alignments
|
||||
- `read()`: Read single alignment
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Formats:**
|
||||
- Clustal
|
||||
- PHYLIP (sequential and interleaved)
|
||||
- Stockholm
|
||||
- NEXUS
|
||||
- FASTA (aligned)
|
||||
- MAF (Multiple Alignment Format)
|
||||
|
||||
## Sequence Alignment
|
||||
|
||||
### Bio.Align - Alignment Tools
|
||||
|
||||
**PairwiseAligner** - High-performance pairwise alignment:
|
||||
```python
|
||||
from Bio import Align
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = 'global' # or 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.gap_score = -2.5
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
```
|
||||
|
||||
**CodonAligner** - Codon-aware alignment
|
||||
|
||||
**MultipleSeqAlignment** - Container for MSA with column access
|
||||
|
||||
### Bio.pairwise2 (Legacy)
|
||||
|
||||
Legacy pairwise alignment module with functions like `align.globalxx()`, `align.localxx()`.
|
||||
|
||||
## Sequence Analysis Utilities
|
||||
|
||||
### Bio.SeqUtils - Sequence Analysis
|
||||
|
||||
Collection of utility functions:
|
||||
|
||||
**CheckSum** - Calculate sequence checksums (CRC32, CRC64, GCG)
|
||||
|
||||
**MeltingTemp** - DNA melting temperature calculations:
|
||||
- Nearest-neighbor method
|
||||
- Wallace rule
|
||||
- GC content method
|
||||
|
||||
**IsoelectricPoint** - Protein pI calculation
|
||||
|
||||
**ProtParam** - Protein analysis:
|
||||
- Molecular weight
|
||||
- Aromaticity
|
||||
- Instability index
|
||||
- Secondary structure fractions
|
||||
|
||||
**GC/GC_skew** - Calculate GC content and GC skew for sequence windows
|
||||
|
||||
### Bio.Data.CodonTable - Genetic Codes
|
||||
|
||||
Access to NCBI genetic code tables:
|
||||
```python
|
||||
from Bio.Data import CodonTable
|
||||
standard_table = CodonTable.unambiguous_dna_by_id[1]
|
||||
print(standard_table.forward_table) # codon to amino acid
|
||||
print(standard_table.back_table) # amino acid to codons
|
||||
print(standard_table.start_codons)
|
||||
print(standard_table.stop_codons)
|
||||
```
|
||||
|
||||
**Available codes:**
|
||||
- Standard code (1)
|
||||
- Vertebrate mitochondrial (2)
|
||||
- Yeast mitochondrial (3)
|
||||
- And many more organism-specific codes
|
||||
|
||||
## Sequence Motifs and Patterns
|
||||
|
||||
### Bio.motifs - Sequence Motif Analysis
|
||||
|
||||
Tools for working with sequence motifs:
|
||||
|
||||
**Position Weight Matrices (PWM):**
|
||||
- Create PWM from aligned sequences
|
||||
- Calculate information content
|
||||
- Search sequences for motif matches
|
||||
- Generate consensus sequences
|
||||
|
||||
**Position Specific Scoring Matrices (PSSM):**
|
||||
- Convert PWM to PSSM
|
||||
- Score sequences against motifs
|
||||
- Determine significance thresholds
|
||||
|
||||
**Supported Formats:**
|
||||
- JASPAR
|
||||
- TRANSFAC
|
||||
- MEME
|
||||
- AlignAce
|
||||
|
||||
### Bio.Restriction - Restriction Enzymes
|
||||
|
||||
Comprehensive restriction enzyme database and analysis:
|
||||
|
||||
**Capabilities:**
|
||||
- Search for restriction sites
|
||||
- Predict digestion products
|
||||
- Analyze restriction maps
|
||||
- Access enzyme properties (recognition site, cut positions, isoschizomers)
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
from Bio import Restriction
|
||||
from Bio.Seq import Seq
|
||||
|
||||
seq = Seq("GAATTC...")
|
||||
enzyme = Restriction.EcoRI
|
||||
results = enzyme.search(seq)
|
||||
```
|
||||
306
scientific-packages/biopython/references/database_tools.md
Normal file
306
scientific-packages/biopython/references/database_tools.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# BioPython Database Access and Search Tools
|
||||
|
||||
This document covers BioPython's capabilities for accessing biological databases and performing sequence searches.
|
||||
|
||||
## NCBI Database Access
|
||||
|
||||
### Bio.Entrez - NCBI E-utilities Interface
|
||||
|
||||
Provides programmatic access to NCBI databases including PubMed, GenBank, Protein, Nucleotide, and more.
|
||||
|
||||
**Important:** Always set your email before using Entrez:
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your.email@example.com"
|
||||
```
|
||||
|
||||
#### Core Query Functions
|
||||
|
||||
**esearch** - Search databases and retrieve IDs:
|
||||
```python
|
||||
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Organism] AND COX1")
|
||||
record = Entrez.read(handle)
|
||||
id_list = record["IdList"]
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database to search (nucleotide, protein, pubmed, etc.)
|
||||
- `term`: Search query
|
||||
- `retmax`: Maximum number of IDs to return
|
||||
- `sort`: Sort order (relevance, pub_date, etc.)
|
||||
- `usehistory`: Store results on server (useful for large queries)
|
||||
|
||||
**efetch** - Retrieve full records:
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="123456", rettype="gb", retmode="text")
|
||||
record = SeqIO.read(handle, "genbank")
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `db`: Database name
|
||||
- `id`: Single ID or comma-separated list
|
||||
- `rettype`: Return type (gb, fasta, gp, xml, etc.)
|
||||
- `retmode`: Return mode (text, xml, asn.1)
|
||||
- Automatically uses POST for >200 IDs
|
||||
|
||||
**elink** - Find related records across databases:
|
||||
```python
|
||||
handle = Entrez.elink(dbfrom="protein", db="gene", id="15718680")
|
||||
result = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- `dbfrom`: Source database
|
||||
- `db`: Target database
|
||||
- `id`: ID(s) to link from
|
||||
- Returns LinkOut providers and relevancy scores
|
||||
|
||||
**esummary** - Get document summaries:
|
||||
```python
|
||||
handle = Entrez.esummary(db="protein", id="15718680")
|
||||
summary = Entrez.read(handle)
|
||||
print(summary[0]['Title'])
|
||||
```
|
||||
|
||||
Returns quick overviews without full records.
|
||||
|
||||
**einfo** - Get database statistics:
|
||||
```python
|
||||
handle = Entrez.einfo(db="nucleotide")
|
||||
info = Entrez.read(handle)
|
||||
```
|
||||
|
||||
Provides field indices, term counts, update dates, and available links.
|
||||
|
||||
**epost** - Upload ID lists to server:
|
||||
```python
|
||||
handle = Entrez.epost("nucleotide", id="123456,789012")
|
||||
result = Entrez.read(handle)
|
||||
webenv = result["WebEnv"]
|
||||
query_key = result["QueryKey"]
|
||||
```
|
||||
|
||||
Useful for large queries split across multiple requests.
|
||||
|
||||
**espell** - Get spelling suggestions:
|
||||
```python
|
||||
handle = Entrez.espell(term="brest cancer")
|
||||
result = Entrez.read(handle)
|
||||
print(result["CorrectedQuery"]) # "breast cancer"
|
||||
```
|
||||
|
||||
**ecitmatch** - Convert citations to PubMed IDs:
|
||||
```python
|
||||
citation = "proc natl acad sci u s a|1991|88|3248|mann bj|"
|
||||
handle = Entrez.ecitmatch(db="pubmed", bdata=citation)
|
||||
```
|
||||
|
||||
#### Data Processing Functions
|
||||
|
||||
**Entrez.read()** - Parse XML to Python dictionary:
|
||||
```python
|
||||
handle = Entrez.esearch(db="protein", term="insulin")
|
||||
record = Entrez.read(handle)
|
||||
```
|
||||
|
||||
**Entrez.parse()** - Generator for large XML results:
|
||||
```python
|
||||
handle = Entrez.efetch(db="protein", id=id_list, rettype="gp", retmode="xml")
|
||||
for record in Entrez.parse(handle):
|
||||
process(record)
|
||||
```
|
||||
|
||||
#### Common Workflows
|
||||
|
||||
**Download sequences by accession:**
|
||||
```python
|
||||
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
|
||||
record = SeqIO.read(handle, "fasta")
|
||||
```
|
||||
|
||||
**Search and download multiple sequences:**
|
||||
```python
|
||||
# Search
|
||||
search_handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax="100")
|
||||
search_results = Entrez.read(search_handle)
|
||||
|
||||
# Download
|
||||
fetch_handle = Entrez.efetch(db="nucleotide", id=search_results["IdList"], rettype="gb", retmode="text")
|
||||
for record in SeqIO.parse(fetch_handle, "genbank"):
|
||||
print(record.id)
|
||||
```
|
||||
|
||||
**Use WebEnv for large queries:**
|
||||
```python
|
||||
# Post IDs
|
||||
post_handle = Entrez.epost(db="nucleotide", id=",".join(large_id_list))
|
||||
post_result = Entrez.read(post_handle)
|
||||
|
||||
# Fetch in batches
|
||||
batch_size = 500
|
||||
for start in range(0, count, batch_size):
|
||||
fetch_handle = Entrez.efetch(
|
||||
db="nucleotide",
|
||||
rettype="fasta",
|
||||
retmode="text",
|
||||
retstart=start,
|
||||
retmax=batch_size,
|
||||
webenv=post_result["WebEnv"],
|
||||
query_key=post_result["QueryKey"]
|
||||
)
|
||||
# Process batch
|
||||
```
|
||||
|
||||
### Bio.GenBank - GenBank Format Parsing
|
||||
|
||||
Low-level GenBank file parser (SeqIO is usually preferred).
|
||||
|
||||
### Bio.SwissProt - Swiss-Prot/UniProt Parsing
|
||||
|
||||
Parse Swiss-Prot and UniProtKB flat file format:
|
||||
```python
|
||||
from Bio import SwissProt
|
||||
with open("uniprot.dat") as handle:
|
||||
for record in SwissProt.parse(handle):
|
||||
print(record.entry_name, record.organism)
|
||||
```
|
||||
|
||||
## Sequence Similarity Searches
|
||||
|
||||
### Bio.Blast - BLAST Interface
|
||||
|
||||
Tools for running BLAST searches and parsing results.
|
||||
|
||||
#### Running BLAST
|
||||
|
||||
**NCBI QBLAST (online):**
|
||||
```python
|
||||
from Bio.Blast import NCBIWWW
|
||||
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- Program: blastn, blastp, blastx, tblastn, tblastx
|
||||
- Database: nt, nr, refseq_rna, pdb, etc.
|
||||
- Sequence: string or Seq object
|
||||
- Additional parameters: `expect`, `word_size`, `hitlist_size`, `format_type`
|
||||
|
||||
**Local BLAST:**
|
||||
Run standalone BLAST from command line, then parse results.
|
||||
|
||||
#### Parsing BLAST Results
|
||||
|
||||
**XML format (recommended):**
|
||||
```python
|
||||
from Bio.Blast import NCBIXML
|
||||
|
||||
result_handle = open("blast_results.xml")
|
||||
blast_records = NCBIXML.parse(result_handle)
|
||||
|
||||
for blast_record in blast_records:
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 0.001:
|
||||
print(f"Hit: {alignment.title}")
|
||||
print(f"Length: {alignment.length}")
|
||||
print(f"E-value: {hsp.expect}")
|
||||
print(f"Identities: {hsp.identities}/{hsp.align_length}")
|
||||
```
|
||||
|
||||
**Functions:**
|
||||
- `NCBIXML.read()`: Single query
|
||||
- `NCBIXML.parse()`: Multiple queries (generator)
|
||||
|
||||
**Key Record Attributes:**
|
||||
- `alignments`: List of matching sequences
|
||||
- `query`: Query sequence ID
|
||||
- `query_length`: Length of query
|
||||
|
||||
**Alignment Attributes:**
|
||||
- `title`: Description of hit
|
||||
- `length`: Length of hit sequence
|
||||
- `hsps`: High-scoring segment pairs
|
||||
|
||||
**HSP Attributes:**
|
||||
- `expect`: E-value
|
||||
- `score`: Bit score
|
||||
- `identities`: Number of identical residues
|
||||
- `positives`: Number of positive scoring matches
|
||||
- `gaps`: Number of gaps
|
||||
- `align_length`: Length of alignment
|
||||
- `query`: Aligned query sequence
|
||||
- `match`: Match indicators
|
||||
- `sbjct`: Aligned subject sequence
|
||||
- `query_start`, `query_end`: Query coordinates
|
||||
- `sbjct_start`, `sbjct_end`: Subject coordinates
|
||||
|
||||
#### Common BLAST Workflows
|
||||
|
||||
**Find homologs:**
|
||||
```python
|
||||
result = NCBIWWW.qblast("blastp", "nr", protein_sequence, expect=1e-10)
|
||||
with open("results.xml", "w") as out:
|
||||
out.write(result.read())
|
||||
```
|
||||
|
||||
**Filter results by criteria:**
|
||||
```python
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect < 1e-5 and hsp.identities/hsp.align_length > 0.5:
|
||||
# Process high-quality hits
|
||||
pass
|
||||
```
|
||||
|
||||
### Bio.SearchIO - Unified Search Results Parser
|
||||
|
||||
Modern interface for parsing various search tool outputs (BLAST, HMMER, BLAT, etc.).
|
||||
|
||||
**Key Functions:**
|
||||
- `read()`: Parse single query
|
||||
- `parse()`: Parse multiple queries (generator)
|
||||
- `write()`: Write results to file
|
||||
- `convert()`: Convert between formats
|
||||
|
||||
**Supported Tools:**
|
||||
- BLAST (XML, tabular, plain text)
|
||||
- HMMER (hmmscan, hmmsearch, phmmer)
|
||||
- BLAT
|
||||
- FASTA
|
||||
- InterProScan
|
||||
- Exonerate
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import SearchIO
|
||||
results = SearchIO.parse("blast_output.xml", "blast-xml")
|
||||
for result in results:
|
||||
for hit in result:
|
||||
if hit.hsps[0].evalue < 0.001:
|
||||
print(hit.id, hit.hsps[0].evalue)
|
||||
```
|
||||
|
||||
## Local Database Management
|
||||
|
||||
### BioSQL - SQL Database Interface
|
||||
|
||||
Store and manage biological sequences in SQL databases (PostgreSQL, MySQL, SQLite).
|
||||
|
||||
**Features:**
|
||||
- Store SeqRecord objects with annotations
|
||||
- Efficient querying and retrieval
|
||||
- Cross-reference sequences
|
||||
- Track relationships between sequences
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from BioSQL import BioSeqDatabase
|
||||
server = BioSeqDatabase.open_database(driver="MySQLdb", user="user", passwd="pass", host="localhost", db="bioseqdb")
|
||||
db = server["my_db"]
|
||||
|
||||
# Store sequences
|
||||
db.load(SeqIO.parse("sequences.gb", "genbank"))
|
||||
|
||||
# Query
|
||||
seq = db.lookup(accession="NC_005816")
|
||||
```
|
||||
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
612
scientific-packages/biopython/references/specialized_modules.md
Normal file
@@ -0,0 +1,612 @@
|
||||
# BioPython Specialized Analysis Modules
|
||||
|
||||
This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.
|
||||
|
||||
## Structural Bioinformatics
|
||||
|
||||
### Bio.PDB - Protein Structure Analysis
|
||||
|
||||
Comprehensive tools for handling macromolecular crystal structures.
|
||||
|
||||
#### Structure Hierarchy
|
||||
|
||||
PDB structures are organized hierarchically:
|
||||
- **Structure** → Models → Chains → Residues → Atoms
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
|
||||
parser = PDBParser()
|
||||
structure = parser.get_structure("protein", "1abc.pdb")
|
||||
|
||||
# Navigate hierarchy
|
||||
for model in structure:
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
for atom in residue:
|
||||
print(atom.coord) # xyz coordinates
|
||||
```
|
||||
|
||||
#### Parsing Structure Files
|
||||
|
||||
**PDB format:**
|
||||
```python
|
||||
from Bio.PDB import PDBParser
|
||||
parser = PDBParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.pdb")
|
||||
```
|
||||
|
||||
**mmCIF format:**
|
||||
```python
|
||||
from Bio.PDB import MMCIFParser
|
||||
parser = MMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**Fast mmCIF parser:**
|
||||
```python
|
||||
from Bio.PDB import FastMMCIFParser
|
||||
parser = FastMMCIFParser(QUIET=True)
|
||||
structure = parser.get_structure("myprotein", "structure.cif")
|
||||
```
|
||||
|
||||
**MMTF format:**
|
||||
```python
|
||||
from Bio.PDB import MMTFParser
|
||||
parser = MMTFParser()
|
||||
structure = parser.get_structure("structure.mmtf")
|
||||
```
|
||||
|
||||
**Binary CIF:**
|
||||
```python
|
||||
from Bio.PDB.binary_cif import BinaryCIFParser
|
||||
parser = BinaryCIFParser()
|
||||
structure = parser.get_structure("structure.bcif")
|
||||
```
|
||||
|
||||
#### Downloading Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBList
|
||||
pdbl = PDBList()
|
||||
|
||||
# Download specific structure
|
||||
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")
|
||||
|
||||
# Download entire PDB (obsolete entries)
|
||||
pdbl.download_obsolete_entries(pdir="obsolete/")
|
||||
|
||||
# Update local PDB mirror
|
||||
pdbl.update_pdb()
|
||||
```
|
||||
|
||||
#### Structure Selection and Filtering
|
||||
|
||||
```python
|
||||
# Select specific chains
|
||||
chain_A = structure[0]['A']
|
||||
|
||||
# Select specific residues
|
||||
residue_10 = chain_A[10]
|
||||
|
||||
# Select specific atoms
|
||||
ca_atom = residue_10['CA']
|
||||
|
||||
# Iterate over specific atom types
|
||||
for atom in structure.get_atoms():
|
||||
if atom.name == 'CA': # Alpha carbons only
|
||||
print(atom.coord)
|
||||
```
|
||||
|
||||
**Structure selectors:**
|
||||
```python
|
||||
from Bio.PDB.Polypeptide import is_aa
|
||||
|
||||
# Filter by residue type
|
||||
for residue in structure.get_residues():
|
||||
if is_aa(residue):
|
||||
print(f"Amino acid: {residue.resname}")
|
||||
```
|
||||
|
||||
#### Secondary Structure Analysis
|
||||
|
||||
**DSSP integration:**
|
||||
```python
|
||||
from Bio.PDB import DSSP
|
||||
|
||||
# Requires DSSP program installed
|
||||
model = structure[0]
|
||||
dssp = DSSP(model, "structure.pdb")
|
||||
|
||||
# Access secondary structure
|
||||
for key in dssp:
|
||||
secondary_structure = dssp[key][2]
|
||||
accessibility = dssp[key][3]
|
||||
print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
|
||||
```
|
||||
|
||||
DSSP codes:
|
||||
- H: Alpha helix
|
||||
- B: Beta bridge
|
||||
- E: Extended strand (beta sheet)
|
||||
- G: 3-10 helix
|
||||
- I: Pi helix
|
||||
- T: Turn
|
||||
- S: Bend
|
||||
- -: Coil
|
||||
|
||||
#### Solvent Accessibility
|
||||
|
||||
**Shrake-Rupley algorithm:**
|
||||
```python
|
||||
from Bio.PDB import ShrakeRupley
|
||||
|
||||
sr = ShrakeRupley()
|
||||
sr.compute(structure, level="R") # R=residue, A=atom, C=chain, M=model, S=structure
|
||||
|
||||
for residue in structure.get_residues():
|
||||
print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
|
||||
```
|
||||
|
||||
**NACCESS wrapper:**
|
||||
```python
|
||||
from Bio.PDB import NACCESS
|
||||
|
||||
# Requires NACCESS program
|
||||
naccess = NACCESS("structure.pdb")
|
||||
for residue_id, data in naccess.items():
|
||||
print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
|
||||
```
|
||||
|
||||
**Half-sphere exposure:**
|
||||
```python
|
||||
from Bio.PDB import HSExposure
|
||||
|
||||
# Requires DSSP
|
||||
model = structure[0]
|
||||
hse = HSExposure()
|
||||
hse.calc_hs_exposure(model, "structure.pdb")
|
||||
|
||||
for chain in model:
|
||||
for residue in chain:
|
||||
if residue.has_id('EXP_HSE_A_U'):
|
||||
hse_up = residue.xtra['EXP_HSE_A_U']
|
||||
hse_down = residue.xtra['EXP_HSE_A_D']
|
||||
```
|
||||
|
||||
#### Structural Alignment and Superimposition
|
||||
|
||||
**Standard superimposition:**
|
||||
```python
|
||||
from Bio.PDB import Superimposer
|
||||
|
||||
sup = Superimposer()
|
||||
sup.set_atoms(ref_atoms, alt_atoms) # Lists of atoms to align
|
||||
sup.apply(structure2.get_atoms()) # Apply transformation
|
||||
|
||||
print(f"RMSD: {sup.rms}")
|
||||
print(f"Rotation matrix: {sup.rotran[0]}")
|
||||
print(f"Translation vector: {sup.rotran[1]}")
|
||||
```
|
||||
|
||||
**QCP (Quaternion Characteristic Polynomial) method:**
|
||||
```python
|
||||
from Bio.PDB import QCPSuperimposer
|
||||
|
||||
qcp = QCPSuperimposer()
|
||||
qcp.set(ref_coords, alt_coords)
|
||||
qcp.run()
|
||||
print(f"RMSD: {qcp.get_rms()}")
|
||||
```
|
||||
|
||||
#### Geometric Calculations
|
||||
|
||||
**Distances and angles:**
|
||||
```python
|
||||
# Distance between atoms
|
||||
from Bio.PDB import Vector
|
||||
dist = atom1 - atom2 # Returns distance
|
||||
|
||||
# Angle between three atoms
|
||||
from Bio.PDB import calc_angle
|
||||
angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)
|
||||
|
||||
# Dihedral angle
|
||||
from Bio.PDB import calc_dihedral
|
||||
dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
|
||||
```
|
||||
|
||||
**Vector operations:**
|
||||
```python
|
||||
from Bio.PDB.Vector import Vector
|
||||
|
||||
v1 = Vector(atom1.coord)
|
||||
v2 = Vector(atom2.coord)
|
||||
|
||||
# Vector operations
|
||||
v3 = v1 + v2
|
||||
v4 = v1 - v2
|
||||
dot_product = v1 * v2
|
||||
cross_product = v1 ** v2
|
||||
magnitude = v1.norm()
|
||||
normalized = v1.normalized()
|
||||
```
|
||||
|
||||
#### Internal Coordinates
|
||||
|
||||
Advanced residue geometry representation:
|
||||
```python
|
||||
from Bio.PDB import internal_coords
|
||||
|
||||
# Enable internal coordinates
|
||||
structure.atom_to_internal_coordinates()
|
||||
|
||||
# Access phi, psi angles
|
||||
for residue in structure.get_residues():
|
||||
if residue.internal_coord:
|
||||
print(f"Phi: {residue.internal_coord.get_angle('phi')}")
|
||||
print(f"Psi: {residue.internal_coord.get_angle('psi')}")
|
||||
```
|
||||
|
||||
#### Writing Structures
|
||||
|
||||
```python
|
||||
from Bio.PDB import PDBIO
|
||||
|
||||
io = PDBIO()
|
||||
io.set_structure(structure)
|
||||
io.save("output.pdb")
|
||||
|
||||
# Save specific selection
|
||||
io.save("chain_A.pdb", select=ChainSelector("A"))
|
||||
```
|
||||
|
||||
### Bio.SCOP - SCOP Database
|
||||
|
||||
Access to Structural Classification of Proteins database.
|
||||
|
||||
### Bio.KEGG - Pathway Analysis
|
||||
|
||||
Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:
|
||||
|
||||
**Capabilities:**
|
||||
- Access pathway maps
|
||||
- Retrieve enzyme data
|
||||
- Get compound information
|
||||
- Query orthology relationships
|
||||
|
||||
## Phylogenetics
|
||||
|
||||
### Bio.Phylo - Phylogenetic Tree Analysis
|
||||
|
||||
Comprehensive phylogenetic tree manipulation and analysis.
|
||||
|
||||
#### Reading and Writing Trees
|
||||
|
||||
**Supported formats:**
|
||||
- Newick: Simple, widely-used format
|
||||
- NEXUS: Rich metadata format
|
||||
- PhyloXML: XML-based with extensive annotations
|
||||
- NeXML: Modern XML standard
|
||||
|
||||
```python
|
||||
from Bio import Phylo
|
||||
|
||||
# Read tree
|
||||
tree = Phylo.read("tree.nwk", "newick")
|
||||
|
||||
# Read multiple trees
|
||||
trees = list(Phylo.parse("trees.nex", "nexus"))
|
||||
|
||||
# Write tree
|
||||
Phylo.write(tree, "output.nwk", "newick")
|
||||
```
|
||||
|
||||
#### Tree Visualization
|
||||
|
||||
**ASCII visualization:**
|
||||
```python
|
||||
Phylo.draw_ascii(tree)
|
||||
```
|
||||
|
||||
**Matplotlib plotting:**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
Phylo.draw(tree)
|
||||
plt.show()
|
||||
|
||||
# With customization
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
Phylo.draw(tree, axes=ax, do_show=False)
|
||||
ax.set_title("My Phylogenetic Tree")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
#### Tree Navigation and Manipulation
|
||||
|
||||
**Find clades:**
|
||||
```python
|
||||
# Get all terminal nodes (leaves)
|
||||
terminals = tree.get_terminals()
|
||||
|
||||
# Get all nonterminal nodes
|
||||
nonterminals = tree.get_nonterminals()
|
||||
|
||||
# Find specific clade
|
||||
target = tree.find_any(name="Species_A")
|
||||
|
||||
# Find all matching clades
|
||||
matches = tree.find_clades(terminal=True)
|
||||
```
|
||||
|
||||
**Tree properties:**
|
||||
```python
|
||||
# Count terminals
|
||||
num_species = tree.count_terminals()
|
||||
|
||||
# Get total branch length
|
||||
total_length = tree.total_branch_length()
|
||||
|
||||
# Check if tree is bifurcating
|
||||
is_bifurcating = tree.is_bifurcating()
|
||||
|
||||
# Get maximum distance from root
|
||||
max_dist = tree.distance(tree.root)
|
||||
```
|
||||
|
||||
**Tree modification:**
|
||||
```python
|
||||
# Prune tree to specific taxa
|
||||
keep_taxa = ["Species_A", "Species_B", "Species_C"]
|
||||
tree.prune(keep_taxa)
|
||||
|
||||
# Collapse short branches
|
||||
tree.collapse_all(lambda c: c.branch_length < 0.01)
|
||||
|
||||
# Ladderize (sort branches)
|
||||
tree.ladderize()
|
||||
|
||||
# Root tree at midpoint
|
||||
tree.root_at_midpoint()
|
||||
|
||||
# Root at specific clade
|
||||
outgroup = tree.find_any(name="Outgroup_species")
|
||||
tree.root_with_outgroup(outgroup)
|
||||
```
|
||||
|
||||
**Calculate distances:**
|
||||
```python
|
||||
# Distance between two clades
|
||||
dist = tree.distance(clade1, clade2)
|
||||
|
||||
# Distance from root
|
||||
root_dist = tree.distance(tree.root, terminal_clade)
|
||||
```
|
||||
|
||||
#### Tree Construction
|
||||
|
||||
**Distance-based methods:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
|
||||
from Bio import AlignIO
|
||||
|
||||
# Load alignment
|
||||
aln = AlignIO.read("alignment.fasta", "fasta")
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator('identity')
|
||||
dm = calculator.get_distance(aln)
|
||||
|
||||
# Construct tree using UPGMA
|
||||
constructor = DistanceTreeConstructor()
|
||||
tree_upgma = constructor.upgma(dm)
|
||||
|
||||
# Or using Neighbor-Joining
|
||||
tree_nj = constructor.nj(dm)
|
||||
```
|
||||
|
||||
**Parsimony method:**
|
||||
```python
|
||||
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
|
||||
|
||||
scorer = ParsimonyScorer()
|
||||
searcher = NNITreeSearcher(scorer)
|
||||
tree = searcher.search(starting_tree, alignment)
|
||||
```
|
||||
|
||||
**Distance calculators:**
|
||||
- 'identity': Simple identity scoring
|
||||
- 'blastn': BLAST nucleotide scoring
|
||||
- 'blastp': BLAST protein scoring
|
||||
- 'dnafull': EMBOSS DNA scoring matrix
|
||||
- 'blosum62': BLOSUM62 protein matrix
|
||||
- 'pam250': PAM250 protein matrix
|
||||
|
||||
#### Consensus Trees
|
||||
|
||||
```python
|
||||
from Bio.Phylo.Consensus import majority_consensus, strict_consensus
|
||||
|
||||
# Strict consensus
|
||||
consensus_strict = strict_consensus(trees)
|
||||
|
||||
# Majority rule consensus
|
||||
consensus_majority = majority_consensus(trees, cutoff=0.5)
|
||||
|
||||
# Bootstrap consensus
|
||||
from Bio.Phylo.Consensus import bootstrap_consensus
|
||||
bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
|
||||
```
|
||||
|
||||
#### External Tool Wrappers
|
||||
|
||||
**PhyML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import PhymlCommandline
|
||||
|
||||
cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
|
||||
stdout, stderr = cmd()
|
||||
tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
|
||||
```
|
||||
|
||||
**RAxML:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import RaxmlCommandline
|
||||
|
||||
cmd = RaxmlCommandline(
|
||||
sequences="alignment.phy",
|
||||
model="GTRGAMMA",
|
||||
name="mytree",
|
||||
parsimony_seed=12345
|
||||
)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
**FastTree:**
|
||||
```python
|
||||
from Bio.Phylo.Applications import FastTreeCommandline
|
||||
|
||||
cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
|
||||
stdout, stderr = cmd()
|
||||
```
|
||||
|
||||
### Bio.Phylo.PAML - Evolutionary Analysis
|
||||
|
||||
Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):
|
||||
|
||||
**CODEML - Codon-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import codeml
|
||||
|
||||
cml = codeml.Codeml()
|
||||
cml.alignment = "alignment.phy"
|
||||
cml.tree = "tree.nwk"
|
||||
cml.out_file = "results.out"
|
||||
cml.working_dir = "./paml_wd"
|
||||
|
||||
# Set parameters
|
||||
cml.set_options(
|
||||
seqtype=1, # Codon sequences
|
||||
model=0, # One omega ratio
|
||||
NSsites=[0, 1, 2], # Test different models
|
||||
CodonFreq=2, # F3x4 codon frequencies
|
||||
)
|
||||
|
||||
results = cml.run()
|
||||
```
|
||||
|
||||
**BaseML - Nucleotide-based analysis:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import baseml
|
||||
|
||||
bml = baseml.Baseml()
|
||||
bml.alignment = "alignment.phy"
|
||||
bml.tree = "tree.nwk"
|
||||
results = bml.run()
|
||||
```
|
||||
|
||||
**YN00 - Yang-Nielsen method:**
|
||||
```python
|
||||
from Bio.Phylo.PAML import yn00
|
||||
|
||||
yn = yn00.Yn00()
|
||||
yn.alignment = "alignment.phy"
|
||||
results = yn.run()
|
||||
```
|
||||
|
||||
## Population Genetics
|
||||
|
||||
### Bio.PopGen - Population Genetics Analysis
|
||||
|
||||
Tools for population-level genetic analysis.
|
||||
|
||||
**Capabilities:**
|
||||
- Allele frequency calculations
|
||||
- Hardy-Weinberg equilibrium testing
|
||||
- Linkage disequilibrium analysis
|
||||
- F-statistics (FST, FIS, FIT)
|
||||
- Tajima's D
|
||||
- Population structure analysis
|
||||
|
||||
## Clustering and Machine Learning
|
||||
|
||||
### Bio.Cluster - Clustering Algorithms
|
||||
|
||||
Statistical clustering for gene expression and other biological data:
|
||||
|
||||
**Hierarchical clustering:**
|
||||
```python
|
||||
from Bio.Cluster import treecluster
|
||||
|
||||
tree = treecluster(data, method='a', dist='e')
|
||||
# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
|
||||
# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
|
||||
```
|
||||
|
||||
**k-means clustering:**
|
||||
```python
|
||||
from Bio.Cluster import kcluster
|
||||
|
||||
clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
|
||||
```
|
||||
|
||||
**Self-Organizing Maps (SOM):**
|
||||
```python
|
||||
from Bio.Cluster import somcluster
|
||||
|
||||
clusterid, celldata = somcluster(data, nx=3, ny=3)
|
||||
```
|
||||
|
||||
**Principal Component Analysis:**
|
||||
```python
|
||||
from Bio.Cluster import pca
|
||||
|
||||
columnmean, coordinates, components, eigenvalues = pca(data)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Bio.Graphics - Genomic Visualization
|
||||
|
||||
Tools for creating publication-quality biological graphics.
|
||||
|
||||
**GenomeDiagram - Circular and linear genome maps:**
|
||||
```python
|
||||
from Bio.Graphics import GenomeDiagram
|
||||
from Bio import SeqIO
|
||||
|
||||
record = SeqIO.read("genome.gb", "genbank")
|
||||
|
||||
gd_diagram = GenomeDiagram.Diagram("Genome Map")
|
||||
gd_track = gd_diagram.new_track(1, greytrack=True)
|
||||
gd_feature_set = gd_track.new_set()
|
||||
|
||||
# Add features
|
||||
for feature in record.features:
|
||||
if feature.type == "gene":
|
||||
gd_feature_set.add_feature(feature, color="blue", label=True)
|
||||
|
||||
gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
|
||||
gd_diagram.write("genome_map.pdf", "PDF")
|
||||
```
|
||||
|
||||
**Chromosomes - Chromosome visualization:**
|
||||
```python
|
||||
from Bio.Graphics.BasicChromosome import Chromosome
|
||||
|
||||
chr = Chromosome("Chromosome 1")
|
||||
chr.add("gene1", 1000, 2000, color="red")
|
||||
chr.add("gene2", 3000, 4500, color="blue")
|
||||
```
|
||||
|
||||
## Phenotype Analysis
|
||||
|
||||
### Bio.phenotype - Phenotypic Microarray Analysis
|
||||
|
||||
Tools for analyzing phenotypic microarray data (e.g., Biolog plates):
|
||||
|
||||
**Capabilities:**
|
||||
- Parse PM plate data
|
||||
- Growth curve analysis
|
||||
- Compare phenotypic profiles
|
||||
- Calculate similarity metrics
|
||||
370
scientific-packages/biopython/scripts/alignment_phylogeny.py
Normal file
370
scientific-packages/biopython/scripts/alignment_phylogeny.py
Normal file
@@ -0,0 +1,370 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Sequence alignment and phylogenetic analysis using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Pairwise sequence alignment
|
||||
- Multiple sequence alignment I/O
|
||||
- Distance matrix calculation
|
||||
- Phylogenetic tree construction
|
||||
- Tree manipulation and visualization
|
||||
"""
|
||||
|
||||
from Bio import Align, AlignIO, Phylo
|
||||
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
|
||||
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
|
||||
from Bio.Seq import Seq
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
|
||||
def pairwise_alignment_example():
|
||||
"""Demonstrate pairwise sequence alignment."""
|
||||
|
||||
print("Pairwise Sequence Alignment")
|
||||
print("=" * 60)
|
||||
|
||||
# Create aligner
|
||||
aligner = Align.PairwiseAligner()
|
||||
|
||||
# Set parameters
|
||||
aligner.mode = "global" # or 'local' for local alignment
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.open_gap_score = -2
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
# Sequences to align
|
||||
seq1 = "ACGTACGTACGT"
|
||||
seq2 = "ACGTTACGTGT"
|
||||
|
||||
print(f"Sequence 1: {seq1}")
|
||||
print(f"Sequence 2: {seq2}")
|
||||
print()
|
||||
|
||||
# Perform alignment
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
|
||||
# Show results
|
||||
print(f"Number of optimal alignments: {len(alignments)}")
|
||||
print(f"Best alignment score: {alignments.score:.1f}")
|
||||
print()
|
||||
|
||||
# Display best alignment
|
||||
print("Best alignment:")
|
||||
print(alignments[0])
|
||||
print()
|
||||
|
||||
|
||||
def local_alignment_example():
|
||||
"""Demonstrate local alignment (Smith-Waterman)."""
|
||||
|
||||
print("Local Sequence Alignment")
|
||||
print("=" * 60)
|
||||
|
||||
aligner = Align.PairwiseAligner()
|
||||
aligner.mode = "local"
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
aligner.open_gap_score = -2
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
seq1 = "AAAAACGTACGTACGTAAAAA"
|
||||
seq2 = "TTTTTTACGTACGTTTTTTT"
|
||||
|
||||
print(f"Sequence 1: {seq1}")
|
||||
print(f"Sequence 2: {seq2}")
|
||||
print()
|
||||
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
|
||||
print(f"Best local alignment score: {alignments.score:.1f}")
|
||||
print()
|
||||
print("Best local alignment:")
|
||||
print(alignments[0])
|
||||
print()
|
||||
|
||||
|
||||
def read_and_analyze_alignment(alignment_file, format="fasta"):
|
||||
"""Read and analyze a multiple sequence alignment."""
|
||||
|
||||
print(f"Reading alignment from: {alignment_file}")
|
||||
print("-" * 60)
|
||||
|
||||
# Read alignment
|
||||
alignment = AlignIO.read(alignment_file, format)
|
||||
|
||||
print(f"Number of sequences: {len(alignment)}")
|
||||
print(f"Alignment length: {alignment.get_alignment_length()}")
|
||||
print()
|
||||
|
||||
# Display alignment
|
||||
print("Alignment preview:")
|
||||
for record in alignment[:5]: # Show first 5 sequences
|
||||
print(f"{record.id[:15]:15s} {record.seq[:50]}...")
|
||||
|
||||
print()
|
||||
|
||||
# Calculate some statistics
|
||||
analyze_alignment_statistics(alignment)
|
||||
|
||||
return alignment
|
||||
|
||||
|
||||
def analyze_alignment_statistics(alignment):
|
||||
"""Calculate statistics for an alignment."""
|
||||
|
||||
print("Alignment Statistics:")
|
||||
print("-" * 60)
|
||||
|
||||
# Get alignment length
|
||||
length = alignment.get_alignment_length()
|
||||
|
||||
# Count gaps
|
||||
total_gaps = sum(str(record.seq).count("-") for record in alignment)
|
||||
gap_percentage = (total_gaps / (length * len(alignment))) * 100
|
||||
|
||||
print(f"Total positions: {length}")
|
||||
print(f"Number of sequences: {len(alignment)}")
|
||||
print(f"Total gaps: {total_gaps} ({gap_percentage:.1f}%)")
|
||||
print()
|
||||
|
||||
# Calculate conservation at each position
|
||||
conserved_positions = 0
|
||||
for i in range(length):
|
||||
column = alignment[:, i]
|
||||
# Count most common residue
|
||||
if column.count(max(set(column), key=column.count)) == len(alignment):
|
||||
conserved_positions += 1
|
||||
|
||||
conservation = (conserved_positions / length) * 100
|
||||
print(f"Fully conserved positions: {conserved_positions} ({conservation:.1f}%)")
|
||||
print()
|
||||
|
||||
|
||||
def calculate_distance_matrix(alignment):
|
||||
"""Calculate distance matrix from alignment."""
|
||||
|
||||
print("Calculating Distance Matrix")
|
||||
print("-" * 60)
|
||||
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
print("Distance matrix:")
|
||||
print(dm)
|
||||
print()
|
||||
|
||||
return dm
|
||||
|
||||
|
||||
def build_upgma_tree(alignment):
|
||||
"""Build phylogenetic tree using UPGMA."""
|
||||
|
||||
print("Building UPGMA Tree")
|
||||
print("=" * 60)
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Construct tree
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.upgma(dm)
|
||||
|
||||
print("UPGMA tree constructed")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def build_nj_tree(alignment):
|
||||
"""Build phylogenetic tree using Neighbor-Joining."""
|
||||
|
||||
print("Building Neighbor-Joining Tree")
|
||||
print("=" * 60)
|
||||
|
||||
# Calculate distance matrix
|
||||
calculator = DistanceCalculator("identity")
|
||||
dm = calculator.get_distance(alignment)
|
||||
|
||||
# Construct tree
|
||||
constructor = DistanceTreeConstructor(calculator)
|
||||
tree = constructor.nj(dm)
|
||||
|
||||
print("Neighbor-Joining tree constructed")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def visualize_tree(tree, title="Phylogenetic Tree"):
|
||||
"""Visualize phylogenetic tree."""
|
||||
|
||||
print("Visualizing tree...")
|
||||
print()
|
||||
|
||||
# ASCII visualization
|
||||
print("ASCII tree:")
|
||||
Phylo.draw_ascii(tree)
|
||||
print()
|
||||
|
||||
# Matplotlib visualization
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
Phylo.draw(tree, axes=ax, do_show=False)
|
||||
ax.set_title(title)
|
||||
plt.tight_layout()
|
||||
plt.savefig("tree_visualization.png", dpi=300, bbox_inches="tight")
|
||||
print("Tree saved to tree_visualization.png")
|
||||
print()
|
||||
|
||||
|
||||
def manipulate_tree(tree):
|
||||
"""Demonstrate tree manipulation operations."""
|
||||
|
||||
print("Tree Manipulation")
|
||||
print("=" * 60)
|
||||
|
||||
# Get terminals
|
||||
terminals = tree.get_terminals()
|
||||
print(f"Terminal nodes: {[t.name for t in terminals]}")
|
||||
print()
|
||||
|
||||
# Get nonterminals
|
||||
nonterminals = tree.get_nonterminals()
|
||||
print(f"Number of internal nodes: {len(nonterminals)}")
|
||||
print()
|
||||
|
||||
# Calculate total branch length
|
||||
total_length = tree.total_branch_length()
|
||||
print(f"Total branch length: {total_length:.4f}")
|
||||
print()
|
||||
|
||||
# Find specific clade
|
||||
if len(terminals) > 0:
|
||||
target_name = terminals[0].name
|
||||
found = tree.find_any(name=target_name)
|
||||
print(f"Found clade: {found.name}")
|
||||
print()
|
||||
|
||||
# Ladderize tree (sort branches)
|
||||
tree.ladderize()
|
||||
print("Tree ladderized (branches sorted)")
|
||||
print()
|
||||
|
||||
# Root at midpoint
|
||||
tree.root_at_midpoint()
|
||||
print("Tree rooted at midpoint")
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def read_and_analyze_tree(tree_file, format="newick"):
|
||||
"""Read and analyze a phylogenetic tree."""
|
||||
|
||||
print(f"Reading tree from: {tree_file}")
|
||||
print("-" * 60)
|
||||
|
||||
tree = Phylo.read(tree_file, format)
|
||||
|
||||
print(f"Tree format: {format}")
|
||||
print(f"Number of terminals: {tree.count_terminals()}")
|
||||
print(f"Is bifurcating: {tree.is_bifurcating()}")
|
||||
print(f"Total branch length: {tree.total_branch_length():.4f}")
|
||||
print()
|
||||
|
||||
# Show tree structure
|
||||
print("Tree structure:")
|
||||
Phylo.draw_ascii(tree)
|
||||
print()
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def compare_trees(tree1, tree2):
|
||||
"""Compare two phylogenetic trees."""
|
||||
|
||||
print("Comparing Trees")
|
||||
print("=" * 60)
|
||||
|
||||
# Get terminal names
|
||||
terminals1 = {t.name for t in tree1.get_terminals()}
|
||||
terminals2 = {t.name for t in tree2.get_terminals()}
|
||||
|
||||
print(f"Tree 1 terminals: {len(terminals1)}")
|
||||
print(f"Tree 2 terminals: {len(terminals2)}")
|
||||
print(f"Shared terminals: {len(terminals1 & terminals2)}")
|
||||
print(f"Unique to tree 1: {len(terminals1 - terminals2)}")
|
||||
print(f"Unique to tree 2: {len(terminals2 - terminals1)}")
|
||||
print()
|
||||
|
||||
|
||||
def create_example_alignment():
|
||||
"""Create an example alignment for demonstration."""
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
from Bio.Align import MultipleSeqAlignment
|
||||
|
||||
sequences = [
|
||||
SeqRecord(Seq("ACTGCTAGCTAGCTAG"), id="seq1"),
|
||||
SeqRecord(Seq("ACTGCTAGCT-GCTAG"), id="seq2"),
|
||||
SeqRecord(Seq("ACTGCTAGCTAGCTGG"), id="seq3"),
|
||||
SeqRecord(Seq("ACTGCT-GCTAGCTAG"), id="seq4"),
|
||||
]
|
||||
|
||||
alignment = MultipleSeqAlignment(sequences)
|
||||
|
||||
# Save alignment
|
||||
AlignIO.write(alignment, "example_alignment.fasta", "fasta")
|
||||
print("Created example alignment: example_alignment.fasta")
|
||||
print()
|
||||
|
||||
return alignment
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate complete alignment and phylogeny workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython Alignment & Phylogeny Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Pairwise alignment examples
|
||||
pairwise_alignment_example()
|
||||
print()
|
||||
local_alignment_example()
|
||||
print()
|
||||
|
||||
# Create example data
|
||||
alignment = create_example_alignment()
|
||||
|
||||
# Analyze alignment
|
||||
analyze_alignment_statistics(alignment)
|
||||
|
||||
# Calculate distance matrix
|
||||
dm = calculate_distance_matrix(alignment)
|
||||
|
||||
# Build trees
|
||||
upgma_tree = build_upgma_tree(alignment)
|
||||
nj_tree = build_nj_tree(alignment)
|
||||
|
||||
# Manipulate tree
|
||||
manipulate_tree(upgma_tree)
|
||||
|
||||
# Visualize
|
||||
visualize_tree(upgma_tree, "UPGMA Tree")
|
||||
|
||||
print("Workflow completed!")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print("Note: For real analyses, use actual alignment files.")
|
||||
print("Supported alignment formats: clustal, phylip, stockholm, nexus, fasta")
|
||||
print("Supported tree formats: newick, nexus, phyloxml, nexml")
|
||||
272
scientific-packages/biopython/scripts/blast_search.py
Normal file
272
scientific-packages/biopython/scripts/blast_search.py
Normal file
@@ -0,0 +1,272 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
BLAST searches and result parsing using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Running BLAST searches via NCBI (qblast)
|
||||
- Parsing BLAST XML output
|
||||
- Filtering and analyzing results
|
||||
- Working with alignments and HSPs
|
||||
"""
|
||||
|
||||
from Bio.Blast import NCBIWWW, NCBIXML
|
||||
from Bio import SeqIO
|
||||
|
||||
|
||||
def run_blast_online(sequence, program="blastn", database="nt", expect=0.001):
|
||||
"""
|
||||
Run BLAST search via NCBI's qblast.
|
||||
|
||||
Parameters:
|
||||
- sequence: Sequence string or Seq object
|
||||
- program: blastn, blastp, blastx, tblastn, tblastx
|
||||
- database: nt (nucleotide), nr (protein), refseq_rna, etc.
|
||||
- expect: E-value threshold
|
||||
"""
|
||||
|
||||
print(f"Running {program} search against {database} database...")
|
||||
print(f"E-value threshold: {expect}")
|
||||
print("-" * 60)
|
||||
|
||||
# Run BLAST
|
||||
result_handle = NCBIWWW.qblast(
|
||||
program=program,
|
||||
database=database,
|
||||
sequence=sequence,
|
||||
expect=expect,
|
||||
hitlist_size=50, # Number of sequences to show alignments for
|
||||
)
|
||||
|
||||
# Save results
|
||||
output_file = "blast_results.xml"
|
||||
with open(output_file, "w") as out:
|
||||
out.write(result_handle.read())
|
||||
|
||||
result_handle.close()
|
||||
|
||||
print(f"BLAST search complete. Results saved to {output_file}")
|
||||
print()
|
||||
|
||||
return output_file
|
||||
|
||||
|
||||
def parse_blast_results(xml_file, max_hits=10, evalue_threshold=0.001):
|
||||
"""Parse BLAST XML results."""
|
||||
|
||||
print(f"Parsing BLAST results from: {xml_file}")
|
||||
print(f"E-value threshold: {evalue_threshold}")
|
||||
print("=" * 60)
|
||||
|
||||
with open(xml_file) as result_handle:
|
||||
blast_record = NCBIXML.read(result_handle)
|
||||
|
||||
print(f"Query: {blast_record.query}")
|
||||
print(f"Query length: {blast_record.query_length} residues")
|
||||
print(f"Database: {blast_record.database}")
|
||||
print(f"Number of alignments: {len(blast_record.alignments)}")
|
||||
print()
|
||||
|
||||
hit_count = 0
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
if hsp.expect <= evalue_threshold:
|
||||
hit_count += 1
|
||||
|
||||
if hit_count <= max_hits:
|
||||
print(f"Hit {hit_count}:")
|
||||
print(f" Sequence: {alignment.title}")
|
||||
print(f" Length: {alignment.length}")
|
||||
print(f" E-value: {hsp.expect:.2e}")
|
||||
print(f" Score: {hsp.score}")
|
||||
print(f" Identities: {hsp.identities}/{hsp.align_length} ({hsp.identities / hsp.align_length * 100:.1f}%)")
|
||||
print(f" Positives: {hsp.positives}/{hsp.align_length} ({hsp.positives / hsp.align_length * 100:.1f}%)")
|
||||
print(f" Gaps: {hsp.gaps}/{hsp.align_length}")
|
||||
print(f" Query range: {hsp.query_start} - {hsp.query_end}")
|
||||
print(f" Subject range: {hsp.sbjct_start} - {hsp.sbjct_end}")
|
||||
print()
|
||||
|
||||
# Show alignment (first 100 characters)
|
||||
print(" Alignment preview:")
|
||||
print(f" Query: {hsp.query[:100]}")
|
||||
print(f" Match: {hsp.match[:100]}")
|
||||
print(f" Sbjct: {hsp.sbjct[:100]}")
|
||||
print()
|
||||
|
||||
print(f"Total significant hits (E-value <= {evalue_threshold}): {hit_count}")
|
||||
print()
|
||||
|
||||
return blast_record
|
||||
|
||||
|
||||
def parse_multiple_queries(xml_file):
|
||||
"""Parse BLAST results with multiple queries."""
|
||||
|
||||
print(f"Parsing multiple queries from: {xml_file}")
|
||||
print("=" * 60)
|
||||
|
||||
with open(xml_file) as result_handle:
|
||||
blast_records = NCBIXML.parse(result_handle)
|
||||
|
||||
for i, blast_record in enumerate(blast_records, 1):
|
||||
print(f"\nQuery {i}: {blast_record.query}")
|
||||
print(f" Number of hits: {len(blast_record.alignments)}")
|
||||
|
||||
if blast_record.alignments:
|
||||
best_hit = blast_record.alignments[0]
|
||||
best_hsp = best_hit.hsps[0]
|
||||
print(f" Best hit: {best_hit.title[:80]}...")
|
||||
print(f" Best E-value: {best_hsp.expect:.2e}")
|
||||
|
||||
|
||||
def filter_blast_results(blast_record, min_identity=0.7, min_coverage=0.5):
|
||||
"""Filter BLAST results by identity and coverage."""
|
||||
|
||||
print(f"Filtering results:")
|
||||
print(f" Minimum identity: {min_identity * 100}%")
|
||||
print(f" Minimum coverage: {min_coverage * 100}%")
|
||||
print("-" * 60)
|
||||
|
||||
filtered_hits = []
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
identity_fraction = hsp.identities / hsp.align_length
|
||||
coverage = hsp.align_length / blast_record.query_length
|
||||
|
||||
if identity_fraction >= min_identity and coverage >= min_coverage:
|
||||
filtered_hits.append(
|
||||
{
|
||||
"title": alignment.title,
|
||||
"length": alignment.length,
|
||||
"evalue": hsp.expect,
|
||||
"identity": identity_fraction,
|
||||
"coverage": coverage,
|
||||
"alignment": alignment,
|
||||
"hsp": hsp,
|
||||
}
|
||||
)
|
||||
|
||||
print(f"Found {len(filtered_hits)} hits matching criteria")
|
||||
print()
|
||||
|
||||
# Sort by E-value
|
||||
filtered_hits.sort(key=lambda x: x["evalue"])
|
||||
|
||||
# Display top hits
|
||||
for i, hit in enumerate(filtered_hits[:5], 1):
|
||||
print(f"{i}. {hit['title'][:80]}")
|
||||
print(f" Identity: {hit['identity']*100:.1f}%, Coverage: {hit['coverage']*100:.1f}%, E-value: {hit['evalue']:.2e}")
|
||||
print()
|
||||
|
||||
return filtered_hits
|
||||
|
||||
|
||||
def extract_hit_sequences(blast_record, output_file="blast_hits.fasta"):
|
||||
"""Extract aligned sequences from BLAST results."""
|
||||
|
||||
print(f"Extracting hit sequences to {output_file}...")
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
|
||||
records = []
|
||||
|
||||
for i, alignment in enumerate(blast_record.alignments[:10]): # Top 10 hits
|
||||
hsp = alignment.hsps[0] # Best HSP for this alignment
|
||||
|
||||
# Extract accession from title
|
||||
accession = alignment.title.split()[0]
|
||||
|
||||
# Create SeqRecord from aligned subject sequence
|
||||
record = SeqRecord(
|
||||
Seq(hsp.sbjct.replace("-", "")), # Remove gaps
|
||||
id=accession,
|
||||
description=f"E-value: {hsp.expect:.2e}, Identity: {hsp.identities}/{hsp.align_length}",
|
||||
)
|
||||
|
||||
records.append(record)
|
||||
|
||||
# Write to FASTA
|
||||
SeqIO.write(records, output_file, "fasta")
|
||||
|
||||
print(f"Extracted {len(records)} sequences")
|
||||
print()
|
||||
|
||||
|
||||
def analyze_blast_statistics(blast_record):
|
||||
"""Compute statistics from BLAST results."""
|
||||
|
||||
print("BLAST Result Statistics:")
|
||||
print("-" * 60)
|
||||
|
||||
if not blast_record.alignments:
|
||||
print("No hits found")
|
||||
return
|
||||
|
||||
evalues = []
|
||||
identities = []
|
||||
scores = []
|
||||
|
||||
for alignment in blast_record.alignments:
|
||||
for hsp in alignment.hsps:
|
||||
evalues.append(hsp.expect)
|
||||
identities.append(hsp.identities / hsp.align_length)
|
||||
scores.append(hsp.score)
|
||||
|
||||
import statistics
|
||||
|
||||
print(f"Total HSPs: {len(evalues)}")
|
||||
print(f"\nE-values:")
|
||||
print(f" Min: {min(evalues):.2e}")
|
||||
print(f" Max: {max(evalues):.2e}")
|
||||
print(f" Median: {statistics.median(evalues):.2e}")
|
||||
print(f"\nIdentity percentages:")
|
||||
print(f" Min: {min(identities)*100:.1f}%")
|
||||
print(f" Max: {max(identities)*100:.1f}%")
|
||||
print(f" Mean: {statistics.mean(identities)*100:.1f}%")
|
||||
print(f"\nBit scores:")
|
||||
print(f" Min: {min(scores):.1f}")
|
||||
print(f" Max: {max(scores):.1f}")
|
||||
print(f" Mean: {statistics.mean(scores):.1f}")
|
||||
print()
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate BLAST workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython BLAST Example Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Example sequence (human beta-globin)
|
||||
example_sequence = """
|
||||
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
|
||||
""".replace("\n", "").replace(" ", "")
|
||||
|
||||
print("Example: Human beta-globin sequence")
|
||||
print(f"Length: {len(example_sequence)} bp")
|
||||
print()
|
||||
|
||||
# Note: Uncomment to run actual BLAST search (takes time)
|
||||
# xml_file = run_blast_online(example_sequence, program="blastn", database="nt", expect=0.001)
|
||||
|
||||
# For demonstration, use a pre-existing results file
|
||||
print("To run a real BLAST search, uncomment the run_blast_online() line")
|
||||
print("For now, demonstrating parsing with example results file")
|
||||
print()
|
||||
|
||||
# If you have results, parse them:
|
||||
# blast_record = parse_blast_results("blast_results.xml", max_hits=5)
|
||||
# filtered = filter_blast_results(blast_record, min_identity=0.9)
|
||||
# analyze_blast_statistics(blast_record)
|
||||
# extract_hit_sequences(blast_record)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("Note: BLAST searches can take several minutes.")
|
||||
print("For production use, consider running local BLAST instead.")
|
||||
215
scientific-packages/biopython/scripts/file_io.py
Normal file
215
scientific-packages/biopython/scripts/file_io.py
Normal file
@@ -0,0 +1,215 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
File I/O operations using BioPython SeqIO.
|
||||
|
||||
This script demonstrates:
|
||||
- Reading sequences from various formats
|
||||
- Writing sequences to files
|
||||
- Converting between formats
|
||||
- Filtering and processing sequences
|
||||
- Working with large files efficiently
|
||||
"""
|
||||
|
||||
from Bio import SeqIO
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
|
||||
|
||||
def read_sequences(filename, format_type):
|
||||
"""Read and display sequences from a file."""
|
||||
|
||||
print(f"Reading {format_type} file: {filename}")
|
||||
print("-" * 60)
|
||||
|
||||
count = 0
|
||||
for record in SeqIO.parse(filename, format_type):
|
||||
count += 1
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Name: {record.name}")
|
||||
print(f"Description: {record.description}")
|
||||
print(f"Sequence length: {len(record.seq)}")
|
||||
print(f"Sequence: {record.seq[:50]}...")
|
||||
print()
|
||||
|
||||
# Only show first 3 sequences
|
||||
if count >= 3:
|
||||
break
|
||||
|
||||
# Count total sequences
|
||||
total = len(list(SeqIO.parse(filename, format_type)))
|
||||
print(f"Total sequences in file: {total}")
|
||||
print()
|
||||
|
||||
|
||||
def read_single_sequence(filename, format_type):
|
||||
"""Read a single sequence from a file."""
|
||||
|
||||
record = SeqIO.read(filename, format_type)
|
||||
|
||||
print("Single sequence record:")
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Sequence: {record.seq}")
|
||||
print()
|
||||
|
||||
|
||||
def write_sequences(records, output_filename, format_type):
|
||||
"""Write sequences to a file."""
|
||||
|
||||
count = SeqIO.write(records, output_filename, format_type)
|
||||
print(f"Wrote {count} sequences to {output_filename} in {format_type} format")
|
||||
print()
|
||||
|
||||
|
||||
def convert_format(input_file, input_format, output_file, output_format):
|
||||
"""Convert sequences from one format to another."""
|
||||
|
||||
count = SeqIO.convert(input_file, input_format, output_file, output_format)
|
||||
print(f"Converted {count} sequences from {input_format} to {output_format}")
|
||||
print()
|
||||
|
||||
|
||||
def filter_sequences(input_file, format_type, min_length=100, max_length=1000):
|
||||
"""Filter sequences by length."""
|
||||
|
||||
filtered = []
|
||||
|
||||
for record in SeqIO.parse(input_file, format_type):
|
||||
if min_length <= len(record.seq) <= max_length:
|
||||
filtered.append(record)
|
||||
|
||||
print(f"Found {len(filtered)} sequences between {min_length} and {max_length} bp")
|
||||
return filtered
|
||||
|
||||
|
||||
def extract_subsequence(input_file, format_type, seq_id, start, end):
|
||||
"""Extract a subsequence from a specific record."""
|
||||
|
||||
# Index for efficient access
|
||||
record_dict = SeqIO.index(input_file, format_type)
|
||||
|
||||
if seq_id in record_dict:
|
||||
record = record_dict[seq_id]
|
||||
subseq = record.seq[start:end]
|
||||
print(f"Extracted subsequence from {seq_id} ({start}:{end}):")
|
||||
print(subseq)
|
||||
return subseq
|
||||
else:
|
||||
print(f"Sequence {seq_id} not found")
|
||||
return None
|
||||
|
||||
|
||||
def create_sequence_records():
|
||||
"""Create SeqRecord objects from scratch."""
|
||||
|
||||
# Simple record
|
||||
simple_record = SeqRecord(
|
||||
Seq("ATGCATGCATGC"),
|
||||
id="seq001",
|
||||
name="MySequence",
|
||||
description="Example sequence"
|
||||
)
|
||||
|
||||
# Record with annotations
|
||||
annotated_record = SeqRecord(
|
||||
Seq("ATGGTGCATCTGACTCCTGAGGAG"),
|
||||
id="seq002",
|
||||
name="GeneX",
|
||||
description="Important gene"
|
||||
)
|
||||
annotated_record.annotations["molecule_type"] = "DNA"
|
||||
annotated_record.annotations["organism"] = "Homo sapiens"
|
||||
|
||||
return [simple_record, annotated_record]
|
||||
|
||||
|
||||
def index_large_file(filename, format_type):
|
||||
"""Index a large file for random access without loading into memory."""
|
||||
|
||||
# Create index
|
||||
record_index = SeqIO.index(filename, format_type)
|
||||
|
||||
print(f"Indexed {len(record_index)} sequences")
|
||||
print(f"Available IDs: {list(record_index.keys())[:10]}...")
|
||||
print()
|
||||
|
||||
# Access specific record by ID
|
||||
if len(record_index) > 0:
|
||||
first_id = list(record_index.keys())[0]
|
||||
record = record_index[first_id]
|
||||
print(f"Accessed record: {record.id}")
|
||||
print()
|
||||
|
||||
# Close index
|
||||
record_index.close()
|
||||
|
||||
|
||||
def parse_with_quality_scores(fastq_file):
|
||||
"""Parse FASTQ files with quality scores."""
|
||||
|
||||
print("Parsing FASTQ with quality scores:")
|
||||
print("-" * 60)
|
||||
|
||||
for record in SeqIO.parse(fastq_file, "fastq"):
|
||||
print(f"ID: {record.id}")
|
||||
print(f"Sequence: {record.seq[:50]}...")
|
||||
print(f"Quality scores (first 10): {record.letter_annotations['phred_quality'][:10]}")
|
||||
|
||||
# Calculate average quality
|
||||
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
|
||||
print(f"Average quality: {avg_quality:.2f}")
|
||||
print()
|
||||
break # Just show first record
|
||||
|
||||
|
||||
def batch_process_large_file(input_file, format_type, batch_size=100):
|
||||
"""Process large files in batches to manage memory."""
|
||||
|
||||
batch = []
|
||||
count = 0
|
||||
|
||||
for record in SeqIO.parse(input_file, format_type):
|
||||
batch.append(record)
|
||||
count += 1
|
||||
|
||||
if len(batch) == batch_size:
|
||||
# Process batch
|
||||
print(f"Processing batch of {len(batch)} sequences...")
|
||||
# Do something with batch
|
||||
batch = [] # Clear for next batch
|
||||
|
||||
# Process remaining records
|
||||
if batch:
|
||||
print(f"Processing final batch of {len(batch)} sequences...")
|
||||
|
||||
print(f"Total sequences processed: {count}")
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate a complete workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython SeqIO Workflow Example")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Create example sequences
|
||||
records = create_sequence_records()
|
||||
|
||||
# Write as FASTA
|
||||
write_sequences(records, "example_output.fasta", "fasta")
|
||||
|
||||
# Write as GenBank
|
||||
write_sequences(records, "example_output.gb", "genbank")
|
||||
|
||||
# Convert FASTA to GenBank (would work if file exists)
|
||||
# convert_format("input.fasta", "fasta", "output.gb", "genbank")
|
||||
|
||||
print("Example workflow completed!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("Note: This script demonstrates BioPython SeqIO operations.")
|
||||
print("Uncomment and adapt the functions for your specific files.")
|
||||
293
scientific-packages/biopython/scripts/ncbi_entrez.py
Normal file
293
scientific-packages/biopython/scripts/ncbi_entrez.py
Normal file
@@ -0,0 +1,293 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NCBI Entrez database access using BioPython.
|
||||
|
||||
This script demonstrates:
|
||||
- Searching NCBI databases
|
||||
- Downloading sequences by accession
|
||||
- Retrieving PubMed articles
|
||||
- Batch downloading with WebEnv
|
||||
- Proper error handling and rate limiting
|
||||
"""
|
||||
|
||||
import time
|
||||
from Bio import Entrez, SeqIO
|
||||
|
||||
# IMPORTANT: Always set your email
|
||||
Entrez.email = "your.email@example.com" # Change this!
|
||||
|
||||
|
||||
def search_nucleotide(query, max_results=10):
|
||||
"""Search NCBI nucleotide database."""
|
||||
|
||||
print(f"Searching nucleotide database for: {query}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
print(f"Found {record['Count']} total matches")
|
||||
print(f"Returning top {len(record['IdList'])} IDs:")
|
||||
print(record["IdList"])
|
||||
print()
|
||||
|
||||
return record["IdList"]
|
||||
|
||||
|
||||
def fetch_sequence_by_accession(accession):
|
||||
"""Download a sequence by accession number."""
|
||||
|
||||
print(f"Fetching sequence: {accession}")
|
||||
|
||||
try:
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=accession, rettype="gb", retmode="text"
|
||||
)
|
||||
record = SeqIO.read(handle, "genbank")
|
||||
handle.close()
|
||||
|
||||
print(f"Successfully retrieved: {record.id}")
|
||||
print(f"Description: {record.description}")
|
||||
print(f"Length: {len(record.seq)} bp")
|
||||
print(f"Organism: {record.annotations.get('organism', 'Unknown')}")
|
||||
print()
|
||||
|
||||
return record
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching {accession}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def fetch_multiple_sequences(id_list, output_file="downloaded_sequences.fasta"):
|
||||
"""Download multiple sequences and save to file."""
|
||||
|
||||
print(f"Fetching {len(id_list)} sequences...")
|
||||
|
||||
try:
|
||||
# For >200 IDs, efetch automatically uses POST
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=id_list, rettype="fasta", retmode="text"
|
||||
)
|
||||
|
||||
# Parse and save
|
||||
records = list(SeqIO.parse(handle, "fasta"))
|
||||
handle.close()
|
||||
|
||||
SeqIO.write(records, output_file, "fasta")
|
||||
|
||||
print(f"Successfully downloaded {len(records)} sequences to {output_file}")
|
||||
print()
|
||||
|
||||
return records
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching sequences: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def search_and_download(query, output_file, max_results=100):
|
||||
"""Complete workflow: search and download sequences."""
|
||||
|
||||
print(f"Searching and downloading: {query}")
|
||||
print("=" * 60)
|
||||
|
||||
# Search
|
||||
handle = Entrez.esearch(db="nucleotide", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
id_list = record["IdList"]
|
||||
print(f"Found {len(id_list)} sequences")
|
||||
|
||||
if not id_list:
|
||||
print("No results found")
|
||||
return
|
||||
|
||||
# Download in batches to be polite
|
||||
batch_size = 100
|
||||
all_records = []
|
||||
|
||||
for start in range(0, len(id_list), batch_size):
|
||||
end = min(start + batch_size, len(id_list))
|
||||
batch_ids = id_list[start:end]
|
||||
|
||||
print(f"Downloading batch {start // batch_size + 1} ({len(batch_ids)} sequences)...")
|
||||
|
||||
handle = Entrez.efetch(
|
||||
db="nucleotide", id=batch_ids, rettype="fasta", retmode="text"
|
||||
)
|
||||
batch_records = list(SeqIO.parse(handle, "fasta"))
|
||||
handle.close()
|
||||
|
||||
all_records.extend(batch_records)
|
||||
|
||||
# Be polite - wait between requests
|
||||
time.sleep(0.5)
|
||||
|
||||
# Save all records
|
||||
SeqIO.write(all_records, output_file, "fasta")
|
||||
print(f"Downloaded {len(all_records)} sequences to {output_file}")
|
||||
print()
|
||||
|
||||
|
||||
def use_history_for_large_queries(query, max_results=1000):
|
||||
"""Use NCBI History server for large queries."""
|
||||
|
||||
print("Using NCBI History server for large query")
|
||||
print("-" * 60)
|
||||
|
||||
# Search with history
|
||||
search_handle = Entrez.esearch(
|
||||
db="nucleotide", term=query, retmax=max_results, usehistory="y"
|
||||
)
|
||||
search_results = Entrez.read(search_handle)
|
||||
search_handle.close()
|
||||
|
||||
count = int(search_results["Count"])
|
||||
webenv = search_results["WebEnv"]
|
||||
query_key = search_results["QueryKey"]
|
||||
|
||||
print(f"Found {count} total sequences")
|
||||
print(f"WebEnv: {webenv[:20]}...")
|
||||
print(f"QueryKey: {query_key}")
|
||||
print()
|
||||
|
||||
# Fetch in batches using history
|
||||
batch_size = 500
|
||||
all_records = []
|
||||
|
||||
for start in range(0, min(count, max_results), batch_size):
|
||||
end = min(start + batch_size, max_results)
|
||||
|
||||
print(f"Downloading records {start + 1} to {end}...")
|
||||
|
||||
fetch_handle = Entrez.efetch(
|
||||
db="nucleotide",
|
||||
rettype="fasta",
|
||||
retmode="text",
|
||||
retstart=start,
|
||||
retmax=batch_size,
|
||||
webenv=webenv,
|
||||
query_key=query_key,
|
||||
)
|
||||
|
||||
batch_records = list(SeqIO.parse(fetch_handle, "fasta"))
|
||||
fetch_handle.close()
|
||||
|
||||
all_records.extend(batch_records)
|
||||
|
||||
# Be polite
|
||||
time.sleep(0.5)
|
||||
|
||||
print(f"Downloaded {len(all_records)} sequences total")
|
||||
return all_records
|
||||
|
||||
|
||||
def search_pubmed(query, max_results=10):
|
||||
"""Search PubMed for articles."""
|
||||
|
||||
print(f"Searching PubMed for: {query}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
id_list = record["IdList"]
|
||||
print(f"Found {record['Count']} total articles")
|
||||
print(f"Returning {len(id_list)} PMIDs:")
|
||||
print(id_list)
|
||||
print()
|
||||
|
||||
return id_list
|
||||
|
||||
|
||||
def fetch_pubmed_abstracts(pmid_list):
|
||||
"""Fetch PubMed article summaries."""
|
||||
|
||||
print(f"Fetching summaries for {len(pmid_list)} articles...")
|
||||
|
||||
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="abstract", retmode="text")
|
||||
abstracts = handle.read()
|
||||
handle.close()
|
||||
|
||||
print(abstracts[:500]) # Show first 500 characters
|
||||
print("...")
|
||||
print()
|
||||
|
||||
|
||||
def get_database_info(database="nucleotide"):
|
||||
"""Get information about an NCBI database."""
|
||||
|
||||
print(f"Getting info for database: {database}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.einfo(db=database)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
db_info = record["DbInfo"]
|
||||
print(f"Name: {db_info['DbName']}")
|
||||
print(f"Description: {db_info['Description']}")
|
||||
print(f"Record count: {db_info['Count']}")
|
||||
print(f"Last update: {db_info['LastUpdate']}")
|
||||
print()
|
||||
|
||||
|
||||
def link_databases(db_from, db_to, id_):
|
||||
"""Find related records in other databases."""
|
||||
|
||||
print(f"Finding links from {db_from} ID {id_} to {db_to}")
|
||||
print("-" * 60)
|
||||
|
||||
handle = Entrez.elink(dbfrom=db_from, db=db_to, id=id_)
|
||||
record = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
if record[0]["LinkSetDb"]:
|
||||
linked_ids = [link["Id"] for link in record[0]["LinkSetDb"][0]["Link"]]
|
||||
print(f"Found {len(linked_ids)} linked records")
|
||||
print(f"IDs: {linked_ids[:10]}")
|
||||
else:
|
||||
print("No linked records found")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
def example_workflow():
|
||||
"""Demonstrate complete Entrez workflow."""
|
||||
|
||||
print("=" * 60)
|
||||
print("BioPython Entrez Example Workflow")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Note: These are examples - uncomment to run with your email set
|
||||
|
||||
# # Example 1: Search and get IDs
|
||||
# ids = search_nucleotide("Homo sapiens[Organism] AND COX1[Gene]", max_results=5)
|
||||
#
|
||||
# # Example 2: Fetch a specific sequence
|
||||
# fetch_sequence_by_accession("NM_001301717")
|
||||
#
|
||||
# # Example 3: Complete search and download
|
||||
# search_and_download("Escherichia coli[Organism] AND 16S", "ecoli_16s.fasta", max_results=50)
|
||||
#
|
||||
# # Example 4: PubMed search
|
||||
# pmids = search_pubmed("CRISPR[Title] AND 2023[PDAT]", max_results=5)
|
||||
# fetch_pubmed_abstracts(pmids[:2])
|
||||
#
|
||||
# # Example 5: Get database info
|
||||
# get_database_info("nucleotide")
|
||||
|
||||
print("Examples are commented out. Uncomment and set your email to run.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
example_workflow()
|
||||
|
||||
print()
|
||||
print("IMPORTANT: Always set Entrez.email before using these functions!")
|
||||
print("NCBI requires an email address for their E-utilities.")
|
||||
125
scientific-packages/biopython/scripts/sequence_operations.py
Normal file
125
scientific-packages/biopython/scripts/sequence_operations.py
Normal file
@@ -0,0 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Common sequence operations using BioPython.
|
||||
|
||||
This script demonstrates basic sequence manipulation tasks like:
|
||||
- Creating and manipulating Seq objects
|
||||
- Transcription and translation
|
||||
- Complement and reverse complement
|
||||
- Calculating GC content and melting temperature
|
||||
"""
|
||||
|
||||
from Bio.Seq import Seq
|
||||
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
|
||||
|
||||
|
||||
def demonstrate_seq_operations():
|
||||
"""Show common Seq object operations."""
|
||||
|
||||
# Create DNA sequence
|
||||
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTG")
|
||||
|
||||
print("Original DNA sequence:")
|
||||
print(dna_seq)
|
||||
print()
|
||||
|
||||
# Transcription (DNA -> RNA)
|
||||
rna_seq = dna_seq.transcribe()
|
||||
print("Transcribed to RNA:")
|
||||
print(rna_seq)
|
||||
print()
|
||||
|
||||
# Translation (DNA -> Protein)
|
||||
protein_seq = dna_seq.translate()
|
||||
print("Translated to protein:")
|
||||
print(protein_seq)
|
||||
print()
|
||||
|
||||
# Translation with stop codon handling
|
||||
protein_to_stop = dna_seq.translate(to_stop=True)
|
||||
print("Translated to first stop codon:")
|
||||
print(protein_to_stop)
|
||||
print()
|
||||
|
||||
# Complement
|
||||
complement = dna_seq.complement()
|
||||
print("Complement:")
|
||||
print(complement)
|
||||
print()
|
||||
|
||||
# Reverse complement
|
||||
reverse_complement = dna_seq.reverse_complement()
|
||||
print("Reverse complement:")
|
||||
print(reverse_complement)
|
||||
print()
|
||||
|
||||
# GC content
|
||||
gc = gc_fraction(dna_seq) * 100
|
||||
print(f"GC content: {gc:.2f}%")
|
||||
print()
|
||||
|
||||
# Melting temperature
|
||||
tm = mt.Tm_NN(dna_seq)
|
||||
print(f"Melting temperature (nearest-neighbor): {tm:.2f}°C")
|
||||
print()
|
||||
|
||||
# Sequence searching
|
||||
codon_start = dna_seq.find("ATG")
|
||||
print(f"Start codon (ATG) position: {codon_start}")
|
||||
|
||||
# Count occurrences
|
||||
g_count = dna_seq.count("G")
|
||||
print(f"Number of G nucleotides: {g_count}")
|
||||
print()
|
||||
|
||||
|
||||
def translate_with_genetic_code():
|
||||
"""Demonstrate translation with different genetic codes."""
|
||||
|
||||
dna_seq = Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCT")
|
||||
|
||||
# Standard genetic code (table 1)
|
||||
standard = dna_seq.translate(table=1)
|
||||
print("Standard genetic code translation:")
|
||||
print(standard)
|
||||
|
||||
# Vertebrate mitochondrial code (table 2)
|
||||
mito = dna_seq.translate(table=2)
|
||||
print("Vertebrate mitochondrial code translation:")
|
||||
print(mito)
|
||||
print()
|
||||
|
||||
|
||||
def working_with_codons():
|
||||
"""Access genetic code tables."""
|
||||
from Bio.Data import CodonTable
|
||||
|
||||
# Get standard genetic code
|
||||
standard_table = CodonTable.unambiguous_dna_by_id[1]
|
||||
|
||||
print("Standard genetic code:")
|
||||
print(f"Start codons: {standard_table.start_codons}")
|
||||
print(f"Stop codons: {standard_table.stop_codons}")
|
||||
print()
|
||||
|
||||
# Show some codon translations
|
||||
print("Example codons:")
|
||||
for codon in ["ATG", "TGG", "TAA", "TAG", "TGA"]:
|
||||
if codon in standard_table.stop_codons:
|
||||
print(f"{codon} -> STOP")
|
||||
else:
|
||||
aa = standard_table.forward_table.get(codon, "Unknown")
|
||||
print(f"{codon} -> {aa}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("=" * 60)
|
||||
print("BioPython Sequence Operations Demo")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
demonstrate_seq_operations()
|
||||
print("-" * 60)
|
||||
translate_with_genetic_code()
|
||||
print("-" * 60)
|
||||
working_with_codons()
|
||||
355
scientific-packages/bioservices/SKILL.md
Normal file
355
scientific-packages/bioservices/SKILL.md
Normal file
@@ -0,0 +1,355 @@
|
||||
---
|
||||
name: bioservices
|
||||
description: Toolkit for accessing 40+ biological web services and databases programmatically. Use when working with protein sequences, gene pathways (KEGG), identifier mapping (UniProt), compound databases (ChEBI, ChEMBL), sequence analysis (BLAST), pathway interactions, gene ontology, or any bioinformatics data retrieval tasks requiring integration across multiple biological databases.
|
||||
---
|
||||
|
||||
# BioServices
|
||||
|
||||
## Overview
|
||||
|
||||
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Use this skill to retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when tasks involve:
|
||||
- Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
|
||||
- Analyzing metabolic pathways and gene functions via KEGG or Reactome
|
||||
- Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
|
||||
- Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs)
|
||||
- Running sequence similarity searches (BLAST, MUSCLE alignment)
|
||||
- Querying gene ontology terms (QuickGO, GO annotations)
|
||||
- Accessing protein-protein interaction data (PSICQUIC, IntactComplex)
|
||||
- Mining genomic data (BioMart, ArrayExpress, ENA)
|
||||
- Integrating data from multiple bioinformatics resources in a single workflow
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Protein Analysis
|
||||
|
||||
Retrieve protein information, sequences, and functional annotations:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt(verbose=False)
|
||||
|
||||
# Search for protein by name
|
||||
results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
|
||||
|
||||
# Retrieve FASTA sequence
|
||||
sequence = u.retrieve("P43403", "fasta")
|
||||
|
||||
# Map identifiers between databases
|
||||
kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
||||
```
|
||||
|
||||
**Key methods:**
|
||||
- `search()`: Query UniProt with flexible search terms
|
||||
- `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
|
||||
- `mapping()`: Convert identifiers between databases
|
||||
|
||||
Reference: `references/services_reference.md` for complete UniProt API details.
|
||||
|
||||
### 2. Pathway Discovery and Analysis
|
||||
|
||||
Access KEGG pathway information for genes and organisms:
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
k.organism = "hsa" # Set to human
|
||||
|
||||
# Search for organisms
|
||||
k.lookfor_organism("droso") # Find Drosophila species
|
||||
|
||||
# Find pathways by name
|
||||
k.lookfor_pathway("B cell") # Returns matching pathway IDs
|
||||
|
||||
# Get pathways containing specific genes
|
||||
pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene
|
||||
|
||||
# Retrieve and parse pathway data
|
||||
data = k.get("hsa04660")
|
||||
parsed = k.parse(data)
|
||||
|
||||
# Extract pathway interactions
|
||||
interactions = k.parse_kgml_pathway("hsa04660")
|
||||
relations = interactions['relations'] # Protein-protein interactions
|
||||
|
||||
# Convert to Simple Interaction Format
|
||||
sif_data = k.pathway2sif("hsa04660")
|
||||
```
|
||||
|
||||
**Key methods:**
|
||||
- `lookfor_organism()`, `lookfor_pathway()`: Search by name
|
||||
- `get_pathway_by_gene()`: Find pathways containing genes
|
||||
- `parse_kgml_pathway()`: Extract structured pathway data
|
||||
- `pathway2sif()`: Get protein interaction networks
|
||||
|
||||
Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
|
||||
|
||||
### 3. Compound Database Searches
|
||||
|
||||
Search and cross-reference compounds across multiple databases:
|
||||
|
||||
```python
|
||||
from bioservices import KEGG, UniChem
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Search compounds by name
|
||||
results = k.find("compound", "Geldanamycin") # Returns cpd:C11222
|
||||
|
||||
# Get compound information with database links
|
||||
compound_info = k.get("cpd:C11222") # Includes ChEBI links
|
||||
|
||||
# Cross-reference KEGG → ChEMBL using UniChem
|
||||
u = UniChem()
|
||||
chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315
|
||||
```
|
||||
|
||||
**Common workflow:**
|
||||
1. Search compound by name in KEGG
|
||||
2. Extract KEGG compound ID
|
||||
3. Use UniChem for KEGG → ChEMBL mapping
|
||||
4. ChEBI IDs are often provided in KEGG entries
|
||||
|
||||
Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
|
||||
|
||||
### 4. Sequence Analysis
|
||||
|
||||
Run BLAST searches and sequence alignments:
|
||||
|
||||
```python
|
||||
from bioservices import NCBIblast
|
||||
|
||||
s = NCBIblast(verbose=False)
|
||||
|
||||
# Run BLASTP against UniProtKB
|
||||
jobid = s.run(
|
||||
program="blastp",
|
||||
sequence=protein_sequence,
|
||||
stype="protein",
|
||||
database="uniprotkb",
|
||||
email="your.email@example.com" # Required by NCBI
|
||||
)
|
||||
|
||||
# Check job status and retrieve results
|
||||
s.getStatus(jobid)
|
||||
results = s.getResult(jobid, "out")
|
||||
```
|
||||
|
||||
**Note:** BLAST jobs are asynchronous. Check status before retrieving results.
|
||||
|
||||
### 5. Identifier Mapping
|
||||
|
||||
Convert identifiers between different biological databases:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt, KEGG
|
||||
|
||||
# UniProt mapping (many database pairs supported)
|
||||
u = UniProt()
|
||||
results = u.mapping(
|
||||
fr="UniProtKB_AC-ID", # Source database
|
||||
to="KEGG", # Target database
|
||||
query="P43403" # Identifier(s) to convert
|
||||
)
|
||||
|
||||
# KEGG gene ID → UniProt
|
||||
kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
|
||||
|
||||
# For compounds, use UniChem
|
||||
from bioservices import UniChem
|
||||
u = UniChem()
|
||||
chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
|
||||
```
|
||||
|
||||
**Supported mappings (UniProt):**
|
||||
- UniProtKB ↔ KEGG
|
||||
- UniProtKB ↔ Ensembl
|
||||
- UniProtKB ↔ PDB
|
||||
- UniProtKB ↔ RefSeq
|
||||
- And many more (see `references/identifier_mapping.md`)
|
||||
|
||||
### 6. Gene Ontology Queries
|
||||
|
||||
Access GO terms and annotations:
|
||||
|
||||
```python
|
||||
from bioservices import QuickGO
|
||||
|
||||
g = QuickGO(verbose=False)
|
||||
|
||||
# Retrieve GO term information
|
||||
term_info = g.Term("GO:0003824", frmt="obo")
|
||||
|
||||
# Search annotations
|
||||
annotations = g.Annotation(protein="P43403", format="tsv")
|
||||
```
|
||||
|
||||
### 7. Protein-Protein Interactions
|
||||
|
||||
Query interaction databases via PSICQUIC:
|
||||
|
||||
```python
|
||||
from bioservices import PSICQUIC
|
||||
|
||||
s = PSICQUIC(verbose=False)
|
||||
|
||||
# Query specific database (e.g., MINT)
|
||||
interactions = s.query("mint", "ZAP70 AND species:9606")
|
||||
|
||||
# List available interaction databases
|
||||
databases = s.activeDBs
|
||||
```
|
||||
|
||||
**Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
|
||||
|
||||
## Multi-Service Integration Workflows
|
||||
|
||||
BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
|
||||
|
||||
### Complete Protein Analysis Pipeline
|
||||
|
||||
Execute a full protein characterization workflow:
|
||||
|
||||
```bash
|
||||
python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
|
||||
```
|
||||
|
||||
This script demonstrates:
|
||||
1. UniProt search for protein entry
|
||||
2. FASTA sequence retrieval
|
||||
3. BLAST similarity search
|
||||
4. KEGG pathway discovery
|
||||
5. PSICQUIC interaction mapping
|
||||
|
||||
### Pathway Network Analysis
|
||||
|
||||
Analyze all pathways for an organism:
|
||||
|
||||
```bash
|
||||
python scripts/pathway_analysis.py hsa output_directory/
|
||||
```
|
||||
|
||||
Extracts and analyzes:
|
||||
- All pathway IDs for organism
|
||||
- Protein-protein interactions per pathway
|
||||
- Interaction type distributions
|
||||
- Exports to CSV/SIF formats
|
||||
|
||||
### Cross-Database Compound Search
|
||||
|
||||
Map compound identifiers across databases:
|
||||
|
||||
```bash
|
||||
python scripts/compound_cross_reference.py Geldanamycin
|
||||
```
|
||||
|
||||
Retrieves:
|
||||
- KEGG compound ID
|
||||
- ChEBI identifier
|
||||
- ChEMBL identifier
|
||||
- Basic compound properties
|
||||
|
||||
### Batch Identifier Conversion
|
||||
|
||||
Convert multiple identifiers at once:
|
||||
|
||||
```bash
|
||||
python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Output Format Handling
|
||||
|
||||
Different services return data in various formats:
|
||||
- **XML**: Parse using BeautifulSoup (most SOAP services)
|
||||
- **Tab-separated (TSV)**: Pandas DataFrames for tabular data
|
||||
- **Dictionary/JSON**: Direct Python manipulation
|
||||
- **FASTA**: BioPython integration for sequence analysis
|
||||
|
||||
### Rate Limiting and Verbosity
|
||||
|
||||
Control API request behavior:
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG(verbose=False) # Suppress HTTP request details
|
||||
k.TIMEOUT = 30 # Adjust timeout for slow connections
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
Wrap service calls in try-except blocks:
|
||||
|
||||
```python
|
||||
try:
|
||||
results = u.search("ambiguous_query")
|
||||
if results:
|
||||
# Process results
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f"Search failed: {e}")
|
||||
```
|
||||
|
||||
### Organism Codes
|
||||
|
||||
Use standard organism abbreviations:
|
||||
- `hsa`: Homo sapiens (human)
|
||||
- `mmu`: Mus musculus (mouse)
|
||||
- `dme`: Drosophila melanogaster
|
||||
- `sce`: Saccharomyces cerevisiae (yeast)
|
||||
|
||||
List all organisms: `k.list("organism")` or `k.organismIds`
|
||||
|
||||
### Integration with Other Tools
|
||||
|
||||
BioServices works well with:
|
||||
- **BioPython**: Sequence analysis on retrieved FASTA data
|
||||
- **Pandas**: Tabular data manipulation
|
||||
- **PyMOL**: 3D structure visualization (retrieve PDB IDs)
|
||||
- **NetworkX**: Network analysis of pathway interactions
|
||||
- **Galaxy**: Custom tool wrappers for workflow platforms
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
|
||||
Executable Python scripts demonstrating complete workflows:
|
||||
|
||||
- `protein_analysis_workflow.py`: End-to-end protein characterization
|
||||
- `pathway_analysis.py`: KEGG pathway discovery and network extraction
|
||||
- `compound_cross_reference.py`: Multi-database compound searching
|
||||
- `batch_id_converter.py`: Bulk identifier mapping utility
|
||||
|
||||
Scripts can be executed directly or adapted for specific use cases.
|
||||
|
||||
### references/
|
||||
|
||||
Detailed documentation loaded as needed:
|
||||
|
||||
- `services_reference.md`: Comprehensive list of all 40+ services with methods
|
||||
- `workflow_patterns.md`: Detailed multi-step analysis workflows
|
||||
- `identifier_mapping.md`: Complete guide to cross-database ID conversion
|
||||
|
||||
Load references when working with specific services or complex integration tasks.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install bioservices
|
||||
```
|
||||
|
||||
Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
|
||||
|
||||
## Additional Information
|
||||
|
||||
For detailed API documentation and advanced features, refer to:
|
||||
- Official documentation: https://bioservices.readthedocs.io/
|
||||
- Source code: https://github.com/cokelaer/bioservices
|
||||
- Service-specific references in `references/services_reference.md`
|
||||
685
scientific-packages/bioservices/references/identifier_mapping.md
Normal file
685
scientific-packages/bioservices/references/identifier_mapping.md
Normal file
@@ -0,0 +1,685 @@
|
||||
# BioServices: Identifier Mapping Guide
|
||||
|
||||
This document provides comprehensive information about converting identifiers between different biological databases using BioServices.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [UniProt Mapping Service](#uniprot-mapping-service)
|
||||
3. [UniChem Compound Mapping](#unichem-compound-mapping)
|
||||
4. [KEGG Identifier Conversions](#kegg-identifier-conversions)
|
||||
5. [Common Mapping Patterns](#common-mapping-patterns)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:
|
||||
|
||||
1. **UniProt Mapping**: Comprehensive protein/gene ID conversion
|
||||
2. **UniChem**: Chemical compound ID mapping
|
||||
3. **KEGG**: Built-in cross-references in entries
|
||||
4. **PICR**: Protein identifier cross-reference service
|
||||
|
||||
---
|
||||
|
||||
## UniProt Mapping Service
|
||||
|
||||
The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# Map single ID
|
||||
result = u.mapping(
|
||||
fr="UniProtKB_AC-ID", # Source database
|
||||
to="KEGG", # Target database
|
||||
query="P43403" # Identifier to convert
|
||||
)
|
||||
|
||||
print(result)
|
||||
# Output: {'P43403': ['hsa:7535']}
|
||||
```
|
||||
|
||||
### Batch Mapping
|
||||
|
||||
```python
|
||||
# Map multiple IDs (comma-separated)
|
||||
ids = ["P43403", "P04637", "P53779"]
|
||||
result = u.mapping(
|
||||
fr="UniProtKB_AC-ID",
|
||||
to="KEGG",
|
||||
query=",".join(ids)
|
||||
)
|
||||
|
||||
for uniprot_id, kegg_ids in result.items():
|
||||
print(f"{uniprot_id} → {kegg_ids}")
|
||||
```
|
||||
|
||||
### Supported Database Pairs
|
||||
|
||||
UniProt supports mapping between 100+ database pairs. Key ones include:
|
||||
|
||||
#### Protein/Gene Databases
|
||||
|
||||
| Source Format | Code | Target Format | Code |
|
||||
|---------------|------|---------------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` |
|
||||
| KEGG | `KEGG` | UniProtKB | `UniProtKB` |
|
||||
| Ensembl | `Ensembl` | UniProtKB | `UniProtKB` |
|
||||
| GeneID | `GeneID` | UniProtKB | `UniProtKB` |
|
||||
|
||||
#### Structural Databases
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` |
|
||||
| PDB | `PDB` | UniProtKB | `UniProtKB` |
|
||||
|
||||
#### Expression & Proteomics
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` |
|
||||
|
||||
#### Organism-Specific
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` |
|
||||
|
||||
#### Other Useful Mappings
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` |
|
||||
|
||||
### Complete List of Database Codes
|
||||
|
||||
To get the complete, up-to-date list:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# This information is in the UniProt REST API documentation
|
||||
# Common patterns:
|
||||
# - Source databases typically end in source database name
|
||||
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
|
||||
# - Most other databases use their standard abbreviation
|
||||
```
|
||||
|
||||
### Common Database Codes Reference
|
||||
|
||||
**Gene/Protein Identifiers:**
|
||||
- `UniProtKB_AC-ID`: UniProt accession/ID
|
||||
- `UniProtKB`: UniProt accession
|
||||
- `KEGG`: KEGG gene IDs (e.g., hsa:7535)
|
||||
- `GeneID`: NCBI Gene (Entrez) IDs
|
||||
- `Ensembl`: Ensembl gene IDs
|
||||
- `Ensembl_Protein`: Ensembl protein IDs
|
||||
- `Ensembl_Transcript`: Ensembl transcript IDs
|
||||
- `RefSeq_Protein`: RefSeq protein IDs (NP_)
|
||||
- `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_)
|
||||
|
||||
**Gene Nomenclature:**
|
||||
- `HGNC`: Human Gene Nomenclature Committee
|
||||
- `MGI`: Mouse Genome Informatics
|
||||
- `RGD`: Rat Genome Database
|
||||
- `SGD`: Saccharomyces Genome Database
|
||||
- `FlyBase`: Drosophila database
|
||||
- `WormBase`: C. elegans database
|
||||
- `ZFIN`: Zebrafish database
|
||||
|
||||
**Structure:**
|
||||
- `PDB`: Protein Data Bank
|
||||
- `Pfam`: Protein families
|
||||
- `InterPro`: Protein domains
|
||||
- `SUPFAM`: Superfamily
|
||||
- `PROSITE`: Protein motifs
|
||||
|
||||
**Pathways & Networks:**
|
||||
- `Reactome`: Reactome pathways
|
||||
- `BioCyc`: BioCyc pathways
|
||||
- `PathwayCommons`: Pathway Commons
|
||||
- `STRING`: Protein-protein networks
|
||||
- `BioGRID`: Interaction database
|
||||
|
||||
### Mapping Examples
|
||||
|
||||
#### UniProt → KEGG
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# Single mapping
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
||||
print(result) # {'P43403': ['hsa:7535']}
|
||||
```
|
||||
|
||||
#### KEGG → UniProt
|
||||
|
||||
```python
|
||||
# Reverse mapping
|
||||
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
|
||||
print(result) # {'hsa:7535': ['P43403']}
|
||||
```
|
||||
|
||||
#### UniProt → Ensembl
|
||||
|
||||
```python
|
||||
# To Ensembl gene IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
|
||||
print(result) # {'P43403': ['ENSG00000115085']}
|
||||
|
||||
# To Ensembl protein IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
|
||||
print(result) # {'P43403': ['ENSP00000381359']}
|
||||
```
|
||||
|
||||
#### UniProt → PDB
|
||||
|
||||
```python
|
||||
# Find 3D structures
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
|
||||
print(result) # {'P04637': ['1A1U', '1AIE', '1C26', ...]}
|
||||
```
|
||||
|
||||
#### UniProt → RefSeq
|
||||
|
||||
```python
|
||||
# Get RefSeq protein IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
|
||||
print(result) # {'P43403': ['NP_001070.2']}
|
||||
```
|
||||
|
||||
#### Gene Name → UniProt (via search, then mapping)
|
||||
|
||||
```python
|
||||
# First search for gene
|
||||
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
|
||||
lines = search_result.strip().split("\n")
|
||||
if len(lines) > 1:
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Then map to other databases
|
||||
kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
print(kegg_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UniChem Compound Mapping
|
||||
|
||||
UniChem specializes in mapping chemical compound identifiers across databases.
|
||||
|
||||
### Source Database IDs
|
||||
|
||||
| Source ID | Database |
|
||||
|-----------|----------|
|
||||
| 1 | ChEMBL |
|
||||
| 2 | DrugBank |
|
||||
| 3 | PDB |
|
||||
| 4 | IUPHAR/BPS Guide to Pharmacology |
|
||||
| 5 | PubChem |
|
||||
| 6 | KEGG |
|
||||
| 7 | ChEBI |
|
||||
| 8 | NIH Clinical Collection |
|
||||
| 14 | FDA/SRS |
|
||||
| 22 | PubChem |
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from bioservices import UniChem
|
||||
|
||||
u = UniChem()
|
||||
|
||||
# Get ChEMBL ID from KEGG compound ID
|
||||
chembl_id = u.get_compound_id_from_kegg("C11222")
|
||||
print(chembl_id) # CHEMBL278315
|
||||
```
|
||||
|
||||
### All Compound IDs
|
||||
|
||||
```python
|
||||
# Get all identifiers for a compound
|
||||
# src_compound_id: compound ID, src_id: source database ID
|
||||
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1) # 1 = ChEMBL
|
||||
|
||||
for mapping in all_ids:
|
||||
src_name = mapping['src_name']
|
||||
src_compound_id = mapping['src_compound_id']
|
||||
print(f"{src_name}: {src_compound_id}")
|
||||
```
|
||||
|
||||
### Specific Database Conversion
|
||||
|
||||
```python
|
||||
# Convert between specific databases
|
||||
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
|
||||
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
|
||||
print(result)
|
||||
```
|
||||
|
||||
### Common Compound Mappings
|
||||
|
||||
#### KEGG → ChEMBL
|
||||
|
||||
```python
|
||||
u = UniChem()
|
||||
chembl_id = u.get_compound_id_from_kegg("C00031") # D-Glucose
|
||||
print(f"ChEMBL: {chembl_id}")
|
||||
```
|
||||
|
||||
#### ChEMBL → PubChem
|
||||
|
||||
```python
|
||||
result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
|
||||
if result:
|
||||
pubchem_id = result[0]['src_compound_id']
|
||||
print(f"PubChem: {pubchem_id}")
|
||||
```
|
||||
|
||||
#### ChEBI → DrugBank
|
||||
|
||||
```python
|
||||
result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
|
||||
if result:
|
||||
drugbank_id = result[0]['src_compound_id']
|
||||
print(f"DrugBank: {drugbank_id}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## KEGG Identifier Conversions
|
||||
|
||||
KEGG entries contain cross-references that can be extracted by parsing.
|
||||
|
||||
### Extract Database Links from KEGG Entry
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Get compound entry
|
||||
entry = k.get("cpd:C11222")
|
||||
|
||||
# Parse for specific database
|
||||
chebi_id = None
|
||||
uniprot_ids = []
|
||||
|
||||
for line in entry.split("\n"):
|
||||
if "ChEBI:" in line:
|
||||
# Extract ChEBI ID
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
chebi_id = parts[1].strip().split()[0]
|
||||
|
||||
# For genes/proteins
|
||||
gene_entry = k.get("hsa:7535")
|
||||
for line in gene_entry.split("\n"):
|
||||
if line.startswith(" "): # Database links section
|
||||
if "UniProt:" in line:
|
||||
parts = line.split("UniProt:")
|
||||
if len(parts) > 1:
|
||||
uniprot_id = parts[1].strip()
|
||||
uniprot_ids.append(uniprot_id)
|
||||
```
|
||||
|
||||
### KEGG Gene ID Components
|
||||
|
||||
KEGG gene IDs have format `organism:gene_id`:
|
||||
|
||||
```python
|
||||
kegg_id = "hsa:7535"
|
||||
organism, gene_id = kegg_id.split(":")
|
||||
|
||||
print(f"Organism: {organism}") # hsa (human)
|
||||
print(f"Gene ID: {gene_id}") # 7535
|
||||
```
|
||||
|
||||
### KEGG Pathway to Genes
|
||||
|
||||
```python
|
||||
k = KEGG()
|
||||
|
||||
# Get pathway entry
|
||||
pathway = k.get("path:hsa04660")
|
||||
|
||||
# Parse for gene list
|
||||
genes = []
|
||||
in_gene_section = False
|
||||
|
||||
for line in pathway.split("\n"):
|
||||
if line.startswith("GENE"):
|
||||
in_gene_section = True
|
||||
|
||||
if in_gene_section:
|
||||
if line.startswith(" " * 12): # Gene line
|
||||
parts = line.strip().split()
|
||||
if parts:
|
||||
gene_id = parts[0]
|
||||
genes.append(f"hsa:{gene_id}")
|
||||
elif not line.startswith(" "):
|
||||
break
|
||||
|
||||
print(f"Found {len(genes)} genes")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Mapping Patterns
|
||||
|
||||
### Pattern 1: Gene Symbol → Multiple Database IDs
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def gene_symbol_to_ids(gene_symbol, organism="9606"):
|
||||
"""Convert gene symbol to multiple database IDs."""
|
||||
u = UniProt()
|
||||
|
||||
# Search for gene
|
||||
query = f"gene:{gene_symbol} AND organism:{organism}"
|
||||
result = u.search(query, frmt="tab", columns="id")
|
||||
|
||||
lines = result.strip().split("\n")
|
||||
if len(lines) < 2:
|
||||
return None
|
||||
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Map to multiple databases
|
||||
ids = {
|
||||
'uniprot': uniprot_id,
|
||||
'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
|
||||
'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
|
||||
'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
|
||||
'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
|
||||
}
|
||||
|
||||
return ids
|
||||
|
||||
# Usage
|
||||
ids = gene_symbol_to_ids("ZAP70")
|
||||
print(ids)
|
||||
```
|
||||
|
||||
### Pattern 2: Compound Name → All Database IDs
|
||||
|
||||
```python
|
||||
from bioservices import KEGG, UniChem, ChEBI
|
||||
|
||||
def compound_name_to_ids(compound_name):
|
||||
"""Search compound and get all database IDs."""
|
||||
k = KEGG()
|
||||
|
||||
# Search KEGG
|
||||
results = k.find("compound", compound_name)
|
||||
if not results:
|
||||
return None
|
||||
|
||||
# Extract KEGG ID
|
||||
kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")
|
||||
|
||||
# Get KEGG entry for ChEBI
|
||||
entry = k.get(f"cpd:{kegg_id}")
|
||||
chebi_id = None
|
||||
for line in entry.split("\n"):
|
||||
if "ChEBI:" in line:
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
chebi_id = parts[1].strip().split()[0]
|
||||
break
|
||||
|
||||
# Get ChEMBL from UniChem
|
||||
u = UniChem()
|
||||
try:
|
||||
chembl_id = u.get_compound_id_from_kegg(kegg_id)
|
||||
except:
|
||||
chembl_id = None
|
||||
|
||||
return {
|
||||
'kegg': kegg_id,
|
||||
'chebi': chebi_id,
|
||||
'chembl': chembl_id
|
||||
}
|
||||
|
||||
# Usage
|
||||
ids = compound_name_to_ids("Geldanamycin")
|
||||
print(ids)
|
||||
```
|
||||
|
||||
### Pattern 3: Batch ID Conversion with Error Handling
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
|
||||
"""Safely map IDs with error handling and chunking."""
|
||||
u = UniProt()
|
||||
all_results = {}
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
query = ",".join(chunk)
|
||||
|
||||
try:
|
||||
results = u.mapping(fr=from_db, to=to_db, query=query)
|
||||
all_results.update(results)
|
||||
print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error at chunk {i}: {e}")
|
||||
|
||||
# Try individual IDs in failed chunk
|
||||
for single_id in chunk:
|
||||
try:
|
||||
result = u.mapping(fr=from_db, to=to_db, query=single_id)
|
||||
all_results.update(result)
|
||||
except:
|
||||
all_results[single_id] = None
|
||||
|
||||
return all_results
|
||||
|
||||
# Usage
|
||||
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
|
||||
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")
|
||||
```
|
||||
|
||||
### Pattern 4: Multi-Hop Mapping
|
||||
|
||||
Sometimes you need to map through intermediate databases:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def multi_hop_mapping(gene_symbol, organism="9606"):
|
||||
"""Gene symbol → UniProt → KEGG → Pathways."""
|
||||
u = UniProt()
|
||||
k = KEGG()
|
||||
|
||||
# Step 1: Gene symbol → UniProt
|
||||
query = f"gene:{gene_symbol} AND organism:{organism}"
|
||||
result = u.search(query, frmt="tab", columns="id")
|
||||
|
||||
lines = result.strip().split("\n")
|
||||
if len(lines) < 2:
|
||||
return None
|
||||
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Step 2: UniProt → KEGG
|
||||
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
if not kegg_mapping or uniprot_id not in kegg_mapping:
|
||||
return None
|
||||
|
||||
kegg_id = kegg_mapping[uniprot_id][0]
|
||||
|
||||
# Step 3: KEGG → Pathways
|
||||
organism_code, gene_id = kegg_id.split(":")
|
||||
pathways = k.get_pathway_by_gene(gene_id, organism_code)
|
||||
|
||||
return {
|
||||
'gene': gene_symbol,
|
||||
'uniprot': uniprot_id,
|
||||
'kegg': kegg_id,
|
||||
'pathways': pathways
|
||||
}
|
||||
|
||||
# Usage
|
||||
result = multi_hop_mapping("TP53")
|
||||
print(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: No Mapping Found
|
||||
|
||||
**Symptom:** Mapping returns empty or None
|
||||
|
||||
**Solutions:**
|
||||
1. Verify source ID exists in source database
|
||||
2. Check database code spelling
|
||||
3. Try reverse mapping
|
||||
4. Some IDs may not have mappings in all databases
|
||||
|
||||
```python
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
||||
|
||||
if not result or 'P43403' not in result:
|
||||
print("No mapping found. Try:")
|
||||
print("1. Verify ID exists: u.search('P43403')")
|
||||
print("2. Check if protein has KEGG annotation")
|
||||
```
|
||||
|
||||
### Issue 2: Too Many IDs in Batch
|
||||
|
||||
**Symptom:** Batch mapping fails or times out
|
||||
|
||||
**Solution:** Split into smaller chunks
|
||||
|
||||
```python
|
||||
def chunked_mapping(ids, from_db, to_db, chunk_size=50):
|
||||
all_results = {}
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
|
||||
all_results.update(result)
|
||||
|
||||
return all_results
|
||||
```
|
||||
|
||||
### Issue 3: Multiple Target IDs
|
||||
|
||||
**Symptom:** One source ID maps to multiple target IDs
|
||||
|
||||
**Solution:** Handle as list
|
||||
|
||||
```python
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
|
||||
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}
|
||||
|
||||
pdb_ids = result['P04637']
|
||||
print(f"Found {len(pdb_ids)} PDB structures")
|
||||
|
||||
for pdb_id in pdb_ids:
|
||||
print(f" {pdb_id}")
|
||||
```
|
||||
|
||||
### Issue 4: Organism Ambiguity
|
||||
|
||||
**Symptom:** Gene symbol maps to multiple organisms
|
||||
|
||||
**Solution:** Always specify organism in searches
|
||||
|
||||
```python
|
||||
# Bad: Ambiguous
|
||||
result = u.search("gene:TP53") # Many organisms have TP53
|
||||
|
||||
# Good: Specific
|
||||
result = u.search("gene:TP53 AND organism:9606") # Human only
|
||||
```
|
||||
|
||||
### Issue 5: Deprecated IDs
|
||||
|
||||
**Symptom:** Old database IDs don't map
|
||||
|
||||
**Solution:** Update to current IDs first
|
||||
|
||||
```python
|
||||
# Check if ID is current
|
||||
entry = u.retrieve("P43403", frmt="txt")
|
||||
|
||||
# Look for secondary accessions
|
||||
for line in entry.split("\n"):
|
||||
if line.startswith("AC"):
|
||||
print(line) # Shows primary and secondary accessions
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always validate inputs** before batch processing
|
||||
2. **Handle None/empty results** gracefully
|
||||
3. **Use chunking** for large ID lists (50-100 per chunk)
|
||||
4. **Cache results** for repeated queries
|
||||
5. **Specify organism** when possible to avoid ambiguity
|
||||
6. **Log failures** in batch processing for later retry
|
||||
7. **Add delays** between large batches to respect API limits
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def polite_batch_mapping(ids, from_db, to_db):
|
||||
"""Batch mapping with rate limiting."""
|
||||
results = {}
|
||||
|
||||
for i in range(0, len(ids), 50):
|
||||
chunk = ids[i:i+50]
|
||||
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
|
||||
results.update(result)
|
||||
|
||||
time.sleep(0.5) # Be nice to the API
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
For complete working examples, see:
|
||||
- `scripts/batch_id_converter.py`: Command-line batch conversion tool
|
||||
- `workflow_patterns.md`: Integration into larger workflows
|
||||
634
scientific-packages/bioservices/references/services_reference.md
Normal file
634
scientific-packages/bioservices/references/services_reference.md
Normal file
@@ -0,0 +1,634 @@
|
||||
# BioServices: Complete Services Reference
|
||||
|
||||
This document provides a comprehensive reference for all major services available in BioServices, including key methods, parameters, and use cases.
|
||||
|
||||
## Protein & Gene Resources
|
||||
|
||||
### UniProt
|
||||
|
||||
Protein sequence and functional information database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
u = UniProt(verbose=False)
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
- `search(query, frmt="tab", columns=None, limit=None, sort=None, compress=False, include=False, **kwargs)`
|
||||
- Search UniProt with flexible query syntax
|
||||
- `frmt`: "tab", "fasta", "xml", "rdf", "gff", "txt"
|
||||
- `columns`: Comma-separated list (e.g., "id,genes,organism,length")
|
||||
- Returns: String in requested format
|
||||
|
||||
- `retrieve(uniprot_id, frmt="txt")`
|
||||
- Retrieve specific UniProt entry
|
||||
- `frmt`: "txt", "fasta", "xml", "rdf", "gff"
|
||||
- Returns: Entry data in requested format
|
||||
|
||||
- `mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")`
|
||||
- Convert identifiers between databases
|
||||
- `fr`/`to`: Database identifiers (see identifier_mapping.md)
|
||||
- `query`: Single ID or comma-separated list
|
||||
- Returns: Dictionary mapping input to output IDs
|
||||
|
||||
- `searchUniProtId(pattern, columns="entry name,length,organism", limit=100)`
|
||||
- Convenience method for ID-based searches
|
||||
- Returns: Tab-separated values
|
||||
|
||||
**Common columns:** id, entry name, genes, organism, protein names, length, sequence, go-id, ec, pathway, interactor
|
||||
|
||||
**Use cases:**
|
||||
- Protein sequence retrieval for BLAST
|
||||
- Functional annotation lookup
|
||||
- Cross-database identifier mapping
|
||||
- Batch protein information retrieval
|
||||
|
||||
---
|
||||
|
||||
### KEGG (Kyoto Encyclopedia of Genes and Genomes)
|
||||
|
||||
Metabolic pathways, genes, and organisms database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
k = KEGG()
|
||||
k.organism = "hsa" # Set default organism
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
- `list(database)`
|
||||
- List entries in KEGG database
|
||||
- `database`: "organism", "pathway", "module", "disease", "drug", "compound"
|
||||
- Returns: Multi-line string with entries
|
||||
|
||||
- `find(database, query)`
|
||||
- Search database by keywords
|
||||
- Returns: List of matching entries with IDs
|
||||
|
||||
- `get(entry_id)`
|
||||
- Retrieve entry by ID
|
||||
- Supports genes, pathways, compounds, etc.
|
||||
- Returns: Raw entry text
|
||||
|
||||
- `parse(data)`
|
||||
- Parse KEGG entry into dictionary
|
||||
- Returns: Dict with structured data
|
||||
|
||||
- `lookfor_organism(name)`
|
||||
- Search organisms by name pattern
|
||||
- Returns: List of matching organism codes
|
||||
|
||||
- `lookfor_pathway(name)`
|
||||
- Search pathways by name
|
||||
- Returns: List of pathway IDs
|
||||
|
||||
- `get_pathway_by_gene(gene_id, organism)`
|
||||
- Find pathways containing gene
|
||||
- Returns: List of pathway IDs
|
||||
|
||||
- `parse_kgml_pathway(pathway_id)`
|
||||
- Parse pathway KGML for interactions
|
||||
- Returns: Dict with "entries" and "relations"
|
||||
|
||||
- `pathway2sif(pathway_id)`
|
||||
- Extract Simple Interaction Format data
|
||||
- Filters for activation/inhibition
|
||||
- Returns: List of interaction tuples
|
||||
|
||||
**Organism codes:**
|
||||
- hsa: Homo sapiens
|
||||
- mmu: Mus musculus
|
||||
- dme: Drosophila melanogaster
|
||||
- sce: Saccharomyces cerevisiae
|
||||
- eco: Escherichia coli
|
||||
|
||||
**Use cases:**
|
||||
- Pathway analysis and visualization
|
||||
- Gene function annotation
|
||||
- Metabolic network reconstruction
|
||||
- Protein-protein interaction extraction
|
||||
|
||||
---
|
||||
|
||||
### HGNC (Human Gene Nomenclature Committee)
|
||||
|
||||
Official human gene naming authority.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import HGNC
|
||||
h = HGNC()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `search(query)`: Search gene symbols/names
|
||||
- `fetch(format, query)`: Retrieve gene information
|
||||
|
||||
**Use cases:**
|
||||
- Standardizing human gene names
|
||||
- Looking up official gene symbols
|
||||
|
||||
---
|
||||
|
||||
### MyGeneInfo
|
||||
|
||||
Gene annotation and query service.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import MyGeneInfo
|
||||
m = MyGeneInfo()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `querymany(ids, scopes, fields, species)`: Batch gene queries
|
||||
- `getgene(geneid)`: Get gene annotation
|
||||
|
||||
**Use cases:**
|
||||
- Batch gene annotation retrieval
|
||||
- Gene ID conversion
|
||||
|
||||
---
|
||||
|
||||
## Chemical Compound Resources
|
||||
|
||||
### ChEBI (Chemical Entities of Biological Interest)
|
||||
|
||||
Dictionary of molecular entities.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import ChEBI
|
||||
c = ChEBI()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `getCompleteEntity(chebi_id)`: Full compound information
|
||||
- `getLiteEntity(chebi_id)`: Basic information
|
||||
- `getCompleteEntityByList(chebi_ids)`: Batch retrieval
|
||||
|
||||
**Use cases:**
|
||||
- Small molecule information
|
||||
- Chemical structure data
|
||||
- Compound property lookup
|
||||
|
||||
---
|
||||
|
||||
### ChEMBL
|
||||
|
||||
Bioactive drug-like compound database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import ChEMBL
|
||||
c = ChEMBL()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_compound_by_chemblId(chembl_id)`: Compound details
|
||||
- `get_target_by_chemblId(chembl_id)`: Target information
|
||||
- `get_assays()`: Bioassay data
|
||||
|
||||
**Use cases:**
|
||||
- Drug discovery data
|
||||
- Bioactivity information
|
||||
- Target-compound relationships
|
||||
|
||||
---
|
||||
|
||||
### UniChem
|
||||
|
||||
Chemical identifier mapping service.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import UniChem
|
||||
u = UniChem()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_compound_id_from_kegg(kegg_id)`: KEGG → ChEMBL
|
||||
- `get_all_compound_ids(src_compound_id, src_id)`: Get all IDs
|
||||
- `get_src_compound_ids(src_compound_id, from_src_id, to_src_id)`: Convert IDs
|
||||
|
||||
**Source IDs:**
|
||||
- 1: ChEMBL
|
||||
- 2: DrugBank
|
||||
- 3: PDB
|
||||
- 6: KEGG
|
||||
- 7: ChEBI
|
||||
- 22: PubChem
|
||||
|
||||
**Use cases:**
|
||||
- Cross-database compound ID mapping
|
||||
- Linking chemical databases
|
||||
|
||||
---
|
||||
|
||||
### PubChem
|
||||
|
||||
Chemical compound database from NIH.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import PubChem
|
||||
p = PubChem()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_compounds(identifier, namespace)`: Retrieve compounds
|
||||
- `get_properties(properties, identifier, namespace)`: Get properties
|
||||
|
||||
**Use cases:**
|
||||
- Chemical structure retrieval
|
||||
- Compound property information
|
||||
|
||||
---
|
||||
|
||||
## Sequence Analysis Tools
|
||||
|
||||
### NCBIblast
|
||||
|
||||
Sequence similarity searching.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import NCBIblast
|
||||
s = NCBIblast(verbose=False)
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `run(program, sequence, stype, database, email, **params)`
|
||||
- Submit BLAST job
|
||||
- `program`: "blastp", "blastn", "blastx", "tblastn", "tblastx"
|
||||
- `stype`: "protein" or "dna"
|
||||
- `database`: "uniprotkb", "pdb", "refseq_protein", etc.
|
||||
- `email`: Required by NCBI
|
||||
- Returns: Job ID
|
||||
|
||||
- `getStatus(jobid)`
|
||||
- Check job status
|
||||
- Returns: "RUNNING", "FINISHED", "ERROR"
|
||||
|
||||
- `getResult(jobid, result_type)`
|
||||
- Retrieve results
|
||||
- `result_type`: "out" (default), "ids", "xml"
|
||||
|
||||
**Important:** BLAST jobs are asynchronous. Always check status before retrieving results.
|
||||
|
||||
**Use cases:**
|
||||
- Protein homology searches
|
||||
- Sequence similarity analysis
|
||||
- Functional annotation by homology
|
||||
|
||||
---
|
||||
|
||||
## Pathway & Interaction Resources
|
||||
|
||||
### Reactome
|
||||
|
||||
Pathway database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import Reactome
|
||||
r = Reactome()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_pathway_by_id(pathway_id)`: Pathway details
|
||||
- `search_pathway(query)`: Search pathways
|
||||
|
||||
**Use cases:**
|
||||
- Human pathway analysis
|
||||
- Biological process annotation
|
||||
|
||||
---
|
||||
|
||||
### PSICQUIC
|
||||
|
||||
Protein interaction query service (federates 30+ databases).
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import PSICQUIC
|
||||
s = PSICQUIC()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `query(database, query_string)`
|
||||
- Query specific interaction database
|
||||
- Returns: PSI-MI TAB format
|
||||
|
||||
- `activeDBs`
|
||||
- Property listing available databases
|
||||
- Returns: List of database names
|
||||
|
||||
**Available databases:** MINT, IntAct, BioGRID, DIP, InnateDB, MatrixDB, MPIDB, UniProt, and 30+ more
|
||||
|
||||
**Query syntax:** Supports AND, OR, species filters
|
||||
- Example: "ZAP70 AND species:9606"
|
||||
|
||||
**Use cases:**
|
||||
- Protein-protein interaction discovery
|
||||
- Network analysis
|
||||
- Interactome mapping
|
||||
|
||||
---
|
||||
|
||||
### IntactComplex
|
||||
|
||||
Protein complex database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import IntactComplex
|
||||
i = IntactComplex()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `search(query)`: Search complexes
|
||||
- `details(complex_ac)`: Complex details
|
||||
|
||||
**Use cases:**
|
||||
- Protein complex composition
|
||||
- Multi-protein assembly analysis
|
||||
|
||||
---
|
||||
|
||||
### OmniPath
|
||||
|
||||
Integrated signaling pathway database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import OmniPath
|
||||
o = OmniPath()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `interactions(datasets, organisms)`: Get interactions
|
||||
- `ptms(datasets, organisms)`: Post-translational modifications
|
||||
|
||||
**Use cases:**
|
||||
- Cell signaling analysis
|
||||
- Regulatory network mapping
|
||||
|
||||
---
|
||||
|
||||
## Gene Ontology
|
||||
|
||||
### QuickGO
|
||||
|
||||
Gene Ontology annotation service.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import QuickGO
|
||||
g = QuickGO()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `Term(go_id, frmt="obo")`
|
||||
- Retrieve GO term information
|
||||
- Returns: Term definition and metadata
|
||||
|
||||
- `Annotation(protein=None, goid=None, format="tsv")`
|
||||
- Get GO annotations
|
||||
- Returns: Annotations in requested format
|
||||
|
||||
**GO categories:**
|
||||
- Biological Process (BP)
|
||||
- Molecular Function (MF)
|
||||
- Cellular Component (CC)
|
||||
|
||||
**Use cases:**
|
||||
- Functional annotation
|
||||
- Enrichment analysis
|
||||
- GO term lookup
|
||||
|
||||
---
|
||||
|
||||
## Genomic Resources
|
||||
|
||||
### BioMart
|
||||
|
||||
Data mining tool for genomic data.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import BioMart
|
||||
b = BioMart()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `datasets(dataset)`: List available datasets
|
||||
- `attributes(dataset)`: List attributes
|
||||
- `query(query_xml)`: Execute BioMart query
|
||||
|
||||
**Use cases:**
|
||||
- Bulk genomic data retrieval
|
||||
- Custom genome annotations
|
||||
- SNP information
|
||||
|
||||
---
|
||||
|
||||
### ArrayExpress
|
||||
|
||||
Gene expression database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import ArrayExpress
|
||||
a = ArrayExpress()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `queryExperiments(keywords)`: Search experiments
|
||||
- `retrieveExperiment(accession)`: Get experiment data
|
||||
|
||||
**Use cases:**
|
||||
- Gene expression data
|
||||
- Microarray analysis
|
||||
- RNA-seq data retrieval
|
||||
|
||||
---
|
||||
|
||||
### ENA (European Nucleotide Archive)
|
||||
|
||||
Nucleotide sequence database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import ENA
|
||||
e = ENA()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `search_data(query)`: Search sequences
|
||||
- `retrieve_data(accession)`: Retrieve sequences
|
||||
|
||||
**Use cases:**
|
||||
- Nucleotide sequence retrieval
|
||||
- Genome assembly access
|
||||
|
||||
---
|
||||
|
||||
## Structural Biology
|
||||
|
||||
### PDB (Protein Data Bank)
|
||||
|
||||
3D protein structure database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import PDB
|
||||
p = PDB()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_file(pdb_id, file_format)`: Download structure files
|
||||
- `search(query)`: Search structures
|
||||
|
||||
**File formats:** pdb, cif, xml
|
||||
|
||||
**Use cases:**
|
||||
- 3D structure retrieval
|
||||
- Structure-based analysis
|
||||
- PyMOL visualization
|
||||
|
||||
---
|
||||
|
||||
### Pfam
|
||||
|
||||
Protein family database.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import Pfam
|
||||
p = Pfam()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `searchSequence(sequence)`: Find domains in sequence
|
||||
- `getPfamEntry(pfam_id)`: Domain information
|
||||
|
||||
**Use cases:**
|
||||
- Protein domain identification
|
||||
- Family classification
|
||||
- Functional motif discovery
|
||||
|
||||
---
|
||||
|
||||
## Specialized Resources
|
||||
|
||||
### BioModels
|
||||
|
||||
Systems biology model repository.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import BioModels
|
||||
b = BioModels()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `get_model_by_id(model_id)`: Retrieve SBML model
|
||||
|
||||
**Use cases:**
|
||||
- Systems biology modeling
|
||||
- SBML model retrieval
|
||||
|
||||
---
|
||||
|
||||
### COG (Clusters of Orthologous Genes)
|
||||
|
||||
Orthologous gene classification.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import COG
|
||||
c = COG()
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Orthology analysis
|
||||
- Functional classification
|
||||
|
||||
---
|
||||
|
||||
### BiGG Models
|
||||
|
||||
Metabolic network models.
|
||||
|
||||
**Initialization:**
|
||||
```python
|
||||
from bioservices import BiGG
|
||||
b = BiGG()
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `list_models()`: Available models
|
||||
- `get_model(model_id)`: Model details
|
||||
|
||||
**Use cases:**
|
||||
- Metabolic network analysis
|
||||
- Flux balance analysis
|
||||
|
||||
---
|
||||
|
||||
## General Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
All services may throw exceptions. Wrap calls in try-except:
|
||||
|
||||
```python
|
||||
try:
|
||||
result = service.method(params)
|
||||
if result:
|
||||
# Process result
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
### Verbosity Control
|
||||
|
||||
Most services support `verbose` parameter:
|
||||
```python
|
||||
service = Service(verbose=False) # Suppress HTTP logs
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
Services have timeouts and rate limits:
|
||||
```python
|
||||
service.TIMEOUT = 30 # Adjust timeout
|
||||
service.DELAY = 1 # Delay between requests (if supported)
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
|
||||
Common format parameters:
|
||||
- `frmt`: "xml", "json", "tab", "txt", "fasta"
|
||||
- `format`: Service-specific variants
|
||||
|
||||
### Caching
|
||||
|
||||
Some services cache results:
|
||||
```python
|
||||
service.CACHE = True # Enable caching
|
||||
service.clear_cache() # Clear cache
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
For detailed API documentation:
|
||||
- Official docs: https://bioservices.readthedocs.io/
|
||||
- Individual service docs linked from main page
|
||||
- Source code: https://github.com/cokelaer/bioservices
|
||||
811
scientific-packages/bioservices/references/workflow_patterns.md
Normal file
811
scientific-packages/bioservices/references/workflow_patterns.md
Normal file
@@ -0,0 +1,811 @@
|
||||
# BioServices: Common Workflow Patterns
|
||||
|
||||
This document describes detailed multi-step workflows for common bioinformatics tasks using BioServices.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Complete Protein Analysis Pipeline](#complete-protein-analysis-pipeline)
|
||||
2. [Pathway Discovery and Network Analysis](#pathway-discovery-and-network-analysis)
|
||||
3. [Compound Multi-Database Search](#compound-multi-database-search)
|
||||
4. [Batch Identifier Conversion](#batch-identifier-conversion)
|
||||
5. [Gene Functional Annotation](#gene-functional-annotation)
|
||||
6. [Protein Interaction Network Construction](#protein-interaction-network-construction)
|
||||
7. [Multi-Organism Comparative Analysis](#multi-organism-comparative-analysis)
|
||||
|
||||
---
|
||||
|
||||
## Complete Protein Analysis Pipeline
|
||||
|
||||
**Goal:** Given a protein name, retrieve sequence, find homologs, identify pathways, and discover interactions.
|
||||
|
||||
**Example:** Analyzing human ZAP70 protein
|
||||
|
||||
### Step 1: UniProt Search and Identifier Retrieval
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt(verbose=False)
|
||||
|
||||
# Search for protein by name
|
||||
query = "ZAP70_HUMAN"
|
||||
results = u.search(query, frmt="tab", columns="id,genes,organism,length")
|
||||
|
||||
# Parse results
|
||||
lines = results.strip().split("\n")
|
||||
if len(lines) > 1:
|
||||
header = lines[0]
|
||||
data = lines[1].split("\t")
|
||||
uniprot_id = data[0] # e.g., P43403
|
||||
gene_names = data[1] # e.g., ZAP70
|
||||
|
||||
print(f"UniProt ID: {uniprot_id}")
|
||||
print(f"Gene names: {gene_names}")
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- UniProt accession: P43403
|
||||
- Gene name: ZAP70
|
||||
|
||||
### Step 2: Sequence Retrieval
|
||||
|
||||
```python
|
||||
# Retrieve FASTA sequence
|
||||
sequence = u.retrieve(uniprot_id, frmt="fasta")
|
||||
print(sequence)
|
||||
|
||||
# Extract just the sequence string (remove header)
|
||||
seq_lines = sequence.split("\n")
|
||||
sequence_only = "".join(seq_lines[1:]) # Skip FASTA header
|
||||
```
|
||||
|
||||
**Output:** Complete protein sequence in FASTA format
|
||||
|
||||
### Step 3: BLAST Similarity Search
|
||||
|
||||
```python
|
||||
from bioservices import NCBIblast
|
||||
import time
|
||||
|
||||
s = NCBIblast(verbose=False)
|
||||
|
||||
# Submit BLAST job
|
||||
jobid = s.run(
|
||||
program="blastp",
|
||||
sequence=sequence_only,
|
||||
stype="protein",
|
||||
database="uniprotkb",
|
||||
email="your.email@example.com"
|
||||
)
|
||||
|
||||
print(f"BLAST Job ID: {jobid}")
|
||||
|
||||
# Wait for completion
|
||||
while True:
|
||||
status = s.getStatus(jobid)
|
||||
print(f"Status: {status}")
|
||||
if status == "FINISHED":
|
||||
break
|
||||
elif status == "ERROR":
|
||||
print("BLAST job failed")
|
||||
break
|
||||
time.sleep(5)
|
||||
|
||||
# Retrieve results
|
||||
if status == "FINISHED":
|
||||
blast_results = s.getResult(jobid, "out")
|
||||
print(blast_results[:500]) # Print first 500 characters
|
||||
```
|
||||
|
||||
**Output:** BLAST alignment results showing similar proteins
|
||||
|
||||
### Step 4: KEGG Pathway Discovery
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Get KEGG gene ID from UniProt mapping
|
||||
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
print(f"KEGG mapping: {kegg_mapping}")
|
||||
|
||||
# Extract KEGG gene ID (e.g., hsa:7535)
|
||||
if kegg_mapping:
|
||||
kegg_gene_id = kegg_mapping[uniprot_id][0] if uniprot_id in kegg_mapping else None
|
||||
|
||||
if kegg_gene_id:
|
||||
# Find pathways containing this gene
|
||||
organism = kegg_gene_id.split(":")[0] # e.g., "hsa"
|
||||
gene_id = kegg_gene_id.split(":")[1] # e.g., "7535"
|
||||
|
||||
pathways = k.get_pathway_by_gene(gene_id, organism)
|
||||
print(f"Found {len(pathways)} pathways:")
|
||||
|
||||
# Get pathway names
|
||||
for pathway_id in pathways:
|
||||
pathway_info = k.get(pathway_id)
|
||||
# Parse NAME line
|
||||
for line in pathway_info.split("\n"):
|
||||
if line.startswith("NAME"):
|
||||
pathway_name = line.replace("NAME", "").strip()
|
||||
print(f" {pathway_id}: {pathway_name}")
|
||||
break
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- path:hsa04064 - NF-kappa B signaling pathway
|
||||
- path:hsa04650 - Natural killer cell mediated cytotoxicity
|
||||
- path:hsa04660 - T cell receptor signaling pathway
|
||||
- path:hsa04662 - B cell receptor signaling pathway
|
||||
|
||||
### Step 5: Protein-Protein Interactions
|
||||
|
||||
```python
|
||||
from bioservices import PSICQUIC
|
||||
|
||||
p = PSICQUIC()
|
||||
|
||||
# Query MINT database for human (taxid:9606) interactions
|
||||
query = f"ZAP70 AND species:9606"
|
||||
interactions = p.query("mint", query)
|
||||
|
||||
# Parse PSI-MI TAB format results
|
||||
if interactions:
|
||||
interaction_lines = interactions.strip().split("\n")
|
||||
print(f"Found {len(interaction_lines)} interactions")
|
||||
|
||||
# Print first few interactions
|
||||
for line in interaction_lines[:5]:
|
||||
fields = line.split("\t")
|
||||
protein_a = fields[0]
|
||||
protein_b = fields[1]
|
||||
interaction_type = fields[11]
|
||||
print(f" {protein_a} - {protein_b}: {interaction_type}")
|
||||
```
|
||||
|
||||
**Output:** List of proteins that interact with ZAP70
|
||||
|
||||
### Step 6: Gene Ontology Annotation
|
||||
|
||||
```python
|
||||
from bioservices import QuickGO
|
||||
|
||||
g = QuickGO()
|
||||
|
||||
# Get GO annotations for protein
|
||||
annotations = g.Annotation(protein=uniprot_id, format="tsv")
|
||||
|
||||
if annotations:
|
||||
# Parse TSV results
|
||||
lines = annotations.strip().split("\n")
|
||||
print(f"Found {len(lines)-1} GO annotations")
|
||||
|
||||
# Display first few annotations
|
||||
for line in lines[1:6]: # Skip header
|
||||
fields = line.split("\t")
|
||||
go_id = fields[6]
|
||||
go_term = fields[7]
|
||||
go_aspect = fields[8]
|
||||
print(f" {go_id}: {go_term} [{go_aspect}]")
|
||||
```
|
||||
|
||||
**Output:** GO terms annotating ZAP70 function, process, and location
|
||||
|
||||
### Complete Pipeline Summary
|
||||
|
||||
**Inputs:** Protein name (e.g., "ZAP70_HUMAN")
|
||||
|
||||
**Outputs:**
|
||||
1. UniProt accession and gene name
|
||||
2. Protein sequence (FASTA)
|
||||
3. Similar proteins (BLAST results)
|
||||
4. Biological pathways (KEGG)
|
||||
5. Interaction partners (PSICQUIC)
|
||||
6. Functional annotations (GO terms)
|
||||
|
||||
**Script:** `scripts/protein_analysis_workflow.py` automates this entire pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Pathway Discovery and Network Analysis
|
||||
|
||||
**Goal:** Analyze all pathways for an organism and extract protein interaction networks.
|
||||
|
||||
**Example:** Human (hsa) pathway analysis
|
||||
|
||||
### Step 1: Get All Pathways for Organism
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
k.organism = "hsa"
|
||||
|
||||
# Get all pathway IDs
|
||||
pathway_ids = k.pathwayIds
|
||||
print(f"Found {len(pathway_ids)} pathways for {k.organism}")
|
||||
|
||||
# Display first few
|
||||
for pid in pathway_ids[:10]:
|
||||
print(f" {pid}")
|
||||
```
|
||||
|
||||
**Output:** List of ~300 human pathways
|
||||
|
||||
### Step 2: Parse Pathway for Interactions
|
||||
|
||||
```python
|
||||
# Analyze specific pathway
|
||||
pathway_id = "hsa04660" # T cell receptor signaling
|
||||
|
||||
# Get KGML data
|
||||
kgml_data = k.parse_kgml_pathway(pathway_id)
|
||||
|
||||
# Extract entries (genes/proteins)
|
||||
entries = kgml_data['entries']
|
||||
print(f"Pathway contains {len(entries)} entries")
|
||||
|
||||
# Extract relations (interactions)
|
||||
relations = kgml_data['relations']
|
||||
print(f"Found {len(relations)} relations")
|
||||
|
||||
# Analyze relation types
|
||||
relation_types = {}
|
||||
for rel in relations:
|
||||
rel_type = rel.get('name', 'unknown')
|
||||
relation_types[rel_type] = relation_types.get(rel_type, 0) + 1
|
||||
|
||||
print("\nRelation type distribution:")
|
||||
for rel_type, count in sorted(relation_types.items()):
|
||||
print(f" {rel_type}: {count}")
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- Entry count (genes/proteins in pathway)
|
||||
- Relation count (interactions)
|
||||
- Distribution of interaction types (activation, inhibition, binding, etc.)
|
||||
|
||||
### Step 3: Extract Protein-Protein Interactions
|
||||
|
||||
```python
|
||||
# Filter for specific interaction types
|
||||
pprel_interactions = [
|
||||
rel for rel in relations
|
||||
if rel.get('link') == 'PPrel' # Protein-protein relation
|
||||
]
|
||||
|
||||
print(f"Found {len(pprel_interactions)} protein-protein interactions")
|
||||
|
||||
# Extract interaction details
|
||||
for rel in pprel_interactions[:10]:
|
||||
entry1 = rel['entry1']
|
||||
entry2 = rel['entry2']
|
||||
interaction_type = rel.get('name', 'unknown')
|
||||
|
||||
print(f" {entry1} -> {entry2}: {interaction_type}")
|
||||
```
|
||||
|
||||
**Output:** Directed protein-protein interactions with types
|
||||
|
||||
### Step 4: Convert to Network Format (SIF)
|
||||
|
||||
```python
|
||||
# Get Simple Interaction Format (filters for key interactions)
|
||||
sif_data = k.pathway2sif(pathway_id)
|
||||
|
||||
# SIF format: source, interaction_type, target
|
||||
print("\nSimple Interaction Format:")
|
||||
for interaction in sif_data[:10]:
|
||||
print(f" {interaction}")
|
||||
```
|
||||
|
||||
**Output:** Network edges suitable for Cytoscape or NetworkX
|
||||
|
||||
### Step 5: Batch Analysis of All Pathways
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Analyze all pathways (this takes time!)
|
||||
all_results = []
|
||||
|
||||
for pathway_id in pathway_ids[:50]: # Limit for example
|
||||
try:
|
||||
kgml = k.parse_kgml_pathway(pathway_id)
|
||||
|
||||
result = {
|
||||
'pathway_id': pathway_id,
|
||||
'num_entries': len(kgml.get('entries', [])),
|
||||
'num_relations': len(kgml.get('relations', []))
|
||||
}
|
||||
|
||||
all_results.append(result)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error parsing {pathway_id}: {e}")
|
||||
|
||||
# Create DataFrame
|
||||
df = pd.DataFrame(all_results)
|
||||
print(df.describe())
|
||||
|
||||
# Find largest pathways
|
||||
print("\nLargest pathways:")
|
||||
print(df.nlargest(10, 'num_entries')[['pathway_id', 'num_entries', 'num_relations']])
|
||||
```
|
||||
|
||||
**Output:** Statistical summary of pathway sizes and interaction densities
|
||||
|
||||
**Script:** `scripts/pathway_analysis.py` implements this workflow with export options.
|
||||
|
||||
---
|
||||
|
||||
## Compound Multi-Database Search
|
||||
|
||||
**Goal:** Search for compound by name and retrieve identifiers across KEGG, ChEBI, and ChEMBL.
|
||||
|
||||
**Example:** Geldanamycin (antibiotic)
|
||||
|
||||
### Step 1: Search KEGG Compound Database
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Search by compound name
|
||||
compound_name = "Geldanamycin"
|
||||
results = k.find("compound", compound_name)
|
||||
|
||||
print(f"KEGG search results for '{compound_name}':")
|
||||
print(results)
|
||||
|
||||
# Extract compound ID
|
||||
if results:
|
||||
lines = results.strip().split("\n")
|
||||
if lines:
|
||||
kegg_id = lines[0].split("\t")[0] # e.g., cpd:C11222
|
||||
kegg_id_clean = kegg_id.replace("cpd:", "") # C11222
|
||||
print(f"\nKEGG Compound ID: {kegg_id_clean}")
|
||||
```
|
||||
|
||||
**Output:** KEGG ID (e.g., C11222)
|
||||
|
||||
### Step 2: Get KEGG Entry with Database Links
|
||||
|
||||
```python
|
||||
# Retrieve compound entry
|
||||
compound_entry = k.get(kegg_id)
|
||||
|
||||
# Parse entry for database links
|
||||
chebi_id = None
|
||||
for line in compound_entry.split("\n"):
|
||||
if "ChEBI:" in line:
|
||||
# Extract ChEBI ID
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
chebi_id = parts[1].strip().split()[0]
|
||||
print(f"ChEBI ID: {chebi_id}")
|
||||
break
|
||||
|
||||
# Display entry snippet
|
||||
print("\nKEGG Entry (first 500 chars):")
|
||||
print(compound_entry[:500])
|
||||
```
|
||||
|
||||
**Output:** ChEBI ID (e.g., 5292) and compound information
|
||||
|
||||
### Step 3: Cross-Reference to ChEMBL via UniChem
|
||||
|
||||
```python
|
||||
from bioservices import UniChem
|
||||
|
||||
u = UniChem()
|
||||
|
||||
# Convert KEGG → ChEMBL
|
||||
try:
|
||||
chembl_id = u.get_compound_id_from_kegg(kegg_id_clean)
|
||||
print(f"ChEMBL ID: {chembl_id}")
|
||||
except Exception as e:
|
||||
print(f"UniChem lookup failed: {e}")
|
||||
chembl_id = None
|
||||
```
|
||||
|
||||
**Output:** ChEMBL ID (e.g., CHEMBL278315)
|
||||
|
||||
### Step 4: Retrieve Detailed Information
|
||||
|
||||
```python
|
||||
# Get ChEBI information
|
||||
if chebi_id:
|
||||
from bioservices import ChEBI
|
||||
c = ChEBI()
|
||||
|
||||
try:
|
||||
chebi_entity = c.getCompleteEntity(f"CHEBI:{chebi_id}")
|
||||
print(f"\nChEBI Formula: {chebi_entity.Formulae}")
|
||||
print(f"ChEBI Name: {chebi_entity.chebiAsciiName}")
|
||||
except Exception as e:
|
||||
print(f"ChEBI lookup failed: {e}")
|
||||
|
||||
# Get ChEMBL information
|
||||
if chembl_id:
|
||||
from bioservices import ChEMBL
|
||||
chembl = ChEMBL()
|
||||
|
||||
try:
|
||||
chembl_compound = chembl.get_compound_by_chemblId(chembl_id)
|
||||
print(f"\nChEMBL Molecular Weight: {chembl_compound['molecule_properties']['full_mwt']}")
|
||||
print(f"ChEMBL SMILES: {chembl_compound['molecule_structures']['canonical_smiles']}")
|
||||
except Exception as e:
|
||||
print(f"ChEMBL lookup failed: {e}")
|
||||
```
|
||||
|
||||
**Output:** Chemical properties from multiple databases
|
||||
|
||||
### Complete Compound Workflow Summary
|
||||
|
||||
**Input:** Compound name (e.g., "Geldanamycin")
|
||||
|
||||
**Output:**
|
||||
- KEGG ID: C11222
|
||||
- ChEBI ID: 5292
|
||||
- ChEMBL ID: CHEMBL278315
|
||||
- Chemical formula
|
||||
- Molecular weight
|
||||
- SMILES structure
|
||||
|
||||
**Script:** `scripts/compound_cross_reference.py` automates this workflow.
|
||||
|
||||
---
|
||||
|
||||
## Batch Identifier Conversion
|
||||
|
||||
**Goal:** Convert multiple identifiers between databases efficiently.
|
||||
|
||||
### Batch UniProt → KEGG Mapping
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# List of UniProt IDs
|
||||
uniprot_ids = ["P43403", "P04637", "P53779", "Q9Y6K9"]
|
||||
|
||||
# Batch mapping (comma-separated)
|
||||
query_string = ",".join(uniprot_ids)
|
||||
results = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=query_string)
|
||||
|
||||
print("UniProt → KEGG mapping:")
|
||||
for uniprot_id, kegg_ids in results.items():
|
||||
print(f" {uniprot_id} → {kegg_ids}")
|
||||
```
|
||||
|
||||
**Output:** Dictionary mapping each UniProt ID to KEGG gene IDs
|
||||
|
||||
### Batch File Processing
|
||||
|
||||
```python
|
||||
import csv
|
||||
|
||||
# Read identifiers from file
|
||||
def read_ids_from_file(filename):
|
||||
with open(filename, 'r') as f:
|
||||
ids = [line.strip() for line in f if line.strip()]
|
||||
return ids
|
||||
|
||||
# Process in chunks (API limits)
|
||||
def batch_convert(ids, from_db, to_db, chunk_size=100):
|
||||
u = UniProt()
|
||||
all_results = {}
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
query = ",".join(chunk)
|
||||
|
||||
try:
|
||||
results = u.mapping(fr=from_db, to=to_db, query=query)
|
||||
all_results.update(results)
|
||||
print(f"Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
|
||||
except Exception as e:
|
||||
print(f"Error processing chunk {i}: {e}")
|
||||
|
||||
return all_results
|
||||
|
||||
# Write results to CSV
|
||||
def write_mapping_to_csv(mapping, output_file):
|
||||
with open(output_file, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(['Source_ID', 'Target_IDs'])
|
||||
|
||||
for source_id, target_ids in mapping.items():
|
||||
target_str = ";".join(target_ids) if target_ids else "No mapping"
|
||||
writer.writerow([source_id, target_str])
|
||||
|
||||
# Example usage
|
||||
input_ids = read_ids_from_file("uniprot_ids.txt")
|
||||
mapping = batch_convert(input_ids, "UniProtKB_AC-ID", "KEGG", chunk_size=50)
|
||||
write_mapping_to_csv(mapping, "uniprot_to_kegg_mapping.csv")
|
||||
```
|
||||
|
||||
**Script:** `scripts/batch_id_converter.py` provides command-line batch conversion.
|
||||
|
||||
---
|
||||
|
||||
## Gene Functional Annotation
|
||||
|
||||
**Goal:** Retrieve comprehensive functional information for a gene.
|
||||
|
||||
### Workflow
|
||||
|
||||
```python
|
||||
from bioservices import UniProt, KEGG, QuickGO
|
||||
|
||||
# Gene of interest
|
||||
gene_symbol = "TP53"
|
||||
|
||||
# 1. Find UniProt entry
|
||||
u = UniProt()
|
||||
search_results = u.search(f"gene:{gene_symbol} AND organism:9606",
|
||||
frmt="tab",
|
||||
columns="id,genes,protein names")
|
||||
|
||||
# Extract UniProt ID
|
||||
lines = search_results.strip().split("\n")
|
||||
if len(lines) > 1:
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
protein_name = lines[1].split("\t")[2]
|
||||
print(f"Protein: {protein_name}")
|
||||
print(f"UniProt ID: {uniprot_id}")
|
||||
|
||||
# 2. Get KEGG pathways
|
||||
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
if uniprot_id in kegg_mapping:
|
||||
kegg_id = kegg_mapping[uniprot_id][0]
|
||||
|
||||
k = KEGG()
|
||||
organism, gene_id = kegg_id.split(":")
|
||||
pathways = k.get_pathway_by_gene(gene_id, organism)
|
||||
|
||||
print(f"\nPathways ({len(pathways)}):")
|
||||
for pathway_id in pathways[:5]:
|
||||
print(f" {pathway_id}")
|
||||
|
||||
# 3. Get GO annotations
|
||||
g = QuickGO()
|
||||
go_annotations = g.Annotation(protein=uniprot_id, format="tsv")
|
||||
|
||||
if go_annotations:
|
||||
lines = go_annotations.strip().split("\n")
|
||||
print(f"\nGO Annotations ({len(lines)-1} total):")
|
||||
|
||||
# Group by aspect
|
||||
aspects = {"P": [], "F": [], "C": []}
|
||||
for line in lines[1:]:
|
||||
fields = line.split("\t")
|
||||
go_aspect = fields[8] # P, F, or C
|
||||
go_term = fields[7]
|
||||
aspects[go_aspect].append(go_term)
|
||||
|
||||
print(f" Biological Process: {len(aspects['P'])} terms")
|
||||
print(f" Molecular Function: {len(aspects['F'])} terms")
|
||||
print(f" Cellular Component: {len(aspects['C'])} terms")
|
||||
|
||||
# 4. Get protein sequence features
|
||||
full_entry = u.retrieve(uniprot_id, frmt="txt")
|
||||
print("\nProtein Features:")
|
||||
for line in full_entry.split("\n"):
|
||||
if line.startswith("FT DOMAIN"):
|
||||
print(f" {line}")
|
||||
```
|
||||
|
||||
**Output:** Comprehensive annotation including name, pathways, GO terms, and features.
|
||||
|
||||
---
|
||||
|
||||
## Protein Interaction Network Construction
|
||||
|
||||
**Goal:** Build a protein-protein interaction network for a set of proteins.
|
||||
|
||||
### Workflow
|
||||
|
||||
```python
|
||||
from bioservices import PSICQUIC
|
||||
import networkx as nx
|
||||
|
||||
# Proteins of interest
|
||||
proteins = ["ZAP70", "LCK", "LAT", "SLP76", "PLCg1"]
|
||||
|
||||
# Initialize PSICQUIC
|
||||
p = PSICQUIC()
|
||||
|
||||
# Build network
|
||||
G = nx.Graph()
|
||||
|
||||
for protein in proteins:
|
||||
# Query for human interactions
|
||||
query = f"{protein} AND species:9606"
|
||||
|
||||
try:
|
||||
results = p.query("intact", query)
|
||||
|
||||
if results:
|
||||
lines = results.strip().split("\n")
|
||||
|
||||
for line in lines:
|
||||
fields = line.split("\t")
|
||||
# Extract protein names (simplified)
|
||||
protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
|
||||
protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
|
||||
|
||||
# Add edge
|
||||
G.add_edge(protein_a, protein_b)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error querying {protein}: {e}")
|
||||
|
||||
print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
|
||||
|
||||
# Analyze network
|
||||
print("\nNode degrees:")
|
||||
for node in proteins:
|
||||
if node in G:
|
||||
print(f" {node}: {G.degree(node)} interactions")
|
||||
|
||||
# Export for visualization
|
||||
nx.write_gml(G, "protein_network.gml")
|
||||
print("\nNetwork exported to protein_network.gml")
|
||||
```
|
||||
|
||||
**Output:** NetworkX graph exported in GML format for Cytoscape visualization.
|
||||
|
||||
---
|
||||
|
||||
## Multi-Organism Comparative Analysis
|
||||
|
||||
**Goal:** Compare pathway or gene presence across multiple organisms.
|
||||
|
||||
### Workflow
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Organisms to compare
|
||||
organisms = ["hsa", "mmu", "dme", "sce"] # Human, mouse, fly, yeast
|
||||
organism_names = {
|
||||
"hsa": "Human",
|
||||
"mmu": "Mouse",
|
||||
"dme": "Fly",
|
||||
"sce": "Yeast"
|
||||
}
|
||||
|
||||
# Pathway of interest
|
||||
pathway_name = "cell cycle"
|
||||
|
||||
print(f"Searching for '{pathway_name}' pathway across organisms:\n")
|
||||
|
||||
for org in organisms:
|
||||
k.organism = org
|
||||
|
||||
# Search pathways
|
||||
results = k.lookfor_pathway(pathway_name)
|
||||
|
||||
print(f"{organism_names[org]} ({org}):")
|
||||
if results:
|
||||
for pathway in results[:3]: # Show first 3
|
||||
print(f" {pathway}")
|
||||
else:
|
||||
print(" No matches found")
|
||||
print()
|
||||
```
|
||||
|
||||
**Output:** Pathway presence/absence across organisms.
|
||||
|
||||
---
|
||||
|
||||
## Best Practices for Workflows
|
||||
|
||||
### 1. Error Handling
|
||||
|
||||
Always wrap service calls:
|
||||
```python
|
||||
try:
|
||||
result = service.method(params)
|
||||
if result:
|
||||
# Process
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
### 2. Rate Limiting
|
||||
|
||||
Add delays for batch processing:
|
||||
```python
|
||||
import time
|
||||
|
||||
for item in items:
|
||||
result = service.query(item)
|
||||
time.sleep(0.5) # 500ms delay
|
||||
```
|
||||
|
||||
### 3. Result Validation
|
||||
|
||||
Check for empty or unexpected results:
|
||||
```python
|
||||
if result and len(result) > 0:
|
||||
# Process
|
||||
pass
|
||||
else:
|
||||
print("No results returned")
|
||||
```
|
||||
|
||||
### 4. Progress Reporting
|
||||
|
||||
For long workflows:
|
||||
```python
|
||||
total = len(items)
|
||||
for i, item in enumerate(items):
|
||||
# Process item
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f"Processed {i+1}/{total}")
|
||||
```
|
||||
|
||||
### 5. Data Export
|
||||
|
||||
Save intermediate results:
|
||||
```python
|
||||
import json
|
||||
|
||||
with open("results.json", "w") as f:
|
||||
json.dump(results, f, indent=2)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### BioPython Integration
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
from Bio import SeqIO
|
||||
from io import StringIO
|
||||
|
||||
u = UniProt()
|
||||
fasta_data = u.retrieve("P43403", "fasta")
|
||||
|
||||
# Parse with BioPython
|
||||
fasta_io = StringIO(fasta_data)
|
||||
record = SeqIO.read(fasta_io, "fasta")
|
||||
|
||||
print(f"Sequence length: {len(record.seq)}")
|
||||
print(f"Description: {record.description}")
|
||||
```
|
||||
|
||||
### Pandas Integration
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
import pandas as pd
|
||||
from io import StringIO
|
||||
|
||||
u = UniProt()
|
||||
results = u.search("zap70", frmt="tab", columns="id,genes,length,organism")
|
||||
|
||||
# Load into DataFrame
|
||||
df = pd.read_csv(StringIO(results), sep="\t")
|
||||
print(df.head())
|
||||
print(df.describe())
|
||||
```
|
||||
|
||||
### NetworkX Integration
|
||||
|
||||
See Protein Interaction Network Construction above.
|
||||
|
||||
---
|
||||
|
||||
For complete working examples, see the scripts in `scripts/` directory.
|
||||
347
scientific-packages/bioservices/scripts/batch_id_converter.py
Executable file
347
scientific-packages/bioservices/scripts/batch_id_converter.py
Executable file
@@ -0,0 +1,347 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch Identifier Converter
|
||||
|
||||
This script converts multiple identifiers between biological databases
|
||||
using UniProt's mapping service. Supports batch processing with
|
||||
automatic chunking and error handling.
|
||||
|
||||
Usage:
|
||||
python batch_id_converter.py INPUT_FILE --from DB1 --to DB2 [options]
|
||||
|
||||
Examples:
|
||||
python batch_id_converter.py uniprot_ids.txt --from UniProtKB_AC-ID --to KEGG
|
||||
python batch_id_converter.py gene_ids.txt --from GeneID --to UniProtKB --output mapping.csv
|
||||
python batch_id_converter.py ids.txt --from UniProtKB_AC-ID --to Ensembl --chunk-size 50
|
||||
|
||||
Input file format:
|
||||
One identifier per line (plain text)
|
||||
|
||||
Common database codes:
|
||||
UniProtKB_AC-ID - UniProt accession/ID
|
||||
KEGG - KEGG gene IDs
|
||||
GeneID - NCBI Gene (Entrez) IDs
|
||||
Ensembl - Ensembl gene IDs
|
||||
Ensembl_Protein - Ensembl protein IDs
|
||||
RefSeq_Protein - RefSeq protein IDs
|
||||
PDB - Protein Data Bank IDs
|
||||
HGNC - Human gene symbols
|
||||
GO - Gene Ontology IDs
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import csv
|
||||
import time
|
||||
from bioservices import UniProt
|
||||
|
||||
|
||||
# Common database code mappings
|
||||
DATABASE_CODES = {
|
||||
'uniprot': 'UniProtKB_AC-ID',
|
||||
'uniprotkb': 'UniProtKB_AC-ID',
|
||||
'kegg': 'KEGG',
|
||||
'geneid': 'GeneID',
|
||||
'entrez': 'GeneID',
|
||||
'ensembl': 'Ensembl',
|
||||
'ensembl_protein': 'Ensembl_Protein',
|
||||
'ensembl_transcript': 'Ensembl_Transcript',
|
||||
'refseq': 'RefSeq_Protein',
|
||||
'refseq_protein': 'RefSeq_Protein',
|
||||
'pdb': 'PDB',
|
||||
'hgnc': 'HGNC',
|
||||
'mgi': 'MGI',
|
||||
'go': 'GO',
|
||||
'pfam': 'Pfam',
|
||||
'interpro': 'InterPro',
|
||||
'reactome': 'Reactome',
|
||||
'string': 'STRING',
|
||||
'biogrid': 'BioGRID'
|
||||
}
|
||||
|
||||
|
||||
def normalize_database_code(code):
|
||||
"""Normalize database code to official format."""
|
||||
# Try exact match first
|
||||
if code in DATABASE_CODES.values():
|
||||
return code
|
||||
|
||||
# Try lowercase lookup
|
||||
lowercase = code.lower()
|
||||
if lowercase in DATABASE_CODES:
|
||||
return DATABASE_CODES[lowercase]
|
||||
|
||||
# Return as-is if not found (may still be valid)
|
||||
return code
|
||||
|
||||
|
||||
def read_ids_from_file(filename):
|
||||
"""Read identifiers from file (one per line)."""
|
||||
print(f"Reading identifiers from {filename}...")
|
||||
|
||||
ids = []
|
||||
with open(filename, 'r') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
ids.append(line)
|
||||
|
||||
print(f"✓ Read {len(ids)} identifier(s)")
|
||||
|
||||
return ids
|
||||
|
||||
|
||||
def batch_convert(ids, from_db, to_db, chunk_size=100, delay=0.5):
|
||||
"""Convert IDs with automatic chunking and error handling."""
|
||||
print(f"\nConverting {len(ids)} IDs:")
|
||||
print(f" From: {from_db}")
|
||||
print(f" To: {to_db}")
|
||||
print(f" Chunk size: {chunk_size}")
|
||||
print()
|
||||
|
||||
u = UniProt(verbose=False)
|
||||
all_results = {}
|
||||
failed_ids = []
|
||||
|
||||
total_chunks = (len(ids) + chunk_size - 1) // chunk_size
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
chunk_num = (i // chunk_size) + 1
|
||||
|
||||
query = ",".join(chunk)
|
||||
|
||||
try:
|
||||
print(f" [{chunk_num}/{total_chunks}] Processing {len(chunk)} IDs...", end=" ")
|
||||
|
||||
results = u.mapping(fr=from_db, to=to_db, query=query)
|
||||
|
||||
if results:
|
||||
all_results.update(results)
|
||||
mapped_count = len([v for v in results.values() if v])
|
||||
print(f"✓ Mapped: {mapped_count}/{len(chunk)}")
|
||||
else:
|
||||
print(f"✗ No mappings returned")
|
||||
failed_ids.extend(chunk)
|
||||
|
||||
# Rate limiting
|
||||
if delay > 0 and i + chunk_size < len(ids):
|
||||
time.sleep(delay)
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
|
||||
# Try individual IDs in failed chunk
|
||||
print(f" Retrying individual IDs...")
|
||||
for single_id in chunk:
|
||||
try:
|
||||
result = u.mapping(fr=from_db, to=to_db, query=single_id)
|
||||
if result:
|
||||
all_results.update(result)
|
||||
print(f" ✓ {single_id}")
|
||||
else:
|
||||
failed_ids.append(single_id)
|
||||
print(f" ✗ {single_id} - no mapping")
|
||||
except Exception as e2:
|
||||
failed_ids.append(single_id)
|
||||
print(f" ✗ {single_id} - {e2}")
|
||||
|
||||
time.sleep(0.2)
|
||||
|
||||
# Add missing IDs to results (mark as failed)
|
||||
for id_ in ids:
|
||||
if id_ not in all_results:
|
||||
all_results[id_] = None
|
||||
|
||||
print(f"\n✓ Conversion complete:")
|
||||
print(f" Total: {len(ids)}")
|
||||
print(f" Mapped: {len([v for v in all_results.values() if v])}")
|
||||
print(f" Failed: {len(failed_ids)}")
|
||||
|
||||
return all_results, failed_ids
|
||||
|
||||
|
||||
def save_mapping_csv(mapping, output_file, from_db, to_db):
|
||||
"""Save mapping results to CSV."""
|
||||
print(f"\nSaving results to {output_file}...")
|
||||
|
||||
with open(output_file, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
|
||||
# Header
|
||||
writer.writerow(['Source_ID', 'Source_DB', 'Target_IDs', 'Target_DB', 'Mapping_Status'])
|
||||
|
||||
# Data
|
||||
for source_id, target_ids in sorted(mapping.items()):
|
||||
if target_ids:
|
||||
target_str = ";".join(target_ids)
|
||||
status = "Success"
|
||||
else:
|
||||
target_str = ""
|
||||
status = "Failed"
|
||||
|
||||
writer.writerow([source_id, from_db, target_str, to_db, status])
|
||||
|
||||
print(f"✓ Results saved")
|
||||
|
||||
|
||||
def save_failed_ids(failed_ids, output_file):
|
||||
"""Save failed IDs to file."""
|
||||
if not failed_ids:
|
||||
return
|
||||
|
||||
print(f"\nSaving failed IDs to {output_file}...")
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
for id_ in failed_ids:
|
||||
f.write(f"{id_}\n")
|
||||
|
||||
print(f"✓ Saved {len(failed_ids)} failed ID(s)")
|
||||
|
||||
|
||||
def print_mapping_summary(mapping, from_db, to_db):
|
||||
"""Print summary of mapping results."""
|
||||
print(f"\n{'='*70}")
|
||||
print("MAPPING SUMMARY")
|
||||
print(f"{'='*70}")
|
||||
|
||||
total = len(mapping)
|
||||
mapped = len([v for v in mapping.values() if v])
|
||||
failed = total - mapped
|
||||
|
||||
print(f"\nSource database: {from_db}")
|
||||
print(f"Target database: {to_db}")
|
||||
print(f"\nTotal identifiers: {total}")
|
||||
print(f"Successfully mapped: {mapped} ({mapped/total*100:.1f}%)")
|
||||
print(f"Failed to map: {failed} ({failed/total*100:.1f}%)")
|
||||
|
||||
# Show some examples
|
||||
if mapped > 0:
|
||||
print(f"\nExample mappings (first 5):")
|
||||
count = 0
|
||||
for source_id, target_ids in mapping.items():
|
||||
if target_ids:
|
||||
target_str = ", ".join(target_ids[:3])
|
||||
if len(target_ids) > 3:
|
||||
target_str += f" ... +{len(target_ids)-3} more"
|
||||
print(f" {source_id} → {target_str}")
|
||||
count += 1
|
||||
if count >= 5:
|
||||
break
|
||||
|
||||
# Show multiple mapping statistics
|
||||
multiple_mappings = [v for v in mapping.values() if v and len(v) > 1]
|
||||
if multiple_mappings:
|
||||
print(f"\nMultiple target mappings: {len(multiple_mappings)} ID(s)")
|
||||
print(f" (These source IDs map to multiple target IDs)")
|
||||
|
||||
print(f"{'='*70}")
|
||||
|
||||
|
||||
def list_common_databases():
|
||||
"""Print list of common database codes."""
|
||||
print("\nCommon Database Codes:")
|
||||
print("-" * 70)
|
||||
print(f"{'Alias':<20} {'Official Code':<30}")
|
||||
print("-" * 70)
|
||||
|
||||
for alias, code in sorted(DATABASE_CODES.items()):
|
||||
if alias != code.lower():
|
||||
print(f"{alias:<20} {code:<30}")
|
||||
|
||||
print("-" * 70)
|
||||
print("\nNote: Many other database codes are supported.")
|
||||
print("See UniProt documentation for complete list.")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main conversion workflow."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Batch convert biological identifiers between databases",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python batch_id_converter.py uniprot_ids.txt --from UniProtKB_AC-ID --to KEGG
|
||||
python batch_id_converter.py ids.txt --from GeneID --to UniProtKB -o mapping.csv
|
||||
python batch_id_converter.py ids.txt --from uniprot --to ensembl --chunk-size 50
|
||||
|
||||
Common database codes:
|
||||
UniProtKB_AC-ID, KEGG, GeneID, Ensembl, Ensembl_Protein,
|
||||
RefSeq_Protein, PDB, HGNC, GO, Pfam, InterPro, Reactome
|
||||
|
||||
Use --list-databases to see all supported aliases.
|
||||
"""
|
||||
)
|
||||
parser.add_argument("input_file", help="Input file with IDs (one per line)")
|
||||
parser.add_argument("--from", dest="from_db", required=True,
|
||||
help="Source database code")
|
||||
parser.add_argument("--to", dest="to_db", required=True,
|
||||
help="Target database code")
|
||||
parser.add_argument("-o", "--output", default=None,
|
||||
help="Output CSV file (default: mapping_results.csv)")
|
||||
parser.add_argument("--chunk-size", type=int, default=100,
|
||||
help="Number of IDs per batch (default: 100)")
|
||||
parser.add_argument("--delay", type=float, default=0.5,
|
||||
help="Delay between batches in seconds (default: 0.5)")
|
||||
parser.add_argument("--save-failed", action="store_true",
|
||||
help="Save failed IDs to separate file")
|
||||
parser.add_argument("--list-databases", action="store_true",
|
||||
help="List common database codes and exit")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List databases and exit
|
||||
if args.list_databases:
|
||||
list_common_databases()
|
||||
sys.exit(0)
|
||||
|
||||
print("=" * 70)
|
||||
print("BIOSERVICES: Batch Identifier Converter")
|
||||
print("=" * 70)
|
||||
|
||||
# Normalize database codes
|
||||
from_db = normalize_database_code(args.from_db)
|
||||
to_db = normalize_database_code(args.to_db)
|
||||
|
||||
if from_db != args.from_db:
|
||||
print(f"\nNote: Normalized '{args.from_db}' → '{from_db}'")
|
||||
if to_db != args.to_db:
|
||||
print(f"Note: Normalized '{args.to_db}' → '{to_db}'")
|
||||
|
||||
# Read input IDs
|
||||
try:
|
||||
ids = read_ids_from_file(args.input_file)
|
||||
except Exception as e:
|
||||
print(f"\n✗ Error reading input file: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
if not ids:
|
||||
print("\n✗ No IDs found in input file")
|
||||
sys.exit(1)
|
||||
|
||||
# Perform conversion
|
||||
mapping, failed_ids = batch_convert(
|
||||
ids,
|
||||
from_db,
|
||||
to_db,
|
||||
chunk_size=args.chunk_size,
|
||||
delay=args.delay
|
||||
)
|
||||
|
||||
# Print summary
|
||||
print_mapping_summary(mapping, from_db, to_db)
|
||||
|
||||
# Save results
|
||||
output_file = args.output or "mapping_results.csv"
|
||||
save_mapping_csv(mapping, output_file, from_db, to_db)
|
||||
|
||||
# Save failed IDs if requested
|
||||
if args.save_failed and failed_ids:
|
||||
failed_file = output_file.replace(".csv", "_failed.txt")
|
||||
save_failed_ids(failed_ids, failed_file)
|
||||
|
||||
print(f"\n✓ Done!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
378
scientific-packages/bioservices/scripts/compound_cross_reference.py
Executable file
378
scientific-packages/bioservices/scripts/compound_cross_reference.py
Executable file
@@ -0,0 +1,378 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Compound Cross-Database Search
|
||||
|
||||
This script searches for a compound by name and retrieves identifiers
|
||||
from multiple databases:
|
||||
- KEGG Compound
|
||||
- ChEBI
|
||||
- ChEMBL (via UniChem)
|
||||
- Basic compound properties
|
||||
|
||||
Usage:
|
||||
python compound_cross_reference.py COMPOUND_NAME [--output FILE]
|
||||
|
||||
Examples:
|
||||
python compound_cross_reference.py Geldanamycin
|
||||
python compound_cross_reference.py "Adenosine triphosphate"
|
||||
python compound_cross_reference.py Aspirin --output aspirin_info.txt
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
from bioservices import KEGG, UniChem, ChEBI, ChEMBL
|
||||
|
||||
|
||||
def search_kegg_compound(compound_name):
|
||||
"""Search KEGG for compound by name."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 1: KEGG Compound Search")
|
||||
print(f"{'='*70}")
|
||||
|
||||
k = KEGG()
|
||||
|
||||
print(f"Searching KEGG for: {compound_name}")
|
||||
|
||||
try:
|
||||
results = k.find("compound", compound_name)
|
||||
|
||||
if not results or not results.strip():
|
||||
print(f"✗ No results found in KEGG")
|
||||
return k, None
|
||||
|
||||
# Parse results
|
||||
lines = results.strip().split("\n")
|
||||
print(f"✓ Found {len(lines)} result(s):\n")
|
||||
|
||||
for i, line in enumerate(lines[:5], 1):
|
||||
parts = line.split("\t")
|
||||
kegg_id = parts[0]
|
||||
description = parts[1] if len(parts) > 1 else "No description"
|
||||
print(f" {i}. {kegg_id}: {description}")
|
||||
|
||||
# Use first result
|
||||
first_result = lines[0].split("\t")
|
||||
kegg_id = first_result[0].replace("cpd:", "")
|
||||
|
||||
print(f"\nUsing: {kegg_id}")
|
||||
|
||||
return k, kegg_id
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return k, None
|
||||
|
||||
|
||||
def get_kegg_info(kegg, kegg_id):
|
||||
"""Retrieve detailed KEGG compound information."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 2: KEGG Compound Details")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
print(f"Retrieving KEGG entry for {kegg_id}...")
|
||||
|
||||
entry = kegg.get(f"cpd:{kegg_id}")
|
||||
|
||||
if not entry:
|
||||
print("✗ Failed to retrieve entry")
|
||||
return None
|
||||
|
||||
# Parse entry
|
||||
compound_info = {
|
||||
'kegg_id': kegg_id,
|
||||
'name': None,
|
||||
'formula': None,
|
||||
'exact_mass': None,
|
||||
'mol_weight': None,
|
||||
'chebi_id': None,
|
||||
'pathways': []
|
||||
}
|
||||
|
||||
current_section = None
|
||||
|
||||
for line in entry.split("\n"):
|
||||
if line.startswith("NAME"):
|
||||
compound_info['name'] = line.replace("NAME", "").strip().rstrip(";")
|
||||
|
||||
elif line.startswith("FORMULA"):
|
||||
compound_info['formula'] = line.replace("FORMULA", "").strip()
|
||||
|
||||
elif line.startswith("EXACT_MASS"):
|
||||
compound_info['exact_mass'] = line.replace("EXACT_MASS", "").strip()
|
||||
|
||||
elif line.startswith("MOL_WEIGHT"):
|
||||
compound_info['mol_weight'] = line.replace("MOL_WEIGHT", "").strip()
|
||||
|
||||
elif "ChEBI:" in line:
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
compound_info['chebi_id'] = parts[1].strip().split()[0]
|
||||
|
||||
elif line.startswith("PATHWAY"):
|
||||
current_section = "pathway"
|
||||
pathway = line.replace("PATHWAY", "").strip()
|
||||
if pathway:
|
||||
compound_info['pathways'].append(pathway)
|
||||
|
||||
elif current_section == "pathway" and line.startswith(" "):
|
||||
pathway = line.strip()
|
||||
if pathway:
|
||||
compound_info['pathways'].append(pathway)
|
||||
|
||||
elif line.startswith(" ") and not line.startswith(" "):
|
||||
current_section = None
|
||||
|
||||
# Display information
|
||||
print(f"\n✓ KEGG Compound Information:")
|
||||
print(f" ID: {compound_info['kegg_id']}")
|
||||
print(f" Name: {compound_info['name']}")
|
||||
print(f" Formula: {compound_info['formula']}")
|
||||
print(f" Exact Mass: {compound_info['exact_mass']}")
|
||||
print(f" Molecular Weight: {compound_info['mol_weight']}")
|
||||
|
||||
if compound_info['chebi_id']:
|
||||
print(f" ChEBI ID: {compound_info['chebi_id']}")
|
||||
|
||||
if compound_info['pathways']:
|
||||
print(f" Pathways: {len(compound_info['pathways'])} found")
|
||||
|
||||
return compound_info
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def get_chembl_id(kegg_id):
|
||||
"""Map KEGG ID to ChEMBL via UniChem."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 3: ChEMBL Mapping (via UniChem)")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
u = UniChem()
|
||||
|
||||
print(f"Mapping KEGG:{kegg_id} to ChEMBL...")
|
||||
|
||||
chembl_id = u.get_compound_id_from_kegg(kegg_id)
|
||||
|
||||
if chembl_id:
|
||||
print(f"✓ ChEMBL ID: {chembl_id}")
|
||||
return chembl_id
|
||||
else:
|
||||
print("✗ No ChEMBL mapping found")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def get_chebi_info(chebi_id):
|
||||
"""Retrieve ChEBI compound information."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 4: ChEBI Details")
|
||||
print(f"{'='*70}")
|
||||
|
||||
if not chebi_id:
|
||||
print("⊘ No ChEBI ID available")
|
||||
return None
|
||||
|
||||
try:
|
||||
c = ChEBI()
|
||||
|
||||
print(f"Retrieving ChEBI entry for {chebi_id}...")
|
||||
|
||||
# Ensure proper format
|
||||
if not chebi_id.startswith("CHEBI:"):
|
||||
chebi_id = f"CHEBI:{chebi_id}"
|
||||
|
||||
entity = c.getCompleteEntity(chebi_id)
|
||||
|
||||
if entity:
|
||||
print(f"\n✓ ChEBI Information:")
|
||||
print(f" ID: {entity.chebiId}")
|
||||
print(f" Name: {entity.chebiAsciiName}")
|
||||
|
||||
if hasattr(entity, 'Formulae') and entity.Formulae:
|
||||
print(f" Formula: {entity.Formulae}")
|
||||
|
||||
if hasattr(entity, 'mass') and entity.mass:
|
||||
print(f" Mass: {entity.mass}")
|
||||
|
||||
if hasattr(entity, 'charge') and entity.charge:
|
||||
print(f" Charge: {entity.charge}")
|
||||
|
||||
return {
|
||||
'chebi_id': entity.chebiId,
|
||||
'name': entity.chebiAsciiName,
|
||||
'formula': entity.Formulae if hasattr(entity, 'Formulae') else None,
|
||||
'mass': entity.mass if hasattr(entity, 'mass') else None
|
||||
}
|
||||
else:
|
||||
print("✗ Failed to retrieve ChEBI entry")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def get_chembl_info(chembl_id):
|
||||
"""Retrieve ChEMBL compound information."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 5: ChEMBL Details")
|
||||
print(f"{'='*70}")
|
||||
|
||||
if not chembl_id:
|
||||
print("⊘ No ChEMBL ID available")
|
||||
return None
|
||||
|
||||
try:
|
||||
c = ChEMBL()
|
||||
|
||||
print(f"Retrieving ChEMBL entry for {chembl_id}...")
|
||||
|
||||
compound = c.get_compound_by_chemblId(chembl_id)
|
||||
|
||||
if compound:
|
||||
print(f"\n✓ ChEMBL Information:")
|
||||
print(f" ID: {chembl_id}")
|
||||
|
||||
if 'pref_name' in compound and compound['pref_name']:
|
||||
print(f" Preferred Name: {compound['pref_name']}")
|
||||
|
||||
if 'molecule_properties' in compound:
|
||||
props = compound['molecule_properties']
|
||||
|
||||
if 'full_mwt' in props:
|
||||
print(f" Molecular Weight: {props['full_mwt']}")
|
||||
|
||||
if 'alogp' in props:
|
||||
print(f" LogP: {props['alogp']}")
|
||||
|
||||
if 'hba' in props:
|
||||
print(f" H-Bond Acceptors: {props['hba']}")
|
||||
|
||||
if 'hbd' in props:
|
||||
print(f" H-Bond Donors: {props['hbd']}")
|
||||
|
||||
if 'molecule_structures' in compound:
|
||||
structs = compound['molecule_structures']
|
||||
|
||||
if 'canonical_smiles' in structs:
|
||||
smiles = structs['canonical_smiles']
|
||||
print(f" SMILES: {smiles[:60]}{'...' if len(smiles) > 60 else ''}")
|
||||
|
||||
return compound
|
||||
else:
|
||||
print("✗ Failed to retrieve ChEMBL entry")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def save_results(compound_name, kegg_info, chembl_id, output_file):
|
||||
"""Save results to file."""
|
||||
print(f"\n{'='*70}")
|
||||
print(f"Saving results to {output_file}")
|
||||
print(f"{'='*70}")
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write("=" * 70 + "\n")
|
||||
f.write(f"Compound Cross-Reference Report: {compound_name}\n")
|
||||
f.write("=" * 70 + "\n\n")
|
||||
|
||||
# KEGG information
|
||||
if kegg_info:
|
||||
f.write("KEGG Compound\n")
|
||||
f.write("-" * 70 + "\n")
|
||||
f.write(f"ID: {kegg_info['kegg_id']}\n")
|
||||
f.write(f"Name: {kegg_info['name']}\n")
|
||||
f.write(f"Formula: {kegg_info['formula']}\n")
|
||||
f.write(f"Exact Mass: {kegg_info['exact_mass']}\n")
|
||||
f.write(f"Molecular Weight: {kegg_info['mol_weight']}\n")
|
||||
f.write(f"Pathways: {len(kegg_info['pathways'])} found\n")
|
||||
f.write("\n")
|
||||
|
||||
# Database IDs
|
||||
f.write("Cross-Database Identifiers\n")
|
||||
f.write("-" * 70 + "\n")
|
||||
if kegg_info:
|
||||
f.write(f"KEGG: {kegg_info['kegg_id']}\n")
|
||||
if kegg_info['chebi_id']:
|
||||
f.write(f"ChEBI: {kegg_info['chebi_id']}\n")
|
||||
if chembl_id:
|
||||
f.write(f"ChEMBL: {chembl_id}\n")
|
||||
f.write("\n")
|
||||
|
||||
print(f"✓ Results saved")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main workflow."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Search compound across multiple databases",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python compound_cross_reference.py Geldanamycin
|
||||
python compound_cross_reference.py "Adenosine triphosphate"
|
||||
python compound_cross_reference.py Aspirin --output aspirin_info.txt
|
||||
"""
|
||||
)
|
||||
parser.add_argument("compound", help="Compound name to search")
|
||||
parser.add_argument("--output", default=None,
|
||||
help="Output file for results (optional)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 70)
|
||||
print("BIOSERVICES: Compound Cross-Database Search")
|
||||
print("=" * 70)
|
||||
|
||||
# Step 1: Search KEGG
|
||||
kegg, kegg_id = search_kegg_compound(args.compound)
|
||||
if not kegg_id:
|
||||
print("\n✗ Failed to find compound. Exiting.")
|
||||
sys.exit(1)
|
||||
|
||||
# Step 2: Get KEGG details
|
||||
kegg_info = get_kegg_info(kegg, kegg_id)
|
||||
|
||||
# Step 3: Map to ChEMBL
|
||||
chembl_id = get_chembl_id(kegg_id)
|
||||
|
||||
# Step 4: Get ChEBI details
|
||||
chebi_info = None
|
||||
if kegg_info and kegg_info['chebi_id']:
|
||||
chebi_info = get_chebi_info(kegg_info['chebi_id'])
|
||||
|
||||
# Step 5: Get ChEMBL details
|
||||
chembl_info = None
|
||||
if chembl_id:
|
||||
chembl_info = get_chembl_info(chembl_id)
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*70}")
|
||||
print("SUMMARY")
|
||||
print(f"{'='*70}")
|
||||
print(f" Compound: {args.compound}")
|
||||
if kegg_info:
|
||||
print(f" KEGG ID: {kegg_info['kegg_id']}")
|
||||
if kegg_info['chebi_id']:
|
||||
print(f" ChEBI ID: {kegg_info['chebi_id']}")
|
||||
if chembl_id:
|
||||
print(f" ChEMBL ID: {chembl_id}")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Save to file if requested
|
||||
if args.output:
|
||||
save_results(args.compound, kegg_info, chembl_id, args.output)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
309
scientific-packages/bioservices/scripts/pathway_analysis.py
Executable file
309
scientific-packages/bioservices/scripts/pathway_analysis.py
Executable file
@@ -0,0 +1,309 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
KEGG Pathway Network Analysis
|
||||
|
||||
This script analyzes all pathways for an organism and extracts:
|
||||
- Pathway sizes (number of genes)
|
||||
- Protein-protein interactions
|
||||
- Interaction type distributions
|
||||
- Network data in various formats (CSV, SIF)
|
||||
|
||||
Usage:
|
||||
python pathway_analysis.py ORGANISM OUTPUT_DIR [--limit N]
|
||||
|
||||
Examples:
|
||||
python pathway_analysis.py hsa ./human_pathways
|
||||
python pathway_analysis.py mmu ./mouse_pathways --limit 50
|
||||
|
||||
Organism codes:
|
||||
hsa = Homo sapiens (human)
|
||||
mmu = Mus musculus (mouse)
|
||||
dme = Drosophila melanogaster
|
||||
sce = Saccharomyces cerevisiae (yeast)
|
||||
eco = Escherichia coli
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import argparse
|
||||
import csv
|
||||
from collections import Counter
|
||||
from bioservices import KEGG
|
||||
|
||||
|
||||
def get_all_pathways(kegg, organism):
|
||||
"""Get all pathway IDs for organism."""
|
||||
print(f"\nRetrieving pathways for {organism}...")
|
||||
|
||||
kegg.organism = organism
|
||||
pathway_ids = kegg.pathwayIds
|
||||
|
||||
print(f"✓ Found {len(pathway_ids)} pathways")
|
||||
|
||||
return pathway_ids
|
||||
|
||||
|
||||
def analyze_pathway(kegg, pathway_id):
|
||||
"""Analyze single pathway for size and interactions."""
|
||||
try:
|
||||
# Parse KGML pathway
|
||||
kgml = kegg.parse_kgml_pathway(pathway_id)
|
||||
|
||||
entries = kgml.get('entries', [])
|
||||
relations = kgml.get('relations', [])
|
||||
|
||||
# Count relation types
|
||||
relation_types = Counter()
|
||||
for rel in relations:
|
||||
rel_type = rel.get('name', 'unknown')
|
||||
relation_types[rel_type] += 1
|
||||
|
||||
# Get pathway name
|
||||
try:
|
||||
entry = kegg.get(pathway_id)
|
||||
pathway_name = "Unknown"
|
||||
for line in entry.split("\n"):
|
||||
if line.startswith("NAME"):
|
||||
pathway_name = line.replace("NAME", "").strip()
|
||||
break
|
||||
except:
|
||||
pathway_name = "Unknown"
|
||||
|
||||
result = {
|
||||
'pathway_id': pathway_id,
|
||||
'pathway_name': pathway_name,
|
||||
'num_entries': len(entries),
|
||||
'num_relations': len(relations),
|
||||
'relation_types': dict(relation_types),
|
||||
'entries': entries,
|
||||
'relations': relations
|
||||
}
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
print(f" ✗ Error analyzing {pathway_id}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def analyze_all_pathways(kegg, pathway_ids, limit=None):
|
||||
"""Analyze all pathways."""
|
||||
if limit:
|
||||
pathway_ids = pathway_ids[:limit]
|
||||
print(f"\n⚠ Limiting analysis to first {limit} pathways")
|
||||
|
||||
print(f"\nAnalyzing {len(pathway_ids)} pathways...")
|
||||
|
||||
results = []
|
||||
for i, pathway_id in enumerate(pathway_ids, 1):
|
||||
print(f" [{i}/{len(pathway_ids)}] {pathway_id}", end="\r")
|
||||
|
||||
result = analyze_pathway(kegg, pathway_id)
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
print(f"\n✓ Successfully analyzed {len(results)}/{len(pathway_ids)} pathways")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def save_pathway_summary(results, output_file):
|
||||
"""Save pathway summary to CSV."""
|
||||
print(f"\nSaving pathway summary to {output_file}...")
|
||||
|
||||
with open(output_file, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
|
||||
# Header
|
||||
writer.writerow([
|
||||
'Pathway_ID',
|
||||
'Pathway_Name',
|
||||
'Num_Genes',
|
||||
'Num_Interactions',
|
||||
'Activation',
|
||||
'Inhibition',
|
||||
'Phosphorylation',
|
||||
'Binding',
|
||||
'Other'
|
||||
])
|
||||
|
||||
# Data
|
||||
for result in results:
|
||||
rel_types = result['relation_types']
|
||||
|
||||
writer.writerow([
|
||||
result['pathway_id'],
|
||||
result['pathway_name'],
|
||||
result['num_entries'],
|
||||
result['num_relations'],
|
||||
rel_types.get('activation', 0),
|
||||
rel_types.get('inhibition', 0),
|
||||
rel_types.get('phosphorylation', 0),
|
||||
rel_types.get('binding/association', 0),
|
||||
sum(v for k, v in rel_types.items()
|
||||
if k not in ['activation', 'inhibition', 'phosphorylation', 'binding/association'])
|
||||
])
|
||||
|
||||
print(f"✓ Summary saved")
|
||||
|
||||
|
||||
def save_interactions_sif(results, output_file):
|
||||
"""Save all interactions in SIF format."""
|
||||
print(f"\nSaving interactions to {output_file}...")
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
for result in results:
|
||||
pathway_id = result['pathway_id']
|
||||
|
||||
for rel in result['relations']:
|
||||
entry1 = rel.get('entry1', '')
|
||||
entry2 = rel.get('entry2', '')
|
||||
interaction_type = rel.get('name', 'interaction')
|
||||
|
||||
# Write SIF format: source\tinteraction\ttarget
|
||||
f.write(f"{entry1}\t{interaction_type}\t{entry2}\n")
|
||||
|
||||
print(f"✓ Interactions saved")
|
||||
|
||||
|
||||
def save_detailed_pathway_info(results, output_dir):
|
||||
"""Save detailed information for each pathway."""
|
||||
print(f"\nSaving detailed pathway files to {output_dir}/pathways/...")
|
||||
|
||||
pathway_dir = os.path.join(output_dir, "pathways")
|
||||
os.makedirs(pathway_dir, exist_ok=True)
|
||||
|
||||
for result in results:
|
||||
pathway_id = result['pathway_id'].replace(":", "_")
|
||||
filename = os.path.join(pathway_dir, f"{pathway_id}_interactions.csv")
|
||||
|
||||
with open(filename, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(['Source', 'Target', 'Interaction_Type', 'Link_Type'])
|
||||
|
||||
for rel in result['relations']:
|
||||
writer.writerow([
|
||||
rel.get('entry1', ''),
|
||||
rel.get('entry2', ''),
|
||||
rel.get('name', 'unknown'),
|
||||
rel.get('link', 'unknown')
|
||||
])
|
||||
|
||||
print(f"✓ Detailed files saved for {len(results)} pathways")
|
||||
|
||||
|
||||
def print_statistics(results):
|
||||
"""Print analysis statistics."""
|
||||
print(f"\n{'='*70}")
|
||||
print("PATHWAY ANALYSIS STATISTICS")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Total stats
|
||||
total_pathways = len(results)
|
||||
total_interactions = sum(r['num_relations'] for r in results)
|
||||
total_genes = sum(r['num_entries'] for r in results)
|
||||
|
||||
print(f"\nOverall:")
|
||||
print(f" Total pathways: {total_pathways}")
|
||||
print(f" Total genes/proteins: {total_genes}")
|
||||
print(f" Total interactions: {total_interactions}")
|
||||
|
||||
# Largest pathways
|
||||
print(f"\nLargest pathways (by gene count):")
|
||||
sorted_by_size = sorted(results, key=lambda x: x['num_entries'], reverse=True)
|
||||
for i, result in enumerate(sorted_by_size[:10], 1):
|
||||
print(f" {i}. {result['pathway_id']}: {result['num_entries']} genes")
|
||||
print(f" {result['pathway_name']}")
|
||||
|
||||
# Most connected pathways
|
||||
print(f"\nMost connected pathways (by interactions):")
|
||||
sorted_by_connections = sorted(results, key=lambda x: x['num_relations'], reverse=True)
|
||||
for i, result in enumerate(sorted_by_connections[:10], 1):
|
||||
print(f" {i}. {result['pathway_id']}: {result['num_relations']} interactions")
|
||||
print(f" {result['pathway_name']}")
|
||||
|
||||
# Interaction type distribution
|
||||
print(f"\nInteraction type distribution:")
|
||||
all_types = Counter()
|
||||
for result in results:
|
||||
for rel_type, count in result['relation_types'].items():
|
||||
all_types[rel_type] += count
|
||||
|
||||
for rel_type, count in all_types.most_common():
|
||||
percentage = (count / total_interactions) * 100 if total_interactions > 0 else 0
|
||||
print(f" {rel_type}: {count} ({percentage:.1f}%)")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main analysis workflow."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze KEGG pathways for an organism",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python pathway_analysis.py hsa ./human_pathways
|
||||
python pathway_analysis.py mmu ./mouse_pathways --limit 50
|
||||
|
||||
Organism codes:
|
||||
hsa = Homo sapiens (human)
|
||||
mmu = Mus musculus (mouse)
|
||||
dme = Drosophila melanogaster
|
||||
sce = Saccharomyces cerevisiae (yeast)
|
||||
eco = Escherichia coli
|
||||
"""
|
||||
)
|
||||
parser.add_argument("organism", help="KEGG organism code (e.g., hsa, mmu)")
|
||||
parser.add_argument("output_dir", help="Output directory for results")
|
||||
parser.add_argument("--limit", type=int, default=None,
|
||||
help="Limit analysis to first N pathways")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 70)
|
||||
print("BIOSERVICES: KEGG Pathway Network Analysis")
|
||||
print("=" * 70)
|
||||
|
||||
# Create output directory
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
|
||||
# Initialize KEGG
|
||||
kegg = KEGG()
|
||||
|
||||
# Get all pathways
|
||||
pathway_ids = get_all_pathways(kegg, args.organism)
|
||||
|
||||
if not pathway_ids:
|
||||
print(f"\n✗ No pathways found for {args.organism}")
|
||||
sys.exit(1)
|
||||
|
||||
# Analyze pathways
|
||||
results = analyze_all_pathways(kegg, pathway_ids, args.limit)
|
||||
|
||||
if not results:
|
||||
print("\n✗ No pathways successfully analyzed")
|
||||
sys.exit(1)
|
||||
|
||||
# Print statistics
|
||||
print_statistics(results)
|
||||
|
||||
# Save results
|
||||
summary_file = os.path.join(args.output_dir, "pathway_summary.csv")
|
||||
save_pathway_summary(results, summary_file)
|
||||
|
||||
sif_file = os.path.join(args.output_dir, "all_interactions.sif")
|
||||
save_interactions_sif(results, sif_file)
|
||||
|
||||
save_detailed_pathway_info(results, args.output_dir)
|
||||
|
||||
# Final summary
|
||||
print(f"\n{'='*70}")
|
||||
print("OUTPUT FILES")
|
||||
print(f"{'='*70}")
|
||||
print(f" Summary: {summary_file}")
|
||||
print(f" Interactions: {sif_file}")
|
||||
print(f" Detailed: {args.output_dir}/pathways/")
|
||||
print(f"{'='*70}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
408
scientific-packages/bioservices/scripts/protein_analysis_workflow.py
Executable file
408
scientific-packages/bioservices/scripts/protein_analysis_workflow.py
Executable file
@@ -0,0 +1,408 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Complete Protein Analysis Workflow
|
||||
|
||||
This script performs a comprehensive protein analysis pipeline:
|
||||
1. UniProt search and identifier retrieval
|
||||
2. FASTA sequence retrieval
|
||||
3. BLAST similarity search
|
||||
4. KEGG pathway discovery
|
||||
5. PSICQUIC interaction mapping
|
||||
6. GO annotation retrieval
|
||||
|
||||
Usage:
|
||||
python protein_analysis_workflow.py PROTEIN_NAME EMAIL [--skip-blast]
|
||||
|
||||
Examples:
|
||||
python protein_analysis_workflow.py ZAP70_HUMAN user@example.com
|
||||
python protein_analysis_workflow.py P43403 user@example.com --skip-blast
|
||||
|
||||
Note: BLAST searches can take several minutes. Use --skip-blast to skip this step.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import time
|
||||
import argparse
|
||||
from bioservices import UniProt, KEGG, NCBIblast, PSICQUIC, QuickGO
|
||||
|
||||
|
||||
def search_protein(query):
|
||||
"""Search UniProt for protein and retrieve basic information."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 1: UniProt Search")
|
||||
print(f"{'='*70}")
|
||||
|
||||
u = UniProt(verbose=False)
|
||||
|
||||
print(f"Searching for: {query}")
|
||||
|
||||
# Try direct retrieval first (if query looks like accession)
|
||||
if len(query) == 6 and query[0] in "OPQ":
|
||||
try:
|
||||
entry = u.retrieve(query, frmt="tab")
|
||||
if entry:
|
||||
uniprot_id = query
|
||||
print(f"✓ Found UniProt entry: {uniprot_id}")
|
||||
return u, uniprot_id
|
||||
except:
|
||||
pass
|
||||
|
||||
# Otherwise search
|
||||
results = u.search(query, frmt="tab", columns="id,genes,organism,length,protein names", limit=5)
|
||||
|
||||
if not results:
|
||||
print("✗ No results found")
|
||||
return u, None
|
||||
|
||||
lines = results.strip().split("\n")
|
||||
if len(lines) < 2:
|
||||
print("✗ No entries found")
|
||||
return u, None
|
||||
|
||||
# Display results
|
||||
print(f"\n✓ Found {len(lines)-1} result(s):")
|
||||
for i, line in enumerate(lines[1:], 1):
|
||||
fields = line.split("\t")
|
||||
print(f" {i}. {fields[0]} - {fields[1]} ({fields[2]})")
|
||||
|
||||
# Use first result
|
||||
first_entry = lines[1].split("\t")
|
||||
uniprot_id = first_entry[0]
|
||||
gene_names = first_entry[1] if len(first_entry) > 1 else "N/A"
|
||||
organism = first_entry[2] if len(first_entry) > 2 else "N/A"
|
||||
length = first_entry[3] if len(first_entry) > 3 else "N/A"
|
||||
protein_name = first_entry[4] if len(first_entry) > 4 else "N/A"
|
||||
|
||||
print(f"\nUsing first result:")
|
||||
print(f" UniProt ID: {uniprot_id}")
|
||||
print(f" Gene names: {gene_names}")
|
||||
print(f" Organism: {organism}")
|
||||
print(f" Length: {length} aa")
|
||||
print(f" Protein: {protein_name}")
|
||||
|
||||
return u, uniprot_id
|
||||
|
||||
|
||||
def retrieve_sequence(uniprot, uniprot_id):
|
||||
"""Retrieve FASTA sequence for protein."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 2: FASTA Sequence Retrieval")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
sequence = uniprot.retrieve(uniprot_id, frmt="fasta")
|
||||
|
||||
if sequence:
|
||||
# Extract sequence only (remove header)
|
||||
lines = sequence.strip().split("\n")
|
||||
header = lines[0]
|
||||
seq_only = "".join(lines[1:])
|
||||
|
||||
print(f"✓ Retrieved sequence:")
|
||||
print(f" Header: {header}")
|
||||
print(f" Length: {len(seq_only)} residues")
|
||||
print(f" First 60 residues: {seq_only[:60]}...")
|
||||
|
||||
return seq_only
|
||||
else:
|
||||
print("✗ Failed to retrieve sequence")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def run_blast(sequence, email, skip=False):
|
||||
"""Run BLAST similarity search."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 3: BLAST Similarity Search")
|
||||
print(f"{'='*70}")
|
||||
|
||||
if skip:
|
||||
print("⊘ Skipped (--skip-blast flag)")
|
||||
return None
|
||||
|
||||
if not email or "@" not in email:
|
||||
print("⊘ Skipped (valid email required for BLAST)")
|
||||
return None
|
||||
|
||||
try:
|
||||
print(f"Submitting BLASTP job...")
|
||||
print(f" Database: uniprotkb")
|
||||
print(f" Sequence length: {len(sequence)} aa")
|
||||
|
||||
s = NCBIblast(verbose=False)
|
||||
|
||||
jobid = s.run(
|
||||
program="blastp",
|
||||
sequence=sequence,
|
||||
stype="protein",
|
||||
database="uniprotkb",
|
||||
email=email
|
||||
)
|
||||
|
||||
print(f"✓ Job submitted: {jobid}")
|
||||
print(f" Waiting for completion...")
|
||||
|
||||
# Poll for completion
|
||||
max_wait = 300 # 5 minutes
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < max_wait:
|
||||
status = s.getStatus(jobid)
|
||||
elapsed = int(time.time() - start_time)
|
||||
print(f" Status: {status} (elapsed: {elapsed}s)", end="\r")
|
||||
|
||||
if status == "FINISHED":
|
||||
print(f"\n✓ BLAST completed in {elapsed}s")
|
||||
|
||||
# Retrieve results
|
||||
results = s.getResult(jobid, "out")
|
||||
|
||||
# Parse and display summary
|
||||
lines = results.split("\n")
|
||||
print(f"\n Results preview:")
|
||||
for line in lines[:20]:
|
||||
if line.strip():
|
||||
print(f" {line}")
|
||||
|
||||
return results
|
||||
|
||||
elif status == "ERROR":
|
||||
print(f"\n✗ BLAST job failed")
|
||||
return None
|
||||
|
||||
time.sleep(5)
|
||||
|
||||
print(f"\n✗ Timeout after {max_wait}s")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def discover_pathways(uniprot, kegg, uniprot_id):
|
||||
"""Discover KEGG pathways for protein."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 4: KEGG Pathway Discovery")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
# Map UniProt → KEGG
|
||||
print(f"Mapping {uniprot_id} to KEGG...")
|
||||
kegg_mapping = uniprot.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
|
||||
if not kegg_mapping or uniprot_id not in kegg_mapping:
|
||||
print("✗ No KEGG mapping found")
|
||||
return []
|
||||
|
||||
kegg_ids = kegg_mapping[uniprot_id]
|
||||
print(f"✓ KEGG ID(s): {kegg_ids}")
|
||||
|
||||
# Get pathways for first KEGG ID
|
||||
kegg_id = kegg_ids[0]
|
||||
organism, gene_id = kegg_id.split(":")
|
||||
|
||||
print(f"\nSearching pathways for {kegg_id}...")
|
||||
pathways = kegg.get_pathway_by_gene(gene_id, organism)
|
||||
|
||||
if not pathways:
|
||||
print("✗ No pathways found")
|
||||
return []
|
||||
|
||||
print(f"✓ Found {len(pathways)} pathway(s):\n")
|
||||
|
||||
# Get pathway names
|
||||
pathway_info = []
|
||||
for pathway_id in pathways:
|
||||
try:
|
||||
entry = kegg.get(pathway_id)
|
||||
|
||||
# Extract pathway name
|
||||
pathway_name = "Unknown"
|
||||
for line in entry.split("\n"):
|
||||
if line.startswith("NAME"):
|
||||
pathway_name = line.replace("NAME", "").strip()
|
||||
break
|
||||
|
||||
pathway_info.append((pathway_id, pathway_name))
|
||||
print(f" • {pathway_id}: {pathway_name}")
|
||||
|
||||
except Exception as e:
|
||||
print(f" • {pathway_id}: [Error retrieving name]")
|
||||
|
||||
return pathway_info
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def find_interactions(protein_query):
|
||||
"""Find protein-protein interactions via PSICQUIC."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 5: Protein-Protein Interactions")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
p = PSICQUIC()
|
||||
|
||||
# Try querying MINT database
|
||||
query = f"{protein_query} AND species:9606"
|
||||
print(f"Querying MINT database...")
|
||||
print(f" Query: {query}")
|
||||
|
||||
results = p.query("mint", query)
|
||||
|
||||
if not results:
|
||||
print("✗ No interactions found in MINT")
|
||||
return []
|
||||
|
||||
# Parse PSI-MI TAB format
|
||||
lines = results.strip().split("\n")
|
||||
print(f"✓ Found {len(lines)} interaction(s):\n")
|
||||
|
||||
# Display first 10 interactions
|
||||
interactions = []
|
||||
for i, line in enumerate(lines[:10], 1):
|
||||
fields = line.split("\t")
|
||||
if len(fields) >= 12:
|
||||
protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
|
||||
protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
|
||||
interaction_type = fields[11]
|
||||
|
||||
interactions.append((protein_a, protein_b, interaction_type))
|
||||
print(f" {i}. {protein_a} ↔ {protein_b}")
|
||||
|
||||
if len(lines) > 10:
|
||||
print(f" ... and {len(lines)-10} more")
|
||||
|
||||
return interactions
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def get_go_annotations(uniprot_id):
|
||||
"""Retrieve GO annotations."""
|
||||
print(f"\n{'='*70}")
|
||||
print("STEP 6: Gene Ontology Annotations")
|
||||
print(f"{'='*70}")
|
||||
|
||||
try:
|
||||
g = QuickGO()
|
||||
|
||||
print(f"Retrieving GO annotations for {uniprot_id}...")
|
||||
annotations = g.Annotation(protein=uniprot_id, format="tsv")
|
||||
|
||||
if not annotations:
|
||||
print("✗ No GO annotations found")
|
||||
return []
|
||||
|
||||
lines = annotations.strip().split("\n")
|
||||
print(f"✓ Found {len(lines)-1} annotation(s)\n")
|
||||
|
||||
# Group by aspect
|
||||
aspects = {"P": [], "F": [], "C": []}
|
||||
for line in lines[1:]:
|
||||
fields = line.split("\t")
|
||||
if len(fields) >= 9:
|
||||
go_id = fields[6]
|
||||
go_term = fields[7]
|
||||
go_aspect = fields[8]
|
||||
|
||||
if go_aspect in aspects:
|
||||
aspects[go_aspect].append((go_id, go_term))
|
||||
|
||||
# Display summary
|
||||
print(f" Biological Process (P): {len(aspects['P'])} terms")
|
||||
for go_id, go_term in aspects['P'][:5]:
|
||||
print(f" • {go_id}: {go_term}")
|
||||
if len(aspects['P']) > 5:
|
||||
print(f" ... and {len(aspects['P'])-5} more")
|
||||
|
||||
print(f"\n Molecular Function (F): {len(aspects['F'])} terms")
|
||||
for go_id, go_term in aspects['F'][:5]:
|
||||
print(f" • {go_id}: {go_term}")
|
||||
if len(aspects['F']) > 5:
|
||||
print(f" ... and {len(aspects['F'])-5} more")
|
||||
|
||||
print(f"\n Cellular Component (C): {len(aspects['C'])} terms")
|
||||
for go_id, go_term in aspects['C'][:5]:
|
||||
print(f" • {go_id}: {go_term}")
|
||||
if len(aspects['C']) > 5:
|
||||
print(f" ... and {len(aspects['C'])-5} more")
|
||||
|
||||
return aspects
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
return {}
|
||||
|
||||
|
||||
def main():
|
||||
"""Main workflow."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Complete protein analysis workflow using BioServices",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python protein_analysis_workflow.py ZAP70_HUMAN user@example.com
|
||||
python protein_analysis_workflow.py P43403 user@example.com --skip-blast
|
||||
"""
|
||||
)
|
||||
parser.add_argument("protein", help="Protein name or UniProt ID")
|
||||
parser.add_argument("email", help="Email address (required for BLAST)")
|
||||
parser.add_argument("--skip-blast", action="store_true",
|
||||
help="Skip BLAST search (faster)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 70)
|
||||
print("BIOSERVICES: Complete Protein Analysis Workflow")
|
||||
print("=" * 70)
|
||||
|
||||
# Step 1: Search protein
|
||||
uniprot, uniprot_id = search_protein(args.protein)
|
||||
if not uniprot_id:
|
||||
print("\n✗ Failed to find protein. Exiting.")
|
||||
sys.exit(1)
|
||||
|
||||
# Step 2: Retrieve sequence
|
||||
sequence = retrieve_sequence(uniprot, uniprot_id)
|
||||
if not sequence:
|
||||
print("\n⚠ Warning: Could not retrieve sequence")
|
||||
|
||||
# Step 3: BLAST search
|
||||
if sequence:
|
||||
blast_results = run_blast(sequence, args.email, args.skip_blast)
|
||||
|
||||
# Step 4: Pathway discovery
|
||||
kegg = KEGG()
|
||||
pathways = discover_pathways(uniprot, kegg, uniprot_id)
|
||||
|
||||
# Step 5: Interaction mapping
|
||||
interactions = find_interactions(args.protein)
|
||||
|
||||
# Step 6: GO annotations
|
||||
go_terms = get_go_annotations(uniprot_id)
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*70}")
|
||||
print("WORKFLOW SUMMARY")
|
||||
print(f"{'='*70}")
|
||||
print(f" Protein: {args.protein}")
|
||||
print(f" UniProt ID: {uniprot_id}")
|
||||
print(f" Sequence: {'✓' if sequence else '✗'}")
|
||||
print(f" BLAST: {'✓' if not args.skip_blast and sequence else '⊘'}")
|
||||
print(f" Pathways: {len(pathways)} found")
|
||||
print(f" Interactions: {len(interactions)} found")
|
||||
print(f" GO annotations: {sum(len(v) for v in go_terms.values())} found")
|
||||
print(f"{'='*70}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
505
scientific-packages/cellxgene-census/SKILL.md
Normal file
505
scientific-packages/cellxgene-census/SKILL.md
Normal file
@@ -0,0 +1,505 @@
|
||||
---
|
||||
name: cellxgene-census
|
||||
description: Access and analyze single-cell genomics data from the CZ CELLxGENE Census. This skill should be used when working with large-scale single-cell RNA-seq data, querying cell and gene metadata, training machine learning models on Census data, integrating multiple single-cell datasets, or performing cross-dataset analyses. It covers data exploration, expression queries, out-of-core processing, PyTorch integration, and scanpy workflows.
|
||||
---
|
||||
|
||||
# CZ CELLxGENE Census
|
||||
|
||||
## Overview
|
||||
|
||||
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
|
||||
|
||||
The Census includes:
|
||||
- **61+ million cells** from human and mouse
|
||||
- **Standardized metadata** (cell types, tissues, diseases, donors)
|
||||
- **Raw gene expression** matrices
|
||||
- **Pre-calculated embeddings** and statistics
|
||||
- **Integration with PyTorch, scanpy, and other analysis tools**
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when tasks involve:
|
||||
- Querying single-cell expression data by cell type, tissue, or disease
|
||||
- Exploring available single-cell datasets and metadata
|
||||
- Training machine learning models on single-cell data
|
||||
- Performing large-scale cross-dataset analyses
|
||||
- Integrating Census data with scanpy or other analysis frameworks
|
||||
- Computing statistics across millions of cells
|
||||
- Accessing pre-calculated embeddings or model predictions
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
Install the Census API:
|
||||
```bash
|
||||
pip install cellxgene-census
|
||||
```
|
||||
|
||||
For machine learning workflows, install additional dependencies:
|
||||
```bash
|
||||
pip install cellxgene-census[experimental]
|
||||
```
|
||||
|
||||
## Core Workflow Patterns
|
||||
|
||||
### 1. Opening the Census
|
||||
|
||||
Always use the context manager to ensure proper resource cleanup:
|
||||
|
||||
```python
|
||||
import cellxgene_census
|
||||
|
||||
# Open latest stable version
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Work with census data
|
||||
|
||||
# Open specific version for reproducibility
|
||||
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
|
||||
# Work with census data
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
- Use context manager (`with` statement) for automatic cleanup
|
||||
- Specify `census_version` for reproducible analyses
|
||||
- Default opens latest "stable" release
|
||||
|
||||
### 2. Exploring Census Information
|
||||
|
||||
Before querying expression data, explore available datasets and metadata.
|
||||
|
||||
**Access summary information:**
|
||||
```python
|
||||
# Get summary statistics
|
||||
summary = census["census_info"]["summary"].read().concat().to_pandas()
|
||||
print(f"Total cells: {summary['total_cell_count'][0]}")
|
||||
|
||||
# Get all datasets
|
||||
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
|
||||
|
||||
# Filter datasets by criteria
|
||||
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
|
||||
```
|
||||
|
||||
**Query cell metadata to understand available data:**
|
||||
```python
|
||||
# Get unique cell types in a tissue
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["cell_type"]
|
||||
)
|
||||
unique_cell_types = cell_metadata["cell_type"].unique()
|
||||
print(f"Found {len(unique_cell_types)} cell types in brain")
|
||||
|
||||
# Count cells by tissue
|
||||
tissue_counts = cell_metadata.groupby("tissue_general").size()
|
||||
```
|
||||
|
||||
**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
|
||||
|
||||
### 3. Querying Expression Data (Small to Medium Scale)
|
||||
|
||||
For queries returning < 100k cells that fit in memory, use `get_anndata()`:
|
||||
|
||||
```python
|
||||
# Basic query with cell type and tissue filters
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens", # or "Mus musculus"
|
||||
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
|
||||
obs_column_names=["assay", "disease", "sex", "donor_id"],
|
||||
)
|
||||
|
||||
# Query specific genes with multiple filters
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
|
||||
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
|
||||
obs_column_names=["cell_type", "tissue_general", "donor_id"],
|
||||
)
|
||||
```
|
||||
|
||||
**Filter syntax:**
|
||||
- Use `obs_value_filter` for cell filtering
|
||||
- Use `var_value_filter` for gene filtering
|
||||
- Combine conditions with `and`, `or`
|
||||
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
|
||||
- Select only needed columns with `obs_column_names`
|
||||
|
||||
**Getting metadata separately:**
|
||||
```python
|
||||
# Query cell metadata
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="disease == 'COVID-19' and is_primary_data == True",
|
||||
column_names=["cell_type", "tissue_general", "donor_id"]
|
||||
)
|
||||
|
||||
# Query gene metadata
|
||||
gene_metadata = cellxgene_census.get_var(
|
||||
census, "homo_sapiens",
|
||||
value_filter="feature_name in ['CD4', 'CD8A']",
|
||||
column_names=["feature_id", "feature_name", "feature_length"]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Large-Scale Queries (Out-of-Core Processing)
|
||||
|
||||
For queries exceeding available RAM, use `axis_query()` with iterative processing:
|
||||
|
||||
```python
|
||||
import tiledbsoma as soma
|
||||
|
||||
# Create axis query
|
||||
query = census["census_data"]["homo_sapiens"].axis_query(
|
||||
measurement_name="RNA",
|
||||
obs_query=soma.AxisQuery(
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True"
|
||||
),
|
||||
var_query=soma.AxisQuery(
|
||||
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
|
||||
)
|
||||
)
|
||||
|
||||
# Iterate through expression matrix in chunks
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
# batch is a pyarrow.Table with columns:
|
||||
# - soma_data: expression value
|
||||
# - soma_dim_0: cell (obs) coordinate
|
||||
# - soma_dim_1: gene (var) coordinate
|
||||
process_batch(batch)
|
||||
```
|
||||
|
||||
**Computing incremental statistics:**
|
||||
```python
|
||||
# Example: Calculate mean expression
|
||||
n_observations = 0
|
||||
sum_values = 0.0
|
||||
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
values = batch["soma_data"].to_numpy()
|
||||
n_observations += len(values)
|
||||
sum_values += values.sum()
|
||||
|
||||
mean_expression = sum_values / n_observations
|
||||
```
|
||||
|
||||
### 5. Machine Learning with PyTorch
|
||||
|
||||
For training models, use the experimental PyTorch integration:
|
||||
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import experiment_dataloader
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Create dataloader
|
||||
dataloader = experiment_dataloader(
|
||||
census["census_data"]["homo_sapiens"],
|
||||
measurement_name="RNA",
|
||||
X_name="raw",
|
||||
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for epoch in range(num_epochs):
|
||||
for batch in dataloader:
|
||||
X = batch["X"] # Gene expression tensor
|
||||
labels = batch["obs"]["cell_type"] # Cell type labels
|
||||
|
||||
# Forward pass
|
||||
outputs = model(X)
|
||||
loss = criterion(outputs, labels)
|
||||
|
||||
# Backward pass
|
||||
optimizer.zero_grad()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
**Train/test splitting:**
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import ExperimentDataset
|
||||
|
||||
# Create dataset from experiment
|
||||
dataset = ExperimentDataset(
|
||||
experiment_axis_query,
|
||||
layer_name="raw",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
)
|
||||
|
||||
# Split into train and test
|
||||
train_dataset, test_dataset = dataset.random_split(
|
||||
split=[0.8, 0.2],
|
||||
seed=42
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Integration with Scanpy
|
||||
|
||||
Seamlessly integrate Census data with scanpy workflows:
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Load data from Census
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
|
||||
)
|
||||
|
||||
# Standard scanpy workflow
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||||
|
||||
# Dimensionality reduction
|
||||
sc.pp.pca(adata, n_comps=50)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
|
||||
# Visualization
|
||||
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
|
||||
```
|
||||
|
||||
### 7. Multi-Dataset Integration
|
||||
|
||||
Query and integrate multiple datasets:
|
||||
|
||||
```python
|
||||
# Strategy 1: Query multiple tissues separately
|
||||
tissues = ["lung", "liver", "kidney"]
|
||||
adatas = []
|
||||
|
||||
for tissue in tissues:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
|
||||
)
|
||||
adata.obs["tissue"] = tissue
|
||||
adatas.append(adata)
|
||||
|
||||
# Concatenate
|
||||
combined = adatas[0].concatenate(adatas[1:])
|
||||
|
||||
# Strategy 2: Query multiple datasets directly
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
## Key Concepts and Best Practices
|
||||
|
||||
### Always Filter for Primary Data
|
||||
Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times:
|
||||
```python
|
||||
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
|
||||
```
|
||||
|
||||
### Specify Census Version for Reproducibility
|
||||
Always specify the Census version in production analyses:
|
||||
```python
|
||||
census = cellxgene_census.open_soma(census_version="2023-07-25")
|
||||
```
|
||||
|
||||
### Estimate Query Size Before Loading
|
||||
For large queries, first check the number of cells to avoid memory issues:
|
||||
```python
|
||||
# Get cell count
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["soma_joinid"]
|
||||
)
|
||||
n_cells = len(metadata)
|
||||
print(f"Query will return {n_cells:,} cells")
|
||||
|
||||
# If too large (>100k), use out-of-core processing
|
||||
```
|
||||
|
||||
### Use tissue_general for Broader Groupings
|
||||
The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses:
|
||||
```python
|
||||
# Broader grouping
|
||||
obs_value_filter="tissue_general == 'immune system'"
|
||||
|
||||
# Specific tissue
|
||||
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
|
||||
```
|
||||
|
||||
### Select Only Needed Columns
|
||||
Minimize data transfer by specifying only required metadata columns:
|
||||
```python
|
||||
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
|
||||
```
|
||||
|
||||
### Check Dataset Presence for Gene-Specific Queries
|
||||
When analyzing specific genes, verify which datasets measured them:
|
||||
```python
|
||||
presence = cellxgene_census.get_presence_matrix(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A']"
|
||||
)
|
||||
```
|
||||
|
||||
### Two-Step Workflow: Explore Then Query
|
||||
First explore metadata to understand available data, then query expression:
|
||||
```python
|
||||
# Step 1: Explore what's available
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="disease == 'COVID-19' and is_primary_data == True",
|
||||
column_names=["cell_type", "tissue_general"]
|
||||
)
|
||||
print(metadata.value_counts())
|
||||
|
||||
# Step 2: Query based on findings
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
## Available Metadata Fields
|
||||
|
||||
### Cell Metadata (obs)
|
||||
Key fields for filtering:
|
||||
- `cell_type`, `cell_type_ontology_term_id`
|
||||
- `tissue`, `tissue_general`, `tissue_ontology_term_id`
|
||||
- `disease`, `disease_ontology_term_id`
|
||||
- `assay`, `assay_ontology_term_id`
|
||||
- `donor_id`, `sex`, `self_reported_ethnicity`
|
||||
- `development_stage`, `development_stage_ontology_term_id`
|
||||
- `dataset_id`
|
||||
- `is_primary_data` (Boolean: True = unique cell)
|
||||
|
||||
### Gene Metadata (var)
|
||||
- `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798")
|
||||
- `feature_name` (Gene symbol, e.g., "FOXP2")
|
||||
- `feature_length` (Gene length in base pairs)
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes detailed reference documentation:
|
||||
|
||||
### references/census_schema.md
|
||||
Comprehensive documentation of:
|
||||
- Census data structure and organization
|
||||
- All available metadata fields
|
||||
- Value filter syntax and operators
|
||||
- SOMA object types
|
||||
- Data inclusion criteria
|
||||
|
||||
**When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax.
|
||||
|
||||
### references/common_patterns.md
|
||||
Examples and patterns for:
|
||||
- Exploratory queries (metadata only)
|
||||
- Small-to-medium queries (AnnData)
|
||||
- Large queries (out-of-core processing)
|
||||
- PyTorch integration
|
||||
- Scanpy integration workflows
|
||||
- Multi-dataset integration
|
||||
- Best practices and common pitfalls
|
||||
|
||||
**When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Use Case 1: Explore Cell Types in a Tissue
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
cells = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="tissue_general == 'lung' and is_primary_data == True",
|
||||
column_names=["cell_type"]
|
||||
)
|
||||
print(cells["cell_type"].value_counts())
|
||||
```
|
||||
|
||||
### Use Case 2: Query Marker Gene Expression
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
|
||||
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
### Use Case 3: Train Cell Type Classifier
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import experiment_dataloader
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
dataloader = experiment_dataloader(
|
||||
census["census_data"]["homo_sapiens"],
|
||||
measurement_name="RNA",
|
||||
X_name="raw",
|
||||
obs_value_filter="is_primary_data == True",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Train model
|
||||
for epoch in range(epochs):
|
||||
for batch in dataloader:
|
||||
# Training logic
|
||||
pass
|
||||
```
|
||||
|
||||
### Use Case 4: Cross-Tissue Analysis
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
|
||||
)
|
||||
|
||||
# Analyze macrophage differences across tissues
|
||||
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Query Returns Too Many Cells
|
||||
- Add more specific filters to reduce scope
|
||||
- Use `tissue` instead of `tissue_general` for finer granularity
|
||||
- Filter by specific `dataset_id` if known
|
||||
- Switch to out-of-core processing for large queries
|
||||
|
||||
### Memory Errors
|
||||
- Reduce query scope with more restrictive filters
|
||||
- Select fewer genes with `var_value_filter`
|
||||
- Use out-of-core processing with `axis_query()`
|
||||
- Process data in batches
|
||||
|
||||
### Duplicate Cells in Results
|
||||
- Always include `is_primary_data == True` in filters
|
||||
- Check if intentionally querying across multiple datasets
|
||||
|
||||
### Gene Not Found
|
||||
- Verify gene name spelling (case-sensitive)
|
||||
- Try Ensembl ID with `feature_id` instead of `feature_name`
|
||||
- Check dataset presence matrix to see if gene was measured
|
||||
- Some genes may have been filtered during Census construction
|
||||
|
||||
### Version Inconsistencies
|
||||
- Always specify `census_version` explicitly
|
||||
- Use same version across all analyses
|
||||
- Check release notes for version-specific changes
|
||||
182
scientific-packages/cellxgene-census/references/census_schema.md
Normal file
182
scientific-packages/cellxgene-census/references/census_schema.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# CZ CELLxGENE Census Data Schema Reference
|
||||
|
||||
## Overview
|
||||
|
||||
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
|
||||
|
||||
## High-Level Structure
|
||||
|
||||
The Census is organized as a `SOMACollection` with two main components:
|
||||
|
||||
### 1. census_info
|
||||
Summary information including:
|
||||
- **summary**: Build date, cell counts, dataset statistics
|
||||
- **datasets**: All datasets from CELLxGENE Discover with metadata
|
||||
- **summary_cell_counts**: Cell counts stratified by metadata categories
|
||||
|
||||
### 2. census_data
|
||||
Organism-specific `SOMAExperiment` objects:
|
||||
- **"homo_sapiens"**: Human single-cell data
|
||||
- **"mus_musculus"**: Mouse single-cell data
|
||||
|
||||
## Data Structure Per Organism
|
||||
|
||||
Each organism experiment contains:
|
||||
|
||||
### obs (Cell Metadata)
|
||||
Cell-level annotations stored as a `SOMADataFrame`. Access via:
|
||||
```python
|
||||
census["census_data"]["homo_sapiens"].obs
|
||||
```
|
||||
|
||||
### ms["RNA"] (Measurement)
|
||||
RNA measurement data including:
|
||||
- **X**: Data matrices with layers:
|
||||
- `raw`: Raw count data
|
||||
- `normalized`: (if available) Normalized counts
|
||||
- **var**: Gene metadata
|
||||
- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
|
||||
|
||||
## Cell Metadata Fields (obs)
|
||||
|
||||
### Required/Core Fields
|
||||
|
||||
**Identity & Dataset:**
|
||||
- `soma_joinid`: Unique integer identifier for joins
|
||||
- `dataset_id`: Source dataset identifier
|
||||
- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
|
||||
|
||||
**Cell Type:**
|
||||
- `cell_type`: Human-readable cell type name
|
||||
- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
|
||||
|
||||
**Tissue:**
|
||||
- `tissue`: Specific tissue name
|
||||
- `tissue_general`: Broader tissue category (useful for grouping)
|
||||
- `tissue_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Assay:**
|
||||
- `assay`: Sequencing technology used
|
||||
- `assay_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Disease:**
|
||||
- `disease`: Disease status or condition
|
||||
- `disease_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Donor:**
|
||||
- `donor_id`: Unique donor identifier
|
||||
- `sex`: Biological sex (male, female, unknown)
|
||||
- `self_reported_ethnicity`: Ethnicity information
|
||||
- `development_stage`: Life stage (adult, child, embryonic, etc.)
|
||||
- `development_stage_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Organism:**
|
||||
- `organism`: Scientific name (Homo sapiens, Mus musculus)
|
||||
- `organism_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Technical:**
|
||||
- `suspension_type`: Sample preparation type (cell, nucleus, na)
|
||||
|
||||
## Gene Metadata Fields (var)
|
||||
|
||||
Access via:
|
||||
```python
|
||||
census["census_data"]["homo_sapiens"].ms["RNA"].var
|
||||
```
|
||||
|
||||
**Available Fields:**
|
||||
- `soma_joinid`: Unique integer identifier for joins
|
||||
- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
|
||||
- `feature_name`: Gene symbol (e.g., "FOXP2")
|
||||
- `feature_length`: Gene length in base pairs
|
||||
|
||||
## Value Filter Syntax
|
||||
|
||||
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
|
||||
|
||||
### Comparison Operators
|
||||
- `==`: Equal to
|
||||
- `!=`: Not equal to
|
||||
- `<`, `>`, `<=`, `>=`: Numeric comparisons
|
||||
- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
|
||||
|
||||
### Logical Operators
|
||||
- `and`, `&`: Logical AND
|
||||
- `or`, `|`: Logical OR
|
||||
|
||||
### Examples
|
||||
|
||||
**Single condition:**
|
||||
```python
|
||||
value_filter="cell_type == 'B cell'"
|
||||
```
|
||||
|
||||
**Multiple conditions with AND:**
|
||||
```python
|
||||
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
|
||||
```
|
||||
|
||||
**Using IN for multiple values:**
|
||||
```python
|
||||
value_filter="tissue in ['lung', 'liver', 'kidney']"
|
||||
```
|
||||
|
||||
**Complex condition:**
|
||||
```python
|
||||
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
|
||||
```
|
||||
|
||||
**Filtering genes:**
|
||||
```python
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
|
||||
```
|
||||
|
||||
## Data Inclusion Criteria
|
||||
|
||||
The Census includes all data from CZ CELLxGENE Discover meeting:
|
||||
|
||||
1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
|
||||
2. **Technology**: Approved sequencing technologies for RNA
|
||||
3. **Count Type**: Raw counts only (no processed/normalized-only data)
|
||||
4. **Metadata**: Standardized following CELLxGENE schema
|
||||
5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
|
||||
|
||||
## Important Data Characteristics
|
||||
|
||||
### Duplicate Cells
|
||||
Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
|
||||
|
||||
### Count Types
|
||||
The Census includes:
|
||||
- **Molecule counts**: From UMI-based methods
|
||||
- **Full-gene sequencing read counts**: From non-UMI methods
|
||||
These may need different normalization approaches.
|
||||
|
||||
### Versioning
|
||||
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
|
||||
```python
|
||||
census = cellxgene_census.open_soma(census_version="2023-07-25")
|
||||
```
|
||||
|
||||
## Dataset Presence Matrix
|
||||
|
||||
Access which genes were measured in each dataset:
|
||||
```python
|
||||
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
|
||||
```
|
||||
|
||||
This sparse boolean matrix helps understand:
|
||||
- Gene coverage across datasets
|
||||
- Which datasets to include for specific gene analyses
|
||||
- Technical batch effects related to gene coverage
|
||||
|
||||
## SOMA Object Types
|
||||
|
||||
Core TileDB-SOMA objects used:
|
||||
- **DataFrame**: Tabular data (obs, var)
|
||||
- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
|
||||
- **DenseNDArray**: Dense arrays (less common)
|
||||
- **Collection**: Container for related objects
|
||||
- **Experiment**: Top-level container for measurements
|
||||
- **SOMAScene**: Spatial transcriptomics scenes
|
||||
- **obs_spatial_presence**: Spatial data availability
|
||||
@@ -0,0 +1,351 @@
|
||||
# Common Query Patterns and Best Practices
|
||||
|
||||
## Query Pattern Categories
|
||||
|
||||
### 1. Exploratory Queries (Metadata Only)
|
||||
|
||||
Use when exploring available data without loading expression matrices.
|
||||
|
||||
**Pattern: Get unique cell types in a tissue**
|
||||
```python
|
||||
import cellxgene_census
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["cell_type"]
|
||||
)
|
||||
unique_cell_types = cell_metadata["cell_type"].unique()
|
||||
print(f"Found {len(unique_cell_types)} unique cell types")
|
||||
```
|
||||
|
||||
**Pattern: Count cells by condition**
|
||||
```python
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
value_filter="disease != 'normal' and is_primary_data == True",
|
||||
column_names=["disease", "tissue_general"]
|
||||
)
|
||||
counts = cell_metadata.groupby(["disease", "tissue_general"]).size()
|
||||
```
|
||||
|
||||
**Pattern: Explore dataset information**
|
||||
```python
|
||||
# Access datasets table
|
||||
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
|
||||
|
||||
# Filter for specific criteria
|
||||
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
|
||||
```
|
||||
|
||||
### 2. Small-to-Medium Queries (AnnData)
|
||||
|
||||
Use `get_anndata()` when results fit in memory (typically < 100k cells).
|
||||
|
||||
**Pattern: Tissue-specific cell type query**
|
||||
```python
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
|
||||
obs_column_names=["assay", "disease", "sex", "donor_id"],
|
||||
)
|
||||
```
|
||||
|
||||
**Pattern: Gene-specific query with multiple genes**
|
||||
```python
|
||||
marker_genes = ["CD4", "CD8A", "CD19", "FOXP3"]
|
||||
|
||||
# First get gene IDs
|
||||
gene_metadata = cellxgene_census.get_var(
|
||||
census, "homo_sapiens",
|
||||
value_filter=f"feature_name in {marker_genes}",
|
||||
column_names=["feature_id", "feature_name"]
|
||||
)
|
||||
gene_ids = gene_metadata["feature_id"].tolist()
|
||||
|
||||
# Query with gene filter
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
var_value_filter=f"feature_id in {gene_ids}",
|
||||
obs_value_filter="cell_type == 'T cell' and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
**Pattern: Multi-tissue query**
|
||||
```python
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
|
||||
obs_column_names=["cell_type", "tissue_general", "dataset_id"],
|
||||
)
|
||||
```
|
||||
|
||||
**Pattern: Disease-specific query**
|
||||
```python
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="disease == 'COVID-19' and tissue_general == 'lung' and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Large Queries (Out-of-Core Processing)
|
||||
|
||||
Use `axis_query()` for queries that exceed available RAM.
|
||||
|
||||
**Pattern: Iterative processing**
|
||||
```python
|
||||
import pyarrow as pa
|
||||
|
||||
# Create query
|
||||
query = census["census_data"]["homo_sapiens"].axis_query(
|
||||
measurement_name="RNA",
|
||||
obs_query=soma.AxisQuery(
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True"
|
||||
),
|
||||
var_query=soma.AxisQuery(
|
||||
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
|
||||
)
|
||||
)
|
||||
|
||||
# Iterate through X matrix in chunks
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
# Process batch (a pyarrow.Table)
|
||||
# batch has columns: soma_data, soma_dim_0, soma_dim_1
|
||||
process_batch(batch)
|
||||
```
|
||||
|
||||
**Pattern: Incremental statistics (mean/variance)**
|
||||
```python
|
||||
# Using Welford's online algorithm
|
||||
n = 0
|
||||
mean = 0
|
||||
M2 = 0
|
||||
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
values = batch["soma_data"].to_numpy()
|
||||
for x in values:
|
||||
n += 1
|
||||
delta = x - mean
|
||||
mean += delta / n
|
||||
delta2 = x - mean
|
||||
M2 += delta * delta2
|
||||
|
||||
variance = M2 / (n - 1) if n > 1 else 0
|
||||
```
|
||||
|
||||
### 4. PyTorch Integration (Machine Learning)
|
||||
|
||||
Use `experiment_dataloader()` for training models.
|
||||
|
||||
**Pattern: Create training dataloader**
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import experiment_dataloader
|
||||
import torch
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Create dataloader
|
||||
dataloader = experiment_dataloader(
|
||||
census["census_data"]["homo_sapiens"],
|
||||
measurement_name="RNA",
|
||||
X_name="raw",
|
||||
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for epoch in range(num_epochs):
|
||||
for batch in dataloader:
|
||||
X = batch["X"] # Gene expression
|
||||
labels = batch["obs"]["cell_type"] # Cell type labels
|
||||
# Train model...
|
||||
```
|
||||
|
||||
**Pattern: Train/test split**
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import ExperimentDataset
|
||||
|
||||
# Create dataset from query
|
||||
dataset = ExperimentDataset(
|
||||
experiment_axis_query,
|
||||
layer_name="raw",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
)
|
||||
|
||||
# Split data
|
||||
train_dataset, test_dataset = dataset.random_split(
|
||||
split=[0.8, 0.2],
|
||||
seed=42
|
||||
)
|
||||
|
||||
# Create loaders
|
||||
train_loader = experiment_dataloader(train_dataset)
|
||||
test_loader = experiment_dataloader(test_dataset)
|
||||
```
|
||||
|
||||
### 5. Integration Workflows
|
||||
|
||||
**Pattern: Scanpy integration**
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Load data
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'neuron' and is_primary_data == True",
|
||||
)
|
||||
|
||||
# Standard scanpy workflow
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata)
|
||||
sc.pp.pca(adata)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
sc.pl.umap(adata, color=["cell_type", "tissue_general"])
|
||||
```
|
||||
|
||||
**Pattern: Multi-dataset integration**
|
||||
```python
|
||||
# Query multiple datasets separately
|
||||
datasets_to_integrate = ["dataset_id_1", "dataset_id_2", "dataset_id_3"]
|
||||
|
||||
adatas = []
|
||||
for dataset_id in datasets_to_integrate:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter=f"dataset_id == '{dataset_id}' and is_primary_data == True",
|
||||
)
|
||||
adatas.append(adata)
|
||||
|
||||
# Integrate using scanorama, harmony, or other tools
|
||||
import scanpy.external as sce
|
||||
sce.pp.scanorama_integrate(adatas)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Filter for Primary Data
|
||||
Unless specifically analyzing duplicates, always include `is_primary_data == True`:
|
||||
```python
|
||||
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
|
||||
```
|
||||
|
||||
### 2. Specify Census Version
|
||||
For reproducible analysis, always specify the Census version:
|
||||
```python
|
||||
census = cellxgene_census.open_soma(census_version="2023-07-25")
|
||||
```
|
||||
|
||||
### 3. Use Context Manager
|
||||
Always use the context manager to ensure proper cleanup:
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Your code here
|
||||
```
|
||||
|
||||
### 4. Select Only Needed Columns
|
||||
Minimize data transfer by selecting only required metadata columns:
|
||||
```python
|
||||
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
|
||||
```
|
||||
|
||||
### 5. Check Dataset Presence for Gene Queries
|
||||
When analyzing specific genes, check which datasets measured them:
|
||||
```python
|
||||
presence = cellxgene_census.get_presence_matrix(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A']"
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Use tissue_general for Broader Queries
|
||||
`tissue_general` provides coarser groupings than `tissue`, useful for cross-tissue analyses:
|
||||
```python
|
||||
# Better for broad queries
|
||||
obs_value_filter="tissue_general == 'immune system'"
|
||||
|
||||
# Use specific tissue when needed
|
||||
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
|
||||
```
|
||||
|
||||
### 7. Combine Metadata Exploration with Expression Queries
|
||||
First explore metadata to understand available data, then query expression:
|
||||
```python
|
||||
# Step 1: Explore
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="disease == 'COVID-19'",
|
||||
column_names=["cell_type", "tissue_general"]
|
||||
)
|
||||
print(metadata.value_counts())
|
||||
|
||||
# Step 2: Query based on findings
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
### 8. Memory Management for Large Queries
|
||||
For large queries, check estimated size before loading:
|
||||
```python
|
||||
# Get cell count first
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["soma_joinid"]
|
||||
)
|
||||
n_cells = len(metadata)
|
||||
print(f"Query will return {n_cells} cells")
|
||||
|
||||
# If too large, use out-of-core processing or further filtering
|
||||
```
|
||||
|
||||
### 9. Leverage Ontology Terms for Consistency
|
||||
When possible, use ontology term IDs instead of free text:
|
||||
```python
|
||||
# More reliable than cell_type == 'B cell' across datasets
|
||||
obs_value_filter="cell_type_ontology_term_id == 'CL:0000236'"
|
||||
```
|
||||
|
||||
### 10. Batch Processing Pattern
|
||||
For systematic analyses across multiple conditions:
|
||||
```python
|
||||
tissues = ["lung", "liver", "kidney", "heart"]
|
||||
results = {}
|
||||
|
||||
for tissue in tissues:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
|
||||
)
|
||||
# Perform analysis
|
||||
results[tissue] = analyze(adata)
|
||||
```
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Not filtering for is_primary_data**: Leads to counting duplicate cells
|
||||
2. **Loading too much data**: Use metadata queries to estimate size first
|
||||
3. **Not using context manager**: Can cause resource leaks
|
||||
4. **Inconsistent versioning**: Results not reproducible without specifying version
|
||||
5. **Overly broad queries**: Start with focused queries, expand as needed
|
||||
6. **Ignoring dataset presence**: Some genes not measured in all datasets
|
||||
7. **Wrong count normalization**: Be aware of UMI vs read count differences
|
||||
457
scientific-packages/cobrapy/SKILL.md
Normal file
457
scientific-packages/cobrapy/SKILL.md
Normal file
@@ -0,0 +1,457 @@
|
||||
---
|
||||
name: cobrapy
|
||||
description: Comprehensive toolkit for constraint-based reconstruction and analysis (COBRA) of metabolic models. Use when working with genome-scale metabolic models, performing flux balance analysis (FBA), simulating cellular metabolism, conducting gene/reaction knockout studies, gapfilling metabolic networks, analyzing flux distributions, calculating minimal media requirements, or any systems biology task involving computational modeling of cellular metabolism. Supports SBML, JSON, YAML, and MATLAB formats.
|
||||
---
|
||||
|
||||
# COBRApy - Constraint-Based Reconstruction and Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Use this skill to work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
COBRApy provides comprehensive tools organized into several key areas:
|
||||
|
||||
### 1. Model Management
|
||||
|
||||
Load existing models from repositories or files:
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
|
||||
# Load bundled test models
|
||||
model = load_model("textbook") # E. coli core model
|
||||
model = load_model("ecoli") # Full E. coli model
|
||||
model = load_model("salmonella")
|
||||
|
||||
# Load from files
|
||||
from cobra.io import read_sbml_model, load_json_model, load_yaml_model
|
||||
model = read_sbml_model("path/to/model.xml")
|
||||
model = load_json_model("path/to/model.json")
|
||||
model = load_yaml_model("path/to/model.yml")
|
||||
```
|
||||
|
||||
Save models in various formats:
|
||||
```python
|
||||
from cobra.io import write_sbml_model, save_json_model, save_yaml_model
|
||||
write_sbml_model(model, "output.xml") # Preferred format
|
||||
save_json_model(model, "output.json") # For Escher compatibility
|
||||
save_yaml_model(model, "output.yml") # Human-readable
|
||||
```
|
||||
|
||||
### 2. Model Structure and Components
|
||||
|
||||
Access and inspect model components:
|
||||
```python
|
||||
# Access components
|
||||
model.reactions # DictList of all reactions
|
||||
model.metabolites # DictList of all metabolites
|
||||
model.genes # DictList of all genes
|
||||
|
||||
# Get specific items by ID or index
|
||||
reaction = model.reactions.get_by_id("PFK")
|
||||
metabolite = model.metabolites[0]
|
||||
|
||||
# Inspect properties
|
||||
print(reaction.reaction) # Stoichiometric equation
|
||||
print(reaction.bounds) # Flux constraints
|
||||
print(reaction.gene_reaction_rule) # GPR logic
|
||||
print(metabolite.formula) # Chemical formula
|
||||
print(metabolite.compartment) # Cellular location
|
||||
```
|
||||
|
||||
### 3. Flux Balance Analysis (FBA)
|
||||
|
||||
Perform standard FBA simulation:
|
||||
```python
|
||||
# Basic optimization
|
||||
solution = model.optimize()
|
||||
print(f"Objective value: {solution.objective_value}")
|
||||
print(f"Status: {solution.status}")
|
||||
|
||||
# Access fluxes
|
||||
print(solution.fluxes["PFK"])
|
||||
print(solution.fluxes.head())
|
||||
|
||||
# Fast optimization (objective value only)
|
||||
objective_value = model.slim_optimize()
|
||||
|
||||
# Change objective
|
||||
model.objective = "ATPM"
|
||||
solution = model.optimize()
|
||||
```
|
||||
|
||||
Parsimonious FBA (minimize total flux):
|
||||
```python
|
||||
from cobra.flux_analysis import pfba
|
||||
solution = pfba(model)
|
||||
```
|
||||
|
||||
Geometric FBA (find central solution):
|
||||
```python
|
||||
from cobra.flux_analysis import geometric_fba
|
||||
solution = geometric_fba(model)
|
||||
```
|
||||
|
||||
### 4. Flux Variability Analysis (FVA)
|
||||
|
||||
Determine flux ranges for all reactions:
|
||||
```python
|
||||
from cobra.flux_analysis import flux_variability_analysis
|
||||
|
||||
# Standard FVA
|
||||
fva_result = flux_variability_analysis(model)
|
||||
|
||||
# FVA at 90% optimality
|
||||
fva_result = flux_variability_analysis(model, fraction_of_optimum=0.9)
|
||||
|
||||
# Loopless FVA (eliminates thermodynamically infeasible loops)
|
||||
fva_result = flux_variability_analysis(model, loopless=True)
|
||||
|
||||
# FVA for specific reactions
|
||||
fva_result = flux_variability_analysis(
|
||||
model,
|
||||
reaction_list=["PFK", "FBA", "PGI"]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Gene and Reaction Deletion Studies
|
||||
|
||||
Perform knockout analyses:
|
||||
```python
|
||||
from cobra.flux_analysis import (
|
||||
single_gene_deletion,
|
||||
single_reaction_deletion,
|
||||
double_gene_deletion,
|
||||
double_reaction_deletion
|
||||
)
|
||||
|
||||
# Single deletions
|
||||
gene_results = single_gene_deletion(model)
|
||||
reaction_results = single_reaction_deletion(model)
|
||||
|
||||
# Double deletions (uses multiprocessing)
|
||||
double_gene_results = double_gene_deletion(
|
||||
model,
|
||||
processes=4 # Number of CPU cores
|
||||
)
|
||||
|
||||
# Manual knockout using context manager
|
||||
with model:
|
||||
model.genes.get_by_id("b0008").knock_out()
|
||||
solution = model.optimize()
|
||||
print(f"Growth after knockout: {solution.objective_value}")
|
||||
# Model automatically reverts after context exit
|
||||
```
|
||||
|
||||
### 6. Growth Media and Minimal Media
|
||||
|
||||
Manage growth medium:
|
||||
```python
|
||||
# View current medium
|
||||
print(model.medium)
|
||||
|
||||
# Modify medium (must reassign entire dict)
|
||||
medium = model.medium
|
||||
medium["EX_glc__D_e"] = 10.0 # Set glucose uptake
|
||||
medium["EX_o2_e"] = 0.0 # Anaerobic conditions
|
||||
model.medium = medium
|
||||
|
||||
# Calculate minimal media
|
||||
from cobra.medium import minimal_medium
|
||||
|
||||
# Minimize total import flux
|
||||
min_medium = minimal_medium(model, minimize_components=False)
|
||||
|
||||
# Minimize number of components (uses MILP, slower)
|
||||
min_medium = minimal_medium(
|
||||
model,
|
||||
minimize_components=True,
|
||||
open_exchanges=True
|
||||
)
|
||||
```
|
||||
|
||||
### 7. Flux Sampling
|
||||
|
||||
Sample the feasible flux space:
|
||||
```python
|
||||
from cobra.sampling import sample
|
||||
|
||||
# Sample using OptGP (default, supports parallel processing)
|
||||
samples = sample(model, n=1000, method="optgp", processes=4)
|
||||
|
||||
# Sample using ACHR
|
||||
samples = sample(model, n=1000, method="achr")
|
||||
|
||||
# Validate samples
|
||||
from cobra.sampling import OptGPSampler
|
||||
sampler = OptGPSampler(model, processes=4)
|
||||
sampler.sample(1000)
|
||||
validation = sampler.validate(sampler.samples)
|
||||
print(validation.value_counts()) # Should be all 'v' for valid
|
||||
```
|
||||
|
||||
### 8. Production Envelopes
|
||||
|
||||
Calculate phenotype phase planes:
|
||||
```python
|
||||
from cobra.flux_analysis import production_envelope
|
||||
|
||||
# Standard production envelope
|
||||
envelope = production_envelope(
|
||||
model,
|
||||
reactions=["EX_glc__D_e", "EX_o2_e"],
|
||||
objective="EX_ac_e" # Acetate production
|
||||
)
|
||||
|
||||
# With carbon yield
|
||||
envelope = production_envelope(
|
||||
model,
|
||||
reactions=["EX_glc__D_e", "EX_o2_e"],
|
||||
carbon_sources="EX_glc__D_e"
|
||||
)
|
||||
|
||||
# Visualize (use matplotlib or pandas plotting)
|
||||
import matplotlib.pyplot as plt
|
||||
envelope.plot(x="EX_glc__D_e", y="EX_o2_e", kind="scatter")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### 9. Gapfilling
|
||||
|
||||
Add reactions to make models feasible:
|
||||
```python
|
||||
from cobra.flux_analysis import gapfill
|
||||
|
||||
# Prepare universal model with candidate reactions
|
||||
universal = load_model("universal")
|
||||
|
||||
# Perform gapfilling
|
||||
with model:
|
||||
# Remove reactions to create gaps for demonstration
|
||||
model.remove_reactions([model.reactions.PGI])
|
||||
|
||||
# Find reactions needed
|
||||
solution = gapfill(model, universal)
|
||||
print(f"Reactions to add: {solution}")
|
||||
```
|
||||
|
||||
### 10. Model Building
|
||||
|
||||
Build models from scratch:
|
||||
```python
|
||||
from cobra import Model, Reaction, Metabolite
|
||||
|
||||
# Create model
|
||||
model = Model("my_model")
|
||||
|
||||
# Create metabolites
|
||||
atp_c = Metabolite("atp_c", formula="C10H12N5O13P3",
|
||||
name="ATP", compartment="c")
|
||||
adp_c = Metabolite("adp_c", formula="C10H12N5O10P2",
|
||||
name="ADP", compartment="c")
|
||||
pi_c = Metabolite("pi_c", formula="HO4P",
|
||||
name="Phosphate", compartment="c")
|
||||
|
||||
# Create reaction
|
||||
reaction = Reaction("ATPASE")
|
||||
reaction.name = "ATP hydrolysis"
|
||||
reaction.subsystem = "Energy"
|
||||
reaction.lower_bound = 0.0
|
||||
reaction.upper_bound = 1000.0
|
||||
|
||||
# Add metabolites with stoichiometry
|
||||
reaction.add_metabolites({
|
||||
atp_c: -1.0,
|
||||
adp_c: 1.0,
|
||||
pi_c: 1.0
|
||||
})
|
||||
|
||||
# Add gene-reaction rule
|
||||
reaction.gene_reaction_rule = "(gene1 and gene2) or gene3"
|
||||
|
||||
# Add to model
|
||||
model.add_reactions([reaction])
|
||||
|
||||
# Add boundary reactions
|
||||
model.add_boundary(atp_c, type="exchange")
|
||||
model.add_boundary(adp_c, type="demand")
|
||||
|
||||
# Set objective
|
||||
model.objective = "ATPASE"
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Load Model and Predict Growth
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
|
||||
# Load model
|
||||
model = load_model("ecoli")
|
||||
|
||||
# Run FBA
|
||||
solution = model.optimize()
|
||||
print(f"Growth rate: {solution.objective_value:.3f} /h")
|
||||
|
||||
# Show active pathways
|
||||
print(solution.fluxes[solution.fluxes.abs() > 1e-6])
|
||||
```
|
||||
|
||||
### Workflow 2: Gene Knockout Screen
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.flux_analysis import single_gene_deletion
|
||||
|
||||
# Load model
|
||||
model = load_model("ecoli")
|
||||
|
||||
# Perform single gene deletions
|
||||
results = single_gene_deletion(model)
|
||||
|
||||
# Find essential genes (growth < threshold)
|
||||
essential_genes = results[results["growth"] < 0.01]
|
||||
print(f"Found {len(essential_genes)} essential genes")
|
||||
|
||||
# Find genes with minimal impact
|
||||
neutral_genes = results[results["growth"] > 0.9 * solution.objective_value]
|
||||
```
|
||||
|
||||
### Workflow 3: Media Optimization
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.medium import minimal_medium
|
||||
|
||||
# Load model
|
||||
model = load_model("ecoli")
|
||||
|
||||
# Calculate minimal medium for 50% of max growth
|
||||
target_growth = model.slim_optimize() * 0.5
|
||||
min_medium = minimal_medium(
|
||||
model,
|
||||
target_growth,
|
||||
minimize_components=True
|
||||
)
|
||||
|
||||
print(f"Minimal medium components: {len(min_medium)}")
|
||||
print(min_medium)
|
||||
```
|
||||
|
||||
### Workflow 4: Flux Uncertainty Analysis
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.flux_analysis import flux_variability_analysis
|
||||
from cobra.sampling import sample
|
||||
|
||||
# Load model
|
||||
model = load_model("ecoli")
|
||||
|
||||
# First check flux ranges at optimality
|
||||
fva = flux_variability_analysis(model, fraction_of_optimum=1.0)
|
||||
|
||||
# For reactions with large ranges, sample to understand distribution
|
||||
samples = sample(model, n=1000)
|
||||
|
||||
# Analyze specific reaction
|
||||
reaction_id = "PFK"
|
||||
import matplotlib.pyplot as plt
|
||||
samples[reaction_id].hist(bins=50)
|
||||
plt.xlabel(f"Flux through {reaction_id}")
|
||||
plt.ylabel("Frequency")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Workflow 5: Context Manager for Temporary Changes
|
||||
|
||||
Use context managers to make temporary modifications:
|
||||
```python
|
||||
# Model remains unchanged outside context
|
||||
with model:
|
||||
# Temporarily change objective
|
||||
model.objective = "ATPM"
|
||||
|
||||
# Temporarily modify bounds
|
||||
model.reactions.EX_glc__D_e.lower_bound = -5.0
|
||||
|
||||
# Temporarily knock out genes
|
||||
model.genes.b0008.knock_out()
|
||||
|
||||
# Optimize with changes
|
||||
solution = model.optimize()
|
||||
print(f"Modified growth: {solution.objective_value}")
|
||||
|
||||
# All changes automatically reverted
|
||||
solution = model.optimize()
|
||||
print(f"Original growth: {solution.objective_value}")
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### DictList Objects
|
||||
Models use `DictList` objects for reactions, metabolites, and genes - behaving like both lists and dictionaries:
|
||||
```python
|
||||
# Access by index
|
||||
first_reaction = model.reactions[0]
|
||||
|
||||
# Access by ID
|
||||
pfk = model.reactions.get_by_id("PFK")
|
||||
|
||||
# Query methods
|
||||
atp_reactions = model.reactions.query("atp")
|
||||
```
|
||||
|
||||
### Flux Constraints
|
||||
Reaction bounds define feasible flux ranges:
|
||||
- **Irreversible**: `lower_bound = 0, upper_bound > 0`
|
||||
- **Reversible**: `lower_bound < 0, upper_bound > 0`
|
||||
- Set both bounds simultaneously with `.bounds` to avoid inconsistencies
|
||||
|
||||
### Gene-Reaction Rules (GPR)
|
||||
Boolean logic linking genes to reactions:
|
||||
```python
|
||||
# AND logic (both required)
|
||||
reaction.gene_reaction_rule = "gene1 and gene2"
|
||||
|
||||
# OR logic (either sufficient)
|
||||
reaction.gene_reaction_rule = "gene1 or gene2"
|
||||
|
||||
# Complex logic
|
||||
reaction.gene_reaction_rule = "(gene1 and gene2) or (gene3 and gene4)"
|
||||
```
|
||||
|
||||
### Exchange Reactions
|
||||
Special reactions representing metabolite import/export:
|
||||
- Named with prefix `EX_` by convention
|
||||
- Positive flux = secretion, negative flux = uptake
|
||||
- Managed through `model.medium` dictionary
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use context managers** for temporary modifications to avoid state management issues
|
||||
2. **Validate models** before analysis using `model.slim_optimize()` to ensure feasibility
|
||||
3. **Check solution status** after optimization - `optimal` indicates successful solve
|
||||
4. **Use loopless FVA** when thermodynamic feasibility matters
|
||||
5. **Set fraction_of_optimum** appropriately in FVA to explore suboptimal space
|
||||
6. **Parallelize** computationally expensive operations (sampling, double deletions)
|
||||
7. **Prefer SBML format** for model exchange and long-term storage
|
||||
8. **Use slim_optimize()** when only objective value needed for performance
|
||||
9. **Validate flux samples** to ensure numerical stability
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Infeasible solutions**: Check medium constraints, reaction bounds, and model consistency
|
||||
**Slow optimization**: Try different solvers (GLPK, CPLEX, Gurobi) via `model.solver`
|
||||
**Unbounded solutions**: Verify exchange reactions have appropriate upper bounds
|
||||
**Import errors**: Ensure correct file format and valid SBML identifiers
|
||||
|
||||
## References
|
||||
|
||||
For detailed workflows and API patterns, refer to:
|
||||
- `references/workflows.md` - Comprehensive step-by-step workflow examples
|
||||
- `references/api_quick_reference.md` - Common function signatures and patterns
|
||||
|
||||
Official documentation: https://cobrapy.readthedocs.io/en/latest/
|
||||
655
scientific-packages/cobrapy/references/api_quick_reference.md
Normal file
655
scientific-packages/cobrapy/references/api_quick_reference.md
Normal file
@@ -0,0 +1,655 @@
|
||||
# COBRApy API Quick Reference
|
||||
|
||||
This document provides quick reference for common COBRApy functions, signatures, and usage patterns.
|
||||
|
||||
## Model I/O
|
||||
|
||||
### Loading Models
|
||||
|
||||
```python
|
||||
from cobra.io import load_model, read_sbml_model, load_json_model, load_yaml_model, load_matlab_model
|
||||
|
||||
# Bundled test models
|
||||
model = load_model("textbook") # E. coli core metabolism
|
||||
model = load_model("ecoli") # Full E. coli iJO1366
|
||||
model = load_model("salmonella") # Salmonella LT2
|
||||
|
||||
# From files
|
||||
model = read_sbml_model(filename, f_replace={}, **kwargs)
|
||||
model = load_json_model(filename)
|
||||
model = load_yaml_model(filename)
|
||||
model = load_matlab_model(filename, variable_name=None)
|
||||
```
|
||||
|
||||
### Saving Models
|
||||
|
||||
```python
|
||||
from cobra.io import write_sbml_model, save_json_model, save_yaml_model, save_matlab_model
|
||||
|
||||
write_sbml_model(model, filename, f_replace={}, **kwargs)
|
||||
save_json_model(model, filename, pretty=False, **kwargs)
|
||||
save_yaml_model(model, filename, **kwargs)
|
||||
save_matlab_model(model, filename, **kwargs)
|
||||
```
|
||||
|
||||
## Model Structure
|
||||
|
||||
### Core Classes
|
||||
|
||||
```python
|
||||
from cobra import Model, Reaction, Metabolite, Gene
|
||||
|
||||
# Create model
|
||||
model = Model(id_or_model=None, name=None)
|
||||
|
||||
# Create metabolite
|
||||
metabolite = Metabolite(
|
||||
id=None,
|
||||
formula=None,
|
||||
name="",
|
||||
charge=None,
|
||||
compartment=None
|
||||
)
|
||||
|
||||
# Create reaction
|
||||
reaction = Reaction(
|
||||
id=None,
|
||||
name="",
|
||||
subsystem="",
|
||||
lower_bound=0.0,
|
||||
upper_bound=None
|
||||
)
|
||||
|
||||
# Create gene
|
||||
gene = Gene(id=None, name="", functional=True)
|
||||
```
|
||||
|
||||
### Model Attributes
|
||||
|
||||
```python
|
||||
# Component access (DictList objects)
|
||||
model.reactions # DictList of Reaction objects
|
||||
model.metabolites # DictList of Metabolite objects
|
||||
model.genes # DictList of Gene objects
|
||||
|
||||
# Special reaction lists
|
||||
model.exchanges # Exchange reactions (external transport)
|
||||
model.demands # Demand reactions (metabolite sinks)
|
||||
model.sinks # Sink reactions
|
||||
model.boundary # All boundary reactions
|
||||
|
||||
# Model properties
|
||||
model.objective # Current objective (read/write)
|
||||
model.objective_direction # "max" or "min"
|
||||
model.medium # Growth medium (dict of exchange: bound)
|
||||
model.solver # Optimization solver
|
||||
```
|
||||
|
||||
### DictList Methods
|
||||
|
||||
```python
|
||||
# Access by index
|
||||
item = model.reactions[0]
|
||||
|
||||
# Access by ID
|
||||
item = model.reactions.get_by_id("PFK")
|
||||
|
||||
# Query by string (substring match)
|
||||
items = model.reactions.query("atp") # Case-insensitive search
|
||||
items = model.reactions.query(lambda x: x.subsystem == "Glycolysis")
|
||||
|
||||
# List comprehension
|
||||
items = [r for r in model.reactions if r.lower_bound < 0]
|
||||
|
||||
# Check membership
|
||||
"PFK" in model.reactions
|
||||
```
|
||||
|
||||
## Optimization
|
||||
|
||||
### Basic Optimization
|
||||
|
||||
```python
|
||||
# Full optimization (returns Solution object)
|
||||
solution = model.optimize()
|
||||
|
||||
# Attributes of Solution
|
||||
solution.objective_value # Objective function value
|
||||
solution.status # Optimization status ("optimal", "infeasible", etc.)
|
||||
solution.fluxes # Pandas Series of reaction fluxes
|
||||
solution.shadow_prices # Pandas Series of metabolite shadow prices
|
||||
solution.reduced_costs # Pandas Series of reduced costs
|
||||
|
||||
# Fast optimization (returns float only)
|
||||
objective_value = model.slim_optimize()
|
||||
|
||||
# Change objective
|
||||
model.objective = "ATPM"
|
||||
model.objective = model.reactions.ATPM
|
||||
model.objective = {model.reactions.ATPM: 1.0}
|
||||
|
||||
# Change optimization direction
|
||||
model.objective_direction = "max" # or "min"
|
||||
```
|
||||
|
||||
### Solver Configuration
|
||||
|
||||
```python
|
||||
# Check available solvers
|
||||
from cobra.util.solver import solvers
|
||||
print(solvers)
|
||||
|
||||
# Change solver
|
||||
model.solver = "glpk" # or "cplex", "gurobi", etc.
|
||||
|
||||
# Solver-specific configuration
|
||||
model.solver.configuration.timeout = 60 # seconds
|
||||
model.solver.configuration.verbosity = 1
|
||||
model.solver.configuration.tolerances.feasibility = 1e-9
|
||||
```
|
||||
|
||||
## Flux Analysis
|
||||
|
||||
### Flux Balance Analysis (FBA)
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import pfba, geometric_fba
|
||||
|
||||
# Parsimonious FBA
|
||||
solution = pfba(model, fraction_of_optimum=1.0, **kwargs)
|
||||
|
||||
# Geometric FBA
|
||||
solution = geometric_fba(model, epsilon=1e-06, max_tries=200)
|
||||
```
|
||||
|
||||
### Flux Variability Analysis (FVA)
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import flux_variability_analysis
|
||||
|
||||
fva_result = flux_variability_analysis(
|
||||
model,
|
||||
reaction_list=None, # List of reaction IDs or None for all
|
||||
loopless=False, # Eliminate thermodynamically infeasible loops
|
||||
fraction_of_optimum=1.0, # Optimality fraction (0.0-1.0)
|
||||
pfba_factor=None, # Optional pFBA constraint
|
||||
processes=1 # Number of parallel processes
|
||||
)
|
||||
|
||||
# Returns DataFrame with columns: minimum, maximum
|
||||
```
|
||||
|
||||
### Gene and Reaction Deletions
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import (
|
||||
single_gene_deletion,
|
||||
single_reaction_deletion,
|
||||
double_gene_deletion,
|
||||
double_reaction_deletion
|
||||
)
|
||||
|
||||
# Single deletions
|
||||
results = single_gene_deletion(
|
||||
model,
|
||||
gene_list=None, # None for all genes
|
||||
processes=1,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
results = single_reaction_deletion(
|
||||
model,
|
||||
reaction_list=None, # None for all reactions
|
||||
processes=1,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
# Double deletions
|
||||
results = double_gene_deletion(
|
||||
model,
|
||||
gene_list1=None,
|
||||
gene_list2=None,
|
||||
processes=1,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
results = double_reaction_deletion(
|
||||
model,
|
||||
reaction_list1=None,
|
||||
reaction_list2=None,
|
||||
processes=1,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
# Returns DataFrame with columns: ids, growth, status
|
||||
# For double deletions, index is MultiIndex of gene/reaction pairs
|
||||
```
|
||||
|
||||
### Flux Sampling
|
||||
|
||||
```python
|
||||
from cobra.sampling import sample, OptGPSampler, ACHRSampler
|
||||
|
||||
# Simple interface
|
||||
samples = sample(
|
||||
model,
|
||||
n, # Number of samples
|
||||
method="optgp", # or "achr"
|
||||
thinning=100, # Thinning factor (sample every n iterations)
|
||||
processes=1, # Parallel processes (OptGP only)
|
||||
seed=None # Random seed
|
||||
)
|
||||
|
||||
# Advanced interface with sampler objects
|
||||
sampler = OptGPSampler(model, processes=4, thinning=100)
|
||||
sampler = ACHRSampler(model, thinning=100)
|
||||
|
||||
# Generate samples
|
||||
samples = sampler.sample(n)
|
||||
|
||||
# Validate samples
|
||||
validation = sampler.validate(sampler.samples)
|
||||
# Returns array of 'v' (valid), 'l' (lower bound violation),
|
||||
# 'u' (upper bound violation), 'e' (equality violation)
|
||||
|
||||
# Batch sampling
|
||||
sampler.batch(n_samples, n_batches)
|
||||
```
|
||||
|
||||
### Production Envelopes
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import production_envelope
|
||||
|
||||
envelope = production_envelope(
|
||||
model,
|
||||
reactions, # List of 1-2 reaction IDs
|
||||
objective=None, # Objective reaction ID (None uses model objective)
|
||||
carbon_sources=None, # Carbon source for yield calculation
|
||||
points=20, # Number of points to calculate
|
||||
threshold=0.01 # Minimum objective value threshold
|
||||
)
|
||||
|
||||
# Returns DataFrame with columns:
|
||||
# - First reaction flux
|
||||
# - Second reaction flux (if provided)
|
||||
# - objective_minimum, objective_maximum
|
||||
# - carbon_yield_minimum, carbon_yield_maximum (if carbon source specified)
|
||||
# - mass_yield_minimum, mass_yield_maximum
|
||||
```
|
||||
|
||||
### Gapfilling
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import gapfill
|
||||
|
||||
# Basic gapfilling
|
||||
solution = gapfill(
|
||||
model,
|
||||
universal=None, # Universal model with candidate reactions
|
||||
lower_bound=0.05, # Minimum objective flux
|
||||
penalties=None, # Dict of reaction: penalty
|
||||
demand_reactions=True, # Add demand reactions if needed
|
||||
exchange_reactions=False,
|
||||
iterations=1
|
||||
)
|
||||
|
||||
# Returns list of Reaction objects to add
|
||||
|
||||
# Multiple solutions
|
||||
solutions = []
|
||||
for i in range(5):
|
||||
sol = gapfill(model, universal, iterations=1)
|
||||
solutions.append(sol)
|
||||
# Prevent finding same solution by increasing penalties
|
||||
```
|
||||
|
||||
### Other Analysis Methods
|
||||
|
||||
```python
|
||||
from cobra.flux_analysis import (
|
||||
find_blocked_reactions,
|
||||
find_essential_genes,
|
||||
find_essential_reactions
|
||||
)
|
||||
|
||||
# Blocked reactions (cannot carry flux)
|
||||
blocked = find_blocked_reactions(
|
||||
model,
|
||||
reaction_list=None,
|
||||
zero_cutoff=1e-9,
|
||||
open_exchanges=False
|
||||
)
|
||||
|
||||
# Essential genes/reactions
|
||||
essential_genes = find_essential_genes(model, threshold=0.01)
|
||||
essential_reactions = find_essential_reactions(model, threshold=0.01)
|
||||
```
|
||||
|
||||
## Media and Boundary Conditions
|
||||
|
||||
### Medium Management
|
||||
|
||||
```python
|
||||
# Get current medium (returns dict)
|
||||
medium = model.medium
|
||||
|
||||
# Set medium (must reassign entire dict)
|
||||
medium = model.medium
|
||||
medium["EX_glc__D_e"] = 10.0
|
||||
medium["EX_o2_e"] = 20.0
|
||||
model.medium = medium
|
||||
|
||||
# Alternative: individual modification
|
||||
with model:
|
||||
model.reactions.EX_glc__D_e.lower_bound = -10.0
|
||||
```
|
||||
|
||||
### Minimal Media
|
||||
|
||||
```python
|
||||
from cobra.medium import minimal_medium
|
||||
|
||||
min_medium = minimal_medium(
|
||||
model,
|
||||
min_objective_value=0.1, # Minimum growth rate
|
||||
minimize_components=False, # If True, uses MILP (slower)
|
||||
open_exchanges=False, # Open all exchanges before optimization
|
||||
exports=False, # Allow metabolite export
|
||||
penalties=None # Dict of exchange: penalty
|
||||
)
|
||||
|
||||
# Returns Series of exchange reactions with fluxes
|
||||
```
|
||||
|
||||
### Boundary Reactions
|
||||
|
||||
```python
|
||||
# Add boundary reaction
|
||||
model.add_boundary(
|
||||
metabolite,
|
||||
type="exchange", # or "demand", "sink"
|
||||
reaction_id=None, # Auto-generated if None
|
||||
lb=None,
|
||||
ub=None,
|
||||
sbo_term=None
|
||||
)
|
||||
|
||||
# Access boundary reactions
|
||||
exchanges = model.exchanges # System boundary
|
||||
demands = model.demands # Intracellular removal
|
||||
sinks = model.sinks # Intracellular exchange
|
||||
boundaries = model.boundary # All boundary reactions
|
||||
```
|
||||
|
||||
## Model Manipulation
|
||||
|
||||
### Adding Components
|
||||
|
||||
```python
|
||||
# Add reactions
|
||||
model.add_reactions([reaction1, reaction2, ...])
|
||||
model.add_reaction(reaction)
|
||||
|
||||
# Add metabolites
|
||||
reaction.add_metabolites({
|
||||
metabolite1: -1.0, # Consumed (negative stoichiometry)
|
||||
metabolite2: 1.0 # Produced (positive stoichiometry)
|
||||
})
|
||||
|
||||
# Add metabolites to model
|
||||
model.add_metabolites([metabolite1, metabolite2, ...])
|
||||
|
||||
# Add genes (usually automatic via gene_reaction_rule)
|
||||
model.genes += [gene1, gene2, ...]
|
||||
```
|
||||
|
||||
### Removing Components
|
||||
|
||||
```python
|
||||
# Remove reactions
|
||||
model.remove_reactions([reaction1, reaction2, ...])
|
||||
model.remove_reactions(["PFK", "FBA"])
|
||||
|
||||
# Remove metabolites (removes from reactions too)
|
||||
model.remove_metabolites([metabolite1, metabolite2, ...])
|
||||
|
||||
# Remove genes (usually via gene_reaction_rule)
|
||||
model.genes.remove(gene)
|
||||
```
|
||||
|
||||
### Modifying Reactions
|
||||
|
||||
```python
|
||||
# Set bounds
|
||||
reaction.bounds = (lower, upper)
|
||||
reaction.lower_bound = 0.0
|
||||
reaction.upper_bound = 1000.0
|
||||
|
||||
# Modify stoichiometry
|
||||
reaction.add_metabolites({metabolite: 1.0})
|
||||
reaction.subtract_metabolites({metabolite: 1.0})
|
||||
|
||||
# Change gene-reaction rule
|
||||
reaction.gene_reaction_rule = "(gene1 and gene2) or gene3"
|
||||
|
||||
# Knock out
|
||||
reaction.knock_out()
|
||||
gene.knock_out()
|
||||
```
|
||||
|
||||
### Model Copying
|
||||
|
||||
```python
|
||||
# Deep copy (independent model)
|
||||
model_copy = model.copy()
|
||||
|
||||
# Copy specific reactions
|
||||
new_model = Model("subset")
|
||||
reactions_to_copy = [model.reactions.PFK, model.reactions.FBA]
|
||||
new_model.add_reactions(reactions_to_copy)
|
||||
```
|
||||
|
||||
## Context Management
|
||||
|
||||
Use context managers for temporary modifications:
|
||||
|
||||
```python
|
||||
# Changes automatically revert after with block
|
||||
with model:
|
||||
model.objective = "ATPM"
|
||||
model.reactions.EX_glc__D_e.lower_bound = -5.0
|
||||
model.genes.b0008.knock_out()
|
||||
solution = model.optimize()
|
||||
|
||||
# Model state restored here
|
||||
|
||||
# Multiple nested contexts
|
||||
with model:
|
||||
model.objective = "ATPM"
|
||||
with model:
|
||||
model.genes.b0008.knock_out()
|
||||
# Both modifications active
|
||||
# Only objective change active
|
||||
|
||||
# Context management with reactions
|
||||
with model:
|
||||
model.reactions.PFK.knock_out()
|
||||
# Equivalent to: reaction.lower_bound = reaction.upper_bound = 0
|
||||
```
|
||||
|
||||
## Reaction and Metabolite Properties
|
||||
|
||||
### Reaction Attributes
|
||||
|
||||
```python
|
||||
reaction.id # Unique identifier
|
||||
reaction.name # Human-readable name
|
||||
reaction.subsystem # Pathway/subsystem
|
||||
reaction.bounds # (lower_bound, upper_bound)
|
||||
reaction.lower_bound
|
||||
reaction.upper_bound
|
||||
reaction.reversibility # Boolean (lower_bound < 0)
|
||||
reaction.gene_reaction_rule # GPR string
|
||||
reaction.genes # Set of associated Gene objects
|
||||
reaction.metabolites # Dict of {metabolite: stoichiometry}
|
||||
|
||||
# Methods
|
||||
reaction.reaction # Stoichiometric equation string
|
||||
reaction.build_reaction_string() # Same as above
|
||||
reaction.check_mass_balance() # Returns imbalances or empty dict
|
||||
reaction.get_coefficient(metabolite_id)
|
||||
reaction.add_metabolites({metabolite: coeff})
|
||||
reaction.subtract_metabolites({metabolite: coeff})
|
||||
reaction.knock_out()
|
||||
```
|
||||
|
||||
### Metabolite Attributes
|
||||
|
||||
```python
|
||||
metabolite.id # Unique identifier
|
||||
metabolite.name # Human-readable name
|
||||
metabolite.formula # Chemical formula
|
||||
metabolite.charge # Charge
|
||||
metabolite.compartment # Compartment ID
|
||||
metabolite.reactions # FrozenSet of associated reactions
|
||||
|
||||
# Methods
|
||||
metabolite.summary() # Print production/consumption
|
||||
metabolite.copy()
|
||||
```
|
||||
|
||||
### Gene Attributes
|
||||
|
||||
```python
|
||||
gene.id # Unique identifier
|
||||
gene.name # Human-readable name
|
||||
gene.functional # Boolean activity status
|
||||
gene.reactions # FrozenSet of associated reactions
|
||||
|
||||
# Methods
|
||||
gene.knock_out()
|
||||
```
|
||||
|
||||
## Model Validation
|
||||
|
||||
### Consistency Checking
|
||||
|
||||
```python
|
||||
from cobra.manipulation import check_mass_balance, check_metabolite_compartment_formula
|
||||
|
||||
# Check all reactions for mass balance
|
||||
unbalanced = {}
|
||||
for reaction in model.reactions:
|
||||
balance = reaction.check_mass_balance()
|
||||
if balance:
|
||||
unbalanced[reaction.id] = balance
|
||||
|
||||
# Check metabolite formulas are valid
|
||||
check_metabolite_compartment_formula(model)
|
||||
```
|
||||
|
||||
### Model Statistics
|
||||
|
||||
```python
|
||||
# Basic stats
|
||||
print(f"Reactions: {len(model.reactions)}")
|
||||
print(f"Metabolites: {len(model.metabolites)}")
|
||||
print(f"Genes: {len(model.genes)}")
|
||||
|
||||
# Advanced stats
|
||||
print(f"Exchanges: {len(model.exchanges)}")
|
||||
print(f"Demands: {len(model.demands)}")
|
||||
|
||||
# Blocked reactions
|
||||
from cobra.flux_analysis import find_blocked_reactions
|
||||
blocked = find_blocked_reactions(model)
|
||||
print(f"Blocked reactions: {len(blocked)}")
|
||||
|
||||
# Essential genes
|
||||
from cobra.flux_analysis import find_essential_genes
|
||||
essential = find_essential_genes(model)
|
||||
print(f"Essential genes: {len(essential)}")
|
||||
```
|
||||
|
||||
## Summary Methods
|
||||
|
||||
```python
|
||||
# Model summary
|
||||
model.summary() # Overall model info
|
||||
|
||||
# Metabolite summary
|
||||
model.metabolites.atp_c.summary()
|
||||
|
||||
# Reaction summary
|
||||
model.reactions.PFK.summary()
|
||||
|
||||
# Summary with FVA
|
||||
model.summary(fva=0.95) # Include FVA at 95% optimality
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Batch Analysis Pattern
|
||||
|
||||
```python
|
||||
results = []
|
||||
for condition in conditions:
|
||||
with model:
|
||||
# Apply condition
|
||||
setup_condition(model, condition)
|
||||
|
||||
# Analyze
|
||||
solution = model.optimize()
|
||||
|
||||
# Store result
|
||||
results.append({
|
||||
"condition": condition,
|
||||
"growth": solution.objective_value,
|
||||
"status": solution.status
|
||||
})
|
||||
|
||||
df = pd.DataFrame(results)
|
||||
```
|
||||
|
||||
### Systematic Knockout Pattern
|
||||
|
||||
```python
|
||||
knockout_results = []
|
||||
for gene in model.genes:
|
||||
with model:
|
||||
gene.knock_out()
|
||||
|
||||
solution = model.optimize()
|
||||
|
||||
knockout_results.append({
|
||||
"gene": gene.id,
|
||||
"growth": solution.objective_value if solution.status == "optimal" else 0,
|
||||
"status": solution.status
|
||||
})
|
||||
|
||||
df = pd.DataFrame(knockout_results)
|
||||
```
|
||||
|
||||
### Parameter Scan Pattern
|
||||
|
||||
```python
|
||||
parameter_values = np.linspace(0, 20, 21)
|
||||
results = []
|
||||
|
||||
for value in parameter_values:
|
||||
with model:
|
||||
model.reactions.EX_glc__D_e.lower_bound = -value
|
||||
|
||||
solution = model.optimize()
|
||||
|
||||
results.append({
|
||||
"glucose_uptake": value,
|
||||
"growth": solution.objective_value,
|
||||
"acetate_secretion": solution.fluxes["EX_ac_e"]
|
||||
})
|
||||
|
||||
df = pd.DataFrame(results)
|
||||
```
|
||||
|
||||
This quick reference covers the most commonly used COBRApy functions and patterns. For complete API documentation, see https://cobrapy.readthedocs.io/
|
||||
593
scientific-packages/cobrapy/references/workflows.md
Normal file
593
scientific-packages/cobrapy/references/workflows.md
Normal file
@@ -0,0 +1,593 @@
|
||||
# COBRApy Comprehensive Workflows
|
||||
|
||||
This document provides detailed step-by-step workflows for common COBRApy tasks in metabolic modeling.
|
||||
|
||||
## Workflow 1: Complete Knockout Study with Visualization
|
||||
|
||||
This workflow demonstrates how to perform a comprehensive gene knockout study and visualize the results.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from cobra.io import load_model
|
||||
from cobra.flux_analysis import single_gene_deletion, double_gene_deletion
|
||||
|
||||
# Step 1: Load model
|
||||
model = load_model("ecoli")
|
||||
print(f"Loaded model: {model.id}")
|
||||
print(f"Model contains {len(model.reactions)} reactions, {len(model.metabolites)} metabolites, {len(model.genes)} genes")
|
||||
|
||||
# Step 2: Get baseline growth rate
|
||||
baseline = model.slim_optimize()
|
||||
print(f"Baseline growth rate: {baseline:.3f} /h")
|
||||
|
||||
# Step 3: Perform single gene deletions
|
||||
print("Performing single gene deletions...")
|
||||
single_results = single_gene_deletion(model)
|
||||
|
||||
# Step 4: Classify genes by impact
|
||||
essential_genes = single_results[single_results["growth"] < 0.01]
|
||||
severely_impaired = single_results[(single_results["growth"] >= 0.01) &
|
||||
(single_results["growth"] < 0.5 * baseline)]
|
||||
moderately_impaired = single_results[(single_results["growth"] >= 0.5 * baseline) &
|
||||
(single_results["growth"] < 0.9 * baseline)]
|
||||
neutral_genes = single_results[single_results["growth"] >= 0.9 * baseline]
|
||||
|
||||
print(f"\nSingle Deletion Results:")
|
||||
print(f" Essential genes: {len(essential_genes)}")
|
||||
print(f" Severely impaired: {len(severely_impaired)}")
|
||||
print(f" Moderately impaired: {len(moderately_impaired)}")
|
||||
print(f" Neutral genes: {len(neutral_genes)}")
|
||||
|
||||
# Step 5: Visualize distribution
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
single_results["growth"].hist(bins=50, ax=ax)
|
||||
ax.axvline(baseline, color='r', linestyle='--', label='Baseline')
|
||||
ax.set_xlabel("Growth rate (/h)")
|
||||
ax.set_ylabel("Number of genes")
|
||||
ax.set_title("Distribution of Growth Rates After Single Gene Deletions")
|
||||
ax.legend()
|
||||
plt.tight_layout()
|
||||
plt.savefig("single_deletion_distribution.png", dpi=300)
|
||||
|
||||
# Step 6: Identify gene pairs for double deletions
|
||||
# Focus on non-essential genes to find synthetic lethals
|
||||
target_genes = single_results[single_results["growth"] >= 0.5 * baseline].index.tolist()
|
||||
target_genes = [list(gene)[0] for gene in target_genes[:50]] # Limit for performance
|
||||
|
||||
print(f"\nPerforming double deletions on {len(target_genes)} genes...")
|
||||
double_results = double_gene_deletion(
|
||||
model,
|
||||
gene_list1=target_genes,
|
||||
processes=4
|
||||
)
|
||||
|
||||
# Step 7: Find synthetic lethal pairs
|
||||
synthetic_lethals = double_results[
|
||||
(double_results["growth"] < 0.01) &
|
||||
(single_results.loc[double_results.index.get_level_values(0)]["growth"].values >= 0.5 * baseline) &
|
||||
(single_results.loc[double_results.index.get_level_values(1)]["growth"].values >= 0.5 * baseline)
|
||||
]
|
||||
|
||||
print(f"Found {len(synthetic_lethals)} synthetic lethal gene pairs")
|
||||
print("\nTop 10 synthetic lethal pairs:")
|
||||
print(synthetic_lethals.head(10))
|
||||
|
||||
# Step 8: Export results
|
||||
single_results.to_csv("single_gene_deletions.csv")
|
||||
double_results.to_csv("double_gene_deletions.csv")
|
||||
synthetic_lethals.to_csv("synthetic_lethals.csv")
|
||||
```
|
||||
|
||||
## Workflow 2: Media Design and Optimization
|
||||
|
||||
This workflow shows how to systematically design growth media and find minimal media compositions.
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.medium import minimal_medium
|
||||
import pandas as pd
|
||||
|
||||
# Step 1: Load model and check current medium
|
||||
model = load_model("ecoli")
|
||||
current_medium = model.medium
|
||||
print("Current medium composition:")
|
||||
for exchange, bound in current_medium.items():
|
||||
metabolite_id = exchange.replace("EX_", "").replace("_e", "")
|
||||
print(f" {metabolite_id}: {bound:.2f} mmol/gDW/h")
|
||||
|
||||
# Step 2: Get baseline growth
|
||||
baseline_growth = model.slim_optimize()
|
||||
print(f"\nBaseline growth rate: {baseline_growth:.3f} /h")
|
||||
|
||||
# Step 3: Calculate minimal medium for different growth targets
|
||||
growth_targets = [0.25, 0.5, 0.75, 1.0]
|
||||
minimal_media = {}
|
||||
|
||||
for fraction in growth_targets:
|
||||
target_growth = baseline_growth * fraction
|
||||
print(f"\nCalculating minimal medium for {fraction*100:.0f}% growth ({target_growth:.3f} /h)...")
|
||||
|
||||
min_medium = minimal_medium(
|
||||
model,
|
||||
target_growth,
|
||||
minimize_components=True,
|
||||
open_exchanges=True
|
||||
)
|
||||
|
||||
minimal_media[fraction] = min_medium
|
||||
print(f" Required components: {len(min_medium)}")
|
||||
print(f" Components: {list(min_medium.index)}")
|
||||
|
||||
# Step 4: Compare media compositions
|
||||
media_df = pd.DataFrame(minimal_media).fillna(0)
|
||||
media_df.to_csv("minimal_media_comparison.csv")
|
||||
|
||||
# Step 5: Test aerobic vs anaerobic conditions
|
||||
print("\n--- Aerobic vs Anaerobic Comparison ---")
|
||||
|
||||
# Aerobic
|
||||
model_aerobic = model.copy()
|
||||
aerobic_growth = model_aerobic.slim_optimize()
|
||||
aerobic_medium = minimal_medium(model_aerobic, aerobic_growth * 0.9, minimize_components=True)
|
||||
|
||||
# Anaerobic
|
||||
model_anaerobic = model.copy()
|
||||
medium_anaerobic = model_anaerobic.medium
|
||||
medium_anaerobic["EX_o2_e"] = 0.0
|
||||
model_anaerobic.medium = medium_anaerobic
|
||||
anaerobic_growth = model_anaerobic.slim_optimize()
|
||||
anaerobic_medium = minimal_medium(model_anaerobic, anaerobic_growth * 0.9, minimize_components=True)
|
||||
|
||||
print(f"Aerobic growth: {aerobic_growth:.3f} /h (requires {len(aerobic_medium)} components)")
|
||||
print(f"Anaerobic growth: {anaerobic_growth:.3f} /h (requires {len(anaerobic_medium)} components)")
|
||||
|
||||
# Step 6: Identify unique requirements
|
||||
aerobic_only = set(aerobic_medium.index) - set(anaerobic_medium.index)
|
||||
anaerobic_only = set(anaerobic_medium.index) - set(aerobic_medium.index)
|
||||
shared = set(aerobic_medium.index) & set(anaerobic_medium.index)
|
||||
|
||||
print(f"\nShared components: {len(shared)}")
|
||||
print(f"Aerobic-only: {aerobic_only}")
|
||||
print(f"Anaerobic-only: {anaerobic_only}")
|
||||
|
||||
# Step 7: Test custom medium
|
||||
print("\n--- Testing Custom Medium ---")
|
||||
custom_medium = {
|
||||
"EX_glc__D_e": 10.0, # Glucose
|
||||
"EX_o2_e": 20.0, # Oxygen
|
||||
"EX_nh4_e": 5.0, # Ammonium
|
||||
"EX_pi_e": 5.0, # Phosphate
|
||||
"EX_so4_e": 1.0, # Sulfate
|
||||
}
|
||||
|
||||
with model:
|
||||
model.medium = custom_medium
|
||||
custom_growth = model.optimize().objective_value
|
||||
print(f"Growth on custom medium: {custom_growth:.3f} /h")
|
||||
|
||||
# Check which nutrients are limiting
|
||||
for exchange in custom_medium:
|
||||
with model:
|
||||
# Double the uptake rate
|
||||
medium_test = model.medium
|
||||
medium_test[exchange] *= 2
|
||||
model.medium = medium_test
|
||||
test_growth = model.optimize().objective_value
|
||||
improvement = (test_growth - custom_growth) / custom_growth * 100
|
||||
if improvement > 1:
|
||||
print(f" {exchange}: +{improvement:.1f}% growth when doubled (LIMITING)")
|
||||
```
|
||||
|
||||
## Workflow 3: Flux Space Exploration with Sampling
|
||||
|
||||
This workflow demonstrates comprehensive flux space analysis using FVA and sampling.
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.flux_analysis import flux_variability_analysis
|
||||
from cobra.sampling import sample
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
# Step 1: Load model
|
||||
model = load_model("ecoli")
|
||||
baseline = model.slim_optimize()
|
||||
print(f"Baseline growth: {baseline:.3f} /h")
|
||||
|
||||
# Step 2: Perform FVA at optimal growth
|
||||
print("\nPerforming FVA at optimal growth...")
|
||||
fva_optimal = flux_variability_analysis(model, fraction_of_optimum=1.0)
|
||||
|
||||
# Step 3: Identify reactions with flexibility
|
||||
fva_optimal["range"] = fva_optimal["maximum"] - fva_optimal["minimum"]
|
||||
fva_optimal["relative_range"] = fva_optimal["range"] / (fva_optimal["maximum"].abs() + 1e-9)
|
||||
|
||||
flexible_reactions = fva_optimal[fva_optimal["range"] > 1.0].sort_values("range", ascending=False)
|
||||
print(f"\nFound {len(flexible_reactions)} reactions with >1.0 mmol/gDW/h flexibility")
|
||||
print("\nTop 10 most flexible reactions:")
|
||||
print(flexible_reactions.head(10)[["minimum", "maximum", "range"]])
|
||||
|
||||
# Step 4: Perform FVA at suboptimal growth (90%)
|
||||
print("\nPerforming FVA at 90% optimal growth...")
|
||||
fva_suboptimal = flux_variability_analysis(model, fraction_of_optimum=0.9)
|
||||
fva_suboptimal["range"] = fva_suboptimal["maximum"] - fva_suboptimal["minimum"]
|
||||
|
||||
# Step 5: Compare flexibility at different optimality levels
|
||||
comparison = pd.DataFrame({
|
||||
"range_100": fva_optimal["range"],
|
||||
"range_90": fva_suboptimal["range"]
|
||||
})
|
||||
comparison["range_increase"] = comparison["range_90"] - comparison["range_100"]
|
||||
|
||||
print("\nReactions with largest increase in flexibility at suboptimality:")
|
||||
print(comparison.sort_values("range_increase", ascending=False).head(10))
|
||||
|
||||
# Step 6: Perform flux sampling
|
||||
print("\nPerforming flux sampling (1000 samples)...")
|
||||
samples = sample(model, n=1000, method="optgp", processes=4)
|
||||
|
||||
# Step 7: Analyze sampling results for key reactions
|
||||
key_reactions = ["PFK", "FBA", "TPI", "GAPD", "PGK", "PGM", "ENO", "PYK"]
|
||||
available_key_reactions = [r for r in key_reactions if r in samples.columns]
|
||||
|
||||
if available_key_reactions:
|
||||
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
|
||||
axes = axes.flatten()
|
||||
|
||||
for idx, reaction_id in enumerate(available_key_reactions[:8]):
|
||||
ax = axes[idx]
|
||||
samples[reaction_id].hist(bins=30, ax=ax, alpha=0.7)
|
||||
|
||||
# Overlay FVA bounds
|
||||
fva_min = fva_optimal.loc[reaction_id, "minimum"]
|
||||
fva_max = fva_optimal.loc[reaction_id, "maximum"]
|
||||
ax.axvline(fva_min, color='r', linestyle='--', label='FVA min')
|
||||
ax.axvline(fva_max, color='r', linestyle='--', label='FVA max')
|
||||
|
||||
ax.set_xlabel("Flux (mmol/gDW/h)")
|
||||
ax.set_ylabel("Frequency")
|
||||
ax.set_title(reaction_id)
|
||||
if idx == 0:
|
||||
ax.legend()
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig("flux_distributions.png", dpi=300)
|
||||
|
||||
# Step 8: Calculate correlation between reactions
|
||||
print("\nCalculating flux correlations...")
|
||||
correlation_matrix = samples[available_key_reactions].corr()
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm",
|
||||
center=0, ax=ax, square=True)
|
||||
ax.set_title("Flux Correlations Between Key Glycolysis Reactions")
|
||||
plt.tight_layout()
|
||||
plt.savefig("flux_correlations.png", dpi=300)
|
||||
|
||||
# Step 9: Identify reaction modules (highly correlated groups)
|
||||
print("\nHighly correlated reaction pairs (|r| > 0.9):")
|
||||
for i in range(len(correlation_matrix)):
|
||||
for j in range(i+1, len(correlation_matrix)):
|
||||
corr = correlation_matrix.iloc[i, j]
|
||||
if abs(corr) > 0.9:
|
||||
print(f" {correlation_matrix.index[i]} <-> {correlation_matrix.columns[j]}: {corr:.3f}")
|
||||
|
||||
# Step 10: Export all results
|
||||
fva_optimal.to_csv("fva_optimal.csv")
|
||||
fva_suboptimal.to_csv("fva_suboptimal.csv")
|
||||
samples.to_csv("flux_samples.csv")
|
||||
correlation_matrix.to_csv("flux_correlations.csv")
|
||||
```
|
||||
|
||||
## Workflow 4: Production Strain Design
|
||||
|
||||
This workflow demonstrates how to design a production strain for a target metabolite.
|
||||
|
||||
```python
|
||||
from cobra.io import load_model
|
||||
from cobra.flux_analysis import (
|
||||
production_envelope,
|
||||
flux_variability_analysis,
|
||||
single_gene_deletion
|
||||
)
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Step 1: Define production target
|
||||
TARGET_METABOLITE = "EX_ac_e" # Acetate production
|
||||
CARBON_SOURCE = "EX_glc__D_e" # Glucose uptake
|
||||
|
||||
# Step 2: Load model
|
||||
model = load_model("ecoli")
|
||||
print(f"Designing strain for {TARGET_METABOLITE} production")
|
||||
|
||||
# Step 3: Calculate baseline production envelope
|
||||
print("\nCalculating production envelope...")
|
||||
envelope = production_envelope(
|
||||
model,
|
||||
reactions=[CARBON_SOURCE, TARGET_METABOLITE],
|
||||
carbon_sources=CARBON_SOURCE
|
||||
)
|
||||
|
||||
# Visualize production envelope
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
ax.plot(envelope[CARBON_SOURCE], envelope["mass_yield_maximum"], 'b-', label='Max yield')
|
||||
ax.plot(envelope[CARBON_SOURCE], envelope["mass_yield_minimum"], 'r-', label='Min yield')
|
||||
ax.set_xlabel(f"Glucose uptake (mmol/gDW/h)")
|
||||
ax.set_ylabel(f"Acetate yield")
|
||||
ax.set_title("Wild-type Production Envelope")
|
||||
ax.legend()
|
||||
ax.grid(True, alpha=0.3)
|
||||
plt.tight_layout()
|
||||
plt.savefig("production_envelope_wildtype.png", dpi=300)
|
||||
|
||||
# Step 4: Maximize production while maintaining growth
|
||||
print("\nOptimizing for production...")
|
||||
|
||||
# Set minimum growth constraint
|
||||
MIN_GROWTH = 0.1 # Maintain at least 10% of max growth
|
||||
|
||||
with model:
|
||||
# Change objective to product formation
|
||||
model.objective = TARGET_METABOLITE
|
||||
model.objective_direction = "max"
|
||||
|
||||
# Add growth constraint
|
||||
growth_reaction = model.reactions.get_by_id(model.objective.name) if hasattr(model.objective, 'name') else list(model.objective.variables.keys())[0].name
|
||||
max_growth = model.slim_optimize()
|
||||
|
||||
model.reactions.BIOMASS_Ecoli_core_w_GAM.lower_bound = MIN_GROWTH
|
||||
|
||||
with model:
|
||||
model.objective = TARGET_METABOLITE
|
||||
model.objective_direction = "max"
|
||||
production_solution = model.optimize()
|
||||
|
||||
max_production = production_solution.objective_value
|
||||
print(f"Maximum production: {max_production:.3f} mmol/gDW/h")
|
||||
print(f"Growth rate: {production_solution.fluxes['BIOMASS_Ecoli_core_w_GAM']:.3f} /h")
|
||||
|
||||
# Step 5: Identify beneficial gene knockouts
|
||||
print("\nScreening for beneficial knockouts...")
|
||||
|
||||
# Reset model
|
||||
model.reactions.BIOMASS_Ecoli_core_w_GAM.lower_bound = MIN_GROWTH
|
||||
model.objective = TARGET_METABOLITE
|
||||
model.objective_direction = "max"
|
||||
|
||||
knockout_results = []
|
||||
for gene in model.genes:
|
||||
with model:
|
||||
gene.knock_out()
|
||||
try:
|
||||
solution = model.optimize()
|
||||
if solution.status == "optimal":
|
||||
production = solution.objective_value
|
||||
growth = solution.fluxes["BIOMASS_Ecoli_core_w_GAM"]
|
||||
|
||||
if production > max_production * 1.05: # >5% improvement
|
||||
knockout_results.append({
|
||||
"gene": gene.id,
|
||||
"production": production,
|
||||
"growth": growth,
|
||||
"improvement": (production / max_production - 1) * 100
|
||||
})
|
||||
except:
|
||||
continue
|
||||
|
||||
knockout_df = pd.DataFrame(knockout_results)
|
||||
if len(knockout_df) > 0:
|
||||
knockout_df = knockout_df.sort_values("improvement", ascending=False)
|
||||
print(f"\nFound {len(knockout_df)} beneficial knockouts:")
|
||||
print(knockout_df.head(10))
|
||||
knockout_df.to_csv("beneficial_knockouts.csv", index=False)
|
||||
else:
|
||||
print("No beneficial single knockouts found")
|
||||
|
||||
# Step 6: Test combination of best knockouts
|
||||
if len(knockout_df) > 0:
|
||||
print("\nTesting knockout combinations...")
|
||||
top_genes = knockout_df.head(3)["gene"].tolist()
|
||||
|
||||
with model:
|
||||
for gene_id in top_genes:
|
||||
model.genes.get_by_id(gene_id).knock_out()
|
||||
|
||||
solution = model.optimize()
|
||||
if solution.status == "optimal":
|
||||
combined_production = solution.objective_value
|
||||
combined_growth = solution.fluxes["BIOMASS_Ecoli_core_w_GAM"]
|
||||
combined_improvement = (combined_production / max_production - 1) * 100
|
||||
|
||||
print(f"\nCombined knockout results:")
|
||||
print(f" Genes: {', '.join(top_genes)}")
|
||||
print(f" Production: {combined_production:.3f} mmol/gDW/h")
|
||||
print(f" Growth: {combined_growth:.3f} /h")
|
||||
print(f" Improvement: {combined_improvement:.1f}%")
|
||||
|
||||
# Step 7: Analyze flux distribution in production strain
|
||||
if len(knockout_df) > 0:
|
||||
best_gene = knockout_df.iloc[0]["gene"]
|
||||
|
||||
with model:
|
||||
model.genes.get_by_id(best_gene).knock_out()
|
||||
solution = model.optimize()
|
||||
|
||||
# Get active pathways
|
||||
active_fluxes = solution.fluxes[solution.fluxes.abs() > 0.1]
|
||||
active_fluxes.to_csv(f"production_strain_fluxes_{best_gene}_knockout.csv")
|
||||
|
||||
print(f"\nActive reactions in production strain: {len(active_fluxes)}")
|
||||
```
|
||||
|
||||
## Workflow 5: Model Validation and Debugging
|
||||
|
||||
This workflow shows systematic approaches to validate and debug metabolic models.
|
||||
|
||||
```python
|
||||
from cobra.io import load_model, read_sbml_model
|
||||
from cobra.flux_analysis import flux_variability_analysis
|
||||
import pandas as pd
|
||||
|
||||
# Step 1: Load model
|
||||
model = load_model("ecoli") # Or read_sbml_model("your_model.xml")
|
||||
print(f"Model: {model.id}")
|
||||
print(f"Reactions: {len(model.reactions)}")
|
||||
print(f"Metabolites: {len(model.metabolites)}")
|
||||
print(f"Genes: {len(model.genes)}")
|
||||
|
||||
# Step 2: Check model feasibility
|
||||
print("\n--- Feasibility Check ---")
|
||||
try:
|
||||
objective_value = model.slim_optimize()
|
||||
print(f"Model is feasible (objective: {objective_value:.3f})")
|
||||
except:
|
||||
print("Model is INFEASIBLE")
|
||||
print("Troubleshooting steps:")
|
||||
|
||||
# Check for blocked reactions
|
||||
from cobra.flux_analysis import find_blocked_reactions
|
||||
blocked = find_blocked_reactions(model)
|
||||
print(f" Blocked reactions: {len(blocked)}")
|
||||
if len(blocked) > 0:
|
||||
print(f" First 10 blocked: {list(blocked)[:10]}")
|
||||
|
||||
# Check medium
|
||||
print(f"\n Current medium: {model.medium}")
|
||||
|
||||
# Try opening all exchanges
|
||||
for reaction in model.exchanges:
|
||||
reaction.lower_bound = -1000
|
||||
|
||||
try:
|
||||
objective_value = model.slim_optimize()
|
||||
print(f"\n Model feasible with open exchanges (objective: {objective_value:.3f})")
|
||||
print(" Issue: Medium constraints too restrictive")
|
||||
except:
|
||||
print("\n Model still infeasible with open exchanges")
|
||||
print(" Issue: Structural problem (missing reactions, mass imbalance, etc.)")
|
||||
|
||||
# Step 3: Check mass and charge balance
|
||||
print("\n--- Mass and Charge Balance Check ---")
|
||||
unbalanced_reactions = []
|
||||
for reaction in model.reactions:
|
||||
try:
|
||||
balance = reaction.check_mass_balance()
|
||||
if balance:
|
||||
unbalanced_reactions.append({
|
||||
"reaction": reaction.id,
|
||||
"imbalance": balance
|
||||
})
|
||||
except:
|
||||
pass
|
||||
|
||||
if unbalanced_reactions:
|
||||
print(f"Found {len(unbalanced_reactions)} unbalanced reactions:")
|
||||
for item in unbalanced_reactions[:10]:
|
||||
print(f" {item['reaction']}: {item['imbalance']}")
|
||||
else:
|
||||
print("All reactions are mass balanced")
|
||||
|
||||
# Step 4: Identify dead-end metabolites
|
||||
print("\n--- Dead-end Metabolite Check ---")
|
||||
dead_end_metabolites = []
|
||||
for metabolite in model.metabolites:
|
||||
producing_reactions = [r for r in metabolite.reactions
|
||||
if r.metabolites[metabolite] > 0]
|
||||
consuming_reactions = [r for r in metabolite.reactions
|
||||
if r.metabolites[metabolite] < 0]
|
||||
|
||||
if len(producing_reactions) == 0 or len(consuming_reactions) == 0:
|
||||
dead_end_metabolites.append({
|
||||
"metabolite": metabolite.id,
|
||||
"producers": len(producing_reactions),
|
||||
"consumers": len(consuming_reactions)
|
||||
})
|
||||
|
||||
if dead_end_metabolites:
|
||||
print(f"Found {len(dead_end_metabolites)} dead-end metabolites:")
|
||||
for item in dead_end_metabolites[:10]:
|
||||
print(f" {item['metabolite']}: {item['producers']} producers, {item['consumers']} consumers")
|
||||
else:
|
||||
print("No dead-end metabolites found")
|
||||
|
||||
# Step 5: Check for duplicate reactions
|
||||
print("\n--- Duplicate Reaction Check ---")
|
||||
reaction_equations = {}
|
||||
duplicates = []
|
||||
|
||||
for reaction in model.reactions:
|
||||
equation = reaction.build_reaction_string()
|
||||
if equation in reaction_equations:
|
||||
duplicates.append({
|
||||
"reaction1": reaction_equations[equation],
|
||||
"reaction2": reaction.id,
|
||||
"equation": equation
|
||||
})
|
||||
else:
|
||||
reaction_equations[equation] = reaction.id
|
||||
|
||||
if duplicates:
|
||||
print(f"Found {len(duplicates)} duplicate reaction pairs:")
|
||||
for item in duplicates[:10]:
|
||||
print(f" {item['reaction1']} == {item['reaction2']}")
|
||||
else:
|
||||
print("No duplicate reactions found")
|
||||
|
||||
# Step 6: Identify orphan genes
|
||||
print("\n--- Orphan Gene Check ---")
|
||||
orphan_genes = [gene for gene in model.genes if len(gene.reactions) == 0]
|
||||
|
||||
if orphan_genes:
|
||||
print(f"Found {len(orphan_genes)} orphan genes (not associated with reactions):")
|
||||
print(f" First 10: {[g.id for g in orphan_genes[:10]]}")
|
||||
else:
|
||||
print("No orphan genes found")
|
||||
|
||||
# Step 7: Check for thermodynamically infeasible loops
|
||||
print("\n--- Thermodynamic Loop Check ---")
|
||||
fva_loopless = flux_variability_analysis(model, loopless=True)
|
||||
fva_standard = flux_variability_analysis(model)
|
||||
|
||||
loop_reactions = []
|
||||
for reaction_id in fva_standard.index:
|
||||
standard_range = fva_standard.loc[reaction_id, "maximum"] - fva_standard.loc[reaction_id, "minimum"]
|
||||
loopless_range = fva_loopless.loc[reaction_id, "maximum"] - fva_loopless.loc[reaction_id, "minimum"]
|
||||
|
||||
if standard_range > loopless_range + 0.1:
|
||||
loop_reactions.append({
|
||||
"reaction": reaction_id,
|
||||
"standard_range": standard_range,
|
||||
"loopless_range": loopless_range
|
||||
})
|
||||
|
||||
if loop_reactions:
|
||||
print(f"Found {len(loop_reactions)} reactions potentially involved in loops:")
|
||||
loop_df = pd.DataFrame(loop_reactions).sort_values("standard_range", ascending=False)
|
||||
print(loop_df.head(10))
|
||||
else:
|
||||
print("No thermodynamically infeasible loops detected")
|
||||
|
||||
# Step 8: Generate validation report
|
||||
print("\n--- Generating Validation Report ---")
|
||||
validation_report = {
|
||||
"model_id": model.id,
|
||||
"feasible": objective_value if 'objective_value' in locals() else None,
|
||||
"n_reactions": len(model.reactions),
|
||||
"n_metabolites": len(model.metabolites),
|
||||
"n_genes": len(model.genes),
|
||||
"n_unbalanced": len(unbalanced_reactions),
|
||||
"n_dead_ends": len(dead_end_metabolites),
|
||||
"n_duplicates": len(duplicates),
|
||||
"n_orphan_genes": len(orphan_genes),
|
||||
"n_loop_reactions": len(loop_reactions)
|
||||
}
|
||||
|
||||
validation_df = pd.DataFrame([validation_report])
|
||||
validation_df.to_csv("model_validation_report.csv", index=False)
|
||||
print("Validation report saved to model_validation_report.csv")
|
||||
```
|
||||
|
||||
These workflows provide comprehensive templates for common COBRApy tasks. Adapt them as needed for specific research questions and models.
|
||||
704
scientific-packages/datamol/SKILL.md
Normal file
704
scientific-packages/datamol/SKILL.md
Normal file
@@ -0,0 +1,704 @@
|
||||
---
|
||||
name: datamol
|
||||
description: Comprehensive toolkit for molecular cheminformatics using datamol, a Pythonic layer built on RDKit. Use this skill when working with molecular structures, SMILES strings, chemical reactions, molecular descriptors, conformer generation, molecular clustering, scaffold analysis, or any cheminformatics tasks. This skill should be applied when users need to process molecules, analyze chemical properties, visualize molecular structures, fragment compounds, or perform molecular similarity calculations.
|
||||
---
|
||||
|
||||
# Datamol Cheminformatics Skill
|
||||
|
||||
## Overview
|
||||
|
||||
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
|
||||
|
||||
**Key capabilities**:
|
||||
- Molecular format conversion (SMILES, SELFIES, InChI)
|
||||
- Structure standardization and sanitization
|
||||
- Molecular descriptors and fingerprints
|
||||
- 3D conformer generation and analysis
|
||||
- Clustering and diversity selection
|
||||
- Scaffold and fragment analysis
|
||||
- Chemical reaction application
|
||||
- Visualization and alignment
|
||||
- Batch processing with parallelization
|
||||
- Cloud storage support via fsspec
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
Guide users to install datamol:
|
||||
|
||||
```bash
|
||||
# Via conda/mamba (recommended)
|
||||
conda install -c conda-forge datamol
|
||||
|
||||
# Via pip
|
||||
pip install datamol
|
||||
```
|
||||
|
||||
**Import convention**:
|
||||
```python
|
||||
import datamol as dm
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### 1. Basic Molecule Handling
|
||||
|
||||
**Creating molecules from SMILES**:
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# Single molecule
|
||||
mol = dm.to_mol("CCO") # Ethanol
|
||||
|
||||
# From list of SMILES
|
||||
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
|
||||
mols = [dm.to_mol(smi) for smi in smiles_list]
|
||||
|
||||
# Error handling
|
||||
mol = dm.to_mol("invalid_smiles") # Returns None
|
||||
if mol is None:
|
||||
print("Failed to parse SMILES")
|
||||
```
|
||||
|
||||
**Converting molecules to SMILES**:
|
||||
```python
|
||||
# Canonical SMILES
|
||||
smiles = dm.to_smiles(mol)
|
||||
|
||||
# Isomeric SMILES (includes stereochemistry)
|
||||
smiles = dm.to_smiles(mol, isomeric=True)
|
||||
|
||||
# Other formats
|
||||
inchi = dm.to_inchi(mol)
|
||||
inchikey = dm.to_inchikey(mol)
|
||||
selfies = dm.to_selfies(mol)
|
||||
```
|
||||
|
||||
**Standardization and sanitization** (always recommend for user-provided molecules):
|
||||
```python
|
||||
# Sanitize molecule
|
||||
mol = dm.sanitize_mol(mol)
|
||||
|
||||
# Full standardization (recommended for datasets)
|
||||
mol = dm.standardize_mol(
|
||||
mol,
|
||||
disconnect_metals=True,
|
||||
normalize=True,
|
||||
reionize=True
|
||||
)
|
||||
|
||||
# For SMILES strings directly
|
||||
clean_smiles = dm.standardize_smiles(smiles)
|
||||
```
|
||||
|
||||
### 2. Reading and Writing Molecular Files
|
||||
|
||||
Refer to `references/io_module.md` for comprehensive I/O documentation.
|
||||
|
||||
**Reading files**:
|
||||
```python
|
||||
# SDF files (most common in chemistry)
|
||||
df = dm.read_sdf("compounds.sdf", mol_column='mol')
|
||||
|
||||
# SMILES files
|
||||
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
|
||||
|
||||
# CSV with SMILES column
|
||||
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
|
||||
|
||||
# Excel files
|
||||
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
|
||||
|
||||
# Universal reader (auto-detects format)
|
||||
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
|
||||
```
|
||||
|
||||
**Writing files**:
|
||||
```python
|
||||
# Save as SDF
|
||||
dm.to_sdf(mols, "output.sdf")
|
||||
# Or from DataFrame
|
||||
dm.to_sdf(df, "output.sdf", mol_column="mol")
|
||||
|
||||
# Save as SMILES file
|
||||
dm.to_smi(mols, "output.smi")
|
||||
|
||||
# Excel with rendered molecule images
|
||||
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
|
||||
```
|
||||
|
||||
**Remote file support** (S3, GCS, HTTP):
|
||||
```python
|
||||
# Read from cloud storage
|
||||
df = dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
df = dm.read_csv("https://example.com/data.csv")
|
||||
|
||||
# Write to cloud storage
|
||||
dm.to_sdf(mols, "s3://bucket/output.sdf")
|
||||
```
|
||||
|
||||
### 3. Molecular Descriptors and Properties
|
||||
|
||||
Refer to `references/descriptors_viz.md` for detailed descriptor documentation.
|
||||
|
||||
**Computing descriptors for a single molecule**:
|
||||
```python
|
||||
# Get standard descriptor set
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
|
||||
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
|
||||
```
|
||||
|
||||
**Batch descriptor computation** (recommended for datasets):
|
||||
```python
|
||||
# Compute for all molecules in parallel
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(
|
||||
mols,
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
progress=True # Show progress bar
|
||||
)
|
||||
```
|
||||
|
||||
**Specific descriptors**:
|
||||
```python
|
||||
# Aromaticity
|
||||
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
|
||||
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
|
||||
|
||||
# Stereochemistry
|
||||
n_stereo = dm.descriptors.n_stereo_centers(mol)
|
||||
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
|
||||
|
||||
# Flexibility
|
||||
n_rigid = dm.descriptors.n_rigid_bonds(mol)
|
||||
```
|
||||
|
||||
**Drug-likeness filtering (Lipinski's Rule of Five)**:
|
||||
```python
|
||||
# Filter compounds
|
||||
def is_druglike(mol):
|
||||
desc = dm.descriptors.compute_many_descriptors(mol)
|
||||
return (
|
||||
desc['mw'] <= 500 and
|
||||
desc['logp'] <= 5 and
|
||||
desc['hbd'] <= 5 and
|
||||
desc['hba'] <= 10
|
||||
)
|
||||
|
||||
druglike_mols = [mol for mol in mols if is_druglike(mol)]
|
||||
```
|
||||
|
||||
### 4. Molecular Fingerprints and Similarity
|
||||
|
||||
**Generating fingerprints**:
|
||||
```python
|
||||
# ECFP (Extended Connectivity Fingerprint, default)
|
||||
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
|
||||
|
||||
# Other fingerprint types
|
||||
fp_maccs = dm.to_fp(mol, fp_type='maccs')
|
||||
fp_topological = dm.to_fp(mol, fp_type='topological')
|
||||
fp_atompair = dm.to_fp(mol, fp_type='atompair')
|
||||
```
|
||||
|
||||
**Similarity calculations**:
|
||||
```python
|
||||
# Pairwise distances within a set
|
||||
distance_matrix = dm.pdist(mols, n_jobs=-1)
|
||||
|
||||
# Distances between two sets
|
||||
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
|
||||
|
||||
# Find most similar molecules
|
||||
from scipy.spatial.distance import squareform
|
||||
dist_matrix = squareform(dm.pdist(mols))
|
||||
# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
|
||||
```
|
||||
|
||||
### 5. Clustering and Diversity Selection
|
||||
|
||||
Refer to `references/core_api.md` for clustering details.
|
||||
|
||||
**Butina clustering**:
|
||||
```python
|
||||
# Cluster molecules by structural similarity
|
||||
clusters = dm.cluster_mols(
|
||||
mols,
|
||||
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
|
||||
n_jobs=-1 # Parallel processing
|
||||
)
|
||||
|
||||
# Each cluster is a list of molecule indices
|
||||
for i, cluster in enumerate(clusters):
|
||||
print(f"Cluster {i}: {len(cluster)} molecules")
|
||||
cluster_mols = [mols[idx] for idx in cluster]
|
||||
```
|
||||
|
||||
**Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
|
||||
|
||||
**Diversity selection**:
|
||||
```python
|
||||
# Pick diverse subset
|
||||
diverse_mols = dm.pick_diverse(
|
||||
mols,
|
||||
npick=100 # Select 100 diverse molecules
|
||||
)
|
||||
|
||||
# Pick cluster centroids
|
||||
centroids = dm.pick_centroids(
|
||||
mols,
|
||||
npick=50 # Select 50 representative molecules
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Scaffold Analysis
|
||||
|
||||
Refer to `references/fragments_scaffolds.md` for complete scaffold documentation.
|
||||
|
||||
**Extracting Murcko scaffolds**:
|
||||
```python
|
||||
# Get Bemis-Murcko scaffold (core structure)
|
||||
scaffold = dm.to_scaffold_murcko(mol)
|
||||
scaffold_smiles = dm.to_smiles(scaffold)
|
||||
```
|
||||
|
||||
**Scaffold-based analysis**:
|
||||
```python
|
||||
# Group compounds by scaffold
|
||||
from collections import Counter
|
||||
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Count scaffold frequency
|
||||
scaffold_counts = Counter(scaffold_smiles)
|
||||
most_common = scaffold_counts.most_common(10)
|
||||
|
||||
# Create scaffold-to-molecules mapping
|
||||
scaffold_groups = {}
|
||||
for mol, scaf_smi in zip(mols, scaffold_smiles):
|
||||
if scaf_smi not in scaffold_groups:
|
||||
scaffold_groups[scaf_smi] = []
|
||||
scaffold_groups[scaf_smi].append(mol)
|
||||
```
|
||||
|
||||
**Scaffold-based train/test splitting** (for ML):
|
||||
```python
|
||||
# Ensure train and test sets have different scaffolds
|
||||
scaffold_to_mols = {}
|
||||
for mol, scaf in zip(mols, scaffold_smiles):
|
||||
if scaf not in scaffold_to_mols:
|
||||
scaffold_to_mols[scaf] = []
|
||||
scaffold_to_mols[scaf].append(mol)
|
||||
|
||||
# Split scaffolds into train/test
|
||||
import random
|
||||
scaffolds = list(scaffold_to_mols.keys())
|
||||
random.shuffle(scaffolds)
|
||||
split_idx = int(0.8 * len(scaffolds))
|
||||
train_scaffolds = scaffolds[:split_idx]
|
||||
test_scaffolds = scaffolds[split_idx:]
|
||||
|
||||
# Get molecules for each split
|
||||
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
|
||||
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
|
||||
```
|
||||
|
||||
### 7. Molecular Fragmentation
|
||||
|
||||
Refer to `references/fragments_scaffolds.md` for fragmentation details.
|
||||
|
||||
**BRICS fragmentation** (16 bond types):
|
||||
```python
|
||||
# Fragment molecule
|
||||
fragments = dm.fragment.brics(mol)
|
||||
# Returns: set of fragment SMILES with attachment points like '[1*]CCN'
|
||||
```
|
||||
|
||||
**RECAP fragmentation** (11 bond types):
|
||||
```python
|
||||
fragments = dm.fragment.recap(mol)
|
||||
```
|
||||
|
||||
**Fragment analysis**:
|
||||
```python
|
||||
# Find common fragments across compound library
|
||||
from collections import Counter
|
||||
|
||||
all_fragments = []
|
||||
for mol in mols:
|
||||
frags = dm.fragment.brics(mol)
|
||||
all_fragments.extend(frags)
|
||||
|
||||
fragment_counts = Counter(all_fragments)
|
||||
common_frags = fragment_counts.most_common(20)
|
||||
|
||||
# Fragment-based scoring
|
||||
def fragment_score(mol, reference_fragments):
|
||||
mol_frags = dm.fragment.brics(mol)
|
||||
overlap = mol_frags.intersection(reference_fragments)
|
||||
return len(overlap) / len(mol_frags) if mol_frags else 0
|
||||
```
|
||||
|
||||
### 8. 3D Conformer Generation
|
||||
|
||||
Refer to `references/conformers_module.md` for detailed conformer documentation.
|
||||
|
||||
**Generating conformers**:
|
||||
```python
|
||||
# Generate 3D conformers
|
||||
mol_3d = dm.conformers.generate(
|
||||
mol,
|
||||
n_confs=50, # Number to generate (auto if None)
|
||||
rms_cutoff=0.5, # Filter similar conformers (Ångströms)
|
||||
minimize_energy=True, # Minimize with UFF force field
|
||||
method='ETKDGv3' # Embedding method (recommended)
|
||||
)
|
||||
|
||||
# Access conformers
|
||||
n_conformers = mol_3d.GetNumConformers()
|
||||
conf = mol_3d.GetConformer(0) # Get first conformer
|
||||
positions = conf.GetPositions() # Nx3 array of atom coordinates
|
||||
```
|
||||
|
||||
**Conformer clustering**:
|
||||
```python
|
||||
# Cluster conformers by RMSD
|
||||
clusters = dm.conformers.cluster(
|
||||
mol_3d,
|
||||
rms_cutoff=1.0,
|
||||
centroids=False
|
||||
)
|
||||
|
||||
# Get representative conformers
|
||||
centroids = dm.conformers.return_centroids(mol_3d, clusters)
|
||||
```
|
||||
|
||||
**SASA calculation**:
|
||||
```python
|
||||
# Calculate solvent accessible surface area
|
||||
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
|
||||
|
||||
# Access SASA from conformer properties
|
||||
conf = mol_3d.GetConformer(0)
|
||||
sasa = conf.GetDoubleProp('rdkit_free_sasa')
|
||||
```
|
||||
|
||||
### 9. Visualization
|
||||
|
||||
Refer to `references/descriptors_viz.md` for visualization documentation.
|
||||
|
||||
**Basic molecule grid**:
|
||||
```python
|
||||
# Visualize molecules
|
||||
dm.viz.to_image(
|
||||
mols[:20],
|
||||
legends=[dm.to_smiles(m) for m in mols[:20]],
|
||||
n_cols=5,
|
||||
mol_size=(300, 300)
|
||||
)
|
||||
|
||||
# Save to file
|
||||
dm.viz.to_image(mols, outfile="molecules.png")
|
||||
|
||||
# SVG for publications
|
||||
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
|
||||
```
|
||||
|
||||
**Aligned visualization** (for SAR analysis):
|
||||
```python
|
||||
# Align molecules by common substructure
|
||||
dm.viz.to_image(
|
||||
similar_mols,
|
||||
align=True, # Enable MCS alignment
|
||||
legends=activity_labels,
|
||||
n_cols=4
|
||||
)
|
||||
```
|
||||
|
||||
**Highlighting substructures**:
|
||||
```python
|
||||
# Highlight specific atoms and bonds
|
||||
dm.viz.to_image(
|
||||
mol,
|
||||
highlight_atom=[0, 1, 2, 3], # Atom indices
|
||||
highlight_bond=[0, 1, 2] # Bond indices
|
||||
)
|
||||
```
|
||||
|
||||
**Conformer visualization**:
|
||||
```python
|
||||
# Display multiple conformers
|
||||
dm.viz.conformers(
|
||||
mol_3d,
|
||||
n_confs=10,
|
||||
align_conf=True,
|
||||
n_cols=3
|
||||
)
|
||||
```
|
||||
|
||||
### 10. Chemical Reactions
|
||||
|
||||
Refer to `references/reactions_data.md` for reactions documentation.
|
||||
|
||||
**Applying reactions**:
|
||||
```python
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# Define reaction from SMARTS
|
||||
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
|
||||
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
|
||||
|
||||
# Apply to molecule
|
||||
reactant = dm.to_mol("CC(=O)O") # Acetic acid
|
||||
product = dm.reactions.apply_reaction(
|
||||
rxn,
|
||||
(reactant,),
|
||||
sanitize=True
|
||||
)
|
||||
|
||||
# Convert to SMILES
|
||||
product_smiles = dm.to_smiles(product)
|
||||
```
|
||||
|
||||
**Batch reaction application**:
|
||||
```python
|
||||
# Apply reaction to library
|
||||
products = []
|
||||
for mol in reactant_mols:
|
||||
try:
|
||||
prod = dm.reactions.apply_reaction(rxn, (mol,))
|
||||
if prod is not None:
|
||||
products.append(prod)
|
||||
except Exception as e:
|
||||
print(f"Reaction failed: {e}")
|
||||
```
|
||||
|
||||
## Parallelization
|
||||
|
||||
Datamol includes built-in parallelization for many operations. Use `n_jobs` parameter:
|
||||
- `n_jobs=1`: Sequential (no parallelization)
|
||||
- `n_jobs=-1`: Use all available CPU cores
|
||||
- `n_jobs=4`: Use 4 cores
|
||||
|
||||
**Functions supporting parallelization**:
|
||||
- `dm.read_sdf(..., n_jobs=-1)`
|
||||
- `dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)`
|
||||
- `dm.cluster_mols(..., n_jobs=-1)`
|
||||
- `dm.pdist(..., n_jobs=-1)`
|
||||
- `dm.conformers.sasa(..., n_jobs=-1)`
|
||||
|
||||
**Progress bars**: Many batch operations support `progress=True` parameter.
|
||||
|
||||
## Common Workflows and Patterns
|
||||
|
||||
### Complete Pipeline: Data Loading → Filtering → Analysis
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
import pandas as pd
|
||||
|
||||
# 1. Load molecules
|
||||
df = dm.read_sdf("compounds.sdf")
|
||||
|
||||
# 2. Standardize
|
||||
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
|
||||
df = df[df['mol'].notna()] # Remove failed molecules
|
||||
|
||||
# 3. Compute descriptors
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(
|
||||
df['mol'].tolist(),
|
||||
n_jobs=-1,
|
||||
progress=True
|
||||
)
|
||||
|
||||
# 4. Filter by drug-likeness
|
||||
druglike = (
|
||||
(desc_df['mw'] <= 500) &
|
||||
(desc_df['logp'] <= 5) &
|
||||
(desc_df['hbd'] <= 5) &
|
||||
(desc_df['hba'] <= 10)
|
||||
)
|
||||
filtered_df = df[druglike]
|
||||
|
||||
# 5. Cluster and select diverse subset
|
||||
diverse_mols = dm.pick_diverse(
|
||||
filtered_df['mol'].tolist(),
|
||||
npick=100
|
||||
)
|
||||
|
||||
# 6. Visualize results
|
||||
dm.viz.to_image(
|
||||
diverse_mols,
|
||||
legends=[dm.to_smiles(m) for m in diverse_mols],
|
||||
outfile="diverse_compounds.png",
|
||||
n_cols=10
|
||||
)
|
||||
```
|
||||
|
||||
### Structure-Activity Relationship (SAR) Analysis
|
||||
|
||||
```python
|
||||
# Group by scaffold
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Create DataFrame with activities
|
||||
sar_df = pd.DataFrame({
|
||||
'mol': mols,
|
||||
'scaffold': scaffold_smiles,
|
||||
'activity': activities # User-provided activity data
|
||||
})
|
||||
|
||||
# Analyze each scaffold series
|
||||
for scaffold, group in sar_df.groupby('scaffold'):
|
||||
if len(group) >= 3: # Need multiple examples
|
||||
print(f"\nScaffold: {scaffold}")
|
||||
print(f"Count: {len(group)}")
|
||||
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
|
||||
|
||||
# Visualize with activities as legends
|
||||
dm.viz.to_image(
|
||||
group['mol'].tolist(),
|
||||
legends=[f"Activity: {act:.2f}" for act in group['activity']],
|
||||
align=True # Align by common substructure
|
||||
)
|
||||
```
|
||||
|
||||
### Virtual Screening Pipeline
|
||||
|
||||
```python
|
||||
# 1. Generate fingerprints for query and library
|
||||
query_fps = [dm.to_fp(mol) for mol in query_actives]
|
||||
library_fps = [dm.to_fp(mol) for mol in library_mols]
|
||||
|
||||
# 2. Calculate similarities
|
||||
from scipy.spatial.distance import cdist
|
||||
import numpy as np
|
||||
|
||||
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
|
||||
|
||||
# 3. Find closest matches (min distance to any query)
|
||||
min_distances = distances.min(axis=0)
|
||||
similarities = 1 - min_distances # Convert distance to similarity
|
||||
|
||||
# 4. Rank and select top hits
|
||||
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
|
||||
top_hits = [library_mols[i] for i in top_indices]
|
||||
top_scores = [similarities[i] for i in top_indices]
|
||||
|
||||
# 5. Visualize hits
|
||||
dm.viz.to_image(
|
||||
top_hits[:20],
|
||||
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
|
||||
outfile="screening_hits.png"
|
||||
)
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For detailed API documentation, consult these reference files:
|
||||
|
||||
- **`references/core_api.md`**: Core namespace functions (conversions, standardization, fingerprints, clustering)
|
||||
- **`references/io_module.md`**: File I/O operations (read/write SDF, CSV, Excel, remote files)
|
||||
- **`references/conformers_module.md`**: 3D conformer generation, clustering, SASA calculations
|
||||
- **`references/descriptors_viz.md`**: Molecular descriptors and visualization functions
|
||||
- **`references/fragments_scaffolds.md`**: Scaffold extraction, BRICS/RECAP fragmentation
|
||||
- **`references/reactions_data.md`**: Chemical reactions and toy datasets
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always standardize molecules** from external sources:
|
||||
```python
|
||||
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
|
||||
```
|
||||
|
||||
2. **Check for None values** after molecule parsing:
|
||||
```python
|
||||
mol = dm.to_mol(smiles)
|
||||
if mol is None:
|
||||
# Handle invalid SMILES
|
||||
```
|
||||
|
||||
3. **Use parallel processing** for large datasets:
|
||||
```python
|
||||
result = dm.operation(..., n_jobs=-1, progress=True)
|
||||
```
|
||||
|
||||
4. **Leverage fsspec** for cloud storage:
|
||||
```python
|
||||
df = dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
```
|
||||
|
||||
5. **Use appropriate fingerprints** for similarity:
|
||||
- ECFP (Morgan): General purpose, structural similarity
|
||||
- MACCS: Fast, smaller feature space
|
||||
- Atom pairs: Considers atom pairs and distances
|
||||
|
||||
6. **Consider scale limitations**:
|
||||
- Butina clustering: ~1,000 molecules (full distance matrix)
|
||||
- For larger datasets: Use diversity selection or hierarchical methods
|
||||
|
||||
7. **Scaffold splitting for ML**: Ensure proper train/test separation by scaffold
|
||||
|
||||
8. **Align molecules** when visualizing SAR series
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
# Safe molecule creation
|
||||
def safe_to_mol(smiles):
|
||||
try:
|
||||
mol = dm.to_mol(smiles)
|
||||
if mol is not None:
|
||||
mol = dm.standardize_mol(mol)
|
||||
return mol
|
||||
except Exception as e:
|
||||
print(f"Failed to process {smiles}: {e}")
|
||||
return None
|
||||
|
||||
# Safe batch processing
|
||||
valid_mols = []
|
||||
for smiles in smiles_list:
|
||||
mol = safe_to_mol(smiles)
|
||||
if mol is not None:
|
||||
valid_mols.append(mol)
|
||||
```
|
||||
|
||||
## Integration with Machine Learning
|
||||
|
||||
```python
|
||||
# Feature generation
|
||||
X = np.array([dm.to_fp(mol) for mol in mols])
|
||||
|
||||
# Or descriptors
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
|
||||
X = desc_df.values
|
||||
|
||||
# Train model
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
model = RandomForestRegressor()
|
||||
model.fit(X, y_target)
|
||||
|
||||
# Predict
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue**: Molecule parsing fails
|
||||
- **Solution**: Use `dm.standardize_smiles()` first or try `dm.fix_mol()`
|
||||
|
||||
**Issue**: Memory errors with clustering
|
||||
- **Solution**: Use `dm.pick_diverse()` instead of full clustering for large sets
|
||||
|
||||
**Issue**: Slow conformer generation
|
||||
- **Solution**: Reduce `n_confs` or increase `rms_cutoff` to generate fewer conformers
|
||||
|
||||
**Issue**: Remote file access fails
|
||||
- **Solution**: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Datamol Documentation**: https://docs.datamol.io/
|
||||
- **RDKit Documentation**: https://www.rdkit.org/docs/
|
||||
- **GitHub Repository**: https://github.com/datamol-io/datamol
|
||||
131
scientific-packages/datamol/references/conformers_module.md
Normal file
131
scientific-packages/datamol/references/conformers_module.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Datamol Conformers Module Reference
|
||||
|
||||
The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations.
|
||||
|
||||
## Conformer Generation
|
||||
|
||||
### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, ...)`
|
||||
Generate 3D molecular conformers.
|
||||
- **Parameters**:
|
||||
- `mol`: Input molecule
|
||||
- `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None)
|
||||
- `rms_cutoff`: RMS threshold in Ångströms for filtering similar conformers (removes duplicates)
|
||||
- `minimize_energy`: Apply UFF energy minimization (default: True)
|
||||
- `method`: Embedding method - options:
|
||||
- `'ETDG'` - Experimental Torsion Distance Geometry
|
||||
- `'ETKDG'` - ETDG with additional basic knowledge
|
||||
- `'ETKDGv2'` - Enhanced version 2
|
||||
- `'ETKDGv3'` - Enhanced version 3 (default, recommended)
|
||||
- `add_hs`: Add hydrogens before embedding (default: True, critical for quality)
|
||||
- `random_seed`: Set for reproducibility
|
||||
- **Returns**: Molecule with embedded conformers
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCO")
|
||||
mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5)
|
||||
conformers = mol_3d.GetConformers() # Access all conformers
|
||||
```
|
||||
|
||||
## Conformer Clustering
|
||||
|
||||
### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)`
|
||||
Group conformers by RMS distance.
|
||||
- **Parameters**:
|
||||
- `rms_cutoff`: Clustering threshold in Ångströms (default: 1.0)
|
||||
- `already_aligned`: Whether conformers are pre-aligned
|
||||
- `centroids`: Return centroid conformers (True) or cluster groups (False)
|
||||
- **Returns**: Cluster information or centroid conformers
|
||||
- **Use case**: Identify distinct conformational families
|
||||
|
||||
### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)`
|
||||
Extract representative conformers from clusters.
|
||||
- **Parameters**:
|
||||
- `conf_clusters`: Sequence of cluster indices from `cluster()`
|
||||
- `centroids`: Return single molecule (True) or list of molecules (False)
|
||||
- **Returns**: Centroid conformer(s)
|
||||
|
||||
## Conformer Analysis
|
||||
|
||||
### `dm.conformers.rmsd(mol)`
|
||||
Calculate pairwise RMSD matrix across all conformers.
|
||||
- **Requirements**: Minimum 2 conformers
|
||||
- **Returns**: NxN matrix of RMSD values
|
||||
- **Use case**: Quantify conformer diversity
|
||||
|
||||
### `dm.conformers.sasa(mol, n_jobs=1, ...)`
|
||||
Calculate Solvent Accessible Surface Area (SASA) using FreeSASA.
|
||||
- **Parameters**:
|
||||
- `n_jobs`: Parallelization for multiple conformers
|
||||
- **Returns**: Array of SASA values (one per conformer)
|
||||
- **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'`
|
||||
- **Example**:
|
||||
```python
|
||||
sasa_values = dm.conformers.sasa(mol_3d)
|
||||
# Or access from conformer properties
|
||||
conf = mol_3d.GetConformer(0)
|
||||
sasa = conf.GetDoubleProp('rdkit_free_sasa')
|
||||
```
|
||||
|
||||
## Low-Level Conformer Manipulation
|
||||
|
||||
### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)`
|
||||
Calculate molecular center.
|
||||
- **Parameters**:
|
||||
- `conf_id`: Conformer index (-1 for first conformer)
|
||||
- `use_atoms`: Use atomic masses (True) or geometric center (False)
|
||||
- `round_coord`: Decimal precision for rounding
|
||||
- **Returns**: 3D coordinates of center
|
||||
- **Use case**: Centering molecules for visualization or alignment
|
||||
|
||||
### `dm.conformers.get_coords(mol, conf_id=-1)`
|
||||
Retrieve atomic coordinates from a conformer.
|
||||
- **Returns**: Nx3 numpy array of atomic positions
|
||||
- **Example**:
|
||||
```python
|
||||
positions = dm.conformers.get_coords(mol_3d, conf_id=0)
|
||||
# positions.shape: (num_atoms, 3)
|
||||
```
|
||||
|
||||
### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)`
|
||||
Reposition conformer using transformation matrix.
|
||||
- **Modification**: Operates in-place
|
||||
- **Use case**: Aligning or repositioning molecules
|
||||
|
||||
## Workflow Example
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# 1. Create molecule and generate conformers
|
||||
mol = dm.to_mol("CC(C)CCO") # Isopentanol
|
||||
mol_3d = dm.conformers.generate(
|
||||
mol,
|
||||
n_confs=50, # Generate 50 initial conformers
|
||||
rms_cutoff=0.5, # Filter similar conformers
|
||||
minimize_energy=True # Minimize energy
|
||||
)
|
||||
|
||||
# 2. Analyze conformers
|
||||
n_conformers = mol_3d.GetNumConformers()
|
||||
print(f"Generated {n_conformers} unique conformers")
|
||||
|
||||
# 3. Calculate SASA
|
||||
sasa_values = dm.conformers.sasa(mol_3d)
|
||||
|
||||
# 4. Cluster conformers
|
||||
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False)
|
||||
|
||||
# 5. Get representative conformers
|
||||
centroids = dm.conformers.return_centroids(mol_3d, clusters)
|
||||
|
||||
# 6. Access 3D coordinates
|
||||
coords = dm.conformers.get_coords(mol_3d, conf_id=0)
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- **Distance Geometry**: Method for generating 3D structures from connectivity information
|
||||
- **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge
|
||||
- **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers
|
||||
- **Energy Minimization**: Relaxes structures to nearest local energy minimum
|
||||
- **Hydrogens**: Critical for accurate 3D geometry - always include during embedding
|
||||
130
scientific-packages/datamol/references/core_api.md
Normal file
130
scientific-packages/datamol/references/core_api.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# Datamol Core API Reference
|
||||
|
||||
This document covers the main functions available in the datamol namespace.
|
||||
|
||||
## Molecule Creation and Conversion
|
||||
|
||||
### `to_mol(mol, ...)`
|
||||
Convert SMILES string or other molecular representations to RDKit molecule objects.
|
||||
- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
|
||||
- **Returns**: `rdkit.Chem.Mol` object
|
||||
- **Common usage**: `mol = dm.to_mol("CCO")`
|
||||
|
||||
### `from_inchi(inchi)`
|
||||
Convert InChI string to molecule object.
|
||||
|
||||
### `from_smarts(smarts)`
|
||||
Convert SMARTS pattern to molecule object.
|
||||
|
||||
### `from_selfies(selfies)`
|
||||
Convert SELFIES string to molecule object.
|
||||
|
||||
### `copy_mol(mol)`
|
||||
Create a copy of a molecule object to avoid modifying the original.
|
||||
|
||||
## Molecule Export
|
||||
|
||||
### `to_smiles(mol, ...)`
|
||||
Convert molecule object to SMILES string.
|
||||
- **Common parameters**: `canonical=True`, `isomeric=True`
|
||||
|
||||
### `to_inchi(mol, ...)`
|
||||
Convert molecule to InChI string representation.
|
||||
|
||||
### `to_inchikey(mol)`
|
||||
Convert molecule to InChI key (fixed-length hash).
|
||||
|
||||
### `to_smarts(mol)`
|
||||
Convert molecule to SMARTS pattern.
|
||||
|
||||
### `to_selfies(mol)`
|
||||
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
|
||||
|
||||
## Sanitization and Standardization
|
||||
|
||||
### `sanitize_mol(mol, ...)`
|
||||
Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
|
||||
- **Purpose**: Fix common molecular structure issues
|
||||
- **Returns**: Sanitized molecule or None if sanitization fails
|
||||
|
||||
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
|
||||
Apply comprehensive standardization procedures including:
|
||||
- Metal disconnection
|
||||
- Normalization (charge corrections)
|
||||
- Reionization
|
||||
- Fragment handling (largest fragment selection)
|
||||
|
||||
### `standardize_smiles(smiles, ...)`
|
||||
Apply SMILES standardization procedures directly to a SMILES string.
|
||||
|
||||
### `fix_mol(mol)`
|
||||
Attempt to fix molecular structure issues automatically.
|
||||
|
||||
### `fix_valence(mol)`
|
||||
Correct valence errors in molecular structures.
|
||||
|
||||
## Molecular Properties
|
||||
|
||||
### `reorder_atoms(mol, ...)`
|
||||
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
|
||||
- **Purpose**: Maintain reproducible feature generation
|
||||
|
||||
### `remove_hs(mol, ...)`
|
||||
Remove hydrogen atoms from molecular structure.
|
||||
|
||||
### `add_hs(mol, ...)`
|
||||
Add explicit hydrogen atoms to molecular structure.
|
||||
|
||||
## Fingerprints and Similarity
|
||||
|
||||
### `to_fp(mol, fp_type='ecfp', ...)`
|
||||
Generate molecular fingerprints for similarity calculations.
|
||||
- **Fingerprint types**:
|
||||
- `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
|
||||
- `'fcfp'` - Functional Connectivity Fingerprints
|
||||
- `'maccs'` - MACCS keys
|
||||
- `'topological'` - Topological fingerprints
|
||||
- `'atompair'` - Atom pair fingerprints
|
||||
- **Common parameters**: `n_bits`, `radius`
|
||||
- **Returns**: Numpy array or RDKit fingerprint object
|
||||
|
||||
### `pdist(mols, ...)`
|
||||
Calculate pairwise Tanimoto distances between all molecules in a list.
|
||||
- **Supports**: Parallel processing via `n_jobs` parameter
|
||||
- **Returns**: Distance matrix
|
||||
|
||||
### `cdist(mols1, mols2, ...)`
|
||||
Calculate Tanimoto distances between two sets of molecules.
|
||||
|
||||
## Clustering and Diversity
|
||||
|
||||
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
|
||||
Cluster molecules using Butina clustering algorithm.
|
||||
- **Parameters**:
|
||||
- `cutoff`: Distance threshold (default 0.2)
|
||||
- `feature_fn`: Custom function for molecular features
|
||||
- `n_jobs`: Parallelization (-1 for all cores)
|
||||
- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
|
||||
- **Returns**: List of clusters (each cluster is a list of molecule indices)
|
||||
|
||||
### `pick_diverse(mols, npick, ...)`
|
||||
Select diverse subset of molecules based on fingerprint diversity.
|
||||
|
||||
### `pick_centroids(mols, npick, ...)`
|
||||
Select centroid molecules representing clusters.
|
||||
|
||||
## Graph Operations
|
||||
|
||||
### `to_graph(mol)`
|
||||
Convert molecule to graph representation for graph-based analysis.
|
||||
|
||||
### `get_all_path_between(mol, start, end)`
|
||||
Find all paths between two atoms in molecular structure.
|
||||
|
||||
## DataFrame Integration
|
||||
|
||||
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
|
||||
Convert list of molecules to pandas DataFrame.
|
||||
|
||||
### `from_df(df, smiles_column='smiles', mol_column='mol')`
|
||||
Convert pandas DataFrame to list of molecules.
|
||||
195
scientific-packages/datamol/references/descriptors_viz.md
Normal file
195
scientific-packages/datamol/references/descriptors_viz.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Datamol Descriptors and Visualization Reference
|
||||
|
||||
## Descriptors Module (`datamol.descriptors`)
|
||||
|
||||
The descriptors module provides tools for computing molecular properties and descriptors.
|
||||
|
||||
### Specialized Descriptor Functions
|
||||
|
||||
#### `dm.descriptors.n_aromatic_atoms(mol)`
|
||||
Calculate the number of aromatic atoms.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Aromaticity analysis
|
||||
|
||||
#### `dm.descriptors.n_aromatic_atoms_proportion(mol)`
|
||||
Calculate ratio of aromatic atoms to total heavy atoms.
|
||||
- **Returns**: Float between 0 and 1
|
||||
- **Use case**: Quantifying aromatic character
|
||||
|
||||
#### `dm.descriptors.n_charged_atoms(mol)`
|
||||
Count atoms with nonzero formal charge.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Charge distribution analysis
|
||||
|
||||
#### `dm.descriptors.n_rigid_bonds(mol)`
|
||||
Count non-rotatable bonds (neither single bonds nor ring bonds).
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Molecular flexibility assessment
|
||||
|
||||
#### `dm.descriptors.n_stereo_centers(mol)`
|
||||
Count stereogenic centers (chiral centers).
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Stereochemistry analysis
|
||||
|
||||
#### `dm.descriptors.n_stereo_centers_unspecified(mol)`
|
||||
Count stereocenters lacking stereochemical specification.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Identifying incomplete stereochemistry
|
||||
|
||||
### Batch Descriptor Computation
|
||||
|
||||
#### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)`
|
||||
Compute multiple molecular properties for a single molecule.
|
||||
- **Parameters**:
|
||||
- `properties_fn`: Custom list of descriptor functions
|
||||
- `add_properties`: Include additional computed properties
|
||||
- **Returns**: Dictionary of descriptor name → value pairs
|
||||
- **Default descriptors include**:
|
||||
- Molecular weight, LogP, number of H-bond donors/acceptors
|
||||
- Aromatic atoms, stereocenters, rotatable bonds
|
||||
- TPSA (Topological Polar Surface Area)
|
||||
- Ring count, heteroatom count
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCO")
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, ...}
|
||||
```
|
||||
|
||||
#### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)`
|
||||
Compute descriptors for multiple molecules in parallel.
|
||||
- **Parameters**:
|
||||
- `mols`: List of molecules
|
||||
- `n_jobs`: Number of parallel jobs (-1 for all cores)
|
||||
- `batch_size`: Chunk size for parallel processing
|
||||
- `progress`: Show progress bar
|
||||
- **Returns**: Pandas DataFrame with one row per molecule
|
||||
- **Example**:
|
||||
```python
|
||||
mols = [dm.to_mol(smi) for smi in smiles_list]
|
||||
df = dm.descriptors.batch_compute_many_descriptors(
|
||||
mols,
|
||||
n_jobs=-1,
|
||||
progress=True
|
||||
)
|
||||
```
|
||||
|
||||
### RDKit Descriptor Access
|
||||
|
||||
#### `dm.descriptors.any_rdkit_descriptor(name)`
|
||||
Retrieve any descriptor function from RDKit by name.
|
||||
- **Parameters**: `name` - Descriptor function name (e.g., 'MolWt', 'TPSA')
|
||||
- **Returns**: RDKit descriptor function
|
||||
- **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors`
|
||||
- **Example**:
|
||||
```python
|
||||
tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA')
|
||||
tpsa_value = tpsa_fn(mol)
|
||||
```
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
**Drug-likeness Filtering (Lipinski's Rule of Five)**:
|
||||
```python
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
is_druglike = (
|
||||
descriptors['mw'] <= 500 and
|
||||
descriptors['logp'] <= 5 and
|
||||
descriptors['hbd'] <= 5 and
|
||||
descriptors['hba'] <= 10
|
||||
)
|
||||
```
|
||||
|
||||
**ADME Property Analysis**:
|
||||
```python
|
||||
df = dm.descriptors.batch_compute_many_descriptors(compound_library)
|
||||
# Filter by TPSA for blood-brain barrier penetration
|
||||
bbb_candidates = df[df['tpsa'] < 90]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Module (`datamol.viz`)
|
||||
|
||||
The viz module provides tools for rendering molecules and conformers as images.
|
||||
|
||||
### Main Visualization Function
|
||||
|
||||
#### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, ...)`
|
||||
Generate image grid from molecules.
|
||||
- **Parameters**:
|
||||
- `mols`: Single molecule or list of molecules
|
||||
- `legends`: String or list of strings as labels (one per molecule)
|
||||
- `n_cols`: Number of molecules per row (default: 4)
|
||||
- `use_svg`: Output SVG format (True) or PNG (False, default)
|
||||
- `mol_size`: Tuple (width, height) or single int for square images
|
||||
- `highlight_atom`: Atom indices to highlight (list or dict)
|
||||
- `highlight_bond`: Bond indices to highlight (list or dict)
|
||||
- `outfile`: Save path (local or remote, supports fsspec)
|
||||
- `max_mols`: Maximum number of molecules to display
|
||||
- `indices`: Draw atom indices on structures (default: False)
|
||||
- `align`: Align molecules using MCS (Maximum Common Substructure)
|
||||
- **Returns**: Image object (can be displayed in Jupyter) or saves to file
|
||||
- **Example**:
|
||||
```python
|
||||
# Basic grid
|
||||
dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]])
|
||||
|
||||
# Save to file
|
||||
dm.viz.to_image(mols, outfile="molecules.png", n_cols=5)
|
||||
|
||||
# Highlight substructure
|
||||
dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1])
|
||||
|
||||
# Aligned visualization
|
||||
dm.viz.to_image(mols, align=True, legends=activity_labels)
|
||||
```
|
||||
|
||||
### Conformer Visualization
|
||||
|
||||
#### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, ...)`
|
||||
Display multiple conformers in grid layout.
|
||||
- **Parameters**:
|
||||
- `mol`: Molecule with embedded conformers
|
||||
- `n_confs`: Number or list of conformer indices to display (None = all)
|
||||
- `align_conf`: Align conformers for comparison (default: True)
|
||||
- `n_cols`: Grid columns (default: 3)
|
||||
- `sync_views`: Synchronize 3D views when interactive (default: True)
|
||||
- `remove_hs`: Remove hydrogens for clarity (default: True)
|
||||
- **Returns**: Grid of conformer visualizations
|
||||
- **Use case**: Comparing conformational diversity
|
||||
- **Example**:
|
||||
```python
|
||||
mol_3d = dm.conformers.generate(mol, n_confs=20)
|
||||
dm.viz.conformers(mol_3d, n_confs=10, align_conf=True)
|
||||
```
|
||||
|
||||
### Circle Grid Visualization
|
||||
|
||||
#### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, ...)`
|
||||
Create concentric ring visualization with central molecule.
|
||||
- **Parameters**:
|
||||
- `center_mol`: Molecule at center
|
||||
- `circle_mols`: List of molecule lists (one list per ring)
|
||||
- `mol_size`: Image size per molecule
|
||||
- `circle_margin`: Spacing between rings (default: 50)
|
||||
- `act_mapper`: Activity mapping dictionary for color-coding
|
||||
- **Returns**: Circular grid image
|
||||
- **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks
|
||||
- **Example**:
|
||||
```python
|
||||
# Show a reference molecule surrounded by similar compounds
|
||||
dm.viz.circle_grid(
|
||||
center_mol=reference,
|
||||
circle_mols=[nearest_neighbors, second_tier]
|
||||
)
|
||||
```
|
||||
|
||||
### Visualization Best Practices
|
||||
|
||||
1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values
|
||||
2. **Align related molecules**: Use `align=True` in `to_image()` for SAR analysis
|
||||
3. **Adjust grid size**: Set `n_cols` based on molecule count and display width
|
||||
4. **Use SVG for publications**: Set `use_svg=True` for scalable vector graphics
|
||||
5. **Highlight substructures**: Use `highlight_atom` and `highlight_bond` to emphasize features
|
||||
6. **Save large grids**: Use `outfile` parameter to save rather than display in memory
|
||||
174
scientific-packages/datamol/references/fragments_scaffolds.md
Normal file
174
scientific-packages/datamol/references/fragments_scaffolds.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Datamol Fragments and Scaffolds Reference
|
||||
|
||||
## Scaffolds Module (`datamol.scaffold`)
|
||||
|
||||
Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).
|
||||
|
||||
### Murcko Scaffolds
|
||||
|
||||
#### `dm.to_scaffold_murcko(mol)`
|
||||
Extract Bemis-Murcko scaffold (molecular framework).
|
||||
- **Method**: Removes side chains, retaining ring systems and linkers
|
||||
- **Returns**: Molecule object representing the scaffold
|
||||
- **Use case**: Identify core structures across compound series
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("c1ccc(cc1)CCN") # Phenethylamine
|
||||
scaffold = dm.to_scaffold_murcko(mol)
|
||||
scaffold_smiles = dm.to_smiles(scaffold)
|
||||
# Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
|
||||
```
|
||||
|
||||
**Workflow for scaffold analysis**:
|
||||
```python
|
||||
# Extract scaffolds from compound library
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Count scaffold frequency
|
||||
from collections import Counter
|
||||
scaffold_counts = Counter(scaffold_smiles)
|
||||
most_common = scaffold_counts.most_common(10)
|
||||
```
|
||||
|
||||
### Fuzzy Scaffolds
|
||||
|
||||
#### `dm.scaffold.fuzzy_scaffolding(mol, ...)`
|
||||
Generate fuzzy scaffolds with enforceable groups that must appear in the core.
|
||||
- **Purpose**: More flexible scaffold definition allowing specified functional groups
|
||||
- **Use case**: Custom scaffold definitions beyond Murcko rules
|
||||
|
||||
### Applications
|
||||
|
||||
**Scaffold-based splitting** (for ML model validation):
|
||||
```python
|
||||
# Group compounds by scaffold
|
||||
scaffold_to_mols = {}
|
||||
for mol, scaffold in zip(mols, scaffolds):
|
||||
smi = dm.to_smiles(scaffold)
|
||||
if smi not in scaffold_to_mols:
|
||||
scaffold_to_mols[smi] = []
|
||||
scaffold_to_mols[smi].append(mol)
|
||||
|
||||
# Ensure train/test sets have different scaffolds
|
||||
```
|
||||
|
||||
**SAR analysis**:
|
||||
```python
|
||||
# Group by scaffold and analyze activity
|
||||
for scaffold_smi, molecules in scaffold_to_mols.items():
|
||||
activities = [get_activity(mol) for mol in molecules]
|
||||
print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fragments Module (`datamol.fragment`)
|
||||
|
||||
Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.
|
||||
|
||||
### BRICS Fragmentation
|
||||
|
||||
#### `dm.fragment.brics(mol, ...)`
|
||||
Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).
|
||||
- **Method**: Dissects based on 16 chemically meaningful bond types
|
||||
- **Consideration**: Considers chemical environment and surrounding substructures
|
||||
- **Returns**: Set of fragment SMILES strings
|
||||
- **Use case**: Retrosynthetic analysis, fragment-based design
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("c1ccccc1CCN")
|
||||
fragments = dm.fragment.brics(mol)
|
||||
# Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
|
||||
# [1*] represents attachment points
|
||||
```
|
||||
|
||||
### RECAP Fragmentation
|
||||
|
||||
#### `dm.fragment.recap(mol, ...)`
|
||||
Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).
|
||||
- **Method**: Dissects based on 11 predefined bond types
|
||||
- **Rules**:
|
||||
- Leaves alkyl groups smaller than 5 carbons intact
|
||||
- Preserves cyclic bonds
|
||||
- **Returns**: Set of fragment SMILES strings
|
||||
- **Use case**: Combinatorial library design
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCCCCc1ccccc1")
|
||||
fragments = dm.fragment.recap(mol)
|
||||
```
|
||||
|
||||
### MMPA Fragmentation
|
||||
|
||||
#### `dm.fragment.mmpa_frag(mol, ...)`
|
||||
Fragment for Matched Molecular Pair Analysis.
|
||||
- **Purpose**: Generate fragments suitable for identifying molecular pairs
|
||||
- **Use case**: Analyzing how small structural changes affect properties
|
||||
- **Example**:
|
||||
```python
|
||||
fragments = dm.fragment.mmpa_frag(mol)
|
||||
# Used to find pairs of molecules differing by single transformation
|
||||
```
|
||||
|
||||
### Comparison of Methods
|
||||
|
||||
| Method | Bond Types | Preserves Cycles | Best For |
|
||||
|--------|-----------|------------------|----------|
|
||||
| BRICS | 16 | Yes | Retrosynthetic analysis, fragment recombination |
|
||||
| RECAP | 11 | Yes | Combinatorial library design |
|
||||
| MMPA | Variable | Depends | Structure-activity relationship analysis |
|
||||
|
||||
### Fragmentation Workflow
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# 1. Fragment a molecule
|
||||
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
|
||||
brics_frags = dm.fragment.brics(mol)
|
||||
recap_frags = dm.fragment.recap(mol)
|
||||
|
||||
# 2. Analyze fragment frequency across library
|
||||
all_fragments = []
|
||||
for mol in molecule_library:
|
||||
frags = dm.fragment.brics(mol)
|
||||
all_fragments.extend(frags)
|
||||
|
||||
# 3. Identify common fragments
|
||||
from collections import Counter
|
||||
fragment_counts = Counter(all_fragments)
|
||||
common_fragments = fragment_counts.most_common(20)
|
||||
|
||||
# 4. Convert fragments back to molecules (remove attachment points)
|
||||
def clean_fragment(frag_smiles):
|
||||
# Remove [1*], [2*], etc. attachment point markers
|
||||
clean = frag_smiles.replace('[1*]', '[H]')
|
||||
return dm.to_mol(clean)
|
||||
```
|
||||
|
||||
### Advanced: Fragment-Based Virtual Screening
|
||||
|
||||
```python
|
||||
# Build fragment library from known actives
|
||||
active_fragments = set()
|
||||
for active_mol in active_compounds:
|
||||
frags = dm.fragment.brics(active_mol)
|
||||
active_fragments.update(frags)
|
||||
|
||||
# Screen compounds for presence of active fragments
|
||||
def score_by_fragments(mol, fragment_set):
|
||||
mol_frags = dm.fragment.brics(mol)
|
||||
overlap = mol_frags.intersection(fragment_set)
|
||||
return len(overlap) / len(mol_frags)
|
||||
|
||||
# Score screening library
|
||||
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **Attachment Points**: Marked with [1*], [2*], etc. in fragment SMILES
|
||||
- **Retrosynthetic**: Fragmentation mimics synthetic disconnections
|
||||
- **Chemically Meaningful**: Breaks occur at typical synthetic bonds
|
||||
- **Recombination**: Fragments can theoretically be recombined into valid molecules
|
||||
109
scientific-packages/datamol/references/io_module.md
Normal file
109
scientific-packages/datamol/references/io_module.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Datamol I/O Module Reference
|
||||
|
||||
The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.
|
||||
|
||||
## Reading Molecular Files
|
||||
|
||||
### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
|
||||
Read Structure-Data File (SDF) format.
|
||||
- **Parameters**:
|
||||
- `filename`: Path to SDF file (supports local and remote paths via fsspec)
|
||||
- `sanitize`: Apply sanitization to molecules
|
||||
- `remove_hs`: Remove explicit hydrogens
|
||||
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
||||
- `mol_column`: Name of molecule column in DataFrame
|
||||
- `n_jobs`: Enable parallel processing
|
||||
- **Returns**: DataFrame or list of molecules
|
||||
- **Example**: `df = dm.read_sdf("compounds.sdf")`
|
||||
|
||||
### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
|
||||
Read SMILES file (space-delimited by default).
|
||||
- **Common format**: SMILES followed by molecule ID/name
|
||||
- **Example**: `df = dm.read_smi("molecules.smi")`
|
||||
|
||||
### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
|
||||
Read CSV file with optional automatic SMILES-to-molecule conversion.
|
||||
- **Parameters**:
|
||||
- `smiles_column`: Column containing SMILES strings
|
||||
- `mol_column`: If specified, creates molecule objects from SMILES column
|
||||
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`
|
||||
|
||||
### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
|
||||
Read Excel files with molecule handling.
|
||||
- **Parameters**:
|
||||
- `sheet_name`: Sheet to read (index or name)
|
||||
- Other parameters similar to `read_csv`
|
||||
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`
|
||||
|
||||
### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
|
||||
Parse MOL block string (molecular structure text representation).
|
||||
|
||||
### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
|
||||
Read Mol2 format files.
|
||||
|
||||
### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
|
||||
Read Protein Data Bank (PDB) format files.
|
||||
|
||||
### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
|
||||
Parse PDB block string.
|
||||
|
||||
### `dm.open_df(filename, ...)`
|
||||
Universal DataFrame reader - automatically detects format.
|
||||
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
|
||||
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`
|
||||
|
||||
## Writing Molecular Files
|
||||
|
||||
### `dm.to_sdf(mols, filename, mol_column=None, ...)`
|
||||
Write molecules to SDF file.
|
||||
- **Input types**:
|
||||
- List of molecules
|
||||
- DataFrame with molecule column
|
||||
- Sequence of molecules
|
||||
- **Parameters**:
|
||||
- `mol_column`: Column name if input is DataFrame
|
||||
- **Example**:
|
||||
```python
|
||||
dm.to_sdf(mols, "output.sdf")
|
||||
# or from DataFrame
|
||||
dm.to_sdf(df, "output.sdf", mol_column="mol")
|
||||
```
|
||||
|
||||
### `dm.to_smi(mols, filename, mol_column=None, ...)`
|
||||
Write molecules to SMILES file with optional validation.
|
||||
- **Format**: SMILES strings with optional molecule names/IDs
|
||||
|
||||
### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
|
||||
Export DataFrame to Excel with rendered molecular images.
|
||||
- **Parameters**:
|
||||
- `mol_columns`: Columns containing molecules to render as images
|
||||
- **Special feature**: Automatically renders molecules as images in Excel cells
|
||||
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`
|
||||
|
||||
### `dm.to_molblock(mol, ...)`
|
||||
Convert molecule to MOL block string.
|
||||
|
||||
### `dm.to_pdbblock(mol, ...)`
|
||||
Convert molecule to PDB block string.
|
||||
|
||||
### `dm.save_df(df, filename, ...)`
|
||||
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
|
||||
|
||||
## Remote File Support
|
||||
|
||||
All I/O functions support remote file paths through fsspec integration:
|
||||
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
|
||||
- **Example**:
|
||||
```python
|
||||
dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
dm.read_csv("https://example.com/data.csv")
|
||||
```
|
||||
|
||||
## Key Parameters Across Functions
|
||||
|
||||
- **`sanitize`**: Apply molecule sanitization (default: True)
|
||||
- **`remove_hs`**: Remove explicit hydrogens (default: True)
|
||||
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
|
||||
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
|
||||
- **`mol_column`**: Name of molecule column in DataFrames
|
||||
- **`smiles_column`**: Name of SMILES column in DataFrames
|
||||
218
scientific-packages/datamol/references/reactions_data.md
Normal file
218
scientific-packages/datamol/references/reactions_data.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Datamol Reactions and Data Modules Reference
|
||||
|
||||
## Reactions Module (`datamol.reactions`)
|
||||
|
||||
The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
|
||||
|
||||
### Applying Chemical Reactions
|
||||
|
||||
#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
|
||||
Apply a chemical reaction to reactant molecules.
|
||||
- **Parameters**:
|
||||
- `rxn`: Reaction object (from SMARTS pattern)
|
||||
- `reactants`: Tuple of reactant molecules
|
||||
- `as_smiles`: Return SMILES strings (True) or molecule objects (False)
|
||||
- `sanitize`: Sanitize product molecules
|
||||
- `single_product_group`: Return single product (True) or all product groups (False)
|
||||
- `rm_attach`: Remove attachment point markers
|
||||
- `product_index`: Which product to return from reaction
|
||||
- **Returns**: Product molecule(s) or SMILES
|
||||
- **Example**:
|
||||
```python
|
||||
from rdkit import Chem
|
||||
|
||||
# Define reaction: alcohol + carboxylic acid → ester
|
||||
rxn = Chem.rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
|
||||
# Apply to reactants
|
||||
alcohol = dm.to_mol("CCO")
|
||||
acid = dm.to_mol("CC(=O)O")
|
||||
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
|
||||
```
|
||||
|
||||
### Creating Reactions
|
||||
|
||||
Reactions are typically created from SMARTS patterns using RDKit:
|
||||
```python
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# Reaction pattern: [reactant1].[reactant2]>>[product]
|
||||
rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
|
||||
)
|
||||
```
|
||||
|
||||
### Validation Functions
|
||||
|
||||
The module includes functions to:
|
||||
- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
|
||||
- **Validate reaction**: Check if reaction is synthetically reasonable
|
||||
- **Process reaction files**: Load reactions from files or databases
|
||||
|
||||
### Common Reaction Patterns
|
||||
|
||||
**Amide formation**:
|
||||
```python
|
||||
# Amine + carboxylic acid → amide
|
||||
amide_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
|
||||
)
|
||||
```
|
||||
|
||||
**Suzuki coupling**:
|
||||
```python
|
||||
# Aryl halide + boronic acid → biaryl
|
||||
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
|
||||
)
|
||||
```
|
||||
|
||||
**Functional group transformations**:
|
||||
```python
|
||||
# Alcohol → ester
|
||||
esterification = rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow Example
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# 1. Define reaction
|
||||
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
|
||||
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
|
||||
|
||||
# 2. Apply to molecule library
|
||||
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
|
||||
acid_chlorides = []
|
||||
|
||||
for acid in acids:
|
||||
try:
|
||||
product = dm.reactions.apply_reaction(
|
||||
rxn,
|
||||
(acid,), # Single reactant as tuple
|
||||
sanitize=True
|
||||
)
|
||||
acid_chlorides.append(product)
|
||||
except Exception as e:
|
||||
print(f"Reaction failed: {e}")
|
||||
|
||||
# 3. Validate products
|
||||
valid_products = [p for p in acid_chlorides if p is not None]
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
|
||||
- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
|
||||
- **Attachment Points**: [1*] represents generic connection points
|
||||
- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
|
||||
|
||||
---
|
||||
|
||||
## Data Module (`datamol.data`)
|
||||
|
||||
The data module provides convenient access to curated molecular datasets for testing and learning.
|
||||
|
||||
### Available Datasets
|
||||
|
||||
#### `dm.data.cdk2(as_df=True, mol_column='mol')`
|
||||
RDKit CDK2 dataset - kinase inhibitor data.
|
||||
- **Parameters**:
|
||||
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
||||
- `mol_column`: Name for molecule column
|
||||
- **Returns**: Dataset with molecular structures and activity data
|
||||
- **Use case**: Small dataset for algorithm testing
|
||||
- **Example**:
|
||||
```python
|
||||
cdk2_df = dm.data.cdk2(as_df=True)
|
||||
print(cdk2_df.shape)
|
||||
print(cdk2_df.columns)
|
||||
```
|
||||
|
||||
#### `dm.data.freesolv()`
|
||||
FreeSolv dataset - experimental and calculated hydration free energies.
|
||||
- **Contents**: 642 molecules with:
|
||||
- IUPAC names
|
||||
- SMILES strings
|
||||
- Experimental hydration free energy values
|
||||
- Calculated values
|
||||
- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
|
||||
- **Not suitable for**: Benchmarking or production model training
|
||||
- **Example**:
|
||||
```python
|
||||
freesolv_df = dm.data.freesolv()
|
||||
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
|
||||
```
|
||||
|
||||
#### `dm.data.solubility(as_df=True, mol_column='mol')`
|
||||
RDKit solubility dataset with train/test splits.
|
||||
- **Contents**: Aqueous solubility data with pre-defined splits
|
||||
- **Columns**: Includes 'split' column with 'train' or 'test' values
|
||||
- **Use case**: Testing ML workflows with proper train/test separation
|
||||
- **Example**:
|
||||
```python
|
||||
sol_df = dm.data.solubility(as_df=True)
|
||||
|
||||
# Split into train/test
|
||||
train_df = sol_df[sol_df['split'] == 'train']
|
||||
test_df = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Use for model development
|
||||
X_train = dm.to_fp(train_df[mol_column])
|
||||
y_train = train_df['solubility']
|
||||
```
|
||||
|
||||
### Usage Guidelines
|
||||
|
||||
**For testing and tutorials**:
|
||||
```python
|
||||
# Quick dataset for testing code
|
||||
df = dm.data.cdk2()
|
||||
mols = df['mol'].tolist()
|
||||
|
||||
# Test descriptor calculation
|
||||
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
|
||||
|
||||
# Test clustering
|
||||
clusters = dm.cluster_mols(mols, cutoff=0.3)
|
||||
```
|
||||
|
||||
**For learning workflows**:
|
||||
```python
|
||||
# Complete ML pipeline example
|
||||
sol_df = dm.data.solubility()
|
||||
|
||||
# Preprocessing
|
||||
train = sol_df[sol_df['split'] == 'train']
|
||||
test = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Featurization
|
||||
X_train = dm.to_fp(train['mol'])
|
||||
X_test = dm.to_fp(test['mol'])
|
||||
|
||||
# Model training (example)
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
model = RandomForestRegressor()
|
||||
model.fit(X_train, train['solubility'])
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Toy Datasets**: Designed for pedagogical purposes, not production use
|
||||
- **Small Size**: Limited number of compounds suitable for quick tests
|
||||
- **Pre-processed**: Data already cleaned and formatted
|
||||
- **Citations**: Check dataset documentation for proper attribution if publishing
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Use for development only**: Don't draw scientific conclusions from toy datasets
|
||||
2. **Validate on real data**: Always test production code on actual project data
|
||||
3. **Proper attribution**: Cite original data sources if using in publications
|
||||
4. **Understand limitations**: Know the scope and quality of each dataset
|
||||
591
scientific-packages/deepchem/SKILL.md
Normal file
591
scientific-packages/deepchem/SKILL.md
Normal file
@@ -0,0 +1,591 @@
|
||||
---
|
||||
name: deepchem
|
||||
description: Comprehensive toolkit for molecular machine learning, drug discovery, and materials science using DeepChem. Use this skill when working with molecular data (SMILES, SDF files), predicting molecular properties (solubility, toxicity, binding affinity), training graph neural networks on molecules, using MoleculeNet benchmarks, performing molecular featurization, or applying transfer learning with pretrained chemical models (ChemBERTa, GROVER). Also applicable for materials science (crystal structures, bandgap prediction) and protein/DNA sequence analysis.
|
||||
---
|
||||
|
||||
# DeepChem
|
||||
|
||||
## Overview
|
||||
|
||||
DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
- Loading and processing molecular data (SMILES strings, SDF files, protein sequences)
|
||||
- Predicting molecular properties (solubility, toxicity, binding affinity, ADMET properties)
|
||||
- Training models on chemical/biological datasets
|
||||
- Using MoleculeNet benchmark datasets (Tox21, BBBP, Delaney, etc.)
|
||||
- Converting molecules to ML-ready features (fingerprints, graph representations, descriptors)
|
||||
- Implementing graph neural networks for molecules (GCN, GAT, MPNN, AttentiveFP)
|
||||
- Applying transfer learning with pretrained models (ChemBERTa, GROVER, MolFormer)
|
||||
- Predicting crystal/materials properties (bandgap, formation energy)
|
||||
- Analyzing protein or DNA sequences
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Molecular Data Loading and Processing
|
||||
|
||||
DeepChem provides specialized loaders for various chemical data formats:
|
||||
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load CSV with SMILES
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['solubility', 'toxicity'],
|
||||
feature_field='smiles',
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset('molecules.csv')
|
||||
|
||||
# Load SDF files
|
||||
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
|
||||
dataset = loader.create_dataset('compounds.sdf')
|
||||
|
||||
# Load protein sequences
|
||||
loader = dc.data.FASTALoader()
|
||||
dataset = loader.create_dataset('proteins.fasta')
|
||||
```
|
||||
|
||||
**Key Loaders**:
|
||||
- `CSVLoader`: Tabular data with molecular identifiers
|
||||
- `SDFLoader`: Molecular structure files
|
||||
- `FASTALoader`: Protein/DNA sequences
|
||||
- `ImageLoader`: Molecular images
|
||||
- `JsonLoader`: JSON-formatted datasets
|
||||
|
||||
### 2. Molecular Featurization
|
||||
|
||||
Convert molecules into numerical representations for ML models.
|
||||
|
||||
#### Decision Tree for Featurizer Selection
|
||||
|
||||
```
|
||||
Is the model a graph neural network?
|
||||
├─ YES → Use graph featurizers
|
||||
│ ├─ Standard GNN → MolGraphConvFeaturizer
|
||||
│ ├─ Message passing → DMPNNFeaturizer
|
||||
│ └─ Pretrained → GroverFeaturizer
|
||||
│
|
||||
└─ NO → What type of model?
|
||||
├─ Traditional ML (RF, XGBoost, SVM)
|
||||
│ ├─ Fast baseline → CircularFingerprint (ECFP)
|
||||
│ ├─ Interpretable → RDKitDescriptors
|
||||
│ └─ Maximum coverage → MordredDescriptors
|
||||
│
|
||||
├─ Deep learning (non-graph)
|
||||
│ ├─ Dense networks → CircularFingerprint
|
||||
│ └─ CNN → SmilesToImage
|
||||
│
|
||||
├─ Sequence models (LSTM, Transformer)
|
||||
│ └─ SmilesToSeq
|
||||
│
|
||||
└─ 3D structure analysis
|
||||
└─ CoulombMatrix
|
||||
```
|
||||
|
||||
#### Example Featurization
|
||||
|
||||
```python
|
||||
# Fingerprints (for traditional ML)
|
||||
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
|
||||
# Descriptors (for interpretable models)
|
||||
desc = dc.feat.RDKitDescriptors()
|
||||
|
||||
# Graph features (for GNNs)
|
||||
graph_feat = dc.feat.MolGraphConvFeaturizer()
|
||||
|
||||
# Apply featurization
|
||||
features = fp.featurize(['CCO', 'c1ccccc1'])
|
||||
```
|
||||
|
||||
**Selection Guide**:
|
||||
- **Small datasets (<1K)**: CircularFingerprint or RDKitDescriptors
|
||||
- **Medium datasets (1K-100K)**: CircularFingerprint or graph featurizers
|
||||
- **Large datasets (>100K)**: Graph featurizers (MolGraphConvFeaturizer, DMPNNFeaturizer)
|
||||
- **Transfer learning**: Pretrained model featurizers (GroverFeaturizer)
|
||||
|
||||
See `references/api_reference.md` for complete featurizer documentation.
|
||||
|
||||
### 3. Data Splitting
|
||||
|
||||
**Critical**: For drug discovery tasks, use `ScaffoldSplitter` to prevent data leakage from similar molecular structures appearing in both training and test sets.
|
||||
|
||||
```python
|
||||
# Scaffold splitting (recommended for molecules)
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
|
||||
# Random splitting (for non-molecular data)
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# Stratified splitting (for imbalanced classification)
|
||||
splitter = dc.splits.RandomStratifiedSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
```
|
||||
|
||||
**Available Splitters**:
|
||||
- `ScaffoldSplitter`: Split by molecular scaffolds (prevents leakage)
|
||||
- `ButinaSplitter`: Clustering-based molecular splitting
|
||||
- `MaxMinSplitter`: Maximize diversity between sets
|
||||
- `RandomSplitter`: Random splitting
|
||||
- `RandomStratifiedSplitter`: Preserves class distributions
|
||||
|
||||
### 4. Model Selection and Training
|
||||
|
||||
#### Quick Model Selection Guide
|
||||
|
||||
| Dataset Size | Task | Recommended Model | Featurizer |
|
||||
|-------------|------|-------------------|------------|
|
||||
| < 1K samples | Any | SklearnModel (RandomForest) | CircularFingerprint |
|
||||
| 1K-100K | Classification/Regression | GBDTModel or MultitaskRegressor | CircularFingerprint |
|
||||
| > 100K | Molecular properties | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer |
|
||||
| Any (small preferred) | Transfer learning | ChemBERTa, GROVER, MolFormer | Model-specific |
|
||||
| Crystal structures | Materials properties | CGCNNModel, MEGNetModel | Structure-based |
|
||||
| Protein sequences | Protein properties | ProtBERT | Sequence-based |
|
||||
|
||||
#### Example: Traditional ML
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
|
||||
# Wrap scikit-learn model
|
||||
sklearn_model = RandomForestRegressor(n_estimators=100)
|
||||
model = dc.models.SklearnModel(model=sklearn_model)
|
||||
model.fit(train)
|
||||
```
|
||||
|
||||
#### Example: Deep Learning
|
||||
```python
|
||||
# Multitask regressor (for fingerprints)
|
||||
model = dc.models.MultitaskRegressor(
|
||||
n_tasks=2,
|
||||
n_features=2048,
|
||||
layer_sizes=[1000, 500],
|
||||
dropouts=0.25,
|
||||
learning_rate=0.001
|
||||
)
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
#### Example: Graph Neural Networks
|
||||
```python
|
||||
# Graph Convolutional Network
|
||||
model = dc.models.GCNModel(
|
||||
n_tasks=1,
|
||||
mode='regression',
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# Graph Attention Network
|
||||
model = dc.models.GATModel(n_tasks=1, mode='classification')
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# Attentive Fingerprint
|
||||
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
### 5. MoleculeNet Benchmarks
|
||||
|
||||
Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
|
||||
|
||||
```python
|
||||
# Load benchmark dataset
|
||||
tasks, datasets, transformers = dc.molnet.load_tox21(
|
||||
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
|
||||
splitter='scaffold', # or 'random', 'stratified'
|
||||
reload=False
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# Train and evaluate
|
||||
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
**Common Datasets**:
|
||||
- **Classification**: `load_tox21()`, `load_bbbp()`, `load_hiv()`, `load_clintox()`
|
||||
- **Regression**: `load_delaney()`, `load_freesolv()`, `load_lipo()`
|
||||
- **Quantum properties**: `load_qm7()`, `load_qm8()`, `load_qm9()`
|
||||
- **Materials**: `load_perovskite()`, `load_bandgap()`, `load_mp_formation_energy()`
|
||||
|
||||
See `references/api_reference.md` for complete dataset list.
|
||||
|
||||
### 6. Transfer Learning
|
||||
|
||||
Leverage pretrained models for improved performance, especially on small datasets:
|
||||
|
||||
```python
|
||||
# ChemBERTa (BERT pretrained on 77M molecules)
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='seyonec/ChemBERTa-zinc-base-v1',
|
||||
task='classification',
|
||||
n_tasks=1,
|
||||
learning_rate=2e-5 # Lower LR for fine-tuning
|
||||
)
|
||||
model.fit(train, nb_epoch=10)
|
||||
|
||||
# GROVER (graph transformer pretrained on 10M molecules)
|
||||
model = dc.models.GroverModel(
|
||||
task='regression',
|
||||
n_tasks=1
|
||||
)
|
||||
model.fit(train, nb_epoch=20)
|
||||
```
|
||||
|
||||
**When to use transfer learning**:
|
||||
- Small datasets (< 1000 samples)
|
||||
- Novel molecular scaffolds
|
||||
- Limited computational resources
|
||||
- Need for rapid prototyping
|
||||
|
||||
Use the `scripts/transfer_learning.py` script for guided transfer learning workflows.
|
||||
|
||||
### 7. Model Evaluation
|
||||
|
||||
```python
|
||||
# Define metrics
|
||||
classification_metrics = [
|
||||
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
|
||||
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
|
||||
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
|
||||
]
|
||||
|
||||
regression_metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
|
||||
]
|
||||
|
||||
# Evaluate
|
||||
train_scores = model.evaluate(train, classification_metrics)
|
||||
test_scores = model.evaluate(test, classification_metrics)
|
||||
```
|
||||
|
||||
### 8. Making Predictions
|
||||
|
||||
```python
|
||||
# Predict on test set
|
||||
predictions = model.predict(test)
|
||||
|
||||
# Predict on new molecules
|
||||
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
|
||||
new_features = featurizer.featurize(new_smiles)
|
||||
new_dataset = dc.data.NumpyDataset(X=new_features)
|
||||
|
||||
# Apply same transformations as training
|
||||
for transformer in transformers:
|
||||
new_dataset = transformer.transform(new_dataset)
|
||||
|
||||
predictions = model.predict(new_dataset)
|
||||
```
|
||||
|
||||
## Typical Workflows
|
||||
|
||||
### Workflow A: Quick Benchmark Evaluation
|
||||
|
||||
For evaluating a model on standard benchmarks:
|
||||
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# 1. Load benchmark
|
||||
tasks, datasets, _ = dc.molnet.load_bbbp(
|
||||
featurizer='GraphConv',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# 2. Train model
|
||||
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# 3. Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
print(f"Test ROC-AUC: {test_score}")
|
||||
```
|
||||
|
||||
### Workflow B: Custom Data Prediction
|
||||
|
||||
For training on custom molecular datasets:
|
||||
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# 1. Load and featurize data
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['activity'],
|
||||
feature_field='smiles',
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset('my_molecules.csv')
|
||||
|
||||
# 2. Split data (use ScaffoldSplitter for molecules!)
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(dataset)
|
||||
|
||||
# 3. Normalize (optional but recommended)
|
||||
transformers = [dc.trans.NormalizationTransformer(
|
||||
transform_y=True, dataset=train
|
||||
)]
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
valid = transformer.transform(valid)
|
||||
test = transformer.transform(test)
|
||||
|
||||
# 4. Train model
|
||||
model = dc.models.MultitaskRegressor(
|
||||
n_tasks=1,
|
||||
n_features=2048,
|
||||
layer_sizes=[1000, 500],
|
||||
dropouts=0.25
|
||||
)
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# 5. Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
### Workflow C: Transfer Learning on Small Dataset
|
||||
|
||||
For leveraging pretrained models:
|
||||
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# 1. Load data (pretrained models often need raw SMILES)
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['activity'],
|
||||
feature_field='smiles',
|
||||
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
|
||||
)
|
||||
dataset = loader.create_dataset('small_dataset.csv')
|
||||
|
||||
# 2. Split data
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# 3. Load pretrained model
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='seyonec/ChemBERTa-zinc-base-v1',
|
||||
task='classification',
|
||||
n_tasks=1,
|
||||
learning_rate=2e-5
|
||||
)
|
||||
|
||||
# 4. Fine-tune
|
||||
model.fit(train, nb_epoch=10)
|
||||
|
||||
# 5. Evaluate
|
||||
predictions = model.predict(test)
|
||||
```
|
||||
|
||||
See `references/workflows.md` for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.
|
||||
|
||||
## Example Scripts
|
||||
|
||||
This skill includes three production-ready scripts in the `scripts/` directory:
|
||||
|
||||
### 1. `predict_solubility.py`
|
||||
Train and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
|
||||
|
||||
```bash
|
||||
# Use Delaney benchmark
|
||||
python scripts/predict_solubility.py
|
||||
|
||||
# Use custom data
|
||||
python scripts/predict_solubility.py \
|
||||
--data my_data.csv \
|
||||
--smiles-col smiles \
|
||||
--target-col solubility \
|
||||
--predict "CCO" "c1ccccc1"
|
||||
```
|
||||
|
||||
### 2. `graph_neural_network.py`
|
||||
Train various graph neural network architectures on molecular data.
|
||||
|
||||
```bash
|
||||
# Train GCN on Tox21
|
||||
python scripts/graph_neural_network.py --model gcn --dataset tox21
|
||||
|
||||
# Train AttentiveFP on custom data
|
||||
python scripts/graph_neural_network.py \
|
||||
--model attentivefp \
|
||||
--data molecules.csv \
|
||||
--task-type regression \
|
||||
--targets activity \
|
||||
--epochs 100
|
||||
```
|
||||
|
||||
### 3. `transfer_learning.py`
|
||||
Fine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
|
||||
|
||||
```bash
|
||||
# Fine-tune ChemBERTa on BBBP
|
||||
python scripts/transfer_learning.py --model chemberta --dataset bbbp
|
||||
|
||||
# Fine-tune GROVER on custom data
|
||||
python scripts/transfer_learning.py \
|
||||
--model grover \
|
||||
--data small_dataset.csv \
|
||||
--target activity \
|
||||
--task-type classification \
|
||||
--epochs 20
|
||||
```
|
||||
|
||||
## Common Patterns and Best Practices
|
||||
|
||||
### Pattern 1: Always Use Scaffold Splitting for Molecules
|
||||
```python
|
||||
# GOOD: Prevents data leakage
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# BAD: Similar molecules in train and test
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
```
|
||||
|
||||
### Pattern 2: Normalize Features and Targets
|
||||
```python
|
||||
transformers = [
|
||||
dc.trans.NormalizationTransformer(
|
||||
transform_y=True, # Also normalize target values
|
||||
dataset=train
|
||||
)
|
||||
]
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
test = transformer.transform(test)
|
||||
```
|
||||
|
||||
### Pattern 3: Start Simple, Then Scale
|
||||
1. Start with Random Forest + CircularFingerprint (fast baseline)
|
||||
2. Try XGBoost/LightGBM if RF works well
|
||||
3. Move to deep learning (MultitaskRegressor) if you have >5K samples
|
||||
4. Try GNNs if you have >10K samples
|
||||
5. Use transfer learning for small datasets or novel scaffolds
|
||||
|
||||
### Pattern 4: Handle Imbalanced Data
|
||||
```python
|
||||
# Option 1: Balancing transformer
|
||||
transformer = dc.trans.BalancingTransformer(dataset=train)
|
||||
train = transformer.transform(train)
|
||||
|
||||
# Option 2: Use balanced metrics
|
||||
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
|
||||
```
|
||||
|
||||
### Pattern 5: Avoid Memory Issues
|
||||
```python
|
||||
# Use DiskDataset for large datasets
|
||||
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
|
||||
|
||||
# Use smaller batch sizes
|
||||
model = dc.models.GCNModel(batch_size=32) # Instead of 128
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Issue 1: Data Leakage in Drug Discovery
|
||||
**Problem**: Using random splitting allows similar molecules in train/test sets.
|
||||
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
|
||||
|
||||
### Issue 2: GNN Underperforming vs Fingerprints
|
||||
**Problem**: Graph neural networks perform worse than simple fingerprints.
|
||||
**Solutions**:
|
||||
- Ensure dataset is large enough (>10K samples typically)
|
||||
- Increase training epochs (50-100)
|
||||
- Try different architectures (AttentiveFP, DMPNN instead of GCN)
|
||||
- Use pretrained models (GROVER)
|
||||
|
||||
### Issue 3: Overfitting on Small Datasets
|
||||
**Problem**: Model memorizes training data.
|
||||
**Solutions**:
|
||||
- Use stronger regularization (increase dropout to 0.5)
|
||||
- Use simpler models (Random Forest instead of deep learning)
|
||||
- Apply transfer learning (ChemBERTa, GROVER)
|
||||
- Collect more data
|
||||
|
||||
### Issue 4: Import Errors
|
||||
**Problem**: Module not found errors.
|
||||
**Solution**: Ensure DeepChem is installed with required dependencies:
|
||||
```bash
|
||||
pip install deepchem
|
||||
# For PyTorch models
|
||||
pip install deepchem[torch]
|
||||
# For all features
|
||||
pip install deepchem[all]
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### `references/api_reference.md`
|
||||
Complete API documentation including:
|
||||
- All data loaders and their use cases
|
||||
- Dataset classes and when to use each
|
||||
- Complete featurizer catalog with selection guide
|
||||
- Model catalog organized by category (50+ models)
|
||||
- MoleculeNet dataset descriptions
|
||||
- Metrics and evaluation functions
|
||||
- Common code patterns
|
||||
|
||||
**When to reference**: Search this file when you need specific API details, parameter names, or want to explore available options.
|
||||
|
||||
### `references/workflows.md`
|
||||
Eight detailed end-to-end workflows:
|
||||
1. Molecular property prediction from SMILES
|
||||
2. Using MoleculeNet benchmarks
|
||||
3. Hyperparameter optimization
|
||||
4. Transfer learning with pretrained models
|
||||
5. Molecular generation with GANs
|
||||
6. Materials property prediction
|
||||
7. Protein sequence analysis
|
||||
8. Custom model integration
|
||||
|
||||
**When to reference**: Use these workflows as templates for implementing complete solutions.
|
||||
|
||||
## Installation Notes
|
||||
|
||||
Basic installation:
|
||||
```bash
|
||||
pip install deepchem
|
||||
```
|
||||
|
||||
For PyTorch models (GCN, GAT, etc.):
|
||||
```bash
|
||||
pip install deepchem[torch]
|
||||
```
|
||||
|
||||
For all features:
|
||||
```bash
|
||||
pip install deepchem[all]
|
||||
```
|
||||
|
||||
If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- Official documentation: https://deepchem.readthedocs.io/
|
||||
- GitHub repository: https://github.com/deepchem/deepchem
|
||||
- Tutorials: https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html
|
||||
- Paper: "MoleculeNet: A Benchmark for Molecular Machine Learning"
|
||||
303
scientific-packages/deepchem/references/api_reference.md
Normal file
303
scientific-packages/deepchem/references/api_reference.md
Normal file
@@ -0,0 +1,303 @@
|
||||
# DeepChem API Reference
|
||||
|
||||
This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality.
|
||||
|
||||
## Data Handling
|
||||
|
||||
### Data Loaders
|
||||
|
||||
#### File Format Loaders
|
||||
- **CSVLoader**: Load tabular data from CSV files with customizable feature handling
|
||||
- **UserCSVLoader**: User-defined CSV loading with flexible column specifications
|
||||
- **SDFLoader**: Process molecular structure files (SDF format)
|
||||
- **JsonLoader**: Import JSON-structured datasets
|
||||
- **ImageLoader**: Load image data for computer vision tasks
|
||||
|
||||
#### Biological Data Loaders
|
||||
- **FASTALoader**: Handle protein/DNA sequences in FASTA format
|
||||
- **FASTQLoader**: Process FASTQ sequencing data with quality scores
|
||||
- **SAMLoader/BAMLoader/CRAMLoader**: Support sequence alignment formats
|
||||
|
||||
#### Specialized Loaders
|
||||
- **DFTYamlLoader**: Process density functional theory computational data
|
||||
- **InMemoryLoader**: Load data directly from Python objects
|
||||
|
||||
### Dataset Classes
|
||||
|
||||
- **NumpyDataset**: Wrap NumPy arrays for in-memory data manipulation
|
||||
- **DiskDataset**: Manage larger datasets stored on disk, reducing memory overhead
|
||||
- **ImageDataset**: Specialized container for image-based ML tasks
|
||||
|
||||
### Data Splitters
|
||||
|
||||
#### General Splitters
|
||||
- **RandomSplitter**: Random dataset partitioning
|
||||
- **IndexSplitter**: Split by specified indices
|
||||
- **SpecifiedSplitter**: Use pre-defined splits
|
||||
- **RandomStratifiedSplitter**: Stratified random splitting
|
||||
- **SingletaskStratifiedSplitter**: Stratified splitting for single tasks
|
||||
- **TaskSplitter**: Split for multitask scenarios
|
||||
|
||||
#### Molecule-Specific Splitters
|
||||
- **ScaffoldSplitter**: Divide molecules by structural scaffolds (prevents data leakage)
|
||||
- **ButinaSplitter**: Clustering-based molecular splitting
|
||||
- **FingerprintSplitter**: Split based on molecular fingerprint similarity
|
||||
- **MaxMinSplitter**: Maximize diversity between training/test sets
|
||||
- **MolecularWeightSplitter**: Split by molecular weight properties
|
||||
|
||||
**Best Practice**: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures.
|
||||
|
||||
### Transformers
|
||||
|
||||
#### Normalization
|
||||
- **NormalizationTransformer**: Standard normalization (mean=0, std=1)
|
||||
- **MinMaxTransformer**: Scale features to [0,1] range
|
||||
- **LogTransformer**: Apply log transformation
|
||||
- **PowerTransformer**: Box-Cox and Yeo-Johnson transformations
|
||||
- **CDFTransformer**: Cumulative distribution function normalization
|
||||
|
||||
#### Task-Specific
|
||||
- **BalancingTransformer**: Address class imbalance
|
||||
- **FeaturizationTransformer**: Apply dynamic feature engineering
|
||||
- **CoulombFitTransformer**: Quantum chemistry specific
|
||||
- **DAGTransformer**: Directed acyclic graph transformations
|
||||
- **RxnSplitTransformer**: Chemical reaction preprocessing
|
||||
|
||||
## Molecular Featurizers
|
||||
|
||||
### Graph-Based Featurizers
|
||||
Use these with graph neural networks (GCNs, MPNNs, etc.):
|
||||
|
||||
- **ConvMolFeaturizer**: Graph representations for graph convolutional networks
|
||||
- **WeaveFeaturizer**: "Weave" graph embeddings
|
||||
- **MolGraphConvFeaturizer**: Graph convolution-ready representations
|
||||
- **EquivariantGraphFeaturizer**: Maintains geometric invariance
|
||||
- **DMPNNFeaturizer**: Directed message-passing neural network inputs
|
||||
- **GroverFeaturizer**: Pre-trained molecular embeddings
|
||||
|
||||
### Fingerprint-Based Featurizers
|
||||
Use these with traditional ML (Random Forest, SVM, XGBoost):
|
||||
|
||||
- **MACCSKeysFingerprint**: 167-bit structural keys
|
||||
- **CircularFingerprint**: Extended connectivity fingerprints (Morgan fingerprints)
|
||||
- Parameters: `radius` (default 2), `size` (default 2048), `useChirality` (default False)
|
||||
- **PubChemFingerprint**: 881-bit structural descriptors
|
||||
- **Mol2VecFingerprint**: Learned molecular vector representations
|
||||
|
||||
### Descriptor Featurizers
|
||||
Calculate molecular properties directly:
|
||||
|
||||
- **RDKitDescriptors**: ~200 molecular descriptors (MW, LogP, H-donors, H-acceptors, TPSA, etc.)
|
||||
- **MordredDescriptors**: Comprehensive structural and physicochemical descriptors
|
||||
- **CoulombMatrix**: Interatomic distance matrices for 3D structures
|
||||
|
||||
### Sequence-Based Featurizers
|
||||
For recurrent networks and transformers:
|
||||
|
||||
- **SmilesToSeq**: Convert SMILES strings to sequences
|
||||
- **SmilesToImage**: Generate 2D image representations from SMILES
|
||||
- **RawFeaturizer**: Pass through raw molecular data unchanged
|
||||
|
||||
### Selection Guide
|
||||
|
||||
| Use Case | Recommended Featurizer | Model Type |
|
||||
|----------|----------------------|------------|
|
||||
| Graph neural networks | ConvMolFeaturizer, MolGraphConvFeaturizer | GCN, MPNN, GAT |
|
||||
| Traditional ML | CircularFingerprint, RDKitDescriptors | Random Forest, XGBoost, SVM |
|
||||
| Deep learning (non-graph) | CircularFingerprint, Mol2VecFingerprint | Dense networks, CNN |
|
||||
| Sequence models | SmilesToSeq | LSTM, GRU, Transformer |
|
||||
| 3D molecular structures | CoulombMatrix | Specialized 3D models |
|
||||
| Quick baseline | RDKitDescriptors | Linear, Ridge, Lasso |
|
||||
|
||||
## Models
|
||||
|
||||
### Scikit-Learn Integration
|
||||
- **SklearnModel**: Wrapper for any scikit-learn algorithm
|
||||
- Usage: `SklearnModel(model=RandomForestRegressor())`
|
||||
|
||||
### Gradient Boosting
|
||||
- **GBDTModel**: Gradient boosting decision trees (XGBoost, LightGBM)
|
||||
|
||||
### PyTorch Models
|
||||
|
||||
#### Molecular Property Prediction
|
||||
- **MultitaskRegressor**: Multi-task regression with shared representations
|
||||
- **MultitaskClassifier**: Multi-task classification
|
||||
- **MultitaskFitTransformRegressor**: Regression with learned transformations
|
||||
- **GCNModel**: Graph convolutional networks
|
||||
- **GATModel**: Graph attention networks
|
||||
- **AttentiveFPModel**: Attentive fingerprint networks
|
||||
- **DMPNNModel**: Directed message passing neural networks
|
||||
- **GroverModel**: GROVER pre-trained transformer
|
||||
- **MATModel**: Molecule attention transformer
|
||||
|
||||
#### Materials Science
|
||||
- **CGCNNModel**: Crystal graph convolutional networks
|
||||
- **MEGNetModel**: Materials graph networks
|
||||
- **LCNNModel**: Lattice CNN for materials
|
||||
|
||||
#### Generative Models
|
||||
- **GANModel**: Generative adversarial networks
|
||||
- **WGANModel**: Wasserstein GAN
|
||||
- **BasicMolGANModel**: Molecular GAN
|
||||
- **LSTMGenerator**: LSTM-based molecule generation
|
||||
- **SeqToSeqModel**: Sequence-to-sequence models
|
||||
|
||||
#### Physics-Informed Models
|
||||
- **PINNModel**: Physics-informed neural networks
|
||||
- **HNNModel**: Hamiltonian neural networks
|
||||
- **LNN**: Lagrangian neural networks
|
||||
- **FNOModel**: Fourier neural operators
|
||||
|
||||
#### Computer Vision
|
||||
- **CNN**: Convolutional neural networks
|
||||
- **UNetModel**: U-Net architecture for segmentation
|
||||
- **InceptionV3Model**: Pre-trained Inception v3
|
||||
- **MobileNetV2Model**: Lightweight mobile networks
|
||||
|
||||
### Hugging Face Models
|
||||
|
||||
- **HuggingFaceModel**: General wrapper for HF transformers
|
||||
- **Chemberta**: Chemical BERT for molecular property prediction
|
||||
- **MoLFormer**: Molecular transformer architecture
|
||||
- **ProtBERT**: Protein sequence BERT
|
||||
- **DeepAbLLM**: Antibody large language models
|
||||
|
||||
### Model Selection Guide
|
||||
|
||||
| Task | Recommended Model | Featurizer |
|
||||
|------|------------------|------------|
|
||||
| Small dataset (<1000 samples) | SklearnModel (Random Forest) | CircularFingerprint |
|
||||
| Medium dataset (1K-100K) | GBDTModel or MultitaskRegressor | CircularFingerprint or ConvMolFeaturizer |
|
||||
| Large dataset (>100K) | GCNModel, AttentiveFPModel, or DMPNN | MolGraphConvFeaturizer |
|
||||
| Transfer learning | GroverModel, Chemberta, MoLFormer | Model-specific |
|
||||
| Materials properties | CGCNNModel, MEGNetModel | Structure-based |
|
||||
| Molecule generation | BasicMolGANModel, LSTMGenerator | SmilesToSeq |
|
||||
| Protein sequences | ProtBERT | Sequence-based |
|
||||
|
||||
## MoleculeNet Datasets
|
||||
|
||||
Quick access to 30+ benchmark datasets via `dc.molnet.load_*()` functions.
|
||||
|
||||
### Classification Datasets
|
||||
- **load_bace()**: BACE-1 inhibitors (binary classification)
|
||||
- **load_bbbp()**: Blood-brain barrier penetration
|
||||
- **load_clintox()**: Clinical toxicity
|
||||
- **load_hiv()**: HIV inhibition activity
|
||||
- **load_muv()**: PubChem BioAssay (challenging, sparse)
|
||||
- **load_pcba()**: PubChem screening data
|
||||
- **load_sider()**: Adverse drug reactions (multi-label)
|
||||
- **load_tox21()**: 12 toxicity assays (multi-task)
|
||||
- **load_toxcast()**: EPA ToxCast screening
|
||||
|
||||
### Regression Datasets
|
||||
- **load_delaney()**: Aqueous solubility (ESOL)
|
||||
- **load_freesolv()**: Solvation free energy
|
||||
- **load_lipo()**: Lipophilicity (octanol-water partition)
|
||||
- **load_qm7/qm8/qm9()**: Quantum mechanical properties
|
||||
- **load_hopv()**: Organic photovoltaic properties
|
||||
|
||||
### Protein-Ligand Binding
|
||||
- **load_pdbbind()**: Binding affinity data
|
||||
|
||||
### Materials Science
|
||||
- **load_perovskite()**: Perovskite stability
|
||||
- **load_mp_formation_energy()**: Materials Project formation energy
|
||||
- **load_mp_metallicity()**: Metal vs. non-metal classification
|
||||
- **load_bandgap()**: Electronic bandgap prediction
|
||||
|
||||
### Chemical Reactions
|
||||
- **load_uspto()**: USPTO reaction dataset
|
||||
|
||||
### Usage Pattern
|
||||
```python
|
||||
tasks, datasets, transformers = dc.molnet.load_bbbp(
|
||||
featurizer='GraphConv', # or 'ECFP', 'GraphConv', 'Weave', etc.
|
||||
splitter='scaffold', # or 'random', 'stratified', etc.
|
||||
reload=False # set True to skip caching
|
||||
)
|
||||
train, valid, test = datasets
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
Common evaluation metrics available in `dc.metrics`:
|
||||
|
||||
### Classification Metrics
|
||||
- **roc_auc_score**: Area under ROC curve (binary/multi-class)
|
||||
- **prc_auc_score**: Area under precision-recall curve
|
||||
- **accuracy_score**: Classification accuracy
|
||||
- **balanced_accuracy_score**: Balanced accuracy for imbalanced datasets
|
||||
- **recall_score**: Sensitivity/recall
|
||||
- **precision_score**: Precision
|
||||
- **f1_score**: F1 score
|
||||
|
||||
### Regression Metrics
|
||||
- **mean_absolute_error**: MAE
|
||||
- **mean_squared_error**: MSE
|
||||
- **root_mean_squared_error**: RMSE
|
||||
- **r2_score**: R² coefficient of determination
|
||||
- **pearson_r2_score**: Pearson correlation
|
||||
- **spearman_correlation**: Spearman rank correlation
|
||||
|
||||
### Multi-Task Metrics
|
||||
Most metrics support multi-task evaluation by averaging over tasks.
|
||||
|
||||
## Training Pattern
|
||||
|
||||
Standard DeepChem workflow:
|
||||
|
||||
```python
|
||||
# 1. Load data
|
||||
loader = dc.data.CSVLoader(tasks=['task1'], feature_field='smiles',
|
||||
featurizer=dc.feat.CircularFingerprint())
|
||||
dataset = loader.create_dataset('data.csv')
|
||||
|
||||
# 2. Split data
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(dataset)
|
||||
|
||||
# 3. Transform data (optional)
|
||||
transformers = [dc.trans.NormalizationTransformer(dataset=train)]
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
valid = transformer.transform(valid)
|
||||
test = transformer.transform(test)
|
||||
|
||||
# 4. Create and train model
|
||||
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048, layer_sizes=[1000])
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# 5. Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
train_score = model.evaluate(train, [metric])
|
||||
test_score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Quick Baseline with MoleculeNet
|
||||
```python
|
||||
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
|
||||
train, valid, test = datasets
|
||||
model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024)
|
||||
model.fit(train)
|
||||
```
|
||||
|
||||
### Pattern 2: Custom Data with Graph Networks
|
||||
```python
|
||||
featurizer = dc.feat.MolGraphConvFeaturizer()
|
||||
loader = dc.data.CSVLoader(tasks=['activity'], feature_field='smiles',
|
||||
featurizer=featurizer)
|
||||
dataset = loader.create_dataset('my_data.csv')
|
||||
train, test = dc.splits.RandomSplitter().train_test_split(dataset)
|
||||
model = dc.models.GCNModel(mode='classification', n_tasks=1)
|
||||
model.fit(train)
|
||||
```
|
||||
|
||||
### Pattern 3: Transfer Learning with Pretrained Models
|
||||
```python
|
||||
model = dc.models.GroverModel(task='classification', n_tasks=1)
|
||||
model.fit(train_dataset)
|
||||
predictions = model.predict(test_dataset)
|
||||
```
|
||||
491
scientific-packages/deepchem/references/workflows.md
Normal file
491
scientific-packages/deepchem/references/workflows.md
Normal file
@@ -0,0 +1,491 @@
|
||||
# DeepChem Workflows
|
||||
|
||||
This document provides detailed workflows for common DeepChem use cases.
|
||||
|
||||
## Workflow 1: Molecular Property Prediction from SMILES
|
||||
|
||||
**Goal**: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
|
||||
|
||||
### Step-by-Step Process
|
||||
|
||||
#### 1. Prepare Your Data
|
||||
Data should be in CSV format with at minimum:
|
||||
- A column with SMILES strings
|
||||
- One or more columns with property values (targets)
|
||||
|
||||
Example CSV structure:
|
||||
```csv
|
||||
smiles,solubility,toxicity
|
||||
CCO,-0.77,0
|
||||
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
|
||||
```
|
||||
|
||||
#### 2. Choose Featurizer
|
||||
Decision tree:
|
||||
- **Small dataset (<1K)**: Use `CircularFingerprint` or `RDKitDescriptors`
|
||||
- **Medium dataset (1K-100K)**: Use `CircularFingerprint` or `MolGraphConvFeaturizer`
|
||||
- **Large dataset (>100K)**: Use graph-based featurizers (`MolGraphConvFeaturizer`, `DMPNNFeaturizer`)
|
||||
- **Transfer learning**: Use pretrained model featurizers (`GroverFeaturizer`)
|
||||
|
||||
#### 3. Load and Featurize Data
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# For fingerprint-based
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
# OR for graph-based
|
||||
featurizer = dc.feat.MolGraphConvFeaturizer()
|
||||
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['solubility', 'toxicity'], # column names to predict
|
||||
feature_field='smiles', # column with SMILES
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset('data.csv')
|
||||
```
|
||||
|
||||
#### 4. Split Data
|
||||
**Critical**: Use `ScaffoldSplitter` for drug discovery to prevent data leakage.
|
||||
|
||||
```python
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
```
|
||||
|
||||
#### 5. Transform Data (Optional but Recommended)
|
||||
```python
|
||||
transformers = [
|
||||
dc.trans.NormalizationTransformer(
|
||||
transform_y=True,
|
||||
dataset=train
|
||||
)
|
||||
]
|
||||
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
valid = transformer.transform(valid)
|
||||
test = transformer.transform(test)
|
||||
```
|
||||
|
||||
#### 6. Select and Train Model
|
||||
```python
|
||||
# For fingerprints
|
||||
model = dc.models.MultitaskRegressor(
|
||||
n_tasks=2, # number of properties to predict
|
||||
n_features=2048, # fingerprint size
|
||||
layer_sizes=[1000, 500], # hidden layer sizes
|
||||
dropouts=0.25,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# OR for graphs
|
||||
model = dc.models.GCNModel(
|
||||
n_tasks=2,
|
||||
mode='regression',
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
#### 7. Evaluate
|
||||
```python
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
train_score = model.evaluate(train, [metric])
|
||||
valid_score = model.evaluate(valid, [metric])
|
||||
test_score = model.evaluate(test, [metric])
|
||||
|
||||
print(f"Train R²: {train_score}")
|
||||
print(f"Valid R²: {valid_score}")
|
||||
print(f"Test R²: {test_score}")
|
||||
```
|
||||
|
||||
#### 8. Make Predictions
|
||||
```python
|
||||
# Predict on new molecules
|
||||
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
|
||||
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
new_features = new_featurizer.featurize(new_smiles)
|
||||
new_dataset = dc.data.NumpyDataset(X=new_features)
|
||||
|
||||
# Apply same transformations
|
||||
for transformer in transformers:
|
||||
new_dataset = transformer.transform(new_dataset)
|
||||
|
||||
predictions = model.predict(new_dataset)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2: Using MoleculeNet Benchmark Datasets
|
||||
|
||||
**Goal**: Quickly train and evaluate models on standard benchmarks.
|
||||
|
||||
### Quick Start
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load benchmark dataset
|
||||
tasks, datasets, transformers = dc.molnet.load_tox21(
|
||||
featurizer='GraphConv',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# Train model
|
||||
model = dc.models.GCNModel(
|
||||
n_tasks=len(tasks),
|
||||
mode='classification'
|
||||
)
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
print(f"Test ROC-AUC: {test_score}")
|
||||
```
|
||||
|
||||
### Available Featurizer Options
|
||||
When calling `load_*()` functions:
|
||||
- `'ECFP'`: Extended-connectivity fingerprints (circular fingerprints)
|
||||
- `'GraphConv'`: Graph convolution features
|
||||
- `'Weave'`: Weave features
|
||||
- `'Raw'`: Raw SMILES strings
|
||||
- `'smiles2img'`: 2D molecular images
|
||||
|
||||
### Available Splitter Options
|
||||
- `'scaffold'`: Scaffold-based splitting (recommended for drug discovery)
|
||||
- `'random'`: Random splitting
|
||||
- `'stratified'`: Stratified splitting (preserves class distributions)
|
||||
- `'butina'`: Butina clustering-based splitting
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3: Hyperparameter Optimization
|
||||
|
||||
**Goal**: Find optimal model hyperparameters systematically.
|
||||
|
||||
### Using GridHyperparamOpt
|
||||
```python
|
||||
import deepchem as dc
|
||||
import numpy as np
|
||||
|
||||
# Load data
|
||||
tasks, datasets, transformers = dc.molnet.load_bbbp(
|
||||
featurizer='ECFP',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# Define parameter grid
|
||||
params_dict = {
|
||||
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
|
||||
'dropouts': [0.0, 0.25, 0.5],
|
||||
'learning_rate': [0.001, 0.0001]
|
||||
}
|
||||
|
||||
# Define model builder function
|
||||
def model_builder(model_params, model_dir):
|
||||
return dc.models.MultitaskClassifier(
|
||||
n_tasks=len(tasks),
|
||||
n_features=1024,
|
||||
**model_params
|
||||
)
|
||||
|
||||
# Setup optimizer
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
|
||||
|
||||
# Run optimization
|
||||
best_model, best_params, all_results = optimizer.hyperparam_search(
|
||||
params_dict,
|
||||
train,
|
||||
valid,
|
||||
metric,
|
||||
transformers=transformers
|
||||
)
|
||||
|
||||
print(f"Best parameters: {best_params}")
|
||||
print(f"Best validation score: {all_results['best_validation_score']}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4: Transfer Learning with Pretrained Models
|
||||
|
||||
**Goal**: Leverage pretrained models for improved performance on small datasets.
|
||||
|
||||
### Using ChemBERTa
|
||||
```python
|
||||
import deepchem as dc
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
# Load your data
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['activity'],
|
||||
feature_field='smiles',
|
||||
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
|
||||
)
|
||||
dataset = loader.create_dataset('data.csv')
|
||||
|
||||
# Split data
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# Load pretrained ChemBERTa
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='seyonec/ChemBERTa-zinc-base-v1',
|
||||
task='regression',
|
||||
n_tasks=1
|
||||
)
|
||||
|
||||
# Fine-tune
|
||||
model.fit(train, nb_epoch=10)
|
||||
|
||||
# Evaluate
|
||||
predictions = model.predict(test)
|
||||
```
|
||||
|
||||
### Using GROVER
|
||||
```python
|
||||
# GROVER: pre-trained on molecular graphs
|
||||
model = dc.models.GroverModel(
|
||||
task='classification',
|
||||
n_tasks=1,
|
||||
model_dir='./grover_model'
|
||||
)
|
||||
|
||||
# Fine-tune on your data
|
||||
model.fit(train_dataset, nb_epoch=20)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 5: Molecular Generation with GANs
|
||||
|
||||
**Goal**: Generate novel molecules with desired properties.
|
||||
|
||||
### Basic MolGAN
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load training data (molecules for the generator to learn from)
|
||||
tasks, datasets, _ = dc.molnet.load_qm9(
|
||||
featurizer='GraphConv',
|
||||
splitter='random'
|
||||
)
|
||||
train, _, _ = datasets
|
||||
|
||||
# Create and train MolGAN
|
||||
gan = dc.models.BasicMolGANModel(
|
||||
learning_rate=0.001,
|
||||
vertices=9, # max atoms in molecule
|
||||
edges=5, # max bonds
|
||||
nodes=[128, 256, 512]
|
||||
)
|
||||
|
||||
# Train
|
||||
gan.fit_gan(
|
||||
train,
|
||||
nb_epoch=100,
|
||||
generator_steps=0.2,
|
||||
checkpoint_interval=10
|
||||
)
|
||||
|
||||
# Generate new molecules
|
||||
generated_molecules = gan.predict_gan_generator(1000)
|
||||
```
|
||||
|
||||
### Conditional Generation
|
||||
```python
|
||||
# For property-targeted generation
|
||||
from deepchem.models.optimizers import ExponentialDecay
|
||||
|
||||
gan = dc.models.BasicMolGANModel(
|
||||
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
|
||||
conditional=True # enable conditional generation
|
||||
)
|
||||
|
||||
# Train with properties
|
||||
gan.fit_gan(train, nb_epoch=100)
|
||||
|
||||
# Generate molecules with target properties
|
||||
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
|
||||
molecules = gan.predict_gan_generator(
|
||||
1000,
|
||||
conditional_inputs=target_properties
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 6: Materials Property Prediction
|
||||
|
||||
**Goal**: Predict properties of crystalline materials.
|
||||
|
||||
### Using Crystal Graph Convolutional Networks
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load materials data (structure files in CIF format)
|
||||
loader = dc.data.CIFLoader()
|
||||
dataset = loader.create_dataset('materials.csv')
|
||||
|
||||
# Split data
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# Create CGCNN model
|
||||
model = dc.models.CGCNNModel(
|
||||
n_tasks=1,
|
||||
mode='regression',
|
||||
batch_size=32,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=100)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.mae_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 7: Protein Sequence Analysis
|
||||
|
||||
**Goal**: Predict protein properties from sequences.
|
||||
|
||||
### Using ProtBERT
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load protein sequence data
|
||||
loader = dc.data.FASTALoader()
|
||||
dataset = loader.create_dataset('proteins.fasta')
|
||||
|
||||
# Use ProtBERT
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='Rostlab/prot_bert',
|
||||
task='classification',
|
||||
n_tasks=1
|
||||
)
|
||||
|
||||
# Split and train
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
model.fit(train, nb_epoch=5)
|
||||
|
||||
# Predict
|
||||
predictions = model.predict(test)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 8: Custom Model Integration
|
||||
|
||||
**Goal**: Use your own PyTorch/scikit-learn models with DeepChem.
|
||||
|
||||
### Wrapping Scikit-Learn Models
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
import deepchem as dc
|
||||
|
||||
# Create scikit-learn model
|
||||
sklearn_model = RandomForestRegressor(
|
||||
n_estimators=100,
|
||||
max_depth=10,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
# Wrap in DeepChem
|
||||
model = dc.models.SklearnModel(model=sklearn_model)
|
||||
|
||||
# Use with DeepChem datasets
|
||||
model.fit(train)
|
||||
predictions = model.predict(test)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
### Creating Custom PyTorch Models
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import deepchem as dc
|
||||
|
||||
class CustomNetwork(nn.Module):
|
||||
def __init__(self, n_features, n_tasks):
|
||||
super().__init__()
|
||||
self.fc1 = nn.Linear(n_features, 512)
|
||||
self.fc2 = nn.Linear(512, 256)
|
||||
self.fc3 = nn.Linear(256, n_tasks)
|
||||
self.relu = nn.ReLU()
|
||||
self.dropout = nn.Dropout(0.2)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.relu(self.fc1(x))
|
||||
x = self.dropout(x)
|
||||
x = self.relu(self.fc2(x))
|
||||
x = self.dropout(x)
|
||||
return self.fc3(x)
|
||||
|
||||
# Wrap in DeepChem TorchModel
|
||||
model = dc.models.TorchModel(
|
||||
model=CustomNetwork(n_features=2048, n_tasks=1),
|
||||
loss=nn.MSELoss(),
|
||||
output_types=['prediction']
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
|
||||
### Issue 1: Data Leakage in Drug Discovery
|
||||
**Problem**: Using random splitting allows similar molecules in train and test sets.
|
||||
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
|
||||
|
||||
### Issue 2: Imbalanced Classification
|
||||
**Problem**: Poor performance on minority class.
|
||||
**Solution**: Use `BalancingTransformer` or weighted metrics.
|
||||
```python
|
||||
transformer = dc.trans.BalancingTransformer(dataset=train)
|
||||
train = transformer.transform(train)
|
||||
```
|
||||
|
||||
### Issue 3: Memory Issues with Large Datasets
|
||||
**Problem**: Dataset doesn't fit in memory.
|
||||
**Solution**: Use `DiskDataset` instead of `NumpyDataset`.
|
||||
```python
|
||||
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
|
||||
```
|
||||
|
||||
### Issue 4: Overfitting on Small Datasets
|
||||
**Problem**: Model memorizes training data.
|
||||
**Solutions**:
|
||||
1. Use stronger regularization (increase dropout)
|
||||
2. Use simpler models (Random Forest, Ridge)
|
||||
3. Apply transfer learning (pretrained models)
|
||||
4. Collect more data
|
||||
|
||||
### Issue 5: Poor Graph Neural Network Performance
|
||||
**Problem**: GNN performs worse than fingerprints.
|
||||
**Solutions**:
|
||||
1. Check if dataset is large enough (GNNs need >10K samples typically)
|
||||
2. Increase training epochs
|
||||
3. Try different GNN architectures (AttentiveFP, DMPNN)
|
||||
4. Use pretrained models (GROVER)
|
||||
338
scientific-packages/deepchem/scripts/graph_neural_network.py
Normal file
338
scientific-packages/deepchem/scripts/graph_neural_network.py
Normal file
@@ -0,0 +1,338 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Graph Neural Network Training Script
|
||||
|
||||
This script demonstrates training Graph Convolutional Networks (GCNs) and other
|
||||
graph-based models for molecular property prediction.
|
||||
|
||||
Usage:
|
||||
python graph_neural_network.py --dataset tox21 --model gcn
|
||||
python graph_neural_network.py --dataset bbbp --model attentivefp
|
||||
python graph_neural_network.py --data custom.csv --task-type regression
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import deepchem as dc
|
||||
import sys
|
||||
|
||||
|
||||
AVAILABLE_MODELS = {
|
||||
'gcn': 'Graph Convolutional Network',
|
||||
'gat': 'Graph Attention Network',
|
||||
'attentivefp': 'Attentive Fingerprint',
|
||||
'mpnn': 'Message Passing Neural Network',
|
||||
'dmpnn': 'Directed Message Passing Neural Network'
|
||||
}
|
||||
|
||||
MOLNET_DATASETS = {
|
||||
'tox21': ('classification', 12),
|
||||
'bbbp': ('classification', 1),
|
||||
'bace': ('classification', 1),
|
||||
'hiv': ('classification', 1),
|
||||
'delaney': ('regression', 1),
|
||||
'freesolv': ('regression', 1),
|
||||
'lipo': ('regression', 1)
|
||||
}
|
||||
|
||||
|
||||
def create_model(model_type, n_tasks, mode='classification'):
|
||||
"""
|
||||
Create a graph neural network model.
|
||||
|
||||
Args:
|
||||
model_type: Type of model ('gcn', 'gat', 'attentivefp', etc.)
|
||||
n_tasks: Number of prediction tasks
|
||||
mode: 'classification' or 'regression'
|
||||
|
||||
Returns:
|
||||
DeepChem model
|
||||
"""
|
||||
if model_type == 'gcn':
|
||||
return dc.models.GCNModel(
|
||||
n_tasks=n_tasks,
|
||||
mode=mode,
|
||||
batch_size=128,
|
||||
learning_rate=0.001,
|
||||
dropout=0.0
|
||||
)
|
||||
elif model_type == 'gat':
|
||||
return dc.models.GATModel(
|
||||
n_tasks=n_tasks,
|
||||
mode=mode,
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
elif model_type == 'attentivefp':
|
||||
return dc.models.AttentiveFPModel(
|
||||
n_tasks=n_tasks,
|
||||
mode=mode,
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
elif model_type == 'mpnn':
|
||||
return dc.models.MPNNModel(
|
||||
n_tasks=n_tasks,
|
||||
mode=mode,
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
elif model_type == 'dmpnn':
|
||||
return dc.models.DMPNNModel(
|
||||
n_tasks=n_tasks,
|
||||
mode=mode,
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Unknown model type: {model_type}")
|
||||
|
||||
|
||||
def train_on_molnet(dataset_name, model_type, n_epochs=50):
|
||||
"""
|
||||
Train a graph neural network on a MoleculeNet benchmark dataset.
|
||||
|
||||
Args:
|
||||
dataset_name: Name of MoleculeNet dataset
|
||||
model_type: Type of model to train
|
||||
n_epochs: Number of training epochs
|
||||
|
||||
Returns:
|
||||
Trained model and test scores
|
||||
"""
|
||||
print("=" * 70)
|
||||
print(f"Training {AVAILABLE_MODELS[model_type]} on {dataset_name.upper()}")
|
||||
print("=" * 70)
|
||||
|
||||
# Get dataset info
|
||||
task_type, n_tasks_default = MOLNET_DATASETS[dataset_name]
|
||||
|
||||
# Load dataset with graph featurization
|
||||
print(f"\nLoading {dataset_name} dataset with GraphConv featurizer...")
|
||||
load_func = getattr(dc.molnet, f'load_{dataset_name}')
|
||||
tasks, datasets, transformers = load_func(
|
||||
featurizer='GraphConv',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
n_tasks = len(tasks)
|
||||
print(f"\nDataset Information:")
|
||||
print(f" Task type: {task_type}")
|
||||
print(f" Number of tasks: {n_tasks}")
|
||||
print(f" Training samples: {len(train)}")
|
||||
print(f" Validation samples: {len(valid)}")
|
||||
print(f" Test samples: {len(test)}")
|
||||
|
||||
# Create model
|
||||
print(f"\nCreating {AVAILABLE_MODELS[model_type]} model...")
|
||||
model = create_model(model_type, n_tasks, mode=task_type)
|
||||
|
||||
# Train
|
||||
print(f"\nTraining for {n_epochs} epochs...")
|
||||
model.fit(train, nb_epoch=n_epochs)
|
||||
print("Training complete!")
|
||||
|
||||
# Evaluate
|
||||
print("\n" + "=" * 70)
|
||||
print("Model Evaluation")
|
||||
print("=" * 70)
|
||||
|
||||
if task_type == 'classification':
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
|
||||
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
|
||||
dc.metrics.Metric(dc.metrics.f1_score, name='F1'),
|
||||
]
|
||||
else:
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE'),
|
||||
]
|
||||
|
||||
results = {}
|
||||
for dataset_name_eval, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
|
||||
print(f"\n{dataset_name_eval} Set:")
|
||||
scores = model.evaluate(dataset, metrics)
|
||||
results[dataset_name_eval] = scores
|
||||
for metric_name, score in scores.items():
|
||||
print(f" {metric_name}: {score:.4f}")
|
||||
|
||||
return model, results
|
||||
|
||||
|
||||
def train_on_custom_data(data_path, model_type, task_type, target_cols, smiles_col='smiles', n_epochs=50):
|
||||
"""
|
||||
Train a graph neural network on custom CSV data.
|
||||
|
||||
Args:
|
||||
data_path: Path to CSV file
|
||||
model_type: Type of model to train
|
||||
task_type: 'classification' or 'regression'
|
||||
target_cols: List of target column names
|
||||
smiles_col: Name of SMILES column
|
||||
n_epochs: Number of training epochs
|
||||
|
||||
Returns:
|
||||
Trained model and test dataset
|
||||
"""
|
||||
print("=" * 70)
|
||||
print(f"Training {AVAILABLE_MODELS[model_type]} on Custom Data")
|
||||
print("=" * 70)
|
||||
|
||||
# Load and featurize data
|
||||
print(f"\nLoading data from {data_path}...")
|
||||
featurizer = dc.feat.MolGraphConvFeaturizer()
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=target_cols,
|
||||
feature_field=smiles_col,
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset(data_path)
|
||||
|
||||
print(f"Loaded {len(dataset)} molecules")
|
||||
|
||||
# Split data
|
||||
print("\nSplitting data with scaffold splitter...")
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
|
||||
print(f" Training: {len(train)}")
|
||||
print(f" Validation: {len(valid)}")
|
||||
print(f" Test: {len(test)}")
|
||||
|
||||
# Create model
|
||||
print(f"\nCreating {AVAILABLE_MODELS[model_type]} model...")
|
||||
n_tasks = len(target_cols)
|
||||
model = create_model(model_type, n_tasks, mode=task_type)
|
||||
|
||||
# Train
|
||||
print(f"\nTraining for {n_epochs} epochs...")
|
||||
model.fit(train, nb_epoch=n_epochs)
|
||||
print("Training complete!")
|
||||
|
||||
# Evaluate
|
||||
print("\n" + "=" * 70)
|
||||
print("Model Evaluation")
|
||||
print("=" * 70)
|
||||
|
||||
if task_type == 'classification':
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
|
||||
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
|
||||
]
|
||||
else:
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
]
|
||||
|
||||
for dataset_name, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
|
||||
print(f"\n{dataset_name} Set:")
|
||||
scores = model.evaluate(dataset, metrics)
|
||||
for metric_name, score in scores.items():
|
||||
print(f" {metric_name}: {score:.4f}")
|
||||
|
||||
return model, test
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Train graph neural networks for molecular property prediction'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
type=str,
|
||||
choices=list(AVAILABLE_MODELS.keys()),
|
||||
default='gcn',
|
||||
help='Type of graph neural network model'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--dataset',
|
||||
type=str,
|
||||
choices=list(MOLNET_DATASETS.keys()),
|
||||
default=None,
|
||||
help='MoleculeNet dataset to use'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--data',
|
||||
type=str,
|
||||
default=None,
|
||||
help='Path to custom CSV file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--task-type',
|
||||
type=str,
|
||||
choices=['classification', 'regression'],
|
||||
default='classification',
|
||||
help='Type of prediction task (for custom data)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--targets',
|
||||
nargs='+',
|
||||
default=['target'],
|
||||
help='Names of target columns (for custom data)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--smiles-col',
|
||||
type=str,
|
||||
default='smiles',
|
||||
help='Name of SMILES column'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--epochs',
|
||||
type=int,
|
||||
default=50,
|
||||
help='Number of training epochs'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate arguments
|
||||
if args.dataset is None and args.data is None:
|
||||
print("Error: Must specify either --dataset (MoleculeNet) or --data (custom CSV)",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.dataset and args.data:
|
||||
print("Error: Cannot specify both --dataset and --data",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Train model
|
||||
try:
|
||||
if args.dataset:
|
||||
model, results = train_on_molnet(
|
||||
args.dataset,
|
||||
args.model,
|
||||
n_epochs=args.epochs
|
||||
)
|
||||
else:
|
||||
model, test_set = train_on_custom_data(
|
||||
args.data,
|
||||
args.model,
|
||||
args.task_type,
|
||||
args.targets,
|
||||
smiles_col=args.smiles_col,
|
||||
n_epochs=args.epochs
|
||||
)
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("Training Complete!")
|
||||
print("=" * 70)
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
print(f"\nError: {e}", file=sys.stderr)
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
224
scientific-packages/deepchem/scripts/predict_solubility.py
Normal file
224
scientific-packages/deepchem/scripts/predict_solubility.py
Normal file
@@ -0,0 +1,224 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Molecular Solubility Prediction Script
|
||||
|
||||
This script trains a model to predict aqueous solubility from SMILES strings
|
||||
using the Delaney (ESOL) dataset as an example. Can be adapted for custom datasets.
|
||||
|
||||
Usage:
|
||||
python predict_solubility.py --data custom_data.csv --smiles-col smiles --target-col solubility
|
||||
python predict_solubility.py # Uses Delaney dataset by default
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import deepchem as dc
|
||||
import numpy as np
|
||||
import sys
|
||||
|
||||
|
||||
def train_solubility_model(data_path=None, smiles_col='smiles', target_col='measured log solubility in mols per litre'):
|
||||
"""
|
||||
Train a solubility prediction model.
|
||||
|
||||
Args:
|
||||
data_path: Path to CSV file with SMILES and solubility data. If None, uses Delaney dataset.
|
||||
smiles_col: Name of column containing SMILES strings
|
||||
target_col: Name of column containing solubility values
|
||||
|
||||
Returns:
|
||||
Trained model, test dataset, and transformers
|
||||
"""
|
||||
print("=" * 60)
|
||||
print("DeepChem Solubility Prediction")
|
||||
print("=" * 60)
|
||||
|
||||
# Load data
|
||||
if data_path is None:
|
||||
print("\nUsing Delaney (ESOL) benchmark dataset...")
|
||||
tasks, datasets, transformers = dc.molnet.load_delaney(
|
||||
featurizer='ECFP',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
else:
|
||||
print(f"\nLoading custom data from {data_path}...")
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=[target_col],
|
||||
feature_field=smiles_col,
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset(data_path)
|
||||
|
||||
# Split data
|
||||
print("Splitting data with scaffold splitter...")
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
|
||||
# Normalize data
|
||||
print("Normalizing features and targets...")
|
||||
transformers = [
|
||||
dc.trans.NormalizationTransformer(
|
||||
transform_y=True,
|
||||
dataset=train
|
||||
)
|
||||
]
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
valid = transformer.transform(valid)
|
||||
test = transformer.transform(test)
|
||||
|
||||
tasks = [target_col]
|
||||
|
||||
print(f"\nDataset sizes:")
|
||||
print(f" Training: {len(train)} molecules")
|
||||
print(f" Validation: {len(valid)} molecules")
|
||||
print(f" Test: {len(test)} molecules")
|
||||
|
||||
# Create model
|
||||
print("\nCreating multitask regressor...")
|
||||
model = dc.models.MultitaskRegressor(
|
||||
n_tasks=len(tasks),
|
||||
n_features=2048, # ECFP fingerprint size
|
||||
layer_sizes=[1000, 500],
|
||||
dropouts=0.25,
|
||||
learning_rate=0.001,
|
||||
batch_size=50
|
||||
)
|
||||
|
||||
# Train model
|
||||
print("\nTraining model...")
|
||||
model.fit(train, nb_epoch=50)
|
||||
print("Training complete!")
|
||||
|
||||
# Evaluate model
|
||||
print("\n" + "=" * 60)
|
||||
print("Model Evaluation")
|
||||
print("=" * 60)
|
||||
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE'),
|
||||
]
|
||||
|
||||
for dataset_name, dataset in [('Train', train), ('Valid', valid), ('Test', test)]:
|
||||
print(f"\n{dataset_name} Set:")
|
||||
scores = model.evaluate(dataset, metrics)
|
||||
for metric_name, score in scores.items():
|
||||
print(f" {metric_name}: {score:.4f}")
|
||||
|
||||
return model, test, transformers
|
||||
|
||||
|
||||
def predict_new_molecules(model, smiles_list, transformers=None):
|
||||
"""
|
||||
Predict solubility for new molecules.
|
||||
|
||||
Args:
|
||||
model: Trained DeepChem model
|
||||
smiles_list: List of SMILES strings
|
||||
transformers: List of data transformers to apply
|
||||
|
||||
Returns:
|
||||
Array of predictions
|
||||
"""
|
||||
print("\n" + "=" * 60)
|
||||
print("Predicting New Molecules")
|
||||
print("=" * 60)
|
||||
|
||||
# Featurize new molecules
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
features = featurizer.featurize(smiles_list)
|
||||
|
||||
# Create dataset
|
||||
new_dataset = dc.data.NumpyDataset(X=features)
|
||||
|
||||
# Apply transformers (if any)
|
||||
if transformers:
|
||||
for transformer in transformers:
|
||||
new_dataset = transformer.transform(new_dataset)
|
||||
|
||||
# Predict
|
||||
predictions = model.predict(new_dataset)
|
||||
|
||||
# Display results
|
||||
print("\nPredictions:")
|
||||
for smiles, pred in zip(smiles_list, predictions):
|
||||
print(f" {smiles:30s} -> {pred[0]:.3f} log(mol/L)")
|
||||
|
||||
return predictions
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Train a molecular solubility prediction model'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--data',
|
||||
type=str,
|
||||
default=None,
|
||||
help='Path to CSV file with molecular data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--smiles-col',
|
||||
type=str,
|
||||
default='smiles',
|
||||
help='Name of column containing SMILES strings'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--target-col',
|
||||
type=str,
|
||||
default='solubility',
|
||||
help='Name of column containing target values'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--predict',
|
||||
nargs='+',
|
||||
default=None,
|
||||
help='SMILES strings to predict after training'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Train model
|
||||
try:
|
||||
model, test_set, transformers = train_solubility_model(
|
||||
data_path=args.data,
|
||||
smiles_col=args.smiles_col,
|
||||
target_col=args.target_col
|
||||
)
|
||||
except Exception as e:
|
||||
print(f"\nError during training: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Make predictions on new molecules if provided
|
||||
if args.predict:
|
||||
try:
|
||||
predict_new_molecules(model, args.predict, transformers)
|
||||
except Exception as e:
|
||||
print(f"\nError during prediction: {e}", file=sys.stderr)
|
||||
return 1
|
||||
else:
|
||||
# Example predictions
|
||||
example_smiles = [
|
||||
'CCO', # Ethanol
|
||||
'CC(=O)O', # Acetic acid
|
||||
'c1ccccc1', # Benzene
|
||||
'CN1C=NC2=C1C(=O)N(C(=O)N2C)C', # Caffeine
|
||||
]
|
||||
predict_new_molecules(model, example_smiles, transformers)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("Complete!")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
375
scientific-packages/deepchem/scripts/transfer_learning.py
Normal file
375
scientific-packages/deepchem/scripts/transfer_learning.py
Normal file
@@ -0,0 +1,375 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Transfer Learning Script for DeepChem
|
||||
|
||||
Use pretrained models (ChemBERTa, GROVER, MolFormer) for molecular property prediction
|
||||
with transfer learning. Particularly useful for small datasets.
|
||||
|
||||
Usage:
|
||||
python transfer_learning.py --model chemberta --data my_data.csv --target activity
|
||||
python transfer_learning.py --model grover --dataset bbbp
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import deepchem as dc
|
||||
import sys
|
||||
|
||||
|
||||
PRETRAINED_MODELS = {
|
||||
'chemberta': {
|
||||
'name': 'ChemBERTa',
|
||||
'description': 'BERT pretrained on 77M molecules from ZINC15',
|
||||
'model_id': 'seyonec/ChemBERTa-zinc-base-v1'
|
||||
},
|
||||
'grover': {
|
||||
'name': 'GROVER',
|
||||
'description': 'Graph transformer pretrained on 10M molecules',
|
||||
'model_id': None # GROVER uses its own loading mechanism
|
||||
},
|
||||
'molformer': {
|
||||
'name': 'MolFormer',
|
||||
'description': 'Transformer pretrained on molecular structures',
|
||||
'model_id': 'ibm/MoLFormer-XL-both-10pct'
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def train_chemberta(train_dataset, valid_dataset, test_dataset, task_type='classification', n_tasks=1, n_epochs=10):
|
||||
"""
|
||||
Fine-tune ChemBERTa on a dataset.
|
||||
|
||||
Args:
|
||||
train_dataset: Training dataset
|
||||
valid_dataset: Validation dataset
|
||||
test_dataset: Test dataset
|
||||
task_type: 'classification' or 'regression'
|
||||
n_tasks: Number of prediction tasks
|
||||
n_epochs: Number of fine-tuning epochs
|
||||
|
||||
Returns:
|
||||
Trained model and evaluation results
|
||||
"""
|
||||
print("=" * 70)
|
||||
print("Fine-tuning ChemBERTa")
|
||||
print("=" * 70)
|
||||
print("\nChemBERTa is a BERT model pretrained on 77M molecules from ZINC15.")
|
||||
print("It uses SMILES strings as input and has learned rich molecular")
|
||||
print("representations that transfer well to downstream tasks.")
|
||||
|
||||
print(f"\nLoading pretrained ChemBERTa model...")
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model=PRETRAINED_MODELS['chemberta']['model_id'],
|
||||
task=task_type,
|
||||
n_tasks=n_tasks,
|
||||
batch_size=32,
|
||||
learning_rate=2e-5 # Lower LR for fine-tuning
|
||||
)
|
||||
|
||||
print(f"\nFine-tuning for {n_epochs} epochs...")
|
||||
print("(This may take a while on the first run as the model is downloaded)")
|
||||
model.fit(train_dataset, nb_epoch=n_epochs)
|
||||
print("Fine-tuning complete!")
|
||||
|
||||
# Evaluate
|
||||
print("\n" + "=" * 70)
|
||||
print("Model Evaluation")
|
||||
print("=" * 70)
|
||||
|
||||
if task_type == 'classification':
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
|
||||
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
|
||||
]
|
||||
else:
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
]
|
||||
|
||||
results = {}
|
||||
for name, dataset in [('Train', train_dataset), ('Valid', valid_dataset), ('Test', test_dataset)]:
|
||||
print(f"\n{name} Set:")
|
||||
scores = model.evaluate(dataset, metrics)
|
||||
results[name] = scores
|
||||
for metric_name, score in scores.items():
|
||||
print(f" {metric_name}: {score:.4f}")
|
||||
|
||||
return model, results
|
||||
|
||||
|
||||
def train_grover(train_dataset, test_dataset, task_type='classification', n_tasks=1, n_epochs=20):
|
||||
"""
|
||||
Fine-tune GROVER on a dataset.
|
||||
|
||||
Args:
|
||||
train_dataset: Training dataset
|
||||
test_dataset: Test dataset
|
||||
task_type: 'classification' or 'regression'
|
||||
n_tasks: Number of prediction tasks
|
||||
n_epochs: Number of fine-tuning epochs
|
||||
|
||||
Returns:
|
||||
Trained model and evaluation results
|
||||
"""
|
||||
print("=" * 70)
|
||||
print("Fine-tuning GROVER")
|
||||
print("=" * 70)
|
||||
print("\nGROVER is a graph transformer pretrained on 10M molecules using")
|
||||
print("self-supervised learning. It learns both node and graph-level")
|
||||
print("representations through masked atom/bond prediction tasks.")
|
||||
|
||||
print(f"\nCreating GROVER model...")
|
||||
model = dc.models.GroverModel(
|
||||
task=task_type,
|
||||
n_tasks=n_tasks,
|
||||
model_dir='./grover_pretrained'
|
||||
)
|
||||
|
||||
print(f"\nFine-tuning for {n_epochs} epochs...")
|
||||
model.fit(train_dataset, nb_epoch=n_epochs)
|
||||
print("Fine-tuning complete!")
|
||||
|
||||
# Evaluate
|
||||
print("\n" + "=" * 70)
|
||||
print("Model Evaluation")
|
||||
print("=" * 70)
|
||||
|
||||
if task_type == 'classification':
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
|
||||
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
|
||||
]
|
||||
else:
|
||||
metrics = [
|
||||
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
|
||||
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
|
||||
]
|
||||
|
||||
results = {}
|
||||
for name, dataset in [('Train', train_dataset), ('Test', test_dataset)]:
|
||||
print(f"\n{name} Set:")
|
||||
scores = model.evaluate(dataset, metrics)
|
||||
results[name] = scores
|
||||
for metric_name, score in scores.items():
|
||||
print(f" {metric_name}: {score:.4f}")
|
||||
|
||||
return model, results
|
||||
|
||||
|
||||
def load_molnet_dataset(dataset_name, model_type):
|
||||
"""
|
||||
Load a MoleculeNet dataset with appropriate featurization.
|
||||
|
||||
Args:
|
||||
dataset_name: Name of MoleculeNet dataset
|
||||
model_type: Type of pretrained model being used
|
||||
|
||||
Returns:
|
||||
tasks, train/valid/test datasets, transformers
|
||||
"""
|
||||
# Map of MoleculeNet datasets
|
||||
molnet_datasets = {
|
||||
'tox21': dc.molnet.load_tox21,
|
||||
'bbbp': dc.molnet.load_bbbp,
|
||||
'bace': dc.molnet.load_bace_classification,
|
||||
'hiv': dc.molnet.load_hiv,
|
||||
'delaney': dc.molnet.load_delaney,
|
||||
'freesolv': dc.molnet.load_freesolv,
|
||||
'lipo': dc.molnet.load_lipo
|
||||
}
|
||||
|
||||
if dataset_name not in molnet_datasets:
|
||||
raise ValueError(f"Unknown dataset: {dataset_name}")
|
||||
|
||||
# ChemBERTa and MolFormer use raw SMILES
|
||||
if model_type in ['chemberta', 'molformer']:
|
||||
featurizer = 'Raw'
|
||||
# GROVER needs graph features
|
||||
elif model_type == 'grover':
|
||||
featurizer = 'GraphConv'
|
||||
else:
|
||||
featurizer = 'ECFP'
|
||||
|
||||
print(f"\nLoading {dataset_name} dataset...")
|
||||
load_func = molnet_datasets[dataset_name]
|
||||
tasks, datasets, transformers = load_func(
|
||||
featurizer=featurizer,
|
||||
splitter='scaffold'
|
||||
)
|
||||
|
||||
return tasks, datasets, transformers
|
||||
|
||||
|
||||
def load_custom_dataset(data_path, target_cols, smiles_col, model_type):
|
||||
"""
|
||||
Load a custom CSV dataset.
|
||||
|
||||
Args:
|
||||
data_path: Path to CSV file
|
||||
target_cols: List of target column names
|
||||
smiles_col: Name of SMILES column
|
||||
model_type: Type of pretrained model being used
|
||||
|
||||
Returns:
|
||||
train, valid, test datasets
|
||||
"""
|
||||
print(f"\nLoading custom data from {data_path}...")
|
||||
|
||||
# Choose featurizer based on model
|
||||
if model_type in ['chemberta', 'molformer']:
|
||||
featurizer = dc.feat.DummyFeaturizer() # Models handle featurization
|
||||
elif model_type == 'grover':
|
||||
featurizer = dc.feat.MolGraphConvFeaturizer()
|
||||
else:
|
||||
featurizer = dc.feat.CircularFingerprint()
|
||||
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=target_cols,
|
||||
feature_field=smiles_col,
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset(data_path)
|
||||
|
||||
print(f"Loaded {len(dataset)} molecules")
|
||||
|
||||
# Split data
|
||||
print("Splitting data with scaffold splitter...")
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
|
||||
print(f" Training: {len(train)}")
|
||||
print(f" Validation: {len(valid)}")
|
||||
print(f" Test: {len(test)}")
|
||||
|
||||
return train, valid, test
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Transfer learning for molecular property prediction'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
type=str,
|
||||
choices=list(PRETRAINED_MODELS.keys()),
|
||||
required=True,
|
||||
help='Pretrained model to use'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--dataset',
|
||||
type=str,
|
||||
choices=['tox21', 'bbbp', 'bace', 'hiv', 'delaney', 'freesolv', 'lipo'],
|
||||
default=None,
|
||||
help='MoleculeNet dataset to use'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--data',
|
||||
type=str,
|
||||
default=None,
|
||||
help='Path to custom CSV file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--target',
|
||||
nargs='+',
|
||||
default=['target'],
|
||||
help='Target column name(s) for custom data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--smiles-col',
|
||||
type=str,
|
||||
default='smiles',
|
||||
help='SMILES column name for custom data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--task-type',
|
||||
type=str,
|
||||
choices=['classification', 'regression'],
|
||||
default='classification',
|
||||
help='Type of prediction task'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--epochs',
|
||||
type=int,
|
||||
default=10,
|
||||
help='Number of fine-tuning epochs'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate arguments
|
||||
if args.dataset is None and args.data is None:
|
||||
print("Error: Must specify either --dataset or --data", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.dataset and args.data:
|
||||
print("Error: Cannot specify both --dataset and --data", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Print model info
|
||||
model_info = PRETRAINED_MODELS[args.model]
|
||||
print("\n" + "=" * 70)
|
||||
print(f"Transfer Learning with {model_info['name']}")
|
||||
print("=" * 70)
|
||||
print(f"\n{model_info['description']}")
|
||||
|
||||
try:
|
||||
# Load dataset
|
||||
if args.dataset:
|
||||
tasks, datasets, transformers = load_molnet_dataset(args.dataset, args.model)
|
||||
train, valid, test = datasets
|
||||
task_type = 'classification' if args.dataset in ['tox21', 'bbbp', 'bace', 'hiv'] else 'regression'
|
||||
n_tasks = len(tasks)
|
||||
else:
|
||||
train, valid, test = load_custom_dataset(
|
||||
args.data,
|
||||
args.target,
|
||||
args.smiles_col,
|
||||
args.model
|
||||
)
|
||||
task_type = args.task_type
|
||||
n_tasks = len(args.target)
|
||||
|
||||
# Train model
|
||||
if args.model == 'chemberta':
|
||||
model, results = train_chemberta(
|
||||
train, valid, test,
|
||||
task_type=task_type,
|
||||
n_tasks=n_tasks,
|
||||
n_epochs=args.epochs
|
||||
)
|
||||
elif args.model == 'grover':
|
||||
model, results = train_grover(
|
||||
train, test,
|
||||
task_type=task_type,
|
||||
n_tasks=n_tasks,
|
||||
n_epochs=args.epochs
|
||||
)
|
||||
else:
|
||||
print(f"Error: Model {args.model} not yet implemented", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("Transfer Learning Complete!")
|
||||
print("=" * 70)
|
||||
print("\nTip: Pretrained models often work best with:")
|
||||
print(" - Small datasets (< 1000 samples)")
|
||||
print(" - Lower learning rates (1e-5 to 5e-5)")
|
||||
print(" - Fewer epochs (5-20)")
|
||||
print(" - Avoiding overfitting through early stopping")
|
||||
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
print(f"\nError: {e}", file=sys.stderr)
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
537
scientific-packages/deeptools/SKILL.md
Normal file
537
scientific-packages/deeptools/SKILL.md
Normal file
@@ -0,0 +1,537 @@
|
||||
---
|
||||
name: deeptools
|
||||
description: Comprehensive toolkit for analyzing next-generation sequencing (NGS) data including ChIP-seq, RNA-seq, ATAC-seq, and related experiments. Use this skill when working with BAM files, bigWig coverage tracks, or when creating heatmaps, profile plots, and quality control visualizations for genomic data. Applicable for tasks involving read coverage analysis, sample correlation, ChIP enrichment assessment, normalization, and publication-quality visualization generation.
|
||||
---
|
||||
|
||||
# deepTools: NGS Data Analysis Toolkit
|
||||
|
||||
## Overview
|
||||
|
||||
deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. This skill provides guidance for using deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
|
||||
|
||||
**Core capabilities:**
|
||||
- Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
|
||||
- Quality control assessment (fingerprint, correlation, coverage)
|
||||
- Sample comparison and correlation analysis
|
||||
- Heatmap and profile plot generation around genomic features
|
||||
- Enrichment analysis and peak region visualization
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Invoke this skill when users request tasks involving:
|
||||
|
||||
- **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
|
||||
- **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
|
||||
- **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot"
|
||||
- **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis"
|
||||
- **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow"
|
||||
- **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context
|
||||
|
||||
## Quick Start
|
||||
|
||||
For users new to deepTools, start with file validation and common workflows:
|
||||
|
||||
### 1. Validate Input Files
|
||||
|
||||
Before running any analysis, validate BAM, bigWig, and BED files using the validation script:
|
||||
|
||||
```bash
|
||||
python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed
|
||||
```
|
||||
|
||||
This checks file existence, BAM indices, and format correctness.
|
||||
|
||||
### 2. Generate Workflow Template
|
||||
|
||||
For standard analyses, use the workflow generator to create customized scripts:
|
||||
|
||||
```bash
|
||||
# List available workflows
|
||||
python scripts/workflow_generator.py --list
|
||||
|
||||
# Generate ChIP-seq QC workflow
|
||||
python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \
|
||||
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
|
||||
--genome-size 2913022398
|
||||
|
||||
# Make executable and run
|
||||
chmod +x qc_workflow.sh
|
||||
./qc_workflow.sh
|
||||
```
|
||||
|
||||
### 3. Most Common Operations
|
||||
|
||||
See `assets/quick_reference.md` for frequently used commands and parameters.
|
||||
|
||||
## Installation
|
||||
|
||||
Guide users to install deepTools using conda (recommended):
|
||||
|
||||
```bash
|
||||
# Standard installation
|
||||
conda install -c conda-forge -c bioconda deeptools
|
||||
|
||||
# For M1 Macs
|
||||
CONDA_SUBDIR=osx-64 conda create -c conda-forge -c bioconda -n deeptools deeptools
|
||||
```
|
||||
|
||||
Or using pip:
|
||||
|
||||
```bash
|
||||
pip install deeptools
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization**
|
||||
|
||||
### ChIP-seq Quality Control Workflow
|
||||
|
||||
When users request ChIP-seq QC or quality assessment:
|
||||
|
||||
1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc`
|
||||
2. **Key QC steps**:
|
||||
- Sample correlation (multiBamSummary + plotCorrelation)
|
||||
- PCA analysis (plotPCA)
|
||||
- Coverage assessment (plotCoverage)
|
||||
- Fragment size validation (bamPEFragmentSize)
|
||||
- ChIP enrichment strength (plotFingerprint)
|
||||
|
||||
**Interpreting results:**
|
||||
- **Correlation**: Replicates should cluster together with high correlation (>0.9)
|
||||
- **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment
|
||||
- **Coverage**: Assess if sequencing depth is adequate for analysis
|
||||
|
||||
Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow"
|
||||
|
||||
### ChIP-seq Complete Analysis Workflow
|
||||
|
||||
For full ChIP-seq analysis from BAM to visualizations:
|
||||
|
||||
1. **Generate coverage tracks** with normalization (bamCoverage)
|
||||
2. **Create comparison tracks** (bamCompare for log2 ratio)
|
||||
3. **Compute signal matrices** around features (computeMatrix)
|
||||
4. **Generate visualizations** (plotHeatmap, plotProfile)
|
||||
5. **Enrichment analysis** at peaks (plotEnrichment)
|
||||
|
||||
Use `scripts/workflow_generator.py chipseq_analysis` to generate template.
|
||||
|
||||
Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow"
|
||||
|
||||
### RNA-seq Coverage Workflow
|
||||
|
||||
For strand-specific RNA-seq coverage tracks:
|
||||
|
||||
Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands.
|
||||
|
||||
**Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions).
|
||||
|
||||
Use normalization: CPM for fixed bins, RPKM for gene-level analysis.
|
||||
|
||||
Template available: `scripts/workflow_generator.py rnaseq_coverage`
|
||||
|
||||
Details in `references/workflows.md` → "RNA-seq Coverage Workflow"
|
||||
|
||||
### ATAC-seq Analysis Workflow
|
||||
|
||||
ATAC-seq requires Tn5 offset correction:
|
||||
|
||||
1. **Shift reads** using alignmentSieve with `--ATACshift`
|
||||
2. **Generate coverage** with bamCoverage
|
||||
3. **Analyze fragment sizes** (expect nucleosome ladder pattern)
|
||||
4. **Visualize at peaks** if available
|
||||
|
||||
Template: `scripts/workflow_generator.py atacseq`
|
||||
|
||||
Full workflow in `references/workflows.md` → "ATAC-seq Workflow"
|
||||
|
||||
## Tool Categories and Common Tasks
|
||||
|
||||
### BAM/bigWig Processing
|
||||
|
||||
**Convert BAM to normalized coverage:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 --numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Compare two samples (log2 ratio):**
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
|
||||
--operation log2 --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
**Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools"
|
||||
|
||||
### Quality Control
|
||||
|
||||
**Check ChIP enrichment:**
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
**Sample correlation:**
|
||||
```bash
|
||||
multiBamSummary bins --bamfiles *.bam -o counts.npz
|
||||
plotCorrelation -in counts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation.png
|
||||
```
|
||||
|
||||
**Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "Quality Control Tools"
|
||||
|
||||
### Visualization
|
||||
|
||||
**Create heatmap around TSS:**
|
||||
```bash
|
||||
# Compute matrix
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
|
||||
|
||||
# Generate heatmap
|
||||
plotHeatmap -m matrix.gz -o heatmap.png \
|
||||
--colorMap RdBu --kmeans 3
|
||||
```
|
||||
|
||||
**Create profile plot:**
|
||||
```bash
|
||||
plotProfile -m matrix.gz -o profile.png \
|
||||
--plotType lines --colors blue red
|
||||
```
|
||||
|
||||
**Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "Visualization Tools"
|
||||
|
||||
## Normalization Methods
|
||||
|
||||
Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance.
|
||||
|
||||
**Quick selection guide:**
|
||||
|
||||
- **ChIP-seq coverage**: Use RPGC or CPM
|
||||
- **ChIP-seq comparison**: Use bamCompare with log2 and readCount
|
||||
- **RNA-seq bins**: Use CPM
|
||||
- **RNA-seq genes**: Use RPKM (accounts for gene length)
|
||||
- **ATAC-seq**: Use RPGC or CPM
|
||||
|
||||
**Normalization methods:**
|
||||
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
|
||||
- **CPM**: Counts per million mapped reads
|
||||
- **RPKM**: Reads per kb per million (accounts for region length)
|
||||
- **BPM**: Bins per million
|
||||
- **None**: Raw counts (not recommended for comparisons)
|
||||
|
||||
Full explanation: `references/normalization_methods.md`
|
||||
|
||||
## Effective Genome Sizes
|
||||
|
||||
RPGC normalization requires effective genome size. Common values:
|
||||
|
||||
| Organism | Assembly | Size | Usage |
|
||||
|----------|----------|------|-------|
|
||||
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
||||
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
||||
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
||||
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
||||
| *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
||||
|
||||
Complete table with read-length-specific values: `references/effective_genome_sizes.md`
|
||||
|
||||
## Common Parameters Across Tools
|
||||
|
||||
Many deepTools commands share these options:
|
||||
|
||||
**Performance:**
|
||||
- `--numberOfProcessors, -p`: Enable parallel processing (always use available cores)
|
||||
- `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`)
|
||||
|
||||
**Read Filtering:**
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses)
|
||||
- `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`)
|
||||
- `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds
|
||||
- `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering
|
||||
|
||||
**Read Processing:**
|
||||
- `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO)
|
||||
- `--centerReads`: Center at fragment midpoint for sharper signals
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Validation
|
||||
**Always validate files first** using `scripts/validate_files.py` to check:
|
||||
- File existence and readability
|
||||
- BAM indices present (.bai files)
|
||||
- BED format correctness
|
||||
- File sizes reasonable
|
||||
|
||||
### Analysis Strategy
|
||||
|
||||
1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding
|
||||
2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing
|
||||
3. **Document commands**: Save full command lines for reproducibility
|
||||
4. **Use consistent normalization**: Apply same method across samples in comparisons
|
||||
5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds
|
||||
|
||||
### ChIP-seq Specific
|
||||
|
||||
- **Always extend reads** for ChIP-seq: `--extendReads 200`
|
||||
- **Remove duplicates**: Use `--ignoreDuplicates` in most cases
|
||||
- **Check enrichment first**: Run plotFingerprint before detailed analysis
|
||||
- **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction
|
||||
|
||||
### RNA-seq Specific
|
||||
|
||||
- **Never extend reads** for RNA-seq (would span splice junctions)
|
||||
- **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries
|
||||
- **Normalization**: CPM for bins, RPKM for genes
|
||||
|
||||
### ATAC-seq Specific
|
||||
|
||||
- **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift`
|
||||
- **Fragment filtering**: Set appropriate min/max fragment lengths
|
||||
- **Check nucleosome pattern**: Fragment size plot should show ladder pattern
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores)
|
||||
2. **Increase bin size** for faster processing and smaller files
|
||||
3. **Process chromosomes separately** for memory-limited systems
|
||||
4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files
|
||||
5. **Use bigWig over bedGraph**: Compressed and faster to process
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**BAM index missing:**
|
||||
```bash
|
||||
samtools index input.bam
|
||||
```
|
||||
|
||||
**Out of memory:**
|
||||
Process chromosomes individually using `--region`:
|
||||
```bash
|
||||
bamCoverage --bam input.bam -o chr1.bw --region chr1
|
||||
```
|
||||
|
||||
**Slow processing:**
|
||||
Increase `--numberOfProcessors` and/or increase `--binSize`
|
||||
|
||||
**bigWig files too large:**
|
||||
Increase bin size: `--binSize 50` or larger
|
||||
|
||||
### Validation Errors
|
||||
|
||||
Run validation script to identify issues:
|
||||
```bash
|
||||
python scripts/validate_files.py --bam *.bam --bed regions.bed
|
||||
```
|
||||
|
||||
Common errors and solutions explained in script output.
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/tools_reference.md
|
||||
Complete documentation of all deepTools commands organized by category:
|
||||
- BAM and bigWig processing tools (9 tools)
|
||||
- Quality control tools (6 tools)
|
||||
- Visualization tools (3 tools)
|
||||
- Miscellaneous tools (2 tools)
|
||||
|
||||
Each tool includes:
|
||||
- Purpose and overview
|
||||
- Key parameters with explanations
|
||||
- Usage examples
|
||||
- Important notes and best practices
|
||||
|
||||
**Use this reference when:** Users ask about specific tools, parameters, or detailed usage.
|
||||
|
||||
### references/workflows.md
|
||||
Complete workflow examples for common analyses:
|
||||
- ChIP-seq quality control workflow
|
||||
- ChIP-seq complete analysis workflow
|
||||
- RNA-seq coverage workflow
|
||||
- ATAC-seq analysis workflow
|
||||
- Multi-sample comparison workflow
|
||||
- Peak region analysis workflow
|
||||
- Troubleshooting and performance tips
|
||||
|
||||
**Use this reference when:** Users need complete analysis pipelines or workflow examples.
|
||||
|
||||
### references/normalization_methods.md
|
||||
Comprehensive guide to normalization methods:
|
||||
- Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.)
|
||||
- When to use each method
|
||||
- Formulas and interpretation
|
||||
- Selection guide by experiment type
|
||||
- Common pitfalls and solutions
|
||||
- Quick reference table
|
||||
|
||||
**Use this reference when:** Users ask about normalization, comparing samples, or which method to use.
|
||||
|
||||
### references/effective_genome_sizes.md
|
||||
Effective genome size values and usage:
|
||||
- Common organism values (human, mouse, fly, worm, zebrafish)
|
||||
- Read-length-specific values
|
||||
- Calculation methods
|
||||
- When and how to use in commands
|
||||
- Custom genome calculation instructions
|
||||
|
||||
**Use this reference when:** Users need genome size for RPGC normalization or GC bias correction.
|
||||
|
||||
## Helper Scripts
|
||||
|
||||
### scripts/validate_files.py
|
||||
|
||||
Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python scripts/validate_files.py --bam sample1.bam sample2.bam \
|
||||
--bed peaks.bed --bigwig signal.bw
|
||||
```
|
||||
|
||||
**When to use:** Before starting any analysis, or when troubleshooting errors.
|
||||
|
||||
### scripts/workflow_generator.py
|
||||
|
||||
Generates customizable bash script templates for common deepTools workflows.
|
||||
|
||||
**Available workflows:**
|
||||
- `chipseq_qc`: ChIP-seq quality control
|
||||
- `chipseq_analysis`: Complete ChIP-seq analysis
|
||||
- `rnaseq_coverage`: Strand-specific RNA-seq coverage
|
||||
- `atacseq`: ATAC-seq with Tn5 correction
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# List workflows
|
||||
python scripts/workflow_generator.py --list
|
||||
|
||||
# Generate workflow
|
||||
python scripts/workflow_generator.py chipseq_qc -o qc.sh \
|
||||
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
|
||||
--genome-size 2913022398 --threads 8
|
||||
|
||||
# Run generated workflow
|
||||
chmod +x qc.sh
|
||||
./qc.sh
|
||||
```
|
||||
|
||||
**When to use:** Users request standard workflows or need template scripts to customize.
|
||||
|
||||
## Assets
|
||||
|
||||
### assets/quick_reference.md
|
||||
|
||||
Quick reference card with most common commands, effective genome sizes, and typical workflow pattern.
|
||||
|
||||
**When to use:** Users need quick command examples without detailed documentation.
|
||||
|
||||
## Handling User Requests
|
||||
|
||||
### For New Users
|
||||
|
||||
1. Start with installation verification
|
||||
2. Validate input files using `scripts/validate_files.py`
|
||||
3. Recommend appropriate workflow based on experiment type
|
||||
4. Generate workflow template using `scripts/workflow_generator.py`
|
||||
5. Guide through customization and execution
|
||||
|
||||
### For Experienced Users
|
||||
|
||||
1. Provide specific tool commands for requested operations
|
||||
2. Reference appropriate sections in `references/tools_reference.md`
|
||||
3. Suggest optimizations and best practices
|
||||
4. Offer troubleshooting for issues
|
||||
|
||||
### For Specific Tasks
|
||||
|
||||
**"Convert BAM to bigWig":**
|
||||
- Use bamCoverage with appropriate normalization
|
||||
- Recommend RPGC or CPM based on use case
|
||||
- Provide effective genome size for organism
|
||||
- Suggest relevant parameters (extendReads, ignoreDuplicates, binSize)
|
||||
|
||||
**"Check ChIP quality":**
|
||||
- Run full QC workflow or use plotFingerprint specifically
|
||||
- Explain interpretation of results
|
||||
- Suggest follow-up actions based on results
|
||||
|
||||
**"Create heatmap":**
|
||||
- Guide through two-step process: computeMatrix → plotHeatmap
|
||||
- Help choose appropriate matrix mode (reference-point vs scale-regions)
|
||||
- Suggest visualization parameters and clustering options
|
||||
|
||||
**"Compare samples":**
|
||||
- Recommend bamCompare for two-sample comparison
|
||||
- Suggest multiBamSummary + plotCorrelation for multiple samples
|
||||
- Guide normalization method selection
|
||||
|
||||
### Referencing Documentation
|
||||
|
||||
When users need detailed information:
|
||||
- **Tool details**: Direct to specific sections in `references/tools_reference.md`
|
||||
- **Workflows**: Use `references/workflows.md` for complete analysis pipelines
|
||||
- **Normalization**: Consult `references/normalization_methods.md` for method selection
|
||||
- **Genome sizes**: Reference `references/effective_genome_sizes.md`
|
||||
|
||||
Search references using grep patterns:
|
||||
```bash
|
||||
# Find tool documentation
|
||||
grep -A 20 "^### toolname" references/tools_reference.md
|
||||
|
||||
# Find workflow
|
||||
grep -A 50 "^## Workflow Name" references/workflows.md
|
||||
|
||||
# Find normalization method
|
||||
grep -A 15 "^### Method Name" references/normalization_methods.md
|
||||
```
|
||||
|
||||
## Example Interactions
|
||||
|
||||
**User: "I need to analyze my ChIP-seq data"**
|
||||
|
||||
Response approach:
|
||||
1. Ask about files available (BAM files, peaks, genes)
|
||||
2. Validate files using validation script
|
||||
3. Generate chipseq_analysis workflow template
|
||||
4. Customize for their specific files and organism
|
||||
5. Explain each step as script runs
|
||||
|
||||
**User: "Which normalization should I use?"**
|
||||
|
||||
Response approach:
|
||||
1. Ask about experiment type (ChIP-seq, RNA-seq, etc.)
|
||||
2. Ask about comparison goal (within-sample or between-sample)
|
||||
3. Consult `references/normalization_methods.md` selection guide
|
||||
4. Recommend appropriate method with justification
|
||||
5. Provide command example with parameters
|
||||
|
||||
**User: "Create a heatmap around TSS"**
|
||||
|
||||
Response approach:
|
||||
1. Verify bigWig and gene BED files available
|
||||
2. Use computeMatrix with reference-point mode at TSS
|
||||
3. Generate plotHeatmap with appropriate visualization parameters
|
||||
4. Suggest clustering if dataset is large
|
||||
5. Offer profile plot as complement
|
||||
|
||||
## Key Reminders
|
||||
|
||||
- **File validation first**: Always validate input files before analysis
|
||||
- **Normalization matters**: Choose appropriate method for comparison type
|
||||
- **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq
|
||||
- **Use all cores**: Set `--numberOfProcessors` to available cores
|
||||
- **Test on regions**: Use `--region` for parameter testing
|
||||
- **Check QC first**: Run quality control before detailed analysis
|
||||
- **Document everything**: Save commands for reproducibility
|
||||
- **Reference documentation**: Use comprehensive references for detailed guidance
|
||||
58
scientific-packages/deeptools/assets/quick_reference.md
Normal file
58
scientific-packages/deeptools/assets/quick_reference.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# deepTools Quick Reference
|
||||
|
||||
## Most Common Commands
|
||||
|
||||
### BAM to bigWig (normalized)
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 --numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Compare two BAM files
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
|
||||
--operation log2 --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
### Correlation heatmap
|
||||
```bash
|
||||
multiBamSummary bins --bamfiles *.bam -o counts.npz
|
||||
plotCorrelation -in counts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation.png
|
||||
```
|
||||
|
||||
### Heatmap around TSS
|
||||
```bash
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
|
||||
|
||||
plotHeatmap -m matrix.gz -o heatmap.png
|
||||
```
|
||||
|
||||
### ChIP enrichment check
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
## Effective Genome Sizes
|
||||
|
||||
| Organism | Assembly | Size |
|
||||
|----------|----------|------|
|
||||
| Human | hg38 | 2913022398 |
|
||||
| Mouse | mm10 | 2652783500 |
|
||||
| Fly | dm6 | 142573017 |
|
||||
|
||||
## Common Normalization Methods
|
||||
|
||||
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
|
||||
- **CPM**: Counts per million (for fixed bins)
|
||||
- **RPKM**: Reads per kb per million (for genes)
|
||||
|
||||
## Typical Workflow
|
||||
|
||||
1. **QC**: plotFingerprint, plotCorrelation
|
||||
2. **Coverage**: bamCoverage with normalization
|
||||
3. **Comparison**: bamCompare for treatment vs control
|
||||
4. **Visualization**: computeMatrix → plotHeatmap/plotProfile
|
||||
@@ -0,0 +1,116 @@
|
||||
# Effective Genome Sizes
|
||||
|
||||
## Definition
|
||||
|
||||
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
- Required for RPGC normalization (`--normalizeUsing RPGC`)
|
||||
- Affects accuracy of coverage calculations
|
||||
- Must match your data processing approach (filtered vs unfiltered reads)
|
||||
|
||||
## Calculation Methods
|
||||
|
||||
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
|
||||
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
|
||||
|
||||
## Common Organism Values
|
||||
|
||||
### Using Non-N Bases Method
|
||||
|
||||
| Organism | Assembly | Effective Size | Full Command |
|
||||
|----------|----------|----------------|--------------|
|
||||
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
||||
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
|
||||
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
|
||||
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
||||
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
||||
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
||||
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
||||
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
|
||||
|
||||
### Human (GRCh38) by Read Length
|
||||
|
||||
For quality-filtered reads, values vary by read length:
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.7 billion |
|
||||
| 75bp | ~2.8 billion |
|
||||
| 100bp | ~2.8 billion |
|
||||
| 150bp | ~2.9 billion |
|
||||
| 250bp | ~2.9 billion |
|
||||
|
||||
### Mouse (GRCm38) by Read Length
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.3 billion |
|
||||
| 75bp | ~2.5 billion |
|
||||
| 100bp | ~2.6 billion |
|
||||
|
||||
## Usage in deepTools
|
||||
|
||||
The effective genome size is most commonly used with:
|
||||
|
||||
### bamCoverage with RPGC normalization
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### bamCompare with RPGC normalization
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam \
|
||||
--outFileName comparison.bw \
|
||||
--scaleFactorsMethod RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### computeGCBias / correctGCBias
|
||||
```bash
|
||||
computeGCBias --bamfile input.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--fragmentLength 200 \
|
||||
--biasPlot bias.png
|
||||
```
|
||||
|
||||
## Choosing the Right Value
|
||||
|
||||
**For most analyses:** Use the non-N bases method value for your reference genome
|
||||
|
||||
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
|
||||
|
||||
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
|
||||
|
||||
## Common Shortcuts
|
||||
|
||||
deepTools also accepts these shorthand values in some contexts:
|
||||
|
||||
- `hs` or `GRCh38`: 2913022398
|
||||
- `mm` or `GRCm38`: 2652783500
|
||||
- `dm` or `dm6`: 142573017
|
||||
- `ce` or `ce10`: 100286401
|
||||
|
||||
Check your specific deepTools version documentation for supported shortcuts.
|
||||
|
||||
## Calculating Custom Values
|
||||
|
||||
For custom genomes or assemblies, calculate the non-N bases count:
|
||||
|
||||
```bash
|
||||
# Using faCount (UCSC tools)
|
||||
faCount genome.fa | grep "total" | awk '{print $2-$7}'
|
||||
|
||||
# Using seqtk
|
||||
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
For the most up-to-date effective genome sizes and detailed calculation methods, see:
|
||||
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
|
||||
- ENCODE documentation for reference genome details
|
||||
@@ -0,0 +1,410 @@
|
||||
# deepTools Normalization Methods
|
||||
|
||||
This document explains the various normalization methods available in deepTools and when to use each one.
|
||||
|
||||
## Why Normalize?
|
||||
|
||||
Normalization is essential for:
|
||||
1. **Comparing samples with different sequencing depths**
|
||||
2. **Accounting for library size differences**
|
||||
3. **Making coverage values interpretable across experiments**
|
||||
4. **Enabling fair comparisons between conditions**
|
||||
|
||||
Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
|
||||
|
||||
---
|
||||
|
||||
## Available Normalization Methods
|
||||
|
||||
### 1. RPKM (Reads Per Kilobase per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads) / (Length of region in kb × Total mapped reads in millions)`
|
||||
|
||||
**When to use:**
|
||||
- Comparing different genomic regions within the same sample
|
||||
- Adjusting for both sequencing depth AND region length
|
||||
- RNA-seq gene expression analysis
|
||||
|
||||
**Available in:** `bamCoverage`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPKM
|
||||
```
|
||||
|
||||
**Interpretation:** RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
|
||||
|
||||
**Pros:**
|
||||
- Accounts for both region length and library size
|
||||
- Widely used and understood in genomics
|
||||
|
||||
**Cons:**
|
||||
- Not ideal for comparing between samples if total RNA content differs
|
||||
- Can be misleading when comparing samples with very different compositions
|
||||
|
||||
---
|
||||
|
||||
### 2. CPM (Counts Per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads) / (Total mapped reads in millions)`
|
||||
|
||||
**Also known as:** RPM (Reads Per Million)
|
||||
|
||||
**When to use:**
|
||||
- Comparing the same genomic regions across different samples
|
||||
- When region length is constant or not relevant
|
||||
- ChIP-seq, ATAC-seq, DNase-seq analyses
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing CPM
|
||||
```
|
||||
|
||||
**Interpretation:** CPM of 5 means 5 reads per million mapped reads in that bin.
|
||||
|
||||
**Pros:**
|
||||
- Simple and intuitive
|
||||
- Good for comparing samples with different sequencing depths
|
||||
- Appropriate when comparing fixed-size bins
|
||||
|
||||
**Cons:**
|
||||
- Does not account for region length
|
||||
- Affected by highly abundant regions (e.g., rRNA in RNA-seq)
|
||||
|
||||
---
|
||||
|
||||
### 3. BPM (Bins Per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads in bin) / (Sum of all reads in bins in millions)`
|
||||
|
||||
**Key difference from CPM:** Only considers reads that fall within the analyzed bins, not all mapped reads.
|
||||
|
||||
**When to use:**
|
||||
- Similar to CPM, but when you want to exclude reads outside analyzed regions
|
||||
- Comparing specific genomic regions while ignoring background
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing BPM
|
||||
```
|
||||
|
||||
**Interpretation:** BPM accounts only for reads in the binned regions.
|
||||
|
||||
**Pros:**
|
||||
- Focuses normalization on analyzed regions
|
||||
- Less affected by reads in unanalyzed areas
|
||||
|
||||
**Cons:**
|
||||
- Less commonly used, may be harder to compare with published data
|
||||
|
||||
---
|
||||
|
||||
### 4. RPGC (Reads Per Genomic Content)
|
||||
|
||||
**Formula:** `(Number of reads × Scaling factor) / Effective genome size`
|
||||
|
||||
**Scaling factor:** Calculated to achieve 1× genomic coverage (1 read per base)
|
||||
|
||||
**When to use:**
|
||||
- Want comparable coverage values across samples
|
||||
- Need interpretable absolute coverage values
|
||||
- Comparing samples with very different total read counts
|
||||
- ChIP-seq with spike-in normalization context
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Requires:** `--effectiveGenomeSize` parameter
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
**Interpretation:** Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
|
||||
|
||||
**Pros:**
|
||||
- Produces 1× normalized coverage
|
||||
- Interpretable in terms of genomic coverage
|
||||
- Good for comparing samples with different sequencing depths
|
||||
|
||||
**Cons:**
|
||||
- Requires knowing effective genome size
|
||||
- Assumes uniform coverage (not true for ChIP-seq with peaks)
|
||||
|
||||
---
|
||||
|
||||
### 5. None (No Normalization)
|
||||
|
||||
**Formula:** Raw read counts
|
||||
|
||||
**When to use:**
|
||||
- Preliminary analysis
|
||||
- When samples have identical library sizes (rare)
|
||||
- When downstream tool will perform normalization
|
||||
- Debugging or quality control
|
||||
|
||||
**Available in:** All tools (usually default)
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing None
|
||||
```
|
||||
|
||||
**Interpretation:** Raw read counts per bin.
|
||||
|
||||
**Pros:**
|
||||
- No assumptions made
|
||||
- Useful for seeing raw data
|
||||
- Fastest computation
|
||||
|
||||
**Cons:**
|
||||
- Cannot fairly compare samples with different sequencing depths
|
||||
- Not suitable for publication figures
|
||||
|
||||
---
|
||||
|
||||
### 6. SES (Selective Enrichment Statistics)
|
||||
|
||||
**Method:** Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
|
||||
|
||||
**When to use:**
|
||||
- ChIP-seq analysis with bamCompare
|
||||
- Want sophisticated background correction
|
||||
- Alternative to simple readCount scaling
|
||||
|
||||
**Available in:** `bamCompare` only
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
|
||||
--scaleFactorsMethod SES
|
||||
```
|
||||
|
||||
**Note:** SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
|
||||
|
||||
---
|
||||
|
||||
### 7. readCount (Read Count Scaling)
|
||||
|
||||
**Method:** Scale by ratio of total read counts between samples
|
||||
|
||||
**When to use:**
|
||||
- Default for `bamCompare`
|
||||
- Compensating for sequencing depth differences in comparisons
|
||||
- When you trust that total read counts reflect library size
|
||||
|
||||
**Available in:** `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
|
||||
--scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
**How it works:** If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
|
||||
|
||||
---
|
||||
|
||||
## Normalization Method Selection Guide
|
||||
|
||||
### For ChIP-seq Coverage Tracks
|
||||
|
||||
**Recommended:** RPGC or CPM
|
||||
|
||||
```bash
|
||||
bamCoverage --bam chip.bam --outFileName chip.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
**Reasoning:** Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
|
||||
|
||||
---
|
||||
|
||||
### For ChIP-seq Comparisons (Treatment vs Control)
|
||||
|
||||
**Recommended:** log2 ratio with readCount or SES scaling
|
||||
|
||||
```bash
|
||||
bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
|
||||
--operation log2 \
|
||||
--scaleFactorsMethod readCount \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
**Reasoning:** Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
|
||||
|
||||
---
|
||||
|
||||
### For RNA-seq Coverage Tracks
|
||||
|
||||
**Recommended:** CPM or RPKM
|
||||
|
||||
```bash
|
||||
# Strand-specific forward
|
||||
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
|
||||
--normalizeUsing CPM \
|
||||
--filterRNAstrand forward
|
||||
|
||||
# For gene-level: RPKM accounts for gene length
|
||||
bamCoverage --bam rnaseq.bam --outFileName output.bw \
|
||||
--normalizeUsing RPKM
|
||||
```
|
||||
|
||||
**Reasoning:** CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
|
||||
|
||||
---
|
||||
|
||||
### For ATAC-seq
|
||||
|
||||
**Recommended:** RPGC or CPM
|
||||
|
||||
```bash
|
||||
bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
**Reasoning:** Similar to ChIP-seq; want comparable coverage across samples.
|
||||
|
||||
---
|
||||
|
||||
### For Sample Correlation Analysis
|
||||
|
||||
**Recommended:** CPM or RPGC
|
||||
|
||||
```bash
|
||||
multiBamSummary bins \
|
||||
--bamfiles sample1.bam sample2.bam sample3.bam \
|
||||
-o readCounts.npz
|
||||
|
||||
plotCorrelation -in readCounts.npz \
|
||||
--corMethod pearson \
|
||||
--whatToShow heatmap \
|
||||
-o correlation.png
|
||||
```
|
||||
|
||||
**Note:** `multiBamSummary` doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with `multiBigwigSummary`.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Normalization Considerations
|
||||
|
||||
### Spike-in Normalization
|
||||
|
||||
For experiments with spike-in controls (e.g., *Drosophila* chromatin spike-in for ChIP-seq):
|
||||
|
||||
1. Calculate scaling factors from spike-in reads
|
||||
2. Apply custom scaling factors using `--scaleFactor` parameter
|
||||
|
||||
```bash
|
||||
# Calculate spike-in factor (example: 0.8)
|
||||
SCALE_FACTOR=0.8
|
||||
|
||||
bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
|
||||
--scaleFactor ${SCALE_FACTOR} \
|
||||
--extendReads 200
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Manual Scaling Factors
|
||||
|
||||
You can apply custom scaling factors:
|
||||
|
||||
```bash
|
||||
# Apply 2× scaling
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--scaleFactor 2.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Chromosome Exclusion
|
||||
|
||||
Exclude specific chromosomes from normalization calculations:
|
||||
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--ignoreForNormalization chrX chrY chrM
|
||||
```
|
||||
|
||||
**When to use:** Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Using RPKM for bin-based data
|
||||
**Problem:** RPKM accounts for region length, but all bins are the same size
|
||||
**Solution:** Use CPM or RPGC instead
|
||||
|
||||
### 2. Comparing unnormalized samples
|
||||
**Problem:** Sample with 2× sequencing depth appears to have 2× signal
|
||||
**Solution:** Always normalize when comparing samples
|
||||
|
||||
### 3. Wrong effective genome size
|
||||
**Problem:** Using hg19 genome size for hg38 data
|
||||
**Solution:** Double-check genome assembly and use correct size
|
||||
|
||||
### 4. Ignoring duplicates after GC correction
|
||||
**Problem:** Can introduce bias
|
||||
**Solution:** Never use `--ignoreDuplicates` after `correctGCBias`
|
||||
|
||||
### 5. Using RPGC without effective genome size
|
||||
**Problem:** Command fails
|
||||
**Solution:** Always specify `--effectiveGenomeSize` with RPGC
|
||||
|
||||
---
|
||||
|
||||
## Normalization for Different Comparisons
|
||||
|
||||
### Within-sample comparisons (different regions)
|
||||
**Use:** RPKM (accounts for region length)
|
||||
|
||||
### Between-sample comparisons (same regions)
|
||||
**Use:** CPM, RPGC, or BPM (accounts for library size)
|
||||
|
||||
### Treatment vs Control
|
||||
**Use:** bamCompare with log2 ratio and readCount/SES scaling
|
||||
|
||||
### Multiple samples correlation
|
||||
**Use:** CPM or RPGC normalized bigWig files, then multiBigwigSummary
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Table
|
||||
|
||||
| Method | Accounts for Depth | Accounts for Length | Best For | Command |
|
||||
|--------|-------------------|---------------------|----------|---------|
|
||||
| RPKM | ✓ | ✓ | RNA-seq genes | `--normalizeUsing RPKM` |
|
||||
| CPM | ✓ | ✗ | Fixed-size bins | `--normalizeUsing CPM` |
|
||||
| BPM | ✓ | ✗ | Specific regions | `--normalizeUsing BPM` |
|
||||
| RPGC | ✓ | ✗ | Interpretable coverage | `--normalizeUsing RPGC --effectiveGenomeSize X` |
|
||||
| None | ✗ | ✗ | Raw data | `--normalizeUsing None` |
|
||||
| SES | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod SES` |
|
||||
| readCount | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod readCount` |
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
For more details on normalization theory and best practices:
|
||||
- deepTools documentation: https://deeptools.readthedocs.io/
|
||||
- ENCODE guidelines for ChIP-seq analysis
|
||||
- RNA-seq normalization papers (DESeq2, TMM methods)
|
||||
533
scientific-packages/deeptools/references/tools_reference.md
Normal file
533
scientific-packages/deeptools/references/tools_reference.md
Normal file
@@ -0,0 +1,533 @@
|
||||
# deepTools Complete Tool Reference
|
||||
|
||||
This document provides a comprehensive reference for all deepTools command-line utilities organized by category.
|
||||
|
||||
## BAM and bigWig File Processing Tools
|
||||
|
||||
### multiBamSummary
|
||||
|
||||
Computes read coverages for genomic regions across multiple BAM files, outputting compressed numpy arrays for downstream correlation and PCA analysis.
|
||||
|
||||
**Modes:**
|
||||
- **bins**: Genome-wide analysis using consecutive equal-sized windows (default 10kb)
|
||||
- **BED-file**: Restricts analysis to user-specified genomic regions
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (space-separated, required)
|
||||
- `--outFileName, -o`: Output coverage matrix file (required)
|
||||
- `--BED`: Region specification file (BED-file mode only)
|
||||
- `--binSize`: Window size in bases (default: 10,000)
|
||||
- `--labels`: Custom sample identifiers
|
||||
- `--minMappingQuality`: Quality threshold for read inclusion
|
||||
- `--numberOfProcessors, -p`: Parallel processing cores
|
||||
- `--extendReads`: Fragment size extension
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--outRawCounts`: Export tab-delimited file with coordinate columns and per-sample counts
|
||||
|
||||
**Output:** Compressed numpy array (.npz) for plotCorrelation and plotPCA
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Genome-wide comparison
|
||||
multiBamSummary bins --bamfiles sample1.bam sample2.bam -o results.npz
|
||||
|
||||
# Peak region comparison
|
||||
multiBamSummary BED-file --BED peaks.bed --bamfiles sample1.bam sample2.bam -o results.npz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### multiBigwigSummary
|
||||
|
||||
Similar to multiBamSummary but operates on bigWig files instead of BAM files. Used for comparing coverage tracks across samples.
|
||||
|
||||
**Modes:**
|
||||
- **bins**: Genome-wide analysis
|
||||
- **BED-file**: Region-specific analysis
|
||||
|
||||
**Key Parameters:** Similar to multiBamSummary but accepts bigWig files
|
||||
|
||||
---
|
||||
|
||||
### bamCoverage
|
||||
|
||||
Converts BAM alignment files into normalized coverage tracks in bigWig or bedGraph formats. Calculates coverage as number of reads per bin.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bam, -b`: Input BAM file (required)
|
||||
- `--outFileName, -o`: Output filename (required)
|
||||
- `--outFileFormat, -of`: Output type (bigwig or bedgraph)
|
||||
- `--normalizeUsing`: Normalization method
|
||||
- **RPKM**: Reads Per Kilobase per Million mapped reads
|
||||
- **CPM**: Counts Per Million mapped reads
|
||||
- **BPM**: Bins Per Million mapped reads
|
||||
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
|
||||
- **None**: No normalization (default)
|
||||
- `--effectiveGenomeSize`: Mappable genome size (required for RPGC)
|
||||
- `--binSize`: Resolution in base pairs (default: 50)
|
||||
- `--extendReads, -e`: Extend reads to fragment length (recommended for ChIP-seq, NOT for RNA-seq)
|
||||
- `--centerReads`: Center reads at fragment length for sharper signals
|
||||
- `--ignoreDuplicates`: Count identical reads only once
|
||||
- `--minMappingQuality`: Filter reads below quality threshold
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length filtering
|
||||
- `--smoothLength`: Window averaging for noise reduction
|
||||
- `--MNase`: Analyze MNase-seq data for nucleosome positioning
|
||||
- `--Offset`: Position-specific offsets (useful for RiboSeq, GROseq)
|
||||
- `--filterRNAstrand`: Separate forward/reverse strand reads
|
||||
- `--ignoreForNormalization`: Exclude chromosomes from normalization (e.g., sex chromosomes)
|
||||
- `--numberOfProcessors, -p`: Parallel processing
|
||||
|
||||
**Important Notes:**
|
||||
- For RNA-seq: Do NOT use --extendReads (would extend over splice junctions)
|
||||
- For ChIP-seq: Use --extendReads with smaller bin sizes
|
||||
- Never apply --ignoreDuplicates after GC bias correction
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Basic coverage with RPKM normalization
|
||||
bamCoverage --bam input.bam --outFileName coverage.bw --normalizeUsing RPKM
|
||||
|
||||
# ChIP-seq with extension
|
||||
bamCoverage --bam chip.bam --outFileName chip_coverage.bw \
|
||||
--binSize 10 --extendReads 200 --ignoreDuplicates
|
||||
|
||||
# Strand-specific RNA-seq
|
||||
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
|
||||
--filterRNAstrand forward
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### bamCompare
|
||||
|
||||
Compares two BAM files by generating bigWig or bedGraph files, normalizing for sequencing depth differences. Processes genome in equal-sized bins and performs per-bin calculations.
|
||||
|
||||
**Comparison Methods:**
|
||||
- **log2** (default): Log2 ratio of samples
|
||||
- **ratio**: Direct ratio calculation
|
||||
- **subtract**: Difference between files
|
||||
- **add**: Sum of samples
|
||||
- **mean**: Average across samples
|
||||
- **reciprocal_ratio**: Negative inverse for ratios < 0
|
||||
- **first/second**: Output scaled signal from single file
|
||||
|
||||
**Normalization Methods:**
|
||||
- **readCount** (default): Compensates for sequencing depth
|
||||
- **SES**: Selective enrichment statistics
|
||||
- **RPKM**: Reads per kilobase per million
|
||||
- **CPM**: Counts per million
|
||||
- **BPM**: Bins per million
|
||||
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfile1, -b1`: First BAM file (required)
|
||||
- `--bamfile2, -b2`: Second BAM file (required)
|
||||
- `--outFileName, -o`: Output filename (required)
|
||||
- `--outFileFormat`: bigwig or bedgraph
|
||||
- `--operation`: Comparison method (see above)
|
||||
- `--scaleFactorsMethod`: Normalization method (see above)
|
||||
- `--binSize`: Bin width for output (default: 50bp)
|
||||
- `--pseudocount`: Avoid division by zero (default: 1)
|
||||
- `--extendReads`: Extend reads to fragment length
|
||||
- `--ignoreDuplicates`: Count identical reads once
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--numberOfProcessors, -p`: Parallelization
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Log2 ratio of treatment vs control
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw
|
||||
|
||||
# Subtract control from treatment
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o difference.bw \
|
||||
--operation subtract --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### correctGCBias / computeGCBias
|
||||
|
||||
**computeGCBias:** Identifies GC-content bias from sequencing and PCR amplification.
|
||||
|
||||
**correctGCBias:** Corrects BAM files for GC bias detected by computeGCBias.
|
||||
|
||||
**Key Parameters (computeGCBias):**
|
||||
- `--bamfile, -b`: Input BAM file
|
||||
- `--effectiveGenomeSize`: Mappable genome size
|
||||
- `--genome, -g`: Reference genome in 2bit format
|
||||
- `--fragmentLength, -l`: Fragment length (for single-end)
|
||||
- `--biasPlot`: Output diagnostic plot
|
||||
|
||||
**Key Parameters (correctGCBias):**
|
||||
- `--bamfile, -b`: Input BAM file
|
||||
- `--effectiveGenomeSize`: Mappable genome size
|
||||
- `--genome, -g`: Reference genome in 2bit format
|
||||
- `--GCbiasFrequenciesFile`: Frequencies from computeGCBias
|
||||
- `--correctedFile, -o`: Output corrected BAM
|
||||
|
||||
**Important:** Never use --ignoreDuplicates after GC bias correction
|
||||
|
||||
---
|
||||
|
||||
### alignmentSieve
|
||||
|
||||
Filters BAM files by various quality metrics on-the-fly. Useful for creating filtered BAM files for specific analyses.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bam, -b`: Input BAM file
|
||||
- `--outFile, -o`: Output BAM file
|
||||
- `--minMappingQuality`: Minimum mapping quality
|
||||
- `--ignoreDuplicates`: Remove duplicates
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length filters
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM flag filtering
|
||||
- `--shift`: Shift reads (e.g., for ATACseq Tn5 correction)
|
||||
- `--ATACshift`: Automatically shift for ATAC-seq data
|
||||
|
||||
---
|
||||
|
||||
### computeMatrix
|
||||
|
||||
Calculates scores per genomic region and prepares matrices for plotHeatmap and plotProfile. Processes bigWig score files and BED/GTF region files.
|
||||
|
||||
**Modes:**
|
||||
- **reference-point**: Signal distribution relative to specific position (TSS, TES, or center)
|
||||
- **scale-regions**: Signal across regions standardized to uniform lengths
|
||||
|
||||
**Key Parameters:**
|
||||
- `-R`: Region file(s) in BED/GTF format (required)
|
||||
- `-S`: BigWig score file(s) (required)
|
||||
- `-o`: Output matrix file (required)
|
||||
- `-b`: Upstream distance from reference point
|
||||
- `-a`: Downstream distance from reference point
|
||||
- `-m`: Region body length (scale-regions only)
|
||||
- `-bs, --binSize`: Bin size for averaging scores
|
||||
- `--skipZeros`: Skip regions with all zeros
|
||||
- `--minThreshold / --maxThreshold`: Filter by signal intensity
|
||||
- `--sortRegions`: ascending, descending, keep, no
|
||||
- `--sortUsing`: mean, median, max, min, sum, region_length
|
||||
- `-p, --numberOfProcessors`: Parallel processing
|
||||
- `--averageTypeBins`: Statistical method (mean, median, min, max, sum, std)
|
||||
|
||||
**Output Options:**
|
||||
- `--outFileNameMatrix`: Export tab-delimited data
|
||||
- `--outFileSortedRegions`: Save filtered/sorted BED file
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# TSS analysis
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-o matrix.gz -b 2000 -a 2000 --referencePoint TSS
|
||||
|
||||
# Scaled gene body
|
||||
computeMatrix scale-regions -S signal.bw -R genes.bed \
|
||||
-o matrix.gz -b 1000 -a 1000 -m 3000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Control Tools
|
||||
|
||||
### plotFingerprint
|
||||
|
||||
Quality control tool primarily for ChIP-seq experiments. Assesses whether antibody enrichment was successful. Generates cumulative read coverage profiles to distinguish signal from noise.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (required)
|
||||
- `--plotFile, -plot, -o`: Output image filename (required)
|
||||
- `--extendReads, -e`: Extend reads to fragment length
|
||||
- `--ignoreDuplicates`: Count identical reads once
|
||||
- `--minMappingQuality`: Mapping quality filter
|
||||
- `--centerReads`: Center reads at fragment length
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
|
||||
- `--outRawCounts`: Save per-bin read counts
|
||||
- `--outQualityMetrics`: Output QC metrics (Jensen-Shannon distance)
|
||||
- `--labels`: Custom sample names
|
||||
- `--numberOfProcessors, -p`: Parallel processing
|
||||
|
||||
**Interpretation:**
|
||||
- Ideal control: Straight diagonal line
|
||||
- Strong ChIP: Steep rise towards highest rank (concentrated reads in few bins)
|
||||
- Weak enrichment: Flatter curve approaching diagonal
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip1.bam chip2.bam \
|
||||
--labels Input ChIP1 ChIP2 -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotCoverage
|
||||
|
||||
Visualizes average read distribution across the genome. Shows genome coverage and helps determine if sequencing depth is adequate.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files to analyze (required)
|
||||
- `--plotFile, -o`: Output plot filename (required)
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--outRawCounts`: Save underlying data
|
||||
- `--labels`: Sample names
|
||||
- `--numberOfSamples`: Number of positions to sample (default: 1,000,000)
|
||||
|
||||
---
|
||||
|
||||
### bamPEFragmentSize
|
||||
|
||||
Determines fragment length distribution for paired-end sequencing data. Essential QC to verify expected fragment sizes from library preparation.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files (required)
|
||||
- `--histogram, -hist`: Output histogram filename (required)
|
||||
- `--plotTitle, -T`: Plot title
|
||||
- `--maxFragmentLength`: Maximum length to consider (default: 1000)
|
||||
- `--logScale`: Use logarithmic Y-axis
|
||||
- `--outRawFragmentLengths`: Save raw fragment lengths
|
||||
|
||||
---
|
||||
|
||||
### plotCorrelation
|
||||
|
||||
Analyzes sample correlations from multiBamSummary or multiBigwigSummary outputs. Shows how similar different samples are.
|
||||
|
||||
**Correlation Methods:**
|
||||
- **Pearson**: Measures metric differences; sensitive to outliers; appropriate for normally distributed data
|
||||
- **Spearman**: Rank-based; less influenced by outliers; better for non-normal distributions
|
||||
|
||||
**Visualization Options:**
|
||||
- **heatmap**: Color intensity with hierarchical clustering (complete linkage)
|
||||
- **scatterplot**: Pairwise scatter plots with correlation coefficients
|
||||
|
||||
**Key Parameters:**
|
||||
- `--corData, -in`: Input matrix from multiBamSummary/multiBigwigSummary (required)
|
||||
- `--corMethod`: pearson or spearman (required)
|
||||
- `--whatToShow`: heatmap or scatterplot (required)
|
||||
- `--plotFile, -o`: Output filename (required)
|
||||
- `--skipZeros`: Exclude zero-value regions
|
||||
- `--removeOutliers`: Use median absolute deviation (MAD) filtering
|
||||
- `--outFileCorMatrix`: Export correlation matrix
|
||||
- `--labels`: Custom sample names
|
||||
- `--plotTitle`: Plot title
|
||||
- `--colorMap`: Color scheme (50+ options)
|
||||
- `--plotNumbers`: Display correlation values on heatmap
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Heatmap with Pearson correlation
|
||||
plotCorrelation -in readCounts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation_heatmap.png --plotNumbers
|
||||
|
||||
# Scatterplot with Spearman correlation
|
||||
plotCorrelation -in readCounts.npz --corMethod spearman \
|
||||
--whatToShow scatterplot -o correlation_scatter.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotPCA
|
||||
|
||||
Generates principal component analysis plots from multiBamSummary or multiBigwigSummary output. Displays sample relationships in reduced dimensionality.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--corData, -in`: Coverage file from multiBamSummary/multiBigwigSummary (required)
|
||||
- `--plotFile, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--outFileNameData`: Export PCA data (loadings/rotation and eigenvalues)
|
||||
- `--labels, -l`: Custom sample labels
|
||||
- `--plotTitle, -T`: Plot title
|
||||
- `--plotHeight / --plotWidth`: Dimensions in centimeters
|
||||
- `--colors`: Custom symbol colors
|
||||
- `--markers`: Symbol shapes
|
||||
- `--transpose`: Perform PCA on transposed matrix (rows=samples)
|
||||
- `--ntop`: Use top N variable rows (default: 1000)
|
||||
- `--PCs`: Components to plot (default: 1 2)
|
||||
- `--log2`: Log2-transform data before analysis
|
||||
- `--rowCenter`: Center each row at 0
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotPCA -in readCounts.npz -o PCA_plot.png \
|
||||
-T "PCA of read counts" --transpose
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Tools
|
||||
|
||||
### plotHeatmap
|
||||
|
||||
Creates genomic region heatmaps from computeMatrix output. Generates publication-quality visualizations.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--matrixFile, -m`: Matrix from computeMatrix (required)
|
||||
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--outFileSortedRegions`: Save regions after filtering
|
||||
- `--outFileNameMatrix`: Export matrix values
|
||||
- `--interpolationMethod`: auto, nearest, bilinear, bicubic, gaussian
|
||||
- Default: nearest (≤1000 columns), bilinear (>1000 columns)
|
||||
- `--dpi`: Figure resolution
|
||||
|
||||
**Clustering:**
|
||||
- `--kmeans`: k-means clustering
|
||||
- `--hclust`: Hierarchical clustering (slower for >1000 regions)
|
||||
- `--silhouette`: Calculate cluster quality metrics
|
||||
|
||||
**Visual Customization:**
|
||||
- `--heatmapHeight / --heatmapWidth`: Dimensions (3-100 cm)
|
||||
- `--whatToShow`: plot, heatmap, colorbar (combinations)
|
||||
- `--alpha`: Transparency (0-1)
|
||||
- `--colorMap`: 50+ color schemes
|
||||
- `--colorList`: Custom gradient colors
|
||||
- `--zMin / --zMax`: Intensity scale limits
|
||||
- `--boxAroundHeatmaps`: yes/no (default: yes)
|
||||
|
||||
**Labels:**
|
||||
- `--xAxisLabel / --yAxisLabel`: Axis labels
|
||||
- `--regionsLabel`: Region set identifiers
|
||||
- `--samplesLabel`: Sample names
|
||||
- `--refPointLabel`: Reference point label
|
||||
- `--startLabel / --endLabel`: Region boundary labels
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Basic heatmap
|
||||
plotHeatmap -m matrix.gz -o heatmap.png
|
||||
|
||||
# With clustering and custom colors
|
||||
plotHeatmap -m matrix.gz -o heatmap.png \
|
||||
--kmeans 3 --colorMap RdBu --zMin -3 --zMax 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotProfile
|
||||
|
||||
Generates profile plots showing scores across genomic regions using computeMatrix output.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--matrixFile, -m`: Matrix from computeMatrix (required)
|
||||
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--plotType`: lines, fill, se, std, overlapped_lines, heatmap
|
||||
- `--colors`: Color palette (names or hex codes)
|
||||
- `--plotHeight / --plotWidth`: Dimensions in centimeters
|
||||
- `--yMin / --yMax`: Y-axis range
|
||||
- `--averageType`: mean, median, min, max, std, sum
|
||||
|
||||
**Clustering:**
|
||||
- `--kmeans`: k-means clustering
|
||||
- `--hclust`: Hierarchical clustering
|
||||
- `--silhouette`: Cluster quality metrics
|
||||
|
||||
**Labels:**
|
||||
- `--plotTitle`: Main heading
|
||||
- `--regionsLabel`: Region set identifiers
|
||||
- `--samplesLabel`: Sample names
|
||||
- `--startLabel / --endLabel`: Region boundary labels (scale-regions mode)
|
||||
|
||||
**Output Options:**
|
||||
- `--outFileNameData`: Export data as tab-separated values
|
||||
- `--outFileSortedRegions`: Save filtered/sorted regions as BED
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Line plot
|
||||
plotProfile -m matrix.gz -o profile.png --plotType lines
|
||||
|
||||
# With standard error shading
|
||||
plotProfile -m matrix.gz -o profile.png --plotType se \
|
||||
--colors blue red green
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotEnrichment
|
||||
|
||||
Calculates and visualizes signal enrichment across genomic regions. Measures percentage of alignments overlapping region groups. Useful for FRiP (Fragment in Peaks) scores.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (required)
|
||||
- `--BED`: Region files in BED/GTF format (required)
|
||||
- `--plotFile, -o`: Output visualization (png, pdf, eps, svg)
|
||||
- `--labels, -l`: Custom sample identifiers
|
||||
- `--outRawCounts`: Export numerical data
|
||||
- `--perSample`: Group by sample instead of feature (default)
|
||||
- `--regionLabels`: Custom region names
|
||||
|
||||
**Read Processing:**
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM flag filters
|
||||
- `--ignoreDuplicates`: Remove duplicates
|
||||
- `--centerReads`: Center reads for sharper signal
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotEnrichment -b Input.bam H3K4me3.bam \
|
||||
--BED peaks_up.bed peaks_down.bed \
|
||||
--regionLabels "Up regulated" "Down regulated" \
|
||||
-o enrichment.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous Tools
|
||||
|
||||
### computeMatrixOperations
|
||||
|
||||
Advanced matrix manipulation tool for combining or subsetting matrices from computeMatrix. Enables complex multi-sample, multi-region analyses.
|
||||
|
||||
**Operations:**
|
||||
- `cbind`: Combine matrices column-wise
|
||||
- `rbind`: Combine matrices row-wise
|
||||
- `subset`: Extract specific samples or regions
|
||||
- `filterStrand`: Keep only regions on specific strand
|
||||
- `filterValues`: Apply signal intensity filters
|
||||
- `sort`: Order regions by various criteria
|
||||
- `dataRange`: Report min/max values
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Combine matrices
|
||||
computeMatrixOperations cbind -m matrix1.gz matrix2.gz -o combined.gz
|
||||
|
||||
# Extract specific samples
|
||||
computeMatrixOperations subset -m matrix.gz --samples 0 2 -o subset.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### estimateReadFiltering
|
||||
|
||||
Predicts the impact of various filtering parameters without actually filtering. Helps optimize filtering strategies before running full analyses.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files to analyze
|
||||
- `--sampleSize`: Number of reads to sample (default: 100,000)
|
||||
- `--binSize`: Bin size for analysis
|
||||
- `--distanceBetweenBins`: Spacing between sampled bins
|
||||
|
||||
**Filtration Options to Test:**
|
||||
- `--minMappingQuality`: Test quality thresholds
|
||||
- `--ignoreDuplicates`: Assess duplicate impact
|
||||
- `--minFragmentLength / --maxFragmentLength`: Test fragment filters
|
||||
|
||||
---
|
||||
|
||||
## Common Parameters Across Tools
|
||||
|
||||
Many deepTools commands share these filtering and performance options:
|
||||
|
||||
**Read Filtering:**
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--minMappingQuality`: Filter by alignment confidence
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM format filtering
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length bounds
|
||||
|
||||
**Performance:**
|
||||
- `--numberOfProcessors, -p`: Enable parallel processing
|
||||
- `--region`: Process specific genomic regions (chr:start-end)
|
||||
|
||||
**Read Processing:**
|
||||
- `--extendReads`: Extend to fragment length
|
||||
- `--centerReads`: Center at fragment midpoint
|
||||
- `--ignoreDuplicates`: Count unique reads only
|
||||
474
scientific-packages/deeptools/references/workflows.md
Normal file
474
scientific-packages/deeptools/references/workflows.md
Normal file
@@ -0,0 +1,474 @@
|
||||
# deepTools Common Workflows
|
||||
|
||||
This document provides complete workflow examples for common deepTools analyses.
|
||||
|
||||
## ChIP-seq Quality Control Workflow
|
||||
|
||||
Complete quality control assessment for ChIP-seq experiments.
|
||||
|
||||
### Step 1: Initial Correlation Assessment
|
||||
|
||||
Compare replicates and samples to verify experimental quality:
|
||||
|
||||
```bash
|
||||
# Generate coverage matrix across genome
|
||||
multiBamSummary bins \
|
||||
--bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
|
||||
-o readCounts.npz \
|
||||
--numberOfProcessors 8
|
||||
|
||||
# Create correlation heatmap
|
||||
plotCorrelation \
|
||||
-in readCounts.npz \
|
||||
--corMethod pearson \
|
||||
--whatToShow heatmap \
|
||||
--plotFile correlation_heatmap.png \
|
||||
--plotNumbers
|
||||
|
||||
# Generate PCA plot
|
||||
plotPCA \
|
||||
-in readCounts.npz \
|
||||
-o PCA_plot.png \
|
||||
-T "PCA of ChIP-seq samples"
|
||||
```
|
||||
|
||||
**Expected Results:**
|
||||
- Replicates should cluster together
|
||||
- Input samples should be distinct from ChIP samples
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Coverage and Depth Assessment
|
||||
|
||||
```bash
|
||||
# Check sequencing depth and coverage
|
||||
plotCoverage \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input ChIP_rep1 ChIP_rep2 \
|
||||
--plotFile coverage.png \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Interpretation:** Assess whether sequencing depth is adequate for downstream analysis.
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Fragment Size Validation (Paired-end)
|
||||
|
||||
```bash
|
||||
# Verify expected fragment sizes
|
||||
bamPEFragmentSize \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--histogram fragmentSizes.png \
|
||||
--plotTitle "Fragment Size Distribution"
|
||||
```
|
||||
|
||||
**Expected Results:** Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).
|
||||
|
||||
---
|
||||
|
||||
### Step 4: GC Bias Detection and Correction
|
||||
|
||||
```bash
|
||||
# Compute GC bias
|
||||
computeGCBias \
|
||||
--bamfile ChIP1.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--fragmentLength 200 \
|
||||
--biasPlot GCbias.png \
|
||||
--frequenciesFile freq.txt
|
||||
|
||||
# If bias detected, correct it
|
||||
correctGCBias \
|
||||
--bamfile ChIP1.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--GCbiasFrequenciesFile freq.txt \
|
||||
--correctedFile ChIP1_GCcorrected.bam
|
||||
```
|
||||
|
||||
**Note:** Only correct if significant bias is observed. Do NOT use `--ignoreDuplicates` with GC-corrected files.
|
||||
|
||||
---
|
||||
|
||||
### Step 5: ChIP Signal Strength Assessment
|
||||
|
||||
```bash
|
||||
# Evaluate ChIP enrichment quality
|
||||
plotFingerprint \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input ChIP_rep1 ChIP_rep2 \
|
||||
--plotFile fingerprint.png \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8 \
|
||||
--outQualityMetrics fingerprint_metrics.txt
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Strong ChIP: Steep rise in cumulative curve
|
||||
- Weak enrichment: Curve close to diagonal (input-like)
|
||||
|
||||
---
|
||||
|
||||
## ChIP-seq Analysis Workflow
|
||||
|
||||
Complete workflow from BAM files to publication-quality visualizations.
|
||||
|
||||
### Step 1: Generate Normalized Coverage Tracks
|
||||
|
||||
```bash
|
||||
# Input control
|
||||
bamCoverage \
|
||||
--bam Input.bam \
|
||||
--outFileName Input_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
|
||||
# ChIP sample
|
||||
bamCoverage \
|
||||
--bam ChIP.bam \
|
||||
--outFileName ChIP_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Create Log2 Ratio Track
|
||||
|
||||
```bash
|
||||
# Compare ChIP to Input
|
||||
bamCompare \
|
||||
--bamfile1 ChIP.bam \
|
||||
--bamfile2 Input.bam \
|
||||
--outFileName ChIP_vs_Input_log2ratio.bw \
|
||||
--operation log2 \
|
||||
--scaleFactorsMethod readCount \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Result:** Log2 ratio track showing enrichment (positive values) and depletion (negative values).
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Compute Matrix Around TSS
|
||||
|
||||
```bash
|
||||
# Prepare data for heatmap/profile around transcription start sites
|
||||
computeMatrix reference-point \
|
||||
--referencePoint TSS \
|
||||
--scoreFileName ChIP_coverage.bw \
|
||||
--regionsFileName genes.bed \
|
||||
--beforeRegionStartLength 3000 \
|
||||
--afterRegionStartLength 3000 \
|
||||
--binSize 10 \
|
||||
--sortRegions descend \
|
||||
--sortUsing mean \
|
||||
--outFileName matrix_TSS.gz \
|
||||
--outFileNameMatrix matrix_TSS.tab \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Generate Heatmap
|
||||
|
||||
```bash
|
||||
# Create heatmap around TSS
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_TSS.gz \
|
||||
--outFileName heatmap_TSS.png \
|
||||
--colorMap RdBu \
|
||||
--whatToShow 'plot, heatmap and colorbar' \
|
||||
--zMin -3 --zMax 3 \
|
||||
--yAxisLabel "Genes" \
|
||||
--xAxisLabel "Distance from TSS (bp)" \
|
||||
--refPointLabel "TSS" \
|
||||
--heatmapHeight 15 \
|
||||
--kmeans 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 5: Generate Profile Plot
|
||||
|
||||
```bash
|
||||
# Create meta-profile around TSS
|
||||
plotProfile \
|
||||
--matrixFile matrix_TSS.gz \
|
||||
--outFileName profile_TSS.png \
|
||||
--plotType lines \
|
||||
--perGroup \
|
||||
--colors blue \
|
||||
--plotTitle "ChIP-seq signal around TSS" \
|
||||
--yAxisLabel "Average signal" \
|
||||
--xAxisLabel "Distance from TSS (bp)" \
|
||||
--refPointLabel "TSS"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 6: Enrichment at Peaks
|
||||
|
||||
```bash
|
||||
# Calculate enrichment in peak regions
|
||||
plotEnrichment \
|
||||
--bamfiles Input.bam ChIP.bam \
|
||||
--BED peaks.bed \
|
||||
--labels Input ChIP \
|
||||
--plotFile enrichment.png \
|
||||
--outRawCounts enrichment_counts.tab \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## RNA-seq Coverage Workflow
|
||||
|
||||
Generate strand-specific coverage tracks for RNA-seq data.
|
||||
|
||||
### Forward Strand
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam rnaseq.bam \
|
||||
--outFileName forward_coverage.bw \
|
||||
--filterRNAstrand forward \
|
||||
--normalizeUsing CPM \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Reverse Strand
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam rnaseq.bam \
|
||||
--outFileName reverse_coverage.bw \
|
||||
--filterRNAstrand reverse \
|
||||
--normalizeUsing CPM \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Important:** Do NOT use `--extendReads` for RNA-seq (would extend over splice junctions).
|
||||
|
||||
---
|
||||
|
||||
## Multi-Sample Comparison Workflow
|
||||
|
||||
Compare multiple ChIP-seq samples (e.g., different conditions or time points).
|
||||
|
||||
### Step 1: Generate Coverage Files
|
||||
|
||||
```bash
|
||||
# For each sample
|
||||
for sample in Control_ChIP Treated_ChIP; do
|
||||
bamCoverage \
|
||||
--bam ${sample}.bam \
|
||||
--outFileName ${sample}.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Compute Multi-Sample Matrix
|
||||
|
||||
```bash
|
||||
computeMatrix scale-regions \
|
||||
--scoreFileName Control_ChIP.bw Treated_ChIP.bw \
|
||||
--regionsFileName genes.bed \
|
||||
--beforeRegionStartLength 1000 \
|
||||
--afterRegionStartLength 1000 \
|
||||
--regionBodyLength 3000 \
|
||||
--binSize 10 \
|
||||
--sortRegions descend \
|
||||
--sortUsing mean \
|
||||
--outFileName matrix_multi.gz \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Multi-Sample Heatmap
|
||||
|
||||
```bash
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_multi.gz \
|
||||
--outFileName heatmap_comparison.png \
|
||||
--colorMap Blues \
|
||||
--whatToShow 'plot, heatmap and colorbar' \
|
||||
--samplesLabel Control Treated \
|
||||
--yAxisLabel "Genes" \
|
||||
--heatmapHeight 15 \
|
||||
--kmeans 4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Multi-Sample Profile
|
||||
|
||||
```bash
|
||||
plotProfile \
|
||||
--matrixFile matrix_multi.gz \
|
||||
--outFileName profile_comparison.png \
|
||||
--plotType lines \
|
||||
--perGroup \
|
||||
--colors blue red \
|
||||
--samplesLabel Control Treated \
|
||||
--plotTitle "ChIP-seq signal comparison" \
|
||||
--startLabel "TSS" \
|
||||
--endLabel "TES"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ATAC-seq Workflow
|
||||
|
||||
Specialized workflow for ATAC-seq data with Tn5 offset correction.
|
||||
|
||||
### Step 1: Shift Reads for Tn5 Correction
|
||||
|
||||
```bash
|
||||
alignmentSieve \
|
||||
--bam atacseq.bam \
|
||||
--outFile atacseq_shifted.bam \
|
||||
--ATACshift \
|
||||
--minFragmentLength 38 \
|
||||
--maxFragmentLength 2000 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Generate Coverage Track
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam atacseq_shifted.bam \
|
||||
--outFileName atacseq_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Fragment Size Analysis
|
||||
|
||||
```bash
|
||||
bamPEFragmentSize \
|
||||
--bamfiles atacseq.bam \
|
||||
--histogram fragmentSizes_atac.png \
|
||||
--maxFragmentLength 1000
|
||||
```
|
||||
|
||||
**Expected Pattern:** Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).
|
||||
|
||||
---
|
||||
|
||||
## Peak Region Analysis Workflow
|
||||
|
||||
Analyze ChIP-seq signal specifically at peak regions.
|
||||
|
||||
### Step 1: Matrix at Peaks
|
||||
|
||||
```bash
|
||||
computeMatrix reference-point \
|
||||
--referencePoint center \
|
||||
--scoreFileName ChIP_coverage.bw \
|
||||
--regionsFileName peaks.bed \
|
||||
--beforeRegionStartLength 2000 \
|
||||
--afterRegionStartLength 2000 \
|
||||
--binSize 10 \
|
||||
--outFileName matrix_peaks.gz \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Heatmap at Peaks
|
||||
|
||||
```bash
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_peaks.gz \
|
||||
--outFileName heatmap_peaks.png \
|
||||
--colorMap YlOrRd \
|
||||
--refPointLabel "Peak Center" \
|
||||
--heatmapHeight 15 \
|
||||
--sortUsing max
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory
|
||||
**Solution:** Use `--region` parameter to process chromosomes individually:
|
||||
```bash
|
||||
bamCoverage --bam input.bam -o chr1.bw --region chr1
|
||||
```
|
||||
|
||||
### Issue: BAM Index Missing
|
||||
**Solution:** Index BAM files before running deepTools:
|
||||
```bash
|
||||
samtools index input.bam
|
||||
```
|
||||
|
||||
### Issue: Slow Processing
|
||||
**Solution:** Increase `--numberOfProcessors`:
|
||||
```bash
|
||||
# Use 8 cores instead of default
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Issue: bigWig Files Too Large
|
||||
**Solution:** Increase bin size:
|
||||
```bash
|
||||
--binSize 50 # or larger (default is 10-50)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use multiple processors:** Always set `--numberOfProcessors` to available cores
|
||||
2. **Process regions:** Use `--region` for testing or memory-limited environments
|
||||
3. **Adjust bin size:** Larger bins = faster processing and smaller files
|
||||
4. **Pre-filter BAM files:** Use `alignmentSieve` to create filtered BAM files once, then reuse
|
||||
5. **Use bigWig over bedGraph:** bigWig format is compressed and faster to process
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always check QC first:** Run correlation, coverage, and fingerprint analysis before proceeding
|
||||
2. **Document parameters:** Save command lines for reproducibility
|
||||
3. **Use consistent normalization:** Apply same normalization method across samples in a comparison
|
||||
4. **Verify reference genome match:** Ensure BAM files and region files use same genome build
|
||||
5. **Check strand orientation:** For RNA-seq, verify correct strand orientation
|
||||
6. **Test on small regions first:** Use `--region chr1:1-1000000` for testing parameters
|
||||
7. **Keep intermediate files:** Save matrices for regenerating plots with different settings
|
||||
195
scientific-packages/deeptools/scripts/validate_files.py
Normal file
195
scientific-packages/deeptools/scripts/validate_files.py
Normal file
@@ -0,0 +1,195 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
deepTools File Validation Script
|
||||
|
||||
Validates BAM, bigWig, and BED files for deepTools analysis.
|
||||
Checks for file existence, proper indexing, and basic format requirements.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def check_file_exists(filepath):
|
||||
"""Check if file exists and is readable."""
|
||||
if not os.path.exists(filepath):
|
||||
return False, f"File not found: {filepath}"
|
||||
if not os.access(filepath, os.R_OK):
|
||||
return False, f"File not readable: {filepath}"
|
||||
return True, f"✓ File exists: {filepath}"
|
||||
|
||||
|
||||
def check_bam_index(bam_file):
|
||||
"""Check if BAM file has an index (.bai or .bam.bai)."""
|
||||
bai_file1 = bam_file + ".bai"
|
||||
bai_file2 = bam_file.replace(".bam", ".bai")
|
||||
|
||||
if os.path.exists(bai_file1):
|
||||
return True, f"✓ BAM index found: {bai_file1}"
|
||||
elif os.path.exists(bai_file2):
|
||||
return True, f"✓ BAM index found: {bai_file2}"
|
||||
else:
|
||||
return False, f"✗ BAM index missing for: {bam_file}\n Run: samtools index {bam_file}"
|
||||
|
||||
|
||||
def check_bigwig_file(bw_file):
|
||||
"""Basic check for bigWig file."""
|
||||
# Check file size (bigWig files should have reasonable size)
|
||||
file_size = os.path.getsize(bw_file)
|
||||
if file_size < 100:
|
||||
return False, f"✗ bigWig file suspiciously small: {bw_file} ({file_size} bytes)"
|
||||
return True, f"✓ bigWig file appears valid: {bw_file} ({file_size} bytes)"
|
||||
|
||||
|
||||
def check_bed_file(bed_file):
|
||||
"""Basic validation of BED file format."""
|
||||
try:
|
||||
with open(bed_file, 'r') as f:
|
||||
lines = [line.strip() for line in f if line.strip() and not line.startswith('#')]
|
||||
|
||||
if len(lines) == 0:
|
||||
return False, f"✗ BED file is empty: {bed_file}"
|
||||
|
||||
# Check first few lines for basic format
|
||||
for i, line in enumerate(lines[:10], 1):
|
||||
fields = line.split('\t')
|
||||
if len(fields) < 3:
|
||||
return False, f"✗ BED file format error at line {i}: expected at least 3 columns\n Line: {line}"
|
||||
|
||||
# Check if start and end are integers
|
||||
try:
|
||||
start = int(fields[1])
|
||||
end = int(fields[2])
|
||||
if start >= end:
|
||||
return False, f"✗ BED file error at line {i}: start >= end ({start} >= {end})"
|
||||
except ValueError:
|
||||
return False, f"✗ BED file format error at line {i}: start and end must be integers\n Line: {line}"
|
||||
|
||||
return True, f"✓ BED file format appears valid: {bed_file} ({len(lines)} regions)"
|
||||
|
||||
except Exception as e:
|
||||
return False, f"✗ Error reading BED file: {bed_file}\n Error: {str(e)}"
|
||||
|
||||
|
||||
def validate_files(bam_files=None, bigwig_files=None, bed_files=None):
|
||||
"""
|
||||
Validate all provided files.
|
||||
|
||||
Args:
|
||||
bam_files: List of BAM file paths
|
||||
bigwig_files: List of bigWig file paths
|
||||
bed_files: List of BED file paths
|
||||
|
||||
Returns:
|
||||
Tuple of (success: bool, messages: list)
|
||||
"""
|
||||
all_success = True
|
||||
messages = []
|
||||
|
||||
# Validate BAM files
|
||||
if bam_files:
|
||||
messages.append("\n=== Validating BAM Files ===")
|
||||
for bam_file in bam_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bam_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Check index
|
||||
success, msg = check_bam_index(bam_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
# Validate bigWig files
|
||||
if bigwig_files:
|
||||
messages.append("\n=== Validating bigWig Files ===")
|
||||
for bw_file in bigwig_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bw_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Basic bigWig check
|
||||
success, msg = check_bigwig_file(bw_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
# Validate BED files
|
||||
if bed_files:
|
||||
messages.append("\n=== Validating BED Files ===")
|
||||
for bed_file in bed_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bed_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Check BED format
|
||||
success, msg = check_bed_file(bed_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
return all_success, messages
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Validate files for deepTools analysis",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate BAM files
|
||||
python validate_files.py --bam sample1.bam sample2.bam
|
||||
|
||||
# Validate all file types
|
||||
python validate_files.py --bam input.bam chip.bam --bed peaks.bed --bigwig signal.bw
|
||||
|
||||
# Validate from a directory
|
||||
python validate_files.py --bam *.bam --bed *.bed
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--bam', nargs='+', help='BAM files to validate')
|
||||
parser.add_argument('--bigwig', '--bw', nargs='+', help='bigWig files to validate')
|
||||
parser.add_argument('--bed', nargs='+', help='BED files to validate')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Check if any files were provided
|
||||
if not any([args.bam, args.bigwig, args.bed]):
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Run validation
|
||||
success, messages = validate_files(
|
||||
bam_files=args.bam,
|
||||
bigwig_files=args.bigwig,
|
||||
bed_files=args.bed
|
||||
)
|
||||
|
||||
# Print results
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*50)
|
||||
if success:
|
||||
print("✓ All validations passed!")
|
||||
sys.exit(0)
|
||||
else:
|
||||
print("✗ Some validations failed. Please fix the issues above.")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
454
scientific-packages/deeptools/scripts/workflow_generator.py
Normal file
454
scientific-packages/deeptools/scripts/workflow_generator.py
Normal file
@@ -0,0 +1,454 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
deepTools Workflow Generator
|
||||
|
||||
Generates bash script templates for common deepTools workflows.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
|
||||
WORKFLOWS = {
|
||||
'chipseq_qc': {
|
||||
'name': 'ChIP-seq Quality Control',
|
||||
'description': 'Complete QC workflow for ChIP-seq experiments',
|
||||
},
|
||||
'chipseq_analysis': {
|
||||
'name': 'ChIP-seq Complete Analysis',
|
||||
'description': 'Full ChIP-seq analysis from BAM to heatmaps',
|
||||
},
|
||||
'rnaseq_coverage': {
|
||||
'name': 'RNA-seq Coverage Tracks',
|
||||
'description': 'Generate strand-specific RNA-seq coverage',
|
||||
},
|
||||
'atacseq': {
|
||||
'name': 'ATAC-seq Analysis',
|
||||
'description': 'ATAC-seq workflow with Tn5 correction',
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def generate_chipseq_qc_workflow(output_file, params):
|
||||
"""Generate ChIP-seq QC workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ChIP-seq Quality Control Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
|
||||
CHIP_BAM=("{params.get('chip_bams', 'ChIP1.bam ChIP2.bam')}")
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'deeptools_qc')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ChIP-seq QC workflow ==="
|
||||
|
||||
# Step 1: Correlation analysis
|
||||
echo "Step 1: Computing correlation matrix..."
|
||||
multiBamSummary bins \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
-o $OUTPUT_DIR/readCounts.npz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 2: Generating correlation heatmap..."
|
||||
plotCorrelation \\
|
||||
-in $OUTPUT_DIR/readCounts.npz \\
|
||||
--corMethod pearson \\
|
||||
--whatToShow heatmap \\
|
||||
--plotFile $OUTPUT_DIR/correlation_heatmap.png \\
|
||||
--plotNumbers
|
||||
|
||||
echo "Step 3: Generating PCA plot..."
|
||||
plotPCA \\
|
||||
-in $OUTPUT_DIR/readCounts.npz \\
|
||||
-o $OUTPUT_DIR/PCA_plot.png \\
|
||||
-T "PCA of ChIP-seq samples"
|
||||
|
||||
# Step 2: Coverage assessment
|
||||
echo "Step 4: Assessing coverage..."
|
||||
plotCoverage \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--plotFile $OUTPUT_DIR/coverage.png \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Fragment size (for paired-end data)
|
||||
echo "Step 5: Analyzing fragment sizes..."
|
||||
bamPEFragmentSize \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--histogram $OUTPUT_DIR/fragmentSizes.png \\
|
||||
--plotTitle "Fragment Size Distribution"
|
||||
|
||||
# Step 4: ChIP signal strength
|
||||
echo "Step 6: Evaluating ChIP enrichment..."
|
||||
plotFingerprint \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--plotFile $OUTPUT_DIR/fingerprint.png \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS \\
|
||||
--outQualityMetrics $OUTPUT_DIR/fingerprint_metrics.txt
|
||||
|
||||
echo "=== ChIP-seq QC workflow complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ChIP-seq QC workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_chipseq_analysis_workflow(output_file, params):
|
||||
"""Generate complete ChIP-seq analysis workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ChIP-seq Complete Analysis Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
|
||||
CHIP_BAM="{params.get('chip_bam', 'ChIP.bam')}"
|
||||
GENES_BED="{params.get('genes_bed', 'genes.bed')}"
|
||||
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'chipseq_analysis')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ChIP-seq analysis workflow ==="
|
||||
|
||||
# Step 1: Generate normalized coverage tracks
|
||||
echo "Step 1: Generating coverage tracks..."
|
||||
|
||||
bamCoverage \\
|
||||
--bam $INPUT_BAM \\
|
||||
--outFileName $OUTPUT_DIR/Input_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
bamCoverage \\
|
||||
--bam $CHIP_BAM \\
|
||||
--outFileName $OUTPUT_DIR/ChIP_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 2: Create log2 ratio track
|
||||
echo "Step 2: Creating log2 ratio track..."
|
||||
bamCompare \\
|
||||
--bamfile1 $CHIP_BAM \\
|
||||
--bamfile2 $INPUT_BAM \\
|
||||
--outFileName $OUTPUT_DIR/ChIP_vs_Input_log2ratio.bw \\
|
||||
--operation log2 \\
|
||||
--scaleFactorsMethod readCount \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Compute matrix around TSS
|
||||
echo "Step 3: Computing matrix around TSS..."
|
||||
computeMatrix reference-point \\
|
||||
--referencePoint TSS \\
|
||||
--scoreFileName $OUTPUT_DIR/ChIP_coverage.bw \\
|
||||
--regionsFileName $GENES_BED \\
|
||||
--beforeRegionStartLength 3000 \\
|
||||
--afterRegionStartLength 3000 \\
|
||||
--binSize 10 \\
|
||||
--sortRegions descend \\
|
||||
--sortUsing mean \\
|
||||
--outFileName $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 4: Generate heatmap
|
||||
echo "Step 4: Generating heatmap..."
|
||||
plotHeatmap \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--outFileName $OUTPUT_DIR/heatmap_TSS.png \\
|
||||
--colorMap RdBu \\
|
||||
--whatToShow 'plot, heatmap and colorbar' \\
|
||||
--yAxisLabel "Genes" \\
|
||||
--xAxisLabel "Distance from TSS (bp)" \\
|
||||
--refPointLabel "TSS" \\
|
||||
--heatmapHeight 15 \\
|
||||
--kmeans 3
|
||||
|
||||
# Step 5: Generate profile plot
|
||||
echo "Step 5: Generating profile plot..."
|
||||
plotProfile \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--outFileName $OUTPUT_DIR/profile_TSS.png \\
|
||||
--plotType lines \\
|
||||
--perGroup \\
|
||||
--colors blue \\
|
||||
--plotTitle "ChIP-seq signal around TSS" \\
|
||||
--yAxisLabel "Average signal" \\
|
||||
--refPointLabel "TSS"
|
||||
|
||||
# Step 6: Enrichment at peaks (if peaks provided)
|
||||
if [ -f "$PEAKS_BED" ]; then
|
||||
echo "Step 6: Calculating enrichment at peaks..."
|
||||
plotEnrichment \\
|
||||
--bamfiles $INPUT_BAM $CHIP_BAM \\
|
||||
--BED $PEAKS_BED \\
|
||||
--labels Input ChIP \\
|
||||
--plotFile $OUTPUT_DIR/enrichment.png \\
|
||||
--outRawCounts $OUTPUT_DIR/enrichment_counts.tab \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates
|
||||
fi
|
||||
|
||||
echo "=== ChIP-seq analysis complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ChIP-seq analysis workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_rnaseq_coverage_workflow(output_file, params):
|
||||
"""Generate RNA-seq coverage workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools RNA-seq Coverage Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
RNASEQ_BAM="{params.get('rnaseq_bam', 'rnaseq.bam')}"
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'rnaseq_coverage')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting RNA-seq coverage workflow ==="
|
||||
|
||||
# Generate strand-specific coverage tracks
|
||||
echo "Step 1: Generating forward strand coverage..."
|
||||
bamCoverage \\
|
||||
--bam $RNASEQ_BAM \\
|
||||
--outFileName $OUTPUT_DIR/forward_coverage.bw \\
|
||||
--filterRNAstrand forward \\
|
||||
--normalizeUsing CPM \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 2: Generating reverse strand coverage..."
|
||||
bamCoverage \\
|
||||
--bam $RNASEQ_BAM \\
|
||||
--outFileName $OUTPUT_DIR/reverse_coverage.bw \\
|
||||
--filterRNAstrand reverse \\
|
||||
--normalizeUsing CPM \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "=== RNA-seq coverage workflow complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
echo ""
|
||||
echo "Note: These bigWig files can be loaded into genome browsers"
|
||||
echo "for strand-specific visualization of RNA-seq data."
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated RNA-seq coverage workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_atacseq_workflow(output_file, params):
|
||||
"""Generate ATAC-seq workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ATAC-seq Analysis Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
ATAC_BAM="{params.get('atac_bam', 'atacseq.bam')}"
|
||||
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'atacseq_analysis')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ATAC-seq analysis workflow ==="
|
||||
|
||||
# Step 1: Shift reads for Tn5 correction
|
||||
echo "Step 1: Applying Tn5 offset correction..."
|
||||
alignmentSieve \\
|
||||
--bam $ATAC_BAM \\
|
||||
--outFile $OUTPUT_DIR/atacseq_shifted.bam \\
|
||||
--ATACshift \\
|
||||
--minFragmentLength 38 \\
|
||||
--maxFragmentLength 2000 \\
|
||||
--ignoreDuplicates
|
||||
|
||||
# Index the shifted BAM
|
||||
samtools index $OUTPUT_DIR/atacseq_shifted.bam
|
||||
|
||||
# Step 2: Generate coverage track
|
||||
echo "Step 2: Generating coverage track..."
|
||||
bamCoverage \\
|
||||
--bam $OUTPUT_DIR/atacseq_shifted.bam \\
|
||||
--outFileName $OUTPUT_DIR/atacseq_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Fragment size analysis
|
||||
echo "Step 3: Analyzing fragment sizes..."
|
||||
bamPEFragmentSize \\
|
||||
--bamfiles $ATAC_BAM \\
|
||||
--histogram $OUTPUT_DIR/fragmentSizes.png \\
|
||||
--maxFragmentLength 1000
|
||||
|
||||
# Step 4: Compute matrix at peaks (if peaks provided)
|
||||
if [ -f "$PEAKS_BED" ]; then
|
||||
echo "Step 4: Computing matrix at peaks..."
|
||||
computeMatrix reference-point \\
|
||||
--referencePoint center \\
|
||||
--scoreFileName $OUTPUT_DIR/atacseq_coverage.bw \\
|
||||
--regionsFileName $PEAKS_BED \\
|
||||
--beforeRegionStartLength 2000 \\
|
||||
--afterRegionStartLength 2000 \\
|
||||
--binSize 10 \\
|
||||
--outFileName $OUTPUT_DIR/matrix_peaks.gz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 5: Generating heatmap..."
|
||||
plotHeatmap \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_peaks.gz \\
|
||||
--outFileName $OUTPUT_DIR/heatmap_peaks.png \\
|
||||
--colorMap YlOrRd \\
|
||||
--refPointLabel "Peak Center" \\
|
||||
--heatmapHeight 15
|
||||
fi
|
||||
|
||||
echo "=== ATAC-seq analysis complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
echo ""
|
||||
echo "Expected fragment size pattern:"
|
||||
echo " ~50bp: nucleosome-free regions"
|
||||
echo " ~200bp: mono-nucleosome"
|
||||
echo " ~400bp: di-nucleosome"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ATAC-seq workflow: {output_file}"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate deepTools workflow scripts",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=f"""
|
||||
Available workflows:
|
||||
{chr(10).join(f" {key}: {value['name']}" for key, value in WORKFLOWS.items())}
|
||||
|
||||
Examples:
|
||||
# Generate ChIP-seq QC workflow
|
||||
python workflow_generator.py chipseq_qc -o chipseq_qc.sh
|
||||
|
||||
# Generate ChIP-seq analysis with custom parameters
|
||||
python workflow_generator.py chipseq_analysis -o analysis.sh \\
|
||||
--chip-bam H3K4me3.bam --input-bam Input.bam
|
||||
|
||||
# List all available workflows
|
||||
python workflow_generator.py --list
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('workflow', nargs='?', choices=list(WORKFLOWS.keys()),
|
||||
help='Workflow type to generate')
|
||||
parser.add_argument('-o', '--output', default='deeptools_workflow.sh',
|
||||
help='Output script filename (default: deeptools_workflow.sh)')
|
||||
parser.add_argument('--list', action='store_true',
|
||||
help='List all available workflows')
|
||||
|
||||
# Common parameters
|
||||
parser.add_argument('--threads', type=int, default=8,
|
||||
help='Number of threads (default: 8)')
|
||||
parser.add_argument('--genome-size', type=int, default=2913022398,
|
||||
help='Effective genome size (default: 2913022398 for hg38)')
|
||||
parser.add_argument('--output-dir', default=None,
|
||||
help='Output directory for results')
|
||||
|
||||
# Workflow-specific parameters
|
||||
parser.add_argument('--input-bam', help='Input/control BAM file')
|
||||
parser.add_argument('--chip-bam', help='ChIP BAM file')
|
||||
parser.add_argument('--chip-bams', help='Multiple ChIP BAM files (space-separated)')
|
||||
parser.add_argument('--rnaseq-bam', help='RNA-seq BAM file')
|
||||
parser.add_argument('--atac-bam', help='ATAC-seq BAM file')
|
||||
parser.add_argument('--genes-bed', help='Genes BED file')
|
||||
parser.add_argument('--peaks-bed', help='Peaks BED file')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List workflows
|
||||
if args.list:
|
||||
print("\nAvailable deepTools workflows:\n")
|
||||
for key, value in WORKFLOWS.items():
|
||||
print(f" {key}")
|
||||
print(f" {value['name']}")
|
||||
print(f" {value['description']}\n")
|
||||
sys.exit(0)
|
||||
|
||||
# Check if workflow was specified
|
||||
if not args.workflow:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Prepare parameters
|
||||
params = {
|
||||
'threads': args.threads,
|
||||
'genome_size': args.genome_size,
|
||||
'output_dir': args.output_dir or f"{args.workflow}_output",
|
||||
'input_bam': args.input_bam,
|
||||
'chip_bam': args.chip_bam,
|
||||
'chip_bams': args.chip_bams,
|
||||
'rnaseq_bam': args.rnaseq_bam,
|
||||
'atac_bam': args.atac_bam,
|
||||
'genes_bed': args.genes_bed,
|
||||
'peaks_bed': args.peaks_bed,
|
||||
}
|
||||
|
||||
# Generate workflow
|
||||
if args.workflow == 'chipseq_qc':
|
||||
message = generate_chipseq_qc_workflow(args.output, params)
|
||||
elif args.workflow == 'chipseq_analysis':
|
||||
message = generate_chipseq_analysis_workflow(args.output, params)
|
||||
elif args.workflow == 'rnaseq_coverage':
|
||||
message = generate_rnaseq_coverage_workflow(args.output, params)
|
||||
elif args.workflow == 'atacseq':
|
||||
message = generate_atacseq_workflow(args.output, params)
|
||||
|
||||
print(message)
|
||||
print(f"\nTo run the workflow:")
|
||||
print(f" chmod +x {args.output}")
|
||||
print(f" ./{args.output}")
|
||||
print(f"\nNote: Edit the script to customize file paths and parameters.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
477
scientific-packages/diffdock/SKILL.md
Normal file
477
scientific-packages/diffdock/SKILL.md
Normal file
@@ -0,0 +1,477 @@
|
||||
---
|
||||
name: diffdock
|
||||
description: This skill provides comprehensive guidance for using DiffDock, a state-of-the-art diffusion-based molecular docking tool that predicts protein-ligand binding poses. Use this skill when users request molecular docking simulations, protein-ligand binding predictions, virtual screening, structure-based drug design tasks, or need to predict how small molecules bind to protein targets. This skill applies to tasks involving PDB files, SMILES strings, protein sequences, ligand structure files, or batch docking of compound libraries.
|
||||
---
|
||||
|
||||
# DiffDock: Molecular Docking with Diffusion Models
|
||||
|
||||
## Overview
|
||||
|
||||
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
|
||||
|
||||
**Core Capabilities:**
|
||||
- Predict ligand binding poses with high accuracy using deep learning
|
||||
- Support protein structures (PDB files) or sequences (via ESMFold)
|
||||
- Process single complexes or batch virtual screening campaigns
|
||||
- Generate confidence scores to assess prediction reliability
|
||||
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
|
||||
|
||||
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
|
||||
|
||||
## When to Use DiffDock
|
||||
|
||||
Invoke this skill when users request:
|
||||
|
||||
- "Dock this ligand to a protein" or "predict binding pose"
|
||||
- "Run molecular docking" or "perform protein-ligand docking"
|
||||
- "Virtual screening" or "screen compound library"
|
||||
- "Where does this molecule bind?" or "predict binding site"
|
||||
- Structure-based drug design or lead optimization tasks
|
||||
- Tasks involving PDB files + SMILES strings or ligand structures
|
||||
- Batch docking of multiple protein-ligand pairs
|
||||
|
||||
## Installation and Environment Setup
|
||||
|
||||
### Check Environment Status
|
||||
|
||||
Before proceeding with DiffDock tasks, verify the environment setup:
|
||||
|
||||
```bash
|
||||
# Use the provided setup checker
|
||||
python scripts/setup_check.py
|
||||
```
|
||||
|
||||
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
|
||||
|
||||
### Installation Options
|
||||
|
||||
**Option 1: Conda (Recommended)**
|
||||
```bash
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
**Option 2: Docker**
|
||||
```bash
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
**Important Notes:**
|
||||
- GPU strongly recommended (10-100x speedup vs CPU)
|
||||
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
|
||||
- Model checkpoints (~500MB) download automatically if not present
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
**Use Case:** Dock one ligand to one protein target
|
||||
|
||||
**Input Requirements:**
|
||||
- Protein: PDB file OR amino acid sequence
|
||||
- Ligand: SMILES string OR structure file (SDF/MOL2)
|
||||
|
||||
**Command:**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Alternative (protein sequence):**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Output Structure:**
|
||||
```
|
||||
results/single_docking/
|
||||
├── rank_1.sdf # Top-ranked pose
|
||||
├── rank_2.sdf # Second-ranked pose
|
||||
├── ...
|
||||
├── rank_10.sdf # 10th pose (default: 10 samples)
|
||||
└── confidence_scores.txt
|
||||
```
|
||||
|
||||
### Workflow 2: Batch Processing Multiple Complexes
|
||||
|
||||
**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
|
||||
|
||||
**Step 1: Prepare Batch CSV**
|
||||
|
||||
Use the provided script to create or validate batch input:
|
||||
|
||||
```bash
|
||||
# Create template
|
||||
python scripts/prepare_batch_csv.py --create --output batch_input.csv
|
||||
|
||||
# Validate existing CSV
|
||||
python scripts/prepare_batch_csv.py my_input.csv --validate
|
||||
```
|
||||
|
||||
**CSV Format:**
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
|
||||
complex3,protein3.pdb,ligand3.sdf,
|
||||
```
|
||||
|
||||
**Required Columns:**
|
||||
- `complex_name`: Unique identifier
|
||||
- `protein_path`: PDB file path (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or ligand file path
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
**Step 2: Run Batch Docking**
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv batch_input.csv \
|
||||
--out_dir results/batch/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**For Large Virtual Screening (>100 compounds):**
|
||||
|
||||
Pre-compute protein embeddings for faster processing:
|
||||
```bash
|
||||
# Pre-compute embeddings
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/screening/
|
||||
```
|
||||
|
||||
### Workflow 3: Analyzing Results
|
||||
|
||||
After docking completes, analyze confidence scores and rank predictions:
|
||||
|
||||
```bash
|
||||
# Analyze all results
|
||||
python scripts/analyze_results.py results/batch/
|
||||
|
||||
# Show top 5 per complex
|
||||
python scripts/analyze_results.py results/batch/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python scripts/analyze_results.py results/batch/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python scripts/analyze_results.py results/batch/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python scripts/analyze_results.py results/batch/ --best 20
|
||||
```
|
||||
|
||||
The analysis script:
|
||||
- Parses confidence scores from all predictions
|
||||
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
|
||||
- Ranks predictions within and across complexes
|
||||
- Generates statistical summaries
|
||||
- Exports results to CSV for downstream analysis
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
**Understanding Scores:**
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High | Strong prediction, likely accurate |
|
||||
| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
|
||||
| **< -1.5** | Low | Uncertain prediction, requires validation |
|
||||
|
||||
**Critical Notes:**
|
||||
1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
|
||||
2. **Context Matters**: Adjust expectations for:
|
||||
- Large ligands (>500 Da): Lower confidence expected
|
||||
- Multiple protein chains: May decrease confidence
|
||||
- Novel protein families: May underperform
|
||||
3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
|
||||
|
||||
**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Parameter Customization
|
||||
|
||||
### Using Custom Configuration
|
||||
|
||||
Create custom configuration for specific use cases:
|
||||
|
||||
```bash
|
||||
# Copy template
|
||||
cp assets/custom_inference_config.yaml my_config.yaml
|
||||
|
||||
# Edit parameters (see template for presets)
|
||||
# Then run with custom config
|
||||
python -m inference \
|
||||
--config my_config.yaml \
|
||||
--protein_ligand_csv input.csv \
|
||||
--out_dir results/
|
||||
```
|
||||
|
||||
### Key Parameters to Adjust
|
||||
|
||||
**Sampling Density:**
|
||||
- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
|
||||
- More samples = better coverage but longer runtime
|
||||
|
||||
**Inference Steps:**
|
||||
- `inference_steps: 20` → Increase to 25-30 for higher accuracy
|
||||
- More steps = potentially better quality but slower
|
||||
|
||||
**Temperature Parameters (control diversity):**
|
||||
- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
|
||||
- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
|
||||
- Higher temperature = more diverse poses
|
||||
|
||||
**Presets Available in Template:**
|
||||
1. High Accuracy: More samples + steps, lower temperature
|
||||
2. Fast Screening: Fewer samples, faster
|
||||
3. Flexible Ligands: Increased torsion temperature
|
||||
4. Rigid Ligands: Decreased torsion temperature
|
||||
|
||||
**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Ensemble Docking (Protein Flexibility)
|
||||
|
||||
For proteins with known flexibility, dock to multiple conformations:
|
||||
|
||||
```python
|
||||
# Create ensemble CSV
|
||||
import pandas as pd
|
||||
|
||||
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
|
||||
ligand = "CC(=O)Oc1ccccc1C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
Run docking with increased sampling:
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--samples_per_complex 20 \
|
||||
--out_dir results/ensemble/
|
||||
```
|
||||
|
||||
### Integration with Scoring Functions
|
||||
|
||||
DiffDock generates poses; combine with other tools for affinity:
|
||||
|
||||
**GNINA (Fast neural network scoring):**
|
||||
```bash
|
||||
for pose in results/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only
|
||||
done
|
||||
```
|
||||
|
||||
**MM/GBSA (More accurate, slower):**
|
||||
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
|
||||
|
||||
**Free Energy Calculations (Most accurate):**
|
||||
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
|
||||
|
||||
**Recommended Workflow:**
|
||||
1. DiffDock → Generate poses with confidence scores
|
||||
2. Visual inspection → Check structural plausibility
|
||||
3. GNINA or MM/GBSA → Rescore and rank by affinity
|
||||
4. Experimental validation → Biochemical assays
|
||||
|
||||
## Limitations and Scope
|
||||
|
||||
**DiffDock IS Designed For:**
|
||||
- Small molecule ligands (typically 100-1000 Da)
|
||||
- Drug-like organic compounds
|
||||
- Small peptides (<20 residues)
|
||||
- Single or multi-chain proteins
|
||||
|
||||
**DiffDock IS NOT Designed For:**
|
||||
- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
|
||||
- Large peptides (>20 residues) → Use alternative methods
|
||||
- Covalent docking → Use specialized covalent docking tools
|
||||
- Binding affinity prediction → Combine with scoring functions
|
||||
- Membrane proteins → Not specifically trained, use with caution
|
||||
|
||||
**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: Low confidence scores across all predictions**
|
||||
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
|
||||
- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
|
||||
|
||||
**Issue: Out of memory errors**
|
||||
- Cause: GPU memory insufficient for batch size
|
||||
- Solution: Reduce `--batch_size 2` or process fewer complexes at once
|
||||
|
||||
**Issue: Slow performance**
|
||||
- Cause: Running on CPU instead of GPU
|
||||
- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
|
||||
|
||||
**Issue: Unrealistic binding poses**
|
||||
- Cause: Poor protein preparation, ligand too large, wrong binding site
|
||||
- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
|
||||
|
||||
**Issue: "Module not found" errors**
|
||||
- Cause: Missing dependencies or wrong environment
|
||||
- Solution: Run `python scripts/setup_check.py` to diagnose
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
**For Best Results:**
|
||||
1. Use GPU (essential for practical use)
|
||||
2. Pre-compute ESM embeddings for repeated protein use
|
||||
3. Batch process multiple complexes together
|
||||
4. Start with default parameters, then tune if needed
|
||||
5. Validate protein structures (resolve missing residues)
|
||||
6. Use canonical SMILES for ligands
|
||||
|
||||
## Graphical User Interface
|
||||
|
||||
For interactive use, launch the web interface:
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
# Navigate to http://localhost:7860
|
||||
```
|
||||
|
||||
Or use the online demo without installation:
|
||||
- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Resources
|
||||
|
||||
### Helper Scripts (`scripts/`)
|
||||
|
||||
**`prepare_batch_csv.py`**: Create and validate batch input CSV files
|
||||
- Create templates with example entries
|
||||
- Validate file paths and SMILES strings
|
||||
- Check for required columns and format issues
|
||||
|
||||
**`analyze_results.py`**: Analyze confidence scores and rank predictions
|
||||
- Parse results from single or batch runs
|
||||
- Generate statistical summaries
|
||||
- Export to CSV for downstream analysis
|
||||
- Identify top predictions across complexes
|
||||
|
||||
**`setup_check.py`**: Verify DiffDock environment setup
|
||||
- Check Python version and dependencies
|
||||
- Verify PyTorch and CUDA availability
|
||||
- Test RDKit and PyTorch Geometric installation
|
||||
- Provide installation instructions if needed
|
||||
|
||||
### Reference Documentation (`references/`)
|
||||
|
||||
**`parameters_reference.md`**: Complete parameter documentation
|
||||
- All command-line options and configuration parameters
|
||||
- Default values and acceptable ranges
|
||||
- Temperature parameters for controlling diversity
|
||||
- Model checkpoint locations and version flags
|
||||
|
||||
Read this file when users need:
|
||||
- Detailed parameter explanations
|
||||
- Fine-tuning guidance for specific systems
|
||||
- Alternative sampling strategies
|
||||
|
||||
**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
|
||||
- Detailed confidence score interpretation
|
||||
- When to trust predictions
|
||||
- Scope and limitations of DiffDock
|
||||
- Integration with complementary tools
|
||||
- Troubleshooting prediction quality
|
||||
|
||||
Read this file when users need:
|
||||
- Help interpreting confidence scores
|
||||
- Understanding when NOT to use DiffDock
|
||||
- Guidance on combining with other tools
|
||||
- Validation strategies
|
||||
|
||||
**`workflows_examples.md`**: Comprehensive workflow examples
|
||||
- Detailed installation instructions
|
||||
- Step-by-step examples for all workflows
|
||||
- Advanced integration patterns
|
||||
- Troubleshooting common issues
|
||||
- Best practices and optimization tips
|
||||
|
||||
Read this file when users need:
|
||||
- Complete workflow examples with code
|
||||
- Integration with GNINA, OpenMM, or other tools
|
||||
- Virtual screening workflows
|
||||
- Ensemble docking procedures
|
||||
|
||||
### Assets (`assets/`)
|
||||
|
||||
**`batch_template.csv`**: Template for batch processing
|
||||
- Pre-formatted CSV with required columns
|
||||
- Example entries showing different input types
|
||||
- Ready to customize with actual data
|
||||
|
||||
**`custom_inference_config.yaml`**: Configuration template
|
||||
- Annotated YAML with all parameters
|
||||
- Four preset configurations for common use cases
|
||||
- Detailed comments explaining each parameter
|
||||
- Ready to customize and use
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always verify environment** with `setup_check.py` before starting large jobs
|
||||
2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
|
||||
3. **Start with defaults** then tune parameters based on system-specific needs
|
||||
4. **Generate multiple samples** (10-40) for robust predictions
|
||||
5. **Visual inspection** of top poses before downstream analysis
|
||||
6. **Combine with scoring** functions for affinity assessment
|
||||
7. **Use confidence scores** for initial ranking, not final decisions
|
||||
8. **Pre-compute embeddings** for virtual screening campaigns
|
||||
9. **Document parameters** used for reproducibility
|
||||
10. **Validate results** experimentally when possible
|
||||
|
||||
## Citations
|
||||
|
||||
When using DiffDock, cite the appropriate papers:
|
||||
|
||||
**DiffDock-L (current default model):**
|
||||
```
|
||||
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
|
||||
arXiv:2402.18396
|
||||
```
|
||||
|
||||
**Original DiffDock:**
|
||||
```
|
||||
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
ICLR 2023, arXiv:2210.01776
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GitHub Repository**: https://github.com/gcorso/DiffDock
|
||||
- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
|
||||
- **Original Paper**: https://arxiv.org/abs/2210.01776
|
||||
4
scientific-packages/diffdock/assets/batch_template.csv
Normal file
4
scientific-packages/diffdock/assets/batch_template.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
example_1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
example_2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
|
||||
example_3,protein3.pdb,ligand3.sdf,
|
||||
|
@@ -0,0 +1,90 @@
|
||||
# DiffDock Custom Inference Configuration Template
|
||||
# Copy and modify this file to customize inference parameters
|
||||
|
||||
# Model paths (usually don't need to change these)
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
ckpt: best_ema_inference_epoch_model.pt
|
||||
confidence_ckpt: best_model_epoch75.pt
|
||||
|
||||
# Model version flags
|
||||
old_score_model: false # Set to true to use original DiffDock instead of DiffDock-L
|
||||
old_filtering_model: true
|
||||
|
||||
# Inference steps
|
||||
inference_steps: 20 # Increase for potentially better accuracy (e.g., 25-30)
|
||||
actual_steps: 19
|
||||
no_final_step_noise: true
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 10 # Increase for difficult cases (e.g., 20-40)
|
||||
sigma_schedule: expbeta
|
||||
initial_noise_std_proportion: 1.46
|
||||
|
||||
# Temperature controls - Adjust these to balance exploration vs accuracy
|
||||
# Higher values = more diverse predictions, lower values = more focused predictions
|
||||
|
||||
# Sampling temperatures
|
||||
temp_sampling_tr: 1.17 # Translation sampling temperature
|
||||
temp_sampling_rot: 2.06 # Rotation sampling temperature
|
||||
temp_sampling_tor: 7.04 # Torsion sampling temperature (increase for flexible ligands)
|
||||
|
||||
# Psi angle temperatures
|
||||
temp_psi_tr: 0.73
|
||||
temp_psi_rot: 0.90
|
||||
temp_psi_tor: 0.59
|
||||
|
||||
# Sigma data temperatures
|
||||
temp_sigma_data_tr: 0.93
|
||||
temp_sigma_data_rot: 0.75
|
||||
temp_sigma_data_tor: 0.69
|
||||
|
||||
# Feature flags
|
||||
no_model: false
|
||||
no_random: false
|
||||
ode: false # Set to true to use ODE solver instead of SDE
|
||||
different_schedules: false
|
||||
limit_failures: 5
|
||||
|
||||
# Output settings
|
||||
# save_visualisation: true # Uncomment to save SDF files
|
||||
|
||||
# ============================================================================
|
||||
# Configuration Presets for Common Use Cases
|
||||
# ============================================================================
|
||||
|
||||
# PRESET 1: High Accuracy (slower, more thorough)
|
||||
# samples_per_complex: 30
|
||||
# inference_steps: 25
|
||||
# temp_sampling_tr: 1.0
|
||||
# temp_sampling_rot: 1.8
|
||||
# temp_sampling_tor: 6.5
|
||||
|
||||
# PRESET 2: Fast Screening (faster, less thorough)
|
||||
# samples_per_complex: 5
|
||||
# inference_steps: 15
|
||||
# temp_sampling_tr: 1.3
|
||||
# temp_sampling_rot: 2.2
|
||||
# temp_sampling_tor: 7.5
|
||||
|
||||
# PRESET 3: Flexible Ligands (more conformational diversity)
|
||||
# samples_per_complex: 20
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.2
|
||||
# temp_sampling_rot: 2.1
|
||||
# temp_sampling_tor: 8.5 # Increased torsion temperature
|
||||
|
||||
# PRESET 4: Rigid Ligands (more focused predictions)
|
||||
# samples_per_complex: 10
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.1
|
||||
# temp_sampling_rot: 2.0
|
||||
# temp_sampling_tor: 6.0 # Decreased torsion temperature
|
||||
|
||||
# ============================================================================
|
||||
# Usage Example
|
||||
# ============================================================================
|
||||
# python -m inference \
|
||||
# --config custom_inference_config.yaml \
|
||||
# --protein_ligand_csv input.csv \
|
||||
# --out_dir results/
|
||||
@@ -0,0 +1,182 @@
|
||||
# DiffDock Confidence Scores and Limitations
|
||||
|
||||
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
|
||||
|
||||
### Score Ranges
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
|
||||
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
|
||||
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
|
||||
|
||||
### Important Notes on Confidence Scores
|
||||
|
||||
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
|
||||
- High confidence = model is confident about the structure
|
||||
- Does NOT indicate strong/weak binding affinity
|
||||
|
||||
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
|
||||
- **Lower expectations** for:
|
||||
- Large ligands (>500 Da)
|
||||
- Protein complexes with many chains
|
||||
- Unbound protein conformations (may require conformational changes)
|
||||
- Novel protein families not well-represented in training data
|
||||
|
||||
- **Higher expectations** for:
|
||||
- Drug-like small molecules (150-500 Da)
|
||||
- Single-chain proteins or well-defined binding sites
|
||||
- Proteins similar to those in training data (PDBBind, BindingMOAD)
|
||||
|
||||
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
|
||||
- Review top-ranked predictions (by confidence)
|
||||
- Consider clustering similar poses
|
||||
- High-confidence consensus across multiple samples strengthens prediction
|
||||
|
||||
## What DiffDock Predicts
|
||||
|
||||
### ✅ DiffDock DOES Predict
|
||||
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
|
||||
- **Confidence scores**: Model's certainty about predictions
|
||||
- **Multiple conformations**: Various possible binding modes
|
||||
|
||||
### ❌ DiffDock DOES NOT Predict
|
||||
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
|
||||
- **Binding kinetics**: On/off rates, residence time
|
||||
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
|
||||
- **Selectivity**: Relative binding to different targets
|
||||
|
||||
## Scope and Limitations
|
||||
|
||||
### Designed For
|
||||
- **Small molecule docking**: Organic compounds typically 100-1000 Da
|
||||
- **Protein targets**: Single or multi-chain proteins
|
||||
- **Small peptides**: Short peptide ligands (< ~20 residues)
|
||||
- **Small nucleic acids**: Short oligonucleotides
|
||||
|
||||
### NOT Designed For
|
||||
- **Large biomolecules**: Full protein-protein interactions
|
||||
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
|
||||
- **Large peptides/proteins**: >20 residues as ligands
|
||||
- **Covalent docking**: Irreversible covalent bond formation
|
||||
- **Metalloprotein specifics**: May not accurately handle metal coordination
|
||||
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
|
||||
|
||||
### Training Data Considerations
|
||||
|
||||
DiffDock was trained on:
|
||||
- **PDBBind**: Diverse protein-ligand complexes
|
||||
- **BindingMOAD**: Multi-domain protein structures
|
||||
|
||||
**Implications**:
|
||||
- Best performance on proteins/ligands similar to training data
|
||||
- May underperform on:
|
||||
- Novel protein families
|
||||
- Unusual ligand chemotypes
|
||||
- Allosteric sites not well-represented in training data
|
||||
|
||||
## Validation and Complementary Tools
|
||||
|
||||
### Recommended Workflow
|
||||
|
||||
1. **Generate poses with DiffDock**
|
||||
- Use confidence scores for initial ranking
|
||||
- Consider multiple high-confidence predictions
|
||||
|
||||
2. **Visual Inspection**
|
||||
- Examine protein-ligand interactions in molecular viewer
|
||||
- Check for reasonable:
|
||||
- Hydrogen bonds
|
||||
- Hydrophobic interactions
|
||||
- Steric complementarity
|
||||
- Electrostatic interactions
|
||||
|
||||
3. **Scoring and Refinement** (choose one or more):
|
||||
- **GNINA**: Deep learning-based scoring function
|
||||
- **Molecular mechanics**: Energy minimization and refinement
|
||||
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
|
||||
- **Free energy calculations**: FEP or TI for accurate affinity prediction
|
||||
|
||||
4. **Experimental Validation**
|
||||
- Biochemical assays (IC50, Kd measurements)
|
||||
- Structural validation (X-ray crystallography, cryo-EM)
|
||||
|
||||
### Tools for Binding Affinity Assessment
|
||||
|
||||
DiffDock should be combined with these tools for affinity prediction:
|
||||
|
||||
- **GNINA**: Fast, accurate scoring function
|
||||
- Github: github.com/gnina/gnina
|
||||
|
||||
- **AutoDock Vina**: Classical docking and scoring
|
||||
- Website: vina.scripps.edu
|
||||
|
||||
- **Free Energy Calculations**:
|
||||
- OpenMM + OpenFE
|
||||
- GROMACS + ABFE/RBFE protocols
|
||||
|
||||
- **MM/GBSA Tools**:
|
||||
- MMPBSA.py (AmberTools)
|
||||
- gmx_MMPBSA
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### For Best Results
|
||||
|
||||
1. **Protein Preparation**:
|
||||
- Remove water molecules far from binding site
|
||||
- Resolve missing residues if possible
|
||||
- Consider protonation states at physiological pH
|
||||
|
||||
2. **Ligand Input**:
|
||||
- Provide reasonable 3D conformers when using structure files
|
||||
- Use canonical SMILES for consistent results
|
||||
- Pre-process with RDKit if needed
|
||||
|
||||
3. **Computational Resources**:
|
||||
- GPU strongly recommended (10-100x speedup)
|
||||
- First run pre-computes lookup tables (takes a few minutes)
|
||||
- Batch processing more efficient than single predictions
|
||||
|
||||
4. **Parameter Tuning**:
|
||||
- Increase `samples_per_complex` for difficult cases (20-40)
|
||||
- Adjust temperature parameters for diversity/accuracy trade-off
|
||||
- Use pre-computed ESM embeddings for repeated predictions
|
||||
|
||||
## Common Issues and Troubleshooting
|
||||
|
||||
### Low Confidence Scores
|
||||
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
|
||||
- **Multiple binding sites**: May predict multiple locations with distributed confidence
|
||||
- **Protein flexibility**: Consider using ensemble of protein conformations
|
||||
|
||||
### Unrealistic Predictions
|
||||
- **Clashes**: May indicate need for protein preparation or refinement
|
||||
- **Surface binding**: Check if true binding site is blocked or unclear
|
||||
- **Unusual poses**: Consider increasing samples to explore more conformations
|
||||
|
||||
### Slow Performance
|
||||
- **Use GPU**: Essential for reasonable runtime
|
||||
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
|
||||
- **Batch processing**: More efficient than sequential individual predictions
|
||||
- **Reduce samples**: Lower `samples_per_complex` for quick screening
|
||||
|
||||
## Citation and Further Reading
|
||||
|
||||
For methodology details and benchmarking results, see:
|
||||
|
||||
1. **Original DiffDock Paper** (ICLR 2023):
|
||||
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
- Corso et al., arXiv:2210.01776
|
||||
|
||||
2. **DiffDock-L Paper** (2024):
|
||||
- Enhanced model with improved generalization
|
||||
- Stärk et al., arXiv:2402.18396
|
||||
|
||||
3. **PoseBusters Benchmark**:
|
||||
- Rigorous docking evaluation framework
|
||||
- Used for DiffDock validation
|
||||
163
scientific-packages/diffdock/references/parameters_reference.md
Normal file
163
scientific-packages/diffdock/references/parameters_reference.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DiffDock Configuration Parameters Reference
|
||||
|
||||
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
|
||||
|
||||
## Model & Checkpoint Settings
|
||||
|
||||
### Model Paths
|
||||
- **`--model_dir`**: Directory containing the score model checkpoint
|
||||
- Default: `./workdir/v1.1/score_model`
|
||||
- DiffDock-L model (current default)
|
||||
|
||||
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
|
||||
- Default: `./workdir/v1.1/confidence_model`
|
||||
|
||||
- **`--ckpt`**: Name of the score model checkpoint file
|
||||
- Default: `best_ema_inference_epoch_model.pt`
|
||||
|
||||
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
|
||||
- Default: `best_model_epoch75.pt`
|
||||
|
||||
### Model Version Flags
|
||||
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
|
||||
- Default: `false` (uses DiffDock-L)
|
||||
|
||||
- **`--old_filtering_model`**: Use legacy confidence filtering approach
|
||||
- Default: `true`
|
||||
|
||||
## Input/Output Options
|
||||
|
||||
### Input Specification
|
||||
- **`--protein_path`**: Path to protein PDB file
|
||||
- Example: `--protein_path protein.pdb`
|
||||
- Alternative to `--protein_sequence`
|
||||
|
||||
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
|
||||
- Automatically generates protein structure from sequence
|
||||
- Alternative to `--protein_path`
|
||||
|
||||
- **`--ligand`**: Ligand specification (SMILES string or file path)
|
||||
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
|
||||
- File path: `--ligand ligand.sdf` or `.mol2`
|
||||
|
||||
- **`--protein_ligand_csv`**: CSV file for batch processing
|
||||
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
|
||||
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
|
||||
|
||||
### Output Control
|
||||
- **`--out_dir`**: Output directory for predictions
|
||||
- Example: `--out_dir results/user_predictions/`
|
||||
|
||||
- **`--save_visualisation`**: Export predicted molecules as SDF files
|
||||
- Enables visualization of results
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
### Diffusion Steps
|
||||
- **`--inference_steps`**: Number of planned inference iterations
|
||||
- Default: `20`
|
||||
- Higher values may improve accuracy but increase runtime
|
||||
|
||||
- **`--actual_steps`**: Actual diffusion steps executed
|
||||
- Default: `19`
|
||||
|
||||
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
|
||||
- Default: `true`
|
||||
|
||||
### Sampling Settings
|
||||
- **`--samples_per_complex`**: Number of samples to generate per complex
|
||||
- Default: `10`
|
||||
- More samples provide better coverage but increase computation
|
||||
|
||||
- **`--sigma_schedule`**: Noise schedule type
|
||||
- Default: `expbeta` (exponential-beta)
|
||||
|
||||
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
|
||||
- Default: `1.46`
|
||||
|
||||
### Temperature Parameters
|
||||
|
||||
#### Sampling Temperatures (Controls diversity of predictions)
|
||||
- **`--temp_sampling_tr`**: Translation sampling temperature
|
||||
- Default: `1.17`
|
||||
|
||||
- **`--temp_sampling_rot`**: Rotation sampling temperature
|
||||
- Default: `2.06`
|
||||
|
||||
- **`--temp_sampling_tor`**: Torsion sampling temperature
|
||||
- Default: `7.04`
|
||||
|
||||
#### Psi Angle Temperatures
|
||||
- **`--temp_psi_tr`**: Translation psi temperature
|
||||
- Default: `0.73`
|
||||
|
||||
- **`--temp_psi_rot`**: Rotation psi temperature
|
||||
- Default: `0.90`
|
||||
|
||||
- **`--temp_psi_tor`**: Torsion psi temperature
|
||||
- Default: `0.59`
|
||||
|
||||
#### Sigma Data Temperatures
|
||||
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
|
||||
- Default: `0.93`
|
||||
|
||||
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
|
||||
- Default: `0.75`
|
||||
|
||||
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
|
||||
- Default: `0.69`
|
||||
|
||||
## Processing Options
|
||||
|
||||
### Performance
|
||||
- **`--batch_size`**: Processing batch size
|
||||
- Default: `10`
|
||||
- Larger values increase throughput but require more memory
|
||||
|
||||
- **`--tqdm`**: Enable progress bar visualization
|
||||
- Useful for monitoring long-running jobs
|
||||
|
||||
### Protein Structure
|
||||
- **`--chain_cutoff`**: Maximum number of protein chains to process
|
||||
- Example: `--chain_cutoff 10`
|
||||
- Useful for large multi-chain complexes
|
||||
|
||||
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
|
||||
- Speeds up inference by reusing embeddings
|
||||
- Optional optimization
|
||||
|
||||
### Dataset Options
|
||||
- **`--split`**: Dataset split to use (train/test/val)
|
||||
- Used for evaluation on standard benchmarks
|
||||
|
||||
## Advanced Flags
|
||||
|
||||
### Debugging & Testing
|
||||
- **`--no_model`**: Disable model inference (debugging)
|
||||
- Default: `false`
|
||||
|
||||
- **`--no_random`**: Disable randomization
|
||||
- Default: `false`
|
||||
- Useful for reproducibility testing
|
||||
|
||||
### Alternative Sampling
|
||||
- **`--ode`**: Use ODE solver instead of SDE
|
||||
- Default: `false`
|
||||
- Alternative sampling approach
|
||||
|
||||
- **`--different_schedules`**: Use different noise schedules per component
|
||||
- Default: `false`
|
||||
|
||||
### Error Handling
|
||||
- **`--limit_failures`**: Maximum allowed failures before stopping
|
||||
- Default: `5`
|
||||
|
||||
## Configuration File
|
||||
|
||||
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
|
||||
|
||||
```bash
|
||||
python -m inference --config default_inference_args.yaml --samples_per_complex 20
|
||||
```
|
||||
|
||||
Command-line arguments take precedence over configuration file values.
|
||||
392
scientific-packages/diffdock/references/workflows_examples.md
Normal file
392
scientific-packages/diffdock/references/workflows_examples.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# DiffDock Workflows and Examples
|
||||
|
||||
This document provides practical workflows and usage examples for common DiffDock tasks.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Conda Installation (Recommended)
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
# Create conda environment
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
### Docker Installation
|
||||
|
||||
```bash
|
||||
# Pull Docker image
|
||||
docker pull rbgcsail/diffdock
|
||||
|
||||
# Run container with GPU support
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
|
||||
# Inside container, activate environment
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
### First Run
|
||||
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
|
||||
|
||||
## Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
### Using PDB File and SMILES String
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path examples/protein.pdb \
|
||||
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/single_docking/
|
||||
├── index_0_rank_1.sdf # Top-ranked prediction
|
||||
├── index_0_rank_2.sdf # Second-ranked prediction
|
||||
├── ...
|
||||
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
|
||||
└── confidence_scores.txt # Scores for all predictions
|
||||
```
|
||||
|
||||
### Using Ligand Structure File
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/ligand_file/
|
||||
```
|
||||
|
||||
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
|
||||
|
||||
## Workflow 2: Protein Sequence to Structure Docking
|
||||
|
||||
### Using ESMFold for Protein Folding
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
|
||||
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- Protein structure not available in PDB
|
||||
- Modeling mutations or variants
|
||||
- De novo protein design validation
|
||||
|
||||
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
|
||||
|
||||
## Workflow 3: Batch Processing Multiple Complexes
|
||||
|
||||
### Prepare CSV File
|
||||
|
||||
Create `complexes.csv` with required columns:
|
||||
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
|
||||
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
|
||||
```
|
||||
|
||||
**Column Descriptions**:
|
||||
- `complex_name`: Unique identifier for the complex
|
||||
- `protein_path`: Path to PDB file (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or path to ligand file
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
### Run Batch Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv complexes.csv \
|
||||
--out_dir results/batch_predictions/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/batch_predictions/
|
||||
├── complex1/
|
||||
│ ├── rank_1.sdf
|
||||
│ ├── rank_2.sdf
|
||||
│ └── ...
|
||||
├── complex2/
|
||||
│ ├── rank_1.sdf
|
||||
│ └── ...
|
||||
└── complex3/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Workflow 4: High-Throughput Virtual Screening
|
||||
|
||||
### Setup for Screening Large Ligand Libraries
|
||||
|
||||
```python
|
||||
# generate_screening_csv.py
|
||||
import pandas as pd
|
||||
|
||||
# Load ligand library
|
||||
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
|
||||
|
||||
# Create DiffDock input
|
||||
screening_data = {
|
||||
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
|
||||
"protein_path": ["target_protein.pdb"] * len(ligands),
|
||||
"ligand_description": ligands["smiles"].tolist(),
|
||||
"protein_sequence": [""] * len(ligands)
|
||||
}
|
||||
|
||||
df = pd.DataFrame(screening_data)
|
||||
df.to_csv("screening_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Screening
|
||||
|
||||
```bash
|
||||
# Pre-compute ESM embeddings for faster screening
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run docking with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/virtual_screening/ \
|
||||
--batch_size 32
|
||||
```
|
||||
|
||||
### Post-Processing: Extract Top Hits
|
||||
|
||||
```python
|
||||
# analyze_screening_results.py
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
results = []
|
||||
results_dir = "results/virtual_screening/"
|
||||
|
||||
for complex_dir in os.listdir(results_dir):
|
||||
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
|
||||
if os.path.exists(confidence_file):
|
||||
with open(confidence_file) as f:
|
||||
scores = [float(line.strip()) for line in f]
|
||||
top_score = max(scores)
|
||||
results.append({"complex": complex_dir, "top_confidence": top_score})
|
||||
|
||||
# Sort by confidence
|
||||
df = pd.DataFrame(results)
|
||||
df_sorted = df.sort_values("top_confidence", ascending=False)
|
||||
|
||||
# Get top 100 hits
|
||||
top_hits = df_sorted.head(100)
|
||||
top_hits.to_csv("top_hits.csv", index=False)
|
||||
```
|
||||
|
||||
## Workflow 5: Ensemble Docking with Protein Flexibility
|
||||
|
||||
### Prepare Protein Ensemble
|
||||
|
||||
```python
|
||||
# For proteins with known flexibility, use multiple conformations
|
||||
# Example: Using MD snapshots or crystal structures
|
||||
|
||||
# create_ensemble_csv.py
|
||||
import pandas as pd
|
||||
|
||||
conformations = [
|
||||
"protein_conf1.pdb",
|
||||
"protein_conf2.pdb",
|
||||
"protein_conf3.pdb",
|
||||
"protein_conf4.pdb"
|
||||
]
|
||||
|
||||
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Ensemble Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--out_dir results/ensemble_docking/ \
|
||||
--samples_per_complex 20 # More samples per conformation
|
||||
```
|
||||
|
||||
## Workflow 6: Integration with Downstream Analysis
|
||||
|
||||
### Example: DiffDock + GNINA Rescoring
|
||||
|
||||
```bash
|
||||
# 1. Run DiffDock
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/diffdock_poses/ \
|
||||
--save_visualisation
|
||||
|
||||
# 2. Rescore with GNINA
|
||||
for pose in results/diffdock_poses/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
|
||||
done
|
||||
```
|
||||
|
||||
### Example: DiffDock + OpenMM Energy Minimization
|
||||
|
||||
```python
|
||||
# minimize_poses.py
|
||||
from openmm import app, LangevinIntegrator, Platform
|
||||
from openmm.app import ForceField, Modeller, PDBFile
|
||||
from rdkit import Chem
|
||||
import os
|
||||
|
||||
# Load protein
|
||||
protein = PDBFile('protein.pdb')
|
||||
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
|
||||
|
||||
# Process each DiffDock pose
|
||||
pose_dir = 'results/diffdock_poses/'
|
||||
for pose_file in os.listdir(pose_dir):
|
||||
if pose_file.endswith('.sdf'):
|
||||
# Load ligand
|
||||
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
|
||||
|
||||
# Combine protein + ligand
|
||||
modeller = Modeller(protein.topology, protein.positions)
|
||||
# ... add ligand to modeller ...
|
||||
|
||||
# Create system and minimize
|
||||
system = forcefield.createSystem(modeller.topology)
|
||||
integrator = LangevinIntegrator(300, 1.0, 0.002)
|
||||
simulation = app.Simulation(modeller.topology, system, integrator)
|
||||
simulation.minimizeEnergy(maxIterations=1000)
|
||||
|
||||
# Save minimized structure
|
||||
positions = simulation.context.getState(getPositions=True).getPositions()
|
||||
PDBFile.writeFile(simulation.topology, positions,
|
||||
open(f"minimized_{pose_file}.pdb", 'w'))
|
||||
```
|
||||
|
||||
## Workflow 7: Using the Graphical Interface
|
||||
|
||||
### Launch Web Interface
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
```
|
||||
|
||||
### Access Interface
|
||||
Navigate to `http://localhost:7860` in web browser
|
||||
|
||||
### Features
|
||||
- Upload protein PDB or enter sequence
|
||||
- Input ligand SMILES or upload structure
|
||||
- Adjust inference parameters via GUI
|
||||
- Visualize results interactively
|
||||
- Download predictions directly
|
||||
|
||||
### Online Alternative
|
||||
Use the Hugging Face Spaces demo without local installation:
|
||||
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Inference Settings
|
||||
|
||||
Create custom YAML configuration:
|
||||
|
||||
```yaml
|
||||
# custom_inference.yaml
|
||||
# Model settings
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 20 # More samples for better coverage
|
||||
inference_steps: 25 # More steps for accuracy
|
||||
|
||||
# Temperature adjustments (increase for more diversity)
|
||||
temp_sampling_tr: 1.3
|
||||
temp_sampling_rot: 2.2
|
||||
temp_sampling_tor: 7.5
|
||||
|
||||
# Output
|
||||
save_visualisation: true
|
||||
```
|
||||
|
||||
Use custom configuration:
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config custom_inference.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/custom_config/
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory Errors
|
||||
|
||||
**Solution**: Reduce batch size
|
||||
```bash
|
||||
python -m inference ... --batch_size 2
|
||||
```
|
||||
|
||||
### Issue: Slow Performance
|
||||
|
||||
**Solution**: Ensure GPU usage
|
||||
```python
|
||||
import torch
|
||||
print(torch.cuda.is_available()) # Should return True
|
||||
```
|
||||
|
||||
### Issue: Poor Predictions for Large Ligands
|
||||
|
||||
**Solution**: Increase sampling diversity
|
||||
```bash
|
||||
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
|
||||
```
|
||||
|
||||
### Issue: Protein with Many Chains
|
||||
|
||||
**Solution**: Limit chains or isolate binding site
|
||||
```bash
|
||||
python -m inference ... --chain_cutoff 4
|
||||
```
|
||||
|
||||
Or pre-process PDB to include only relevant chains.
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Start Simple**: Test with single complex before batch processing
|
||||
2. **GPU Essential**: Use GPU for reasonable performance
|
||||
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
|
||||
4. **Validate Results**: Use molecular visualization and complementary scoring
|
||||
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
|
||||
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
|
||||
7. **Pre-compute Embeddings**: For repeated use of same protein
|
||||
8. **Combine Tools**: Integrate with scoring functions and energy minimization
|
||||
334
scientific-packages/diffdock/scripts/analyze_results.py
Executable file
334
scientific-packages/diffdock/scripts/analyze_results.py
Executable file
@@ -0,0 +1,334 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Results Analysis Script
|
||||
|
||||
This script analyzes DiffDock prediction results, extracting confidence scores,
|
||||
ranking predictions, and generating summary reports.
|
||||
|
||||
Usage:
|
||||
python analyze_results.py results/output_dir/
|
||||
python analyze_results.py results/ --top 50 --threshold 0.0
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
import re
|
||||
|
||||
|
||||
def parse_confidence_scores(results_dir):
|
||||
"""
|
||||
Parse confidence scores from DiffDock output directory.
|
||||
|
||||
Args:
|
||||
results_dir: Path to DiffDock results directory
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping complex names to their predictions and scores
|
||||
"""
|
||||
results = {}
|
||||
results_path = Path(results_dir)
|
||||
|
||||
# Check if this is a single complex or batch results
|
||||
sdf_files = list(results_path.glob("*.sdf"))
|
||||
|
||||
if sdf_files:
|
||||
# Single complex output
|
||||
results['single_complex'] = parse_single_complex(results_path)
|
||||
else:
|
||||
# Batch output - multiple subdirectories
|
||||
for subdir in results_path.iterdir():
|
||||
if subdir.is_dir():
|
||||
complex_results = parse_single_complex(subdir)
|
||||
if complex_results:
|
||||
results[subdir.name] = complex_results
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def parse_single_complex(complex_dir):
|
||||
"""Parse results for a single complex."""
|
||||
predictions = []
|
||||
|
||||
# Look for SDF files with rank information
|
||||
for sdf_file in complex_dir.glob("*.sdf"):
|
||||
filename = sdf_file.name
|
||||
|
||||
# Extract rank from filename (e.g., "rank_1.sdf" or "index_0_rank_1.sdf")
|
||||
rank_match = re.search(r'rank_(\d+)', filename)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
|
||||
# Try to extract confidence score from filename or separate file
|
||||
confidence = extract_confidence_score(sdf_file, complex_dir)
|
||||
|
||||
predictions.append({
|
||||
'rank': rank,
|
||||
'file': sdf_file.name,
|
||||
'path': str(sdf_file),
|
||||
'confidence': confidence
|
||||
})
|
||||
|
||||
# Sort by rank
|
||||
predictions.sort(key=lambda x: x['rank'])
|
||||
|
||||
return {'predictions': predictions} if predictions else None
|
||||
|
||||
|
||||
def extract_confidence_score(sdf_file, complex_dir):
|
||||
"""
|
||||
Extract confidence score for a prediction.
|
||||
|
||||
Tries multiple methods:
|
||||
1. Read from confidence_scores.txt file
|
||||
2. Parse from SDF file properties
|
||||
3. Extract from filename if present
|
||||
"""
|
||||
# Method 1: confidence_scores.txt
|
||||
confidence_file = complex_dir / "confidence_scores.txt"
|
||||
if confidence_file.exists():
|
||||
try:
|
||||
with open(confidence_file) as f:
|
||||
lines = f.readlines()
|
||||
# Extract rank from filename
|
||||
rank_match = re.search(r'rank_(\d+)', sdf_file.name)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
if rank <= len(lines):
|
||||
return float(lines[rank - 1].strip())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 2: Parse from SDF file
|
||||
try:
|
||||
with open(sdf_file) as f:
|
||||
content = f.read()
|
||||
# Look for confidence score in SDF properties
|
||||
conf_match = re.search(r'confidence[:\s]+(-?\d+\.?\d*)', content, re.IGNORECASE)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 3: Filename (e.g., "rank_1_conf_0.95.sdf")
|
||||
conf_match = re.search(r'conf_(-?\d+\.?\d*)', sdf_file.name)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def classify_confidence(score):
|
||||
"""Classify confidence score into categories."""
|
||||
if score is None:
|
||||
return "Unknown"
|
||||
elif score > 0:
|
||||
return "High"
|
||||
elif score > -1.5:
|
||||
return "Moderate"
|
||||
else:
|
||||
return "Low"
|
||||
|
||||
|
||||
def print_summary(results, top_n=None, min_confidence=None):
|
||||
"""Print a formatted summary of results."""
|
||||
|
||||
print("\n" + "="*80)
|
||||
print("DiffDock Results Summary")
|
||||
print("="*80)
|
||||
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
|
||||
print(f"\n{complex_name}")
|
||||
print("-" * 80)
|
||||
|
||||
if not predictions:
|
||||
print(" No predictions found")
|
||||
continue
|
||||
|
||||
# Filter by confidence if specified
|
||||
filtered_predictions = predictions
|
||||
if min_confidence is not None:
|
||||
filtered_predictions = [p for p in predictions if p['confidence'] is not None and p['confidence'] >= min_confidence]
|
||||
|
||||
# Limit to top N if specified
|
||||
if top_n is not None:
|
||||
filtered_predictions = filtered_predictions[:top_n]
|
||||
|
||||
for pred in filtered_predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
|
||||
conf_str = f"{confidence:>7.3f}" if confidence is not None else " N/A"
|
||||
print(f" Rank {pred['rank']:2d}: Confidence = {conf_str} ({confidence_class:8s}) | {pred['file']}")
|
||||
|
||||
# Add to all predictions for overall statistics
|
||||
if confidence is not None:
|
||||
all_predictions.append((complex_name, pred['rank'], confidence))
|
||||
|
||||
# Show statistics for this complex
|
||||
if filtered_predictions and any(p['confidence'] is not None for p in filtered_predictions):
|
||||
confidences = [p['confidence'] for p in filtered_predictions if p['confidence'] is not None]
|
||||
print(f"\n Statistics: {len(filtered_predictions)} predictions")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Overall statistics
|
||||
if all_predictions:
|
||||
print("\n" + "="*80)
|
||||
print("Overall Statistics")
|
||||
print("="*80)
|
||||
|
||||
confidences = [conf for _, _, conf in all_predictions]
|
||||
print(f" Total predictions: {len(all_predictions)}")
|
||||
print(f" Total complexes: {len(results)}")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Confidence distribution
|
||||
high = sum(1 for c in confidences if c > 0)
|
||||
moderate = sum(1 for c in confidences if -1.5 < c <= 0)
|
||||
low = sum(1 for c in confidences if c <= -1.5)
|
||||
|
||||
print(f"\n Confidence distribution:")
|
||||
print(f" High (> 0): {high:4d} ({100*high/len(confidences):5.1f}%)")
|
||||
print(f" Moderate (-1.5 to 0): {moderate:4d} ({100*moderate/len(confidences):5.1f}%)")
|
||||
print(f" Low (< -1.5): {low:4d} ({100*low/len(confidences):5.1f}%)")
|
||||
|
||||
print("\n" + "="*80)
|
||||
|
||||
|
||||
def export_to_csv(results, output_path):
|
||||
"""Export results to CSV file."""
|
||||
import csv
|
||||
|
||||
with open(output_path, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(['complex_name', 'rank', 'confidence', 'confidence_class', 'file_path'])
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
conf_value = confidence if confidence is not None else ''
|
||||
|
||||
writer.writerow([
|
||||
complex_name,
|
||||
pred['rank'],
|
||||
conf_value,
|
||||
confidence_class,
|
||||
pred['path']
|
||||
])
|
||||
|
||||
print(f"✓ Exported results to: {output_path}")
|
||||
|
||||
|
||||
def get_top_predictions(results, n=10, sort_by='confidence'):
|
||||
"""Get top N predictions across all complexes."""
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
if pred['confidence'] is not None:
|
||||
all_predictions.append({
|
||||
'complex': complex_name,
|
||||
**pred
|
||||
})
|
||||
|
||||
# Sort by confidence (descending)
|
||||
all_predictions.sort(key=lambda x: x['confidence'], reverse=True)
|
||||
|
||||
return all_predictions[:n]
|
||||
|
||||
|
||||
def print_top_predictions(results, n=10):
|
||||
"""Print top N predictions across all complexes."""
|
||||
top_preds = get_top_predictions(results, n)
|
||||
|
||||
print("\n" + "="*80)
|
||||
print(f"Top {n} Predictions Across All Complexes")
|
||||
print("="*80)
|
||||
|
||||
for i, pred in enumerate(top_preds, 1):
|
||||
confidence_class = classify_confidence(pred['confidence'])
|
||||
print(f"{i:2d}. {pred['complex']:30s} | Rank {pred['rank']:2d} | "
|
||||
f"Confidence: {pred['confidence']:7.3f} ({confidence_class})")
|
||||
|
||||
print("="*80)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Analyze DiffDock prediction results',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Analyze all results in directory
|
||||
python analyze_results.py results/output_dir/
|
||||
|
||||
# Show only top 5 predictions per complex
|
||||
python analyze_results.py results/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python analyze_results.py results/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python analyze_results.py results/ --best 20
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('results_dir', help='Path to DiffDock results directory')
|
||||
parser.add_argument('--top', '-t', type=int,
|
||||
help='Show only top N predictions per complex')
|
||||
parser.add_argument('--threshold', type=float,
|
||||
help='Minimum confidence threshold')
|
||||
parser.add_argument('--export', '-e', metavar='FILE',
|
||||
help='Export results to CSV file')
|
||||
parser.add_argument('--best', '-b', type=int, metavar='N',
|
||||
help='Show top N predictions across all complexes')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate results directory
|
||||
if not os.path.exists(args.results_dir):
|
||||
print(f"Error: Results directory not found: {args.results_dir}")
|
||||
return 1
|
||||
|
||||
# Parse results
|
||||
print(f"Analyzing results in: {args.results_dir}")
|
||||
results = parse_confidence_scores(args.results_dir)
|
||||
|
||||
if not results:
|
||||
print("No DiffDock results found in directory")
|
||||
return 1
|
||||
|
||||
# Print summary
|
||||
print_summary(results, top_n=args.top, min_confidence=args.threshold)
|
||||
|
||||
# Print top predictions across all complexes
|
||||
if args.best:
|
||||
print_top_predictions(results, args.best)
|
||||
|
||||
# Export to CSV if requested
|
||||
if args.export:
|
||||
export_to_csv(results, args.export)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
254
scientific-packages/diffdock/scripts/prepare_batch_csv.py
Executable file
254
scientific-packages/diffdock/scripts/prepare_batch_csv.py
Executable file
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Batch CSV Preparation and Validation Script
|
||||
|
||||
This script helps prepare and validate CSV files for DiffDock batch processing.
|
||||
It checks for required columns, validates file paths, and ensures SMILES strings
|
||||
are properly formatted.
|
||||
|
||||
Usage:
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
python prepare_batch_csv.py --create --output batch_input.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import pandas as pd
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from rdkit import Chem
|
||||
from rdkit import RDLogger
|
||||
RDLogger.DisableLog('rdApp.*')
|
||||
RDKIT_AVAILABLE = True
|
||||
except ImportError:
|
||||
RDKIT_AVAILABLE = False
|
||||
print("Warning: RDKit not available. SMILES validation will be skipped.")
|
||||
|
||||
|
||||
def validate_smiles(smiles_string):
|
||||
"""Validate a SMILES string using RDKit."""
|
||||
if not RDKIT_AVAILABLE:
|
||||
return True, "RDKit not available for validation"
|
||||
|
||||
try:
|
||||
mol = Chem.MolFromSmiles(smiles_string)
|
||||
if mol is None:
|
||||
return False, "Invalid SMILES structure"
|
||||
return True, "Valid SMILES"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def validate_file_path(file_path, base_dir=None):
|
||||
"""Validate that a file path exists."""
|
||||
if pd.isna(file_path) or file_path == "":
|
||||
return True, "Empty (will use protein_sequence)"
|
||||
|
||||
# Handle relative paths
|
||||
if base_dir:
|
||||
full_path = Path(base_dir) / file_path
|
||||
else:
|
||||
full_path = Path(file_path)
|
||||
|
||||
if full_path.exists():
|
||||
return True, f"File exists: {full_path}"
|
||||
else:
|
||||
return False, f"File not found: {full_path}"
|
||||
|
||||
|
||||
def validate_csv(csv_path, base_dir=None):
|
||||
"""
|
||||
Validate a DiffDock batch input CSV file.
|
||||
|
||||
Args:
|
||||
csv_path: Path to CSV file
|
||||
base_dir: Base directory for relative paths (default: CSV directory)
|
||||
|
||||
Returns:
|
||||
bool: True if validation passes
|
||||
list: List of validation messages
|
||||
"""
|
||||
messages = []
|
||||
valid = True
|
||||
|
||||
# Read CSV
|
||||
try:
|
||||
df = pd.read_csv(csv_path)
|
||||
messages.append(f"✓ Successfully read CSV with {len(df)} rows")
|
||||
except Exception as e:
|
||||
messages.append(f"✗ Error reading CSV: {e}")
|
||||
return False, messages
|
||||
|
||||
# Check required columns
|
||||
required_cols = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']
|
||||
missing_cols = [col for col in required_cols if col not in df.columns]
|
||||
|
||||
if missing_cols:
|
||||
messages.append(f"✗ Missing required columns: {', '.join(missing_cols)}")
|
||||
valid = False
|
||||
else:
|
||||
messages.append("✓ All required columns present")
|
||||
|
||||
# Set base directory
|
||||
if base_dir is None:
|
||||
base_dir = Path(csv_path).parent
|
||||
|
||||
# Validate each row
|
||||
for idx, row in df.iterrows():
|
||||
row_msgs = []
|
||||
|
||||
# Check complex name
|
||||
if pd.isna(row['complex_name']) or row['complex_name'] == "":
|
||||
row_msgs.append("Missing complex_name")
|
||||
valid = False
|
||||
|
||||
# Check that either protein_path or protein_sequence is provided
|
||||
has_protein_path = not pd.isna(row['protein_path']) and row['protein_path'] != ""
|
||||
has_protein_seq = not pd.isna(row['protein_sequence']) and row['protein_sequence'] != ""
|
||||
|
||||
if not has_protein_path and not has_protein_seq:
|
||||
row_msgs.append("Must provide either protein_path or protein_sequence")
|
||||
valid = False
|
||||
elif has_protein_path and has_protein_seq:
|
||||
row_msgs.append("Warning: Both protein_path and protein_sequence provided, will use protein_path")
|
||||
|
||||
# Validate protein path if provided
|
||||
if has_protein_path:
|
||||
file_valid, msg = validate_file_path(row['protein_path'], base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Protein file issue: {msg}")
|
||||
valid = False
|
||||
|
||||
# Validate ligand description
|
||||
if pd.isna(row['ligand_description']) or row['ligand_description'] == "":
|
||||
row_msgs.append("Missing ligand_description")
|
||||
valid = False
|
||||
else:
|
||||
ligand_desc = row['ligand_description']
|
||||
# Check if it's a file path or SMILES
|
||||
if os.path.exists(ligand_desc) or "/" in ligand_desc or "\\" in ligand_desc:
|
||||
# Likely a file path
|
||||
file_valid, msg = validate_file_path(ligand_desc, base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Ligand file issue: {msg}")
|
||||
valid = False
|
||||
else:
|
||||
# Likely a SMILES string
|
||||
smiles_valid, msg = validate_smiles(ligand_desc)
|
||||
if not smiles_valid:
|
||||
row_msgs.append(f"SMILES issue: {msg}")
|
||||
valid = False
|
||||
|
||||
if row_msgs:
|
||||
messages.append(f"\nRow {idx + 1} ({row.get('complex_name', 'unnamed')}):")
|
||||
for msg in row_msgs:
|
||||
messages.append(f" - {msg}")
|
||||
|
||||
# Summary
|
||||
messages.append(f"\n{'='*60}")
|
||||
if valid:
|
||||
messages.append("✓ CSV validation PASSED - ready for DiffDock")
|
||||
else:
|
||||
messages.append("✗ CSV validation FAILED - please fix issues above")
|
||||
|
||||
return valid, messages
|
||||
|
||||
|
||||
def create_template_csv(output_path, num_examples=3):
|
||||
"""Create a template CSV file with example entries."""
|
||||
|
||||
examples = {
|
||||
'complex_name': ['example1', 'example2', 'example3'][:num_examples],
|
||||
'protein_path': ['protein1.pdb', '', 'protein3.pdb'][:num_examples],
|
||||
'ligand_description': [
|
||||
'CC(=O)Oc1ccccc1C(=O)O', # Aspirin SMILES
|
||||
'COc1ccc(C#N)cc1', # Example SMILES
|
||||
'ligand.sdf' # Example file path
|
||||
][:num_examples],
|
||||
'protein_sequence': [
|
||||
'', # Empty - using PDB file
|
||||
'MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK', # GFP sequence
|
||||
'' # Empty - using PDB file
|
||||
][:num_examples]
|
||||
}
|
||||
|
||||
df = pd.DataFrame(examples)
|
||||
df.to_csv(output_path, index=False)
|
||||
|
||||
return df
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Prepare and validate DiffDock batch CSV files',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate existing CSV
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
|
||||
# Create template CSV
|
||||
python prepare_batch_csv.py --create --output batch_template.csv
|
||||
|
||||
# Create template with 5 example rows
|
||||
python prepare_batch_csv.py --create --output template.csv --num-examples 5
|
||||
|
||||
# Validate with custom base directory for relative paths
|
||||
python prepare_batch_csv.py input.csv --validate --base-dir /path/to/data/
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('csv_file', nargs='?', help='CSV file to validate')
|
||||
parser.add_argument('--validate', action='store_true',
|
||||
help='Validate the CSV file')
|
||||
parser.add_argument('--create', action='store_true',
|
||||
help='Create a template CSV file')
|
||||
parser.add_argument('--output', '-o', help='Output path for template CSV')
|
||||
parser.add_argument('--num-examples', type=int, default=3,
|
||||
help='Number of example rows in template (default: 3)')
|
||||
parser.add_argument('--base-dir', help='Base directory for relative file paths')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create template
|
||||
if args.create:
|
||||
output_path = args.output or 'diffdock_batch_template.csv'
|
||||
df = create_template_csv(output_path, args.num_examples)
|
||||
print(f"✓ Created template CSV: {output_path}")
|
||||
print(f"\nTemplate contents:")
|
||||
print(df.to_string(index=False))
|
||||
print(f"\nEdit this file with your protein-ligand pairs and run with:")
|
||||
print(f" python -m inference --config default_inference_args.yaml \\")
|
||||
print(f" --protein_ligand_csv {output_path} --out_dir results/")
|
||||
return 0
|
||||
|
||||
# Validate CSV
|
||||
if args.validate or args.csv_file:
|
||||
if not args.csv_file:
|
||||
print("Error: CSV file required for validation")
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
if not os.path.exists(args.csv_file):
|
||||
print(f"Error: CSV file not found: {args.csv_file}")
|
||||
return 1
|
||||
|
||||
print(f"Validating: {args.csv_file}")
|
||||
print("="*60)
|
||||
|
||||
valid, messages = validate_csv(args.csv_file, args.base_dir)
|
||||
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
|
||||
return 0 if valid else 1
|
||||
|
||||
# No action specified
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
278
scientific-packages/diffdock/scripts/setup_check.py
Executable file
278
scientific-packages/diffdock/scripts/setup_check.py
Executable file
@@ -0,0 +1,278 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Environment Setup Checker
|
||||
|
||||
This script verifies that the DiffDock environment is properly configured
|
||||
and all dependencies are available.
|
||||
|
||||
Usage:
|
||||
python setup_check.py
|
||||
python setup_check.py --verbose
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def check_python_version():
|
||||
"""Check Python version."""
|
||||
import sys
|
||||
version = sys.version_info
|
||||
|
||||
print("Checking Python version...")
|
||||
if version.major == 3 and version.minor >= 8:
|
||||
print(f" ✓ Python {version.major}.{version.minor}.{version.micro}")
|
||||
return True
|
||||
else:
|
||||
print(f" ✗ Python {version.major}.{version.minor}.{version.micro} "
|
||||
f"(requires Python 3.8 or higher)")
|
||||
return False
|
||||
|
||||
|
||||
def check_package(package_name, import_name=None, version_attr='__version__'):
|
||||
"""Check if a Python package is installed."""
|
||||
if import_name is None:
|
||||
import_name = package_name
|
||||
|
||||
try:
|
||||
module = __import__(import_name)
|
||||
version = getattr(module, version_attr, 'unknown')
|
||||
print(f" ✓ {package_name:20s} (version: {version})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ✗ {package_name:20s} (not installed)")
|
||||
return False
|
||||
|
||||
|
||||
def check_pytorch():
|
||||
"""Check PyTorch installation and CUDA availability."""
|
||||
print("\nChecking PyTorch...")
|
||||
try:
|
||||
import torch
|
||||
print(f" ✓ PyTorch version: {torch.__version__}")
|
||||
|
||||
# Check CUDA
|
||||
if torch.cuda.is_available():
|
||||
print(f" ✓ CUDA available: {torch.cuda.get_device_name(0)}")
|
||||
print(f" - CUDA version: {torch.version.cuda}")
|
||||
print(f" - Number of GPUs: {torch.cuda.device_count()}")
|
||||
return True, True
|
||||
else:
|
||||
print(f" ⚠ CUDA not available (will run on CPU)")
|
||||
return True, False
|
||||
except ImportError:
|
||||
print(f" ✗ PyTorch not installed")
|
||||
return False, False
|
||||
|
||||
|
||||
def check_pytorch_geometric():
|
||||
"""Check PyTorch Geometric installation."""
|
||||
print("\nChecking PyTorch Geometric...")
|
||||
packages = [
|
||||
('torch-geometric', 'torch_geometric'),
|
||||
('torch-scatter', 'torch_scatter'),
|
||||
('torch-sparse', 'torch_sparse'),
|
||||
('torch-cluster', 'torch_cluster'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for pkg_name, import_name in packages:
|
||||
if not check_package(pkg_name, import_name):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_core_dependencies():
|
||||
"""Check core DiffDock dependencies."""
|
||||
print("\nChecking core dependencies...")
|
||||
|
||||
dependencies = [
|
||||
('numpy', 'numpy'),
|
||||
('scipy', 'scipy'),
|
||||
('pandas', 'pandas'),
|
||||
('rdkit', 'rdkit', 'rdBase.__version__'),
|
||||
('biopython', 'Bio', '__version__'),
|
||||
('pytorch-lightning', 'pytorch_lightning'),
|
||||
('PyYAML', 'yaml'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for dep in dependencies:
|
||||
pkg_name = dep[0]
|
||||
import_name = dep[1]
|
||||
version_attr = dep[2] if len(dep) > 2 else '__version__'
|
||||
|
||||
if not check_package(pkg_name, import_name, version_attr):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_esm():
|
||||
"""Check ESM (protein language model) installation."""
|
||||
print("\nChecking ESM (for protein sequence folding)...")
|
||||
try:
|
||||
import esm
|
||||
print(f" ✓ ESM installed (version: {esm.__version__ if hasattr(esm, '__version__') else 'unknown'})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ⚠ ESM not installed (needed for protein sequence folding)")
|
||||
print(f" Install with: pip install fair-esm")
|
||||
return False
|
||||
|
||||
|
||||
def check_diffdock_installation():
|
||||
"""Check if DiffDock is properly installed/cloned."""
|
||||
print("\nChecking DiffDock installation...")
|
||||
|
||||
# Look for key files
|
||||
key_files = [
|
||||
'inference.py',
|
||||
'default_inference_args.yaml',
|
||||
'environment.yml',
|
||||
]
|
||||
|
||||
found_files = []
|
||||
missing_files = []
|
||||
|
||||
for filename in key_files:
|
||||
if os.path.exists(filename):
|
||||
found_files.append(filename)
|
||||
else:
|
||||
missing_files.append(filename)
|
||||
|
||||
if found_files:
|
||||
print(f" ✓ Found DiffDock files in current directory:")
|
||||
for f in found_files:
|
||||
print(f" - {f}")
|
||||
else:
|
||||
print(f" ⚠ DiffDock files not found in current directory")
|
||||
print(f" Current directory: {os.getcwd()}")
|
||||
print(f" Make sure you're in the DiffDock repository root")
|
||||
|
||||
# Check for model checkpoints
|
||||
model_dir = Path('./workdir/v1.1/score_model')
|
||||
confidence_dir = Path('./workdir/v1.1/confidence_model')
|
||||
|
||||
if model_dir.exists() and confidence_dir.exists():
|
||||
print(f" ✓ Model checkpoints found")
|
||||
else:
|
||||
print(f" ⚠ Model checkpoints not found in ./workdir/v1.1/")
|
||||
print(f" Models will be downloaded on first run")
|
||||
|
||||
return len(found_files) > 0
|
||||
|
||||
|
||||
def print_installation_instructions():
|
||||
"""Print installation instructions if setup is incomplete."""
|
||||
print("\n" + "="*80)
|
||||
print("Installation Instructions")
|
||||
print("="*80)
|
||||
|
||||
print("""
|
||||
If DiffDock is not installed, follow these steps:
|
||||
|
||||
1. Clone the repository:
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
2. Create conda environment:
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
|
||||
3. Verify installation:
|
||||
python setup_check.py
|
||||
|
||||
For Docker installation:
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
|
||||
For more information, visit: https://github.com/gcorso/DiffDock
|
||||
""")
|
||||
|
||||
|
||||
def print_performance_notes(has_cuda):
|
||||
"""Print performance notes based on available hardware."""
|
||||
print("\n" + "="*80)
|
||||
print("Performance Notes")
|
||||
print("="*80)
|
||||
|
||||
if has_cuda:
|
||||
print("""
|
||||
✓ GPU detected - DiffDock will run efficiently
|
||||
|
||||
Expected performance:
|
||||
- First run: ~2-5 minutes (pre-computing SO(2)/SO(3) tables)
|
||||
- Subsequent runs: ~10-60 seconds per complex (depending on settings)
|
||||
- Batch processing: Highly efficient with GPU
|
||||
""")
|
||||
else:
|
||||
print("""
|
||||
⚠ No GPU detected - DiffDock will run on CPU
|
||||
|
||||
Expected performance:
|
||||
- CPU inference is SIGNIFICANTLY slower than GPU
|
||||
- Single complex: Several minutes to hours
|
||||
- Batch processing: Not recommended on CPU
|
||||
|
||||
Recommendation: Use GPU for practical applications
|
||||
- Cloud options: Google Colab, AWS, or other cloud GPU services
|
||||
- Local: Install CUDA-capable GPU
|
||||
""")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Check DiffDock environment setup',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter
|
||||
)
|
||||
|
||||
parser.add_argument('--verbose', '-v', action='store_true',
|
||||
help='Show detailed version information')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("="*80)
|
||||
print("DiffDock Environment Setup Checker")
|
||||
print("="*80)
|
||||
|
||||
checks = []
|
||||
|
||||
# Run all checks
|
||||
checks.append(("Python version", check_python_version()))
|
||||
|
||||
pytorch_ok, has_cuda = check_pytorch()
|
||||
checks.append(("PyTorch", pytorch_ok))
|
||||
|
||||
checks.append(("PyTorch Geometric", check_pytorch_geometric()))
|
||||
checks.append(("Core dependencies", check_core_dependencies()))
|
||||
checks.append(("ESM", check_esm()))
|
||||
checks.append(("DiffDock files", check_diffdock_installation()))
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*80)
|
||||
print("Summary")
|
||||
print("="*80)
|
||||
|
||||
all_passed = all(result for _, result in checks)
|
||||
|
||||
for check_name, result in checks:
|
||||
status = "✓ PASS" if result else "✗ FAIL"
|
||||
print(f" {status:8s} - {check_name}")
|
||||
|
||||
if all_passed:
|
||||
print("\n✓ All checks passed! DiffDock is ready to use.")
|
||||
print_performance_notes(has_cuda)
|
||||
return 0
|
||||
else:
|
||||
print("\n✗ Some checks failed. Please install missing dependencies.")
|
||||
print_installation_instructions()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
617
scientific-packages/etetoolkit/SKILL.md
Normal file
617
scientific-packages/etetoolkit/SKILL.md
Normal file
@@ -0,0 +1,617 @@
|
||||
---
|
||||
name: etetoolkit
|
||||
description: Comprehensive toolkit for phylogenetic and hierarchical tree analysis using the ETE (Environment for Tree Exploration) Python library. This skill should be used when working with phylogenetic trees, gene trees, species trees, clustering dendrograms, or any hierarchical tree structures. Applies to tasks involving tree manipulation (pruning, rerooting, format conversion), evolutionary analysis (orthology detection, duplication/speciation events), tree comparison (Robinson-Foulds distance), NCBI taxonomy integration, tree visualization (PDF, SVG, PNG output), and clustering analysis with heatmaps.
|
||||
---
|
||||
|
||||
# ETE Toolkit Skill
|
||||
|
||||
## Overview
|
||||
|
||||
Provide comprehensive support for phylogenetic and hierarchical tree analysis using the ETE (Environment for Tree Exploration) toolkit. Enable tree manipulation, evolutionary analysis, visualization, and integration with biological databases for phylogenomic research and clustering analysis.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Tree Manipulation and Analysis
|
||||
|
||||
Load, manipulate, and analyze hierarchical tree structures with support for:
|
||||
|
||||
- **Tree I/O**: Read and write Newick, NHX, PhyloXML, and NeXML formats
|
||||
- **Tree traversal**: Navigate trees using preorder, postorder, or levelorder strategies
|
||||
- **Topology modification**: Prune, root, collapse nodes, resolve polytomies
|
||||
- **Distance calculations**: Compute branch lengths and topological distances between nodes
|
||||
- **Tree comparison**: Calculate Robinson-Foulds distances and identify topological differences
|
||||
|
||||
**Common patterns:**
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
# Load tree from file
|
||||
tree = Tree("tree.nw", format=1)
|
||||
|
||||
# Basic statistics
|
||||
print(f"Leaves: {len(tree)}")
|
||||
print(f"Total nodes: {len(list(tree.traverse()))}")
|
||||
|
||||
# Prune to taxa of interest
|
||||
taxa_to_keep = ["species1", "species2", "species3"]
|
||||
tree.prune(taxa_to_keep, preserve_branch_length=True)
|
||||
|
||||
# Midpoint root
|
||||
midpoint = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint)
|
||||
|
||||
# Save modified tree
|
||||
tree.write(outfile="rooted_tree.nw")
|
||||
```
|
||||
|
||||
Use `scripts/tree_operations.py` for command-line tree manipulation:
|
||||
|
||||
```bash
|
||||
# Display tree statistics
|
||||
python scripts/tree_operations.py stats tree.nw
|
||||
|
||||
# Convert format
|
||||
python scripts/tree_operations.py convert tree.nw output.nw --in-format 0 --out-format 1
|
||||
|
||||
# Reroot tree
|
||||
python scripts/tree_operations.py reroot tree.nw rooted.nw --midpoint
|
||||
|
||||
# Prune to specific taxa
|
||||
python scripts/tree_operations.py prune tree.nw pruned.nw --keep-taxa "sp1,sp2,sp3"
|
||||
|
||||
# Show ASCII visualization
|
||||
python scripts/tree_operations.py ascii tree.nw
|
||||
```
|
||||
|
||||
### 2. Phylogenetic Analysis
|
||||
|
||||
Analyze gene trees with evolutionary event detection:
|
||||
|
||||
- **Sequence alignment integration**: Link trees to multiple sequence alignments (FASTA, Phylip)
|
||||
- **Species naming**: Automatic or custom species extraction from gene names
|
||||
- **Evolutionary events**: Detect duplication and speciation events using Species Overlap or tree reconciliation
|
||||
- **Orthology detection**: Identify orthologs and paralogs based on evolutionary events
|
||||
- **Gene family analysis**: Split trees by duplications, collapse lineage-specific expansions
|
||||
|
||||
**Workflow for gene tree analysis:**
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
|
||||
# Load gene tree with alignment
|
||||
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
|
||||
|
||||
# Set species naming function
|
||||
def get_species(gene_name):
|
||||
return gene_name.split("_")[0]
|
||||
|
||||
tree.set_species_naming_function(get_species)
|
||||
|
||||
# Detect evolutionary events
|
||||
events = tree.get_descendant_evol_events()
|
||||
|
||||
# Analyze events
|
||||
for node in tree.traverse():
|
||||
if hasattr(node, "evoltype"):
|
||||
if node.evoltype == "D":
|
||||
print(f"Duplication at {node.name}")
|
||||
elif node.evoltype == "S":
|
||||
print(f"Speciation at {node.name}")
|
||||
|
||||
# Extract ortholog groups
|
||||
ortho_groups = tree.get_speciation_trees()
|
||||
for i, ortho_tree in enumerate(ortho_groups):
|
||||
ortho_tree.write(outfile=f"ortholog_group_{i}.nw")
|
||||
```
|
||||
|
||||
**Finding orthologs and paralogs:**
|
||||
|
||||
```python
|
||||
# Find orthologs to query gene
|
||||
query = tree & "species1_gene1"
|
||||
|
||||
orthologs = []
|
||||
paralogs = []
|
||||
|
||||
for event in events:
|
||||
if query in event.in_seqs:
|
||||
if event.etype == "S":
|
||||
orthologs.extend([s for s in event.out_seqs if s != query])
|
||||
elif event.etype == "D":
|
||||
paralogs.extend([s for s in event.out_seqs if s != query])
|
||||
```
|
||||
|
||||
### 3. NCBI Taxonomy Integration
|
||||
|
||||
Integrate taxonomic information from NCBI Taxonomy database:
|
||||
|
||||
- **Database access**: Automatic download and local caching of NCBI taxonomy (~300MB)
|
||||
- **Taxid/name translation**: Convert between taxonomic IDs and scientific names
|
||||
- **Lineage retrieval**: Get complete evolutionary lineages
|
||||
- **Taxonomy trees**: Build species trees connecting specified taxa
|
||||
- **Tree annotation**: Automatically annotate trees with taxonomic information
|
||||
|
||||
**Building taxonomy-based trees:**
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
|
||||
ncbi = NCBITaxa()
|
||||
|
||||
# Build tree from species names
|
||||
species = ["Homo sapiens", "Pan troglodytes", "Mus musculus"]
|
||||
name2taxid = ncbi.get_name_translator(species)
|
||||
taxids = [name2taxid[sp][0] for sp in species]
|
||||
|
||||
# Get minimal tree connecting taxa
|
||||
tree = ncbi.get_topology(taxids)
|
||||
|
||||
# Annotate nodes with taxonomy info
|
||||
for node in tree.traverse():
|
||||
if hasattr(node, "sci_name"):
|
||||
print(f"{node.sci_name} - Rank: {node.rank} - TaxID: {node.taxid}")
|
||||
```
|
||||
|
||||
**Annotating existing trees:**
|
||||
|
||||
```python
|
||||
# Get taxonomy info for tree leaves
|
||||
for leaf in tree:
|
||||
species = extract_species_from_name(leaf.name)
|
||||
taxid = ncbi.get_name_translator([species])[species][0]
|
||||
|
||||
# Get lineage
|
||||
lineage = ncbi.get_lineage(taxid)
|
||||
ranks = ncbi.get_rank(lineage)
|
||||
names = ncbi.get_taxid_translator(lineage)
|
||||
|
||||
# Add to node
|
||||
leaf.add_feature("taxid", taxid)
|
||||
leaf.add_feature("lineage", [names[t] for t in lineage])
|
||||
```
|
||||
|
||||
### 4. Tree Visualization
|
||||
|
||||
Create publication-quality tree visualizations:
|
||||
|
||||
- **Output formats**: PNG (raster), PDF, and SVG (vector) for publications
|
||||
- **Layout modes**: Rectangular and circular tree layouts
|
||||
- **Interactive GUI**: Explore trees interactively with zoom, pan, and search
|
||||
- **Custom styling**: NodeStyle for node appearance (colors, shapes, sizes)
|
||||
- **Faces**: Add graphical elements (text, images, charts, heatmaps) to nodes
|
||||
- **Layout functions**: Dynamic styling based on node properties
|
||||
|
||||
**Basic visualization workflow:**
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Configure tree style
|
||||
ts = TreeStyle()
|
||||
ts.show_leaf_name = True
|
||||
ts.show_branch_support = True
|
||||
ts.scale = 50 # pixels per branch length unit
|
||||
|
||||
# Style nodes
|
||||
for node in tree.traverse():
|
||||
nstyle = NodeStyle()
|
||||
|
||||
if node.is_leaf():
|
||||
nstyle["fgcolor"] = "blue"
|
||||
nstyle["size"] = 8
|
||||
else:
|
||||
# Color by support
|
||||
if node.support > 0.9:
|
||||
nstyle["fgcolor"] = "darkgreen"
|
||||
else:
|
||||
nstyle["fgcolor"] = "red"
|
||||
nstyle["size"] = 5
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
# Render to file
|
||||
tree.render("tree.pdf", tree_style=ts)
|
||||
tree.render("tree.png", w=800, h=600, units="px", dpi=300)
|
||||
```
|
||||
|
||||
Use `scripts/quick_visualize.py` for rapid visualization:
|
||||
|
||||
```bash
|
||||
# Basic visualization
|
||||
python scripts/quick_visualize.py tree.nw output.pdf
|
||||
|
||||
# Circular layout with custom styling
|
||||
python scripts/quick_visualize.py tree.nw output.pdf --mode c --color-by-support
|
||||
|
||||
# High-resolution PNG
|
||||
python scripts/quick_visualize.py tree.nw output.png --width 1200 --height 800 --units px --dpi 300
|
||||
|
||||
# Custom title and styling
|
||||
python scripts/quick_visualize.py tree.nw output.pdf --title "Species Phylogeny" --show-support
|
||||
```
|
||||
|
||||
**Advanced visualization with faces:**
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace, CircleFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add features to nodes
|
||||
for leaf in tree:
|
||||
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "land")
|
||||
|
||||
# Layout function
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add colored circle
|
||||
color = "blue" if node.habitat == "marine" else "green"
|
||||
circle = CircleFace(radius=5, color=color)
|
||||
node.add_face(circle, column=0, position="aligned")
|
||||
|
||||
# Add label
|
||||
label = TextFace(node.name, fsize=10)
|
||||
node.add_face(label, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("annotated_tree.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### 5. Clustering Analysis
|
||||
|
||||
Analyze hierarchical clustering results with data integration:
|
||||
|
||||
- **ClusterTree**: Specialized class for clustering dendrograms
|
||||
- **Data matrix linking**: Connect tree leaves to numerical profiles
|
||||
- **Cluster metrics**: Silhouette coefficient, Dunn index, inter/intra-cluster distances
|
||||
- **Validation**: Test cluster quality with different distance metrics
|
||||
- **Heatmap visualization**: Display data matrices alongside trees
|
||||
|
||||
**Clustering workflow:**
|
||||
|
||||
```python
|
||||
from ete3 import ClusterTree
|
||||
|
||||
# Load tree with data matrix
|
||||
matrix = """#Names\tSample1\tSample2\tSample3
|
||||
Gene1\t1.5\t2.3\t0.8
|
||||
Gene2\t0.9\t1.1\t1.8
|
||||
Gene3\t2.1\t2.5\t0.5"""
|
||||
|
||||
tree = ClusterTree("((Gene1,Gene2),Gene3);", text_array=matrix)
|
||||
|
||||
# Evaluate cluster quality
|
||||
for node in tree.traverse():
|
||||
if not node.is_leaf():
|
||||
silhouette = node.get_silhouette()
|
||||
dunn = node.get_dunn()
|
||||
|
||||
print(f"Cluster: {node.name}")
|
||||
print(f" Silhouette: {silhouette:.3f}")
|
||||
print(f" Dunn index: {dunn:.3f}")
|
||||
|
||||
# Visualize with heatmap
|
||||
tree.show("heatmap")
|
||||
```
|
||||
|
||||
### 6. Tree Comparison
|
||||
|
||||
Quantify topological differences between trees:
|
||||
|
||||
- **Robinson-Foulds distance**: Standard metric for tree comparison
|
||||
- **Normalized RF**: Scale-invariant distance (0.0 to 1.0)
|
||||
- **Partition analysis**: Identify unique and shared bipartitions
|
||||
- **Consensus trees**: Analyze support across multiple trees
|
||||
- **Batch comparison**: Compare multiple trees pairwise
|
||||
|
||||
**Compare two trees:**
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree1 = Tree("tree1.nw")
|
||||
tree2 = Tree("tree2.nw")
|
||||
|
||||
# Calculate RF distance
|
||||
rf, max_rf, common_leaves, parts_t1, parts_t2 = tree1.robinson_foulds(tree2)
|
||||
|
||||
print(f"RF distance: {rf}/{max_rf}")
|
||||
print(f"Normalized RF: {rf/max_rf:.3f}")
|
||||
print(f"Common leaves: {len(common_leaves)}")
|
||||
|
||||
# Find unique partitions
|
||||
unique_t1 = parts_t1 - parts_t2
|
||||
unique_t2 = parts_t2 - parts_t1
|
||||
|
||||
print(f"Unique to tree1: {len(unique_t1)}")
|
||||
print(f"Unique to tree2: {len(unique_t2)}")
|
||||
```
|
||||
|
||||
**Compare multiple trees:**
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
trees = [Tree(f"tree{i}.nw") for i in range(4)]
|
||||
|
||||
# Create distance matrix
|
||||
n = len(trees)
|
||||
dist_matrix = np.zeros((n, n))
|
||||
|
||||
for i in range(n):
|
||||
for j in range(i+1, n):
|
||||
rf, max_rf, _, _, _ = trees[i].robinson_foulds(trees[j])
|
||||
norm_rf = rf / max_rf if max_rf > 0 else 0
|
||||
dist_matrix[i, j] = norm_rf
|
||||
dist_matrix[j, i] = norm_rf
|
||||
```
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
Install ETE toolkit:
|
||||
|
||||
```bash
|
||||
# Basic installation
|
||||
pip install ete3
|
||||
|
||||
# With external dependencies for rendering (optional but recommended)
|
||||
# On macOS:
|
||||
brew install qt@5
|
||||
|
||||
# On Ubuntu/Debian:
|
||||
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
|
||||
|
||||
# For full features including GUI
|
||||
pip install ete3[gui]
|
||||
```
|
||||
|
||||
**First-time NCBI Taxonomy setup:**
|
||||
|
||||
The first time NCBITaxa is instantiated, it automatically downloads the NCBI taxonomy database (~300MB) to `~/.etetoolkit/taxa.sqlite`. This happens only once:
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
ncbi = NCBITaxa() # Downloads database on first run
|
||||
```
|
||||
|
||||
Update taxonomy database:
|
||||
|
||||
```python
|
||||
ncbi.update_taxonomy_database() # Download latest NCBI data
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Use Case 1: Phylogenomic Pipeline
|
||||
|
||||
Complete workflow from gene tree to ortholog identification:
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree, NCBITaxa
|
||||
|
||||
# 1. Load gene tree with alignment
|
||||
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
|
||||
|
||||
# 2. Configure species naming
|
||||
tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
|
||||
# 3. Detect evolutionary events
|
||||
tree.get_descendant_evol_events()
|
||||
|
||||
# 4. Annotate with taxonomy
|
||||
ncbi = NCBITaxa()
|
||||
for leaf in tree:
|
||||
if leaf.species in species_to_taxid:
|
||||
taxid = species_to_taxid[leaf.species]
|
||||
lineage = ncbi.get_lineage(taxid)
|
||||
leaf.add_feature("lineage", lineage)
|
||||
|
||||
# 5. Extract ortholog groups
|
||||
ortho_groups = tree.get_speciation_trees()
|
||||
|
||||
# 6. Save and visualize
|
||||
for i, ortho in enumerate(ortho_groups):
|
||||
ortho.write(outfile=f"ortho_{i}.nw")
|
||||
```
|
||||
|
||||
### Use Case 2: Tree Preprocessing and Formatting
|
||||
|
||||
Batch process trees for analysis:
|
||||
|
||||
```bash
|
||||
# Convert format
|
||||
python scripts/tree_operations.py convert input.nw output.nw --in-format 0 --out-format 1
|
||||
|
||||
# Root at midpoint
|
||||
python scripts/tree_operations.py reroot input.nw rooted.nw --midpoint
|
||||
|
||||
# Prune to focal taxa
|
||||
python scripts/tree_operations.py prune rooted.nw pruned.nw --keep-taxa taxa_list.txt
|
||||
|
||||
# Get statistics
|
||||
python scripts/tree_operations.py stats pruned.nw
|
||||
```
|
||||
|
||||
### Use Case 3: Publication-Quality Figures
|
||||
|
||||
Create styled visualizations:
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Define clade colors
|
||||
clade_colors = {
|
||||
"Mammals": "red",
|
||||
"Birds": "blue",
|
||||
"Fish": "green"
|
||||
}
|
||||
|
||||
def layout(node):
|
||||
# Highlight clades
|
||||
if node.is_leaf():
|
||||
for clade, color in clade_colors.items():
|
||||
if clade in node.name:
|
||||
nstyle = NodeStyle()
|
||||
nstyle["fgcolor"] = color
|
||||
nstyle["size"] = 8
|
||||
node.set_style(nstyle)
|
||||
else:
|
||||
# Add support values
|
||||
if node.support > 0.95:
|
||||
support = TextFace(f"{node.support:.2f}", fsize=8)
|
||||
node.add_face(support, column=0, position="branch-top")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_scale = True
|
||||
|
||||
# Render for publication
|
||||
tree.render("figure.pdf", w=200, units="mm", tree_style=ts)
|
||||
tree.render("figure.svg", tree_style=ts) # Editable vector
|
||||
```
|
||||
|
||||
### Use Case 4: Automated Tree Analysis
|
||||
|
||||
Process multiple trees systematically:
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
import os
|
||||
|
||||
input_dir = "trees"
|
||||
output_dir = "processed"
|
||||
|
||||
for filename in os.listdir(input_dir):
|
||||
if filename.endswith(".nw"):
|
||||
tree = Tree(os.path.join(input_dir, filename))
|
||||
|
||||
# Standardize: midpoint root, resolve polytomies
|
||||
midpoint = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint)
|
||||
tree.resolve_polytomy(recursive=True)
|
||||
|
||||
# Filter low support branches
|
||||
for node in tree.traverse():
|
||||
if hasattr(node, 'support') and node.support < 0.5:
|
||||
if not node.is_leaf() and not node.is_root():
|
||||
node.delete()
|
||||
|
||||
# Save processed tree
|
||||
output_file = os.path.join(output_dir, f"processed_{filename}")
|
||||
tree.write(outfile=output_file)
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For comprehensive API documentation, code examples, and detailed guides, refer to the following resources in the `references/` directory:
|
||||
|
||||
- **`api_reference.md`**: Complete API documentation for all ETE classes and methods (Tree, PhyloTree, ClusterTree, NCBITaxa), including parameters, return types, and code examples
|
||||
- **`workflows.md`**: Common workflow patterns organized by task (tree operations, phylogenetic analysis, tree comparison, taxonomy integration, clustering analysis)
|
||||
- **`visualization.md`**: Comprehensive visualization guide covering TreeStyle, NodeStyle, Faces, layout functions, and advanced visualization techniques
|
||||
|
||||
Load these references when detailed information is needed:
|
||||
|
||||
```python
|
||||
# To use API reference
|
||||
# Read references/api_reference.md for complete method signatures and parameters
|
||||
|
||||
# To implement workflows
|
||||
# Read references/workflows.md for step-by-step workflow examples
|
||||
|
||||
# To create visualizations
|
||||
# Read references/visualization.md for styling and rendering options
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Import errors:**
|
||||
|
||||
```bash
|
||||
# If "ModuleNotFoundError: No module named 'ete3'"
|
||||
pip install ete3
|
||||
|
||||
# For GUI and rendering issues
|
||||
pip install ete3[gui]
|
||||
```
|
||||
|
||||
**Rendering issues:**
|
||||
|
||||
If `tree.render()` or `tree.show()` fails with Qt-related errors, install system dependencies:
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install qt@5
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
|
||||
```
|
||||
|
||||
**NCBI Taxonomy database:**
|
||||
|
||||
If database download fails or becomes corrupted:
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
ncbi = NCBITaxa()
|
||||
ncbi.update_taxonomy_database() # Redownload database
|
||||
```
|
||||
|
||||
**Memory issues with large trees:**
|
||||
|
||||
For very large trees (>10,000 leaves), use iterators instead of list comprehensions:
|
||||
|
||||
```python
|
||||
# Memory-efficient iteration
|
||||
for leaf in tree.iter_leaves():
|
||||
process(leaf)
|
||||
|
||||
# Instead of
|
||||
for leaf in tree.get_leaves(): # Loads all into memory
|
||||
process(leaf)
|
||||
```
|
||||
|
||||
## Newick Format Reference
|
||||
|
||||
ETE supports multiple Newick format specifications (0-100):
|
||||
|
||||
- **Format 0**: Flexible with branch lengths (default)
|
||||
- **Format 1**: With internal node names
|
||||
- **Format 2**: With bootstrap/support values
|
||||
- **Format 5**: Internal node names + branch lengths
|
||||
- **Format 8**: All features (names, distances, support)
|
||||
- **Format 9**: Leaf names only
|
||||
- **Format 100**: Topology only
|
||||
|
||||
Specify format when reading/writing:
|
||||
|
||||
```python
|
||||
tree = Tree("tree.nw", format=1)
|
||||
tree.write(outfile="output.nw", format=5)
|
||||
```
|
||||
|
||||
NHX (New Hampshire eXtended) format preserves custom features:
|
||||
|
||||
```python
|
||||
tree.write(outfile="tree.nhx", features=["habitat", "temperature", "depth"])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Preserve branch lengths**: Use `preserve_branch_length=True` when pruning for phylogenetic analysis
|
||||
2. **Cache content**: Use `get_cached_content()` for repeated access to node contents on large trees
|
||||
3. **Use iterators**: Employ `iter_*` methods for memory-efficient processing of large trees
|
||||
4. **Choose appropriate traversal**: Postorder for bottom-up analysis, preorder for top-down
|
||||
5. **Validate monophyly**: Always check returned clade type (monophyletic/paraphyletic/polyphyletic)
|
||||
6. **Vector formats for publication**: Use PDF or SVG for publication figures (scalable, editable)
|
||||
7. **Interactive testing**: Use `tree.show()` to test visualizations before rendering to file
|
||||
8. **PhyloTree for phylogenetics**: Use PhyloTree class for gene trees and evolutionary analysis
|
||||
9. **Copy method selection**: "newick" for speed, "cpickle" for full fidelity, "deepcopy" for complex objects
|
||||
10. **NCBI query caching**: Store NCBI taxonomy query results to avoid repeated database access
|
||||
583
scientific-packages/etetoolkit/references/api_reference.md
Normal file
583
scientific-packages/etetoolkit/references/api_reference.md
Normal file
@@ -0,0 +1,583 @@
|
||||
# ETE Toolkit API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
ETE (Environment for Tree Exploration) is a Python toolkit for phylogenetic tree manipulation, analysis, and visualization. This reference covers the main classes and methods.
|
||||
|
||||
## Core Classes
|
||||
|
||||
### TreeNode (alias: Tree)
|
||||
|
||||
The fundamental class representing tree structures with hierarchical node organization.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
from ete3 import Tree
|
||||
t = Tree(newick=None, format=0, dist=None, support=None, name=None)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `newick`: Newick string or file path
|
||||
- `format`: Newick format (0-100). Common formats:
|
||||
- `0`: Flexible format with branch lengths and names
|
||||
- `1`: With internal node names
|
||||
- `2`: With bootstrap/support values
|
||||
- `5`: Internal node names and branch lengths
|
||||
- `8`: All features (names, distances, support)
|
||||
- `9`: Leaf names only
|
||||
- `100`: Topology only
|
||||
- `dist`: Branch length to parent (default: 1.0)
|
||||
- `support`: Bootstrap/confidence value (default: 1.0)
|
||||
- `name`: Node identifier
|
||||
|
||||
### PhyloTree
|
||||
|
||||
Specialized class for phylogenetic analysis, extending TreeNode.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
t = PhyloTree(newick=None, alignment=None, alg_format='fasta',
|
||||
sp_naming_function=None, format=0)
|
||||
```
|
||||
|
||||
**Additional Parameters:**
|
||||
- `alignment`: Path to alignment file or alignment string
|
||||
- `alg_format`: 'fasta' or 'phylip'
|
||||
- `sp_naming_function`: Custom function to extract species from node names
|
||||
|
||||
### ClusterTree
|
||||
|
||||
Class for hierarchical clustering analysis.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
from ete3 import ClusterTree
|
||||
t = ClusterTree(newick, text_array=None)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `text_array`: Tab-delimited matrix with column headers and row names
|
||||
|
||||
### NCBITaxa
|
||||
|
||||
Class for NCBI taxonomy database operations.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
ncbi = NCBITaxa(dbfile=None)
|
||||
```
|
||||
|
||||
First instantiation downloads ~300MB NCBI taxonomy database to `~/.etetoolkit/taxa.sqlite`.
|
||||
|
||||
## Node Properties
|
||||
|
||||
### Basic Attributes
|
||||
|
||||
| Property | Type | Description | Default |
|
||||
|----------|------|-------------|---------|
|
||||
| `name` | str | Node identifier | "NoName" |
|
||||
| `dist` | float | Branch length to parent | 1.0 |
|
||||
| `support` | float | Bootstrap/confidence value | 1.0 |
|
||||
| `up` | TreeNode | Parent node reference | None |
|
||||
| `children` | list | Child nodes | [] |
|
||||
|
||||
### Custom Features
|
||||
|
||||
Add any custom data to nodes:
|
||||
```python
|
||||
node.add_feature("custom_name", value)
|
||||
node.add_features(feature1=value1, feature2=value2)
|
||||
```
|
||||
|
||||
Access features:
|
||||
```python
|
||||
value = node.custom_name
|
||||
# or
|
||||
value = getattr(node, "custom_name", default_value)
|
||||
```
|
||||
|
||||
## Navigation & Traversal
|
||||
|
||||
### Basic Navigation
|
||||
|
||||
```python
|
||||
# Check node type
|
||||
node.is_leaf() # Returns True if terminal node
|
||||
node.is_root() # Returns True if root node
|
||||
len(node) # Number of leaves under node
|
||||
|
||||
# Get relatives
|
||||
parent = node.up
|
||||
children = node.children
|
||||
root = node.get_tree_root()
|
||||
```
|
||||
|
||||
### Traversal Strategies
|
||||
|
||||
```python
|
||||
# Three traversal strategies
|
||||
for node in tree.traverse("preorder"): # Root → Left → Right
|
||||
print(node.name)
|
||||
|
||||
for node in tree.traverse("postorder"): # Left → Right → Root
|
||||
print(node.name)
|
||||
|
||||
for node in tree.traverse("levelorder"): # Level by level
|
||||
print(node.name)
|
||||
|
||||
# Exclude root
|
||||
for node in tree.iter_descendants("postorder"):
|
||||
print(node.name)
|
||||
```
|
||||
|
||||
### Getting Nodes
|
||||
|
||||
```python
|
||||
# Get all leaves
|
||||
leaves = tree.get_leaves()
|
||||
for leaf in tree: # Shortcut iteration
|
||||
print(leaf.name)
|
||||
|
||||
# Get all descendants
|
||||
descendants = tree.get_descendants()
|
||||
|
||||
# Get ancestors
|
||||
ancestors = node.get_ancestors()
|
||||
|
||||
# Get specific nodes by attribute
|
||||
nodes = tree.search_nodes(name="NodeA")
|
||||
node = tree & "NodeA" # Shortcut syntax
|
||||
|
||||
# Get leaves by name
|
||||
leaves = tree.get_leaves_by_name("LeafA")
|
||||
|
||||
# Get common ancestor
|
||||
ancestor = tree.get_common_ancestor("LeafA", "LeafB", "LeafC")
|
||||
|
||||
# Custom filtering
|
||||
filtered = [n for n in tree.traverse() if n.dist > 0.5 and n.is_leaf()]
|
||||
```
|
||||
|
||||
### Iterator Methods (Memory Efficient)
|
||||
|
||||
```python
|
||||
# For large trees, use iterators
|
||||
for match in tree.iter_search_nodes(name="X"):
|
||||
if some_condition:
|
||||
break # Stop early
|
||||
|
||||
for leaf in tree.iter_leaves():
|
||||
process(leaf)
|
||||
|
||||
for descendant in node.iter_descendants():
|
||||
process(descendant)
|
||||
```
|
||||
|
||||
## Tree Construction & Modification
|
||||
|
||||
### Creating Trees from Scratch
|
||||
|
||||
```python
|
||||
# Empty tree
|
||||
t = Tree()
|
||||
|
||||
# Add children
|
||||
child1 = t.add_child(name="A", dist=1.0)
|
||||
child2 = t.add_child(name="B", dist=2.0)
|
||||
|
||||
# Add siblings
|
||||
sister = child1.add_sister(name="C", dist=1.5)
|
||||
|
||||
# Populate with random topology
|
||||
t.populate(10) # Creates 10 random leaves
|
||||
t.populate(5, names_library=["A", "B", "C", "D", "E"])
|
||||
```
|
||||
|
||||
### Removing & Deleting Nodes
|
||||
|
||||
```python
|
||||
# Detach: removes entire subtree
|
||||
node.detach()
|
||||
# or
|
||||
parent.remove_child(node)
|
||||
|
||||
# Delete: removes node, reconnects children to parent
|
||||
node.delete()
|
||||
# or
|
||||
parent.remove_child(node)
|
||||
```
|
||||
|
||||
### Pruning
|
||||
|
||||
Keep only specified leaves:
|
||||
```python
|
||||
# Keep only these leaves, remove all others
|
||||
tree.prune(["A", "B", "C"])
|
||||
|
||||
# Preserve original branch lengths
|
||||
tree.prune(["A", "B", "C"], preserve_branch_length=True)
|
||||
```
|
||||
|
||||
### Tree Concatenation
|
||||
|
||||
```python
|
||||
# Attach one tree as child of another
|
||||
t1 = Tree("(A,(B,C));")
|
||||
t2 = Tree("((D,E),(F,G));")
|
||||
A = t1 & "A"
|
||||
A.add_child(t2)
|
||||
```
|
||||
|
||||
### Tree Copying
|
||||
|
||||
```python
|
||||
# Four copy methods
|
||||
copy1 = tree.copy() # Default: cpickle (preserves types)
|
||||
copy2 = tree.copy("newick") # Fastest: basic topology
|
||||
copy3 = tree.copy("newick-extended") # Includes custom features as text
|
||||
copy4 = tree.copy("deepcopy") # Slowest: handles complex objects
|
||||
```
|
||||
|
||||
## Tree Operations
|
||||
|
||||
### Rooting
|
||||
|
||||
```python
|
||||
# Set outgroup (reroot tree)
|
||||
outgroup_node = tree & "OutgroupLeaf"
|
||||
tree.set_outgroup(outgroup_node)
|
||||
|
||||
# Midpoint rooting
|
||||
midpoint = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint)
|
||||
|
||||
# Unroot tree
|
||||
tree.unroot()
|
||||
```
|
||||
|
||||
### Resolving Polytomies
|
||||
|
||||
```python
|
||||
# Resolve multifurcations to bifurcations
|
||||
tree.resolve_polytomy(recursive=False) # Single node only
|
||||
tree.resolve_polytomy(recursive=True) # Entire tree
|
||||
```
|
||||
|
||||
### Ladderize
|
||||
|
||||
```python
|
||||
# Sort branches by size
|
||||
tree.ladderize()
|
||||
tree.ladderize(direction=1) # Ascending order
|
||||
```
|
||||
|
||||
### Convert to Ultrametric
|
||||
|
||||
```python
|
||||
# Make all leaves equidistant from root
|
||||
tree.convert_to_ultrametric()
|
||||
tree.convert_to_ultrametric(tree_length=100) # Specific total length
|
||||
```
|
||||
|
||||
## Distance & Comparison
|
||||
|
||||
### Distance Calculations
|
||||
|
||||
```python
|
||||
# Branch length distance between nodes
|
||||
dist = tree.get_distance("A", "B")
|
||||
dist = nodeA.get_distance(nodeB)
|
||||
|
||||
# Topology-only distance (count nodes)
|
||||
dist = tree.get_distance("A", "B", topology_only=True)
|
||||
|
||||
# Farthest node
|
||||
farthest, distance = node.get_farthest_node()
|
||||
farthest_leaf, distance = node.get_farthest_leaf()
|
||||
```
|
||||
|
||||
### Monophyly Testing
|
||||
|
||||
```python
|
||||
# Check if values form monophyletic group
|
||||
is_mono, clade_type, base_node = tree.check_monophyly(
|
||||
values=["A", "B", "C"],
|
||||
target_attr="name"
|
||||
)
|
||||
# Returns: (bool, "monophyletic"|"paraphyletic"|"polyphyletic", node)
|
||||
|
||||
# Get all monophyletic clades
|
||||
monophyletic_nodes = tree.get_monophyletic(
|
||||
values=["A", "B", "C"],
|
||||
target_attr="name"
|
||||
)
|
||||
```
|
||||
|
||||
### Tree Comparison
|
||||
|
||||
```python
|
||||
# Robinson-Foulds distance
|
||||
rf, max_rf, common_leaves, parts_t1, parts_t2 = t1.robinson_foulds(t2)
|
||||
print(f"RF distance: {rf}/{max_rf}")
|
||||
|
||||
# Normalized RF distance
|
||||
result = t1.compare(t2)
|
||||
norm_rf = result["norm_rf"] # 0.0 to 1.0
|
||||
ref_edges = result["ref_edges_in_source"]
|
||||
```
|
||||
|
||||
## Input/Output
|
||||
|
||||
### Reading Trees
|
||||
|
||||
```python
|
||||
# From string
|
||||
t = Tree("(A:1,(B:1,(C:1,D:1):0.5):0.5);")
|
||||
|
||||
# From file
|
||||
t = Tree("tree.nw")
|
||||
|
||||
# With format
|
||||
t = Tree("tree.nw", format=1)
|
||||
```
|
||||
|
||||
### Writing Trees
|
||||
|
||||
```python
|
||||
# To string
|
||||
newick = tree.write()
|
||||
newick = tree.write(format=1)
|
||||
newick = tree.write(format=1, features=["support", "custom_feature"])
|
||||
|
||||
# To file
|
||||
tree.write(outfile="output.nw")
|
||||
tree.write(format=5, outfile="output.nw", features=["name", "dist"])
|
||||
|
||||
# Custom leaf function (for collapsing)
|
||||
def is_leaf(node):
|
||||
return len(node) <= 3 # Treat small clades as leaves
|
||||
|
||||
newick = tree.write(is_leaf_fn=is_leaf)
|
||||
```
|
||||
|
||||
### Tree Rendering
|
||||
|
||||
```python
|
||||
# Show interactive GUI
|
||||
tree.show()
|
||||
|
||||
# Render to file (PNG, PDF, SVG)
|
||||
tree.render("tree.png")
|
||||
tree.render("tree.pdf", w=200, units="mm")
|
||||
tree.render("tree.svg", dpi=300)
|
||||
|
||||
# ASCII representation
|
||||
print(tree)
|
||||
print(tree.get_ascii(show_internal=True, compact=False))
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Caching Content
|
||||
|
||||
For frequent access to node contents:
|
||||
```python
|
||||
# Cache all node contents
|
||||
node2content = tree.get_cached_content()
|
||||
|
||||
# Fast lookup
|
||||
for node in tree.traverse():
|
||||
leaves = node2content[node]
|
||||
print(f"Node has {len(leaves)} leaves")
|
||||
```
|
||||
|
||||
### Precomputing Distances
|
||||
|
||||
```python
|
||||
# For multiple distance queries
|
||||
node2dist = {}
|
||||
for node in tree.traverse():
|
||||
node2dist[node] = node.get_distance(tree)
|
||||
```
|
||||
|
||||
## PhyloTree-Specific Methods
|
||||
|
||||
### Sequence Alignment
|
||||
|
||||
```python
|
||||
# Link alignment
|
||||
tree.link_to_alignment("alignment.fasta", alg_format="fasta")
|
||||
|
||||
# Access sequences
|
||||
for leaf in tree:
|
||||
print(f"{leaf.name}: {leaf.sequence}")
|
||||
```
|
||||
|
||||
### Species Naming
|
||||
|
||||
```python
|
||||
# Default: first 3 letters
|
||||
# Custom function
|
||||
def get_species(node_name):
|
||||
return node_name.split("_")[0]
|
||||
|
||||
tree.set_species_naming_function(get_species)
|
||||
|
||||
# Manual setting
|
||||
for leaf in tree:
|
||||
leaf.species = extract_species(leaf.name)
|
||||
```
|
||||
|
||||
### Evolutionary Events
|
||||
|
||||
```python
|
||||
# Detect duplication/speciation events
|
||||
events = tree.get_descendant_evol_events()
|
||||
|
||||
for node in tree.traverse():
|
||||
if hasattr(node, "evoltype"):
|
||||
print(f"{node.name}: {node.evoltype}") # "D" or "S"
|
||||
|
||||
# With species tree
|
||||
species_tree = Tree("(human, (chimp, gorilla));")
|
||||
events = tree.get_descendant_evol_events(species_tree=species_tree)
|
||||
```
|
||||
|
||||
### Gene Tree Operations
|
||||
|
||||
```python
|
||||
# Get species trees from duplicated gene families
|
||||
species_trees = tree.get_speciation_trees()
|
||||
|
||||
# Split by duplication events
|
||||
subtrees = tree.split_by_dups()
|
||||
|
||||
# Collapse lineage-specific expansions
|
||||
tree.collapse_lineage_specific_expansions()
|
||||
```
|
||||
|
||||
## NCBITaxa Methods
|
||||
|
||||
### Database Operations
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
ncbi = NCBITaxa()
|
||||
|
||||
# Update database
|
||||
ncbi.update_taxonomy_database()
|
||||
```
|
||||
|
||||
### Querying Taxonomy
|
||||
|
||||
```python
|
||||
# Get taxid from name
|
||||
taxid = ncbi.get_name_translator(["Homo sapiens"])
|
||||
# Returns: {'Homo sapiens': [9606]}
|
||||
|
||||
# Get name from taxid
|
||||
names = ncbi.get_taxid_translator([9606, 9598])
|
||||
# Returns: {9606: 'Homo sapiens', 9598: 'Pan troglodytes'}
|
||||
|
||||
# Get rank
|
||||
rank = ncbi.get_rank([9606])
|
||||
# Returns: {9606: 'species'}
|
||||
|
||||
# Get lineage
|
||||
lineage = ncbi.get_lineage(9606)
|
||||
# Returns: [1, 131567, 2759, ..., 9606]
|
||||
|
||||
# Get descendants
|
||||
descendants = ncbi.get_descendant_taxa("Primates")
|
||||
descendants = ncbi.get_descendant_taxa("Primates", collapse_subspecies=True)
|
||||
```
|
||||
|
||||
### Building Taxonomy Trees
|
||||
|
||||
```python
|
||||
# Get minimal tree connecting taxa
|
||||
tree = ncbi.get_topology([9606, 9598, 9593]) # Human, chimp, gorilla
|
||||
|
||||
# Annotate tree with taxonomy
|
||||
tree.annotate_ncbi_taxa()
|
||||
|
||||
# Access taxonomy info
|
||||
for node in tree.traverse():
|
||||
print(f"{node.sci_name} ({node.taxid}) - Rank: {node.rank}")
|
||||
```
|
||||
|
||||
## ClusterTree Methods
|
||||
|
||||
### Linking to Data
|
||||
|
||||
```python
|
||||
# Link matrix to tree
|
||||
tree.link_to_arraytable(matrix_string)
|
||||
|
||||
# Access profiles
|
||||
for leaf in tree:
|
||||
print(leaf.profile) # Numerical array
|
||||
```
|
||||
|
||||
### Cluster Metrics
|
||||
|
||||
```python
|
||||
# Get silhouette coefficient
|
||||
silhouette = tree.get_silhouette()
|
||||
|
||||
# Get Dunn index
|
||||
dunn = tree.get_dunn()
|
||||
|
||||
# Inter/intra cluster distances
|
||||
inter = node.intercluster_dist
|
||||
intra = node.intracluster_dist
|
||||
|
||||
# Standard deviation
|
||||
dev = node.deviation
|
||||
```
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
Supported metrics:
|
||||
- `"euclidean"`: Euclidean distance
|
||||
- `"pearson"`: Pearson correlation
|
||||
- `"spearman"`: Spearman rank correlation
|
||||
|
||||
```python
|
||||
tree.dist_to(node2, metric="pearson")
|
||||
```
|
||||
|
||||
## Common Error Handling
|
||||
|
||||
```python
|
||||
# Check if tree is empty
|
||||
if tree.children:
|
||||
print("Tree has children")
|
||||
|
||||
# Check if node exists
|
||||
nodes = tree.search_nodes(name="X")
|
||||
if nodes:
|
||||
node = nodes[0]
|
||||
|
||||
# Safe feature access
|
||||
value = getattr(node, "feature_name", default_value)
|
||||
|
||||
# Check format compatibility
|
||||
try:
|
||||
tree.write(format=1)
|
||||
except:
|
||||
print("Tree lacks internal node names")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use appropriate traversal**: Postorder for bottom-up, preorder for top-down
|
||||
2. **Cache for repeated access**: Use `get_cached_content()` for frequent queries
|
||||
3. **Use iterators for large trees**: Memory-efficient processing
|
||||
4. **Preserve branch lengths**: Use `preserve_branch_length=True` when pruning
|
||||
5. **Choose copy method wisely**: "newick" for speed, "cpickle" for full fidelity
|
||||
6. **Validate monophyly**: Check returned clade type (monophyletic/paraphyletic/polyphyletic)
|
||||
7. **Use PhyloTree for phylogenetics**: Specialized methods for evolutionary analysis
|
||||
8. **Cache NCBI queries**: Store results to avoid repeated database access
|
||||
783
scientific-packages/etetoolkit/references/visualization.md
Normal file
783
scientific-packages/etetoolkit/references/visualization.md
Normal file
@@ -0,0 +1,783 @@
|
||||
# ETE Toolkit Visualization Guide
|
||||
|
||||
Complete guide to tree visualization with ETE Toolkit.
|
||||
|
||||
## Table of Contents
|
||||
1. [Rendering Basics](#rendering-basics)
|
||||
2. [TreeStyle Configuration](#treestyle-configuration)
|
||||
3. [Node Styling](#node-styling)
|
||||
4. [Faces](#faces)
|
||||
5. [Layout Functions](#layout-functions)
|
||||
6. [Advanced Visualization](#advanced-visualization)
|
||||
|
||||
---
|
||||
|
||||
## Rendering Basics
|
||||
|
||||
### Output Formats
|
||||
|
||||
ETE supports three main output formats:
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# PNG (raster, good for presentations)
|
||||
tree.render("output.png", w=800, h=600, units="px", dpi=300)
|
||||
|
||||
# PDF (vector, good for publications)
|
||||
tree.render("output.pdf", w=200, units="mm")
|
||||
|
||||
# SVG (vector, editable)
|
||||
tree.render("output.svg")
|
||||
```
|
||||
|
||||
### Units and Dimensions
|
||||
|
||||
```python
|
||||
# Pixels
|
||||
tree.render("tree.png", w=1200, h=800, units="px")
|
||||
|
||||
# Millimeters
|
||||
tree.render("tree.pdf", w=210, h=297, units="mm") # A4 size
|
||||
|
||||
# Inches
|
||||
tree.render("tree.pdf", w=8.5, h=11, units="in") # US Letter
|
||||
|
||||
# Auto-size (aspect ratio preserved)
|
||||
tree.render("tree.pdf", w=200, units="mm") # Height auto-calculated
|
||||
```
|
||||
|
||||
### Interactive Visualization
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Launch GUI
|
||||
# - Zoom with mouse wheel
|
||||
# - Pan by dragging
|
||||
# - Search with Ctrl+F
|
||||
# - Export from menu
|
||||
# - Edit node properties
|
||||
tree.show()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## TreeStyle Configuration
|
||||
|
||||
### Basic TreeStyle Options
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
ts = TreeStyle()
|
||||
|
||||
# Display options
|
||||
ts.show_leaf_name = True # Show leaf names
|
||||
ts.show_branch_length = True # Show branch lengths
|
||||
ts.show_branch_support = True # Show support values
|
||||
ts.show_scale = True # Show scale bar
|
||||
|
||||
# Branch length scaling
|
||||
ts.scale = 50 # Pixels per branch length unit
|
||||
ts.min_leaf_separation = 10 # Minimum space between leaves (pixels)
|
||||
|
||||
# Layout orientation
|
||||
ts.rotation = 0 # 0=left-to-right, 90=top-to-bottom
|
||||
ts.branch_vertical_margin = 10 # Vertical spacing between branches
|
||||
|
||||
# Tree shape
|
||||
ts.mode = "r" # "r"=rectangular (default), "c"=circular
|
||||
|
||||
tree.render("tree.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Circular Trees
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
ts = TreeStyle()
|
||||
|
||||
# Circular mode
|
||||
ts.mode = "c"
|
||||
ts.arc_start = 0 # Starting angle (degrees)
|
||||
ts.arc_span = 360 # Angular span (degrees, 360=full circle)
|
||||
|
||||
# For semicircle
|
||||
ts.arc_start = -180
|
||||
ts.arc_span = 180
|
||||
|
||||
tree.render("circular_tree.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Title and Legend
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
ts = TreeStyle()
|
||||
|
||||
# Add title
|
||||
title = TextFace("Phylogenetic Tree of Species", fsize=20, bold=True)
|
||||
ts.title.add_face(title, column=0)
|
||||
|
||||
# Add legend
|
||||
ts.legend.add_face(TextFace("Red nodes: High support", fsize=10), column=0)
|
||||
ts.legend.add_face(TextFace("Blue nodes: Low support", fsize=10), column=0)
|
||||
|
||||
# Legend position
|
||||
ts.legend_position = 1 # 1=top-right, 2=top-left, 3=bottom-left, 4=bottom-right
|
||||
|
||||
tree.render("tree_with_legend.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Custom Background
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
ts = TreeStyle()
|
||||
|
||||
# Background color
|
||||
ts.bgcolor = "#f0f0f0" # Light gray background
|
||||
|
||||
# Tree border
|
||||
ts.show_border = True
|
||||
|
||||
tree.render("tree_background.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Node Styling
|
||||
|
||||
### NodeStyle Properties
|
||||
|
||||
```python
|
||||
from ete3 import Tree, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
for node in tree.traverse():
|
||||
nstyle = NodeStyle()
|
||||
|
||||
# Node size and shape
|
||||
nstyle["size"] = 10 # Node size in pixels
|
||||
nstyle["shape"] = "circle" # "circle", "square", "sphere"
|
||||
|
||||
# Colors
|
||||
nstyle["fgcolor"] = "blue" # Foreground color (node itself)
|
||||
nstyle["bgcolor"] = "lightblue" # Background color (only for sphere)
|
||||
|
||||
# Line style for branches
|
||||
nstyle["hz_line_type"] = 0 # 0=solid, 1=dashed, 2=dotted
|
||||
nstyle["vt_line_type"] = 0 # Vertical line type
|
||||
nstyle["hz_line_color"] = "black" # Horizontal line color
|
||||
nstyle["vt_line_color"] = "black" # Vertical line color
|
||||
nstyle["hz_line_width"] = 2 # Line width in pixels
|
||||
nstyle["vt_line_width"] = 2
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
tree.render("styled_tree.pdf")
|
||||
```
|
||||
|
||||
### Conditional Styling
|
||||
|
||||
```python
|
||||
from ete3 import Tree, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Style based on node properties
|
||||
for node in tree.traverse():
|
||||
nstyle = NodeStyle()
|
||||
|
||||
if node.is_leaf():
|
||||
# Leaf node style
|
||||
nstyle["size"] = 8
|
||||
nstyle["fgcolor"] = "darkgreen"
|
||||
nstyle["shape"] = "circle"
|
||||
else:
|
||||
# Internal node style based on support
|
||||
if node.support > 0.9:
|
||||
nstyle["size"] = 6
|
||||
nstyle["fgcolor"] = "red"
|
||||
nstyle["shape"] = "sphere"
|
||||
else:
|
||||
nstyle["size"] = 4
|
||||
nstyle["fgcolor"] = "gray"
|
||||
nstyle["shape"] = "circle"
|
||||
|
||||
# Style branches by length
|
||||
if node.dist > 1.0:
|
||||
nstyle["hz_line_width"] = 3
|
||||
nstyle["hz_line_color"] = "blue"
|
||||
else:
|
||||
nstyle["hz_line_width"] = 1
|
||||
nstyle["hz_line_color"] = "black"
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
tree.render("conditional_styled_tree.pdf")
|
||||
```
|
||||
|
||||
### Hiding Nodes
|
||||
|
||||
```python
|
||||
from ete3 import Tree, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Hide specific nodes
|
||||
for node in tree.traverse():
|
||||
if node.support < 0.5: # Hide low support nodes
|
||||
nstyle = NodeStyle()
|
||||
nstyle["draw_descendants"] = False # Don't draw this node's subtree
|
||||
nstyle["size"] = 0 # Make node invisible
|
||||
node.set_style(nstyle)
|
||||
|
||||
tree.render("filtered_tree.pdf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Faces
|
||||
|
||||
Faces are graphical elements attached to nodes. They appear at specific positions around nodes.
|
||||
|
||||
### Face Positions
|
||||
|
||||
- `"branch-right"`: Right side of branch (after node)
|
||||
- `"branch-top"`: Above branch
|
||||
- `"branch-bottom"`: Below branch
|
||||
- `"aligned"`: Aligned column at tree edge (for leaves)
|
||||
|
||||
### TextFace
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add species name
|
||||
name_face = TextFace(node.name, fsize=12, fgcolor="black")
|
||||
node.add_face(name_face, column=0, position="branch-right")
|
||||
|
||||
# Add additional text
|
||||
info_face = TextFace(f"Length: {node.dist:.3f}", fsize=8, fgcolor="gray")
|
||||
node.add_face(info_face, column=1, position="branch-right")
|
||||
else:
|
||||
# Add support value
|
||||
if node.support:
|
||||
support_face = TextFace(f"{node.support:.2f}", fsize=8, fgcolor="red")
|
||||
node.add_face(support_face, column=0, position="branch-top")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False # We're adding custom names
|
||||
|
||||
tree.render("tree_textfaces.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### AttrFace
|
||||
|
||||
Display node attributes directly:
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, AttrFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add custom attributes
|
||||
for leaf in tree:
|
||||
leaf.add_feature("habitat", "aquatic" if "fish" in leaf.name else "terrestrial")
|
||||
leaf.add_feature("temperature", 20)
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Display attribute directly
|
||||
habitat_face = AttrFace("habitat", fsize=10)
|
||||
node.add_face(habitat_face, column=0, position="aligned")
|
||||
|
||||
temp_face = AttrFace("temperature", fsize=10)
|
||||
node.add_face(temp_face, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
|
||||
tree.render("tree_attrfaces.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### CircleFace
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, CircleFace, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Annotate with habitat
|
||||
for leaf in tree:
|
||||
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "land")
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Colored circle based on habitat
|
||||
color = "blue" if node.habitat == "marine" else "green"
|
||||
circle = CircleFace(radius=5, color=color, style="circle")
|
||||
node.add_face(circle, column=0, position="aligned")
|
||||
|
||||
# Label
|
||||
name = TextFace(node.name, fsize=10)
|
||||
node.add_face(name, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_circles.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### ImgFace
|
||||
|
||||
Add images to nodes:
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, ImgFace, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add species image
|
||||
img_path = f"images/{node.name}.png" # Path to image
|
||||
try:
|
||||
img_face = ImgFace(img_path, width=50, height=50)
|
||||
node.add_face(img_face, column=0, position="aligned")
|
||||
except:
|
||||
pass # Skip if image doesn't exist
|
||||
|
||||
# Add name
|
||||
name_face = TextFace(node.name, fsize=10)
|
||||
node.add_face(name_face, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_images.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### BarChartFace
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, BarChartFace, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add data for bar charts
|
||||
for leaf in tree:
|
||||
leaf.add_feature("values", [1.2, 2.3, 0.5, 1.8]) # Multiple values
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add bar chart
|
||||
chart = BarChartFace(
|
||||
node.values,
|
||||
width=100,
|
||||
height=40,
|
||||
colors=["red", "blue", "green", "orange"],
|
||||
labels=["A", "B", "C", "D"]
|
||||
)
|
||||
node.add_face(chart, column=0, position="aligned")
|
||||
|
||||
# Add name
|
||||
name = TextFace(node.name, fsize=10)
|
||||
node.add_face(name, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_barcharts.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### PieChartFace
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, PieChartFace, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add data
|
||||
for leaf in tree:
|
||||
leaf.add_feature("proportions", [25, 35, 40]) # Percentages
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add pie chart
|
||||
pie = PieChartFace(
|
||||
node.proportions,
|
||||
width=30,
|
||||
height=30,
|
||||
colors=["red", "blue", "green"]
|
||||
)
|
||||
node.add_face(pie, column=0, position="aligned")
|
||||
|
||||
name = TextFace(node.name, fsize=10)
|
||||
node.add_face(name, column=1, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_piecharts.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### SequenceFace (for alignments)
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree, TreeStyle, SeqMotifFace
|
||||
|
||||
tree = PhyloTree("tree.nw")
|
||||
tree.link_to_alignment("alignment.fasta")
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Display sequence
|
||||
seq_face = SeqMotifFace(node.sequence, seq_format="seq")
|
||||
node.add_face(seq_face, column=0, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = True
|
||||
|
||||
tree.render("tree_alignment.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Layout Functions
|
||||
|
||||
Layout functions are Python functions that modify node appearance during rendering.
|
||||
|
||||
### Basic Layout Function
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
def my_layout(node):
|
||||
"""Called for every node before rendering"""
|
||||
|
||||
if node.is_leaf():
|
||||
# Add text to leaves
|
||||
name_face = TextFace(node.name.upper(), fsize=12, fgcolor="blue")
|
||||
node.add_face(name_face, column=0, position="branch-right")
|
||||
else:
|
||||
# Add support to internal nodes
|
||||
if node.support:
|
||||
support_face = TextFace(f"BS: {node.support:.0f}", fsize=8)
|
||||
node.add_face(support_face, column=0, position="branch-top")
|
||||
|
||||
# Apply layout function
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = my_layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_custom_layout.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Dynamic Styling in Layout
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
def layout(node):
|
||||
# Modify node style dynamically
|
||||
nstyle = NodeStyle()
|
||||
|
||||
# Color by clade
|
||||
if "clade_A" in [l.name for l in node.get_leaves()]:
|
||||
nstyle["bgcolor"] = "lightblue"
|
||||
elif "clade_B" in [l.name for l in node.get_leaves()]:
|
||||
nstyle["bgcolor"] = "lightgreen"
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
# Add faces based on features
|
||||
if hasattr(node, "annotation"):
|
||||
text = TextFace(node.annotation, fsize=8)
|
||||
node.add_face(text, column=0, position="branch-top")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
|
||||
tree.render("tree_dynamic.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Multiple Column Layout
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace, CircleFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add features
|
||||
for leaf in tree:
|
||||
leaf.add_feature("habitat", "aquatic")
|
||||
leaf.add_feature("temp", 20)
|
||||
leaf.add_feature("depth", 100)
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Column 0: Name
|
||||
name = TextFace(node.name, fsize=10)
|
||||
node.add_face(name, column=0, position="aligned")
|
||||
|
||||
# Column 1: Habitat indicator
|
||||
color = "blue" if node.habitat == "aquatic" else "brown"
|
||||
circle = CircleFace(radius=5, color=color)
|
||||
node.add_face(circle, column=1, position="aligned")
|
||||
|
||||
# Column 2: Temperature
|
||||
temp = TextFace(f"{node.temp}°C", fsize=8)
|
||||
node.add_face(temp, column=2, position="aligned")
|
||||
|
||||
# Column 3: Depth
|
||||
depth = TextFace(f"{node.depth}m", fsize=8)
|
||||
node.add_face(depth, column=3, position="aligned")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
tree.render("tree_columns.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Visualization
|
||||
|
||||
### Highlighting Clades
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Define clades to highlight
|
||||
clade_members = {
|
||||
"Clade_A": ["species1", "species2", "species3"],
|
||||
"Clade_B": ["species4", "species5"]
|
||||
}
|
||||
|
||||
def layout(node):
|
||||
# Check if node is ancestor of specific clade
|
||||
node_leaves = set([l.name for l in node.get_leaves()])
|
||||
|
||||
for clade_name, members in clade_members.items():
|
||||
if set(members).issubset(node_leaves):
|
||||
# This node is ancestor of the clade
|
||||
nstyle = NodeStyle()
|
||||
nstyle["bgcolor"] = "yellow"
|
||||
nstyle["size"] = 0
|
||||
|
||||
# Add label
|
||||
if set(members) == node_leaves: # Exact match
|
||||
label = TextFace(clade_name, fsize=14, bold=True, fgcolor="red")
|
||||
node.add_face(label, column=0, position="branch-top")
|
||||
|
||||
node.set_style(nstyle)
|
||||
break
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
|
||||
tree.render("tree_highlighted_clades.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Collapsing Clades
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Define which clades to collapse
|
||||
clades_to_collapse = ["clade1_species1", "clade1_species2"]
|
||||
|
||||
def layout(node):
|
||||
if not node.is_leaf():
|
||||
node_leaves = [l.name for l in node.get_leaves()]
|
||||
|
||||
# Check if this is a clade we want to collapse
|
||||
if all(l in clades_to_collapse for l in node_leaves):
|
||||
# Collapse by hiding descendants
|
||||
nstyle = NodeStyle()
|
||||
nstyle["draw_descendants"] = False
|
||||
nstyle["size"] = 20
|
||||
nstyle["fgcolor"] = "steelblue"
|
||||
nstyle["shape"] = "sphere"
|
||||
node.set_style(nstyle)
|
||||
|
||||
# Add label showing what's collapsed
|
||||
label = TextFace(f"[{len(node_leaves)} species]", fsize=10)
|
||||
node.add_face(label, column=0, position="branch-right")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
|
||||
tree.render("tree_collapsed.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Heat Map Visualization
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, RectFace, TextFace
|
||||
import numpy as np
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Generate random data for heatmap
|
||||
for leaf in tree:
|
||||
leaf.add_feature("data", np.random.rand(10)) # 10 data points
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add name
|
||||
name = TextFace(node.name, fsize=8)
|
||||
node.add_face(name, column=0, position="aligned")
|
||||
|
||||
# Add heatmap cells
|
||||
for i, value in enumerate(node.data):
|
||||
# Color based on value
|
||||
intensity = int(255 * value)
|
||||
color = f"#{255-intensity:02x}{intensity:02x}00" # Green-red gradient
|
||||
|
||||
rect = RectFace(width=20, height=15, fgcolor=color, bgcolor=color)
|
||||
node.add_face(rect, column=i+1, position="aligned")
|
||||
|
||||
# Add column headers
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False
|
||||
|
||||
# Add header
|
||||
for i in range(10):
|
||||
header = TextFace(f"C{i+1}", fsize=8, fgcolor="gray")
|
||||
ts.aligned_header.add_face(header, column=i+1)
|
||||
|
||||
tree.render("tree_heatmap.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Phylogenetic Events Visualization
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree, TreeStyle, TextFace, NodeStyle
|
||||
|
||||
tree = PhyloTree("gene_tree.nw")
|
||||
tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
tree.get_descendant_evol_events()
|
||||
|
||||
def layout(node):
|
||||
# Style based on evolutionary event
|
||||
if hasattr(node, "evoltype"):
|
||||
nstyle = NodeStyle()
|
||||
|
||||
if node.evoltype == "D": # Duplication
|
||||
nstyle["fgcolor"] = "red"
|
||||
nstyle["size"] = 10
|
||||
nstyle["shape"] = "square"
|
||||
|
||||
label = TextFace("DUP", fsize=8, fgcolor="red", bold=True)
|
||||
node.add_face(label, column=0, position="branch-top")
|
||||
|
||||
elif node.evoltype == "S": # Speciation
|
||||
nstyle["fgcolor"] = "blue"
|
||||
nstyle["size"] = 6
|
||||
nstyle["shape"] = "circle"
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = True
|
||||
|
||||
tree.render("gene_tree_events.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Custom Tree with Legend
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace, CircleFace, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Categorize species
|
||||
for leaf in tree:
|
||||
if "fish" in leaf.name.lower():
|
||||
leaf.add_feature("category", "fish")
|
||||
elif "bird" in leaf.name.lower():
|
||||
leaf.add_feature("category", "bird")
|
||||
else:
|
||||
leaf.add_feature("category", "mammal")
|
||||
|
||||
category_colors = {
|
||||
"fish": "blue",
|
||||
"bird": "green",
|
||||
"mammal": "red"
|
||||
}
|
||||
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Color by category
|
||||
nstyle = NodeStyle()
|
||||
nstyle["fgcolor"] = category_colors[node.category]
|
||||
nstyle["size"] = 10
|
||||
node.set_style(nstyle)
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
|
||||
# Add legend
|
||||
ts.legend.add_face(TextFace("Legend:", fsize=12, bold=True), column=0)
|
||||
for category, color in category_colors.items():
|
||||
circle = CircleFace(radius=5, color=color)
|
||||
ts.legend.add_face(circle, column=0)
|
||||
label = TextFace(f" {category.capitalize()}", fsize=10)
|
||||
ts.legend.add_face(label, column=1)
|
||||
|
||||
ts.legend_position = 1
|
||||
|
||||
tree.render("tree_with_legend.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use layout functions** for complex visualizations - they're called during rendering
|
||||
2. **Set `show_leaf_name = False`** when using custom name faces
|
||||
3. **Use aligned position** for columnar data at leaf level
|
||||
4. **Choose appropriate units**: pixels for screen, mm/inches for print
|
||||
5. **Use vector formats (PDF/SVG)** for publications
|
||||
6. **Precompute styling** when possible - layout functions should be fast
|
||||
7. **Test interactively** with `show()` before rendering to file
|
||||
8. **Use NodeStyle for permanent** changes, layout functions for rendering-time changes
|
||||
9. **Align faces in columns** for clean, organized appearance
|
||||
10. **Add legends** to explain colors and symbols used
|
||||
774
scientific-packages/etetoolkit/references/workflows.md
Normal file
774
scientific-packages/etetoolkit/references/workflows.md
Normal file
@@ -0,0 +1,774 @@
|
||||
# ETE Toolkit Common Workflows
|
||||
|
||||
This document provides complete workflows for common tasks using the ETE Toolkit.
|
||||
|
||||
## Table of Contents
|
||||
1. [Basic Tree Operations](#basic-tree-operations)
|
||||
2. [Phylogenetic Analysis](#phylogenetic-analysis)
|
||||
3. [Tree Comparison](#tree-comparison)
|
||||
4. [Taxonomy Integration](#taxonomy-integration)
|
||||
5. [Clustering Analysis](#clustering-analysis)
|
||||
6. [Tree Visualization](#tree-visualization)
|
||||
|
||||
---
|
||||
|
||||
## Basic Tree Operations
|
||||
|
||||
### Loading and Exploring a Tree
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
# Load tree from file
|
||||
tree = Tree("my_tree.nw", format=1)
|
||||
|
||||
# Display ASCII representation
|
||||
print(tree.get_ascii(show_internal=True))
|
||||
|
||||
# Get basic statistics
|
||||
print(f"Number of leaves: {len(tree)}")
|
||||
print(f"Total nodes: {len(list(tree.traverse()))}")
|
||||
print(f"Tree depth: {tree.get_farthest_leaf()[1]}")
|
||||
|
||||
# List all leaf names
|
||||
for leaf in tree:
|
||||
print(leaf.name)
|
||||
```
|
||||
|
||||
### Extracting and Saving Subtrees
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("full_tree.nw")
|
||||
|
||||
# Get subtree rooted at specific node
|
||||
node = tree.search_nodes(name="MyNode")[0]
|
||||
subtree = node.copy()
|
||||
|
||||
# Save subtree to file
|
||||
subtree.write(outfile="subtree.nw", format=1)
|
||||
|
||||
# Extract monophyletic clade
|
||||
species_of_interest = ["species1", "species2", "species3"]
|
||||
ancestor = tree.get_common_ancestor(species_of_interest)
|
||||
clade = ancestor.copy()
|
||||
clade.write(outfile="clade.nw")
|
||||
```
|
||||
|
||||
### Pruning Trees to Specific Taxa
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("large_tree.nw")
|
||||
|
||||
# Keep only taxa of interest
|
||||
taxa_to_keep = ["taxon1", "taxon2", "taxon3", "taxon4"]
|
||||
tree.prune(taxa_to_keep, preserve_branch_length=True)
|
||||
|
||||
# Save pruned tree
|
||||
tree.write(outfile="pruned_tree.nw")
|
||||
```
|
||||
|
||||
### Rerooting Trees
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("unrooted_tree.nw")
|
||||
|
||||
# Method 1: Root by outgroup
|
||||
outgroup = tree & "Outgroup_species"
|
||||
tree.set_outgroup(outgroup)
|
||||
|
||||
# Method 2: Midpoint rooting
|
||||
midpoint = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint)
|
||||
|
||||
# Save rooted tree
|
||||
tree.write(outfile="rooted_tree.nw")
|
||||
```
|
||||
|
||||
### Annotating Nodes with Custom Data
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add features to nodes based on metadata
|
||||
metadata = {
|
||||
"species1": {"habitat": "marine", "temperature": 20},
|
||||
"species2": {"habitat": "freshwater", "temperature": 15},
|
||||
}
|
||||
|
||||
for leaf in tree:
|
||||
if leaf.name in metadata:
|
||||
leaf.add_features(**metadata[leaf.name])
|
||||
|
||||
# Query annotated features
|
||||
for leaf in tree:
|
||||
if hasattr(leaf, "habitat"):
|
||||
print(f"{leaf.name}: {leaf.habitat}, {leaf.temperature}°C")
|
||||
|
||||
# Save with custom features (NHX format)
|
||||
tree.write(outfile="annotated_tree.nhx", features=["habitat", "temperature"])
|
||||
```
|
||||
|
||||
### Modifying Tree Topology
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Remove a clade
|
||||
node_to_remove = tree & "unwanted_clade"
|
||||
node_to_remove.detach()
|
||||
|
||||
# Collapse a node (delete but keep children)
|
||||
node_to_collapse = tree & "low_support_node"
|
||||
node_to_collapse.delete()
|
||||
|
||||
# Add a new species to existing clade
|
||||
target_clade = tree & "target_node"
|
||||
new_leaf = target_clade.add_child(name="new_species", dist=0.5)
|
||||
|
||||
# Resolve polytomies
|
||||
tree.resolve_polytomy(recursive=True)
|
||||
|
||||
# Save modified tree
|
||||
tree.write(outfile="modified_tree.nw")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phylogenetic Analysis
|
||||
|
||||
### Complete Gene Tree Analysis with Alignment
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
|
||||
# Load gene tree and link alignment
|
||||
tree = PhyloTree("gene_tree.nw", format=1)
|
||||
tree.link_to_alignment("alignment.fasta", alg_format="fasta")
|
||||
|
||||
# Set species naming function (e.g., gene_species format)
|
||||
def extract_species(node_name):
|
||||
return node_name.split("_")[0]
|
||||
|
||||
tree.set_species_naming_function(extract_species)
|
||||
|
||||
# Access sequences
|
||||
for leaf in tree:
|
||||
print(f"{leaf.name} ({leaf.species})")
|
||||
print(f"Sequence: {leaf.sequence[:50]}...")
|
||||
```
|
||||
|
||||
### Detecting Duplication and Speciation Events
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree, Tree
|
||||
|
||||
# Load gene tree
|
||||
gene_tree = PhyloTree("gene_tree.nw")
|
||||
|
||||
# Set species naming
|
||||
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
|
||||
# Option 1: Species Overlap algorithm (no species tree needed)
|
||||
events = gene_tree.get_descendant_evol_events()
|
||||
|
||||
# Option 2: Tree reconciliation (requires species tree)
|
||||
species_tree = Tree("species_tree.nw")
|
||||
events = gene_tree.get_descendant_evol_events(species_tree=species_tree)
|
||||
|
||||
# Analyze events
|
||||
duplications = 0
|
||||
speciations = 0
|
||||
|
||||
for node in gene_tree.traverse():
|
||||
if hasattr(node, "evoltype"):
|
||||
if node.evoltype == "D":
|
||||
duplications += 1
|
||||
print(f"Duplication at node {node.name}")
|
||||
elif node.evoltype == "S":
|
||||
speciations += 1
|
||||
|
||||
print(f"\nTotal duplications: {duplications}")
|
||||
print(f"Total speciations: {speciations}")
|
||||
```
|
||||
|
||||
### Extracting Orthologs and Paralogs
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
|
||||
gene_tree = PhyloTree("gene_tree.nw")
|
||||
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
|
||||
# Detect evolutionary events
|
||||
events = gene_tree.get_descendant_evol_events()
|
||||
|
||||
# Find all orthologs to a query gene
|
||||
query_gene = gene_tree & "species1_gene1"
|
||||
|
||||
orthologs = []
|
||||
paralogs = []
|
||||
|
||||
for event in events:
|
||||
if query_gene in event.in_seqs:
|
||||
if event.etype == "S": # Speciation
|
||||
orthologs.extend([s for s in event.out_seqs if s != query_gene])
|
||||
elif event.etype == "D": # Duplication
|
||||
paralogs.extend([s for s in event.out_seqs if s != query_gene])
|
||||
|
||||
print(f"Orthologs of {query_gene.name}:")
|
||||
for ortholog in set(orthologs):
|
||||
print(f" {ortholog.name}")
|
||||
|
||||
print(f"\nParalogs of {query_gene.name}:")
|
||||
for paralog in set(paralogs):
|
||||
print(f" {paralog.name}")
|
||||
```
|
||||
|
||||
### Splitting Gene Families by Duplication Events
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
|
||||
gene_tree = PhyloTree("gene_family.nw")
|
||||
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
gene_tree.get_descendant_evol_events()
|
||||
|
||||
# Split into individual gene families
|
||||
subfamilies = gene_tree.split_by_dups()
|
||||
|
||||
print(f"Gene family split into {len(subfamilies)} subfamilies")
|
||||
|
||||
for i, subtree in enumerate(subfamilies):
|
||||
subtree.write(outfile=f"subfamily_{i}.nw")
|
||||
species = set([leaf.species for leaf in subtree])
|
||||
print(f"Subfamily {i}: {len(subtree)} genes from {len(species)} species")
|
||||
```
|
||||
|
||||
### Collapsing Lineage-Specific Expansions
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree
|
||||
|
||||
gene_tree = PhyloTree("expanded_tree.nw")
|
||||
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
|
||||
# Collapse lineage-specific duplications
|
||||
gene_tree.collapse_lineage_specific_expansions()
|
||||
|
||||
print("After collapsing expansions:")
|
||||
print(gene_tree.get_ascii())
|
||||
|
||||
gene_tree.write(outfile="collapsed_tree.nw")
|
||||
```
|
||||
|
||||
### Testing Monophyly
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Test if a group is monophyletic
|
||||
target_species = ["species1", "species2", "species3"]
|
||||
is_mono, clade_type, base_node = tree.check_monophyly(
|
||||
values=target_species,
|
||||
target_attr="name"
|
||||
)
|
||||
|
||||
if is_mono:
|
||||
print(f"Group is monophyletic")
|
||||
print(f"MRCA: {base_node.name}")
|
||||
elif clade_type == "paraphyletic":
|
||||
print(f"Group is paraphyletic")
|
||||
elif clade_type == "polyphyletic":
|
||||
print(f"Group is polyphyletic")
|
||||
|
||||
# Get all monophyletic clades of a specific type
|
||||
# Annotate leaves first
|
||||
for leaf in tree:
|
||||
if leaf.name.startswith("species"):
|
||||
leaf.add_feature("type", "typeA")
|
||||
else:
|
||||
leaf.add_feature("type", "typeB")
|
||||
|
||||
mono_clades = tree.get_monophyletic(values=["typeA"], target_attr="type")
|
||||
print(f"Found {len(mono_clades)} monophyletic clades of typeA")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tree Comparison
|
||||
|
||||
### Computing Robinson-Foulds Distance
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree1 = Tree("tree1.nw")
|
||||
tree2 = Tree("tree2.nw")
|
||||
|
||||
# Compute RF distance
|
||||
rf, max_rf, common_leaves, parts_t1, parts_t2 = tree1.robinson_foulds(tree2)
|
||||
|
||||
print(f"Robinson-Foulds distance: {rf}")
|
||||
print(f"Maximum RF distance: {max_rf}")
|
||||
print(f"Normalized RF: {rf/max_rf:.3f}")
|
||||
print(f"Common leaves: {len(common_leaves)}")
|
||||
|
||||
# Find unique partitions
|
||||
unique_in_t1 = parts_t1 - parts_t2
|
||||
unique_in_t2 = parts_t2 - parts_t1
|
||||
|
||||
print(f"\nPartitions unique to tree1: {len(unique_in_t1)}")
|
||||
print(f"Partitions unique to tree2: {len(unique_in_t2)}")
|
||||
```
|
||||
|
||||
### Comparing Multiple Trees
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
import numpy as np
|
||||
|
||||
# Load multiple trees
|
||||
tree_files = ["tree1.nw", "tree2.nw", "tree3.nw", "tree4.nw"]
|
||||
trees = [Tree(f) for f in tree_files]
|
||||
|
||||
# Create distance matrix
|
||||
n = len(trees)
|
||||
dist_matrix = np.zeros((n, n))
|
||||
|
||||
for i in range(n):
|
||||
for j in range(i+1, n):
|
||||
rf, max_rf, _, _, _ = trees[i].robinson_foulds(trees[j])
|
||||
norm_rf = rf / max_rf if max_rf > 0 else 0
|
||||
dist_matrix[i, j] = norm_rf
|
||||
dist_matrix[j, i] = norm_rf
|
||||
|
||||
print("Normalized RF distance matrix:")
|
||||
print(dist_matrix)
|
||||
|
||||
# Find most similar pair
|
||||
min_dist = float('inf')
|
||||
best_pair = None
|
||||
|
||||
for i in range(n):
|
||||
for j in range(i+1, n):
|
||||
if dist_matrix[i, j] < min_dist:
|
||||
min_dist = dist_matrix[i, j]
|
||||
best_pair = (i, j)
|
||||
|
||||
print(f"\nMost similar trees: {tree_files[best_pair[0]]} and {tree_files[best_pair[1]]}")
|
||||
print(f"Distance: {min_dist:.3f}")
|
||||
```
|
||||
|
||||
### Finding Consensus Topology
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
# Load multiple bootstrap trees
|
||||
bootstrap_trees = [Tree(f"bootstrap_{i}.nw") for i in range(100)]
|
||||
|
||||
# Get reference tree (first tree)
|
||||
ref_tree = bootstrap_trees[0].copy()
|
||||
|
||||
# Count bipartitions
|
||||
bipartition_counts = {}
|
||||
|
||||
for tree in bootstrap_trees:
|
||||
rf, max_rf, common, parts_ref, parts_tree = ref_tree.robinson_foulds(tree)
|
||||
for partition in parts_tree:
|
||||
bipartition_counts[partition] = bipartition_counts.get(partition, 0) + 1
|
||||
|
||||
# Filter by support threshold
|
||||
threshold = 70 # 70% support
|
||||
supported_bipartitions = {
|
||||
k: v for k, v in bipartition_counts.items()
|
||||
if (v / len(bootstrap_trees)) * 100 >= threshold
|
||||
}
|
||||
|
||||
print(f"Bipartitions with >{threshold}% support: {len(supported_bipartitions)}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Taxonomy Integration
|
||||
|
||||
### Building Species Trees from NCBI Taxonomy
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
|
||||
ncbi = NCBITaxa()
|
||||
|
||||
# Define species of interest
|
||||
species = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla",
|
||||
"Mus musculus", "Rattus norvegicus"]
|
||||
|
||||
# Get taxids
|
||||
name2taxid = ncbi.get_name_translator(species)
|
||||
taxids = [name2taxid[sp][0] for sp in species]
|
||||
|
||||
# Build tree
|
||||
tree = ncbi.get_topology(taxids)
|
||||
|
||||
# Annotate with taxonomy info
|
||||
for node in tree.traverse():
|
||||
if hasattr(node, "sci_name"):
|
||||
print(f"{node.sci_name} - Rank: {node.rank} - TaxID: {node.taxid}")
|
||||
|
||||
# Save tree
|
||||
tree.write(outfile="species_tree.nw")
|
||||
```
|
||||
|
||||
### Annotating Existing Tree with NCBI Taxonomy
|
||||
|
||||
```python
|
||||
from ete3 import Tree, NCBITaxa
|
||||
|
||||
tree = Tree("species_tree.nw")
|
||||
ncbi = NCBITaxa()
|
||||
|
||||
# Map leaf names to species names (adjust as needed)
|
||||
leaf_to_species = {
|
||||
"Hsap_gene1": "Homo sapiens",
|
||||
"Ptro_gene1": "Pan troglodytes",
|
||||
"Mmur_gene1": "Microcebus murinus",
|
||||
}
|
||||
|
||||
# Get taxids
|
||||
all_species = list(set(leaf_to_species.values()))
|
||||
name2taxid = ncbi.get_name_translator(all_species)
|
||||
|
||||
# Annotate leaves
|
||||
for leaf in tree:
|
||||
if leaf.name in leaf_to_species:
|
||||
species_name = leaf_to_species[leaf.name]
|
||||
taxid = name2taxid[species_name][0]
|
||||
|
||||
# Add taxonomy info
|
||||
leaf.add_feature("species", species_name)
|
||||
leaf.add_feature("taxid", taxid)
|
||||
|
||||
# Get full lineage
|
||||
lineage = ncbi.get_lineage(taxid)
|
||||
names = ncbi.get_taxid_translator(lineage)
|
||||
leaf.add_feature("lineage", [names[t] for t in lineage])
|
||||
|
||||
print(f"{leaf.name}: {species_name} (taxid: {taxid})")
|
||||
```
|
||||
|
||||
### Querying NCBI Taxonomy
|
||||
|
||||
```python
|
||||
from ete3 import NCBITaxa
|
||||
|
||||
ncbi = NCBITaxa()
|
||||
|
||||
# Get all primates
|
||||
primates_taxid = ncbi.get_name_translator(["Primates"])["Primates"][0]
|
||||
all_primates = ncbi.get_descendant_taxa(primates_taxid, collapse_subspecies=True)
|
||||
|
||||
print(f"Total primate species: {len(all_primates)}")
|
||||
|
||||
# Get names for subset
|
||||
taxid2name = ncbi.get_taxid_translator(all_primates[:10])
|
||||
for taxid, name in taxid2name.items():
|
||||
rank = ncbi.get_rank([taxid])[taxid]
|
||||
print(f"{name} ({rank})")
|
||||
|
||||
# Get lineage for specific species
|
||||
human_taxid = 9606
|
||||
lineage = ncbi.get_lineage(human_taxid)
|
||||
ranks = ncbi.get_rank(lineage)
|
||||
names = ncbi.get_taxid_translator(lineage)
|
||||
|
||||
print("\nHuman lineage:")
|
||||
for taxid in lineage:
|
||||
print(f"{ranks[taxid]:15s} {names[taxid]}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Clustering Analysis
|
||||
|
||||
### Analyzing Hierarchical Clustering Results
|
||||
|
||||
```python
|
||||
from ete3 import ClusterTree
|
||||
|
||||
# Load clustering tree with data matrix
|
||||
matrix = """#Names\tSample1\tSample2\tSample3\tSample4
|
||||
Gene1\t1.5\t2.3\t0.8\t1.2
|
||||
Gene2\t0.9\t1.1\t1.8\t2.1
|
||||
Gene3\t2.1\t2.5\t0.5\t0.9
|
||||
Gene4\t0.7\t0.9\t2.2\t2.4"""
|
||||
|
||||
tree = ClusterTree("((Gene1,Gene2),(Gene3,Gene4));", text_array=matrix)
|
||||
|
||||
# Calculate cluster quality metrics
|
||||
for node in tree.traverse():
|
||||
if not node.is_leaf():
|
||||
# Silhouette coefficient
|
||||
silhouette = node.get_silhouette()
|
||||
|
||||
# Dunn index
|
||||
dunn = node.get_dunn()
|
||||
|
||||
# Distances
|
||||
inter = node.intercluster_dist
|
||||
intra = node.intracluster_dist
|
||||
|
||||
print(f"Node: {node.name}")
|
||||
print(f" Silhouette: {silhouette:.3f}")
|
||||
print(f" Dunn index: {dunn:.3f}")
|
||||
print(f" Intercluster distance: {inter:.3f}")
|
||||
print(f" Intracluster distance: {intra:.3f}")
|
||||
```
|
||||
|
||||
### Validating Clusters
|
||||
|
||||
```python
|
||||
from ete3 import ClusterTree
|
||||
|
||||
matrix = """#Names\tCol1\tCol2\tCol3
|
||||
ItemA\t1.2\t0.5\t0.8
|
||||
ItemB\t1.3\t0.6\t0.9
|
||||
ItemC\t0.1\t2.5\t2.3
|
||||
ItemD\t0.2\t2.6\t2.4"""
|
||||
|
||||
tree = ClusterTree("((ItemA,ItemB),(ItemC,ItemD));", text_array=matrix)
|
||||
|
||||
# Test different distance metrics
|
||||
metrics = ["euclidean", "pearson", "spearman"]
|
||||
|
||||
for metric in metrics:
|
||||
print(f"\nUsing {metric} distance:")
|
||||
|
||||
for node in tree.traverse():
|
||||
if not node.is_leaf():
|
||||
silhouette = node.get_silhouette(distance=metric)
|
||||
|
||||
# Positive silhouette = good clustering
|
||||
# Negative silhouette = poor clustering
|
||||
quality = "good" if silhouette > 0 else "poor"
|
||||
|
||||
print(f" Cluster {node.name}: {silhouette:.3f} ({quality})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tree Visualization
|
||||
|
||||
### Basic Tree Rendering
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Create tree style
|
||||
ts = TreeStyle()
|
||||
ts.show_leaf_name = True
|
||||
ts.show_branch_length = True
|
||||
ts.show_branch_support = True
|
||||
ts.scale = 50 # pixels per branch length unit
|
||||
|
||||
# Render to file
|
||||
tree.render("tree_output.pdf", tree_style=ts)
|
||||
tree.render("tree_output.png", tree_style=ts, w=800, h=600, units="px")
|
||||
tree.render("tree_output.svg", tree_style=ts)
|
||||
```
|
||||
|
||||
### Customizing Node Appearance
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, NodeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Define node styles
|
||||
for node in tree.traverse():
|
||||
nstyle = NodeStyle()
|
||||
|
||||
if node.is_leaf():
|
||||
nstyle["fgcolor"] = "blue"
|
||||
nstyle["size"] = 10
|
||||
else:
|
||||
nstyle["fgcolor"] = "red"
|
||||
nstyle["size"] = 5
|
||||
|
||||
if node.support > 0.9:
|
||||
nstyle["shape"] = "sphere"
|
||||
else:
|
||||
nstyle["shape"] = "circle"
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
# Render
|
||||
ts = TreeStyle()
|
||||
tree.render("styled_tree.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Adding Faces to Nodes
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle, TextFace, CircleFace, AttrFace
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Add features to nodes
|
||||
for leaf in tree:
|
||||
leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "terrestrial")
|
||||
leaf.add_feature("temp", 20)
|
||||
|
||||
# Layout function to add faces
|
||||
def layout(node):
|
||||
if node.is_leaf():
|
||||
# Add text face
|
||||
name_face = TextFace(node.name, fsize=10)
|
||||
node.add_face(name_face, column=0, position="branch-right")
|
||||
|
||||
# Add colored circle based on habitat
|
||||
color = "blue" if node.habitat == "marine" else "green"
|
||||
circle_face = CircleFace(radius=5, color=color)
|
||||
node.add_face(circle_face, column=1, position="branch-right")
|
||||
|
||||
# Add attribute face
|
||||
temp_face = AttrFace("temp", fsize=8)
|
||||
node.add_face(temp_face, column=2, position="branch-right")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = False # We're adding custom names
|
||||
|
||||
tree.render("tree_with_faces.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Circular Tree Layout
|
||||
|
||||
```python
|
||||
from ete3 import Tree, TreeStyle
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.mode = "c" # Circular mode
|
||||
ts.arc_start = 0 # Degrees
|
||||
ts.arc_span = 360 # Full circle
|
||||
ts.show_leaf_name = True
|
||||
|
||||
tree.render("circular_tree.pdf", tree_style=ts)
|
||||
```
|
||||
|
||||
### Interactive Exploration
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
|
||||
tree = Tree("tree.nw")
|
||||
|
||||
# Launch GUI (allows zooming, searching, modifying)
|
||||
# Changes persist after closing
|
||||
tree.show()
|
||||
|
||||
# Can save changes made in GUI
|
||||
tree.write(outfile="modified_tree.nw")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Workflows
|
||||
|
||||
### Complete Phylogenomic Pipeline
|
||||
|
||||
```python
|
||||
from ete3 import PhyloTree, NCBITaxa, TreeStyle
|
||||
|
||||
# 1. Load gene tree
|
||||
gene_tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
|
||||
|
||||
# 2. Set species naming
|
||||
gene_tree.set_species_naming_function(lambda x: x.split("_")[0])
|
||||
|
||||
# 3. Detect evolutionary events
|
||||
gene_tree.get_descendant_evol_events()
|
||||
|
||||
# 4. Annotate with NCBI taxonomy
|
||||
ncbi = NCBITaxa()
|
||||
species_set = set([leaf.species for leaf in gene_tree])
|
||||
name2taxid = ncbi.get_name_translator(list(species_set))
|
||||
|
||||
for leaf in gene_tree:
|
||||
if leaf.species in name2taxid:
|
||||
taxid = name2taxid[leaf.species][0]
|
||||
lineage = ncbi.get_lineage(taxid)
|
||||
names = ncbi.get_taxid_translator(lineage)
|
||||
leaf.add_feature("lineage", [names[t] for t in lineage])
|
||||
|
||||
# 5. Identify and save ortholog groups
|
||||
ortho_groups = gene_tree.get_speciation_trees()
|
||||
|
||||
for i, ortho_tree in enumerate(ortho_groups):
|
||||
ortho_tree.write(outfile=f"ortholog_group_{i}.nw")
|
||||
|
||||
# 6. Visualize with evolutionary events marked
|
||||
def layout(node):
|
||||
from ete3 import TextFace
|
||||
if hasattr(node, "evoltype"):
|
||||
if node.evoltype == "D":
|
||||
dup_face = TextFace("DUPLICATION", fsize=8, fgcolor="red")
|
||||
node.add_face(dup_face, column=0, position="branch-top")
|
||||
|
||||
ts = TreeStyle()
|
||||
ts.layout_fn = layout
|
||||
ts.show_leaf_name = True
|
||||
gene_tree.render("annotated_gene_tree.pdf", tree_style=ts)
|
||||
|
||||
print(f"Pipeline complete. Found {len(ortho_groups)} ortholog groups.")
|
||||
```
|
||||
|
||||
### Batch Processing Multiple Trees
|
||||
|
||||
```python
|
||||
from ete3 import Tree
|
||||
import os
|
||||
|
||||
input_dir = "input_trees"
|
||||
output_dir = "processed_trees"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
for filename in os.listdir(input_dir):
|
||||
if filename.endswith(".nw"):
|
||||
# Load tree
|
||||
tree = Tree(os.path.join(input_dir, filename))
|
||||
|
||||
# Process: root, prune, annotate
|
||||
midpoint = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint)
|
||||
|
||||
# Filter by branch length
|
||||
to_remove = []
|
||||
for node in tree.traverse():
|
||||
if node.dist < 0.001 and not node.is_root():
|
||||
to_remove.append(node)
|
||||
|
||||
for node in to_remove:
|
||||
node.delete()
|
||||
|
||||
# Save processed tree
|
||||
output_file = os.path.join(output_dir, f"processed_{filename}")
|
||||
tree.write(outfile=output_file)
|
||||
|
||||
print(f"Processed {filename}")
|
||||
```
|
||||
214
scientific-packages/etetoolkit/scripts/quick_visualize.py
Executable file
214
scientific-packages/etetoolkit/scripts/quick_visualize.py
Executable file
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quick tree visualization script with common customization options.
|
||||
|
||||
Provides command-line interface for rapid tree visualization with
|
||||
customizable styles, layouts, and output formats.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from ete3 import Tree, TreeStyle, NodeStyle
|
||||
except ImportError:
|
||||
print("Error: ete3 not installed. Install with: pip install ete3")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def create_tree_style(args):
|
||||
"""Create TreeStyle based on arguments."""
|
||||
ts = TreeStyle()
|
||||
|
||||
# Basic display options
|
||||
ts.show_leaf_name = args.show_names
|
||||
ts.show_branch_length = args.show_lengths
|
||||
ts.show_branch_support = args.show_support
|
||||
ts.show_scale = args.show_scale
|
||||
|
||||
# Layout
|
||||
ts.mode = args.mode
|
||||
ts.rotation = args.rotation
|
||||
|
||||
# Circular tree options
|
||||
if args.mode == "c":
|
||||
ts.arc_start = args.arc_start
|
||||
ts.arc_span = args.arc_span
|
||||
|
||||
# Spacing
|
||||
ts.branch_vertical_margin = args.vertical_margin
|
||||
if args.scale_factor:
|
||||
ts.scale = args.scale_factor
|
||||
|
||||
# Title
|
||||
if args.title:
|
||||
from ete3 import TextFace
|
||||
title_face = TextFace(args.title, fsize=16, bold=True)
|
||||
ts.title.add_face(title_face, column=0)
|
||||
|
||||
return ts
|
||||
|
||||
|
||||
def apply_node_styling(tree, args):
|
||||
"""Apply styling to tree nodes."""
|
||||
for node in tree.traverse():
|
||||
nstyle = NodeStyle()
|
||||
|
||||
if node.is_leaf():
|
||||
# Leaf style
|
||||
nstyle["fgcolor"] = args.leaf_color
|
||||
nstyle["size"] = args.leaf_size
|
||||
else:
|
||||
# Internal node style
|
||||
nstyle["fgcolor"] = args.internal_color
|
||||
nstyle["size"] = args.internal_size
|
||||
|
||||
# Color by support if enabled
|
||||
if args.color_by_support and hasattr(node, 'support') and node.support:
|
||||
if node.support >= 0.9:
|
||||
nstyle["fgcolor"] = "darkgreen"
|
||||
elif node.support >= 0.7:
|
||||
nstyle["fgcolor"] = "orange"
|
||||
else:
|
||||
nstyle["fgcolor"] = "red"
|
||||
|
||||
node.set_style(nstyle)
|
||||
|
||||
|
||||
def visualize_tree(tree_file, output, args):
|
||||
"""Load tree, apply styles, and render."""
|
||||
try:
|
||||
tree = Tree(str(tree_file), format=args.format)
|
||||
except Exception as e:
|
||||
print(f"Error loading tree: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
# Apply styling
|
||||
apply_node_styling(tree, args)
|
||||
|
||||
# Create tree style
|
||||
ts = create_tree_style(args)
|
||||
|
||||
# Render
|
||||
try:
|
||||
# Determine output parameters based on format
|
||||
output_path = str(output)
|
||||
|
||||
render_args = {"tree_style": ts}
|
||||
|
||||
if args.width:
|
||||
render_args["w"] = args.width
|
||||
if args.height:
|
||||
render_args["h"] = args.height
|
||||
if args.units:
|
||||
render_args["units"] = args.units
|
||||
if args.dpi:
|
||||
render_args["dpi"] = args.dpi
|
||||
|
||||
tree.render(output_path, **render_args)
|
||||
print(f"Tree rendered successfully to: {output}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error rendering tree: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Quick tree visualization with ETE toolkit",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic visualization
|
||||
%(prog)s tree.nw output.pdf
|
||||
|
||||
# Circular tree
|
||||
%(prog)s tree.nw output.pdf --mode c
|
||||
|
||||
# Large tree with custom sizing
|
||||
%(prog)s tree.nw output.png --width 1200 --height 800 --units px --dpi 300
|
||||
|
||||
# Hide names, show support, color by support
|
||||
%(prog)s tree.nw output.pdf --no-names --show-support --color-by-support
|
||||
|
||||
# Custom title
|
||||
%(prog)s tree.nw output.pdf --title "Phylogenetic Tree of Species"
|
||||
|
||||
# Semicircular layout
|
||||
%(prog)s tree.nw output.pdf --mode c --arc-start -90 --arc-span 180
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument("input", help="Input tree file (Newick format)")
|
||||
parser.add_argument("output", help="Output image file (png, pdf, or svg)")
|
||||
|
||||
# Tree format
|
||||
parser.add_argument("--format", type=int, default=0,
|
||||
help="Newick format number (default: 0)")
|
||||
|
||||
# Display options
|
||||
display = parser.add_argument_group("Display options")
|
||||
display.add_argument("--no-names", dest="show_names", action="store_false",
|
||||
help="Don't show leaf names")
|
||||
display.add_argument("--show-lengths", action="store_true",
|
||||
help="Show branch lengths")
|
||||
display.add_argument("--show-support", action="store_true",
|
||||
help="Show support values")
|
||||
display.add_argument("--show-scale", action="store_true",
|
||||
help="Show scale bar")
|
||||
|
||||
# Layout options
|
||||
layout = parser.add_argument_group("Layout options")
|
||||
layout.add_argument("--mode", choices=["r", "c"], default="r",
|
||||
help="Tree mode: r=rectangular, c=circular (default: r)")
|
||||
layout.add_argument("--rotation", type=int, default=0,
|
||||
help="Tree rotation in degrees (default: 0)")
|
||||
layout.add_argument("--arc-start", type=int, default=0,
|
||||
help="Circular tree start angle (default: 0)")
|
||||
layout.add_argument("--arc-span", type=int, default=360,
|
||||
help="Circular tree arc span (default: 360)")
|
||||
|
||||
# Styling options
|
||||
styling = parser.add_argument_group("Styling options")
|
||||
styling.add_argument("--leaf-color", default="blue",
|
||||
help="Leaf node color (default: blue)")
|
||||
styling.add_argument("--leaf-size", type=int, default=6,
|
||||
help="Leaf node size (default: 6)")
|
||||
styling.add_argument("--internal-color", default="gray",
|
||||
help="Internal node color (default: gray)")
|
||||
styling.add_argument("--internal-size", type=int, default=4,
|
||||
help="Internal node size (default: 4)")
|
||||
styling.add_argument("--color-by-support", action="store_true",
|
||||
help="Color internal nodes by support value")
|
||||
|
||||
# Size and spacing
|
||||
size = parser.add_argument_group("Size and spacing")
|
||||
size.add_argument("--width", type=int, help="Output width")
|
||||
size.add_argument("--height", type=int, help="Output height")
|
||||
size.add_argument("--units", choices=["px", "mm", "in"],
|
||||
help="Size units (px, mm, in)")
|
||||
size.add_argument("--dpi", type=int, help="DPI for raster output")
|
||||
size.add_argument("--scale-factor", type=int,
|
||||
help="Branch length scale factor (pixels per unit)")
|
||||
size.add_argument("--vertical-margin", type=int, default=10,
|
||||
help="Vertical margin between branches (default: 10)")
|
||||
|
||||
# Other options
|
||||
parser.add_argument("--title", help="Tree title")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate output format
|
||||
output_path = Path(args.output)
|
||||
valid_extensions = {".png", ".pdf", ".svg"}
|
||||
if output_path.suffix.lower() not in valid_extensions:
|
||||
print(f"Error: Output must be PNG, PDF, or SVG file")
|
||||
sys.exit(1)
|
||||
|
||||
# Visualize
|
||||
visualize_tree(args.input, args.output, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
229
scientific-packages/etetoolkit/scripts/tree_operations.py
Executable file
229
scientific-packages/etetoolkit/scripts/tree_operations.py
Executable file
@@ -0,0 +1,229 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tree operations helper script for common ETE toolkit tasks.
|
||||
|
||||
Provides command-line interface for basic tree operations like:
|
||||
- Format conversion
|
||||
- Rooting (outgroup, midpoint)
|
||||
- Pruning
|
||||
- Basic statistics
|
||||
- ASCII visualization
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from ete3 import Tree
|
||||
except ImportError:
|
||||
print("Error: ete3 not installed. Install with: pip install ete3")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def load_tree(tree_file, format_num=0):
|
||||
"""Load tree from file."""
|
||||
try:
|
||||
return Tree(str(tree_file), format=format_num)
|
||||
except Exception as e:
|
||||
print(f"Error loading tree: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def convert_format(tree_file, output, in_format=0, out_format=1):
|
||||
"""Convert tree between Newick formats."""
|
||||
tree = load_tree(tree_file, in_format)
|
||||
tree.write(outfile=str(output), format=out_format)
|
||||
print(f"Converted {tree_file} (format {in_format}) → {output} (format {out_format})")
|
||||
|
||||
|
||||
def reroot_tree(tree_file, output, outgroup=None, midpoint=False, format_num=0):
|
||||
"""Reroot tree by outgroup or midpoint."""
|
||||
tree = load_tree(tree_file, format_num)
|
||||
|
||||
if midpoint:
|
||||
midpoint_node = tree.get_midpoint_outgroup()
|
||||
tree.set_outgroup(midpoint_node)
|
||||
print(f"Rerooted tree using midpoint method")
|
||||
elif outgroup:
|
||||
try:
|
||||
outgroup_node = tree & outgroup
|
||||
tree.set_outgroup(outgroup_node)
|
||||
print(f"Rerooted tree using outgroup: {outgroup}")
|
||||
except Exception as e:
|
||||
print(f"Error: Could not find outgroup '{outgroup}': {e}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print("Error: Must specify either --outgroup or --midpoint")
|
||||
sys.exit(1)
|
||||
|
||||
tree.write(outfile=str(output), format=format_num)
|
||||
print(f"Saved rerooted tree to: {output}")
|
||||
|
||||
|
||||
def prune_tree(tree_file, output, keep_taxa, preserve_length=True, format_num=0):
|
||||
"""Prune tree to keep only specified taxa."""
|
||||
tree = load_tree(tree_file, format_num)
|
||||
|
||||
# Read taxa list
|
||||
taxa_file = Path(keep_taxa)
|
||||
if taxa_file.exists():
|
||||
with open(taxa_file) as f:
|
||||
taxa = [line.strip() for line in f if line.strip()]
|
||||
else:
|
||||
taxa = [t.strip() for t in keep_taxa.split(",")]
|
||||
|
||||
print(f"Pruning tree to {len(taxa)} taxa")
|
||||
|
||||
try:
|
||||
tree.prune(taxa, preserve_branch_length=preserve_length)
|
||||
tree.write(outfile=str(output), format=format_num)
|
||||
print(f"Pruned tree saved to: {output}")
|
||||
print(f"Retained {len(tree)} leaves")
|
||||
except Exception as e:
|
||||
print(f"Error pruning tree: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def tree_stats(tree_file, format_num=0):
|
||||
"""Display tree statistics."""
|
||||
tree = load_tree(tree_file, format_num)
|
||||
|
||||
print(f"\n=== Tree Statistics ===")
|
||||
print(f"File: {tree_file}")
|
||||
print(f"Number of leaves: {len(tree)}")
|
||||
print(f"Total nodes: {len(list(tree.traverse()))}")
|
||||
|
||||
farthest_leaf, distance = tree.get_farthest_leaf()
|
||||
print(f"Tree depth: {distance:.4f}")
|
||||
print(f"Farthest leaf: {farthest_leaf.name}")
|
||||
|
||||
# Branch length statistics
|
||||
branch_lengths = [node.dist for node in tree.traverse() if not node.is_root()]
|
||||
if branch_lengths:
|
||||
print(f"\nBranch length statistics:")
|
||||
print(f" Mean: {sum(branch_lengths)/len(branch_lengths):.4f}")
|
||||
print(f" Min: {min(branch_lengths):.4f}")
|
||||
print(f" Max: {max(branch_lengths):.4f}")
|
||||
|
||||
# Support values
|
||||
supports = [node.support for node in tree.traverse() if not node.is_leaf() and hasattr(node, 'support')]
|
||||
if supports:
|
||||
print(f"\nSupport value statistics:")
|
||||
print(f" Mean: {sum(supports)/len(supports):.2f}")
|
||||
print(f" Min: {min(supports):.2f}")
|
||||
print(f" Max: {max(supports):.2f}")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
def show_ascii(tree_file, format_num=0, show_internal=True):
|
||||
"""Display tree as ASCII art."""
|
||||
tree = load_tree(tree_file, format_num)
|
||||
print(tree.get_ascii(show_internal=show_internal))
|
||||
|
||||
|
||||
def list_leaves(tree_file, format_num=0):
|
||||
"""List all leaf names."""
|
||||
tree = load_tree(tree_file, format_num)
|
||||
for leaf in tree:
|
||||
print(leaf.name)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="ETE toolkit tree operations helper",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Convert format
|
||||
%(prog)s convert input.nw output.nw --in-format 0 --out-format 1
|
||||
|
||||
# Midpoint root
|
||||
%(prog)s reroot input.nw output.nw --midpoint
|
||||
|
||||
# Reroot with outgroup
|
||||
%(prog)s reroot input.nw output.nw --outgroup "Outgroup_species"
|
||||
|
||||
# Prune tree
|
||||
%(prog)s prune input.nw output.nw --keep-taxa "speciesA,speciesB,speciesC"
|
||||
|
||||
# Show statistics
|
||||
%(prog)s stats input.nw
|
||||
|
||||
# Display as ASCII
|
||||
%(prog)s ascii input.nw
|
||||
|
||||
# List all leaves
|
||||
%(prog)s leaves input.nw
|
||||
"""
|
||||
)
|
||||
|
||||
subparsers = parser.add_subparsers(dest="command", help="Command to execute")
|
||||
|
||||
# Convert command
|
||||
convert_parser = subparsers.add_parser("convert", help="Convert tree format")
|
||||
convert_parser.add_argument("input", help="Input tree file")
|
||||
convert_parser.add_argument("output", help="Output tree file")
|
||||
convert_parser.add_argument("--in-format", type=int, default=0, help="Input format (default: 0)")
|
||||
convert_parser.add_argument("--out-format", type=int, default=1, help="Output format (default: 1)")
|
||||
|
||||
# Reroot command
|
||||
reroot_parser = subparsers.add_parser("reroot", help="Reroot tree")
|
||||
reroot_parser.add_argument("input", help="Input tree file")
|
||||
reroot_parser.add_argument("output", help="Output tree file")
|
||||
reroot_parser.add_argument("--outgroup", help="Outgroup taxon name")
|
||||
reroot_parser.add_argument("--midpoint", action="store_true", help="Use midpoint rooting")
|
||||
reroot_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
|
||||
|
||||
# Prune command
|
||||
prune_parser = subparsers.add_parser("prune", help="Prune tree to specified taxa")
|
||||
prune_parser.add_argument("input", help="Input tree file")
|
||||
prune_parser.add_argument("output", help="Output tree file")
|
||||
prune_parser.add_argument("--keep-taxa", required=True,
|
||||
help="Taxa to keep (comma-separated or file path)")
|
||||
prune_parser.add_argument("--no-preserve-length", action="store_true",
|
||||
help="Don't preserve branch lengths")
|
||||
prune_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
|
||||
|
||||
# Stats command
|
||||
stats_parser = subparsers.add_parser("stats", help="Display tree statistics")
|
||||
stats_parser.add_argument("input", help="Input tree file")
|
||||
stats_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
|
||||
|
||||
# ASCII command
|
||||
ascii_parser = subparsers.add_parser("ascii", help="Display tree as ASCII art")
|
||||
ascii_parser.add_argument("input", help="Input tree file")
|
||||
ascii_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
|
||||
ascii_parser.add_argument("--no-internal", action="store_true",
|
||||
help="Don't show internal node names")
|
||||
|
||||
# Leaves command
|
||||
leaves_parser = subparsers.add_parser("leaves", help="List all leaf names")
|
||||
leaves_parser.add_argument("input", help="Input tree file")
|
||||
leaves_parser.add_argument("--format", type=int, default=0, help="Newick format (default: 0)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Execute command
|
||||
if args.command == "convert":
|
||||
convert_format(args.input, args.output, args.in_format, args.out_format)
|
||||
elif args.command == "reroot":
|
||||
reroot_tree(args.input, args.output, args.outgroup, args.midpoint, args.format)
|
||||
elif args.command == "prune":
|
||||
prune_tree(args.input, args.output, args.keep_taxa,
|
||||
not args.no_preserve_length, args.format)
|
||||
elif args.command == "stats":
|
||||
tree_stats(args.input, args.format)
|
||||
elif args.command == "ascii":
|
||||
show_ascii(args.input, args.format, not args.no_internal)
|
||||
elif args.command == "leaves":
|
||||
list_leaves(args.input, args.format)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
602
scientific-packages/flowio/SKILL.md
Normal file
602
scientific-packages/flowio/SKILL.md
Normal file
@@ -0,0 +1,602 @@
|
||||
---
|
||||
name: flowio
|
||||
description: Toolkit for working with Flow Cytometry Standard (FCS) files in Python. Use this skill when reading, parsing, creating, or exporting FCS files (versions 2.0, 3.0, 3.1), extracting flow cytometry metadata, accessing event data, handling multi-dataset FCS files, or converting between FCS formats. Essential for flow cytometry data processing, channel analysis, and cytometry file manipulation tasks.
|
||||
---
|
||||
|
||||
# FlowIO: Flow Cytometry Standard File Handler
|
||||
|
||||
## Overview
|
||||
|
||||
FlowIO is a lightweight Python library for reading and writing Flow Cytometry Standard (FCS) files. It excels at parsing FCS metadata, extracting event data, and creating new FCS files with minimal dependencies. The library supports FCS versions 2.0, 3.0, and 3.1, making it ideal for backend services, data pipelines, and basic cytometry file operations.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when working with:
|
||||
|
||||
- FCS files requiring parsing or metadata extraction
|
||||
- Flow cytometry data needing conversion to NumPy arrays
|
||||
- Event data requiring export to FCS format
|
||||
- Multi-dataset FCS files needing separation
|
||||
- Channel information extraction (scatter, fluorescence, time)
|
||||
- Cytometry file validation or inspection
|
||||
- Pre-processing workflows before advanced analysis
|
||||
|
||||
**Related Tools:** For advanced flow cytometry analysis including compensation, gating, and FlowJo/GatingML support, recommend FlowKit library as a companion to FlowIO.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install flowio
|
||||
```
|
||||
|
||||
Requires Python 3.9 or later.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic File Reading
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Read FCS file
|
||||
flow_data = FlowData('experiment.fcs')
|
||||
|
||||
# Access basic information
|
||||
print(f"FCS Version: {flow_data.version}")
|
||||
print(f"Events: {flow_data.event_count}")
|
||||
print(f"Channels: {flow_data.pnn_labels}")
|
||||
|
||||
# Get event data as NumPy array
|
||||
events = flow_data.as_array() # Shape: (events, channels)
|
||||
```
|
||||
|
||||
### Creating FCS Files
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from flowio import create_fcs
|
||||
|
||||
# Prepare data
|
||||
data = np.array([[100, 200, 50], [150, 180, 60]]) # 2 events, 3 channels
|
||||
channels = ['FSC-A', 'SSC-A', 'FL1-A']
|
||||
|
||||
# Create FCS file
|
||||
create_fcs('output.fcs', data, channels)
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### Reading and Parsing FCS Files
|
||||
|
||||
The FlowData class provides the primary interface for reading FCS files.
|
||||
|
||||
**Standard Reading:**
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Basic reading
|
||||
flow = FlowData('sample.fcs')
|
||||
|
||||
# Access attributes
|
||||
version = flow.version # '3.0', '3.1', etc.
|
||||
event_count = flow.event_count # Number of events
|
||||
channel_count = flow.channel_count # Number of channels
|
||||
pnn_labels = flow.pnn_labels # Short channel names
|
||||
pns_labels = flow.pns_labels # Descriptive stain names
|
||||
|
||||
# Get event data
|
||||
events = flow.as_array() # Preprocessed (gain, log scaling applied)
|
||||
raw_events = flow.as_array(preprocess=False) # Raw data
|
||||
```
|
||||
|
||||
**Memory-Efficient Metadata Reading:**
|
||||
|
||||
When only metadata is needed (no event data):
|
||||
|
||||
```python
|
||||
# Only parse TEXT segment, skip DATA and ANALYSIS
|
||||
flow = FlowData('sample.fcs', only_text=True)
|
||||
|
||||
# Access metadata
|
||||
metadata = flow.text # Dictionary of TEXT segment keywords
|
||||
print(metadata.get('$DATE')) # Acquisition date
|
||||
print(metadata.get('$CYT')) # Instrument name
|
||||
```
|
||||
|
||||
**Handling Problematic Files:**
|
||||
|
||||
Some FCS files have offset discrepancies or errors:
|
||||
|
||||
```python
|
||||
# Ignore offset discrepancies between HEADER and TEXT sections
|
||||
flow = FlowData('problematic.fcs', ignore_offset_discrepancy=True)
|
||||
|
||||
# Use HEADER offsets instead of TEXT offsets
|
||||
flow = FlowData('problematic.fcs', use_header_offsets=True)
|
||||
|
||||
# Ignore offset errors entirely
|
||||
flow = FlowData('problematic.fcs', ignore_offset_error=True)
|
||||
```
|
||||
|
||||
**Excluding Null Channels:**
|
||||
|
||||
```python
|
||||
# Exclude specific channels during parsing
|
||||
flow = FlowData('sample.fcs', null_channel_list=['Time', 'Null'])
|
||||
```
|
||||
|
||||
### Extracting Metadata and Channel Information
|
||||
|
||||
FCS files contain rich metadata in the TEXT segment.
|
||||
|
||||
**Common Metadata Keywords:**
|
||||
|
||||
```python
|
||||
flow = FlowData('sample.fcs')
|
||||
|
||||
# File-level metadata
|
||||
text_dict = flow.text
|
||||
acquisition_date = text_dict.get('$DATE', 'Unknown')
|
||||
instrument = text_dict.get('$CYT', 'Unknown')
|
||||
data_type = flow.data_type # 'I', 'F', 'D', 'A'
|
||||
|
||||
# Channel metadata
|
||||
for i in range(flow.channel_count):
|
||||
pnn = flow.pnn_labels[i] # Short name (e.g., 'FSC-A')
|
||||
pns = flow.pns_labels[i] # Descriptive name (e.g., 'Forward Scatter')
|
||||
pnr = flow.pnr_values[i] # Range/max value
|
||||
print(f"Channel {i}: {pnn} ({pns}), Range: {pnr}")
|
||||
```
|
||||
|
||||
**Channel Type Identification:**
|
||||
|
||||
FlowIO automatically categorizes channels:
|
||||
|
||||
```python
|
||||
# Get indices by channel type
|
||||
scatter_idx = flow.scatter_indices # [0, 1] for FSC, SSC
|
||||
fluoro_idx = flow.fluoro_indices # [2, 3, 4] for FL channels
|
||||
time_idx = flow.time_index # Index of time channel (or None)
|
||||
|
||||
# Access specific channel types
|
||||
events = flow.as_array()
|
||||
scatter_data = events[:, scatter_idx]
|
||||
fluorescence_data = events[:, fluoro_idx]
|
||||
```
|
||||
|
||||
**ANALYSIS Segment:**
|
||||
|
||||
If present, access processed results:
|
||||
|
||||
```python
|
||||
if flow.analysis:
|
||||
analysis_keywords = flow.analysis # Dictionary of ANALYSIS keywords
|
||||
print(analysis_keywords)
|
||||
```
|
||||
|
||||
### Creating New FCS Files
|
||||
|
||||
Generate FCS files from NumPy arrays or other data sources.
|
||||
|
||||
**Basic Creation:**
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from flowio import create_fcs
|
||||
|
||||
# Create event data (rows=events, columns=channels)
|
||||
events = np.random.rand(10000, 5) * 1000
|
||||
|
||||
# Define channel names
|
||||
channel_names = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
|
||||
|
||||
# Create FCS file
|
||||
create_fcs('output.fcs', events, channel_names)
|
||||
```
|
||||
|
||||
**With Descriptive Channel Names:**
|
||||
|
||||
```python
|
||||
# Add optional descriptive names (PnS)
|
||||
channel_names = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
|
||||
descriptive_names = ['Forward Scatter', 'Side Scatter', 'FITC', 'PE', 'Time']
|
||||
|
||||
create_fcs('output.fcs',
|
||||
events,
|
||||
channel_names,
|
||||
opt_channel_names=descriptive_names)
|
||||
```
|
||||
|
||||
**With Custom Metadata:**
|
||||
|
||||
```python
|
||||
# Add TEXT segment metadata
|
||||
metadata = {
|
||||
'$SRC': 'Python script',
|
||||
'$DATE': '19-OCT-2025',
|
||||
'$CYT': 'Synthetic Instrument',
|
||||
'$INST': 'Laboratory A'
|
||||
}
|
||||
|
||||
create_fcs('output.fcs',
|
||||
events,
|
||||
channel_names,
|
||||
opt_channel_names=descriptive_names,
|
||||
metadata=metadata)
|
||||
```
|
||||
|
||||
**Note:** FlowIO exports as FCS 3.1 with single-precision floating-point data.
|
||||
|
||||
### Exporting Modified Data
|
||||
|
||||
Modify existing FCS files and re-export them.
|
||||
|
||||
**Approach 1: Using write_fcs() Method:**
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Read original file
|
||||
flow = FlowData('original.fcs')
|
||||
|
||||
# Write with updated metadata
|
||||
flow.write_fcs('modified.fcs', metadata={'$SRC': 'Modified data'})
|
||||
```
|
||||
|
||||
**Approach 2: Extract, Modify, and Recreate:**
|
||||
|
||||
For modifying event data:
|
||||
|
||||
```python
|
||||
from flowio import FlowData, create_fcs
|
||||
|
||||
# Read and extract data
|
||||
flow = FlowData('original.fcs')
|
||||
events = flow.as_array(preprocess=False)
|
||||
|
||||
# Modify event data
|
||||
events[:, 0] = events[:, 0] * 1.5 # Scale first channel
|
||||
|
||||
# Create new FCS file with modified data
|
||||
create_fcs('modified.fcs',
|
||||
events,
|
||||
flow.pnn_labels,
|
||||
opt_channel_names=flow.pns_labels,
|
||||
metadata=flow.text)
|
||||
```
|
||||
|
||||
### Handling Multi-Dataset FCS Files
|
||||
|
||||
Some FCS files contain multiple datasets in a single file.
|
||||
|
||||
**Detecting Multi-Dataset Files:**
|
||||
|
||||
```python
|
||||
from flowio import FlowData, MultipleDataSetsError
|
||||
|
||||
try:
|
||||
flow = FlowData('sample.fcs')
|
||||
except MultipleDataSetsError:
|
||||
print("File contains multiple datasets")
|
||||
# Use read_multiple_data_sets() instead
|
||||
```
|
||||
|
||||
**Reading All Datasets:**
|
||||
|
||||
```python
|
||||
from flowio import read_multiple_data_sets
|
||||
|
||||
# Read all datasets from file
|
||||
datasets = read_multiple_data_sets('multi_dataset.fcs')
|
||||
|
||||
print(f"Found {len(datasets)} datasets")
|
||||
|
||||
# Process each dataset
|
||||
for i, dataset in enumerate(datasets):
|
||||
print(f"\nDataset {i}:")
|
||||
print(f" Events: {dataset.event_count}")
|
||||
print(f" Channels: {dataset.pnn_labels}")
|
||||
|
||||
# Get event data for this dataset
|
||||
events = dataset.as_array()
|
||||
print(f" Shape: {events.shape}")
|
||||
print(f" Mean values: {events.mean(axis=0)}")
|
||||
```
|
||||
|
||||
**Reading Specific Dataset:**
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Read first dataset (nextdata_offset=0)
|
||||
first_dataset = FlowData('multi.fcs', nextdata_offset=0)
|
||||
|
||||
# Read second dataset using NEXTDATA offset from first
|
||||
next_offset = int(first_dataset.text['$NEXTDATA'])
|
||||
if next_offset > 0:
|
||||
second_dataset = FlowData('multi.fcs', nextdata_offset=next_offset)
|
||||
```
|
||||
|
||||
## Data Preprocessing
|
||||
|
||||
FlowIO applies standard FCS preprocessing transformations when `preprocess=True`.
|
||||
|
||||
**Preprocessing Steps:**
|
||||
|
||||
1. **Gain Scaling:** Multiply values by PnG (gain) keyword
|
||||
2. **Logarithmic Transformation:** Apply PnE exponential transformation if present
|
||||
- Formula: `value = a * 10^(b * raw_value)` where PnE = "a,b"
|
||||
3. **Time Scaling:** Convert time values to appropriate units
|
||||
|
||||
**Controlling Preprocessing:**
|
||||
|
||||
```python
|
||||
# Preprocessed data (default)
|
||||
preprocessed = flow.as_array(preprocess=True)
|
||||
|
||||
# Raw data (no transformations)
|
||||
raw = flow.as_array(preprocess=False)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
Handle common FlowIO exceptions appropriately.
|
||||
|
||||
```python
|
||||
from flowio import (
|
||||
FlowData,
|
||||
FCSParsingError,
|
||||
DataOffsetDiscrepancyError,
|
||||
MultipleDataSetsError
|
||||
)
|
||||
|
||||
try:
|
||||
flow = FlowData('sample.fcs')
|
||||
events = flow.as_array()
|
||||
|
||||
except FCSParsingError as e:
|
||||
print(f"Failed to parse FCS file: {e}")
|
||||
# Try with relaxed parsing
|
||||
flow = FlowData('sample.fcs', ignore_offset_error=True)
|
||||
|
||||
except DataOffsetDiscrepancyError as e:
|
||||
print(f"Offset discrepancy detected: {e}")
|
||||
# Use ignore_offset_discrepancy parameter
|
||||
flow = FlowData('sample.fcs', ignore_offset_discrepancy=True)
|
||||
|
||||
except MultipleDataSetsError as e:
|
||||
print(f"Multiple datasets detected: {e}")
|
||||
# Use read_multiple_data_sets instead
|
||||
from flowio import read_multiple_data_sets
|
||||
datasets = read_multiple_data_sets('sample.fcs')
|
||||
|
||||
except Exception as e:
|
||||
print(f"Unexpected error: {e}")
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Inspecting FCS File Contents
|
||||
|
||||
Quick exploration of FCS file structure:
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
flow = FlowData('unknown.fcs')
|
||||
|
||||
print("=" * 50)
|
||||
print(f"File: {flow.name}")
|
||||
print(f"Version: {flow.version}")
|
||||
print(f"Size: {flow.file_size:,} bytes")
|
||||
print("=" * 50)
|
||||
|
||||
print(f"\nEvents: {flow.event_count:,}")
|
||||
print(f"Channels: {flow.channel_count}")
|
||||
|
||||
print("\nChannel Information:")
|
||||
for i, (pnn, pns) in enumerate(zip(flow.pnn_labels, flow.pns_labels)):
|
||||
ch_type = "scatter" if i in flow.scatter_indices else \
|
||||
"fluoro" if i in flow.fluoro_indices else \
|
||||
"time" if i == flow.time_index else "other"
|
||||
print(f" [{i}] {pnn:10s} | {pns:30s} | {ch_type}")
|
||||
|
||||
print("\nKey Metadata:")
|
||||
for key in ['$DATE', '$BTIM', '$ETIM', '$CYT', '$INST', '$SRC']:
|
||||
value = flow.text.get(key, 'N/A')
|
||||
print(f" {key:15s}: {value}")
|
||||
```
|
||||
|
||||
### Batch Processing Multiple Files
|
||||
|
||||
Process a directory of FCS files:
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from flowio import FlowData
|
||||
import pandas as pd
|
||||
|
||||
# Find all FCS files
|
||||
fcs_files = list(Path('data/').glob('*.fcs'))
|
||||
|
||||
# Extract summary information
|
||||
summaries = []
|
||||
for fcs_path in fcs_files:
|
||||
try:
|
||||
flow = FlowData(str(fcs_path), only_text=True)
|
||||
summaries.append({
|
||||
'filename': fcs_path.name,
|
||||
'version': flow.version,
|
||||
'events': flow.event_count,
|
||||
'channels': flow.channel_count,
|
||||
'date': flow.text.get('$DATE', 'N/A')
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"Error processing {fcs_path.name}: {e}")
|
||||
|
||||
# Create summary DataFrame
|
||||
df = pd.DataFrame(summaries)
|
||||
print(df)
|
||||
```
|
||||
|
||||
### Converting FCS to CSV
|
||||
|
||||
Export event data to CSV format:
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
import pandas as pd
|
||||
|
||||
# Read FCS file
|
||||
flow = FlowData('sample.fcs')
|
||||
|
||||
# Convert to DataFrame
|
||||
df = pd.DataFrame(
|
||||
flow.as_array(),
|
||||
columns=flow.pnn_labels
|
||||
)
|
||||
|
||||
# Add metadata as attributes
|
||||
df.attrs['fcs_version'] = flow.version
|
||||
df.attrs['instrument'] = flow.text.get('$CYT', 'Unknown')
|
||||
|
||||
# Export to CSV
|
||||
df.to_csv('output.csv', index=False)
|
||||
print(f"Exported {len(df)} events to CSV")
|
||||
```
|
||||
|
||||
### Filtering Events and Re-exporting
|
||||
|
||||
Apply filters and save filtered data:
|
||||
|
||||
```python
|
||||
from flowio import FlowData, create_fcs
|
||||
import numpy as np
|
||||
|
||||
# Read original file
|
||||
flow = FlowData('sample.fcs')
|
||||
events = flow.as_array(preprocess=False)
|
||||
|
||||
# Apply filtering (example: threshold on first channel)
|
||||
fsc_idx = 0
|
||||
threshold = 500
|
||||
mask = events[:, fsc_idx] > threshold
|
||||
filtered_events = events[mask]
|
||||
|
||||
print(f"Original events: {len(events)}")
|
||||
print(f"Filtered events: {len(filtered_events)}")
|
||||
|
||||
# Create new FCS file with filtered data
|
||||
create_fcs('filtered.fcs',
|
||||
filtered_events,
|
||||
flow.pnn_labels,
|
||||
opt_channel_names=flow.pns_labels,
|
||||
metadata={**flow.text, '$SRC': 'Filtered data'})
|
||||
```
|
||||
|
||||
### Extracting Specific Channels
|
||||
|
||||
Extract and process specific channels:
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
import numpy as np
|
||||
|
||||
flow = FlowData('sample.fcs')
|
||||
events = flow.as_array()
|
||||
|
||||
# Extract fluorescence channels only
|
||||
fluoro_indices = flow.fluoro_indices
|
||||
fluoro_data = events[:, fluoro_indices]
|
||||
fluoro_names = [flow.pnn_labels[i] for i in fluoro_indices]
|
||||
|
||||
print(f"Fluorescence channels: {fluoro_names}")
|
||||
print(f"Shape: {fluoro_data.shape}")
|
||||
|
||||
# Calculate statistics per channel
|
||||
for i, name in enumerate(fluoro_names):
|
||||
channel_data = fluoro_data[:, i]
|
||||
print(f"\n{name}:")
|
||||
print(f" Mean: {channel_data.mean():.2f}")
|
||||
print(f" Median: {np.median(channel_data):.2f}")
|
||||
print(f" Std Dev: {channel_data.std():.2f}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Memory Efficiency:** Use `only_text=True` when event data is not needed
|
||||
2. **Error Handling:** Wrap file operations in try-except blocks for robust code
|
||||
3. **Multi-Dataset Detection:** Check for MultipleDataSetsError and use appropriate function
|
||||
4. **Preprocessing Control:** Explicitly set `preprocess` parameter based on analysis needs
|
||||
5. **Offset Issues:** If parsing fails, try `ignore_offset_discrepancy=True` parameter
|
||||
6. **Channel Validation:** Verify channel counts and names match expectations before processing
|
||||
7. **Metadata Preservation:** When modifying files, preserve original TEXT segment keywords
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Understanding FCS File Structure
|
||||
|
||||
FCS files consist of four segments:
|
||||
|
||||
1. **HEADER:** FCS version and byte offsets for other segments
|
||||
2. **TEXT:** Key-value metadata pairs (delimiter-separated)
|
||||
3. **DATA:** Raw event data (binary/float/ASCII format)
|
||||
4. **ANALYSIS** (optional): Results from data processing
|
||||
|
||||
Access these segments via FlowData attributes:
|
||||
- `flow.header` - HEADER segment
|
||||
- `flow.text` - TEXT segment keywords
|
||||
- `flow.events` - DATA segment (as bytes)
|
||||
- `flow.analysis` - ANALYSIS segment keywords (if present)
|
||||
|
||||
### Detailed API Reference
|
||||
|
||||
For comprehensive API documentation including all parameters, methods, exceptions, and FCS keyword reference, consult the detailed reference file:
|
||||
|
||||
**Read:** `references/api_reference.md`
|
||||
|
||||
The reference includes:
|
||||
- Complete FlowData class documentation
|
||||
- All utility functions (read_multiple_data_sets, create_fcs)
|
||||
- Exception classes and handling
|
||||
- FCS file structure details
|
||||
- Common TEXT segment keywords
|
||||
- Extended example workflows
|
||||
|
||||
When working with complex FCS operations or encountering unusual file formats, load this reference for detailed guidance.
|
||||
|
||||
## Integration Notes
|
||||
|
||||
**NumPy Arrays:** All event data is returned as NumPy ndarrays with shape (events, channels)
|
||||
|
||||
**Pandas DataFrames:** Easily convert to DataFrames for analysis:
|
||||
```python
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(flow.as_array(), columns=flow.pnn_labels)
|
||||
```
|
||||
|
||||
**FlowKit Integration:** For advanced analysis (compensation, gating, FlowJo support), use FlowKit library which builds on FlowIO's parsing capabilities
|
||||
|
||||
**Web Applications:** FlowIO's minimal dependencies make it ideal for web backend services processing FCS uploads
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Problem:** "Offset discrepancy error"
|
||||
**Solution:** Use `ignore_offset_discrepancy=True` parameter
|
||||
|
||||
**Problem:** "Multiple datasets error"
|
||||
**Solution:** Use `read_multiple_data_sets()` function instead of FlowData constructor
|
||||
|
||||
**Problem:** Out of memory with large files
|
||||
**Solution:** Use `only_text=True` for metadata-only operations, or process events in chunks
|
||||
|
||||
**Problem:** Unexpected channel counts
|
||||
**Solution:** Check for null channels; use `null_channel_list` parameter to exclude them
|
||||
|
||||
**Problem:** Cannot modify event data in place
|
||||
**Solution:** FlowIO doesn't support direct modification; extract data, modify, then use `create_fcs()` to save
|
||||
|
||||
## Summary
|
||||
|
||||
FlowIO provides essential FCS file handling capabilities for flow cytometry workflows. Use it for parsing, metadata extraction, and file creation. For simple file operations and data extraction, FlowIO is sufficient. For complex analysis including compensation and gating, integrate with FlowKit or other specialized tools.
|
||||
372
scientific-packages/flowio/references/api_reference.md
Normal file
372
scientific-packages/flowio/references/api_reference.md
Normal file
@@ -0,0 +1,372 @@
|
||||
# FlowIO API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
FlowIO is a Python library for reading and writing Flow Cytometry Standard (FCS) files. It supports FCS versions 2.0, 3.0, and 3.1 with minimal dependencies.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install flowio
|
||||
```
|
||||
|
||||
Supports Python 3.9 and later.
|
||||
|
||||
## Core Classes
|
||||
|
||||
### FlowData
|
||||
|
||||
The primary class for working with FCS files.
|
||||
|
||||
#### Constructor
|
||||
|
||||
```python
|
||||
FlowData(fcs_file,
|
||||
ignore_offset_error=False,
|
||||
ignore_offset_discrepancy=False,
|
||||
use_header_offsets=False,
|
||||
only_text=False,
|
||||
nextdata_offset=None,
|
||||
null_channel_list=None)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `fcs_file`: File path (str), Path object, or file handle
|
||||
- `ignore_offset_error` (bool): Ignore offset errors (default: False)
|
||||
- `ignore_offset_discrepancy` (bool): Ignore offset discrepancies between HEADER and TEXT sections (default: False)
|
||||
- `use_header_offsets` (bool): Use HEADER section offsets instead of TEXT section (default: False)
|
||||
- `only_text` (bool): Only parse the TEXT segment, skip DATA and ANALYSIS (default: False)
|
||||
- `nextdata_offset` (int): Byte offset for reading multi-dataset files
|
||||
- `null_channel_list` (list): List of PnN labels for null channels to exclude
|
||||
|
||||
#### Attributes
|
||||
|
||||
**File Information:**
|
||||
- `name`: Name of the FCS file
|
||||
- `file_size`: Size of the file in bytes
|
||||
- `version`: FCS version (e.g., '3.0', '3.1')
|
||||
- `header`: Dictionary containing HEADER segment information
|
||||
- `data_type`: Type of data format ('I', 'F', 'D', 'A')
|
||||
|
||||
**Channel Information:**
|
||||
- `channel_count`: Number of channels in the dataset
|
||||
- `channels`: Dictionary mapping channel numbers to channel info
|
||||
- `pnn_labels`: List of PnN (short channel name) labels
|
||||
- `pns_labels`: List of PnS (descriptive stain name) labels
|
||||
- `pnr_values`: List of PnR (range) values for each channel
|
||||
- `fluoro_indices`: List of indices for fluorescence channels
|
||||
- `scatter_indices`: List of indices for scatter channels
|
||||
- `time_index`: Index of the time channel (or None)
|
||||
- `null_channels`: List of null channel indices
|
||||
|
||||
**Event Data:**
|
||||
- `event_count`: Number of events (rows) in the dataset
|
||||
- `events`: Raw event data as bytes
|
||||
|
||||
**Metadata:**
|
||||
- `text`: Dictionary of TEXT segment key-value pairs
|
||||
- `analysis`: Dictionary of ANALYSIS segment key-value pairs (if present)
|
||||
|
||||
#### Methods
|
||||
|
||||
##### as_array()
|
||||
|
||||
```python
|
||||
as_array(preprocess=True)
|
||||
```
|
||||
|
||||
Return event data as a 2-D NumPy array.
|
||||
|
||||
**Parameters:**
|
||||
- `preprocess` (bool): Apply gain, logarithmic, and time scaling transformations (default: True)
|
||||
|
||||
**Returns:**
|
||||
- NumPy ndarray with shape (event_count, channel_count)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
flow_data = FlowData('sample.fcs')
|
||||
events_array = flow_data.as_array() # Preprocessed data
|
||||
raw_array = flow_data.as_array(preprocess=False) # Raw data
|
||||
```
|
||||
|
||||
##### write_fcs()
|
||||
|
||||
```python
|
||||
write_fcs(filename, metadata=None)
|
||||
```
|
||||
|
||||
Export the FlowData instance as a new FCS file.
|
||||
|
||||
**Parameters:**
|
||||
- `filename` (str): Output file path
|
||||
- `metadata` (dict): Optional dictionary of TEXT segment keywords to add/update
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
flow_data = FlowData('sample.fcs')
|
||||
flow_data.write_fcs('output.fcs', metadata={'$SRC': 'Modified data'})
|
||||
```
|
||||
|
||||
**Note:** Exports as FCS 3.1 with single-precision floating-point data.
|
||||
|
||||
## Utility Functions
|
||||
|
||||
### read_multiple_data_sets()
|
||||
|
||||
```python
|
||||
read_multiple_data_sets(fcs_file,
|
||||
ignore_offset_error=False,
|
||||
ignore_offset_discrepancy=False,
|
||||
use_header_offsets=False)
|
||||
```
|
||||
|
||||
Read all datasets from an FCS file containing multiple datasets.
|
||||
|
||||
**Parameters:**
|
||||
- Same as FlowData constructor (except `nextdata_offset`)
|
||||
|
||||
**Returns:**
|
||||
- List of FlowData instances, one for each dataset
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from flowio import read_multiple_data_sets
|
||||
|
||||
datasets = read_multiple_data_sets('multi_dataset.fcs')
|
||||
print(f"Found {len(datasets)} datasets")
|
||||
for i, dataset in enumerate(datasets):
|
||||
print(f"Dataset {i}: {dataset.event_count} events")
|
||||
```
|
||||
|
||||
### create_fcs()
|
||||
|
||||
```python
|
||||
create_fcs(filename,
|
||||
event_data,
|
||||
channel_names,
|
||||
opt_channel_names=None,
|
||||
metadata=None)
|
||||
```
|
||||
|
||||
Create a new FCS file from event data.
|
||||
|
||||
**Parameters:**
|
||||
- `filename` (str): Output file path
|
||||
- `event_data` (ndarray): 2-D NumPy array of event data (rows=events, columns=channels)
|
||||
- `channel_names` (list): List of PnN (short) channel names
|
||||
- `opt_channel_names` (list): Optional list of PnS (descriptive) channel names
|
||||
- `metadata` (dict): Optional dictionary of TEXT segment keywords
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
import numpy as np
|
||||
from flowio import create_fcs
|
||||
|
||||
# Create synthetic data
|
||||
events = np.random.rand(10000, 5)
|
||||
channels = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
|
||||
opt_channels = ['Forward Scatter', 'Side Scatter', 'FITC', 'PE', 'Time']
|
||||
|
||||
create_fcs('synthetic.fcs',
|
||||
events,
|
||||
channels,
|
||||
opt_channel_names=opt_channels,
|
||||
metadata={'$SRC': 'Synthetic data'})
|
||||
```
|
||||
|
||||
## Exception Classes
|
||||
|
||||
### FlowIOWarning
|
||||
|
||||
Generic warning class for non-critical issues.
|
||||
|
||||
### PnEWarning
|
||||
|
||||
Warning raised when PnE values are invalid during FCS file creation.
|
||||
|
||||
### FlowIOException
|
||||
|
||||
Base exception class for FlowIO errors.
|
||||
|
||||
### FCSParsingError
|
||||
|
||||
Raised when there are issues parsing an FCS file.
|
||||
|
||||
### DataOffsetDiscrepancyError
|
||||
|
||||
Raised when the HEADER and TEXT sections provide different byte offsets for data segments.
|
||||
|
||||
**Workaround:** Use `ignore_offset_discrepancy=True` parameter when creating FlowData instance.
|
||||
|
||||
### MultipleDataSetsError
|
||||
|
||||
Raised when attempting to read a file with multiple datasets using the standard FlowData constructor.
|
||||
|
||||
**Solution:** Use `read_multiple_data_sets()` function instead.
|
||||
|
||||
## FCS File Structure Reference
|
||||
|
||||
FCS files consist of four segments:
|
||||
|
||||
1. **HEADER**: Contains FCS version and byte locations of other segments
|
||||
2. **TEXT**: Key-value metadata pairs (delimited format)
|
||||
3. **DATA**: Raw event data (binary, floating-point, or ASCII)
|
||||
4. **ANALYSIS** (optional): Results from data processing
|
||||
|
||||
### Common TEXT Segment Keywords
|
||||
|
||||
- `$BEGINDATA`, `$ENDDATA`: Byte offsets for DATA segment
|
||||
- `$BEGINANALYSIS`, `$ENDANALYSIS`: Byte offsets for ANALYSIS segment
|
||||
- `$BYTEORD`: Byte order (1,2,3,4 for little-endian; 4,3,2,1 for big-endian)
|
||||
- `$DATATYPE`: Data type ('I'=integer, 'F'=float, 'D'=double, 'A'=ASCII)
|
||||
- `$MODE`: Data mode ('L'=list mode, most common)
|
||||
- `$NEXTDATA`: Offset to next dataset (0 if single dataset)
|
||||
- `$PAR`: Number of parameters (channels)
|
||||
- `$TOT`: Total number of events
|
||||
- `PnN`: Short name for parameter n
|
||||
- `PnS`: Descriptive stain name for parameter n
|
||||
- `PnR`: Range (max value) for parameter n
|
||||
- `PnE`: Amplification exponent for parameter n (format: "a,b" where value = a * 10^(b*x))
|
||||
- `PnG`: Amplification gain for parameter n
|
||||
|
||||
## Channel Types
|
||||
|
||||
FlowIO automatically categorizes channels:
|
||||
|
||||
- **Scatter channels**: FSC (forward scatter), SSC (side scatter)
|
||||
- **Fluorescence channels**: FL1, FL2, FITC, PE, etc.
|
||||
- **Time channel**: Usually labeled "Time"
|
||||
|
||||
Access indices via:
|
||||
- `flow_data.scatter_indices`
|
||||
- `flow_data.fluoro_indices`
|
||||
- `flow_data.time_index`
|
||||
|
||||
## Data Preprocessing
|
||||
|
||||
When calling `as_array(preprocess=True)`, FlowIO applies:
|
||||
|
||||
1. **Gain scaling**: Multiply by PnG value
|
||||
2. **Logarithmic transformation**: Apply PnE exponential transformation if present
|
||||
3. **Time scaling**: Convert time values to appropriate units
|
||||
|
||||
To access raw, unprocessed data: `as_array(preprocess=False)`
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Memory efficiency**: Use `only_text=True` when only metadata is needed
|
||||
2. **Error handling**: Wrap file operations in try-except blocks for FCSParsingError
|
||||
3. **Multi-dataset files**: Always use `read_multiple_data_sets()` if unsure about dataset count
|
||||
4. **Offset issues**: If encountering offset errors, try `ignore_offset_discrepancy=True`
|
||||
5. **Channel selection**: Use null_channel_list to exclude unwanted channels during parsing
|
||||
|
||||
## Integration with FlowKit
|
||||
|
||||
For advanced flow cytometry analysis including compensation, gating, and GatingML support, consider using FlowKit library alongside FlowIO. FlowKit provides higher-level abstractions built on top of FlowIO's file parsing capabilities.
|
||||
|
||||
## Example Workflows
|
||||
|
||||
### Basic File Reading
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Read FCS file
|
||||
flow = FlowData('experiment.fcs')
|
||||
|
||||
# Print basic info
|
||||
print(f"Version: {flow.version}")
|
||||
print(f"Events: {flow.event_count}")
|
||||
print(f"Channels: {flow.channel_count}")
|
||||
print(f"Channel names: {flow.pnn_labels}")
|
||||
|
||||
# Get event data
|
||||
events = flow.as_array()
|
||||
print(f"Data shape: {events.shape}")
|
||||
```
|
||||
|
||||
### Metadata Extraction
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
flow = FlowData('sample.fcs', only_text=True)
|
||||
|
||||
# Access metadata
|
||||
print(f"Acquisition date: {flow.text.get('$DATE', 'N/A')}")
|
||||
print(f"Instrument: {flow.text.get('$CYT', 'N/A')}")
|
||||
|
||||
# Channel information
|
||||
for i, (pnn, pns) in enumerate(zip(flow.pnn_labels, flow.pns_labels)):
|
||||
print(f"Channel {i}: {pnn} ({pns})")
|
||||
```
|
||||
|
||||
### Creating New FCS Files
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from flowio import create_fcs
|
||||
|
||||
# Generate or process data
|
||||
data = np.random.rand(5000, 3) * 1000
|
||||
|
||||
# Define channels
|
||||
channels = ['FSC-A', 'SSC-A', 'FL1-A']
|
||||
stains = ['Forward Scatter', 'Side Scatter', 'GFP']
|
||||
|
||||
# Create FCS file
|
||||
create_fcs('output.fcs',
|
||||
data,
|
||||
channels,
|
||||
opt_channel_names=stains,
|
||||
metadata={
|
||||
'$SRC': 'Python script',
|
||||
'$DATE': '19-OCT-2025'
|
||||
})
|
||||
```
|
||||
|
||||
### Processing Multi-Dataset Files
|
||||
|
||||
```python
|
||||
from flowio import read_multiple_data_sets
|
||||
|
||||
# Read all datasets
|
||||
datasets = read_multiple_data_sets('multi.fcs')
|
||||
|
||||
# Process each dataset
|
||||
for i, dataset in enumerate(datasets):
|
||||
print(f"\nDataset {i}:")
|
||||
print(f" Events: {dataset.event_count}")
|
||||
print(f" Channels: {dataset.pnn_labels}")
|
||||
|
||||
# Get data array
|
||||
events = dataset.as_array()
|
||||
mean_values = events.mean(axis=0)
|
||||
print(f" Mean values: {mean_values}")
|
||||
```
|
||||
|
||||
### Modifying and Re-exporting
|
||||
|
||||
```python
|
||||
from flowio import FlowData
|
||||
|
||||
# Read original file
|
||||
flow = FlowData('original.fcs')
|
||||
|
||||
# Get event data
|
||||
events = flow.as_array(preprocess=False)
|
||||
|
||||
# Modify data (example: apply custom transformation)
|
||||
events[:, 0] = events[:, 0] * 1.5 # Scale first channel
|
||||
|
||||
# Note: Currently, FlowIO doesn't support direct modification of event data
|
||||
# For modifications, use create_fcs() instead:
|
||||
from flowio import create_fcs
|
||||
|
||||
create_fcs('modified.fcs',
|
||||
events,
|
||||
flow.pnn_labels,
|
||||
opt_channel_names=flow.pns_labels,
|
||||
metadata=flow.text)
|
||||
```
|
||||
870
scientific-packages/gget/SKILL.md
Normal file
870
scientific-packages/gget/SKILL.md
Normal file
@@ -0,0 +1,870 @@
|
||||
---
|
||||
name: gget
|
||||
description: Toolkit for querying genomic databases and performing bioinformatics analysis. Use this skill when working with gene sequences, protein structures, genomic databases (Ensembl, UniProt, NCBI, PDB, COSMIC, etc.), performing BLAST/BLAT searches, retrieving gene expression data, conducting enrichment analysis, predicting protein structures with AlphaFold, analyzing mutations, or any bioinformatics workflow requiring efficient database queries. This skill applies to tasks involving nucleotide/amino acid sequences, gene names, Ensembl IDs, UniProt accessions, or requests for genomic annotations, orthologs, disease associations, drug information, or single-cell RNA-seq data.
|
||||
---
|
||||
|
||||
# gget
|
||||
|
||||
## Overview
|
||||
|
||||
gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Execute queries for gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.
|
||||
|
||||
**Important**: The databases queried by gget are continuously updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary.
|
||||
|
||||
## Installation
|
||||
|
||||
Install gget in a clean virtual environment to avoid conflicts:
|
||||
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv pip install gget
|
||||
|
||||
# Or using pip
|
||||
pip install --upgrade gget
|
||||
|
||||
# In Python/Jupyter
|
||||
import gget
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
Basic usage pattern for all modules:
|
||||
|
||||
```bash
|
||||
# Command-line
|
||||
gget <module> [arguments] [options]
|
||||
|
||||
# Python
|
||||
gget.module(arguments, options)
|
||||
```
|
||||
|
||||
Most modules return:
|
||||
- **Command-line**: JSON (default) or CSV with `-csv` flag
|
||||
- **Python**: DataFrame or dictionary
|
||||
|
||||
Common flags across modules:
|
||||
- `-o/--out`: Save results to file
|
||||
- `-q/--quiet`: Suppress progress information
|
||||
- `-csv`: Return CSV format (command-line only)
|
||||
|
||||
## Module Categories
|
||||
|
||||
### 1. Reference & Gene Information
|
||||
|
||||
#### gget ref - Reference Genome Downloads
|
||||
|
||||
Retrieve download links and metadata for Ensembl reference genomes.
|
||||
|
||||
**Parameters**:
|
||||
- `species`: Genus_species format (e.g., 'homo_sapiens', 'mus_musculus'). Shortcuts: 'human', 'mouse'
|
||||
- `-w/--which`: Specify return types (gtf, cdna, dna, cds, cdrna, pep). Default: all
|
||||
- `-r/--release`: Ensembl release number (default: latest)
|
||||
- `-l/--list_species`: List available vertebrate species
|
||||
- `-liv/--list_iv_species`: List available invertebrate species
|
||||
- `-ftp`: Return only FTP links
|
||||
- `-d/--download`: Download files (requires curl)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# List available species
|
||||
gget ref --list_species
|
||||
|
||||
# Get all reference files for human
|
||||
gget ref homo_sapiens
|
||||
|
||||
# Download only GTF annotation for mouse
|
||||
gget ref -w gtf -d mouse
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.ref("homo_sapiens")
|
||||
gget.ref("mus_musculus", which="gtf", download=True)
|
||||
```
|
||||
|
||||
#### gget search - Gene Search
|
||||
|
||||
Locate genes by name or description across species.
|
||||
|
||||
**Parameters**:
|
||||
- `searchwords`: One or more search terms (case-insensitive)
|
||||
- `-s/--species`: Target species (e.g., 'homo_sapiens', 'mouse')
|
||||
- `-r/--release`: Ensembl release number
|
||||
- `-t/--id_type`: Return 'gene' (default) or 'transcript'
|
||||
- `-ao/--andor`: 'or' (default) finds ANY searchword; 'and' requires ALL
|
||||
- `-l/--limit`: Maximum results to return
|
||||
|
||||
**Returns**: ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Search for GABA-related genes in human
|
||||
gget search -s human gaba gamma-aminobutyric
|
||||
|
||||
# Find specific gene, require all terms
|
||||
gget search -s mouse -ao and pax7 transcription
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.search(["gaba", "gamma-aminobutyric"], species="homo_sapiens")
|
||||
```
|
||||
|
||||
#### gget info - Gene/Transcript Information
|
||||
|
||||
Retrieve comprehensive gene and transcript metadata from Ensembl, UniProt, and NCBI.
|
||||
|
||||
**Parameters**:
|
||||
- `ens_ids`: One or more Ensembl IDs (also supports WormBase, Flybase IDs). Limit: ~1000 IDs
|
||||
- `-n/--ncbi`: Disable NCBI data retrieval
|
||||
- `-u/--uniprot`: Disable UniProt data retrieval
|
||||
- `-pdb`: Include PDB identifiers (increases runtime)
|
||||
|
||||
**Returns**: UniProt ID, NCBI gene ID, primary gene name, synonyms, protein names, descriptions, biotype, canonical transcript
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get info for multiple genes
|
||||
gget info ENSG00000034713 ENSG00000104853 ENSG00000170296
|
||||
|
||||
# Include PDB IDs
|
||||
gget info ENSG00000034713 -pdb
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.info(["ENSG00000034713", "ENSG00000104853"], pdb=True)
|
||||
```
|
||||
|
||||
#### gget seq - Sequence Retrieval
|
||||
|
||||
Fetch nucleotide or amino acid sequences for genes and transcripts.
|
||||
|
||||
**Parameters**:
|
||||
- `ens_ids`: One or more Ensembl identifiers
|
||||
- `-t/--translate`: Fetch amino acid sequences instead of nucleotide
|
||||
- `-iso/--isoforms`: Return all transcript variants (gene IDs only)
|
||||
|
||||
**Returns**: FASTA format sequences
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get nucleotide sequences
|
||||
gget seq ENSG00000034713 ENSG00000104853
|
||||
|
||||
# Get all protein isoforms
|
||||
gget seq -t -iso ENSG00000034713
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.seq(["ENSG00000034713"], translate=True, isoforms=True)
|
||||
```
|
||||
|
||||
### 2. Sequence Analysis & Alignment
|
||||
|
||||
#### gget blast - BLAST Searches
|
||||
|
||||
BLAST nucleotide or amino acid sequences against standard databases.
|
||||
|
||||
**Parameters**:
|
||||
- `sequence`: Sequence string or path to FASTA/.txt file
|
||||
- `-p/--program`: blastn, blastp, blastx, tblastn, tblastx (auto-detected)
|
||||
- `-db/--database`:
|
||||
- Nucleotide: nt, refseq_rna, pdbnt
|
||||
- Protein: nr, swissprot, pdbaa, refseq_protein
|
||||
- `-l/--limit`: Max hits (default: 50)
|
||||
- `-e/--expect`: E-value cutoff (default: 10.0)
|
||||
- `-lcf/--low_comp_filt`: Enable low complexity filtering
|
||||
- `-mbo/--megablast_off`: Disable MegaBLAST (blastn only)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# BLAST protein sequence
|
||||
gget blast MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
|
||||
|
||||
# BLAST from file with specific database
|
||||
gget blast sequence.fasta -db swissprot -l 10
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.blast("MKWMFK...", database="swissprot", limit=10)
|
||||
```
|
||||
|
||||
#### gget blat - BLAT Searches
|
||||
|
||||
Locate genomic positions of sequences using UCSC BLAT.
|
||||
|
||||
**Parameters**:
|
||||
- `sequence`: Sequence string or path to FASTA/.txt file
|
||||
- `-st/--seqtype`: 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' (auto-detected)
|
||||
- `-a/--assembly`: Target assembly (default: 'human'/hg38; options: 'mouse'/mm39, 'zebrafinch'/taeGut2, etc.)
|
||||
|
||||
**Returns**: genome, query size, alignment positions, matches, mismatches, alignment percentage
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Find genomic location in human
|
||||
gget blat ATCGATCGATCGATCG
|
||||
|
||||
# Search in different assembly
|
||||
gget blat -a mm39 ATCGATCGATCGATCG
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.blat("ATCGATCGATCGATCG", assembly="mouse")
|
||||
```
|
||||
|
||||
#### gget muscle - Multiple Sequence Alignment
|
||||
|
||||
Align multiple nucleotide or amino acid sequences using Muscle5.
|
||||
|
||||
**Parameters**:
|
||||
- `fasta`: Sequences or path to FASTA/.txt file
|
||||
- `-s5/--super5`: Use Super5 algorithm for faster processing (large datasets)
|
||||
|
||||
**Returns**: Aligned sequences in ClustalW format or aligned FASTA (.afa)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Align sequences from file
|
||||
gget muscle sequences.fasta -o aligned.afa
|
||||
|
||||
# Use Super5 for large dataset
|
||||
gget muscle large_dataset.fasta -s5
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.muscle("sequences.fasta", save=True)
|
||||
```
|
||||
|
||||
#### gget diamond - Local Sequence Alignment
|
||||
|
||||
Perform fast local protein or translated DNA alignment using DIAMOND.
|
||||
|
||||
**Parameters**:
|
||||
- Query: Sequences (string/list) or FASTA file path
|
||||
- `--reference`: Reference sequences (string/list) or FASTA file path (required)
|
||||
- `--sensitivity`: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive (default), ultra-sensitive
|
||||
- `--threads`: CPU threads (default: 1)
|
||||
- `--diamond_db`: Save database for reuse
|
||||
- `--translated`: Enable nucleotide-to-amino acid alignment
|
||||
|
||||
**Returns**: Identity percentage, sequence lengths, match positions, gap openings, E-values, bit scores
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Align against reference
|
||||
gget diamond GGETISAWESQME -ref reference.fasta --threads 4
|
||||
|
||||
# Save database for reuse
|
||||
gget diamond query.fasta -ref ref.fasta --diamond_db my_db.dmnd
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.diamond("GGETISAWESQME", reference="reference.fasta", threads=4)
|
||||
```
|
||||
|
||||
### 3. Structural & Protein Analysis
|
||||
|
||||
#### gget pdb - Protein Structures
|
||||
|
||||
Query RCSB Protein Data Bank for structure and metadata.
|
||||
|
||||
**Parameters**:
|
||||
- `pdb_id`: PDB identifier (e.g., '7S7U')
|
||||
- `-r/--resource`: Data type (pdb, entry, pubmed, assembly, entity types)
|
||||
- `-i/--identifier`: Assembly, entity, or chain ID
|
||||
|
||||
**Returns**: PDB format (structures) or JSON (metadata)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Download PDB structure
|
||||
gget pdb 7S7U -o 7S7U.pdb
|
||||
|
||||
# Get metadata
|
||||
gget pdb 7S7U -r entry
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.pdb("7S7U", save=True)
|
||||
```
|
||||
|
||||
#### gget alphafold - Protein Structure Prediction
|
||||
|
||||
Predict 3D protein structures using simplified AlphaFold2.
|
||||
|
||||
**Setup Required**:
|
||||
```bash
|
||||
# Install OpenMM first (version depends on Python version)
|
||||
# Python < 3.10:
|
||||
conda install -qy conda==4.13.0 && conda install -qy -c conda-forge openmm=7.5.1
|
||||
# Python 3.10:
|
||||
conda install -qy conda==24.1.2 && conda install -qy -c conda-forge openmm=7.7.0
|
||||
# Python 3.11:
|
||||
conda install -qy conda==24.11.1 && conda install -qy -c conda-forge openmm=8.0.0
|
||||
|
||||
# Then setup AlphaFold
|
||||
gget setup alphafold
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `sequence`: Amino acid sequence (string), multiple sequences (list), or FASTA file. Multiple sequences trigger multimer modeling
|
||||
- `-mr/--multimer_recycles`: Recycling iterations (default: 3; recommend 20 for accuracy)
|
||||
- `-mfm/--multimer_for_monomer`: Apply multimer model to single proteins
|
||||
- `-r/--relax`: AMBER relaxation for top-ranked model
|
||||
- `plot`: Python-only; generate interactive 3D visualization (default: True)
|
||||
- `show_sidechains`: Python-only; include side chains (default: True)
|
||||
|
||||
**Returns**: PDB structure file, JSON alignment error data, optional 3D visualization
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Predict single protein structure
|
||||
gget alphafold MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
|
||||
|
||||
# Predict multimer with higher accuracy
|
||||
gget alphafold sequence1.fasta -mr 20 -r
|
||||
```
|
||||
|
||||
```python
|
||||
# Python with visualization
|
||||
gget.alphafold("MKWMFK...", plot=True, show_sidechains=True)
|
||||
|
||||
# Multimer prediction
|
||||
gget.alphafold(["sequence1", "sequence2"], multimer_recycles=20)
|
||||
```
|
||||
|
||||
#### gget elm - Eukaryotic Linear Motifs
|
||||
|
||||
Predict Eukaryotic Linear Motifs in protein sequences.
|
||||
|
||||
**Setup Required**:
|
||||
```bash
|
||||
gget setup elm
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `sequence`: Amino acid sequence or UniProt Acc
|
||||
- `-u/--uniprot`: Indicates sequence is UniProt Acc
|
||||
- `-e/--expand`: Include protein names, organisms, references
|
||||
- `-s/--sensitivity`: DIAMOND alignment sensitivity (default: "very-sensitive")
|
||||
- `-t/--threads`: Number of threads (default: 1)
|
||||
|
||||
**Returns**: Two outputs:
|
||||
1. **ortholog_df**: Linear motifs from orthologous proteins
|
||||
2. **regex_df**: Motifs directly matched in input sequence
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Predict motifs from sequence
|
||||
gget elm LIAQSIGQASFV -o results
|
||||
|
||||
# Use UniProt accession with expanded info
|
||||
gget elm --uniprot Q02410 -e
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
|
||||
```
|
||||
|
||||
### 4. Expression & Disease Data
|
||||
|
||||
#### gget archs4 - Gene Correlation & Tissue Expression
|
||||
|
||||
Query ARCHS4 database for correlated genes or tissue expression data.
|
||||
|
||||
**Parameters**:
|
||||
- `gene`: Gene symbol or Ensembl ID (with `--ensembl` flag)
|
||||
- `-w/--which`: 'correlation' (default, returns 100 most correlated genes) or 'tissue' (expression atlas)
|
||||
- `-s/--species`: 'human' (default) or 'mouse' (tissue data only)
|
||||
- `-e/--ensembl`: Input is Ensembl ID
|
||||
|
||||
**Returns**:
|
||||
- **Correlation mode**: Gene symbols, Pearson correlation coefficients
|
||||
- **Tissue mode**: Tissue identifiers, min/Q1/median/Q3/max expression values
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get correlated genes
|
||||
gget archs4 ACE2
|
||||
|
||||
# Get tissue expression
|
||||
gget archs4 -w tissue ACE2
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.archs4("ACE2", which="tissue")
|
||||
```
|
||||
|
||||
#### gget cellxgene - Single-Cell RNA-seq Data
|
||||
|
||||
Query CZ CELLxGENE Discover Census for single-cell data.
|
||||
|
||||
**Setup Required**:
|
||||
```bash
|
||||
gget setup cellxgene
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `--gene` (-g): Gene names or Ensembl IDs (case-sensitive! 'PAX7' for human, 'Pax7' for mouse)
|
||||
- `--tissue`: Tissue type(s)
|
||||
- `--cell_type`: Specific cell type(s)
|
||||
- `--species` (-s): 'homo_sapiens' (default) or 'mus_musculus'
|
||||
- `--census_version` (-cv): Version ("stable", "latest", or dated)
|
||||
- `--ensembl` (-e): Use Ensembl IDs
|
||||
- `--meta_only` (-mo): Return metadata only
|
||||
- Additional filters: disease, development_stage, sex, assay, dataset_id, donor_id, ethnicity, suspension_type
|
||||
|
||||
**Returns**: AnnData object with count matrices and metadata (or metadata-only dataframes)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get single-cell data for specific genes and cell types
|
||||
gget cellxgene --gene ACE2 ABCA1 --tissue lung --cell_type "mucus secreting cell" -o lung_data.h5ad
|
||||
|
||||
# Metadata only
|
||||
gget cellxgene --gene PAX7 --tissue muscle --meta_only -o metadata.csv
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
adata = gget.cellxgene(gene=["ACE2", "ABCA1"], tissue="lung", cell_type="mucus secreting cell")
|
||||
```
|
||||
|
||||
#### gget enrichr - Enrichment Analysis
|
||||
|
||||
Perform ontology enrichment analysis on gene lists using Enrichr.
|
||||
|
||||
**Parameters**:
|
||||
- `genes`: Gene symbols or Ensembl IDs
|
||||
- `-db/--database`: Reference database (supports shortcuts: 'pathway', 'transcription', 'ontology', 'diseases_drugs', 'celltypes')
|
||||
- `-s/--species`: human (default), mouse, fly, yeast, worm, fish
|
||||
- `-bkg_l/--background_list`: Background genes for comparison
|
||||
- `-ko/--kegg_out`: Save KEGG pathway images with highlighted genes
|
||||
- `plot`: Python-only; generate graphical results
|
||||
|
||||
**Database Shortcuts**:
|
||||
- 'pathway' → KEGG_2021_Human
|
||||
- 'transcription' → ChEA_2016
|
||||
- 'ontology' → GO_Biological_Process_2021
|
||||
- 'diseases_drugs' → GWAS_Catalog_2019
|
||||
- 'celltypes' → PanglaoDB_Augmented_2021
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Enrichment analysis for ontology
|
||||
gget enrichr -db ontology ACE2 AGT AGTR1
|
||||
|
||||
# Save KEGG pathways
|
||||
gget enrichr -db pathway ACE2 AGT AGTR1 -ko ./kegg_images/
|
||||
```
|
||||
|
||||
```python
|
||||
# Python with plot
|
||||
gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology", plot=True)
|
||||
```
|
||||
|
||||
#### gget bgee - Orthology & Expression
|
||||
|
||||
Retrieve orthology and gene expression data from Bgee database.
|
||||
|
||||
**Parameters**:
|
||||
- `ens_id`: Ensembl gene ID or NCBI gene ID (for non-Ensembl species). Multiple IDs supported when `type=expression`
|
||||
- `-t/--type`: 'orthologs' (default) or 'expression'
|
||||
|
||||
**Returns**:
|
||||
- **Orthologs mode**: Matching genes across species with IDs, names, taxonomic info
|
||||
- **Expression mode**: Anatomical entities, confidence scores, expression status
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get orthologs
|
||||
gget bgee ENSG00000169194
|
||||
|
||||
# Get expression data
|
||||
gget bgee ENSG00000169194 -t expression
|
||||
|
||||
# Multiple genes
|
||||
gget bgee ENSBTAG00000047356 ENSBTAG00000018317 -t expression
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.bgee("ENSG00000169194", type="orthologs")
|
||||
```
|
||||
|
||||
#### gget opentargets - Disease & Drug Associations
|
||||
|
||||
Retrieve disease and drug associations from OpenTargets.
|
||||
|
||||
**Parameters**:
|
||||
- Ensembl gene ID (required)
|
||||
- `-r/--resource`: diseases (default), drugs, tractability, pharmacogenetics, expression, depmap, interactions
|
||||
- `-l/--limit`: Cap results count
|
||||
- Filter arguments (vary by resource):
|
||||
- drugs: `--filter_disease`
|
||||
- pharmacogenetics: `--filter_drug`
|
||||
- expression/depmap: `--filter_tissue`, `--filter_anat_sys`, `--filter_organ`
|
||||
- interactions: `--filter_protein_a`, `--filter_protein_b`, `--filter_gene_b`
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Get associated diseases
|
||||
gget opentargets ENSG00000169194 -r diseases -l 5
|
||||
|
||||
# Get associated drugs
|
||||
gget opentargets ENSG00000169194 -r drugs -l 10
|
||||
|
||||
# Get tissue expression
|
||||
gget opentargets ENSG00000169194 -r expression --filter_tissue brain
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.opentargets("ENSG00000169194", resource="diseases", limit=5)
|
||||
```
|
||||
|
||||
#### gget cbio - cBioPortal Cancer Genomics
|
||||
|
||||
Plot cancer genomics heatmaps using cBioPortal data.
|
||||
|
||||
**Two subcommands**:
|
||||
|
||||
**search** - Find study IDs:
|
||||
```bash
|
||||
gget cbio search breast lung
|
||||
```
|
||||
|
||||
**plot** - Generate heatmaps:
|
||||
|
||||
**Parameters**:
|
||||
- `-s/--study_ids`: Space-separated cBioPortal study IDs (required)
|
||||
- `-g/--genes`: Space-separated gene names or Ensembl IDs (required)
|
||||
- `-st/--stratification`: Column to organize data (tissue, cancer_type, cancer_type_detailed, study_id, sample)
|
||||
- `-vt/--variation_type`: Data type (mutation_occurrences, cna_nonbinary, sv_occurrences, cna_occurrences, Consequence)
|
||||
- `-f/--filter`: Filter by column value (e.g., 'study_id:msk_impact_2017')
|
||||
- `-dd/--data_dir`: Cache directory (default: ./gget_cbio_cache)
|
||||
- `-fd/--figure_dir`: Output directory (default: ./gget_cbio_figures)
|
||||
- `-dpi`: Resolution (default: 100)
|
||||
- `-sh/--show`: Display plot in window
|
||||
- `-nc/--no_confirm`: Skip download confirmations
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Search for studies
|
||||
gget cbio search esophag ovary
|
||||
|
||||
# Create heatmap
|
||||
gget cbio plot -s msk_impact_2017 -g AKT1 ALK BRAF -st tissue -vt mutation_occurrences
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.cbio_search(["esophag", "ovary"])
|
||||
gget.cbio_plot(["msk_impact_2017"], ["AKT1", "ALK"], stratification="tissue")
|
||||
```
|
||||
|
||||
#### gget cosmic - COSMIC Database
|
||||
|
||||
Search COSMIC (Catalogue Of Somatic Mutations In Cancer) database.
|
||||
|
||||
**Important**: License fees apply for commercial use. Requires COSMIC account credentials.
|
||||
|
||||
**Parameters**:
|
||||
- `searchterm`: Gene name, Ensembl ID, mutation notation, or sample ID
|
||||
- `-ctp/--cosmic_tsv_path`: Path to downloaded COSMIC TSV file (required for querying)
|
||||
- `-l/--limit`: Maximum results (default: 100)
|
||||
|
||||
**Database download flags**:
|
||||
- `-d/--download_cosmic`: Activate download mode
|
||||
- `-gm/--gget_mutate`: Create version for gget mutate
|
||||
- `-cp/--cosmic_project`: Database type (cancer, census, cell_line, resistance, genome_screen, targeted_screen)
|
||||
- `-cv/--cosmic_version`: COSMIC version
|
||||
- `-gv/--grch_version`: Human reference genome (37 or 38)
|
||||
- `--email`, `--password`: COSMIC credentials
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# First download database
|
||||
gget cosmic -d --email user@example.com --password xxx -cp cancer
|
||||
|
||||
# Then query
|
||||
gget cosmic EGFR -ctp cosmic_data.tsv -l 10
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
|
||||
```
|
||||
|
||||
### 5. Additional Tools
|
||||
|
||||
#### gget mutate - Generate Mutated Sequences
|
||||
|
||||
Generate mutated nucleotide sequences from mutation annotations.
|
||||
|
||||
**Parameters**:
|
||||
- `sequences`: FASTA file path or direct sequence input (string/list)
|
||||
- `-m/--mutations`: CSV/TSV file or DataFrame with mutation data (required)
|
||||
- `-mc/--mut_column`: Mutation column name (default: 'mutation')
|
||||
- `-sic/--seq_id_column`: Sequence ID column (default: 'seq_ID')
|
||||
- `-mic/--mut_id_column`: Mutation ID column
|
||||
- `-k/--k`: Length of flanking sequences (default: 30 nucleotides)
|
||||
|
||||
**Returns**: Mutated sequences in FASTA format
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Single mutation
|
||||
gget mutate ATCGCTAAGCT -m "c.4G>T"
|
||||
|
||||
# Multiple sequences with mutations from file
|
||||
gget mutate sequences.fasta -m mutations.csv -o mutated.fasta
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
import pandas as pd
|
||||
mutations_df = pd.DataFrame({"seq_ID": ["seq1"], "mutation": ["c.4G>T"]})
|
||||
gget.mutate(["ATCGCTAAGCT"], mutations=mutations_df)
|
||||
```
|
||||
|
||||
#### gget gpt - OpenAI Text Generation
|
||||
|
||||
Generate natural language text using OpenAI's API.
|
||||
|
||||
**Setup Required**:
|
||||
```bash
|
||||
gget setup gpt
|
||||
```
|
||||
|
||||
**Important**: Free tier limited to 3 months after account creation. Set monthly billing limits.
|
||||
|
||||
**Parameters**:
|
||||
- `prompt`: Text input for generation (required)
|
||||
- `api_key`: OpenAI authentication (required)
|
||||
- Model configuration: temperature, top_p, max_tokens, frequency_penalty, presence_penalty
|
||||
- Default model: gpt-3.5-turbo (configurable)
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
gget gpt "Explain CRISPR" --api_key your_key_here
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.gpt("Explain CRISPR", api_key="your_key_here")
|
||||
```
|
||||
|
||||
#### gget setup - Install Dependencies
|
||||
|
||||
Install/download third-party dependencies for specific modules.
|
||||
|
||||
**Parameters**:
|
||||
- `module`: Module name requiring dependency installation
|
||||
- `-o/--out`: Output folder path (elm module only)
|
||||
|
||||
**Modules requiring setup**:
|
||||
- `alphafold` - Downloads ~4GB of model parameters
|
||||
- `cellxgene` - Installs cellxgene-census (may not support latest Python)
|
||||
- `elm` - Downloads local ELM database
|
||||
- `gpt` - Configures OpenAI integration
|
||||
|
||||
**Examples**:
|
||||
```bash
|
||||
# Setup AlphaFold
|
||||
gget setup alphafold
|
||||
|
||||
# Setup ELM with custom directory
|
||||
gget setup elm -o /path/to/elm_data
|
||||
```
|
||||
|
||||
```python
|
||||
# Python
|
||||
gget.setup("alphafold")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Gene Discovery to Sequence Analysis
|
||||
|
||||
Find and analyze genes of interest:
|
||||
|
||||
```python
|
||||
# 1. Search for genes
|
||||
results = gget.search(["GABA", "receptor"], species="homo_sapiens")
|
||||
|
||||
# 2. Get detailed information
|
||||
gene_ids = results["ensembl_id"].tolist()
|
||||
info = gget.info(gene_ids[:5])
|
||||
|
||||
# 3. Retrieve sequences
|
||||
sequences = gget.seq(gene_ids[:5], translate=True)
|
||||
```
|
||||
|
||||
### Workflow 2: Sequence Alignment and Structure
|
||||
|
||||
Align sequences and predict structures:
|
||||
|
||||
```python
|
||||
# 1. Align multiple sequences
|
||||
alignment = gget.muscle("sequences.fasta")
|
||||
|
||||
# 2. Find similar sequences
|
||||
blast_results = gget.blast(my_sequence, database="swissprot", limit=10)
|
||||
|
||||
# 3. Predict structure
|
||||
structure = gget.alphafold(my_sequence, plot=True)
|
||||
|
||||
# 4. Find linear motifs
|
||||
ortholog_df, regex_df = gget.elm(my_sequence)
|
||||
```
|
||||
|
||||
### Workflow 3: Gene Expression and Enrichment
|
||||
|
||||
Analyze expression patterns and functional enrichment:
|
||||
|
||||
```python
|
||||
# 1. Get tissue expression
|
||||
tissue_expr = gget.archs4("ACE2", which="tissue")
|
||||
|
||||
# 2. Find correlated genes
|
||||
correlated = gget.archs4("ACE2", which="correlation")
|
||||
|
||||
# 3. Get single-cell data
|
||||
adata = gget.cellxgene(gene=["ACE2"], tissue="lung", cell_type="epithelial cell")
|
||||
|
||||
# 4. Perform enrichment analysis
|
||||
gene_list = correlated["gene_symbol"].tolist()[:50]
|
||||
enrichment = gget.enrichr(gene_list, database="ontology", plot=True)
|
||||
```
|
||||
|
||||
### Workflow 4: Disease and Drug Analysis
|
||||
|
||||
Investigate disease associations and therapeutic targets:
|
||||
|
||||
```python
|
||||
# 1. Search for genes
|
||||
genes = gget.search(["breast cancer"], species="homo_sapiens")
|
||||
|
||||
# 2. Get disease associations
|
||||
diseases = gget.opentargets("ENSG00000169194", resource="diseases")
|
||||
|
||||
# 3. Get drug associations
|
||||
drugs = gget.opentargets("ENSG00000169194", resource="drugs")
|
||||
|
||||
# 4. Query cancer genomics data
|
||||
study_ids = gget.cbio_search(["breast"])
|
||||
gget.cbio_plot(study_ids[:2], ["BRCA1", "BRCA2"], stratification="cancer_type")
|
||||
|
||||
# 5. Search COSMIC for mutations
|
||||
cosmic_results = gget.cosmic("BRCA1", cosmic_tsv_path="cosmic.tsv")
|
||||
```
|
||||
|
||||
### Workflow 5: Comparative Genomics
|
||||
|
||||
Compare proteins across species:
|
||||
|
||||
```python
|
||||
# 1. Get orthologs
|
||||
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
|
||||
|
||||
# 2. Get sequences for comparison
|
||||
human_seq = gget.seq("ENSG00000169194", translate=True)
|
||||
mouse_seq = gget.seq("ENSMUSG00000026091", translate=True)
|
||||
|
||||
# 3. Align sequences
|
||||
alignment = gget.muscle([human_seq, mouse_seq])
|
||||
|
||||
# 4. Compare structures
|
||||
human_structure = gget.pdb("7S7U")
|
||||
mouse_structure = gget.alphafold(mouse_seq)
|
||||
```
|
||||
|
||||
### Workflow 6: Building Reference Indices
|
||||
|
||||
Prepare reference data for downstream analysis (e.g., kallisto|bustools):
|
||||
|
||||
```bash
|
||||
# 1. List available species
|
||||
gget ref --list_species
|
||||
|
||||
# 2. Download reference files
|
||||
gget ref -w gtf -w cdna -d homo_sapiens
|
||||
|
||||
# 3. Build kallisto index
|
||||
kallisto index -i transcriptome.idx transcriptome.fasta
|
||||
|
||||
# 4. Download genome for alignment
|
||||
gget ref -w dna -d homo_sapiens
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Data Retrieval
|
||||
- Use `--limit` to control result sizes for large queries
|
||||
- Save results with `-o/--out` for reproducibility
|
||||
- Check database versions/releases for consistency across analyses
|
||||
- Use `--quiet` in production scripts to reduce output
|
||||
|
||||
### Sequence Analysis
|
||||
- For BLAST/BLAT, start with default parameters, then adjust sensitivity
|
||||
- Use `gget diamond` with `--threads` for faster local alignment
|
||||
- Save DIAMOND databases with `--diamond_db` for repeated queries
|
||||
- For multiple sequence alignment, use `-s5/--super5` for large datasets
|
||||
|
||||
### Expression and Disease Data
|
||||
- Gene symbols are case-sensitive in cellxgene (e.g., 'PAX7' vs 'Pax7')
|
||||
- Run `gget setup` before first use of alphafold, cellxgene, elm, gpt
|
||||
- For enrichment analysis, use database shortcuts for convenience
|
||||
- Cache cBioPortal data with `-dd` to avoid repeated downloads
|
||||
|
||||
### Structure Prediction
|
||||
- AlphaFold multimer predictions: use `-mr 20` for higher accuracy
|
||||
- Use `-r` flag for AMBER relaxation of final structures
|
||||
- Visualize results in Python with `plot=True`
|
||||
- Check PDB database first before running AlphaFold predictions
|
||||
|
||||
### Error Handling
|
||||
- Database structures change; update gget regularly: `pip install --upgrade gget`
|
||||
- Process max ~1000 Ensembl IDs at once with gget info
|
||||
- For large-scale analyses, implement rate limiting for API queries
|
||||
- Use virtual environments to avoid dependency conflicts
|
||||
|
||||
## Output Formats
|
||||
|
||||
### Command-line
|
||||
- Default: JSON
|
||||
- CSV: Add `-csv` flag
|
||||
- FASTA: gget seq, gget mutate
|
||||
- PDB: gget pdb, gget alphafold
|
||||
- PNG: gget cbio plot
|
||||
|
||||
### Python
|
||||
- Default: DataFrame or dictionary
|
||||
- JSON: Add `json=True` parameter
|
||||
- Save to file: Add `save=True` or specify `out="filename"`
|
||||
- AnnData: gget cellxgene
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes reference documentation for detailed module information:
|
||||
|
||||
### references/
|
||||
- `module_reference.md` - Comprehensive parameter reference for all modules
|
||||
- `database_info.md` - Information about queried databases and their update frequencies
|
||||
- `workflows.md` - Extended workflow examples and use cases
|
||||
|
||||
For additional help:
|
||||
- Official documentation: https://pachterlab.github.io/gget/
|
||||
- GitHub issues: https://github.com/pachterlab/gget/issues
|
||||
- Citation: Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836
|
||||
300
scientific-packages/gget/references/database_info.md
Normal file
300
scientific-packages/gget/references/database_info.md
Normal file
@@ -0,0 +1,300 @@
|
||||
# gget Database Information
|
||||
|
||||
Overview of databases queried by gget modules, including update frequencies and important considerations.
|
||||
|
||||
## Important Note
|
||||
|
||||
The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated:
|
||||
|
||||
```bash
|
||||
pip install --upgrade gget
|
||||
```
|
||||
|
||||
## Database Directory
|
||||
|
||||
### Genomic Reference Databases
|
||||
|
||||
#### Ensembl
|
||||
- **Used by:** gget ref, gget search, gget info, gget seq
|
||||
- **Description:** Comprehensive genome database with annotations for vertebrate and invertebrate species
|
||||
- **Update frequency:** Regular releases (numbered); new releases approximately every 3 months
|
||||
- **Access:** FTP downloads, REST API
|
||||
- **Website:** https://www.ensembl.org/
|
||||
- **Notes:**
|
||||
- Supports both vertebrate and invertebrate genomes
|
||||
- Can specify release number for reproducibility
|
||||
- Shortcuts available for common species ('human', 'mouse')
|
||||
|
||||
#### UCSC Genome Browser
|
||||
- **Used by:** gget blat
|
||||
- **Description:** Genome browser database with BLAT alignment tool
|
||||
- **Update frequency:** Regular updates with new assemblies
|
||||
- **Access:** Web service API
|
||||
- **Website:** https://genome.ucsc.edu/
|
||||
- **Notes:**
|
||||
- Multiple genome assemblies available (hg38, mm39, etc.)
|
||||
- BLAT optimized for vertebrate genomes
|
||||
|
||||
### Protein & Structure Databases
|
||||
|
||||
#### UniProt
|
||||
- **Used by:** gget info, gget seq (amino acid sequences), gget elm
|
||||
- **Description:** Universal Protein Resource, comprehensive protein sequence and functional information
|
||||
- **Update frequency:** Regular releases (weekly for Swiss-Prot, monthly for TrEMBL)
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.uniprot.org/
|
||||
- **Notes:**
|
||||
- Swiss-Prot: manually annotated and reviewed
|
||||
- TrEMBL: automatically annotated
|
||||
|
||||
#### NCBI (National Center for Biotechnology Information)
|
||||
- **Used by:** gget info, gget bgee (for non-Ensembl species)
|
||||
- **Description:** Gene and protein databases with extensive cross-references
|
||||
- **Update frequency:** Continuous updates
|
||||
- **Access:** E-utilities API
|
||||
- **Website:** https://www.ncbi.nlm.nih.gov/
|
||||
- **Databases:** Gene, Protein, RefSeq
|
||||
|
||||
#### RCSB PDB (Protein Data Bank)
|
||||
- **Used by:** gget pdb
|
||||
- **Description:** Repository of 3D structural data for proteins and nucleic acids
|
||||
- **Update frequency:** Weekly updates
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.rcsb.org/
|
||||
- **Notes:**
|
||||
- Experimentally determined structures (X-ray, NMR, cryo-EM)
|
||||
- Includes metadata about experiments and publications
|
||||
|
||||
#### ELM (Eukaryotic Linear Motif)
|
||||
- **Used by:** gget elm
|
||||
- **Description:** Database of functional sites in eukaryotic proteins
|
||||
- **Update frequency:** Periodic updates
|
||||
- **Access:** Downloaded database (via gget setup elm)
|
||||
- **Website:** http://elm.eu.org/
|
||||
- **Notes:**
|
||||
- Requires local download before first use
|
||||
- Contains validated motifs and patterns
|
||||
|
||||
### Sequence Similarity Databases
|
||||
|
||||
#### BLAST Databases (NCBI)
|
||||
- **Used by:** gget blast
|
||||
- **Description:** Pre-formatted databases for BLAST searches
|
||||
- **Update frequency:** Regular updates
|
||||
- **Access:** NCBI BLAST API
|
||||
- **Databases:**
|
||||
- **Nucleotide:** nt (all GenBank), refseq_rna, pdbnt
|
||||
- **Protein:** nr (non-redundant), swissprot, pdbaa, refseq_protein
|
||||
- **Notes:**
|
||||
- nt and nr are very large databases
|
||||
- Consider specialized databases for faster, more focused searches
|
||||
|
||||
### Expression & Correlation Databases
|
||||
|
||||
#### ARCHS4
|
||||
- **Used by:** gget archs4
|
||||
- **Description:** Massive mining of publicly available RNA-seq data
|
||||
- **Update frequency:** Periodic updates with new samples
|
||||
- **Access:** HTTP API
|
||||
- **Website:** https://maayanlab.cloud/archs4/
|
||||
- **Data:**
|
||||
- Human and mouse RNA-seq data
|
||||
- Correlation matrices
|
||||
- Tissue expression atlases
|
||||
- **Citation:** Lachmann et al., Nature Communications, 2018
|
||||
|
||||
#### CZ CELLxGENE Discover
|
||||
- **Used by:** gget cellxgene
|
||||
- **Description:** Single-cell RNA-seq data from multiple studies
|
||||
- **Update frequency:** Continuous additions of new datasets
|
||||
- **Access:** Census API (via cellxgene-census package)
|
||||
- **Website:** https://cellxgene.cziscience.com/
|
||||
- **Data:**
|
||||
- Single-cell RNA-seq count matrices
|
||||
- Cell type annotations
|
||||
- Tissue and disease metadata
|
||||
- **Notes:**
|
||||
- Requires gget setup cellxgene
|
||||
- Gene symbols are case-sensitive
|
||||
- May not support latest Python versions
|
||||
|
||||
#### Bgee
|
||||
- **Used by:** gget bgee
|
||||
- **Description:** Gene expression and orthology database
|
||||
- **Update frequency:** Regular releases
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.bgee.org/
|
||||
- **Data:**
|
||||
- Gene expression across tissues and developmental stages
|
||||
- Orthology relationships across species
|
||||
- **Citation:** Bastian et al., 2021
|
||||
|
||||
### Functional & Pathway Databases
|
||||
|
||||
#### Enrichr / modEnrichr
|
||||
- **Used by:** gget enrichr
|
||||
- **Description:** Gene set enrichment analysis web service
|
||||
- **Update frequency:** Regular updates to underlying databases
|
||||
- **Access:** REST API
|
||||
- **Website:** https://maayanlab.cloud/Enrichr/
|
||||
- **Databases included:**
|
||||
- KEGG pathways
|
||||
- Gene Ontology (GO)
|
||||
- Transcription factor targets (ChEA)
|
||||
- Disease associations (GWAS Catalog)
|
||||
- Cell type markers (PanglaoDB)
|
||||
- **Notes:**
|
||||
- Supports multiple model organisms
|
||||
- Background gene lists can be provided for custom enrichment
|
||||
|
||||
### Disease & Drug Databases
|
||||
|
||||
#### Open Targets
|
||||
- **Used by:** gget opentargets
|
||||
- **Description:** Integrative platform for disease-target associations
|
||||
- **Update frequency:** Regular releases (quarterly)
|
||||
- **Access:** GraphQL API
|
||||
- **Website:** https://www.opentargets.org/
|
||||
- **Data:**
|
||||
- Disease associations
|
||||
- Drug information and clinical trials
|
||||
- Target tractability
|
||||
- Pharmacogenetics
|
||||
- Gene expression
|
||||
- DepMap gene-disease effects
|
||||
- Protein-protein interactions
|
||||
|
||||
#### cBioPortal
|
||||
- **Used by:** gget cbio
|
||||
- **Description:** Cancer genomics data portal
|
||||
- **Update frequency:** Continuous addition of new studies
|
||||
- **Access:** Web API, downloadable datasets
|
||||
- **Website:** https://www.cbioportal.org/
|
||||
- **Data:**
|
||||
- Mutations, copy number alterations, structural variants
|
||||
- Gene expression
|
||||
- Clinical data
|
||||
- **Notes:**
|
||||
- Large datasets; caching recommended
|
||||
- Multiple cancer types and studies available
|
||||
|
||||
#### COSMIC (Catalogue Of Somatic Mutations In Cancer)
|
||||
- **Used by:** gget cosmic
|
||||
- **Description:** Comprehensive cancer mutation database
|
||||
- **Update frequency:** Regular releases
|
||||
- **Access:** Download (requires account and license for commercial use)
|
||||
- **Website:** https://cancer.sanger.ac.uk/cosmic
|
||||
- **Data:**
|
||||
- Somatic mutations in cancer
|
||||
- Gene census
|
||||
- Cell line data
|
||||
- Drug resistance mutations
|
||||
- **Important:**
|
||||
- Free for academic use
|
||||
- License fees apply for commercial use
|
||||
- Requires COSMIC account credentials
|
||||
- Must download database before querying
|
||||
|
||||
### AI & Prediction Services
|
||||
|
||||
#### AlphaFold2 (DeepMind)
|
||||
- **Used by:** gget alphafold
|
||||
- **Description:** Deep learning model for protein structure prediction
|
||||
- **Model version:** Simplified version for local execution
|
||||
- **Access:** Local computation (requires model download via gget setup)
|
||||
- **Website:** https://alphafold.ebi.ac.uk/
|
||||
- **Notes:**
|
||||
- Requires ~4GB model parameters download
|
||||
- Requires OpenMM installation
|
||||
- Computationally intensive
|
||||
- Python version-specific requirements
|
||||
|
||||
#### OpenAI API
|
||||
- **Used by:** gget gpt
|
||||
- **Description:** Large language model API
|
||||
- **Update frequency:** New models released periodically
|
||||
- **Access:** REST API (requires API key)
|
||||
- **Website:** https://openai.com/
|
||||
- **Notes:**
|
||||
- Default model: gpt-3.5-turbo
|
||||
- Free tier limited to 3 months after account creation
|
||||
- Set billing limits to control costs
|
||||
|
||||
## Data Consistency & Reproducibility
|
||||
|
||||
### Version Control
|
||||
To ensure reproducibility in analyses:
|
||||
|
||||
1. **Specify database versions/releases:**
|
||||
```python
|
||||
# Use specific Ensembl release
|
||||
gget.ref("homo_sapiens", release=110)
|
||||
|
||||
# Use specific Census version
|
||||
gget.cellxgene(gene=["PAX7"], census_version="2023-07-25")
|
||||
```
|
||||
|
||||
2. **Document gget version:**
|
||||
```python
|
||||
import gget
|
||||
print(gget.__version__)
|
||||
```
|
||||
|
||||
3. **Save raw data:**
|
||||
```python
|
||||
# Always save results for reproducibility
|
||||
results = gget.search(["ACE2"], species="homo_sapiens")
|
||||
results.to_csv("search_results_2025-01-15.csv", index=False)
|
||||
```
|
||||
|
||||
### Handling Database Updates
|
||||
|
||||
1. **Regular gget updates:**
|
||||
- Update gget biweekly to match database structure changes
|
||||
- Check release notes for breaking changes
|
||||
|
||||
2. **Error handling:**
|
||||
- Database structure changes may cause temporary failures
|
||||
- Check GitHub issues: https://github.com/pachterlab/gget/issues
|
||||
- Update gget if errors occur
|
||||
|
||||
3. **API rate limiting:**
|
||||
- Implement delays for large-scale queries
|
||||
- Use local databases (DIAMOND, COSMIC) when possible
|
||||
- Cache results to avoid repeated queries
|
||||
|
||||
## Database-Specific Best Practices
|
||||
|
||||
### Ensembl
|
||||
- Use species shortcuts ('human', 'mouse') for convenience
|
||||
- Specify release numbers for reproducibility
|
||||
- Check available species with `gget ref --list_species`
|
||||
|
||||
### UniProt
|
||||
- UniProt IDs are more stable than gene names
|
||||
- Swiss-Prot annotations are manually curated and more reliable
|
||||
- Use PDB flag in gget info only when needed (increases runtime)
|
||||
|
||||
### BLAST/BLAT
|
||||
- Start with default parameters, then optimize
|
||||
- Use specialized databases (swissprot, refseq_protein) for focused searches
|
||||
- Consider E-value cutoffs based on query length
|
||||
|
||||
### Expression Databases
|
||||
- Gene symbols are case-sensitive in CELLxGENE
|
||||
- ARCHS4 correlation data is based on co-expression patterns
|
||||
- Consider tissue-specificity when interpreting results
|
||||
|
||||
### Cancer Databases
|
||||
- cBioPortal: cache data locally for repeated analyses
|
||||
- COSMIC: download appropriate database subset for your needs
|
||||
- Respect license agreements for commercial use
|
||||
|
||||
## Citations
|
||||
|
||||
When using gget, cite both the gget publication and the underlying databases:
|
||||
|
||||
**gget:**
|
||||
Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836
|
||||
|
||||
**Database-specific citations:** Check references/ directory or database websites for appropriate citations.
|
||||
467
scientific-packages/gget/references/module_reference.md
Normal file
467
scientific-packages/gget/references/module_reference.md
Normal file
@@ -0,0 +1,467 @@
|
||||
# gget Module Reference
|
||||
|
||||
Comprehensive parameter reference for all gget modules.
|
||||
|
||||
## Reference & Gene Information Modules
|
||||
|
||||
### gget ref
|
||||
Retrieve Ensembl reference genome FTPs and metadata.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `species` | str | Species in Genus_species format or shortcuts ('human', 'mouse') | Required |
|
||||
| `-w/--which` | str | File types to return: gtf, cdna, dna, cds, cdrna, pep | All |
|
||||
| `-r/--release` | int | Ensembl release number | Latest |
|
||||
| `-od/--out_dir` | str | Output directory path | None |
|
||||
| `-o/--out` | str | JSON file path for results | None |
|
||||
| `-l/--list_species` | flag | List available vertebrate species | False |
|
||||
| `-liv/--list_iv_species` | flag | List available invertebrate species | False |
|
||||
| `-ftp` | flag | Return only FTP links | False |
|
||||
| `-d/--download` | flag | Download files (requires curl) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress information | False |
|
||||
|
||||
**Returns:** JSON containing FTP links, Ensembl release numbers, release dates, file sizes
|
||||
|
||||
---
|
||||
|
||||
### gget search
|
||||
Search for genes by name or description in Ensembl.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `searchwords` | str/list | Search terms (case-insensitive) | Required |
|
||||
| `-s/--species` | str | Target species or core database name | Required |
|
||||
| `-r/--release` | int | Ensembl release number | Latest |
|
||||
| `-t/--id_type` | str | Return 'gene' or 'transcript' | 'gene' |
|
||||
| `-ao/--andor` | str | 'or' (ANY term) or 'and' (ALL terms) | 'or' |
|
||||
| `-l/--limit` | int | Maximum results to return | None |
|
||||
| `-o/--out` | str | Output file path (CSV/JSON) | None |
|
||||
|
||||
**Returns:** ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL
|
||||
|
||||
---
|
||||
|
||||
### gget info
|
||||
Get comprehensive gene/transcript metadata from Ensembl, UniProt, and NCBI.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `ens_ids` | str/list | Ensembl IDs (WormBase, Flybase also supported) | Required |
|
||||
| `-o/--out` | str | Output file path (CSV/JSON) | None |
|
||||
| `-n/--ncbi` | bool | Disable NCBI data retrieval | False |
|
||||
| `-u/--uniprot` | bool | Disable UniProt data retrieval | False |
|
||||
| `-pdb` | bool | Include PDB identifiers | False |
|
||||
| `-csv` | flag | Return CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress display | False |
|
||||
|
||||
**Python-specific:**
|
||||
- `save=True`: Save output to current directory
|
||||
- `wrap_text=True`: Format dataframe with wrapped text
|
||||
|
||||
**Note:** Processing >1000 IDs simultaneously may cause server errors.
|
||||
|
||||
**Returns:** UniProt ID, NCBI gene ID, gene name, synonyms, protein names, descriptions, biotype, canonical transcript
|
||||
|
||||
---
|
||||
|
||||
### gget seq
|
||||
Retrieve nucleotide or amino acid sequences in FASTA format.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `ens_ids` | str/list | Ensembl identifiers | Required |
|
||||
| `-o/--out` | str | Output file path | stdout |
|
||||
| `-t/--translate` | flag | Fetch amino acid sequences | False |
|
||||
| `-iso/--isoforms` | flag | Return all transcript variants | False |
|
||||
| `-q/--quiet` | flag | Suppress progress information | False |
|
||||
|
||||
**Data sources:** Ensembl (nucleotide), UniProt (amino acid)
|
||||
|
||||
**Returns:** FASTA format sequences
|
||||
|
||||
---
|
||||
|
||||
## Sequence Analysis & Alignment Modules
|
||||
|
||||
### gget blast
|
||||
BLAST sequences against standard databases.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `sequence` | str | Sequence or path to FASTA/.txt | Required |
|
||||
| `-p/--program` | str | blastn, blastp, blastx, tblastn, tblastx | Auto-detect |
|
||||
| `-db/--database` | str | nt, refseq_rna, pdbnt, nr, swissprot, pdbaa, refseq_protein | nt or nr |
|
||||
| `-l/--limit` | int | Max hits returned | 50 |
|
||||
| `-e/--expect` | float | E-value cutoff | 10.0 |
|
||||
| `-lcf/--low_comp_filt` | flag | Enable low complexity filtering | False |
|
||||
| `-mbo/--megablast_off` | flag | Disable MegaBLAST (blastn only) | False |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** Description, Scientific Name, Common Name, Taxid, Max Score, Total Score, Query Coverage
|
||||
|
||||
---
|
||||
|
||||
### gget blat
|
||||
Find genomic positions using UCSC BLAT.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `sequence` | str | Sequence or path to FASTA/.txt | Required |
|
||||
| `-st/--seqtype` | str | 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' | Auto-detect |
|
||||
| `-a/--assembly` | str | Target assembly (hg38, mm39, taeGut2, etc.) | 'human'/hg38 |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-csv` | flag | Return CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** genome, query size, alignment start/end, matches, mismatches, alignment percentage
|
||||
|
||||
---
|
||||
|
||||
### gget muscle
|
||||
Align multiple sequences using Muscle5.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `fasta` | str/list | Sequences or FASTA file path | Required |
|
||||
| `-o/--out` | str | Output file path | stdout |
|
||||
| `-s5/--super5` | flag | Use Super5 algorithm (faster, large datasets) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** ClustalW format alignment or aligned FASTA (.afa)
|
||||
|
||||
---
|
||||
|
||||
### gget diamond
|
||||
Fast local protein/translated DNA alignment.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `query` | str/list | Query sequences or FASTA file | Required |
|
||||
| `--reference` | str/list | Reference sequences or FASTA file | Required |
|
||||
| `--sensitivity` | str | fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive | very-sensitive |
|
||||
| `--threads` | int | CPU threads | 1 |
|
||||
| `--diamond_binary` | str | Path to DIAMOND installation | Auto-detect |
|
||||
| `--diamond_db` | str | Save database for reuse | None |
|
||||
| `--translated` | flag | Enable nucleotide-to-amino acid alignment | False |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** Identity %, sequence lengths, match positions, gap openings, E-values, bit scores
|
||||
|
||||
---
|
||||
|
||||
## Structural & Protein Analysis Modules
|
||||
|
||||
### gget pdb
|
||||
Query RCSB Protein Data Bank.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `pdb_id` | str | PDB identifier (e.g., '7S7U') | Required |
|
||||
| `-r/--resource` | str | pdb, entry, pubmed, assembly, entity types | 'pdb' |
|
||||
| `-i/--identifier` | str | Assembly, entity, or chain ID | None |
|
||||
| `-o/--out` | str | Output file path | stdout |
|
||||
|
||||
**Returns:** PDB format (structures) or JSON (metadata)
|
||||
|
||||
---
|
||||
|
||||
### gget alphafold
|
||||
Predict 3D protein structures using AlphaFold2.
|
||||
|
||||
**Setup:** Requires OpenMM and `gget setup alphafold` (~4GB download)
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `sequence` | str/list | Amino acid sequence(s) or FASTA file | Required |
|
||||
| `-mr/--multimer_recycles` | int | Recycling iterations for multimers | 3 |
|
||||
| `-o/--out` | str | Output folder path | timestamped |
|
||||
| `-mfm/--multimer_for_monomer` | flag | Apply multimer model to monomers | False |
|
||||
| `-r/--relax` | flag | AMBER relaxation for top model | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Python-only:**
|
||||
- `plot` (bool): Generate 3D visualization (default: True)
|
||||
- `show_sidechains` (bool): Include side chains (default: True)
|
||||
|
||||
**Note:** Multiple sequences automatically trigger multimer modeling
|
||||
|
||||
**Returns:** PDB structure file, JSON alignment error data, optional 3D plot
|
||||
|
||||
---
|
||||
|
||||
### gget elm
|
||||
Predict Eukaryotic Linear Motifs.
|
||||
|
||||
**Setup:** Requires `gget setup elm`
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `sequence` | str | Amino acid sequence or UniProt Acc | Required |
|
||||
| `-s/--sensitivity` | str | DIAMOND alignment sensitivity | very-sensitive |
|
||||
| `-t/--threads` | int | Number of threads | 1 |
|
||||
| `-bin/--diamond_binary` | str | Path to DIAMOND binary | Auto-detect |
|
||||
| `-o/--out` | str | Output directory path | None |
|
||||
| `-u/--uniprot` | flag | Input is UniProt Acc | False |
|
||||
| `-e/--expand` | flag | Include protein names, organisms, references | False |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** Two outputs:
|
||||
1. **ortholog_df**: Motifs from orthologous proteins
|
||||
2. **regex_df**: Motifs matched in input sequence
|
||||
|
||||
---
|
||||
|
||||
## Expression & Disease Data Modules
|
||||
|
||||
### gget archs4
|
||||
Query ARCHS4 for gene correlation or tissue expression.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `gene` | str | Gene symbol or Ensembl ID | Required |
|
||||
| `-w/--which` | str | 'correlation' or 'tissue' | 'correlation' |
|
||||
| `-s/--species` | str | 'human' or 'mouse' (tissue only) | 'human' |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-e/--ensembl` | flag | Input is Ensembl ID | False |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:**
|
||||
- **correlation**: Gene symbols, Pearson correlation coefficients (top 100)
|
||||
- **tissue**: Tissue IDs, min/Q1/median/Q3/max expression
|
||||
|
||||
---
|
||||
|
||||
### gget cellxgene
|
||||
Query CZ CELLxGENE Discover Census for single-cell data.
|
||||
|
||||
**Setup:** Requires `gget setup cellxgene`
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `--gene` (-g) | list | Gene names or Ensembl IDs (case-sensitive!) | Required |
|
||||
| `--tissue` | list | Tissue type(s) | None |
|
||||
| `--cell_type` | list | Cell type(s) | None |
|
||||
| `--species` (-s) | str | 'homo_sapiens' or 'mus_musculus' | 'homo_sapiens' |
|
||||
| `--census_version` (-cv) | str | "stable", "latest", or dated version | "stable" |
|
||||
| `-o/--out` | str | Output file path (required for CLI) | Required |
|
||||
| `--ensembl` (-e) | flag | Use Ensembl IDs | False |
|
||||
| `--meta_only` (-mo) | flag | Return metadata only | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Additional filters:** disease, development_stage, sex, assay, dataset_id, donor_id, ethnicity, suspension_type
|
||||
|
||||
**Important:** Gene symbols are case-sensitive ('PAX7' for human, 'Pax7' for mouse)
|
||||
|
||||
**Returns:** AnnData object with count matrices and metadata
|
||||
|
||||
---
|
||||
|
||||
### gget enrichr
|
||||
Perform enrichment analysis using Enrichr/modEnrichr.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `genes` | list | Gene symbols or Ensembl IDs | Required |
|
||||
| `-db/--database` | str | Reference database or shortcut | Required |
|
||||
| `-s/--species` | str | human, mouse, fly, yeast, worm, fish | 'human' |
|
||||
| `-bkg_l/--background_list` | list | Background genes | None |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-ko/--kegg_out` | str | KEGG pathway images directory | None |
|
||||
|
||||
**Python-only:**
|
||||
- `plot` (bool): Generate graphical results
|
||||
|
||||
**Database shortcuts:**
|
||||
- 'pathway' → KEGG_2021_Human
|
||||
- 'transcription' → ChEA_2016
|
||||
- 'ontology' → GO_Biological_Process_2021
|
||||
- 'diseases_drugs' → GWAS_Catalog_2019
|
||||
- 'celltypes' → PanglaoDB_Augmented_2021
|
||||
|
||||
**Returns:** Pathway/function associations with adjusted p-values, overlapping gene counts
|
||||
|
||||
---
|
||||
|
||||
### gget bgee
|
||||
Retrieve orthology and expression from Bgee.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `ens_id` | str/list | Ensembl or NCBI gene ID | Required |
|
||||
| `-t/--type` | str | 'orthologs' or 'expression' | 'orthologs' |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Note:** Multiple IDs supported when `type='expression'`
|
||||
|
||||
**Returns:**
|
||||
- **orthologs**: Genes across species with IDs, names, taxonomic info
|
||||
- **expression**: Anatomical entities, confidence scores, expression status
|
||||
|
||||
---
|
||||
|
||||
### gget opentargets
|
||||
Retrieve disease/drug associations from OpenTargets.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `ens_id` | str | Ensembl gene ID | Required |
|
||||
| `-r/--resource` | str | diseases, drugs, tractability, pharmacogenetics, expression, depmap, interactions | 'diseases' |
|
||||
| `-l/--limit` | int | Maximum results | None |
|
||||
| `-o/--out` | str | Output file path | None |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Resource-specific filters:**
|
||||
- drugs: `--filter_disease`
|
||||
- pharmacogenetics: `--filter_drug`
|
||||
- expression/depmap: `--filter_tissue`, `--filter_anat_sys`, `--filter_organ`
|
||||
- interactions: `--filter_protein_a`, `--filter_protein_b`, `--filter_gene_b`
|
||||
|
||||
**Returns:** Disease/drug associations, tractability, pharmacogenetics, expression, DepMap, interactions
|
||||
|
||||
---
|
||||
|
||||
### gget cbio
|
||||
Plot cancer genomics heatmaps from cBioPortal.
|
||||
|
||||
**Subcommands:** search, plot
|
||||
|
||||
**search parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `keywords` | list | Search terms | Required |
|
||||
|
||||
**plot parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `-s/--study_ids` | list | cBioPortal study IDs | Required |
|
||||
| `-g/--genes` | list | Gene names or Ensembl IDs | Required |
|
||||
| `-st/--stratification` | str | tissue, cancer_type, cancer_type_detailed, study_id, sample | None |
|
||||
| `-vt/--variation_type` | str | mutation_occurrences, cna_nonbinary, sv_occurrences, cna_occurrences, Consequence | None |
|
||||
| `-f/--filter` | str | Filter by column value (e.g., 'study_id:msk_impact_2017') | None |
|
||||
| `-dd/--data_dir` | str | Cache directory | ./gget_cbio_cache |
|
||||
| `-fd/--figure_dir` | str | Output directory | ./gget_cbio_figures |
|
||||
| `-t/--title` | str | Custom figure title | None |
|
||||
| `-dpi` | int | Resolution | 100 |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
| `-nc/--no_confirm` | flag | Skip download confirmations | False |
|
||||
| `-sh/--show` | flag | Display plot in window | False |
|
||||
|
||||
**Returns:** PNG heatmap figure
|
||||
|
||||
---
|
||||
|
||||
### gget cosmic
|
||||
Search COSMIC database for cancer mutations.
|
||||
|
||||
**Important:** License fees for commercial use. Requires COSMIC account.
|
||||
|
||||
**Query parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `searchterm` | str | Gene name, Ensembl ID, mutation, sample ID | Required |
|
||||
| `-ctp/--cosmic_tsv_path` | str | Path to COSMIC TSV file | Required |
|
||||
| `-l/--limit` | int | Maximum results | 100 |
|
||||
| `-csv` | flag | CSV format (CLI) | False |
|
||||
|
||||
**Download parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `-d/--download_cosmic` | flag | Activate download mode | False |
|
||||
| `-gm/--gget_mutate` | flag | Create version for gget mutate | False |
|
||||
| `-cp/--cosmic_project` | str | cancer, census, cell_line, resistance, genome_screen, targeted_screen | None |
|
||||
| `-cv/--cosmic_version` | str | COSMIC version | Latest |
|
||||
| `-gv/--grch_version` | int | Human reference genome (37 or 38) | None |
|
||||
| `--email` | str | COSMIC account email | Required |
|
||||
| `--password` | str | COSMIC account password | Required |
|
||||
|
||||
**Note:** First-time users must download database
|
||||
|
||||
**Returns:** Mutation data from COSMIC
|
||||
|
||||
---
|
||||
|
||||
## Additional Tools
|
||||
|
||||
### gget mutate
|
||||
Generate mutated nucleotide sequences.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `sequences` | str/list | FASTA file or sequences | Required |
|
||||
| `-m/--mutations` | str/df | CSV/TSV file or DataFrame | Required |
|
||||
| `-mc/--mut_column` | str | Mutation column name | 'mutation' |
|
||||
| `-sic/--seq_id_column` | str | Sequence ID column | 'seq_ID' |
|
||||
| `-mic/--mut_id_column` | str | Mutation ID column | None |
|
||||
| `-k/--k` | int | Length of flanking sequences | 30 |
|
||||
| `-o/--out` | str | Output FASTA file path | stdout |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Returns:** Mutated sequences in FASTA format
|
||||
|
||||
---
|
||||
|
||||
### gget gpt
|
||||
Generate text using OpenAI's API.
|
||||
|
||||
**Setup:** Requires `gget setup gpt` and OpenAI API key
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `prompt` | str | Text input for generation | Required |
|
||||
| `api_key` | str | OpenAI API key | Required |
|
||||
| `model` | str | OpenAI model name | gpt-3.5-turbo |
|
||||
| `temperature` | float | Sampling temperature (0-2) | 1.0 |
|
||||
| `top_p` | float | Nucleus sampling | 1.0 |
|
||||
| `max_tokens` | int | Maximum tokens to generate | None |
|
||||
| `frequency_penalty` | float | Frequency penalty (0-2) | 0 |
|
||||
| `presence_penalty` | float | Presence penalty (0-2) | 0 |
|
||||
|
||||
**Important:** Free tier limited to 3 months. Set billing limits.
|
||||
|
||||
**Returns:** Generated text string
|
||||
|
||||
---
|
||||
|
||||
### gget setup
|
||||
Install/download dependencies for modules.
|
||||
|
||||
**Parameters:**
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `module` | str | Module name | Required |
|
||||
| `-o/--out` | str | Output folder (elm only) | Package install folder |
|
||||
| `-q/--quiet` | flag | Suppress progress | False |
|
||||
|
||||
**Modules requiring setup:**
|
||||
- `alphafold` - Downloads ~4GB model parameters
|
||||
- `cellxgene` - Installs cellxgene-census
|
||||
- `elm` - Downloads local ELM database
|
||||
- `gpt` - Configures OpenAI integration
|
||||
|
||||
**Returns:** None (installs dependencies)
|
||||
814
scientific-packages/gget/references/workflows.md
Normal file
814
scientific-packages/gget/references/workflows.md
Normal file
@@ -0,0 +1,814 @@
|
||||
# gget Workflow Examples
|
||||
|
||||
Extended workflow examples demonstrating how to combine multiple gget modules for common bioinformatics tasks.
|
||||
|
||||
## Table of Contents
|
||||
1. [Complete Gene Analysis Pipeline](#complete-gene-analysis-pipeline)
|
||||
2. [Comparative Structural Biology](#comparative-structural-biology)
|
||||
3. [Cancer Genomics Analysis](#cancer-genomics-analysis)
|
||||
4. [Single-Cell Expression Analysis](#single-cell-expression-analysis)
|
||||
5. [Building Reference Transcriptomes](#building-reference-transcriptomes)
|
||||
6. [Mutation Impact Assessment](#mutation-impact-assessment)
|
||||
7. [Drug Target Discovery](#drug-target-discovery)
|
||||
|
||||
---
|
||||
|
||||
## Complete Gene Analysis Pipeline
|
||||
|
||||
Comprehensive analysis of a gene from discovery to functional annotation.
|
||||
|
||||
```python
|
||||
import gget
|
||||
import pandas as pd
|
||||
|
||||
# Step 1: Search for genes of interest
|
||||
print("Step 1: Searching for GABA receptor genes...")
|
||||
search_results = gget.search(["GABA", "receptor", "alpha"],
|
||||
species="homo_sapiens",
|
||||
andor="and")
|
||||
print(f"Found {len(search_results)} genes")
|
||||
|
||||
# Step 2: Get detailed information
|
||||
print("\nStep 2: Getting detailed information...")
|
||||
gene_ids = search_results["ensembl_id"].tolist()[:5] # Top 5 genes
|
||||
gene_info = gget.info(gene_ids, pdb=True)
|
||||
print(gene_info[["ensembl_id", "gene_name", "uniprot_id", "description"]])
|
||||
|
||||
# Step 3: Retrieve sequences
|
||||
print("\nStep 3: Retrieving sequences...")
|
||||
nucleotide_seqs = gget.seq(gene_ids)
|
||||
protein_seqs = gget.seq(gene_ids, translate=True)
|
||||
|
||||
# Save sequences
|
||||
with open("gaba_receptors_nt.fasta", "w") as f:
|
||||
f.write(nucleotide_seqs)
|
||||
with open("gaba_receptors_aa.fasta", "w") as f:
|
||||
f.write(protein_seqs)
|
||||
|
||||
# Step 4: Get expression data
|
||||
print("\nStep 4: Getting tissue expression...")
|
||||
for gene_id, gene_name in zip(gene_ids, gene_info["gene_name"]):
|
||||
expr_data = gget.archs4(gene_name, which="tissue")
|
||||
print(f"\n{gene_name} expression:")
|
||||
print(expr_data.head())
|
||||
|
||||
# Step 5: Find correlated genes
|
||||
print("\nStep 5: Finding correlated genes...")
|
||||
correlated = gget.archs4(gene_info["gene_name"].iloc[0], which="correlation")
|
||||
correlated_top = correlated.head(20)
|
||||
print(correlated_top)
|
||||
|
||||
# Step 6: Enrichment analysis on correlated genes
|
||||
print("\nStep 6: Performing enrichment analysis...")
|
||||
gene_list = correlated_top["gene_symbol"].tolist()
|
||||
enrichment = gget.enrichr(gene_list, database="ontology", plot=True)
|
||||
print(enrichment.head(10))
|
||||
|
||||
# Step 7: Get disease associations
|
||||
print("\nStep 7: Getting disease associations...")
|
||||
for gene_id, gene_name in zip(gene_ids[:3], gene_info["gene_name"][:3]):
|
||||
diseases = gget.opentargets(gene_id, resource="diseases", limit=5)
|
||||
print(f"\n{gene_name} disease associations:")
|
||||
print(diseases)
|
||||
|
||||
# Step 8: Check for orthologs
|
||||
print("\nStep 8: Finding orthologs...")
|
||||
orthologs = gget.bgee(gene_ids[0], type="orthologs")
|
||||
print(orthologs)
|
||||
|
||||
print("\nComplete gene analysis pipeline finished!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Comparative Structural Biology
|
||||
|
||||
Compare protein structures across species and analyze functional motifs.
|
||||
|
||||
```python
|
||||
import gget
|
||||
|
||||
# Define genes for comparison
|
||||
human_gene = "ENSG00000169174" # PCSK9
|
||||
mouse_gene = "ENSMUSG00000044254" # Pcsk9
|
||||
|
||||
print("Comparative Structural Biology Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Step 1: Get gene information
|
||||
print("\n1. Getting gene information...")
|
||||
human_info = gget.info([human_gene])
|
||||
mouse_info = gget.info([mouse_gene])
|
||||
|
||||
print(f"Human: {human_info['gene_name'].iloc[0]}")
|
||||
print(f"Mouse: {mouse_info['gene_name'].iloc[0]}")
|
||||
|
||||
# Step 2: Retrieve protein sequences
|
||||
print("\n2. Retrieving protein sequences...")
|
||||
human_seq = gget.seq(human_gene, translate=True)
|
||||
mouse_seq = gget.seq(mouse_gene, translate=True)
|
||||
|
||||
# Save to file for alignment
|
||||
with open("pcsk9_sequences.fasta", "w") as f:
|
||||
f.write(human_seq)
|
||||
f.write("\n")
|
||||
f.write(mouse_seq)
|
||||
|
||||
# Step 3: Align sequences
|
||||
print("\n3. Aligning sequences...")
|
||||
alignment = gget.muscle("pcsk9_sequences.fasta")
|
||||
print("Alignment completed. Visualizing in ClustalW format:")
|
||||
print(alignment)
|
||||
|
||||
# Step 4: Get existing structures from PDB
|
||||
print("\n4. Searching PDB for existing structures...")
|
||||
# Search by sequence using BLAST
|
||||
pdb_results = gget.blast(human_seq, database="pdbaa", limit=5)
|
||||
print("Top PDB matches:")
|
||||
print(pdb_results[["Description", "Max Score", "Query Coverage"]])
|
||||
|
||||
# Download top structure
|
||||
if len(pdb_results) > 0:
|
||||
# Extract PDB ID from description (usually format: "PDB|XXXX|...")
|
||||
pdb_id = pdb_results.iloc[0]["Description"].split("|")[1]
|
||||
print(f"\nDownloading PDB structure: {pdb_id}")
|
||||
gget.pdb(pdb_id, save=True)
|
||||
|
||||
# Step 5: Predict AlphaFold structures
|
||||
print("\n5. Predicting structures with AlphaFold...")
|
||||
# Note: This requires gget setup alphafold and is computationally intensive
|
||||
# Uncomment to run:
|
||||
# human_structure = gget.alphafold(human_seq, plot=True)
|
||||
# mouse_structure = gget.alphafold(mouse_seq, plot=True)
|
||||
print("(AlphaFold prediction skipped - uncomment to run)")
|
||||
|
||||
# Step 6: Identify functional motifs
|
||||
print("\n6. Identifying functional motifs with ELM...")
|
||||
# Note: Requires gget setup elm
|
||||
# Uncomment to run:
|
||||
# human_ortholog_df, human_regex_df = gget.elm(human_seq)
|
||||
# print("Human PCSK9 functional motifs:")
|
||||
# print(human_regex_df)
|
||||
print("(ELM analysis skipped - uncomment to run)")
|
||||
|
||||
# Step 7: Get orthology information
|
||||
print("\n7. Getting orthology information from Bgee...")
|
||||
orthologs = gget.bgee(human_gene, type="orthologs")
|
||||
print("PCSK9 orthologs:")
|
||||
print(orthologs)
|
||||
|
||||
print("\nComparative structural biology workflow completed!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cancer Genomics Analysis
|
||||
|
||||
Analyze cancer-associated genes and their mutations.
|
||||
|
||||
```python
|
||||
import gget
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
print("Cancer Genomics Analysis Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Step 1: Search for cancer-related genes
|
||||
print("\n1. Searching for breast cancer genes...")
|
||||
genes = gget.search(["breast", "cancer", "BRCA"],
|
||||
species="homo_sapiens",
|
||||
andor="or",
|
||||
limit=20)
|
||||
print(f"Found {len(genes)} genes")
|
||||
|
||||
# Focus on specific genes
|
||||
target_genes = ["BRCA1", "BRCA2", "TP53", "PIK3CA", "ESR1"]
|
||||
print(f"\nAnalyzing: {', '.join(target_genes)}")
|
||||
|
||||
# Step 2: Get gene information
|
||||
print("\n2. Getting gene information...")
|
||||
gene_search = []
|
||||
for gene in target_genes:
|
||||
result = gget.search([gene], species="homo_sapiens", limit=1)
|
||||
if len(result) > 0:
|
||||
gene_search.append(result.iloc[0])
|
||||
|
||||
gene_df = pd.DataFrame(gene_search)
|
||||
gene_ids = gene_df["ensembl_id"].tolist()
|
||||
|
||||
# Step 3: Get disease associations
|
||||
print("\n3. Getting disease associations from OpenTargets...")
|
||||
for gene_id, gene_name in zip(gene_ids, target_genes):
|
||||
print(f"\n{gene_name} disease associations:")
|
||||
diseases = gget.opentargets(gene_id, resource="diseases", limit=3)
|
||||
print(diseases[["disease_name", "overall_score"]])
|
||||
|
||||
# Step 4: Get drug associations
|
||||
print("\n4. Getting drug associations...")
|
||||
for gene_id, gene_name in zip(gene_ids[:3], target_genes[:3]):
|
||||
print(f"\n{gene_name} drug associations:")
|
||||
drugs = gget.opentargets(gene_id, resource="drugs", limit=3)
|
||||
if len(drugs) > 0:
|
||||
print(drugs[["drug_name", "drug_type", "max_phase_for_all_diseases"]])
|
||||
|
||||
# Step 5: Search cBioPortal for studies
|
||||
print("\n5. Searching cBioPortal for breast cancer studies...")
|
||||
studies = gget.cbio_search(["breast", "cancer"])
|
||||
print(f"Found {len(studies)} studies")
|
||||
print(studies[:5])
|
||||
|
||||
# Step 6: Create cancer genomics heatmap
|
||||
print("\n6. Creating cancer genomics heatmap...")
|
||||
if len(studies) > 0:
|
||||
# Select relevant studies
|
||||
selected_studies = studies[:2] # Top 2 studies
|
||||
|
||||
gget.cbio_plot(
|
||||
selected_studies,
|
||||
target_genes,
|
||||
stratification="cancer_type",
|
||||
variation_type="mutation_occurrences",
|
||||
show=False
|
||||
)
|
||||
print("Heatmap saved to ./gget_cbio_figures/")
|
||||
|
||||
# Step 7: Query COSMIC database (requires setup)
|
||||
print("\n7. Querying COSMIC database...")
|
||||
# Note: Requires COSMIC account and database download
|
||||
# Uncomment to run:
|
||||
# for gene in target_genes[:2]:
|
||||
# cosmic_results = gget.cosmic(
|
||||
# gene,
|
||||
# cosmic_tsv_path="cosmic_cancer.tsv",
|
||||
# limit=10
|
||||
# )
|
||||
# print(f"\n{gene} mutations in COSMIC:")
|
||||
# print(cosmic_results)
|
||||
print("(COSMIC query skipped - requires database download)")
|
||||
|
||||
# Step 8: Enrichment analysis
|
||||
print("\n8. Performing pathway enrichment...")
|
||||
enrichment = gget.enrichr(target_genes, database="pathway", plot=True)
|
||||
print("\nTop enriched pathways:")
|
||||
print(enrichment.head(10))
|
||||
|
||||
print("\nCancer genomics analysis completed!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Single-Cell Expression Analysis
|
||||
|
||||
Analyze single-cell RNA-seq data for specific cell types and tissues.
|
||||
|
||||
```python
|
||||
import gget
|
||||
import scanpy as sc
|
||||
|
||||
print("Single-Cell Expression Analysis Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Note: Requires gget setup cellxgene
|
||||
|
||||
# Step 1: Define genes and cell types of interest
|
||||
genes_of_interest = ["ACE2", "TMPRSS2", "CD4", "CD8A"]
|
||||
tissue = "lung"
|
||||
cell_types = ["type ii pneumocyte", "macrophage", "t cell"]
|
||||
|
||||
print(f"\nAnalyzing genes: {', '.join(genes_of_interest)}")
|
||||
print(f"Tissue: {tissue}")
|
||||
print(f"Cell types: {', '.join(cell_types)}")
|
||||
|
||||
# Step 2: Get metadata first
|
||||
print("\n1. Retrieving metadata...")
|
||||
metadata = gget.cellxgene(
|
||||
gene=genes_of_interest,
|
||||
tissue=tissue,
|
||||
species="homo_sapiens",
|
||||
meta_only=True
|
||||
)
|
||||
print(f"Found {len(metadata)} datasets")
|
||||
print(metadata.head())
|
||||
|
||||
# Step 3: Download count matrices
|
||||
print("\n2. Downloading single-cell data...")
|
||||
# Note: This can be a large download
|
||||
adata = gget.cellxgene(
|
||||
gene=genes_of_interest,
|
||||
tissue=tissue,
|
||||
species="homo_sapiens",
|
||||
census_version="stable"
|
||||
)
|
||||
print(f"AnnData shape: {adata.shape}")
|
||||
print(f"Genes: {adata.n_vars}")
|
||||
print(f"Cells: {adata.n_obs}")
|
||||
|
||||
# Step 4: Basic QC and filtering with scanpy
|
||||
print("\n3. Performing quality control...")
|
||||
sc.pp.filter_cells(adata, min_genes=200)
|
||||
sc.pp.filter_genes(adata, min_cells=3)
|
||||
print(f"After QC - Cells: {adata.n_obs}, Genes: {adata.n_vars}")
|
||||
|
||||
# Step 5: Normalize and log-transform
|
||||
print("\n4. Normalizing data...")
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
|
||||
# Step 6: Calculate gene expression statistics
|
||||
print("\n5. Calculating expression statistics...")
|
||||
for gene in genes_of_interest:
|
||||
if gene in adata.var_names:
|
||||
expr = adata[:, gene].X.toarray().flatten()
|
||||
print(f"\n{gene} expression:")
|
||||
print(f" Mean: {expr.mean():.3f}")
|
||||
print(f" Median: {np.median(expr):.3f}")
|
||||
print(f" % expressing: {(expr > 0).sum() / len(expr) * 100:.1f}%")
|
||||
|
||||
# Step 7: Get tissue expression from ARCHS4 for comparison
|
||||
print("\n6. Getting bulk tissue expression from ARCHS4...")
|
||||
for gene in genes_of_interest:
|
||||
tissue_expr = gget.archs4(gene, which="tissue")
|
||||
lung_expr = tissue_expr[tissue_expr["tissue"] == "lung"]
|
||||
if len(lung_expr) > 0:
|
||||
print(f"\n{gene} in lung (ARCHS4):")
|
||||
print(f" Median: {lung_expr['median'].iloc[0]:.3f}")
|
||||
|
||||
# Step 8: Enrichment analysis
|
||||
print("\n7. Performing enrichment analysis...")
|
||||
enrichment = gget.enrichr(genes_of_interest, database="celltypes", plot=True)
|
||||
print("\nTop cell type associations:")
|
||||
print(enrichment.head(10))
|
||||
|
||||
# Step 9: Get disease associations
|
||||
print("\n8. Getting disease associations...")
|
||||
for gene in genes_of_interest:
|
||||
gene_search = gget.search([gene], species="homo_sapiens", limit=1)
|
||||
if len(gene_search) > 0:
|
||||
gene_id = gene_search["ensembl_id"].iloc[0]
|
||||
diseases = gget.opentargets(gene_id, resource="diseases", limit=3)
|
||||
print(f"\n{gene} disease associations:")
|
||||
print(diseases[["disease_name", "overall_score"]])
|
||||
|
||||
print("\nSingle-cell expression analysis completed!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Building Reference Transcriptomes
|
||||
|
||||
Prepare reference data for RNA-seq analysis pipelines.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Reference transcriptome building workflow
|
||||
|
||||
echo "Reference Transcriptome Building Workflow"
|
||||
echo "=========================================="
|
||||
|
||||
# Step 1: List available species
|
||||
echo -e "\n1. Listing available species..."
|
||||
gget ref --list_species > available_species.txt
|
||||
echo "Available species saved to available_species.txt"
|
||||
|
||||
# Step 2: Download reference files for human
|
||||
echo -e "\n2. Downloading human reference files..."
|
||||
SPECIES="homo_sapiens"
|
||||
RELEASE=110 # Specify release for reproducibility
|
||||
|
||||
# Download GTF annotation
|
||||
echo "Downloading GTF annotation..."
|
||||
gget ref -w gtf -r $RELEASE -d $SPECIES -o human_ref_gtf.json
|
||||
|
||||
# Download cDNA sequences
|
||||
echo "Downloading cDNA sequences..."
|
||||
gget ref -w cdna -r $RELEASE -d $SPECIES -o human_ref_cdna.json
|
||||
|
||||
# Download protein sequences
|
||||
echo "Downloading protein sequences..."
|
||||
gget ref -w pep -r $RELEASE -d $SPECIES -o human_ref_pep.json
|
||||
|
||||
# Step 3: Build kallisto index (if kallisto is installed)
|
||||
echo -e "\n3. Building kallisto index..."
|
||||
if command -v kallisto &> /dev/null; then
|
||||
# Get cDNA FASTA file from download
|
||||
CDNA_FILE=$(ls *.cdna.all.fa.gz)
|
||||
if [ -f "$CDNA_FILE" ]; then
|
||||
kallisto index -i transcriptome.idx $CDNA_FILE
|
||||
echo "Kallisto index created: transcriptome.idx"
|
||||
else
|
||||
echo "cDNA FASTA file not found"
|
||||
fi
|
||||
else
|
||||
echo "kallisto not installed, skipping index building"
|
||||
fi
|
||||
|
||||
# Step 4: Download genome for alignment-based methods
|
||||
echo -e "\n4. Downloading genome sequence..."
|
||||
gget ref -w dna -r $RELEASE -d $SPECIES -o human_ref_dna.json
|
||||
|
||||
# Step 5: Get gene information for genes of interest
|
||||
echo -e "\n5. Getting information for specific genes..."
|
||||
gget search -s $SPECIES "TP53 BRCA1 BRCA2" -o key_genes.csv
|
||||
|
||||
echo -e "\nReference transcriptome building completed!"
|
||||
```
|
||||
|
||||
```python
|
||||
# Python version
|
||||
import gget
|
||||
import json
|
||||
|
||||
print("Reference Transcriptome Building Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Configuration
|
||||
species = "homo_sapiens"
|
||||
release = 110
|
||||
genes_of_interest = ["TP53", "BRCA1", "BRCA2", "MYC", "EGFR"]
|
||||
|
||||
# Step 1: Get reference information
|
||||
print("\n1. Getting reference information...")
|
||||
ref_info = gget.ref(species, release=release)
|
||||
|
||||
# Save reference information
|
||||
with open("reference_info.json", "w") as f:
|
||||
json.dump(ref_info, f, indent=2)
|
||||
print("Reference information saved to reference_info.json")
|
||||
|
||||
# Step 2: Download specific files
|
||||
print("\n2. Downloading reference files...")
|
||||
# GTF annotation
|
||||
gget.ref(species, which="gtf", release=release, download=True)
|
||||
# cDNA sequences
|
||||
gget.ref(species, which="cdna", release=release, download=True)
|
||||
|
||||
# Step 3: Get information for genes of interest
|
||||
print(f"\n3. Getting information for {len(genes_of_interest)} genes...")
|
||||
gene_data = []
|
||||
for gene in genes_of_interest:
|
||||
result = gget.search([gene], species=species, limit=1)
|
||||
if len(result) > 0:
|
||||
gene_data.append(result.iloc[0])
|
||||
|
||||
# Get detailed info
|
||||
if gene_data:
|
||||
gene_ids = [g["ensembl_id"] for g in gene_data]
|
||||
detailed_info = gget.info(gene_ids)
|
||||
detailed_info.to_csv("genes_of_interest_info.csv", index=False)
|
||||
print("Gene information saved to genes_of_interest_info.csv")
|
||||
|
||||
# Step 4: Get sequences
|
||||
print("\n4. Retrieving sequences...")
|
||||
sequences_nt = gget.seq(gene_ids)
|
||||
sequences_aa = gget.seq(gene_ids, translate=True)
|
||||
|
||||
with open("key_genes_nucleotide.fasta", "w") as f:
|
||||
f.write(sequences_nt)
|
||||
with open("key_genes_protein.fasta", "w") as f:
|
||||
f.write(sequences_aa)
|
||||
|
||||
print("\nReference transcriptome building completed!")
|
||||
print(f"Files created:")
|
||||
print(" - reference_info.json")
|
||||
print(" - genes_of_interest_info.csv")
|
||||
print(" - key_genes_nucleotide.fasta")
|
||||
print(" - key_genes_protein.fasta")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mutation Impact Assessment
|
||||
|
||||
Analyze the impact of genetic mutations on protein structure and function.
|
||||
|
||||
```python
|
||||
import gget
|
||||
import pandas as pd
|
||||
|
||||
print("Mutation Impact Assessment Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Define mutations to analyze
|
||||
mutations = [
|
||||
{"gene": "TP53", "mutation": "c.818G>A", "description": "R273H hotspot"},
|
||||
{"gene": "EGFR", "mutation": "c.2573T>G", "description": "L858R activating"},
|
||||
]
|
||||
|
||||
# Step 1: Get gene information
|
||||
print("\n1. Getting gene information...")
|
||||
for mut in mutations:
|
||||
results = gget.search([mut["gene"]], species="homo_sapiens", limit=1)
|
||||
if len(results) > 0:
|
||||
mut["ensembl_id"] = results["ensembl_id"].iloc[0]
|
||||
print(f"{mut['gene']}: {mut['ensembl_id']}")
|
||||
|
||||
# Step 2: Get sequences
|
||||
print("\n2. Retrieving wild-type sequences...")
|
||||
for mut in mutations:
|
||||
# Get nucleotide sequence
|
||||
nt_seq = gget.seq(mut["ensembl_id"])
|
||||
mut["wt_sequence"] = nt_seq
|
||||
|
||||
# Get protein sequence
|
||||
aa_seq = gget.seq(mut["ensembl_id"], translate=True)
|
||||
mut["wt_protein"] = aa_seq
|
||||
|
||||
# Step 3: Generate mutated sequences
|
||||
print("\n3. Generating mutated sequences...")
|
||||
# Create mutation dataframe for gget mutate
|
||||
mut_df = pd.DataFrame({
|
||||
"seq_ID": [m["gene"] for m in mutations],
|
||||
"mutation": [m["mutation"] for m in mutations]
|
||||
})
|
||||
|
||||
# For each mutation
|
||||
for mut in mutations:
|
||||
# Extract sequence from FASTA
|
||||
lines = mut["wt_sequence"].split("\n")
|
||||
seq = "".join(lines[1:])
|
||||
|
||||
# Create single mutation df
|
||||
single_mut = pd.DataFrame({
|
||||
"seq_ID": [mut["gene"]],
|
||||
"mutation": [mut["mutation"]]
|
||||
})
|
||||
|
||||
# Generate mutated sequence
|
||||
mutated = gget.mutate([seq], mutations=single_mut)
|
||||
mut["mutated_sequence"] = mutated
|
||||
|
||||
print("Mutated sequences generated")
|
||||
|
||||
# Step 4: Get existing structure information
|
||||
print("\n4. Getting structure information...")
|
||||
for mut in mutations:
|
||||
# Get info with PDB IDs
|
||||
info = gget.info([mut["ensembl_id"]], pdb=True)
|
||||
|
||||
if "pdb_id" in info.columns and pd.notna(info["pdb_id"].iloc[0]):
|
||||
pdb_ids = info["pdb_id"].iloc[0].split(";")
|
||||
print(f"\n{mut['gene']} PDB structures: {', '.join(pdb_ids[:3])}")
|
||||
|
||||
# Download first structure
|
||||
if len(pdb_ids) > 0:
|
||||
pdb_id = pdb_ids[0].strip()
|
||||
mut["pdb_id"] = pdb_id
|
||||
gget.pdb(pdb_id, save=True)
|
||||
else:
|
||||
print(f"\n{mut['gene']}: No PDB structure available")
|
||||
mut["pdb_id"] = None
|
||||
|
||||
# Step 5: Predict structures with AlphaFold (optional)
|
||||
print("\n5. Predicting structures with AlphaFold...")
|
||||
# Note: Requires gget setup alphafold and is computationally intensive
|
||||
# Uncomment to run:
|
||||
# for mut in mutations:
|
||||
# print(f"Predicting {mut['gene']} wild-type structure...")
|
||||
# wt_structure = gget.alphafold(mut["wt_protein"])
|
||||
#
|
||||
# print(f"Predicting {mut['gene']} mutant structure...")
|
||||
# # Would need to translate mutated sequence first
|
||||
# # mutant_structure = gget.alphafold(mutated_protein)
|
||||
print("(AlphaFold prediction skipped - uncomment to run)")
|
||||
|
||||
# Step 6: Find functional motifs
|
||||
print("\n6. Identifying functional motifs...")
|
||||
# Note: Requires gget setup elm
|
||||
# Uncomment to run:
|
||||
# for mut in mutations:
|
||||
# ortholog_df, regex_df = gget.elm(mut["wt_protein"])
|
||||
# print(f"\n{mut['gene']} functional motifs:")
|
||||
# print(regex_df)
|
||||
print("(ELM analysis skipped - uncomment to run)")
|
||||
|
||||
# Step 7: Get disease associations
|
||||
print("\n7. Getting disease associations...")
|
||||
for mut in mutations:
|
||||
diseases = gget.opentargets(
|
||||
mut["ensembl_id"],
|
||||
resource="diseases",
|
||||
limit=5
|
||||
)
|
||||
print(f"\n{mut['gene']} ({mut['description']}) disease associations:")
|
||||
print(diseases[["disease_name", "overall_score"]])
|
||||
|
||||
# Step 8: Query COSMIC for mutation frequency
|
||||
print("\n8. Querying COSMIC database...")
|
||||
# Note: Requires COSMIC database download
|
||||
# Uncomment to run:
|
||||
# for mut in mutations:
|
||||
# cosmic_results = gget.cosmic(
|
||||
# mut["mutation"],
|
||||
# cosmic_tsv_path="cosmic_cancer.tsv",
|
||||
# limit=10
|
||||
# )
|
||||
# print(f"\n{mut['gene']} {mut['mutation']} in COSMIC:")
|
||||
# print(cosmic_results)
|
||||
print("(COSMIC query skipped - requires database download)")
|
||||
|
||||
print("\nMutation impact assessment completed!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Drug Target Discovery
|
||||
|
||||
Identify and validate potential drug targets for specific diseases.
|
||||
|
||||
```python
|
||||
import gget
|
||||
import pandas as pd
|
||||
|
||||
print("Drug Target Discovery Workflow")
|
||||
print("=" * 50)
|
||||
|
||||
# Step 1: Search for disease-related genes
|
||||
disease = "alzheimer"
|
||||
print(f"\n1. Searching for {disease} disease genes...")
|
||||
genes = gget.search([disease], species="homo_sapiens", limit=50)
|
||||
print(f"Found {len(genes)} potential genes")
|
||||
|
||||
# Step 2: Get detailed information
|
||||
print("\n2. Getting detailed gene information...")
|
||||
gene_ids = genes["ensembl_id"].tolist()[:20] # Top 20
|
||||
gene_info = gget.info(gene_ids[:10]) # Limit to avoid timeout
|
||||
|
||||
# Step 3: Get disease associations from OpenTargets
|
||||
print("\n3. Getting disease associations...")
|
||||
disease_scores = []
|
||||
for gene_id, gene_name in zip(gene_info["ensembl_id"], gene_info["gene_name"]):
|
||||
diseases = gget.opentargets(gene_id, resource="diseases", limit=10)
|
||||
|
||||
# Filter for Alzheimer's disease
|
||||
alzheimer = diseases[diseases["disease_name"].str.contains("Alzheimer", case=False, na=False)]
|
||||
|
||||
if len(alzheimer) > 0:
|
||||
disease_scores.append({
|
||||
"ensembl_id": gene_id,
|
||||
"gene_name": gene_name,
|
||||
"disease_score": alzheimer["overall_score"].max()
|
||||
})
|
||||
|
||||
disease_df = pd.DataFrame(disease_scores).sort_values("disease_score", ascending=False)
|
||||
print("\nTop disease-associated genes:")
|
||||
print(disease_df.head(10))
|
||||
|
||||
# Step 4: Get tractability information
|
||||
print("\n4. Assessing target tractability...")
|
||||
top_targets = disease_df.head(5)
|
||||
for _, row in top_targets.iterrows():
|
||||
tractability = gget.opentargets(
|
||||
row["ensembl_id"],
|
||||
resource="tractability"
|
||||
)
|
||||
print(f"\n{row['gene_name']} tractability:")
|
||||
print(tractability)
|
||||
|
||||
# Step 5: Get expression data
|
||||
print("\n5. Getting tissue expression data...")
|
||||
for _, row in top_targets.iterrows():
|
||||
# Brain expression from OpenTargets
|
||||
expression = gget.opentargets(
|
||||
row["ensembl_id"],
|
||||
resource="expression",
|
||||
filter_tissue="brain"
|
||||
)
|
||||
print(f"\n{row['gene_name']} brain expression:")
|
||||
print(expression)
|
||||
|
||||
# Tissue expression from ARCHS4
|
||||
tissue_expr = gget.archs4(row["gene_name"], which="tissue")
|
||||
brain_expr = tissue_expr[tissue_expr["tissue"].str.contains("brain", case=False, na=False)]
|
||||
print(f"ARCHS4 brain expression:")
|
||||
print(brain_expr)
|
||||
|
||||
# Step 6: Check for existing drugs
|
||||
print("\n6. Checking for existing drugs...")
|
||||
for _, row in top_targets.iterrows():
|
||||
drugs = gget.opentargets(row["ensembl_id"], resource="drugs", limit=5)
|
||||
print(f"\n{row['gene_name']} drug associations:")
|
||||
if len(drugs) > 0:
|
||||
print(drugs[["drug_name", "drug_type", "max_phase_for_all_diseases"]])
|
||||
else:
|
||||
print("No drugs found")
|
||||
|
||||
# Step 7: Get protein-protein interactions
|
||||
print("\n7. Getting protein-protein interactions...")
|
||||
for _, row in top_targets.iterrows():
|
||||
interactions = gget.opentargets(
|
||||
row["ensembl_id"],
|
||||
resource="interactions",
|
||||
limit=10
|
||||
)
|
||||
print(f"\n{row['gene_name']} interacts with:")
|
||||
if len(interactions) > 0:
|
||||
print(interactions[["gene_b_symbol", "interaction_score"]])
|
||||
|
||||
# Step 8: Enrichment analysis
|
||||
print("\n8. Performing pathway enrichment...")
|
||||
gene_list = top_targets["gene_name"].tolist()
|
||||
enrichment = gget.enrichr(gene_list, database="pathway", plot=True)
|
||||
print("\nTop enriched pathways:")
|
||||
print(enrichment.head(10))
|
||||
|
||||
# Step 9: Get structure information
|
||||
print("\n9. Getting structure information...")
|
||||
for _, row in top_targets.iterrows():
|
||||
info = gget.info([row["ensembl_id"]], pdb=True)
|
||||
|
||||
if "pdb_id" in info.columns and pd.notna(info["pdb_id"].iloc[0]):
|
||||
pdb_ids = info["pdb_id"].iloc[0].split(";")
|
||||
print(f"\n{row['gene_name']} PDB structures: {', '.join(pdb_ids[:3])}")
|
||||
else:
|
||||
print(f"\n{row['gene_name']}: No PDB structure available")
|
||||
# Could predict with AlphaFold
|
||||
print(f" Consider AlphaFold prediction")
|
||||
|
||||
# Step 10: Generate target summary report
|
||||
print("\n10. Generating target summary report...")
|
||||
report = []
|
||||
for _, row in top_targets.iterrows():
|
||||
report.append({
|
||||
"Gene": row["gene_name"],
|
||||
"Ensembl ID": row["ensembl_id"],
|
||||
"Disease Score": row["disease_score"],
|
||||
"Target Status": "High Priority"
|
||||
})
|
||||
|
||||
report_df = pd.DataFrame(report)
|
||||
report_df.to_csv("drug_targets_report.csv", index=False)
|
||||
print("\nTarget report saved to drug_targets_report.csv")
|
||||
|
||||
print("\nDrug target discovery workflow completed!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips for Workflow Development
|
||||
|
||||
### Error Handling
|
||||
```python
|
||||
import gget
|
||||
|
||||
def safe_gget_call(func, *args, **kwargs):
|
||||
"""Wrapper for gget calls with error handling"""
|
||||
try:
|
||||
result = func(*args, **kwargs)
|
||||
return result
|
||||
except Exception as e:
|
||||
print(f"Error in {func.__name__}: {str(e)}")
|
||||
return None
|
||||
|
||||
# Usage
|
||||
result = safe_gget_call(gget.search, ["ACE2"], species="homo_sapiens")
|
||||
if result is not None:
|
||||
print(result)
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
```python
|
||||
import time
|
||||
import gget
|
||||
|
||||
def rate_limited_queries(gene_ids, delay=1):
|
||||
"""Query multiple genes with rate limiting"""
|
||||
results = []
|
||||
for i, gene_id in enumerate(gene_ids):
|
||||
print(f"Querying {i+1}/{len(gene_ids)}: {gene_id}")
|
||||
result = gget.info([gene_id])
|
||||
results.append(result)
|
||||
|
||||
if i < len(gene_ids) - 1: # Don't sleep after last query
|
||||
time.sleep(delay)
|
||||
|
||||
return pd.concat(results, ignore_index=True)
|
||||
```
|
||||
|
||||
### Caching Results
|
||||
```python
|
||||
import os
|
||||
import pickle
|
||||
import gget
|
||||
|
||||
def cached_gget(cache_file, func, *args, **kwargs):
|
||||
"""Cache gget results to avoid repeated queries"""
|
||||
if os.path.exists(cache_file):
|
||||
print(f"Loading from cache: {cache_file}")
|
||||
with open(cache_file, "rb") as f:
|
||||
return pickle.load(f)
|
||||
|
||||
result = func(*args, **kwargs)
|
||||
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(result, f)
|
||||
print(f"Saved to cache: {cache_file}")
|
||||
|
||||
return result
|
||||
|
||||
# Usage
|
||||
result = cached_gget("ace2_info.pkl", gget.info, ["ENSG00000130234"])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
These workflows demonstrate how to combine multiple gget modules for comprehensive bioinformatics analyses. Adapt them to your specific research questions and data types.
|
||||
191
scientific-packages/gget/scripts/batch_sequence_analysis.py
Executable file
191
scientific-packages/gget/scripts/batch_sequence_analysis.py
Executable file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch Sequence Analysis Script
|
||||
Analyze multiple sequences: BLAST, alignment, and structure prediction
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import gget
|
||||
|
||||
|
||||
def read_fasta(fasta_file):
|
||||
"""Read sequences from FASTA file."""
|
||||
sequences = []
|
||||
current_id = None
|
||||
current_seq = []
|
||||
|
||||
with open(fasta_file, "r") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line.startswith(">"):
|
||||
if current_id:
|
||||
sequences.append({"id": current_id, "seq": "".join(current_seq)})
|
||||
current_id = line[1:]
|
||||
current_seq = []
|
||||
else:
|
||||
current_seq.append(line)
|
||||
|
||||
if current_id:
|
||||
sequences.append({"id": current_id, "seq": "".join(current_seq)})
|
||||
|
||||
return sequences
|
||||
|
||||
|
||||
def analyze_sequences(
|
||||
fasta_file,
|
||||
blast_db="nr",
|
||||
align=True,
|
||||
predict_structure=False,
|
||||
output_dir="output",
|
||||
):
|
||||
"""
|
||||
Perform batch sequence analysis.
|
||||
|
||||
Args:
|
||||
fasta_file: Path to FASTA file with sequences
|
||||
blast_db: BLAST database to search (default: nr)
|
||||
align: Whether to perform multiple sequence alignment
|
||||
predict_structure: Whether to predict structures with AlphaFold
|
||||
output_dir: Output directory for results
|
||||
"""
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(exist_ok=True)
|
||||
|
||||
print(f"Batch Sequence Analysis")
|
||||
print("=" * 60)
|
||||
print(f"Input file: {fasta_file}")
|
||||
print(f"Output directory: {output_dir}")
|
||||
print("")
|
||||
|
||||
# Read sequences
|
||||
print("Reading sequences...")
|
||||
sequences = read_fasta(fasta_file)
|
||||
print(f"Found {len(sequences)} sequences\n")
|
||||
|
||||
# Step 1: BLAST each sequence
|
||||
print("Step 1: Running BLAST searches...")
|
||||
print("-" * 60)
|
||||
for i, seq_data in enumerate(sequences):
|
||||
print(f"\n{i+1}. BLASTing {seq_data['id']}...")
|
||||
try:
|
||||
blast_results = gget.blast(
|
||||
seq_data["seq"], database=blast_db, limit=10, save=False
|
||||
)
|
||||
|
||||
output_file = output_path / f"{seq_data['id']}_blast.csv"
|
||||
blast_results.to_csv(output_file, index=False)
|
||||
print(f" Results saved to: {output_file}")
|
||||
|
||||
if len(blast_results) > 0:
|
||||
print(f" Top hit: {blast_results.iloc[0]['Description']}")
|
||||
print(
|
||||
f" Max Score: {blast_results.iloc[0]['Max Score']}, "
|
||||
f"Query Coverage: {blast_results.iloc[0]['Query Coverage']}"
|
||||
)
|
||||
except Exception as e:
|
||||
print(f" Error: {e}")
|
||||
|
||||
# Step 2: Multiple sequence alignment
|
||||
if align and len(sequences) > 1:
|
||||
print("\n\nStep 2: Multiple sequence alignment...")
|
||||
print("-" * 60)
|
||||
try:
|
||||
alignment = gget.muscle(fasta_file)
|
||||
alignment_file = output_path / "alignment.afa"
|
||||
with open(alignment_file, "w") as f:
|
||||
f.write(alignment)
|
||||
print(f"Alignment saved to: {alignment_file}")
|
||||
except Exception as e:
|
||||
print(f"Error in alignment: {e}")
|
||||
else:
|
||||
print("\n\nStep 2: Skipping alignment (only 1 sequence or disabled)")
|
||||
|
||||
# Step 3: Structure prediction (optional)
|
||||
if predict_structure:
|
||||
print("\n\nStep 3: Predicting structures with AlphaFold...")
|
||||
print("-" * 60)
|
||||
print(
|
||||
"Note: This requires 'gget setup alphafold' and is computationally intensive"
|
||||
)
|
||||
|
||||
for i, seq_data in enumerate(sequences):
|
||||
print(f"\n{i+1}. Predicting structure for {seq_data['id']}...")
|
||||
try:
|
||||
structure_dir = output_path / f"structure_{seq_data['id']}"
|
||||
# Uncomment to run AlphaFold prediction:
|
||||
# gget.alphafold(seq_data['seq'], out=str(structure_dir))
|
||||
# print(f" Structure saved to: {structure_dir}")
|
||||
print(
|
||||
" (Prediction skipped - uncomment code to run AlphaFold prediction)"
|
||||
)
|
||||
except Exception as e:
|
||||
print(f" Error: {e}")
|
||||
else:
|
||||
print("\n\nStep 3: Structure prediction disabled")
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("Batch analysis complete!")
|
||||
print(f"\nResults saved to: {output_dir}/")
|
||||
print(f" - BLAST results: *_blast.csv")
|
||||
if align and len(sequences) > 1:
|
||||
print(f" - Alignment: alignment.afa")
|
||||
if predict_structure:
|
||||
print(f" - Structures: structure_*/")
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Perform batch sequence analysis using gget"
|
||||
)
|
||||
parser.add_argument("fasta", help="Input FASTA file with sequences")
|
||||
parser.add_argument(
|
||||
"-db",
|
||||
"--database",
|
||||
default="nr",
|
||||
help="BLAST database (default: nr for proteins, nt for nucleotides)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-align", action="store_true", help="Skip multiple sequence alignment"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--predict-structure",
|
||||
action="store_true",
|
||||
help="Predict structures with AlphaFold (requires setup)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-o", "--output", default="output", help="Output directory (default: output)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not Path(args.fasta).exists():
|
||||
print(f"Error: File not found: {args.fasta}")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
success = analyze_sequences(
|
||||
args.fasta,
|
||||
blast_db=args.database,
|
||||
align=not args.no_align,
|
||||
predict_structure=args.predict_structure,
|
||||
output_dir=args.output,
|
||||
)
|
||||
sys.exit(0 if success else 1)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nAnalysis interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n\nError: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
235
scientific-packages/gget/scripts/enrichment_pipeline.py
Executable file
235
scientific-packages/gget/scripts/enrichment_pipeline.py
Executable file
@@ -0,0 +1,235 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Enrichment Analysis Pipeline
|
||||
Perform comprehensive enrichment analysis on a gene list
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import gget
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def read_gene_list(file_path):
|
||||
"""Read gene list from file (one gene per line or CSV)."""
|
||||
file_path = Path(file_path)
|
||||
|
||||
if file_path.suffix == ".csv":
|
||||
df = pd.read_csv(file_path)
|
||||
# Assume first column contains gene names
|
||||
genes = df.iloc[:, 0].tolist()
|
||||
else:
|
||||
# Plain text file
|
||||
with open(file_path, "r") as f:
|
||||
genes = [line.strip() for line in f if line.strip()]
|
||||
|
||||
return genes
|
||||
|
||||
|
||||
def enrichment_pipeline(
|
||||
gene_list,
|
||||
species="human",
|
||||
background=None,
|
||||
output_prefix="enrichment",
|
||||
plot=True,
|
||||
):
|
||||
"""
|
||||
Perform comprehensive enrichment analysis.
|
||||
|
||||
Args:
|
||||
gene_list: List of gene symbols
|
||||
species: Species for analysis
|
||||
background: Background gene list (optional)
|
||||
output_prefix: Prefix for output files
|
||||
plot: Whether to generate plots
|
||||
"""
|
||||
print("Enrichment Analysis Pipeline")
|
||||
print("=" * 60)
|
||||
print(f"Analyzing {len(gene_list)} genes")
|
||||
print(f"Species: {species}\n")
|
||||
|
||||
# Database categories to analyze
|
||||
databases = {
|
||||
"pathway": "KEGG Pathways",
|
||||
"ontology": "Gene Ontology (Biological Process)",
|
||||
"transcription": "Transcription Factors (ChEA)",
|
||||
"diseases_drugs": "Disease Associations (GWAS)",
|
||||
"celltypes": "Cell Type Markers (PanglaoDB)",
|
||||
}
|
||||
|
||||
results = {}
|
||||
|
||||
for db_key, db_name in databases.items():
|
||||
print(f"\nAnalyzing: {db_name}")
|
||||
print("-" * 60)
|
||||
|
||||
try:
|
||||
enrichment = gget.enrichr(
|
||||
gene_list,
|
||||
database=db_key,
|
||||
species=species,
|
||||
background_list=background,
|
||||
plot=plot,
|
||||
)
|
||||
|
||||
if enrichment is not None and len(enrichment) > 0:
|
||||
# Save results
|
||||
output_file = f"{output_prefix}_{db_key}.csv"
|
||||
enrichment.to_csv(output_file, index=False)
|
||||
print(f"Results saved to: {output_file}")
|
||||
|
||||
# Show top 5 results
|
||||
print(f"\nTop 5 enriched terms:")
|
||||
for i, row in enrichment.head(5).iterrows():
|
||||
term = row.get("name", row.get("term", "Unknown"))
|
||||
p_val = row.get(
|
||||
"adjusted_p_value",
|
||||
row.get("p_value", row.get("Adjusted P-value", 1)),
|
||||
)
|
||||
print(f" {i+1}. {term}")
|
||||
print(f" P-value: {p_val:.2e}")
|
||||
|
||||
results[db_key] = enrichment
|
||||
else:
|
||||
print("No significant results found")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
|
||||
# Generate summary report
|
||||
print("\n" + "=" * 60)
|
||||
print("Generating summary report...")
|
||||
|
||||
summary = []
|
||||
for db_key, db_name in databases.items():
|
||||
if db_key in results and len(results[db_key]) > 0:
|
||||
summary.append(
|
||||
{
|
||||
"Database": db_name,
|
||||
"Total Terms": len(results[db_key]),
|
||||
"Top Term": results[db_key].iloc[0].get(
|
||||
"name", results[db_key].iloc[0].get("term", "N/A")
|
||||
),
|
||||
}
|
||||
)
|
||||
|
||||
if summary:
|
||||
summary_df = pd.DataFrame(summary)
|
||||
summary_file = f"{output_prefix}_summary.csv"
|
||||
summary_df.to_csv(summary_file, index=False)
|
||||
print(f"\nSummary saved to: {summary_file}")
|
||||
print("\n" + summary_df.to_string(index=False))
|
||||
else:
|
||||
print("\nNo enrichment results to summarize")
|
||||
|
||||
# Get expression data for genes
|
||||
print("\n" + "=" * 60)
|
||||
print("Getting expression data for input genes...")
|
||||
|
||||
try:
|
||||
# Get tissue expression for first few genes
|
||||
expr_data = []
|
||||
for gene in gene_list[:5]: # Limit to first 5
|
||||
print(f" Getting expression for {gene}...")
|
||||
try:
|
||||
tissue_expr = gget.archs4(gene, which="tissue")
|
||||
top_tissue = tissue_expr.nlargest(1, "median").iloc[0]
|
||||
expr_data.append(
|
||||
{
|
||||
"Gene": gene,
|
||||
"Top Tissue": top_tissue["tissue"],
|
||||
"Median Expression": top_tissue["median"],
|
||||
}
|
||||
)
|
||||
except Exception as e:
|
||||
print(f" Warning: {e}")
|
||||
|
||||
if expr_data:
|
||||
expr_df = pd.DataFrame(expr_data)
|
||||
expr_file = f"{output_prefix}_expression.csv"
|
||||
expr_df.to_csv(expr_file, index=False)
|
||||
print(f"\nExpression data saved to: {expr_file}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting expression data: {e}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("Enrichment analysis complete!")
|
||||
print(f"\nOutput files (prefix: {output_prefix}):")
|
||||
for db_key in databases.keys():
|
||||
if db_key in results:
|
||||
print(f" - {output_prefix}_{db_key}.csv")
|
||||
print(f" - {output_prefix}_summary.csv")
|
||||
print(f" - {output_prefix}_expression.csv")
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Perform comprehensive enrichment analysis using gget"
|
||||
)
|
||||
parser.add_argument(
|
||||
"genes",
|
||||
help="Gene list file (one gene per line or CSV with genes in first column)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-s",
|
||||
"--species",
|
||||
default="human",
|
||||
help="Species (human, mouse, fly, yeast, worm, fish)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-b", "--background", help="Background gene list file (optional)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-o", "--output", default="enrichment", help="Output prefix (default: enrichment)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-plot", action="store_true", help="Disable plotting"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Read gene list
|
||||
if not Path(args.genes).exists():
|
||||
print(f"Error: File not found: {args.genes}")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
gene_list = read_gene_list(args.genes)
|
||||
print(f"Read {len(gene_list)} genes from {args.genes}")
|
||||
|
||||
# Read background if provided
|
||||
background = None
|
||||
if args.background:
|
||||
if Path(args.background).exists():
|
||||
background = read_gene_list(args.background)
|
||||
print(f"Read {len(background)} background genes from {args.background}")
|
||||
else:
|
||||
print(f"Warning: Background file not found: {args.background}")
|
||||
|
||||
success = enrichment_pipeline(
|
||||
gene_list,
|
||||
species=args.species,
|
||||
background=background,
|
||||
output_prefix=args.output,
|
||||
plot=not args.no_plot,
|
||||
)
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nAnalysis interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n\nError: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
161
scientific-packages/gget/scripts/gene_analysis.py
Executable file
161
scientific-packages/gget/scripts/gene_analysis.py
Executable file
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Gene Analysis Script
|
||||
Quick analysis of a gene: search, info, sequences, expression, and enrichment
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import gget
|
||||
|
||||
|
||||
def analyze_gene(gene_name, species="homo_sapiens", output_prefix=None):
|
||||
"""
|
||||
Perform comprehensive analysis of a gene.
|
||||
|
||||
Args:
|
||||
gene_name: Gene symbol to analyze
|
||||
species: Species name (default: homo_sapiens)
|
||||
output_prefix: Prefix for output files (default: gene_name)
|
||||
"""
|
||||
if output_prefix is None:
|
||||
output_prefix = gene_name.lower()
|
||||
|
||||
print(f"Analyzing gene: {gene_name}")
|
||||
print("=" * 60)
|
||||
|
||||
# Step 1: Search for the gene
|
||||
print("\n1. Searching for gene...")
|
||||
search_results = gget.search([gene_name], species=species, limit=1)
|
||||
|
||||
if len(search_results) == 0:
|
||||
print(f"Error: Gene '{gene_name}' not found in {species}")
|
||||
return False
|
||||
|
||||
gene_id = search_results["ensembl_id"].iloc[0]
|
||||
print(f" Found: {gene_id}")
|
||||
print(f" Description: {search_results['ensembl_description'].iloc[0]}")
|
||||
|
||||
# Step 2: Get detailed information
|
||||
print("\n2. Getting detailed information...")
|
||||
gene_info = gget.info([gene_id], pdb=True)
|
||||
gene_info.to_csv(f"{output_prefix}_info.csv", index=False)
|
||||
print(f" Saved to: {output_prefix}_info.csv")
|
||||
|
||||
if "uniprot_id" in gene_info.columns and gene_info["uniprot_id"].iloc[0]:
|
||||
print(f" UniProt ID: {gene_info['uniprot_id'].iloc[0]}")
|
||||
if "pdb_id" in gene_info.columns and gene_info["pdb_id"].iloc[0]:
|
||||
print(f" PDB IDs: {gene_info['pdb_id'].iloc[0]}")
|
||||
|
||||
# Step 3: Get sequences
|
||||
print("\n3. Retrieving sequences...")
|
||||
nucleotide_seq = gget.seq([gene_id])
|
||||
protein_seq = gget.seq([gene_id], translate=True)
|
||||
|
||||
with open(f"{output_prefix}_nucleotide.fasta", "w") as f:
|
||||
f.write(nucleotide_seq)
|
||||
print(f" Nucleotide sequence saved to: {output_prefix}_nucleotide.fasta")
|
||||
|
||||
with open(f"{output_prefix}_protein.fasta", "w") as f:
|
||||
f.write(protein_seq)
|
||||
print(f" Protein sequence saved to: {output_prefix}_protein.fasta")
|
||||
|
||||
# Step 4: Get tissue expression
|
||||
print("\n4. Getting tissue expression...")
|
||||
try:
|
||||
tissue_expr = gget.archs4(gene_name, which="tissue")
|
||||
tissue_expr.to_csv(f"{output_prefix}_tissue_expression.csv", index=False)
|
||||
print(f" Saved to: {output_prefix}_tissue_expression.csv")
|
||||
|
||||
# Show top tissues
|
||||
top_tissues = tissue_expr.nlargest(5, "median")
|
||||
print("\n Top expressing tissues:")
|
||||
for _, row in top_tissues.iterrows():
|
||||
print(f" {row['tissue']}: median = {row['median']:.2f}")
|
||||
except Exception as e:
|
||||
print(f" Warning: Could not retrieve ARCHS4 data: {e}")
|
||||
|
||||
# Step 5: Find correlated genes
|
||||
print("\n5. Finding correlated genes...")
|
||||
try:
|
||||
correlated = gget.archs4(gene_name, which="correlation")
|
||||
correlated.to_csv(f"{output_prefix}_correlated_genes.csv", index=False)
|
||||
print(f" Saved to: {output_prefix}_correlated_genes.csv")
|
||||
|
||||
# Show top correlated
|
||||
print("\n Top 10 correlated genes:")
|
||||
for _, row in correlated.head(10).iterrows():
|
||||
print(f" {row['gene_symbol']}: r = {row['correlation']:.3f}")
|
||||
except Exception as e:
|
||||
print(f" Warning: Could not retrieve correlation data: {e}")
|
||||
|
||||
# Step 6: Get disease associations
|
||||
print("\n6. Getting disease associations...")
|
||||
try:
|
||||
diseases = gget.opentargets(gene_id, resource="diseases", limit=10)
|
||||
diseases.to_csv(f"{output_prefix}_diseases.csv", index=False)
|
||||
print(f" Saved to: {output_prefix}_diseases.csv")
|
||||
|
||||
print("\n Top 5 disease associations:")
|
||||
for _, row in diseases.head(5).iterrows():
|
||||
print(f" {row['disease_name']}: score = {row['overall_score']:.3f}")
|
||||
except Exception as e:
|
||||
print(f" Warning: Could not retrieve disease data: {e}")
|
||||
|
||||
# Step 7: Get drug associations
|
||||
print("\n7. Getting drug associations...")
|
||||
try:
|
||||
drugs = gget.opentargets(gene_id, resource="drugs", limit=10)
|
||||
if len(drugs) > 0:
|
||||
drugs.to_csv(f"{output_prefix}_drugs.csv", index=False)
|
||||
print(f" Saved to: {output_prefix}_drugs.csv")
|
||||
print(f"\n Found {len(drugs)} drug associations")
|
||||
else:
|
||||
print(" No drug associations found")
|
||||
except Exception as e:
|
||||
print(f" Warning: Could not retrieve drug data: {e}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("Analysis complete!")
|
||||
print(f"\nOutput files (prefix: {output_prefix}):")
|
||||
print(f" - {output_prefix}_info.csv")
|
||||
print(f" - {output_prefix}_nucleotide.fasta")
|
||||
print(f" - {output_prefix}_protein.fasta")
|
||||
print(f" - {output_prefix}_tissue_expression.csv")
|
||||
print(f" - {output_prefix}_correlated_genes.csv")
|
||||
print(f" - {output_prefix}_diseases.csv")
|
||||
print(f" - {output_prefix}_drugs.csv (if available)")
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Perform comprehensive analysis of a gene using gget"
|
||||
)
|
||||
parser.add_argument("gene", help="Gene symbol to analyze")
|
||||
parser.add_argument(
|
||||
"-s",
|
||||
"--species",
|
||||
default="homo_sapiens",
|
||||
help="Species (default: homo_sapiens)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-o", "--output", help="Output prefix for files (default: gene name)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
success = analyze_gene(args.gene, args.species, args.output)
|
||||
sys.exit(0 if success else 1)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nAnalysis interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n\nError: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
355
scientific-packages/matplotlib/SKILL.md
Normal file
355
scientific-packages/matplotlib/SKILL.md
Normal file
@@ -0,0 +1,355 @@
|
||||
---
|
||||
name: matplotlib
|
||||
description: Comprehensive toolkit for creating publication-quality data visualizations in Python. Use this skill when creating plots, charts, or any scientific/statistical visualizations including line plots, scatter plots, bar charts, histograms, heatmaps, 3D plots, and more. Applies to tasks involving data visualization, figure generation, plot customization, or exporting graphics to various formats.
|
||||
---
|
||||
|
||||
# Matplotlib
|
||||
|
||||
## Overview
|
||||
|
||||
Matplotlib is Python's foundational visualization library for creating static, animated, and interactive plots. This skill provides guidance on using matplotlib effectively, covering both the pyplot interface (MATLAB-style) and the object-oriented API (Figure/Axes), along with best practices for creating publication-quality visualizations.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
- Creating any type of plot or chart (line, scatter, bar, histogram, heatmap, contour, etc.)
|
||||
- Generating scientific or statistical visualizations
|
||||
- Customizing plot appearance (colors, styles, labels, legends)
|
||||
- Creating multi-panel figures with subplots
|
||||
- Exporting visualizations to various formats (PNG, PDF, SVG, etc.)
|
||||
- Building interactive plots or animations
|
||||
- Working with 3D visualizations
|
||||
- Integrating plots into Jupyter notebooks or GUI applications
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### The Matplotlib Hierarchy
|
||||
|
||||
Matplotlib uses a hierarchical structure of objects:
|
||||
|
||||
1. **Figure** - The top-level container for all plot elements
|
||||
2. **Axes** - The actual plotting area where data is displayed (one Figure can contain multiple Axes)
|
||||
3. **Artist** - Everything visible on the figure (lines, text, ticks, etc.)
|
||||
4. **Axis** - The number line objects (x-axis, y-axis) that handle ticks and labels
|
||||
|
||||
### Two Interfaces
|
||||
|
||||
**1. pyplot Interface (Implicit, MATLAB-style)**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
plt.plot([1, 2, 3, 4])
|
||||
plt.ylabel('some numbers')
|
||||
plt.show()
|
||||
```
|
||||
- Convenient for quick, simple plots
|
||||
- Maintains state automatically
|
||||
- Good for interactive work and simple scripts
|
||||
|
||||
**2. Object-Oriented Interface (Explicit)**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot([1, 2, 3, 4])
|
||||
ax.set_ylabel('some numbers')
|
||||
plt.show()
|
||||
```
|
||||
- **Recommended for most use cases**
|
||||
- More explicit control over figure and axes
|
||||
- Better for complex figures with multiple subplots
|
||||
- Easier to maintain and debug
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### 1. Basic Plot Creation
|
||||
|
||||
**Single plot workflow:**
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
# Create figure and axes (OO interface - RECOMMENDED)
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
|
||||
# Generate and plot data
|
||||
x = np.linspace(0, 2*np.pi, 100)
|
||||
ax.plot(x, np.sin(x), label='sin(x)')
|
||||
ax.plot(x, np.cos(x), label='cos(x)')
|
||||
|
||||
# Customize
|
||||
ax.set_xlabel('x')
|
||||
ax.set_ylabel('y')
|
||||
ax.set_title('Trigonometric Functions')
|
||||
ax.legend()
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# Save and/or display
|
||||
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### 2. Multiple Subplots
|
||||
|
||||
**Creating subplot layouts:**
|
||||
```python
|
||||
# Method 1: Regular grid
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
|
||||
axes[0, 0].plot(x, y1)
|
||||
axes[0, 1].scatter(x, y2)
|
||||
axes[1, 0].bar(categories, values)
|
||||
axes[1, 1].hist(data, bins=30)
|
||||
|
||||
# Method 2: Mosaic layout (more flexible)
|
||||
fig, axes = plt.subplot_mosaic([['left', 'right_top'],
|
||||
['left', 'right_bottom']],
|
||||
figsize=(10, 8))
|
||||
axes['left'].plot(x, y)
|
||||
axes['right_top'].scatter(x, y)
|
||||
axes['right_bottom'].hist(data)
|
||||
|
||||
# Method 3: GridSpec (maximum control)
|
||||
from matplotlib.gridspec import GridSpec
|
||||
fig = plt.figure(figsize=(12, 8))
|
||||
gs = GridSpec(3, 3, figure=fig)
|
||||
ax1 = fig.add_subplot(gs[0, :]) # Top row, all columns
|
||||
ax2 = fig.add_subplot(gs[1:, 0]) # Bottom two rows, first column
|
||||
ax3 = fig.add_subplot(gs[1:, 1:]) # Bottom two rows, last two columns
|
||||
```
|
||||
|
||||
### 3. Plot Types and Use Cases
|
||||
|
||||
**Line plots** - Time series, continuous data, trends
|
||||
```python
|
||||
ax.plot(x, y, linewidth=2, linestyle='--', marker='o', color='blue')
|
||||
```
|
||||
|
||||
**Scatter plots** - Relationships between variables, correlations
|
||||
```python
|
||||
ax.scatter(x, y, s=sizes, c=colors, alpha=0.6, cmap='viridis')
|
||||
```
|
||||
|
||||
**Bar charts** - Categorical comparisons
|
||||
```python
|
||||
ax.bar(categories, values, color='steelblue', edgecolor='black')
|
||||
# For horizontal bars:
|
||||
ax.barh(categories, values)
|
||||
```
|
||||
|
||||
**Histograms** - Distributions
|
||||
```python
|
||||
ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
|
||||
```
|
||||
|
||||
**Heatmaps** - Matrix data, correlations
|
||||
```python
|
||||
im = ax.imshow(matrix, cmap='coolwarm', aspect='auto')
|
||||
plt.colorbar(im, ax=ax)
|
||||
```
|
||||
|
||||
**Contour plots** - 3D data on 2D plane
|
||||
```python
|
||||
contour = ax.contour(X, Y, Z, levels=10)
|
||||
ax.clabel(contour, inline=True, fontsize=8)
|
||||
```
|
||||
|
||||
**Box plots** - Statistical distributions
|
||||
```python
|
||||
ax.boxplot([data1, data2, data3], labels=['A', 'B', 'C'])
|
||||
```
|
||||
|
||||
**Violin plots** - Distribution densities
|
||||
```python
|
||||
ax.violinplot([data1, data2, data3], positions=[1, 2, 3])
|
||||
```
|
||||
|
||||
For comprehensive plot type examples and variations, refer to `references/plot_types.md`.
|
||||
|
||||
### 4. Styling and Customization
|
||||
|
||||
**Color specification methods:**
|
||||
- Named colors: `'red'`, `'blue'`, `'steelblue'`
|
||||
- Hex codes: `'#FF5733'`
|
||||
- RGB tuples: `(0.1, 0.2, 0.3)`
|
||||
- Colormaps: `cmap='viridis'`, `cmap='plasma'`, `cmap='coolwarm'`
|
||||
|
||||
**Using style sheets:**
|
||||
```python
|
||||
plt.style.use('seaborn-v0_8-darkgrid') # Apply predefined style
|
||||
# Available styles: 'ggplot', 'bmh', 'fivethirtyeight', etc.
|
||||
print(plt.style.available) # List all available styles
|
||||
```
|
||||
|
||||
**Customizing with rcParams:**
|
||||
```python
|
||||
plt.rcParams['font.size'] = 12
|
||||
plt.rcParams['axes.labelsize'] = 14
|
||||
plt.rcParams['axes.titlesize'] = 16
|
||||
plt.rcParams['xtick.labelsize'] = 10
|
||||
plt.rcParams['ytick.labelsize'] = 10
|
||||
plt.rcParams['legend.fontsize'] = 12
|
||||
plt.rcParams['figure.titlesize'] = 18
|
||||
```
|
||||
|
||||
**Text and annotations:**
|
||||
```python
|
||||
ax.text(x, y, 'annotation', fontsize=12, ha='center')
|
||||
ax.annotate('important point', xy=(x, y), xytext=(x+1, y+1),
|
||||
arrowprops=dict(arrowstyle='->', color='red'))
|
||||
```
|
||||
|
||||
For detailed styling options and colormap guidelines, see `references/styling_guide.md`.
|
||||
|
||||
### 5. Saving Figures
|
||||
|
||||
**Export to various formats:**
|
||||
```python
|
||||
# High-resolution PNG for presentations/papers
|
||||
plt.savefig('figure.png', dpi=300, bbox_inches='tight', facecolor='white')
|
||||
|
||||
# Vector format for publications (scalable)
|
||||
plt.savefig('figure.pdf', bbox_inches='tight')
|
||||
plt.savefig('figure.svg', bbox_inches='tight')
|
||||
|
||||
# Transparent background
|
||||
plt.savefig('figure.png', dpi=300, bbox_inches='tight', transparent=True)
|
||||
```
|
||||
|
||||
**Important parameters:**
|
||||
- `dpi`: Resolution (300 for publications, 150 for web, 72 for screen)
|
||||
- `bbox_inches='tight'`: Removes excess whitespace
|
||||
- `facecolor='white'`: Ensures white background (useful for transparent themes)
|
||||
- `transparent=True`: Transparent background
|
||||
|
||||
### 6. Working with 3D Plots
|
||||
|
||||
```python
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
|
||||
# Surface plot
|
||||
ax.plot_surface(X, Y, Z, cmap='viridis')
|
||||
|
||||
# 3D scatter
|
||||
ax.scatter(x, y, z, c=colors, marker='o')
|
||||
|
||||
# 3D line plot
|
||||
ax.plot(x, y, z, linewidth=2)
|
||||
|
||||
# Labels
|
||||
ax.set_xlabel('X Label')
|
||||
ax.set_ylabel('Y Label')
|
||||
ax.set_zlabel('Z Label')
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Interface Selection
|
||||
- **Use the object-oriented interface** (fig, ax = plt.subplots()) for production code
|
||||
- Reserve pyplot interface for quick interactive exploration only
|
||||
- Always create figures explicitly rather than relying on implicit state
|
||||
|
||||
### 2. Figure Size and DPI
|
||||
- Set figsize at creation: `fig, ax = plt.subplots(figsize=(10, 6))`
|
||||
- Use appropriate DPI for output medium:
|
||||
- Screen/notebook: 72-100 dpi
|
||||
- Web: 150 dpi
|
||||
- Print/publications: 300 dpi
|
||||
|
||||
### 3. Layout Management
|
||||
- Use `constrained_layout=True` or `tight_layout()` to prevent overlapping elements
|
||||
- `fig, ax = plt.subplots(constrained_layout=True)` is recommended for automatic spacing
|
||||
|
||||
### 4. Colormap Selection
|
||||
- **Sequential** (viridis, plasma, inferno): Ordered data with consistent progression
|
||||
- **Diverging** (coolwarm, RdBu): Data with meaningful center point (e.g., zero)
|
||||
- **Qualitative** (tab10, Set3): Categorical/nominal data
|
||||
- Avoid rainbow colormaps (jet) - they are not perceptually uniform
|
||||
|
||||
### 5. Accessibility
|
||||
- Use colorblind-friendly colormaps (viridis, cividis)
|
||||
- Add patterns/hatching for bar charts in addition to colors
|
||||
- Ensure sufficient contrast between elements
|
||||
- Include descriptive labels and legends
|
||||
|
||||
### 6. Performance
|
||||
- For large datasets, use `rasterized=True` in plot calls to reduce file size
|
||||
- Use appropriate data reduction before plotting (e.g., downsample dense time series)
|
||||
- For animations, use blitting for better performance
|
||||
|
||||
### 7. Code Organization
|
||||
```python
|
||||
# Good practice: Clear structure
|
||||
def create_analysis_plot(data, title):
|
||||
"""Create standardized analysis plot."""
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
# Plot data
|
||||
ax.plot(data['x'], data['y'], linewidth=2)
|
||||
|
||||
# Customize
|
||||
ax.set_xlabel('X Axis Label', fontsize=12)
|
||||
ax.set_ylabel('Y Axis Label', fontsize=12)
|
||||
ax.set_title(title, fontsize=14, fontweight='bold')
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
return fig, ax
|
||||
|
||||
# Use the function
|
||||
fig, ax = create_analysis_plot(my_data, 'My Analysis')
|
||||
plt.savefig('analysis.png', dpi=300, bbox_inches='tight')
|
||||
```
|
||||
|
||||
## Quick Reference Scripts
|
||||
|
||||
This skill includes helper scripts in the `scripts/` directory:
|
||||
|
||||
### `plot_template.py`
|
||||
Template script demonstrating various plot types with best practices. Use this as a starting point for creating new visualizations.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python scripts/plot_template.py
|
||||
```
|
||||
|
||||
### `style_configurator.py`
|
||||
Interactive utility to configure matplotlib style preferences and generate custom style sheets.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python scripts/style_configurator.py
|
||||
```
|
||||
|
||||
## Detailed References
|
||||
|
||||
For comprehensive information, consult the reference documents:
|
||||
|
||||
- **`references/plot_types.md`** - Complete catalog of plot types with code examples and use cases
|
||||
- **`references/styling_guide.md`** - Detailed styling options, colormaps, and customization
|
||||
- **`references/api_reference.md`** - Core classes and methods reference
|
||||
- **`references/common_issues.md`** - Troubleshooting guide for common problems
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
Matplotlib integrates well with:
|
||||
- **NumPy/Pandas** - Direct plotting from arrays and DataFrames
|
||||
- **Seaborn** - High-level statistical visualizations built on matplotlib
|
||||
- **Jupyter** - Interactive plotting with `%matplotlib inline` or `%matplotlib widget`
|
||||
- **GUI frameworks** - Embedding in Tkinter, Qt, wxPython applications
|
||||
|
||||
## Common Gotchas
|
||||
|
||||
1. **Overlapping elements**: Use `constrained_layout=True` or `tight_layout()`
|
||||
2. **State confusion**: Use OO interface to avoid pyplot state machine issues
|
||||
3. **Memory issues with many figures**: Close figures explicitly with `plt.close(fig)`
|
||||
4. **Font warnings**: Install fonts or suppress warnings with `plt.rcParams['font.sans-serif']`
|
||||
5. **DPI confusion**: Remember that figsize is in inches, not pixels: `pixels = dpi * inches`
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- Official documentation: https://matplotlib.org/
|
||||
- Gallery: https://matplotlib.org/stable/gallery/index.html
|
||||
- Cheatsheets: https://matplotlib.org/cheatsheets/
|
||||
- Tutorials: https://matplotlib.org/stable/tutorials/index.html
|
||||
412
scientific-packages/matplotlib/references/api_reference.md
Normal file
412
scientific-packages/matplotlib/references/api_reference.md
Normal file
@@ -0,0 +1,412 @@
|
||||
# Matplotlib API Reference
|
||||
|
||||
This document provides a quick reference for the most commonly used matplotlib classes and methods.
|
||||
|
||||
## Core Classes
|
||||
|
||||
### Figure
|
||||
|
||||
The top-level container for all plot elements.
|
||||
|
||||
**Creation:**
|
||||
```python
|
||||
fig = plt.figure(figsize=(10, 6), dpi=100, facecolor='white')
|
||||
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
- `fig.add_subplot(nrows, ncols, index)` - Add a subplot
|
||||
- `fig.add_axes([left, bottom, width, height])` - Add axes at specific position
|
||||
- `fig.savefig(filename, dpi=300, bbox_inches='tight')` - Save figure
|
||||
- `fig.tight_layout()` - Adjust spacing to prevent overlaps
|
||||
- `fig.suptitle(title)` - Set figure title
|
||||
- `fig.legend()` - Create figure-level legend
|
||||
- `fig.colorbar(mappable)` - Add colorbar to figure
|
||||
- `plt.close(fig)` - Close figure to free memory
|
||||
|
||||
**Key Attributes:**
|
||||
- `fig.axes` - List of all axes in the figure
|
||||
- `fig.dpi` - Resolution in dots per inch
|
||||
- `fig.figsize` - Figure dimensions in inches (width, height)
|
||||
|
||||
### Axes
|
||||
|
||||
The actual plotting area where data is visualized.
|
||||
|
||||
**Creation:**
|
||||
```python
|
||||
fig, ax = plt.subplots() # Single axes
|
||||
ax = fig.add_subplot(111) # Alternative method
|
||||
```
|
||||
|
||||
**Plotting Methods:**
|
||||
|
||||
**Line plots:**
|
||||
- `ax.plot(x, y, **kwargs)` - Line plot
|
||||
- `ax.step(x, y, where='pre'/'mid'/'post')` - Step plot
|
||||
- `ax.errorbar(x, y, yerr, xerr)` - Error bars
|
||||
|
||||
**Scatter plots:**
|
||||
- `ax.scatter(x, y, s=size, c=color, marker='o', alpha=0.5)` - Scatter plot
|
||||
|
||||
**Bar charts:**
|
||||
- `ax.bar(x, height, width=0.8, align='center')` - Vertical bar chart
|
||||
- `ax.barh(y, width)` - Horizontal bar chart
|
||||
|
||||
**Statistical plots:**
|
||||
- `ax.hist(data, bins=10, density=False)` - Histogram
|
||||
- `ax.boxplot(data, labels=None)` - Box plot
|
||||
- `ax.violinplot(data)` - Violin plot
|
||||
|
||||
**2D plots:**
|
||||
- `ax.imshow(array, cmap='viridis', aspect='auto')` - Display image/matrix
|
||||
- `ax.contour(X, Y, Z, levels=10)` - Contour lines
|
||||
- `ax.contourf(X, Y, Z, levels=10)` - Filled contours
|
||||
- `ax.pcolormesh(X, Y, Z)` - Pseudocolor plot
|
||||
|
||||
**Filling:**
|
||||
- `ax.fill_between(x, y1, y2, alpha=0.3)` - Fill between curves
|
||||
- `ax.fill_betweenx(y, x1, x2)` - Fill between vertical curves
|
||||
|
||||
**Text and annotations:**
|
||||
- `ax.text(x, y, text, fontsize=12)` - Add text
|
||||
- `ax.annotate(text, xy=(x, y), xytext=(x2, y2), arrowprops={})` - Annotate with arrow
|
||||
|
||||
**Customization Methods:**
|
||||
|
||||
**Labels and titles:**
|
||||
- `ax.set_xlabel(label, fontsize=12)` - Set x-axis label
|
||||
- `ax.set_ylabel(label, fontsize=12)` - Set y-axis label
|
||||
- `ax.set_title(title, fontsize=14)` - Set axes title
|
||||
|
||||
**Limits and scales:**
|
||||
- `ax.set_xlim(left, right)` - Set x-axis limits
|
||||
- `ax.set_ylim(bottom, top)` - Set y-axis limits
|
||||
- `ax.set_xscale('linear'/'log'/'symlog')` - Set x-axis scale
|
||||
- `ax.set_yscale('linear'/'log'/'symlog')` - Set y-axis scale
|
||||
|
||||
**Ticks:**
|
||||
- `ax.set_xticks(positions)` - Set x-tick positions
|
||||
- `ax.set_xticklabels(labels)` - Set x-tick labels
|
||||
- `ax.tick_params(axis='both', labelsize=10)` - Customize tick appearance
|
||||
|
||||
**Grid and spines:**
|
||||
- `ax.grid(True, alpha=0.3, linestyle='--')` - Add grid
|
||||
- `ax.spines['top'].set_visible(False)` - Hide top spine
|
||||
- `ax.spines['right'].set_visible(False)` - Hide right spine
|
||||
|
||||
**Legend:**
|
||||
- `ax.legend(loc='best', fontsize=10, frameon=True)` - Add legend
|
||||
- `ax.legend(handles, labels)` - Custom legend
|
||||
|
||||
**Aspect and layout:**
|
||||
- `ax.set_aspect('equal'/'auto'/ratio)` - Set aspect ratio
|
||||
- `ax.invert_xaxis()` - Invert x-axis
|
||||
- `ax.invert_yaxis()` - Invert y-axis
|
||||
|
||||
### pyplot Module
|
||||
|
||||
High-level interface for quick plotting.
|
||||
|
||||
**Figure creation:**
|
||||
- `plt.figure()` - Create new figure
|
||||
- `plt.subplots()` - Create figure and axes
|
||||
- `plt.subplot()` - Add subplot to current figure
|
||||
|
||||
**Plotting (uses current axes):**
|
||||
- `plt.plot()` - Line plot
|
||||
- `plt.scatter()` - Scatter plot
|
||||
- `plt.bar()` - Bar chart
|
||||
- `plt.hist()` - Histogram
|
||||
- (All axes methods available)
|
||||
|
||||
**Display and save:**
|
||||
- `plt.show()` - Display figure
|
||||
- `plt.savefig()` - Save figure
|
||||
- `plt.close()` - Close figure
|
||||
|
||||
**Style:**
|
||||
- `plt.style.use(style_name)` - Apply style sheet
|
||||
- `plt.style.available` - List available styles
|
||||
|
||||
**State management:**
|
||||
- `plt.gca()` - Get current axes
|
||||
- `plt.gcf()` - Get current figure
|
||||
- `plt.sca(ax)` - Set current axes
|
||||
- `plt.clf()` - Clear current figure
|
||||
- `plt.cla()` - Clear current axes
|
||||
|
||||
## Line and Marker Styles
|
||||
|
||||
### Line Styles
|
||||
- `'-'` or `'solid'` - Solid line
|
||||
- `'--'` or `'dashed'` - Dashed line
|
||||
- `'-.'` or `'dashdot'` - Dash-dot line
|
||||
- `':'` or `'dotted'` - Dotted line
|
||||
- `''` or `' '` or `'None'` - No line
|
||||
|
||||
### Marker Styles
|
||||
- `'.'` - Point marker
|
||||
- `'o'` - Circle marker
|
||||
- `'v'`, `'^'`, `'<'`, `'>'` - Triangle markers
|
||||
- `'s'` - Square marker
|
||||
- `'p'` - Pentagon marker
|
||||
- `'*'` - Star marker
|
||||
- `'h'`, `'H'` - Hexagon markers
|
||||
- `'+'` - Plus marker
|
||||
- `'x'` - X marker
|
||||
- `'D'`, `'d'` - Diamond markers
|
||||
|
||||
### Color Specifications
|
||||
|
||||
**Single character shortcuts:**
|
||||
- `'b'` - Blue
|
||||
- `'g'` - Green
|
||||
- `'r'` - Red
|
||||
- `'c'` - Cyan
|
||||
- `'m'` - Magenta
|
||||
- `'y'` - Yellow
|
||||
- `'k'` - Black
|
||||
- `'w'` - White
|
||||
|
||||
**Named colors:**
|
||||
- `'steelblue'`, `'coral'`, `'teal'`, etc.
|
||||
- See full list: https://matplotlib.org/stable/gallery/color/named_colors.html
|
||||
|
||||
**Other formats:**
|
||||
- Hex: `'#FF5733'`
|
||||
- RGB tuple: `(0.1, 0.2, 0.3)`
|
||||
- RGBA tuple: `(0.1, 0.2, 0.3, 0.5)`
|
||||
|
||||
## Common Parameters
|
||||
|
||||
### Plot Function Parameters
|
||||
|
||||
```python
|
||||
ax.plot(x, y,
|
||||
color='blue', # Line color
|
||||
linewidth=2, # Line width
|
||||
linestyle='--', # Line style
|
||||
marker='o', # Marker style
|
||||
markersize=8, # Marker size
|
||||
markerfacecolor='red', # Marker fill color
|
||||
markeredgecolor='black',# Marker edge color
|
||||
markeredgewidth=1, # Marker edge width
|
||||
alpha=0.7, # Transparency (0-1)
|
||||
label='data', # Legend label
|
||||
zorder=2, # Drawing order
|
||||
rasterized=True # Rasterize for smaller file size
|
||||
)
|
||||
```
|
||||
|
||||
### Scatter Function Parameters
|
||||
|
||||
```python
|
||||
ax.scatter(x, y,
|
||||
s=50, # Size (scalar or array)
|
||||
c='blue', # Color (scalar, array, or sequence)
|
||||
marker='o', # Marker style
|
||||
cmap='viridis', # Colormap (if c is numeric)
|
||||
alpha=0.5, # Transparency
|
||||
edgecolors='black', # Edge color
|
||||
linewidths=1, # Edge width
|
||||
vmin=0, vmax=1, # Color scale limits
|
||||
label='data' # Legend label
|
||||
)
|
||||
```
|
||||
|
||||
### Text Parameters
|
||||
|
||||
```python
|
||||
ax.text(x, y, text,
|
||||
fontsize=12, # Font size
|
||||
fontweight='normal', # 'normal', 'bold', 'heavy', 'light'
|
||||
fontstyle='normal', # 'normal', 'italic', 'oblique'
|
||||
fontfamily='sans-serif',# Font family
|
||||
color='black', # Text color
|
||||
alpha=1.0, # Transparency
|
||||
ha='center', # Horizontal alignment: 'left', 'center', 'right'
|
||||
va='center', # Vertical alignment: 'top', 'center', 'bottom', 'baseline'
|
||||
rotation=0, # Rotation angle in degrees
|
||||
bbox=dict( # Background box
|
||||
facecolor='white',
|
||||
edgecolor='black',
|
||||
boxstyle='round'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## rcParams Configuration
|
||||
|
||||
Common rcParams settings for global customization:
|
||||
|
||||
```python
|
||||
# Font settings
|
||||
plt.rcParams['font.family'] = 'sans-serif'
|
||||
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica']
|
||||
plt.rcParams['font.size'] = 12
|
||||
|
||||
# Figure settings
|
||||
plt.rcParams['figure.figsize'] = (10, 6)
|
||||
plt.rcParams['figure.dpi'] = 100
|
||||
plt.rcParams['figure.facecolor'] = 'white'
|
||||
plt.rcParams['savefig.dpi'] = 300
|
||||
plt.rcParams['savefig.bbox'] = 'tight'
|
||||
|
||||
# Axes settings
|
||||
plt.rcParams['axes.labelsize'] = 14
|
||||
plt.rcParams['axes.titlesize'] = 16
|
||||
plt.rcParams['axes.grid'] = True
|
||||
plt.rcParams['axes.grid.alpha'] = 0.3
|
||||
|
||||
# Line settings
|
||||
plt.rcParams['lines.linewidth'] = 2
|
||||
plt.rcParams['lines.markersize'] = 8
|
||||
|
||||
# Tick settings
|
||||
plt.rcParams['xtick.labelsize'] = 10
|
||||
plt.rcParams['ytick.labelsize'] = 10
|
||||
plt.rcParams['xtick.direction'] = 'in' # 'in', 'out', 'inout'
|
||||
plt.rcParams['ytick.direction'] = 'in'
|
||||
|
||||
# Legend settings
|
||||
plt.rcParams['legend.fontsize'] = 12
|
||||
plt.rcParams['legend.frameon'] = True
|
||||
plt.rcParams['legend.framealpha'] = 0.8
|
||||
|
||||
# Grid settings
|
||||
plt.rcParams['grid.alpha'] = 0.3
|
||||
plt.rcParams['grid.linestyle'] = '--'
|
||||
```
|
||||
|
||||
## GridSpec for Complex Layouts
|
||||
|
||||
```python
|
||||
from matplotlib.gridspec import GridSpec
|
||||
|
||||
fig = plt.figure(figsize=(12, 8))
|
||||
gs = GridSpec(3, 3, figure=fig, hspace=0.3, wspace=0.3)
|
||||
|
||||
# Span multiple cells
|
||||
ax1 = fig.add_subplot(gs[0, :]) # Top row, all columns
|
||||
ax2 = fig.add_subplot(gs[1:, 0]) # Bottom two rows, first column
|
||||
ax3 = fig.add_subplot(gs[1, 1:]) # Middle row, last two columns
|
||||
ax4 = fig.add_subplot(gs[2, 1]) # Bottom row, middle column
|
||||
ax5 = fig.add_subplot(gs[2, 2]) # Bottom row, right column
|
||||
```
|
||||
|
||||
## 3D Plotting
|
||||
|
||||
```python
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
fig = plt.figure()
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
|
||||
# Plot types
|
||||
ax.plot(x, y, z) # 3D line
|
||||
ax.scatter(x, y, z) # 3D scatter
|
||||
ax.plot_surface(X, Y, Z) # 3D surface
|
||||
ax.plot_wireframe(X, Y, Z) # 3D wireframe
|
||||
ax.contour(X, Y, Z) # 3D contour
|
||||
ax.bar3d(x, y, z, dx, dy, dz) # 3D bar
|
||||
|
||||
# Customization
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
ax.view_init(elev=30, azim=45) # Set viewing angle
|
||||
```
|
||||
|
||||
## Animation
|
||||
|
||||
```python
|
||||
from matplotlib.animation import FuncAnimation
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
line, = ax.plot([], [])
|
||||
|
||||
def init():
|
||||
ax.set_xlim(0, 2*np.pi)
|
||||
ax.set_ylim(-1, 1)
|
||||
return line,
|
||||
|
||||
def update(frame):
|
||||
x = np.linspace(0, 2*np.pi, 100)
|
||||
y = np.sin(x + frame/10)
|
||||
line.set_data(x, y)
|
||||
return line,
|
||||
|
||||
anim = FuncAnimation(fig, update, init_func=init,
|
||||
frames=100, interval=50, blit=True)
|
||||
|
||||
# Save animation
|
||||
anim.save('animation.gif', writer='pillow', fps=20)
|
||||
anim.save('animation.mp4', writer='ffmpeg', fps=20)
|
||||
```
|
||||
|
||||
## Image Operations
|
||||
|
||||
```python
|
||||
# Read and display image
|
||||
img = plt.imread('image.png')
|
||||
ax.imshow(img)
|
||||
|
||||
# Display matrix as image
|
||||
ax.imshow(matrix, cmap='viridis', aspect='auto',
|
||||
interpolation='nearest', origin='lower')
|
||||
|
||||
# Colorbar
|
||||
cbar = plt.colorbar(im, ax=ax)
|
||||
cbar.set_label('Values')
|
||||
|
||||
# Image extent (set coordinates)
|
||||
ax.imshow(img, extent=[x_min, x_max, y_min, y_max])
|
||||
```
|
||||
|
||||
## Event Handling
|
||||
|
||||
```python
|
||||
# Mouse click event
|
||||
def on_click(event):
|
||||
if event.inaxes:
|
||||
print(f'Clicked at x={event.xdata:.2f}, y={event.ydata:.2f}')
|
||||
|
||||
fig.canvas.mpl_connect('button_press_event', on_click)
|
||||
|
||||
# Key press event
|
||||
def on_key(event):
|
||||
print(f'Key pressed: {event.key}')
|
||||
|
||||
fig.canvas.mpl_connect('key_press_event', on_key)
|
||||
```
|
||||
|
||||
## Useful Utilities
|
||||
|
||||
```python
|
||||
# Get current axis limits
|
||||
xlims = ax.get_xlim()
|
||||
ylims = ax.get_ylim()
|
||||
|
||||
# Set equal aspect ratio
|
||||
ax.set_aspect('equal', adjustable='box')
|
||||
|
||||
# Share axes between subplots
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
|
||||
|
||||
# Twin axes (two y-axes)
|
||||
ax2 = ax1.twinx()
|
||||
|
||||
# Remove tick labels
|
||||
ax.set_xticklabels([])
|
||||
ax.set_yticklabels([])
|
||||
|
||||
# Scientific notation
|
||||
ax.ticklabel_format(style='scientific', axis='y', scilimits=(0,0))
|
||||
|
||||
# Date formatting
|
||||
import matplotlib.dates as mdates
|
||||
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
|
||||
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
|
||||
```
|
||||
563
scientific-packages/matplotlib/references/common_issues.md
Normal file
563
scientific-packages/matplotlib/references/common_issues.md
Normal file
@@ -0,0 +1,563 @@
|
||||
# Matplotlib Common Issues and Solutions
|
||||
|
||||
Troubleshooting guide for frequently encountered matplotlib problems.
|
||||
|
||||
## Display and Backend Issues
|
||||
|
||||
### Issue: Plots Not Showing
|
||||
|
||||
**Problem:** `plt.show()` doesn't display anything
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# 1. Check if backend is properly set (for interactive use)
|
||||
import matplotlib
|
||||
print(matplotlib.get_backend())
|
||||
|
||||
# 2. Try different backends
|
||||
matplotlib.use('TkAgg') # or 'Qt5Agg', 'MacOSX'
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# 3. In Jupyter notebooks, use magic command
|
||||
%matplotlib inline # Static images
|
||||
# or
|
||||
%matplotlib widget # Interactive plots
|
||||
|
||||
# 4. Ensure plt.show() is called
|
||||
plt.plot([1, 2, 3])
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Issue: "RuntimeError: main thread is not in main loop"
|
||||
|
||||
**Problem:** Interactive mode issues with threading
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Switch to non-interactive backend
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Or turn off interactive mode
|
||||
plt.ioff()
|
||||
```
|
||||
|
||||
### Issue: Figures Not Updating Interactively
|
||||
|
||||
**Problem:** Changes not reflected in interactive windows
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Enable interactive mode
|
||||
plt.ion()
|
||||
|
||||
# Draw after each change
|
||||
plt.plot(x, y)
|
||||
plt.draw()
|
||||
plt.pause(0.001) # Brief pause to update display
|
||||
```
|
||||
|
||||
## Layout and Spacing Issues
|
||||
|
||||
### Issue: Overlapping Labels and Titles
|
||||
|
||||
**Problem:** Labels, titles, or tick labels overlap or get cut off
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Constrained layout (RECOMMENDED)
|
||||
fig, ax = plt.subplots(constrained_layout=True)
|
||||
|
||||
# Solution 2: Tight layout
|
||||
fig, ax = plt.subplots()
|
||||
plt.tight_layout()
|
||||
|
||||
# Solution 3: Adjust margins manually
|
||||
plt.subplots_adjust(left=0.15, right=0.95, top=0.95, bottom=0.15)
|
||||
|
||||
# Solution 4: Save with bbox_inches='tight'
|
||||
plt.savefig('figure.png', bbox_inches='tight')
|
||||
|
||||
# Solution 5: Rotate long tick labels
|
||||
ax.set_xticklabels(labels, rotation=45, ha='right')
|
||||
```
|
||||
|
||||
### Issue: Colorbar Affects Subplot Size
|
||||
|
||||
**Problem:** Adding colorbar shrinks the plot
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Solution 1: Use constrained layout
|
||||
fig, ax = plt.subplots(constrained_layout=True)
|
||||
im = ax.imshow(data)
|
||||
plt.colorbar(im, ax=ax)
|
||||
|
||||
# Solution 2: Manually specify colorbar dimensions
|
||||
from mpl_toolkits.axes_grid1 import make_axes_locatable
|
||||
divider = make_axes_locatable(ax)
|
||||
cax = divider.append_axes("right", size="5%", pad=0.05)
|
||||
plt.colorbar(im, cax=cax)
|
||||
|
||||
# Solution 3: For multiple subplots, share colorbar
|
||||
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
|
||||
for ax in axes:
|
||||
im = ax.imshow(data)
|
||||
fig.colorbar(im, ax=axes.ravel().tolist(), shrink=0.95)
|
||||
```
|
||||
|
||||
### Issue: Subplots Too Close Together
|
||||
|
||||
**Problem:** Multiple subplots overlapping
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Solution 1: Use constrained_layout
|
||||
fig, axes = plt.subplots(2, 2, constrained_layout=True)
|
||||
|
||||
# Solution 2: Adjust spacing with subplots_adjust
|
||||
fig, axes = plt.subplots(2, 2)
|
||||
plt.subplots_adjust(hspace=0.4, wspace=0.4)
|
||||
|
||||
# Solution 3: Specify spacing in tight_layout
|
||||
plt.tight_layout(h_pad=2.0, w_pad=2.0)
|
||||
```
|
||||
|
||||
## Memory and Performance Issues
|
||||
|
||||
### Issue: Memory Leak with Multiple Figures
|
||||
|
||||
**Problem:** Memory usage grows when creating many figures
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Close figures explicitly
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot(x, y)
|
||||
plt.savefig('plot.png')
|
||||
plt.close(fig) # or plt.close('all')
|
||||
|
||||
# Clear current figure without closing
|
||||
plt.clf()
|
||||
|
||||
# Clear current axes
|
||||
plt.cla()
|
||||
```
|
||||
|
||||
### Issue: Large File Sizes
|
||||
|
||||
**Problem:** Saved figures are too large
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Reduce DPI
|
||||
plt.savefig('figure.png', dpi=150) # Instead of 300
|
||||
|
||||
# Solution 2: Use rasterization for complex plots
|
||||
ax.plot(x, y, rasterized=True)
|
||||
|
||||
# Solution 3: Use vector format for simple plots
|
||||
plt.savefig('figure.pdf') # or .svg
|
||||
|
||||
# Solution 4: Compress PNG
|
||||
plt.savefig('figure.png', dpi=300, optimize=True)
|
||||
```
|
||||
|
||||
### Issue: Slow Plotting with Large Datasets
|
||||
|
||||
**Problem:** Plotting takes too long with many points
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Downsample data
|
||||
from scipy.signal import decimate
|
||||
y_downsampled = decimate(y, 10) # Keep every 10th point
|
||||
|
||||
# Solution 2: Use rasterization
|
||||
ax.plot(x, y, rasterized=True)
|
||||
|
||||
# Solution 3: Use line simplification
|
||||
ax.plot(x, y)
|
||||
for line in ax.get_lines():
|
||||
line.set_rasterized(True)
|
||||
|
||||
# Solution 4: For scatter plots, consider hexbin or 2d histogram
|
||||
ax.hexbin(x, y, gridsize=50, cmap='viridis')
|
||||
```
|
||||
|
||||
## Font and Text Issues
|
||||
|
||||
### Issue: Font Warnings
|
||||
|
||||
**Problem:** "findfont: Font family [...] not found"
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Use available fonts
|
||||
from matplotlib.font_manager import findfont, FontProperties
|
||||
print(findfont(FontProperties(family='sans-serif')))
|
||||
|
||||
# Solution 2: Rebuild font cache
|
||||
import matplotlib.font_manager
|
||||
matplotlib.font_manager._rebuild()
|
||||
|
||||
# Solution 3: Suppress warnings
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore", category=UserWarning)
|
||||
|
||||
# Solution 4: Specify fallback fonts
|
||||
plt.rcParams['font.sans-serif'] = ['Arial', 'DejaVu Sans', 'sans-serif']
|
||||
```
|
||||
|
||||
### Issue: LaTeX Rendering Errors
|
||||
|
||||
**Problem:** Math text not rendering correctly
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Use raw strings with r prefix
|
||||
ax.set_xlabel(r'$\alpha$') # Not '\alpha'
|
||||
|
||||
# Solution 2: Escape backslashes in regular strings
|
||||
ax.set_xlabel('$\\alpha$')
|
||||
|
||||
# Solution 3: Disable LaTeX if not installed
|
||||
plt.rcParams['text.usetex'] = False
|
||||
|
||||
# Solution 4: Use mathtext instead of full LaTeX
|
||||
# Mathtext is always available, no LaTeX installation needed
|
||||
ax.text(x, y, r'$\int_0^\infty e^{-x} dx$')
|
||||
```
|
||||
|
||||
### Issue: Text Cut Off or Outside Figure
|
||||
|
||||
**Problem:** Labels or annotations appear outside figure bounds
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Use bbox_inches='tight'
|
||||
plt.savefig('figure.png', bbox_inches='tight')
|
||||
|
||||
# Solution 2: Adjust figure bounds
|
||||
plt.subplots_adjust(left=0.15, right=0.85, top=0.85, bottom=0.15)
|
||||
|
||||
# Solution 3: Clip text to axes
|
||||
ax.text(x, y, 'text', clip_on=True)
|
||||
|
||||
# Solution 4: Use constrained_layout
|
||||
fig, ax = plt.subplots(constrained_layout=True)
|
||||
```
|
||||
|
||||
## Color and Colormap Issues
|
||||
|
||||
### Issue: Colorbar Not Matching Plot
|
||||
|
||||
**Problem:** Colorbar shows different range than data
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Explicitly set vmin and vmax
|
||||
im = ax.imshow(data, vmin=0, vmax=1, cmap='viridis')
|
||||
plt.colorbar(im, ax=ax)
|
||||
|
||||
# Or use the same norm for multiple plots
|
||||
import matplotlib.colors as mcolors
|
||||
norm = mcolors.Normalize(vmin=data.min(), vmax=data.max())
|
||||
im1 = ax1.imshow(data1, norm=norm, cmap='viridis')
|
||||
im2 = ax2.imshow(data2, norm=norm, cmap='viridis')
|
||||
```
|
||||
|
||||
### Issue: Colors Look Wrong
|
||||
|
||||
**Problem:** Unexpected colors in plots
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Check color specification format
|
||||
ax.plot(x, y, color='blue') # Correct
|
||||
ax.plot(x, y, color=(0, 0, 1)) # Correct RGB
|
||||
ax.plot(x, y, color='#0000FF') # Correct hex
|
||||
|
||||
# Solution 2: Verify colormap exists
|
||||
print(plt.colormaps()) # List available colormaps
|
||||
|
||||
# Solution 3: For scatter plots, ensure c shape matches
|
||||
ax.scatter(x, y, c=colors) # colors should have same length as x, y
|
||||
|
||||
# Solution 4: Check if alpha is set correctly
|
||||
ax.plot(x, y, alpha=1.0) # 0=transparent, 1=opaque
|
||||
```
|
||||
|
||||
### Issue: Reversed Colormap
|
||||
|
||||
**Problem:** Colormap direction is backwards
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Add _r suffix to reverse any colormap
|
||||
ax.imshow(data, cmap='viridis_r')
|
||||
```
|
||||
|
||||
## Axis and Scale Issues
|
||||
|
||||
### Issue: Axis Limits Not Working
|
||||
|
||||
**Problem:** `set_xlim` or `set_ylim` not taking effect
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Set after plotting
|
||||
ax.plot(x, y)
|
||||
ax.set_xlim(0, 10)
|
||||
ax.set_ylim(-1, 1)
|
||||
|
||||
# Solution 2: Disable autoscaling
|
||||
ax.autoscale(False)
|
||||
ax.set_xlim(0, 10)
|
||||
|
||||
# Solution 3: Use axis method
|
||||
ax.axis([xmin, xmax, ymin, ymax])
|
||||
```
|
||||
|
||||
### Issue: Log Scale with Zero or Negative Values
|
||||
|
||||
**Problem:** ValueError when using log scale with data ≤ 0
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Filter out non-positive values
|
||||
mask = (data > 0)
|
||||
ax.plot(x[mask], data[mask])
|
||||
ax.set_yscale('log')
|
||||
|
||||
# Solution 2: Use symlog for data with positive and negative values
|
||||
ax.set_yscale('symlog')
|
||||
|
||||
# Solution 3: Add small offset
|
||||
ax.plot(x, data + 1e-10)
|
||||
ax.set_yscale('log')
|
||||
```
|
||||
|
||||
### Issue: Dates Not Displaying Correctly
|
||||
|
||||
**Problem:** Date axis shows numbers instead of dates
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
import matplotlib.dates as mdates
|
||||
import pandas as pd
|
||||
|
||||
# Convert to datetime if needed
|
||||
dates = pd.to_datetime(date_strings)
|
||||
|
||||
ax.plot(dates, values)
|
||||
|
||||
# Format date axis
|
||||
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
|
||||
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
|
||||
plt.xticks(rotation=45)
|
||||
```
|
||||
|
||||
## Legend Issues
|
||||
|
||||
### Issue: Legend Covers Data
|
||||
|
||||
**Problem:** Legend obscures important parts of plot
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Use 'best' location
|
||||
ax.legend(loc='best')
|
||||
|
||||
# Solution 2: Place outside plot area
|
||||
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
|
||||
|
||||
# Solution 3: Make legend semi-transparent
|
||||
ax.legend(framealpha=0.7)
|
||||
|
||||
# Solution 4: Put legend below plot
|
||||
ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3)
|
||||
```
|
||||
|
||||
### Issue: Too Many Items in Legend
|
||||
|
||||
**Problem:** Legend is cluttered with many entries
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Only label selected items
|
||||
for i, (x, y) in enumerate(data):
|
||||
label = f'Data {i}' if i % 5 == 0 else None
|
||||
ax.plot(x, y, label=label)
|
||||
|
||||
# Solution 2: Use multiple columns
|
||||
ax.legend(ncol=3)
|
||||
|
||||
# Solution 3: Create custom legend with fewer entries
|
||||
from matplotlib.lines import Line2D
|
||||
custom_lines = [Line2D([0], [0], color='r'),
|
||||
Line2D([0], [0], color='b')]
|
||||
ax.legend(custom_lines, ['Category A', 'Category B'])
|
||||
|
||||
# Solution 4: Use separate legend figure
|
||||
fig_leg = plt.figure(figsize=(3, 2))
|
||||
ax_leg = fig_leg.add_subplot(111)
|
||||
ax_leg.legend(*ax.get_legend_handles_labels(), loc='center')
|
||||
ax_leg.axis('off')
|
||||
```
|
||||
|
||||
## 3D Plot Issues
|
||||
|
||||
### Issue: 3D Plots Look Flat
|
||||
|
||||
**Problem:** Difficult to perceive depth in 3D plots
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Adjust viewing angle
|
||||
ax.view_init(elev=30, azim=45)
|
||||
|
||||
# Solution 2: Add gridlines
|
||||
ax.grid(True)
|
||||
|
||||
# Solution 3: Use color for depth
|
||||
scatter = ax.scatter(x, y, z, c=z, cmap='viridis')
|
||||
|
||||
# Solution 4: Rotate interactively (if using interactive backend)
|
||||
# User can click and drag to rotate
|
||||
```
|
||||
|
||||
### Issue: 3D Axis Labels Cut Off
|
||||
|
||||
**Problem:** 3D axis labels appear outside figure
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
ax.plot_surface(X, Y, Z)
|
||||
|
||||
# Add padding
|
||||
fig.tight_layout(pad=3.0)
|
||||
|
||||
# Or save with tight bounding box
|
||||
plt.savefig('3d_plot.png', bbox_inches='tight', pad_inches=0.5)
|
||||
```
|
||||
|
||||
## Image and Colorbar Issues
|
||||
|
||||
### Issue: Images Appear Flipped
|
||||
|
||||
**Problem:** Image orientation is wrong
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Set origin parameter
|
||||
ax.imshow(img, origin='lower') # or 'upper' (default)
|
||||
|
||||
# Or flip array
|
||||
ax.imshow(np.flipud(img))
|
||||
```
|
||||
|
||||
### Issue: Images Look Pixelated
|
||||
|
||||
**Problem:** Image appears blocky when zoomed
|
||||
|
||||
**Solutions:**
|
||||
```python
|
||||
# Solution 1: Use interpolation
|
||||
ax.imshow(img, interpolation='bilinear')
|
||||
# Options: 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', etc.
|
||||
|
||||
# Solution 2: Increase DPI when saving
|
||||
plt.savefig('figure.png', dpi=300)
|
||||
|
||||
# Solution 3: Use vector format if appropriate
|
||||
plt.savefig('figure.pdf')
|
||||
```
|
||||
|
||||
## Common Errors and Fixes
|
||||
|
||||
### "TypeError: 'AxesSubplot' object is not subscriptable"
|
||||
|
||||
**Problem:** Trying to index single axes
|
||||
```python
|
||||
# Wrong
|
||||
fig, ax = plt.subplots()
|
||||
ax[0].plot(x, y) # Error!
|
||||
|
||||
# Correct
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot(x, y)
|
||||
```
|
||||
|
||||
### "ValueError: x and y must have same first dimension"
|
||||
|
||||
**Problem:** Data arrays have mismatched lengths
|
||||
```python
|
||||
# Check shapes
|
||||
print(f"x shape: {x.shape}, y shape: {y.shape}")
|
||||
|
||||
# Ensure they match
|
||||
assert len(x) == len(y), "x and y must have same length"
|
||||
```
|
||||
|
||||
### "AttributeError: 'numpy.ndarray' object has no attribute 'plot'"
|
||||
|
||||
**Problem:** Calling plot on array instead of axes
|
||||
```python
|
||||
# Wrong
|
||||
data.plot(x, y)
|
||||
|
||||
# Correct
|
||||
ax.plot(x, y)
|
||||
# or for pandas
|
||||
data.plot(ax=ax)
|
||||
```
|
||||
|
||||
## Best Practices to Avoid Issues
|
||||
|
||||
1. **Always use the OO interface** - Avoid pyplot state machine
|
||||
```python
|
||||
fig, ax = plt.subplots() # Good
|
||||
ax.plot(x, y)
|
||||
```
|
||||
|
||||
2. **Use constrained_layout** - Prevents overlap issues
|
||||
```python
|
||||
fig, ax = plt.subplots(constrained_layout=True)
|
||||
```
|
||||
|
||||
3. **Close figures explicitly** - Prevents memory leaks
|
||||
```python
|
||||
plt.close(fig)
|
||||
```
|
||||
|
||||
4. **Set figure size at creation** - Better than resizing later
|
||||
```python
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
```
|
||||
|
||||
5. **Use raw strings for math text** - Avoids escape issues
|
||||
```python
|
||||
ax.set_xlabel(r'$\alpha$')
|
||||
```
|
||||
|
||||
6. **Check data shapes before plotting** - Catch size mismatches early
|
||||
```python
|
||||
assert len(x) == len(y)
|
||||
```
|
||||
|
||||
7. **Use appropriate DPI** - 300 for print, 150 for web
|
||||
```python
|
||||
plt.savefig('figure.png', dpi=300)
|
||||
```
|
||||
|
||||
8. **Test with different backends** - If display issues occur
|
||||
```python
|
||||
import matplotlib
|
||||
matplotlib.use('TkAgg')
|
||||
```
|
||||
476
scientific-packages/matplotlib/references/plot_types.md
Normal file
476
scientific-packages/matplotlib/references/plot_types.md
Normal file
@@ -0,0 +1,476 @@
|
||||
# Matplotlib Plot Types Guide
|
||||
|
||||
Comprehensive guide to different plot types in matplotlib with examples and use cases.
|
||||
|
||||
## 1. Line Plots
|
||||
|
||||
**Use cases:** Time series, continuous data, trends, function visualization
|
||||
|
||||
### Basic Line Plot
|
||||
```python
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
ax.plot(x, y, linewidth=2, label='Data')
|
||||
ax.set_xlabel('X axis')
|
||||
ax.set_ylabel('Y axis')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Multiple Lines
|
||||
```python
|
||||
ax.plot(x, y1, label='Dataset 1', linewidth=2)
|
||||
ax.plot(x, y2, label='Dataset 2', linewidth=2, linestyle='--')
|
||||
ax.plot(x, y3, label='Dataset 3', linewidth=2, linestyle=':')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Line with Markers
|
||||
```python
|
||||
ax.plot(x, y, marker='o', markersize=8, linestyle='-',
|
||||
linewidth=2, markerfacecolor='red', markeredgecolor='black')
|
||||
```
|
||||
|
||||
### Step Plot
|
||||
```python
|
||||
ax.step(x, y, where='mid', linewidth=2, label='Step function')
|
||||
# where options: 'pre', 'post', 'mid'
|
||||
```
|
||||
|
||||
### Error Bars
|
||||
```python
|
||||
ax.errorbar(x, y, yerr=error, fmt='o-', linewidth=2,
|
||||
capsize=5, capthick=2, label='With uncertainty')
|
||||
```
|
||||
|
||||
## 2. Scatter Plots
|
||||
|
||||
**Use cases:** Correlations, relationships between variables, clusters, outliers
|
||||
|
||||
### Basic Scatter
|
||||
```python
|
||||
ax.scatter(x, y, s=50, alpha=0.6)
|
||||
```
|
||||
|
||||
### Sized and Colored Scatter
|
||||
```python
|
||||
scatter = ax.scatter(x, y, s=sizes*100, c=colors,
|
||||
cmap='viridis', alpha=0.6, edgecolors='black')
|
||||
plt.colorbar(scatter, ax=ax, label='Color variable')
|
||||
```
|
||||
|
||||
### Categorical Scatter
|
||||
```python
|
||||
for category in categories:
|
||||
mask = data['category'] == category
|
||||
ax.scatter(data[mask]['x'], data[mask]['y'],
|
||||
label=category, s=50, alpha=0.7)
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
## 3. Bar Charts
|
||||
|
||||
**Use cases:** Categorical comparisons, discrete data, counts
|
||||
|
||||
### Vertical Bar Chart
|
||||
```python
|
||||
ax.bar(categories, values, color='steelblue',
|
||||
edgecolor='black', linewidth=1.5)
|
||||
ax.set_ylabel('Values')
|
||||
```
|
||||
|
||||
### Horizontal Bar Chart
|
||||
```python
|
||||
ax.barh(categories, values, color='coral',
|
||||
edgecolor='black', linewidth=1.5)
|
||||
ax.set_xlabel('Values')
|
||||
```
|
||||
|
||||
### Grouped Bar Chart
|
||||
```python
|
||||
x = np.arange(len(categories))
|
||||
width = 0.35
|
||||
|
||||
ax.bar(x - width/2, values1, width, label='Group 1')
|
||||
ax.bar(x + width/2, values2, width, label='Group 2')
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(categories)
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Stacked Bar Chart
|
||||
```python
|
||||
ax.bar(categories, values1, label='Part 1')
|
||||
ax.bar(categories, values2, bottom=values1, label='Part 2')
|
||||
ax.bar(categories, values3, bottom=values1+values2, label='Part 3')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Bar Chart with Error Bars
|
||||
```python
|
||||
ax.bar(categories, values, yerr=errors, capsize=5,
|
||||
color='steelblue', edgecolor='black')
|
||||
```
|
||||
|
||||
### Bar Chart with Patterns
|
||||
```python
|
||||
bars1 = ax.bar(x - width/2, values1, width, label='Group 1',
|
||||
color='white', edgecolor='black', hatch='//')
|
||||
bars2 = ax.bar(x + width/2, values2, width, label='Group 2',
|
||||
color='white', edgecolor='black', hatch='\\\\')
|
||||
```
|
||||
|
||||
## 4. Histograms
|
||||
|
||||
**Use cases:** Distributions, frequency analysis
|
||||
|
||||
### Basic Histogram
|
||||
```python
|
||||
ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
|
||||
ax.set_xlabel('Value')
|
||||
ax.set_ylabel('Frequency')
|
||||
```
|
||||
|
||||
### Multiple Overlapping Histograms
|
||||
```python
|
||||
ax.hist(data1, bins=30, alpha=0.5, label='Dataset 1')
|
||||
ax.hist(data2, bins=30, alpha=0.5, label='Dataset 2')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Normalized Histogram (Density)
|
||||
```python
|
||||
ax.hist(data, bins=30, density=True, alpha=0.7,
|
||||
edgecolor='black', label='Empirical')
|
||||
|
||||
# Overlay theoretical distribution
|
||||
from scipy.stats import norm
|
||||
x = np.linspace(data.min(), data.max(), 100)
|
||||
ax.plot(x, norm.pdf(x, data.mean(), data.std()),
|
||||
'r-', linewidth=2, label='Normal fit')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### 2D Histogram (Hexbin)
|
||||
```python
|
||||
hexbin = ax.hexbin(x, y, gridsize=30, cmap='Blues')
|
||||
plt.colorbar(hexbin, ax=ax, label='Counts')
|
||||
```
|
||||
|
||||
### 2D Histogram (hist2d)
|
||||
```python
|
||||
h = ax.hist2d(x, y, bins=30, cmap='Blues')
|
||||
plt.colorbar(h[3], ax=ax, label='Counts')
|
||||
```
|
||||
|
||||
## 5. Box and Violin Plots
|
||||
|
||||
**Use cases:** Statistical distributions, outlier detection, comparing distributions
|
||||
|
||||
### Box Plot
|
||||
```python
|
||||
ax.boxplot([data1, data2, data3],
|
||||
labels=['Group A', 'Group B', 'Group C'],
|
||||
showmeans=True, meanline=True)
|
||||
ax.set_ylabel('Values')
|
||||
```
|
||||
|
||||
### Horizontal Box Plot
|
||||
```python
|
||||
ax.boxplot([data1, data2, data3], vert=False,
|
||||
labels=['Group A', 'Group B', 'Group C'])
|
||||
ax.set_xlabel('Values')
|
||||
```
|
||||
|
||||
### Violin Plot
|
||||
```python
|
||||
parts = ax.violinplot([data1, data2, data3],
|
||||
positions=[1, 2, 3],
|
||||
showmeans=True, showmedians=True)
|
||||
ax.set_xticks([1, 2, 3])
|
||||
ax.set_xticklabels(['Group A', 'Group B', 'Group C'])
|
||||
```
|
||||
|
||||
## 6. Heatmaps
|
||||
|
||||
**Use cases:** Matrix data, correlations, intensity maps
|
||||
|
||||
### Basic Heatmap
|
||||
```python
|
||||
im = ax.imshow(matrix, cmap='coolwarm', aspect='auto')
|
||||
plt.colorbar(im, ax=ax, label='Values')
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
```
|
||||
|
||||
### Heatmap with Annotations
|
||||
```python
|
||||
im = ax.imshow(matrix, cmap='coolwarm')
|
||||
plt.colorbar(im, ax=ax)
|
||||
|
||||
# Add text annotations
|
||||
for i in range(matrix.shape[0]):
|
||||
for j in range(matrix.shape[1]):
|
||||
text = ax.text(j, i, f'{matrix[i, j]:.2f}',
|
||||
ha='center', va='center', color='black')
|
||||
```
|
||||
|
||||
### Correlation Matrix
|
||||
```python
|
||||
corr = data.corr()
|
||||
im = ax.imshow(corr, cmap='RdBu_r', vmin=-1, vmax=1)
|
||||
plt.colorbar(im, ax=ax, label='Correlation')
|
||||
|
||||
# Set tick labels
|
||||
ax.set_xticks(range(len(corr)))
|
||||
ax.set_yticks(range(len(corr)))
|
||||
ax.set_xticklabels(corr.columns, rotation=45, ha='right')
|
||||
ax.set_yticklabels(corr.columns)
|
||||
```
|
||||
|
||||
## 7. Contour Plots
|
||||
|
||||
**Use cases:** 3D data on 2D plane, topography, function visualization
|
||||
|
||||
### Contour Lines
|
||||
```python
|
||||
contour = ax.contour(X, Y, Z, levels=10, cmap='viridis')
|
||||
ax.clabel(contour, inline=True, fontsize=8)
|
||||
plt.colorbar(contour, ax=ax)
|
||||
```
|
||||
|
||||
### Filled Contours
|
||||
```python
|
||||
contourf = ax.contourf(X, Y, Z, levels=20, cmap='viridis')
|
||||
plt.colorbar(contourf, ax=ax)
|
||||
```
|
||||
|
||||
### Combined Contours
|
||||
```python
|
||||
contourf = ax.contourf(X, Y, Z, levels=20, cmap='viridis', alpha=0.8)
|
||||
contour = ax.contour(X, Y, Z, levels=10, colors='black',
|
||||
linewidths=0.5, alpha=0.4)
|
||||
ax.clabel(contour, inline=True, fontsize=8)
|
||||
plt.colorbar(contourf, ax=ax)
|
||||
```
|
||||
|
||||
## 8. Pie Charts
|
||||
|
||||
**Use cases:** Proportions, percentages (use sparingly)
|
||||
|
||||
### Basic Pie Chart
|
||||
```python
|
||||
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
|
||||
startangle=90, colors=colors)
|
||||
ax.axis('equal') # Equal aspect ratio ensures circular pie
|
||||
```
|
||||
|
||||
### Exploded Pie Chart
|
||||
```python
|
||||
explode = (0.1, 0, 0, 0) # Explode first slice
|
||||
ax.pie(sizes, explode=explode, labels=labels,
|
||||
autopct='%1.1f%%', shadow=True, startangle=90)
|
||||
ax.axis('equal')
|
||||
```
|
||||
|
||||
### Donut Chart
|
||||
```python
|
||||
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
|
||||
wedgeprops=dict(width=0.5), startangle=90)
|
||||
ax.axis('equal')
|
||||
```
|
||||
|
||||
## 9. Polar Plots
|
||||
|
||||
**Use cases:** Cyclic data, directional data, radar charts
|
||||
|
||||
### Basic Polar Plot
|
||||
```python
|
||||
theta = np.linspace(0, 2*np.pi, 100)
|
||||
r = np.abs(np.sin(2*theta))
|
||||
|
||||
ax = plt.subplot(111, projection='polar')
|
||||
ax.plot(theta, r, linewidth=2)
|
||||
```
|
||||
|
||||
### Radar Chart
|
||||
```python
|
||||
categories = ['A', 'B', 'C', 'D', 'E']
|
||||
values = [4, 3, 5, 2, 4]
|
||||
|
||||
# Add first value to the end to close the polygon
|
||||
angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False)
|
||||
values_closed = np.concatenate((values, [values[0]]))
|
||||
angles_closed = np.concatenate((angles, [angles[0]]))
|
||||
|
||||
ax = plt.subplot(111, projection='polar')
|
||||
ax.plot(angles_closed, values_closed, 'o-', linewidth=2)
|
||||
ax.fill(angles_closed, values_closed, alpha=0.25)
|
||||
ax.set_xticks(angles)
|
||||
ax.set_xticklabels(categories)
|
||||
```
|
||||
|
||||
## 10. Stream and Quiver Plots
|
||||
|
||||
**Use cases:** Vector fields, flow visualization
|
||||
|
||||
### Quiver Plot (Vector Field)
|
||||
```python
|
||||
ax.quiver(X, Y, U, V, alpha=0.8)
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_aspect('equal')
|
||||
```
|
||||
|
||||
### Stream Plot
|
||||
```python
|
||||
ax.streamplot(X, Y, U, V, density=1.5, color='k', linewidth=1)
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_aspect('equal')
|
||||
```
|
||||
|
||||
## 11. Fill Between
|
||||
|
||||
**Use cases:** Uncertainty bounds, confidence intervals, areas under curves
|
||||
|
||||
### Fill Between Two Curves
|
||||
```python
|
||||
ax.plot(x, y, 'k-', linewidth=2, label='Mean')
|
||||
ax.fill_between(x, y - std, y + std, alpha=0.3,
|
||||
label='±1 std dev')
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
### Fill Between with Condition
|
||||
```python
|
||||
ax.plot(x, y1, label='Line 1')
|
||||
ax.plot(x, y2, label='Line 2')
|
||||
ax.fill_between(x, y1, y2, where=(y2 >= y1),
|
||||
alpha=0.3, label='y2 > y1', interpolate=True)
|
||||
ax.legend()
|
||||
```
|
||||
|
||||
## 12. 3D Plots
|
||||
|
||||
**Use cases:** Three-dimensional data visualization
|
||||
|
||||
### 3D Scatter
|
||||
```python
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
scatter = ax.scatter(x, y, z, c=colors, cmap='viridis',
|
||||
marker='o', s=50)
|
||||
plt.colorbar(scatter, ax=ax)
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
```
|
||||
|
||||
### 3D Surface Plot
|
||||
```python
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
|
||||
edgecolor='none', alpha=0.9)
|
||||
plt.colorbar(surf, ax=ax)
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
```
|
||||
|
||||
### 3D Wireframe
|
||||
```python
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
ax.plot_wireframe(X, Y, Z, color='black', linewidth=0.5)
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
```
|
||||
|
||||
### 3D Contour
|
||||
```python
|
||||
fig = plt.figure(figsize=(10, 8))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
ax.contour(X, Y, Z, levels=15, cmap='viridis')
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
```
|
||||
|
||||
## 13. Specialized Plots
|
||||
|
||||
### Stem Plot
|
||||
```python
|
||||
ax.stem(x, y, linefmt='C0-', markerfmt='C0o', basefmt='k-')
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
```
|
||||
|
||||
### Filled Polygon
|
||||
```python
|
||||
vertices = [(0, 0), (1, 0), (1, 1), (0, 1)]
|
||||
from matplotlib.patches import Polygon
|
||||
polygon = Polygon(vertices, closed=True, edgecolor='black',
|
||||
facecolor='lightblue', alpha=0.5)
|
||||
ax.add_patch(polygon)
|
||||
ax.set_xlim(-0.5, 1.5)
|
||||
ax.set_ylim(-0.5, 1.5)
|
||||
```
|
||||
|
||||
### Staircase Plot
|
||||
```python
|
||||
ax.stairs(values, edges, fill=True, alpha=0.5)
|
||||
```
|
||||
|
||||
### Broken Barh (Gantt-style)
|
||||
```python
|
||||
ax.broken_barh([(10, 50), (100, 20), (130, 10)], (10, 9),
|
||||
facecolors='tab:blue')
|
||||
ax.broken_barh([(10, 20), (50, 50), (120, 30)], (20, 9),
|
||||
facecolors='tab:orange')
|
||||
ax.set_ylim(5, 35)
|
||||
ax.set_xlim(0, 200)
|
||||
ax.set_xlabel('Time')
|
||||
ax.set_yticks([15, 25])
|
||||
ax.set_yticklabels(['Task 1', 'Task 2'])
|
||||
```
|
||||
|
||||
## 14. Time Series Plots
|
||||
|
||||
### Basic Time Series
|
||||
```python
|
||||
import pandas as pd
|
||||
import matplotlib.dates as mdates
|
||||
|
||||
ax.plot(dates, values, linewidth=2)
|
||||
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
|
||||
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
|
||||
plt.xticks(rotation=45)
|
||||
ax.set_xlabel('Date')
|
||||
ax.set_ylabel('Value')
|
||||
```
|
||||
|
||||
### Time Series with Shaded Regions
|
||||
```python
|
||||
ax.plot(dates, values, linewidth=2)
|
||||
# Shade weekends or specific periods
|
||||
ax.axvspan(start_date, end_date, alpha=0.2, color='gray')
|
||||
```
|
||||
|
||||
## Plot Selection Guide
|
||||
|
||||
| Data Type | Recommended Plot | Alternative Options |
|
||||
|-----------|-----------------|---------------------|
|
||||
| Single continuous variable | Histogram, KDE | Box plot, Violin plot |
|
||||
| Two continuous variables | Scatter plot | Hexbin, 2D histogram |
|
||||
| Time series | Line plot | Area plot, Step plot |
|
||||
| Categorical vs continuous | Bar chart, Box plot | Violin plot, Strip plot |
|
||||
| Two categorical variables | Heatmap | Grouped bar chart |
|
||||
| Three continuous variables | 3D scatter, Contour | Color-coded scatter |
|
||||
| Proportions | Bar chart | Pie chart (use sparingly) |
|
||||
| Distributions comparison | Box plot, Violin plot | Overlaid histograms |
|
||||
| Correlation matrix | Heatmap | Clustered heatmap |
|
||||
| Vector field | Quiver plot, Stream plot | - |
|
||||
| Function visualization | Line plot, Contour | 3D surface |
|
||||
589
scientific-packages/matplotlib/references/styling_guide.md
Normal file
589
scientific-packages/matplotlib/references/styling_guide.md
Normal file
@@ -0,0 +1,589 @@
|
||||
# Matplotlib Styling Guide
|
||||
|
||||
Comprehensive guide for styling and customizing matplotlib visualizations.
|
||||
|
||||
## Colormaps
|
||||
|
||||
### Colormap Categories
|
||||
|
||||
**1. Perceptually Uniform Sequential**
|
||||
Best for ordered data that progresses from low to high values.
|
||||
- `viridis` (default, colorblind-friendly)
|
||||
- `plasma`
|
||||
- `inferno`
|
||||
- `magma`
|
||||
- `cividis` (optimized for colorblind viewers)
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
im = ax.imshow(data, cmap='viridis')
|
||||
scatter = ax.scatter(x, y, c=values, cmap='plasma')
|
||||
```
|
||||
|
||||
**2. Sequential**
|
||||
Traditional colormaps for ordered data.
|
||||
- `Blues`, `Greens`, `Reds`, `Oranges`, `Purples`
|
||||
- `YlOrBr`, `YlOrRd`, `OrRd`, `PuRd`
|
||||
- `BuPu`, `GnBu`, `PuBu`, `YlGnBu`
|
||||
|
||||
**3. Diverging**
|
||||
Best for data with a meaningful center point (e.g., zero, mean).
|
||||
- `coolwarm` (blue to red)
|
||||
- `RdBu` (red-blue)
|
||||
- `RdYlBu` (red-yellow-blue)
|
||||
- `RdYlGn` (red-yellow-green)
|
||||
- `PiYG`, `PRGn`, `BrBG`, `PuOr`, `RdGy`
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Center colormap at zero
|
||||
im = ax.imshow(data, cmap='coolwarm', vmin=-1, vmax=1)
|
||||
```
|
||||
|
||||
**4. Qualitative**
|
||||
Best for categorical/nominal data without inherent ordering.
|
||||
- `tab10` (10 distinct colors)
|
||||
- `tab20` (20 distinct colors)
|
||||
- `Set1`, `Set2`, `Set3`
|
||||
- `Pastel1`, `Pastel2`
|
||||
- `Dark2`, `Accent`, `Paired`
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
colors = plt.cm.tab10(np.linspace(0, 1, n_categories))
|
||||
for i, category in enumerate(categories):
|
||||
ax.plot(x, y[i], color=colors[i], label=category)
|
||||
```
|
||||
|
||||
**5. Cyclic**
|
||||
Best for cyclic data (e.g., phase, angle).
|
||||
- `twilight`
|
||||
- `twilight_shifted`
|
||||
- `hsv`
|
||||
|
||||
### Colormap Best Practices
|
||||
|
||||
1. **Avoid `jet` colormap** - Not perceptually uniform, misleading
|
||||
2. **Use perceptually uniform colormaps** - `viridis`, `plasma`, `cividis`
|
||||
3. **Consider colorblind users** - Use `viridis`, `cividis`, or test with colorblind simulators
|
||||
4. **Match colormap to data type**:
|
||||
- Sequential: increasing/decreasing data
|
||||
- Diverging: data with meaningful center
|
||||
- Qualitative: categories
|
||||
5. **Reverse colormaps** - Add `_r` suffix: `viridis_r`, `coolwarm_r`
|
||||
|
||||
### Creating Custom Colormaps
|
||||
|
||||
```python
|
||||
from matplotlib.colors import LinearSegmentedColormap
|
||||
|
||||
# From color list
|
||||
colors = ['blue', 'white', 'red']
|
||||
n_bins = 100
|
||||
cmap = LinearSegmentedColormap.from_list('custom', colors, N=n_bins)
|
||||
|
||||
# From RGB values
|
||||
colors = [(0, 0, 1), (1, 1, 1), (1, 0, 0)] # RGB tuples
|
||||
cmap = LinearSegmentedColormap.from_list('custom', colors)
|
||||
|
||||
# Use the custom colormap
|
||||
ax.imshow(data, cmap=cmap)
|
||||
```
|
||||
|
||||
### Discrete Colormaps
|
||||
|
||||
```python
|
||||
import matplotlib.colors as mcolors
|
||||
|
||||
# Create discrete colormap from continuous
|
||||
cmap = plt.cm.viridis
|
||||
bounds = np.linspace(0, 10, 11)
|
||||
norm = mcolors.BoundaryNorm(bounds, cmap.N)
|
||||
im = ax.imshow(data, cmap=cmap, norm=norm)
|
||||
```
|
||||
|
||||
## Style Sheets
|
||||
|
||||
### Using Built-in Styles
|
||||
|
||||
```python
|
||||
# List available styles
|
||||
print(plt.style.available)
|
||||
|
||||
# Apply a style
|
||||
plt.style.use('seaborn-v0_8-darkgrid')
|
||||
|
||||
# Apply multiple styles (later styles override earlier ones)
|
||||
plt.style.use(['seaborn-v0_8-whitegrid', 'seaborn-v0_8-poster'])
|
||||
|
||||
# Temporarily use a style
|
||||
with plt.style.context('ggplot'):
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot(x, y)
|
||||
```
|
||||
|
||||
### Popular Built-in Styles
|
||||
|
||||
- `default` - Matplotlib's default style
|
||||
- `classic` - Classic matplotlib look (pre-2.0)
|
||||
- `seaborn-v0_8-*` - Seaborn-inspired styles
|
||||
- `seaborn-v0_8-darkgrid`, `seaborn-v0_8-whitegrid`
|
||||
- `seaborn-v0_8-dark`, `seaborn-v0_8-white`
|
||||
- `seaborn-v0_8-ticks`, `seaborn-v0_8-poster`, `seaborn-v0_8-talk`
|
||||
- `ggplot` - ggplot2-inspired style
|
||||
- `bmh` - Bayesian Methods for Hackers style
|
||||
- `fivethirtyeight` - FiveThirtyEight style
|
||||
- `grayscale` - Grayscale style
|
||||
|
||||
### Creating Custom Style Sheets
|
||||
|
||||
Create a file named `custom_style.mplstyle`:
|
||||
|
||||
```
|
||||
# custom_style.mplstyle
|
||||
|
||||
# Figure
|
||||
figure.figsize: 10, 6
|
||||
figure.dpi: 100
|
||||
figure.facecolor: white
|
||||
|
||||
# Font
|
||||
font.family: sans-serif
|
||||
font.sans-serif: Arial, Helvetica
|
||||
font.size: 12
|
||||
|
||||
# Axes
|
||||
axes.labelsize: 14
|
||||
axes.titlesize: 16
|
||||
axes.facecolor: white
|
||||
axes.edgecolor: black
|
||||
axes.linewidth: 1.5
|
||||
axes.grid: True
|
||||
axes.axisbelow: True
|
||||
|
||||
# Grid
|
||||
grid.color: gray
|
||||
grid.linestyle: --
|
||||
grid.linewidth: 0.5
|
||||
grid.alpha: 0.3
|
||||
|
||||
# Lines
|
||||
lines.linewidth: 2
|
||||
lines.markersize: 8
|
||||
|
||||
# Ticks
|
||||
xtick.labelsize: 10
|
||||
ytick.labelsize: 10
|
||||
xtick.direction: in
|
||||
ytick.direction: in
|
||||
xtick.major.size: 6
|
||||
ytick.major.size: 6
|
||||
xtick.minor.size: 3
|
||||
ytick.minor.size: 3
|
||||
|
||||
# Legend
|
||||
legend.fontsize: 12
|
||||
legend.frameon: True
|
||||
legend.framealpha: 0.8
|
||||
legend.fancybox: True
|
||||
|
||||
# Savefig
|
||||
savefig.dpi: 300
|
||||
savefig.bbox: tight
|
||||
savefig.facecolor: white
|
||||
```
|
||||
|
||||
Load and use:
|
||||
```python
|
||||
plt.style.use('path/to/custom_style.mplstyle')
|
||||
```
|
||||
|
||||
## rcParams Configuration
|
||||
|
||||
### Global Configuration
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Configure globally
|
||||
plt.rcParams['figure.figsize'] = (10, 6)
|
||||
plt.rcParams['font.size'] = 12
|
||||
plt.rcParams['axes.labelsize'] = 14
|
||||
|
||||
# Or update multiple at once
|
||||
plt.rcParams.update({
|
||||
'figure.figsize': (10, 6),
|
||||
'font.size': 12,
|
||||
'axes.labelsize': 14,
|
||||
'axes.titlesize': 16,
|
||||
'lines.linewidth': 2
|
||||
})
|
||||
```
|
||||
|
||||
### Temporary Configuration
|
||||
|
||||
```python
|
||||
# Context manager for temporary changes
|
||||
with plt.rc_context({'font.size': 14, 'lines.linewidth': 2.5}):
|
||||
fig, ax = plt.subplots()
|
||||
ax.plot(x, y)
|
||||
```
|
||||
|
||||
### Common rcParams
|
||||
|
||||
**Figure settings:**
|
||||
```python
|
||||
plt.rcParams['figure.figsize'] = (10, 6)
|
||||
plt.rcParams['figure.dpi'] = 100
|
||||
plt.rcParams['figure.facecolor'] = 'white'
|
||||
plt.rcParams['figure.edgecolor'] = 'white'
|
||||
plt.rcParams['figure.autolayout'] = False
|
||||
plt.rcParams['figure.constrained_layout.use'] = True
|
||||
```
|
||||
|
||||
**Font settings:**
|
||||
```python
|
||||
plt.rcParams['font.family'] = 'sans-serif'
|
||||
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica', 'DejaVu Sans']
|
||||
plt.rcParams['font.size'] = 12
|
||||
plt.rcParams['font.weight'] = 'normal'
|
||||
```
|
||||
|
||||
**Axes settings:**
|
||||
```python
|
||||
plt.rcParams['axes.facecolor'] = 'white'
|
||||
plt.rcParams['axes.edgecolor'] = 'black'
|
||||
plt.rcParams['axes.linewidth'] = 1.5
|
||||
plt.rcParams['axes.grid'] = True
|
||||
plt.rcParams['axes.labelsize'] = 14
|
||||
plt.rcParams['axes.titlesize'] = 16
|
||||
plt.rcParams['axes.labelweight'] = 'normal'
|
||||
plt.rcParams['axes.spines.top'] = True
|
||||
plt.rcParams['axes.spines.right'] = True
|
||||
```
|
||||
|
||||
**Line settings:**
|
||||
```python
|
||||
plt.rcParams['lines.linewidth'] = 2
|
||||
plt.rcParams['lines.linestyle'] = '-'
|
||||
plt.rcParams['lines.marker'] = 'None'
|
||||
plt.rcParams['lines.markersize'] = 6
|
||||
```
|
||||
|
||||
**Save settings:**
|
||||
```python
|
||||
plt.rcParams['savefig.dpi'] = 300
|
||||
plt.rcParams['savefig.format'] = 'png'
|
||||
plt.rcParams['savefig.bbox'] = 'tight'
|
||||
plt.rcParams['savefig.pad_inches'] = 0.1
|
||||
plt.rcParams['savefig.transparent'] = False
|
||||
```
|
||||
|
||||
## Color Palettes
|
||||
|
||||
### Named Color Sets
|
||||
|
||||
```python
|
||||
# Tableau colors
|
||||
tableau_colors = plt.cm.tab10.colors
|
||||
|
||||
# CSS4 colors (subset)
|
||||
css_colors = ['steelblue', 'coral', 'teal', 'goldenrod', 'crimson']
|
||||
|
||||
# Manual definition
|
||||
custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
|
||||
```
|
||||
|
||||
### Color Cycles
|
||||
|
||||
```python
|
||||
# Set default color cycle
|
||||
from cycler import cycler
|
||||
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
|
||||
plt.rcParams['axes.prop_cycle'] = cycler(color=colors)
|
||||
|
||||
# Or combine color and line style
|
||||
plt.rcParams['axes.prop_cycle'] = cycler(color=colors) + cycler(linestyle=['-', '--', ':', '-.'])
|
||||
```
|
||||
|
||||
### Palette Generation
|
||||
|
||||
```python
|
||||
# Evenly spaced colors from colormap
|
||||
n_colors = 5
|
||||
colors = plt.cm.viridis(np.linspace(0, 1, n_colors))
|
||||
|
||||
# Use in plot
|
||||
for i, (x, y) in enumerate(data):
|
||||
ax.plot(x, y, color=colors[i])
|
||||
```
|
||||
|
||||
## Typography
|
||||
|
||||
### Font Configuration
|
||||
|
||||
```python
|
||||
# Set font family
|
||||
plt.rcParams['font.family'] = 'serif'
|
||||
plt.rcParams['font.serif'] = ['Times New Roman', 'DejaVu Serif']
|
||||
|
||||
# Or sans-serif
|
||||
plt.rcParams['font.family'] = 'sans-serif'
|
||||
plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica']
|
||||
|
||||
# Or monospace
|
||||
plt.rcParams['font.family'] = 'monospace'
|
||||
plt.rcParams['font.monospace'] = ['Courier New', 'DejaVu Sans Mono']
|
||||
```
|
||||
|
||||
### Font Properties in Text
|
||||
|
||||
```python
|
||||
from matplotlib import font_manager
|
||||
|
||||
# Specify font properties
|
||||
ax.text(x, y, 'Text',
|
||||
fontsize=14,
|
||||
fontweight='bold', # 'normal', 'bold', 'heavy', 'light'
|
||||
fontstyle='italic', # 'normal', 'italic', 'oblique'
|
||||
fontfamily='serif')
|
||||
|
||||
# Use specific font file
|
||||
prop = font_manager.FontProperties(fname='path/to/font.ttf')
|
||||
ax.text(x, y, 'Text', fontproperties=prop)
|
||||
```
|
||||
|
||||
### Mathematical Text
|
||||
|
||||
```python
|
||||
# LaTeX-style math
|
||||
ax.set_title(r'$\alpha > \beta$')
|
||||
ax.set_xlabel(r'$\mu \pm \sigma$')
|
||||
ax.text(x, y, r'$\int_0^\infty e^{-x} dx = 1$')
|
||||
|
||||
# Subscripts and superscripts
|
||||
ax.set_ylabel(r'$y = x^2 + 2x + 1$')
|
||||
ax.text(x, y, r'$x_1, x_2, \ldots, x_n$')
|
||||
|
||||
# Greek letters
|
||||
ax.text(x, y, r'$\alpha, \beta, \gamma, \delta, \epsilon$')
|
||||
```
|
||||
|
||||
### Using Full LaTeX
|
||||
|
||||
```python
|
||||
# Enable full LaTeX rendering (requires LaTeX installation)
|
||||
plt.rcParams['text.usetex'] = True
|
||||
plt.rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'
|
||||
|
||||
ax.set_title(r'\textbf{Bold Title}')
|
||||
ax.set_xlabel(r'Time $t$ (s)')
|
||||
```
|
||||
|
||||
## Spines and Grids
|
||||
|
||||
### Spine Customization
|
||||
|
||||
```python
|
||||
# Hide specific spines
|
||||
ax.spines['top'].set_visible(False)
|
||||
ax.spines['right'].set_visible(False)
|
||||
|
||||
# Move spine position
|
||||
ax.spines['left'].set_position(('outward', 10))
|
||||
ax.spines['bottom'].set_position(('data', 0))
|
||||
|
||||
# Change spine color and width
|
||||
ax.spines['left'].set_color('red')
|
||||
ax.spines['bottom'].set_linewidth(2)
|
||||
```
|
||||
|
||||
### Grid Customization
|
||||
|
||||
```python
|
||||
# Basic grid
|
||||
ax.grid(True)
|
||||
|
||||
# Customized grid
|
||||
ax.grid(True, which='major', linestyle='--', linewidth=0.8, alpha=0.3)
|
||||
ax.grid(True, which='minor', linestyle=':', linewidth=0.5, alpha=0.2)
|
||||
|
||||
# Grid for specific axis
|
||||
ax.grid(True, axis='x') # Only vertical lines
|
||||
ax.grid(True, axis='y') # Only horizontal lines
|
||||
|
||||
# Grid behind or in front of data
|
||||
ax.set_axisbelow(True) # Grid behind data
|
||||
```
|
||||
|
||||
## Legend Customization
|
||||
|
||||
### Legend Positioning
|
||||
|
||||
```python
|
||||
# Location strings
|
||||
ax.legend(loc='best') # Automatic best position
|
||||
ax.legend(loc='upper right')
|
||||
ax.legend(loc='upper left')
|
||||
ax.legend(loc='lower right')
|
||||
ax.legend(loc='lower left')
|
||||
ax.legend(loc='center')
|
||||
ax.legend(loc='upper center')
|
||||
ax.legend(loc='lower center')
|
||||
ax.legend(loc='center left')
|
||||
ax.legend(loc='center right')
|
||||
|
||||
# Precise positioning (bbox_to_anchor)
|
||||
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Outside plot area
|
||||
ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3) # Below plot
|
||||
```
|
||||
|
||||
### Legend Styling
|
||||
|
||||
```python
|
||||
ax.legend(
|
||||
fontsize=12,
|
||||
frameon=True, # Show frame
|
||||
framealpha=0.9, # Frame transparency
|
||||
fancybox=True, # Rounded corners
|
||||
shadow=True, # Shadow effect
|
||||
ncol=2, # Number of columns
|
||||
title='Legend Title', # Legend title
|
||||
title_fontsize=14, # Title font size
|
||||
edgecolor='black', # Frame edge color
|
||||
facecolor='white' # Frame background color
|
||||
)
|
||||
```
|
||||
|
||||
### Custom Legend Entries
|
||||
|
||||
```python
|
||||
from matplotlib.lines import Line2D
|
||||
|
||||
# Create custom legend handles
|
||||
custom_lines = [Line2D([0], [0], color='red', lw=2),
|
||||
Line2D([0], [0], color='blue', lw=2, linestyle='--'),
|
||||
Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10)]
|
||||
|
||||
ax.legend(custom_lines, ['Label 1', 'Label 2', 'Label 3'])
|
||||
```
|
||||
|
||||
## Layout and Spacing
|
||||
|
||||
### Constrained Layout
|
||||
|
||||
```python
|
||||
# Preferred method (automatic adjustment)
|
||||
fig, axes = plt.subplots(2, 2, constrained_layout=True)
|
||||
```
|
||||
|
||||
### Tight Layout
|
||||
|
||||
```python
|
||||
# Alternative method
|
||||
fig, axes = plt.subplots(2, 2)
|
||||
plt.tight_layout(pad=1.5, h_pad=2.0, w_pad=2.0)
|
||||
```
|
||||
|
||||
### Manual Adjustment
|
||||
|
||||
```python
|
||||
# Fine-grained control
|
||||
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1,
|
||||
hspace=0.3, wspace=0.4)
|
||||
```
|
||||
|
||||
## Professional Publication Style
|
||||
|
||||
Example configuration for publication-quality figures:
|
||||
|
||||
```python
|
||||
# Publication style configuration
|
||||
plt.rcParams.update({
|
||||
# Figure
|
||||
'figure.figsize': (8, 6),
|
||||
'figure.dpi': 100,
|
||||
'savefig.dpi': 300,
|
||||
'savefig.bbox': 'tight',
|
||||
'savefig.pad_inches': 0.1,
|
||||
|
||||
# Font
|
||||
'font.family': 'sans-serif',
|
||||
'font.sans-serif': ['Arial', 'Helvetica'],
|
||||
'font.size': 11,
|
||||
|
||||
# Axes
|
||||
'axes.labelsize': 12,
|
||||
'axes.titlesize': 14,
|
||||
'axes.linewidth': 1.5,
|
||||
'axes.grid': False,
|
||||
'axes.spines.top': False,
|
||||
'axes.spines.right': False,
|
||||
|
||||
# Lines
|
||||
'lines.linewidth': 2,
|
||||
'lines.markersize': 8,
|
||||
|
||||
# Ticks
|
||||
'xtick.labelsize': 10,
|
||||
'ytick.labelsize': 10,
|
||||
'xtick.major.size': 6,
|
||||
'ytick.major.size': 6,
|
||||
'xtick.major.width': 1.5,
|
||||
'ytick.major.width': 1.5,
|
||||
'xtick.direction': 'in',
|
||||
'ytick.direction': 'in',
|
||||
|
||||
# Legend
|
||||
'legend.fontsize': 10,
|
||||
'legend.frameon': True,
|
||||
'legend.framealpha': 1.0,
|
||||
'legend.edgecolor': 'black'
|
||||
})
|
||||
```
|
||||
|
||||
## Dark Theme
|
||||
|
||||
```python
|
||||
# Dark background style
|
||||
plt.style.use('dark_background')
|
||||
|
||||
# Or manual configuration
|
||||
plt.rcParams.update({
|
||||
'figure.facecolor': '#1e1e1e',
|
||||
'axes.facecolor': '#1e1e1e',
|
||||
'axes.edgecolor': 'white',
|
||||
'axes.labelcolor': 'white',
|
||||
'text.color': 'white',
|
||||
'xtick.color': 'white',
|
||||
'ytick.color': 'white',
|
||||
'grid.color': 'gray',
|
||||
'legend.facecolor': '#1e1e1e',
|
||||
'legend.edgecolor': 'white'
|
||||
})
|
||||
```
|
||||
|
||||
## Color Accessibility
|
||||
|
||||
### Colorblind-Friendly Palettes
|
||||
|
||||
```python
|
||||
# Use colorblind-friendly colormaps
|
||||
colorblind_friendly = ['viridis', 'plasma', 'cividis']
|
||||
|
||||
# Colorblind-friendly discrete colors
|
||||
cb_colors = ['#0173B2', '#DE8F05', '#029E73', '#CC78BC',
|
||||
'#CA9161', '#949494', '#ECE133', '#56B4E9']
|
||||
|
||||
# Test with simulation tools or use these validated palettes
|
||||
```
|
||||
|
||||
### High Contrast
|
||||
|
||||
```python
|
||||
# Ensure sufficient contrast
|
||||
plt.rcParams['axes.edgecolor'] = 'black'
|
||||
plt.rcParams['axes.linewidth'] = 2
|
||||
plt.rcParams['xtick.major.width'] = 2
|
||||
plt.rcParams['ytick.major.width'] = 2
|
||||
```
|
||||
401
scientific-packages/matplotlib/scripts/plot_template.py
Normal file
401
scientific-packages/matplotlib/scripts/plot_template.py
Normal file
@@ -0,0 +1,401 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Matplotlib Plot Template
|
||||
|
||||
Comprehensive template demonstrating various plot types and best practices.
|
||||
Use this as a starting point for creating publication-quality visualizations.
|
||||
|
||||
Usage:
|
||||
python plot_template.py [--plot-type TYPE] [--style STYLE] [--output FILE]
|
||||
|
||||
Plot types:
|
||||
line, scatter, bar, histogram, heatmap, contour, box, violin, 3d, all
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.gridspec import GridSpec
|
||||
import argparse
|
||||
|
||||
|
||||
def set_publication_style():
|
||||
"""Configure matplotlib for publication-quality figures."""
|
||||
plt.rcParams.update({
|
||||
'figure.figsize': (10, 6),
|
||||
'figure.dpi': 100,
|
||||
'savefig.dpi': 300,
|
||||
'savefig.bbox': 'tight',
|
||||
'font.size': 11,
|
||||
'axes.labelsize': 12,
|
||||
'axes.titlesize': 14,
|
||||
'xtick.labelsize': 10,
|
||||
'ytick.labelsize': 10,
|
||||
'legend.fontsize': 10,
|
||||
'lines.linewidth': 2,
|
||||
'axes.linewidth': 1.5,
|
||||
})
|
||||
|
||||
|
||||
def generate_sample_data():
|
||||
"""Generate sample data for demonstrations."""
|
||||
np.random.seed(42)
|
||||
x = np.linspace(0, 10, 100)
|
||||
y1 = np.sin(x)
|
||||
y2 = np.cos(x)
|
||||
scatter_x = np.random.randn(200)
|
||||
scatter_y = np.random.randn(200)
|
||||
categories = ['A', 'B', 'C', 'D', 'E']
|
||||
bar_values = np.random.randint(10, 100, len(categories))
|
||||
hist_data = np.random.normal(0, 1, 1000)
|
||||
matrix = np.random.rand(10, 10)
|
||||
|
||||
X, Y = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
|
||||
Z = np.sin(np.sqrt(X**2 + Y**2))
|
||||
|
||||
return {
|
||||
'x': x, 'y1': y1, 'y2': y2,
|
||||
'scatter_x': scatter_x, 'scatter_y': scatter_y,
|
||||
'categories': categories, 'bar_values': bar_values,
|
||||
'hist_data': hist_data, 'matrix': matrix,
|
||||
'X': X, 'Y': Y, 'Z': Z
|
||||
}
|
||||
|
||||
|
||||
def create_line_plot(data, ax=None):
|
||||
"""Create line plot with best practices."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
ax.plot(data['x'], data['y1'], label='sin(x)', linewidth=2, marker='o',
|
||||
markevery=10, markersize=6)
|
||||
ax.plot(data['x'], data['y2'], label='cos(x)', linewidth=2, linestyle='--')
|
||||
|
||||
ax.set_xlabel('x')
|
||||
ax.set_ylabel('y')
|
||||
ax.set_title('Line Plot Example')
|
||||
ax.legend(loc='best', framealpha=0.9)
|
||||
ax.grid(True, alpha=0.3, linestyle='--')
|
||||
|
||||
# Remove top and right spines for cleaner look
|
||||
ax.spines['top'].set_visible(False)
|
||||
ax.spines['right'].set_visible(False)
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_scatter_plot(data, ax=None):
|
||||
"""Create scatter plot with color and size variations."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
# Color based on distance from origin
|
||||
colors = np.sqrt(data['scatter_x']**2 + data['scatter_y']**2)
|
||||
sizes = 50 * (1 + np.abs(data['scatter_x']))
|
||||
|
||||
scatter = ax.scatter(data['scatter_x'], data['scatter_y'],
|
||||
c=colors, s=sizes, alpha=0.6,
|
||||
cmap='viridis', edgecolors='black', linewidth=0.5)
|
||||
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_title('Scatter Plot Example')
|
||||
ax.grid(True, alpha=0.3, linestyle='--')
|
||||
|
||||
# Add colorbar
|
||||
cbar = plt.colorbar(scatter, ax=ax)
|
||||
cbar.set_label('Distance from origin')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_bar_chart(data, ax=None):
|
||||
"""Create bar chart with error bars and styling."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
x_pos = np.arange(len(data['categories']))
|
||||
errors = np.random.randint(5, 15, len(data['categories']))
|
||||
|
||||
bars = ax.bar(x_pos, data['bar_values'], yerr=errors,
|
||||
color='steelblue', edgecolor='black', linewidth=1.5,
|
||||
capsize=5, alpha=0.8)
|
||||
|
||||
# Color bars by value
|
||||
colors = plt.cm.viridis(data['bar_values'] / data['bar_values'].max())
|
||||
for bar, color in zip(bars, colors):
|
||||
bar.set_facecolor(color)
|
||||
|
||||
ax.set_xlabel('Category')
|
||||
ax.set_ylabel('Values')
|
||||
ax.set_title('Bar Chart Example')
|
||||
ax.set_xticks(x_pos)
|
||||
ax.set_xticklabels(data['categories'])
|
||||
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
|
||||
|
||||
# Remove top and right spines
|
||||
ax.spines['top'].set_visible(False)
|
||||
ax.spines['right'].set_visible(False)
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_histogram(data, ax=None):
|
||||
"""Create histogram with density overlay."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
n, bins, patches = ax.hist(data['hist_data'], bins=30, density=True,
|
||||
alpha=0.7, edgecolor='black', color='steelblue')
|
||||
|
||||
# Overlay theoretical normal distribution
|
||||
from scipy.stats import norm
|
||||
mu, std = norm.fit(data['hist_data'])
|
||||
x_theory = np.linspace(data['hist_data'].min(), data['hist_data'].max(), 100)
|
||||
ax.plot(x_theory, norm.pdf(x_theory, mu, std), 'r-', linewidth=2,
|
||||
label=f'Normal fit (μ={mu:.2f}, σ={std:.2f})')
|
||||
|
||||
ax.set_xlabel('Value')
|
||||
ax.set_ylabel('Density')
|
||||
ax.set_title('Histogram with Normal Fit')
|
||||
ax.legend()
|
||||
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_heatmap(data, ax=None):
|
||||
"""Create heatmap with colorbar and annotations."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 8), constrained_layout=True)
|
||||
|
||||
im = ax.imshow(data['matrix'], cmap='coolwarm', aspect='auto',
|
||||
vmin=0, vmax=1)
|
||||
|
||||
# Add colorbar
|
||||
cbar = plt.colorbar(im, ax=ax)
|
||||
cbar.set_label('Value')
|
||||
|
||||
# Optional: Add text annotations
|
||||
# for i in range(data['matrix'].shape[0]):
|
||||
# for j in range(data['matrix'].shape[1]):
|
||||
# text = ax.text(j, i, f'{data["matrix"][i, j]:.2f}',
|
||||
# ha='center', va='center', color='black', fontsize=8)
|
||||
|
||||
ax.set_xlabel('X Index')
|
||||
ax.set_ylabel('Y Index')
|
||||
ax.set_title('Heatmap Example')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_contour_plot(data, ax=None):
|
||||
"""Create contour plot with filled contours and labels."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 8), constrained_layout=True)
|
||||
|
||||
# Filled contours
|
||||
contourf = ax.contourf(data['X'], data['Y'], data['Z'],
|
||||
levels=20, cmap='viridis', alpha=0.8)
|
||||
|
||||
# Contour lines
|
||||
contour = ax.contour(data['X'], data['Y'], data['Z'],
|
||||
levels=10, colors='black', linewidths=0.5, alpha=0.4)
|
||||
|
||||
# Add labels to contour lines
|
||||
ax.clabel(contour, inline=True, fontsize=8)
|
||||
|
||||
# Add colorbar
|
||||
cbar = plt.colorbar(contourf, ax=ax)
|
||||
cbar.set_label('Z value')
|
||||
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_title('Contour Plot Example')
|
||||
ax.set_aspect('equal')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_box_plot(data, ax=None):
|
||||
"""Create box plot comparing distributions."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
# Generate multiple distributions
|
||||
box_data = [np.random.normal(0, std, 100) for std in range(1, 5)]
|
||||
|
||||
bp = ax.boxplot(box_data, labels=['Group 1', 'Group 2', 'Group 3', 'Group 4'],
|
||||
patch_artist=True, showmeans=True,
|
||||
boxprops=dict(facecolor='lightblue', edgecolor='black'),
|
||||
medianprops=dict(color='red', linewidth=2),
|
||||
meanprops=dict(marker='D', markerfacecolor='green', markersize=8))
|
||||
|
||||
ax.set_xlabel('Groups')
|
||||
ax.set_ylabel('Values')
|
||||
ax.set_title('Box Plot Example')
|
||||
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_violin_plot(data, ax=None):
|
||||
"""Create violin plot showing distribution shapes."""
|
||||
if ax is None:
|
||||
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)
|
||||
|
||||
# Generate multiple distributions
|
||||
violin_data = [np.random.normal(0, std, 100) for std in range(1, 5)]
|
||||
|
||||
parts = ax.violinplot(violin_data, positions=range(1, 5),
|
||||
showmeans=True, showmedians=True)
|
||||
|
||||
# Customize colors
|
||||
for pc in parts['bodies']:
|
||||
pc.set_facecolor('lightblue')
|
||||
pc.set_alpha(0.7)
|
||||
pc.set_edgecolor('black')
|
||||
|
||||
ax.set_xlabel('Groups')
|
||||
ax.set_ylabel('Values')
|
||||
ax.set_title('Violin Plot Example')
|
||||
ax.set_xticks(range(1, 5))
|
||||
ax.set_xticklabels(['Group 1', 'Group 2', 'Group 3', 'Group 4'])
|
||||
ax.grid(True, axis='y', alpha=0.3, linestyle='--')
|
||||
|
||||
if ax is None:
|
||||
return fig
|
||||
return ax
|
||||
|
||||
|
||||
def create_3d_plot():
|
||||
"""Create 3D surface plot."""
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
|
||||
fig = plt.figure(figsize=(12, 9))
|
||||
ax = fig.add_subplot(111, projection='3d')
|
||||
|
||||
# Generate data
|
||||
X = np.linspace(-5, 5, 50)
|
||||
Y = np.linspace(-5, 5, 50)
|
||||
X, Y = np.meshgrid(X, Y)
|
||||
Z = np.sin(np.sqrt(X**2 + Y**2))
|
||||
|
||||
# Create surface plot
|
||||
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
|
||||
edgecolor='none', alpha=0.9)
|
||||
|
||||
# Add colorbar
|
||||
fig.colorbar(surf, ax=ax, shrink=0.5)
|
||||
|
||||
ax.set_xlabel('X')
|
||||
ax.set_ylabel('Y')
|
||||
ax.set_zlabel('Z')
|
||||
ax.set_title('3D Surface Plot Example')
|
||||
|
||||
# Set viewing angle
|
||||
ax.view_init(elev=30, azim=45)
|
||||
|
||||
plt.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def create_comprehensive_figure():
|
||||
"""Create a comprehensive figure with multiple subplots."""
|
||||
data = generate_sample_data()
|
||||
|
||||
fig = plt.figure(figsize=(16, 12), constrained_layout=True)
|
||||
gs = GridSpec(3, 3, figure=fig)
|
||||
|
||||
# Create subplots
|
||||
ax1 = fig.add_subplot(gs[0, :2]) # Line plot - top left, spans 2 columns
|
||||
create_line_plot(data, ax1)
|
||||
|
||||
ax2 = fig.add_subplot(gs[0, 2]) # Bar chart - top right
|
||||
create_bar_chart(data, ax2)
|
||||
|
||||
ax3 = fig.add_subplot(gs[1, 0]) # Scatter plot - middle left
|
||||
create_scatter_plot(data, ax3)
|
||||
|
||||
ax4 = fig.add_subplot(gs[1, 1]) # Histogram - middle center
|
||||
create_histogram(data, ax4)
|
||||
|
||||
ax5 = fig.add_subplot(gs[1, 2]) # Box plot - middle right
|
||||
create_box_plot(data, ax5)
|
||||
|
||||
ax6 = fig.add_subplot(gs[2, :2]) # Contour plot - bottom left, spans 2 columns
|
||||
create_contour_plot(data, ax6)
|
||||
|
||||
ax7 = fig.add_subplot(gs[2, 2]) # Heatmap - bottom right
|
||||
create_heatmap(data, ax7)
|
||||
|
||||
fig.suptitle('Comprehensive Matplotlib Template', fontsize=18, fontweight='bold')
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function to run the template."""
|
||||
parser = argparse.ArgumentParser(description='Matplotlib plot template')
|
||||
parser.add_argument('--plot-type', type=str, default='all',
|
||||
choices=['line', 'scatter', 'bar', 'histogram', 'heatmap',
|
||||
'contour', 'box', 'violin', '3d', 'all'],
|
||||
help='Type of plot to create')
|
||||
parser.add_argument('--style', type=str, default='default',
|
||||
help='Matplotlib style to use')
|
||||
parser.add_argument('--output', type=str, default='plot.png',
|
||||
help='Output filename')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Set style
|
||||
if args.style != 'default':
|
||||
plt.style.use(args.style)
|
||||
else:
|
||||
set_publication_style()
|
||||
|
||||
# Generate data
|
||||
data = generate_sample_data()
|
||||
|
||||
# Create plot based on type
|
||||
plot_functions = {
|
||||
'line': create_line_plot,
|
||||
'scatter': create_scatter_plot,
|
||||
'bar': create_bar_chart,
|
||||
'histogram': create_histogram,
|
||||
'heatmap': create_heatmap,
|
||||
'contour': create_contour_plot,
|
||||
'box': create_box_plot,
|
||||
'violin': create_violin_plot,
|
||||
}
|
||||
|
||||
if args.plot_type == '3d':
|
||||
fig = create_3d_plot()
|
||||
elif args.plot_type == 'all':
|
||||
fig = create_comprehensive_figure()
|
||||
else:
|
||||
fig = plot_functions[args.plot_type](data)
|
||||
|
||||
# Save figure
|
||||
plt.savefig(args.output, dpi=300, bbox_inches='tight')
|
||||
print(f"Plot saved to {args.output}")
|
||||
|
||||
# Display
|
||||
plt.show()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
409
scientific-packages/matplotlib/scripts/style_configurator.py
Normal file
409
scientific-packages/matplotlib/scripts/style_configurator.py
Normal file
@@ -0,0 +1,409 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Matplotlib Style Configurator
|
||||
|
||||
Interactive utility to configure matplotlib style preferences and generate
|
||||
custom style sheets. Creates a preview of the style and optionally saves
|
||||
it as a .mplstyle file.
|
||||
|
||||
Usage:
|
||||
python style_configurator.py [--preset PRESET] [--output FILE] [--preview]
|
||||
|
||||
Presets:
|
||||
publication, presentation, web, dark, minimal
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.gridspec import GridSpec
|
||||
import argparse
|
||||
import os
|
||||
|
||||
|
||||
# Predefined style presets
|
||||
STYLE_PRESETS = {
|
||||
'publication': {
|
||||
'figure.figsize': (8, 6),
|
||||
'figure.dpi': 100,
|
||||
'savefig.dpi': 300,
|
||||
'savefig.bbox': 'tight',
|
||||
'font.family': 'sans-serif',
|
||||
'font.sans-serif': ['Arial', 'Helvetica'],
|
||||
'font.size': 11,
|
||||
'axes.labelsize': 12,
|
||||
'axes.titlesize': 14,
|
||||
'axes.linewidth': 1.5,
|
||||
'axes.grid': False,
|
||||
'axes.spines.top': False,
|
||||
'axes.spines.right': False,
|
||||
'lines.linewidth': 2,
|
||||
'lines.markersize': 8,
|
||||
'xtick.labelsize': 10,
|
||||
'ytick.labelsize': 10,
|
||||
'xtick.direction': 'in',
|
||||
'ytick.direction': 'in',
|
||||
'xtick.major.size': 6,
|
||||
'ytick.major.size': 6,
|
||||
'xtick.major.width': 1.5,
|
||||
'ytick.major.width': 1.5,
|
||||
'legend.fontsize': 10,
|
||||
'legend.frameon': True,
|
||||
'legend.framealpha': 1.0,
|
||||
'legend.edgecolor': 'black',
|
||||
},
|
||||
'presentation': {
|
||||
'figure.figsize': (12, 8),
|
||||
'figure.dpi': 100,
|
||||
'savefig.dpi': 150,
|
||||
'font.size': 16,
|
||||
'axes.labelsize': 20,
|
||||
'axes.titlesize': 24,
|
||||
'axes.linewidth': 2,
|
||||
'lines.linewidth': 3,
|
||||
'lines.markersize': 12,
|
||||
'xtick.labelsize': 16,
|
||||
'ytick.labelsize': 16,
|
||||
'legend.fontsize': 16,
|
||||
'axes.grid': True,
|
||||
'grid.alpha': 0.3,
|
||||
},
|
||||
'web': {
|
||||
'figure.figsize': (10, 6),
|
||||
'figure.dpi': 96,
|
||||
'savefig.dpi': 150,
|
||||
'font.size': 11,
|
||||
'axes.labelsize': 12,
|
||||
'axes.titlesize': 14,
|
||||
'lines.linewidth': 2,
|
||||
'axes.grid': True,
|
||||
'grid.alpha': 0.2,
|
||||
'grid.linestyle': '--',
|
||||
},
|
||||
'dark': {
|
||||
'figure.facecolor': '#1e1e1e',
|
||||
'figure.edgecolor': '#1e1e1e',
|
||||
'axes.facecolor': '#1e1e1e',
|
||||
'axes.edgecolor': 'white',
|
||||
'axes.labelcolor': 'white',
|
||||
'text.color': 'white',
|
||||
'xtick.color': 'white',
|
||||
'ytick.color': 'white',
|
||||
'grid.color': 'gray',
|
||||
'grid.alpha': 0.3,
|
||||
'axes.grid': True,
|
||||
'legend.facecolor': '#1e1e1e',
|
||||
'legend.edgecolor': 'white',
|
||||
'savefig.facecolor': '#1e1e1e',
|
||||
},
|
||||
'minimal': {
|
||||
'figure.figsize': (10, 6),
|
||||
'axes.spines.top': False,
|
||||
'axes.spines.right': False,
|
||||
'axes.spines.left': False,
|
||||
'axes.spines.bottom': False,
|
||||
'axes.grid': False,
|
||||
'xtick.bottom': True,
|
||||
'ytick.left': True,
|
||||
'axes.axisbelow': True,
|
||||
'lines.linewidth': 2.5,
|
||||
'font.size': 12,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def generate_preview_data():
|
||||
"""Generate sample data for style preview."""
|
||||
np.random.seed(42)
|
||||
x = np.linspace(0, 10, 100)
|
||||
y1 = np.sin(x) + 0.1 * np.random.randn(100)
|
||||
y2 = np.cos(x) + 0.1 * np.random.randn(100)
|
||||
scatter_x = np.random.randn(100)
|
||||
scatter_y = 2 * scatter_x + np.random.randn(100)
|
||||
categories = ['A', 'B', 'C', 'D', 'E']
|
||||
bar_values = [25, 40, 30, 55, 45]
|
||||
|
||||
return {
|
||||
'x': x, 'y1': y1, 'y2': y2,
|
||||
'scatter_x': scatter_x, 'scatter_y': scatter_y,
|
||||
'categories': categories, 'bar_values': bar_values
|
||||
}
|
||||
|
||||
|
||||
def create_style_preview(style_dict=None):
|
||||
"""Create a preview figure demonstrating the style."""
|
||||
if style_dict:
|
||||
plt.rcParams.update(style_dict)
|
||||
|
||||
data = generate_preview_data()
|
||||
|
||||
fig = plt.figure(figsize=(14, 10))
|
||||
gs = GridSpec(2, 2, figure=fig, hspace=0.3, wspace=0.3)
|
||||
|
||||
# Line plot
|
||||
ax1 = fig.add_subplot(gs[0, 0])
|
||||
ax1.plot(data['x'], data['y1'], label='sin(x)', marker='o', markevery=10)
|
||||
ax1.plot(data['x'], data['y2'], label='cos(x)', linestyle='--')
|
||||
ax1.set_xlabel('X axis')
|
||||
ax1.set_ylabel('Y axis')
|
||||
ax1.set_title('Line Plot')
|
||||
ax1.legend()
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# Scatter plot
|
||||
ax2 = fig.add_subplot(gs[0, 1])
|
||||
colors = np.sqrt(data['scatter_x']**2 + data['scatter_y']**2)
|
||||
scatter = ax2.scatter(data['scatter_x'], data['scatter_y'],
|
||||
c=colors, cmap='viridis', alpha=0.6, s=50)
|
||||
ax2.set_xlabel('X axis')
|
||||
ax2.set_ylabel('Y axis')
|
||||
ax2.set_title('Scatter Plot')
|
||||
cbar = plt.colorbar(scatter, ax=ax2)
|
||||
cbar.set_label('Distance')
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
# Bar chart
|
||||
ax3 = fig.add_subplot(gs[1, 0])
|
||||
bars = ax3.bar(data['categories'], data['bar_values'],
|
||||
edgecolor='black', linewidth=1)
|
||||
# Color bars with gradient
|
||||
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(bars)))
|
||||
for bar, color in zip(bars, colors):
|
||||
bar.set_facecolor(color)
|
||||
ax3.set_xlabel('Categories')
|
||||
ax3.set_ylabel('Values')
|
||||
ax3.set_title('Bar Chart')
|
||||
ax3.grid(True, axis='y', alpha=0.3)
|
||||
|
||||
# Multiple line plot with fills
|
||||
ax4 = fig.add_subplot(gs[1, 1])
|
||||
ax4.plot(data['x'], data['y1'], label='Signal 1', linewidth=2)
|
||||
ax4.fill_between(data['x'], data['y1'] - 0.2, data['y1'] + 0.2,
|
||||
alpha=0.3, label='±1 std')
|
||||
ax4.plot(data['x'], data['y2'], label='Signal 2', linewidth=2)
|
||||
ax4.fill_between(data['x'], data['y2'] - 0.2, data['y2'] + 0.2,
|
||||
alpha=0.3)
|
||||
ax4.set_xlabel('X axis')
|
||||
ax4.set_ylabel('Y axis')
|
||||
ax4.set_title('Time Series with Uncertainty')
|
||||
ax4.legend()
|
||||
ax4.grid(True, alpha=0.3)
|
||||
|
||||
fig.suptitle('Style Preview', fontsize=16, fontweight='bold')
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def save_style_file(style_dict, filename):
|
||||
"""Save style dictionary as .mplstyle file."""
|
||||
with open(filename, 'w') as f:
|
||||
f.write("# Custom matplotlib style\n")
|
||||
f.write("# Generated by style_configurator.py\n\n")
|
||||
|
||||
# Group settings by category
|
||||
categories = {
|
||||
'Figure': ['figure.'],
|
||||
'Font': ['font.'],
|
||||
'Axes': ['axes.'],
|
||||
'Lines': ['lines.'],
|
||||
'Markers': ['markers.'],
|
||||
'Ticks': ['tick.', 'xtick.', 'ytick.'],
|
||||
'Grid': ['grid.'],
|
||||
'Legend': ['legend.'],
|
||||
'Savefig': ['savefig.'],
|
||||
'Text': ['text.'],
|
||||
}
|
||||
|
||||
for category, prefixes in categories.items():
|
||||
category_items = {k: v for k, v in style_dict.items()
|
||||
if any(k.startswith(p) for p in prefixes)}
|
||||
if category_items:
|
||||
f.write(f"# {category}\n")
|
||||
for key, value in sorted(category_items.items()):
|
||||
# Format value appropriately
|
||||
if isinstance(value, (list, tuple)):
|
||||
value_str = ', '.join(str(v) for v in value)
|
||||
elif isinstance(value, bool):
|
||||
value_str = str(value)
|
||||
else:
|
||||
value_str = str(value)
|
||||
f.write(f"{key}: {value_str}\n")
|
||||
f.write("\n")
|
||||
|
||||
print(f"Style saved to {filename}")
|
||||
|
||||
|
||||
def print_style_info(style_dict):
|
||||
"""Print information about the style."""
|
||||
print("\n" + "="*60)
|
||||
print("STYLE CONFIGURATION")
|
||||
print("="*60)
|
||||
|
||||
categories = {
|
||||
'Figure Settings': ['figure.'],
|
||||
'Font Settings': ['font.'],
|
||||
'Axes Settings': ['axes.'],
|
||||
'Line Settings': ['lines.'],
|
||||
'Grid Settings': ['grid.'],
|
||||
'Legend Settings': ['legend.'],
|
||||
}
|
||||
|
||||
for category, prefixes in categories.items():
|
||||
category_items = {k: v for k, v in style_dict.items()
|
||||
if any(k.startswith(p) for p in prefixes)}
|
||||
if category_items:
|
||||
print(f"\n{category}:")
|
||||
for key, value in sorted(category_items.items()):
|
||||
print(f" {key}: {value}")
|
||||
|
||||
print("\n" + "="*60 + "\n")
|
||||
|
||||
|
||||
def list_available_presets():
|
||||
"""Print available style presets."""
|
||||
print("\nAvailable style presets:")
|
||||
print("-" * 40)
|
||||
descriptions = {
|
||||
'publication': 'Optimized for academic publications',
|
||||
'presentation': 'Large fonts for presentations',
|
||||
'web': 'Optimized for web display',
|
||||
'dark': 'Dark background theme',
|
||||
'minimal': 'Minimal, clean style',
|
||||
}
|
||||
for preset, desc in descriptions.items():
|
||||
print(f" {preset:15s} - {desc}")
|
||||
print("-" * 40 + "\n")
|
||||
|
||||
|
||||
def interactive_mode():
|
||||
"""Run interactive mode to customize style settings."""
|
||||
print("\n" + "="*60)
|
||||
print("MATPLOTLIB STYLE CONFIGURATOR - Interactive Mode")
|
||||
print("="*60)
|
||||
|
||||
list_available_presets()
|
||||
|
||||
preset = input("Choose a preset to start from (or 'custom' for default): ").strip().lower()
|
||||
|
||||
if preset in STYLE_PRESETS:
|
||||
style_dict = STYLE_PRESETS[preset].copy()
|
||||
print(f"\nStarting from '{preset}' preset")
|
||||
else:
|
||||
style_dict = {}
|
||||
print("\nStarting from default matplotlib style")
|
||||
|
||||
print("\nCommon settings you might want to customize:")
|
||||
print(" 1. Figure size")
|
||||
print(" 2. Font sizes")
|
||||
print(" 3. Line widths")
|
||||
print(" 4. Grid settings")
|
||||
print(" 5. Color scheme")
|
||||
print(" 6. Done, show preview")
|
||||
|
||||
while True:
|
||||
choice = input("\nSelect option (1-6): ").strip()
|
||||
|
||||
if choice == '1':
|
||||
width = input(" Figure width (inches, default 10): ").strip() or '10'
|
||||
height = input(" Figure height (inches, default 6): ").strip() or '6'
|
||||
style_dict['figure.figsize'] = (float(width), float(height))
|
||||
|
||||
elif choice == '2':
|
||||
base = input(" Base font size (default 12): ").strip() or '12'
|
||||
style_dict['font.size'] = float(base)
|
||||
style_dict['axes.labelsize'] = float(base) + 2
|
||||
style_dict['axes.titlesize'] = float(base) + 4
|
||||
|
||||
elif choice == '3':
|
||||
lw = input(" Line width (default 2): ").strip() or '2'
|
||||
style_dict['lines.linewidth'] = float(lw)
|
||||
|
||||
elif choice == '4':
|
||||
grid = input(" Enable grid? (y/n): ").strip().lower()
|
||||
style_dict['axes.grid'] = grid == 'y'
|
||||
if style_dict['axes.grid']:
|
||||
alpha = input(" Grid transparency (0-1, default 0.3): ").strip() or '0.3'
|
||||
style_dict['grid.alpha'] = float(alpha)
|
||||
|
||||
elif choice == '5':
|
||||
print(" Theme options: 1=Light, 2=Dark")
|
||||
theme = input(" Select theme (1-2): ").strip()
|
||||
if theme == '2':
|
||||
style_dict.update(STYLE_PRESETS['dark'])
|
||||
|
||||
elif choice == '6':
|
||||
break
|
||||
|
||||
return style_dict
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Matplotlib style configurator',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Show available presets
|
||||
python style_configurator.py --list
|
||||
|
||||
# Preview a preset
|
||||
python style_configurator.py --preset publication --preview
|
||||
|
||||
# Save a preset as .mplstyle file
|
||||
python style_configurator.py --preset publication --output my_style.mplstyle
|
||||
|
||||
# Interactive mode
|
||||
python style_configurator.py --interactive
|
||||
"""
|
||||
)
|
||||
parser.add_argument('--preset', type=str, choices=list(STYLE_PRESETS.keys()),
|
||||
help='Use a predefined style preset')
|
||||
parser.add_argument('--output', type=str,
|
||||
help='Save style to .mplstyle file')
|
||||
parser.add_argument('--preview', action='store_true',
|
||||
help='Show style preview')
|
||||
parser.add_argument('--list', action='store_true',
|
||||
help='List available presets')
|
||||
parser.add_argument('--interactive', action='store_true',
|
||||
help='Run in interactive mode')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.list:
|
||||
list_available_presets()
|
||||
# Also show currently available matplotlib styles
|
||||
print("\nBuilt-in matplotlib styles:")
|
||||
print("-" * 40)
|
||||
for style in sorted(plt.style.available):
|
||||
print(f" {style}")
|
||||
return
|
||||
|
||||
if args.interactive:
|
||||
style_dict = interactive_mode()
|
||||
elif args.preset:
|
||||
style_dict = STYLE_PRESETS[args.preset].copy()
|
||||
print(f"Using '{args.preset}' preset")
|
||||
else:
|
||||
print("No preset or interactive mode specified. Showing default preview.")
|
||||
style_dict = {}
|
||||
|
||||
if style_dict:
|
||||
print_style_info(style_dict)
|
||||
|
||||
if args.output:
|
||||
save_style_file(style_dict, args.output)
|
||||
|
||||
if args.preview or args.interactive:
|
||||
print("Creating style preview...")
|
||||
fig = create_style_preview(style_dict if style_dict else None)
|
||||
|
||||
if args.output:
|
||||
preview_filename = args.output.replace('.mplstyle', '_preview.png')
|
||||
plt.savefig(preview_filename, dpi=150, bbox_inches='tight')
|
||||
print(f"Preview saved to {preview_filename}")
|
||||
|
||||
plt.show()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
398
scientific-packages/medchem/SKILL.md
Normal file
398
scientific-packages/medchem/SKILL.md
Normal file
@@ -0,0 +1,398 @@
|
||||
---
|
||||
name: medchem
|
||||
description: Python library for molecular filtering and prioritization in drug discovery. Use when applying medicinal chemistry rules (Rule of Five, CNS, leadlike), detecting structural alerts (PAINS, NIBR, Lilly demerits), analyzing chemical groups, calculating molecular complexity, or filtering compound libraries. Works with SMILES strings and RDKit mol objects, with built-in parallelization for large datasets.
|
||||
---
|
||||
|
||||
# Medchem
|
||||
|
||||
## Overview
|
||||
|
||||
Medchem is a Python library for molecular filtering and prioritization in drug discovery workflows. It provides hundreds of well-established and novel molecular filters, structural alerts, and medicinal chemistry rules to efficiently triage and prioritize compound libraries at scale.
|
||||
|
||||
**Key Principle:** Rules and filters are always context-specific. Avoid blindly applying filters—marketed drugs often don't pass standard medchem filters, and prodrugs may intentionally violate rules. Use these tools as guidelines combined with domain expertise.
|
||||
|
||||
## Installation
|
||||
|
||||
Install medchem via conda or pip:
|
||||
|
||||
```bash
|
||||
# Via conda
|
||||
micromamba install -c conda-forge medchem
|
||||
|
||||
# Via pip
|
||||
pip install medchem
|
||||
```
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Medicinal Chemistry Rules
|
||||
|
||||
Apply established drug-likeness rules to molecules using the `medchem.rules` module.
|
||||
|
||||
**Available Rules:**
|
||||
- Rule of Five (Lipinski)
|
||||
- Rule of Oprea
|
||||
- Rule of CNS
|
||||
- Rule of leadlike (soft and strict)
|
||||
- Rule of three
|
||||
- Rule of Reos
|
||||
- Rule of drug
|
||||
- Rule of Veber
|
||||
- Golden triangle
|
||||
- PAINS filters
|
||||
|
||||
**Single Rule Application:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Apply Rule of Five to a SMILES string
|
||||
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" # Aspirin
|
||||
passes = mc.rules.basic_rules.rule_of_five(smiles)
|
||||
# Returns: True
|
||||
|
||||
# Check specific rules
|
||||
passes_oprea = mc.rules.basic_rules.rule_of_oprea(smiles)
|
||||
passes_cns = mc.rules.basic_rules.rule_of_cns(smiles)
|
||||
```
|
||||
|
||||
**Multiple Rules with RuleFilters:**
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
import medchem as mc
|
||||
|
||||
# Load molecules
|
||||
mols = [dm.to_mol(smiles) for smiles in smiles_list]
|
||||
|
||||
# Create filter with multiple rules
|
||||
rfilter = mc.rules.RuleFilters(
|
||||
rule_list=[
|
||||
"rule_of_five",
|
||||
"rule_of_oprea",
|
||||
"rule_of_cns",
|
||||
"rule_of_leadlike_soft"
|
||||
]
|
||||
)
|
||||
|
||||
# Apply filters with parallelization
|
||||
results = rfilter(
|
||||
mols=mols,
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
progress=True
|
||||
)
|
||||
```
|
||||
|
||||
**Result Format:**
|
||||
Results are returned as dictionaries with pass/fail status and detailed information for each rule.
|
||||
|
||||
### 2. Structural Alert Filters
|
||||
|
||||
Detect potentially problematic structural patterns using the `medchem.structural` module.
|
||||
|
||||
**Available Filters:**
|
||||
|
||||
1. **Common Alerts** - General structural alerts derived from ChEMBL curation and literature
|
||||
2. **NIBR Filters** - Novartis Institutes for BioMedical Research filter set
|
||||
3. **Lilly Demerits** - Eli Lilly's demerit-based system (275 rules, molecules rejected at >100 demerits)
|
||||
|
||||
**Common Alerts:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Create filter
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
|
||||
# Check single molecule
|
||||
mol = dm.to_mol("c1ccccc1")
|
||||
has_alerts, details = alert_filter.check_mol(mol)
|
||||
|
||||
# Batch filtering with parallelization
|
||||
results = alert_filter(
|
||||
mols=mol_list,
|
||||
n_jobs=-1,
|
||||
progress=True
|
||||
)
|
||||
```
|
||||
|
||||
**NIBR Filters:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Apply NIBR filters
|
||||
nibr_filter = mc.structural.NIBRFilters()
|
||||
results = nibr_filter(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
**Lilly Demerits:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Calculate Lilly demerits
|
||||
lilly = mc.structural.LillyDemeritsFilters()
|
||||
results = lilly(mols=mol_list, n_jobs=-1)
|
||||
|
||||
# Each result includes demerit score and whether it passes (≤100 demerits)
|
||||
```
|
||||
|
||||
### 3. Functional API for High-Level Operations
|
||||
|
||||
The `medchem.functional` module provides convenient functions for common workflows.
|
||||
|
||||
**Quick Filtering:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Apply NIBR filters to a list
|
||||
filter_ok = mc.functional.nibr_filter(
|
||||
mols=mol_list,
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
# Apply common alerts
|
||||
alert_results = mc.functional.common_alerts_filter(
|
||||
mols=mol_list,
|
||||
n_jobs=-1
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Chemical Groups Detection
|
||||
|
||||
Identify specific chemical groups and functional groups using `medchem.groups`.
|
||||
|
||||
**Available Groups:**
|
||||
- Hinge binders
|
||||
- Phosphate binders
|
||||
- Michael acceptors
|
||||
- Reactive groups
|
||||
- Custom SMARTS patterns
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Create group detector
|
||||
group = mc.groups.ChemicalGroup(groups=["hinge_binders"])
|
||||
|
||||
# Check for matches
|
||||
has_matches = group.has_match(mol_list)
|
||||
|
||||
# Get detailed match information
|
||||
matches = group.get_matches(mol)
|
||||
```
|
||||
|
||||
### 5. Named Catalogs
|
||||
|
||||
Access curated collections of chemical structures through `medchem.catalogs`.
|
||||
|
||||
**Available Catalogs:**
|
||||
- Functional groups
|
||||
- Protecting groups
|
||||
- Common reagents
|
||||
- Standard fragments
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Access named catalogs
|
||||
catalogs = mc.catalogs.NamedCatalogs
|
||||
|
||||
# Use catalog for matching
|
||||
catalog = catalogs.get("functional_groups")
|
||||
matches = catalog.get_matches(mol)
|
||||
```
|
||||
|
||||
### 6. Molecular Complexity
|
||||
|
||||
Calculate complexity metrics that approximate synthetic accessibility using `medchem.complexity`.
|
||||
|
||||
**Common Metrics:**
|
||||
- Bertz complexity
|
||||
- Whitlock complexity
|
||||
- Barone complexity
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Calculate complexity
|
||||
complexity_score = mc.complexity.calculate_complexity(mol)
|
||||
|
||||
# Filter by complexity threshold
|
||||
complex_filter = mc.complexity.ComplexityFilter(max_complexity=500)
|
||||
results = complex_filter(mols=mol_list)
|
||||
```
|
||||
|
||||
### 7. Constraints Filtering
|
||||
|
||||
Apply custom property-based constraints using `medchem.constraints`.
|
||||
|
||||
**Example Constraints:**
|
||||
- Molecular weight ranges
|
||||
- LogP bounds
|
||||
- TPSA limits
|
||||
- Rotatable bond counts
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Define constraints
|
||||
constraints = mc.constraints.Constraints(
|
||||
mw_range=(200, 500),
|
||||
logp_range=(-2, 5),
|
||||
tpsa_max=140,
|
||||
rotatable_bonds_max=10
|
||||
)
|
||||
|
||||
# Apply constraints
|
||||
results = constraints(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
### 8. Medchem Query Language
|
||||
|
||||
Use a specialized query language for complex filtering criteria.
|
||||
|
||||
**Query Examples:**
|
||||
```
|
||||
# Molecules passing Ro5 AND not having common alerts
|
||||
"rule_of_five AND NOT common_alerts"
|
||||
|
||||
# CNS-like molecules with low complexity
|
||||
"rule_of_cns AND complexity < 400"
|
||||
|
||||
# Leadlike molecules without Lilly demerits
|
||||
"rule_of_leadlike AND lilly_demerits == 0"
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Parse and apply query
|
||||
query = mc.query.parse("rule_of_five AND NOT common_alerts")
|
||||
results = query.apply(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
## Workflow Patterns
|
||||
|
||||
### Pattern 1: Initial Triage of Compound Library
|
||||
|
||||
Filter a large compound collection to identify drug-like candidates.
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
import medchem as mc
|
||||
import pandas as pd
|
||||
|
||||
# Load compound library
|
||||
df = pd.read_csv("compounds.csv")
|
||||
mols = [dm.to_mol(smi) for smi in df["smiles"]]
|
||||
|
||||
# Apply primary filters
|
||||
rule_filter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_veber"])
|
||||
rule_results = rule_filter(mols=mols, n_jobs=-1, progress=True)
|
||||
|
||||
# Apply structural alerts
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
alert_results = alert_filter(mols=mols, n_jobs=-1, progress=True)
|
||||
|
||||
# Combine results
|
||||
df["passes_rules"] = rule_results["pass"]
|
||||
df["has_alerts"] = alert_results["has_alerts"]
|
||||
df["drug_like"] = df["passes_rules"] & ~df["has_alerts"]
|
||||
|
||||
# Save filtered compounds
|
||||
filtered_df = df[df["drug_like"]]
|
||||
filtered_df.to_csv("filtered_compounds.csv", index=False)
|
||||
```
|
||||
|
||||
### Pattern 2: Lead Optimization Filtering
|
||||
|
||||
Apply stricter criteria during lead optimization.
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Create comprehensive filter
|
||||
filters = {
|
||||
"rules": mc.rules.RuleFilters(rule_list=["rule_of_leadlike_strict"]),
|
||||
"alerts": mc.structural.NIBRFilters(),
|
||||
"lilly": mc.structural.LillyDemeritsFilters(),
|
||||
"complexity": mc.complexity.ComplexityFilter(max_complexity=400)
|
||||
}
|
||||
|
||||
# Apply all filters
|
||||
results = {}
|
||||
for name, filt in filters.items():
|
||||
results[name] = filt(mols=candidate_mols, n_jobs=-1)
|
||||
|
||||
# Identify compounds passing all filters
|
||||
passes_all = all(r["pass"] for r in results.values())
|
||||
```
|
||||
|
||||
### Pattern 3: Identify Specific Chemical Groups
|
||||
|
||||
Find molecules containing specific functional groups or scaffolds.
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Create group detector for multiple groups
|
||||
group_detector = mc.groups.ChemicalGroup(
|
||||
groups=["hinge_binders", "phosphate_binders"]
|
||||
)
|
||||
|
||||
# Screen library
|
||||
matches = group_detector.get_all_matches(mol_list)
|
||||
|
||||
# Filter molecules with desired groups
|
||||
mol_with_groups = [mol for mol, match in zip(mol_list, matches) if match]
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Context Matters**: Don't blindly apply filters. Understand the biological target and chemical space.
|
||||
|
||||
2. **Combine Multiple Filters**: Use rules, structural alerts, and domain knowledge together for better decisions.
|
||||
|
||||
3. **Use Parallelization**: For large datasets (>1000 molecules), always use `n_jobs=-1` for parallel processing.
|
||||
|
||||
4. **Iterative Refinement**: Start with broad filters (Ro5), then apply more specific criteria (CNS, leadlike) as needed.
|
||||
|
||||
5. **Document Filtering Decisions**: Track which molecules were filtered out and why for reproducibility.
|
||||
|
||||
6. **Validate Results**: Remember that marketed drugs often fail standard filters—use these as guidelines, not absolute rules.
|
||||
|
||||
7. **Consider Prodrugs**: Molecules designed as prodrugs may intentionally violate standard medicinal chemistry rules.
|
||||
|
||||
## Resources
|
||||
|
||||
### references/api_guide.md
|
||||
Comprehensive API reference covering all medchem modules with detailed function signatures, parameters, and return types.
|
||||
|
||||
### references/rules_catalog.md
|
||||
Complete catalog of available rules, filters, and alerts with descriptions, thresholds, and literature references.
|
||||
|
||||
### scripts/filter_molecules.py
|
||||
Production-ready script for batch filtering workflows. Supports multiple input formats (CSV, SDF, SMILES), configurable filter combinations, and detailed reporting.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python scripts/filter_molecules.py input.csv --rules rule_of_five,rule_of_cns --alerts nibr --output filtered.csv
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
Official documentation: https://medchem-docs.datamol.io/
|
||||
GitHub repository: https://github.com/datamol-io/medchem
|
||||
600
scientific-packages/medchem/references/api_guide.md
Normal file
600
scientific-packages/medchem/references/api_guide.md
Normal file
@@ -0,0 +1,600 @@
|
||||
# Medchem API Reference
|
||||
|
||||
Comprehensive reference for all medchem modules and functions.
|
||||
|
||||
## Module: medchem.rules
|
||||
|
||||
### Class: RuleFilters
|
||||
|
||||
Filter molecules based on multiple medicinal chemistry rules.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
RuleFilters(rule_list: List[str])
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `rule_list`: List of rule names to apply. See available rules below.
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> Dict
|
||||
```
|
||||
- `mols`: List of RDKit molecule objects
|
||||
- `n_jobs`: Number of parallel jobs (-1 uses all cores)
|
||||
- `progress`: Show progress bar
|
||||
- **Returns**: Dictionary with results for each rule
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_cns"])
|
||||
results = rfilter(mols=mol_list, n_jobs=-1, progress=True)
|
||||
```
|
||||
|
||||
### Module: medchem.rules.basic_rules
|
||||
|
||||
Individual rule functions that can be applied to single molecules.
|
||||
|
||||
#### rule_of_five()
|
||||
|
||||
```python
|
||||
rule_of_five(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Lipinski's Rule of Five for oral bioavailability.
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight ≤ 500 Da
|
||||
- LogP ≤ 5
|
||||
- H-bond donors ≤ 5
|
||||
- H-bond acceptors ≤ 10
|
||||
|
||||
**Parameters:**
|
||||
- `mol`: SMILES string or RDKit molecule object
|
||||
|
||||
**Returns:** True if molecule passes all criteria
|
||||
|
||||
#### rule_of_three()
|
||||
|
||||
```python
|
||||
rule_of_three(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Rule of Three for fragment screening libraries.
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight ≤ 300 Da
|
||||
- LogP ≤ 3
|
||||
- H-bond donors ≤ 3
|
||||
- H-bond acceptors ≤ 3
|
||||
- Rotatable bonds ≤ 3
|
||||
- Polar surface area ≤ 60 Ų
|
||||
|
||||
#### rule_of_oprea()
|
||||
|
||||
```python
|
||||
rule_of_oprea(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Oprea's lead-like criteria for hit-to-lead optimization.
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight: 200-350 Da
|
||||
- LogP: -2 to 4
|
||||
- Rotatable bonds ≤ 7
|
||||
- Rings ≤ 4
|
||||
|
||||
#### rule_of_cns()
|
||||
|
||||
```python
|
||||
rule_of_cns(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
CNS drug-likeness rules.
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight ≤ 450 Da
|
||||
- LogP: -1 to 5
|
||||
- H-bond donors ≤ 2
|
||||
- TPSA ≤ 90 Ų
|
||||
|
||||
#### rule_of_leadlike_soft()
|
||||
|
||||
```python
|
||||
rule_of_leadlike_soft(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Soft lead-like criteria (more permissive).
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight: 250-450 Da
|
||||
- LogP: -3 to 4
|
||||
- Rotatable bonds ≤ 10
|
||||
|
||||
#### rule_of_leadlike_strict()
|
||||
|
||||
```python
|
||||
rule_of_leadlike_strict(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Strict lead-like criteria (more restrictive).
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight: 200-350 Da
|
||||
- LogP: -2 to 3.5
|
||||
- Rotatable bonds ≤ 7
|
||||
- Rings: 1-3
|
||||
|
||||
#### rule_of_veber()
|
||||
|
||||
```python
|
||||
rule_of_veber(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Veber's rules for oral bioavailability.
|
||||
|
||||
**Criteria:**
|
||||
- Rotatable bonds ≤ 10
|
||||
- TPSA ≤ 140 Ų
|
||||
|
||||
#### rule_of_reos()
|
||||
|
||||
```python
|
||||
rule_of_reos(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Rapid Elimination Of Swill (REOS) filter.
|
||||
|
||||
**Criteria:**
|
||||
- Molecular weight: 200-500 Da
|
||||
- LogP: -5 to 5
|
||||
- H-bond donors: 0-5
|
||||
- H-bond acceptors: 0-10
|
||||
|
||||
#### rule_of_drug()
|
||||
|
||||
```python
|
||||
rule_of_drug(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Combined drug-likeness criteria.
|
||||
|
||||
**Criteria:**
|
||||
- Passes Rule of Five
|
||||
- Passes Veber rules
|
||||
- No PAINS substructures
|
||||
|
||||
#### golden_triangle()
|
||||
|
||||
```python
|
||||
golden_triangle(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Golden Triangle for drug-likeness balance.
|
||||
|
||||
**Criteria:**
|
||||
- 200 ≤ MW ≤ 50×LogP + 400
|
||||
- LogP: -2 to 5
|
||||
|
||||
#### pains_filter()
|
||||
|
||||
```python
|
||||
pains_filter(mol: Union[str, Chem.Mol]) -> bool
|
||||
```
|
||||
|
||||
Pan Assay INterference compoundS (PAINS) filter.
|
||||
|
||||
**Returns:** True if molecule does NOT contain PAINS substructures
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.structural
|
||||
|
||||
### Class: CommonAlertsFilters
|
||||
|
||||
Filter for common structural alerts derived from ChEMBL and literature.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
CommonAlertsFilters()
|
||||
```
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[Dict]
|
||||
```
|
||||
|
||||
Apply common alerts filter to a list of molecules.
|
||||
|
||||
**Returns:** List of dictionaries with keys:
|
||||
- `has_alerts`: Boolean indicating if molecule has alerts
|
||||
- `alert_details`: List of matched alert patterns
|
||||
- `num_alerts`: Number of alerts found
|
||||
|
||||
```python
|
||||
check_mol(mol: Chem.Mol) -> Tuple[bool, List[str]]
|
||||
```
|
||||
|
||||
Check a single molecule for structural alerts.
|
||||
|
||||
**Returns:** Tuple of (has_alerts, list_of_alert_names)
|
||||
|
||||
### Class: NIBRFilters
|
||||
|
||||
Novartis NIBR medicinal chemistry filters.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
NIBRFilters()
|
||||
```
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[bool]
|
||||
```
|
||||
|
||||
Apply NIBR filters to molecules.
|
||||
|
||||
**Returns:** List of booleans (True if molecule passes)
|
||||
|
||||
### Class: LillyDemeritsFilters
|
||||
|
||||
Eli Lilly's demerit-based structural alert system (275 rules).
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
LillyDemeritsFilters()
|
||||
```
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1, progress: bool = False) -> List[Dict]
|
||||
```
|
||||
|
||||
Calculate Lilly demerits for molecules.
|
||||
|
||||
**Returns:** List of dictionaries with keys:
|
||||
- `demerits`: Total demerit score
|
||||
- `passes`: Boolean (True if demerits ≤ 100)
|
||||
- `matched_patterns`: List of matched patterns with scores
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.functional
|
||||
|
||||
High-level functional API for common operations.
|
||||
|
||||
### nibr_filter()
|
||||
|
||||
```python
|
||||
nibr_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
|
||||
```
|
||||
|
||||
Apply NIBR filters using functional API.
|
||||
|
||||
**Parameters:**
|
||||
- `mols`: List of molecules
|
||||
- `n_jobs`: Parallelization level
|
||||
|
||||
**Returns:** List of pass/fail booleans
|
||||
|
||||
### common_alerts_filter()
|
||||
|
||||
```python
|
||||
common_alerts_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
|
||||
```
|
||||
|
||||
Apply common alerts filter using functional API.
|
||||
|
||||
**Returns:** List of results dictionaries
|
||||
|
||||
### lilly_demerits_filter()
|
||||
|
||||
```python
|
||||
lilly_demerits_filter(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
|
||||
```
|
||||
|
||||
Calculate Lilly demerits using functional API.
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.groups
|
||||
|
||||
### Class: ChemicalGroup
|
||||
|
||||
Detect specific chemical groups in molecules.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
ChemicalGroup(groups: List[str], custom_smarts: Optional[Dict[str, str]] = None)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `groups`: List of predefined group names
|
||||
- `custom_smarts`: Dictionary mapping custom group names to SMARTS patterns
|
||||
|
||||
**Predefined Groups:**
|
||||
- `"hinge_binders"`: Kinase hinge binding motifs
|
||||
- `"phosphate_binders"`: Phosphate binding groups
|
||||
- `"michael_acceptors"`: Michael acceptor electrophiles
|
||||
- `"reactive_groups"`: General reactive functionalities
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
has_match(mols: List[Chem.Mol]) -> List[bool]
|
||||
```
|
||||
|
||||
Check if molecules contain any of the specified groups.
|
||||
|
||||
```python
|
||||
get_matches(mol: Chem.Mol) -> Dict[str, List[Tuple]]
|
||||
```
|
||||
|
||||
Get detailed match information for a single molecule.
|
||||
|
||||
**Returns:** Dictionary mapping group names to lists of atom indices
|
||||
|
||||
```python
|
||||
get_all_matches(mols: List[Chem.Mol]) -> List[Dict]
|
||||
```
|
||||
|
||||
Get match information for all molecules.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
group = mc.groups.ChemicalGroup(groups=["hinge_binders", "phosphate_binders"])
|
||||
matches = group.get_all_matches(mol_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.catalogs
|
||||
|
||||
### Class: NamedCatalogs
|
||||
|
||||
Access to curated chemical catalogs.
|
||||
|
||||
**Available Catalogs:**
|
||||
- `"functional_groups"`: Common functional groups
|
||||
- `"protecting_groups"`: Protecting group structures
|
||||
- `"reagents"`: Common reagents
|
||||
- `"fragments"`: Standard fragments
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
catalog = mc.catalogs.NamedCatalogs.get("functional_groups")
|
||||
matches = catalog.get_matches(mol)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.complexity
|
||||
|
||||
Calculate molecular complexity metrics.
|
||||
|
||||
### calculate_complexity()
|
||||
|
||||
```python
|
||||
calculate_complexity(mol: Chem.Mol, method: str = "bertz") -> float
|
||||
```
|
||||
|
||||
Calculate complexity score for a molecule.
|
||||
|
||||
**Parameters:**
|
||||
- `mol`: RDKit molecule
|
||||
- `method`: Complexity metric ("bertz", "whitlock", "barone")
|
||||
|
||||
**Returns:** Complexity score (higher = more complex)
|
||||
|
||||
### Class: ComplexityFilter
|
||||
|
||||
Filter molecules by complexity threshold.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
ComplexityFilter(max_complexity: float, method: str = "bertz")
|
||||
```
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
|
||||
```
|
||||
|
||||
Filter molecules exceeding complexity threshold.
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.constraints
|
||||
|
||||
### Class: Constraints
|
||||
|
||||
Apply custom property-based constraints.
|
||||
|
||||
**Constructor:**
|
||||
```python
|
||||
Constraints(
|
||||
mw_range: Optional[Tuple[float, float]] = None,
|
||||
logp_range: Optional[Tuple[float, float]] = None,
|
||||
tpsa_max: Optional[float] = None,
|
||||
tpsa_range: Optional[Tuple[float, float]] = None,
|
||||
hbd_max: Optional[int] = None,
|
||||
hba_max: Optional[int] = None,
|
||||
rotatable_bonds_max: Optional[int] = None,
|
||||
rings_range: Optional[Tuple[int, int]] = None,
|
||||
aromatic_rings_max: Optional[int] = None,
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:** All parameters are optional. Specify only the constraints needed.
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
__call__(mols: List[Chem.Mol], n_jobs: int = 1) -> List[Dict]
|
||||
```
|
||||
|
||||
Apply constraints to molecules.
|
||||
|
||||
**Returns:** List of dictionaries with keys:
|
||||
- `passes`: Boolean indicating if all constraints pass
|
||||
- `violations`: List of constraint names that failed
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
constraints = mc.constraints.Constraints(
|
||||
mw_range=(200, 500),
|
||||
logp_range=(-2, 5),
|
||||
tpsa_max=140
|
||||
)
|
||||
results = constraints(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.query
|
||||
|
||||
Query language for complex filtering.
|
||||
|
||||
### parse()
|
||||
|
||||
```python
|
||||
parse(query: str) -> Query
|
||||
```
|
||||
|
||||
Parse a medchem query string into a Query object.
|
||||
|
||||
**Query Syntax:**
|
||||
- Operators: `AND`, `OR`, `NOT`
|
||||
- Comparisons: `<`, `>`, `<=`, `>=`, `==`, `!=`
|
||||
- Properties: `complexity`, `lilly_demerits`, `mw`, `logp`, `tpsa`
|
||||
- Rules: `rule_of_five`, `rule_of_cns`, etc.
|
||||
- Filters: `common_alerts`, `nibr_filter`, `pains_filter`
|
||||
|
||||
**Example Queries:**
|
||||
```python
|
||||
"rule_of_five AND NOT common_alerts"
|
||||
"rule_of_cns AND complexity < 400"
|
||||
"mw > 200 AND mw < 500 AND logp < 5"
|
||||
"(rule_of_five OR rule_of_oprea) AND NOT pains_filter"
|
||||
```
|
||||
|
||||
### Class: Query
|
||||
|
||||
**Methods:**
|
||||
|
||||
```python
|
||||
apply(mols: List[Chem.Mol], n_jobs: int = 1) -> List[bool]
|
||||
```
|
||||
|
||||
Apply parsed query to molecules.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
query = mc.query.parse("rule_of_five AND NOT common_alerts")
|
||||
results = query.apply(mols=mol_list, n_jobs=-1)
|
||||
passing_mols = [mol for mol, passes in zip(mol_list, results) if passes]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module: medchem.utils
|
||||
|
||||
Utility functions for working with molecules.
|
||||
|
||||
### batch_process()
|
||||
|
||||
```python
|
||||
batch_process(
|
||||
mols: List[Chem.Mol],
|
||||
func: Callable,
|
||||
n_jobs: int = 1,
|
||||
progress: bool = False,
|
||||
batch_size: Optional[int] = None
|
||||
) -> List
|
||||
```
|
||||
|
||||
Process molecules in parallel batches.
|
||||
|
||||
**Parameters:**
|
||||
- `mols`: List of molecules
|
||||
- `func`: Function to apply to each molecule
|
||||
- `n_jobs`: Number of parallel workers
|
||||
- `progress`: Show progress bar
|
||||
- `batch_size`: Size of processing batches
|
||||
|
||||
### standardize_mol()
|
||||
|
||||
```python
|
||||
standardize_mol(mol: Chem.Mol) -> Chem.Mol
|
||||
```
|
||||
|
||||
Standardize molecule representation (sanitize, neutralize charges, etc.).
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern: Parallel Processing
|
||||
|
||||
All filters support parallelization:
|
||||
|
||||
```python
|
||||
# Use all CPU cores
|
||||
results = filter_object(mols=mol_list, n_jobs=-1, progress=True)
|
||||
|
||||
# Use specific number of cores
|
||||
results = filter_object(mols=mol_list, n_jobs=4, progress=True)
|
||||
```
|
||||
|
||||
### Pattern: Combining Multiple Filters
|
||||
|
||||
```python
|
||||
import medchem as mc
|
||||
|
||||
# Apply multiple filters
|
||||
rule_filter = mc.rules.RuleFilters(rule_list=["rule_of_five"])
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
lilly_filter = mc.structural.LillyDemeritsFilters()
|
||||
|
||||
# Get results
|
||||
rule_results = rule_filter(mols=mol_list, n_jobs=-1)
|
||||
alert_results = alert_filter(mols=mol_list, n_jobs=-1)
|
||||
lilly_results = lilly_filter(mols=mol_list, n_jobs=-1)
|
||||
|
||||
# Combine criteria
|
||||
passing_mols = [
|
||||
mol for i, mol in enumerate(mol_list)
|
||||
if rule_results[i]["passes"]
|
||||
and not alert_results[i]["has_alerts"]
|
||||
and lilly_results[i]["passes"]
|
||||
]
|
||||
```
|
||||
|
||||
### Pattern: Working with DataFrames
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import datamol as dm
|
||||
import medchem as mc
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv("molecules.csv")
|
||||
df["mol"] = df["smiles"].apply(dm.to_mol)
|
||||
|
||||
# Apply filters
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "rule_of_cns"])
|
||||
results = rfilter(mols=df["mol"].tolist(), n_jobs=-1)
|
||||
|
||||
# Add results to dataframe
|
||||
df["passes_ro5"] = [r["rule_of_five"] for r in results]
|
||||
df["passes_cns"] = [r["rule_of_cns"] for r in results]
|
||||
|
||||
# Filter dataframe
|
||||
filtered_df = df[df["passes_ro5"] & df["passes_cns"]]
|
||||
```
|
||||
604
scientific-packages/medchem/references/rules_catalog.md
Normal file
604
scientific-packages/medchem/references/rules_catalog.md
Normal file
@@ -0,0 +1,604 @@
|
||||
# Medchem Rules and Filters Catalog
|
||||
|
||||
Comprehensive catalog of all available medicinal chemistry rules, structural alerts, and filters in medchem.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Drug-Likeness Rules](#drug-likeness-rules)
|
||||
2. [Lead-Likeness Rules](#lead-likeness-rules)
|
||||
3. [Fragment Rules](#fragment-rules)
|
||||
4. [CNS Rules](#cns-rules)
|
||||
5. [Structural Alert Filters](#structural-alert-filters)
|
||||
6. [Chemical Group Patterns](#chemical-group-patterns)
|
||||
|
||||
---
|
||||
|
||||
## Drug-Likeness Rules
|
||||
|
||||
### Rule of Five (Lipinski)
|
||||
|
||||
**Reference:** Lipinski et al., Adv Drug Deliv Rev (1997) 23:3-25
|
||||
|
||||
**Purpose:** Predict oral bioavailability
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight ≤ 500 Da
|
||||
- LogP ≤ 5
|
||||
- Hydrogen Bond Donors ≤ 5
|
||||
- Hydrogen Bond Acceptors ≤ 10
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_five(mol)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- One of the most widely used filters in drug discovery
|
||||
- About 90% of orally active drugs comply with these rules
|
||||
- Exceptions exist, especially for natural products and antibiotics
|
||||
|
||||
---
|
||||
|
||||
### Rule of Veber
|
||||
|
||||
**Reference:** Veber et al., J Med Chem (2002) 45:2615-2623
|
||||
|
||||
**Purpose:** Additional criteria for oral bioavailability
|
||||
|
||||
**Criteria:**
|
||||
- Rotatable Bonds ≤ 10
|
||||
- Topological Polar Surface Area (TPSA) ≤ 140 Ų
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_veber(mol)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Complements Rule of Five
|
||||
- TPSA correlates with cell permeability
|
||||
- Rotatable bonds affect molecular flexibility
|
||||
|
||||
---
|
||||
|
||||
### Rule of Drug
|
||||
|
||||
**Purpose:** Combined drug-likeness assessment
|
||||
|
||||
**Criteria:**
|
||||
- Passes Rule of Five
|
||||
- Passes Veber rules
|
||||
- Does not contain PAINS substructures
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_drug(mol)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### REOS (Rapid Elimination Of Swill)
|
||||
|
||||
**Reference:** Walters & Murcko, Adv Drug Deliv Rev (2002) 54:255-271
|
||||
|
||||
**Purpose:** Filter out compounds unlikely to be drugs
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight: 200-500 Da
|
||||
- LogP: -5 to 5
|
||||
- Hydrogen Bond Donors: 0-5
|
||||
- Hydrogen Bond Acceptors: 0-10
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_reos(mol)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Golden Triangle
|
||||
|
||||
**Reference:** Johnson et al., J Med Chem (2009) 52:5487-5500
|
||||
|
||||
**Purpose:** Balance lipophilicity and molecular weight
|
||||
|
||||
**Criteria:**
|
||||
- 200 ≤ MW ≤ 50 × LogP + 400
|
||||
- LogP: -2 to 5
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.golden_triangle(mol)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Defines optimal physicochemical space
|
||||
- Visual representation resembles a triangle on MW vs LogP plot
|
||||
|
||||
---
|
||||
|
||||
## Lead-Likeness Rules
|
||||
|
||||
### Rule of Oprea
|
||||
|
||||
**Reference:** Oprea et al., J Chem Inf Comput Sci (2001) 41:1308-1315
|
||||
|
||||
**Purpose:** Identify lead-like compounds for optimization
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight: 200-350 Da
|
||||
- LogP: -2 to 4
|
||||
- Rotatable Bonds ≤ 7
|
||||
- Number of Rings ≤ 4
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_oprea(mol)
|
||||
```
|
||||
|
||||
**Rationale:** Lead compounds should have "room to grow" during optimization
|
||||
|
||||
---
|
||||
|
||||
### Rule of Leadlike (Soft)
|
||||
|
||||
**Purpose:** Permissive lead-like criteria
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight: 250-450 Da
|
||||
- LogP: -3 to 4
|
||||
- Rotatable Bonds ≤ 10
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_leadlike_soft(mol)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Rule of Leadlike (Strict)
|
||||
|
||||
**Purpose:** Restrictive lead-like criteria
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight: 200-350 Da
|
||||
- LogP: -2 to 3.5
|
||||
- Rotatable Bonds ≤ 7
|
||||
- Number of Rings: 1-3
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_leadlike_strict(mol)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fragment Rules
|
||||
|
||||
### Rule of Three
|
||||
|
||||
**Reference:** Congreve et al., Drug Discov Today (2003) 8:876-877
|
||||
|
||||
**Purpose:** Screen fragment libraries for fragment-based drug discovery
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight ≤ 300 Da
|
||||
- LogP ≤ 3
|
||||
- Hydrogen Bond Donors ≤ 3
|
||||
- Hydrogen Bond Acceptors ≤ 3
|
||||
- Rotatable Bonds ≤ 3
|
||||
- Polar Surface Area ≤ 60 Ų
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_three(mol)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Fragments are grown into leads during optimization
|
||||
- Lower complexity allows more starting points
|
||||
|
||||
---
|
||||
|
||||
## CNS Rules
|
||||
|
||||
### Rule of CNS
|
||||
|
||||
**Purpose:** Central nervous system drug-likeness
|
||||
|
||||
**Criteria:**
|
||||
- Molecular Weight ≤ 450 Da
|
||||
- LogP: -1 to 5
|
||||
- Hydrogen Bond Donors ≤ 2
|
||||
- TPSA ≤ 90 Ų
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.rule_of_cns(mol)
|
||||
```
|
||||
|
||||
**Rationale:**
|
||||
- Blood-brain barrier penetration requires specific properties
|
||||
- Lower TPSA and HBD count improve BBB permeability
|
||||
- Tight constraints reflect CNS challenges
|
||||
|
||||
---
|
||||
|
||||
## Structural Alert Filters
|
||||
|
||||
### PAINS (Pan Assay INterference compoundS)
|
||||
|
||||
**Reference:** Baell & Holloway, J Med Chem (2010) 53:2719-2740
|
||||
|
||||
**Purpose:** Identify compounds that interfere with assays
|
||||
|
||||
**Categories:**
|
||||
- Catechols
|
||||
- Quinones
|
||||
- Rhodanines
|
||||
- Hydroxyphenylhydrazones
|
||||
- Alkyl/aryl aldehydes
|
||||
- Michael acceptors (specific patterns)
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
mc.rules.basic_rules.pains_filter(mol)
|
||||
# Returns True if NO PAINS found
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- PAINS compounds show activity in multiple assays through non-specific mechanisms
|
||||
- Common false positives in screening campaigns
|
||||
- Should be deprioritized in lead selection
|
||||
|
||||
---
|
||||
|
||||
### Common Alerts Filters
|
||||
|
||||
**Source:** Derived from ChEMBL curation and medicinal chemistry literature
|
||||
|
||||
**Purpose:** Flag common problematic structural patterns
|
||||
|
||||
**Alert Categories:**
|
||||
1. **Reactive Groups**
|
||||
- Epoxides
|
||||
- Aziridines
|
||||
- Acid halides
|
||||
- Isocyanates
|
||||
|
||||
2. **Metabolic Liabilities**
|
||||
- Hydrazines
|
||||
- Thioureas
|
||||
- Anilines (certain patterns)
|
||||
|
||||
3. **Aggregators**
|
||||
- Polyaromatic systems
|
||||
- Long aliphatic chains
|
||||
|
||||
4. **Toxicophores**
|
||||
- Nitro aromatics
|
||||
- Aromatic N-oxides
|
||||
- Certain heterocycles
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
has_alerts, details = alert_filter.check_mol(mol)
|
||||
```
|
||||
|
||||
**Return Format:**
|
||||
```python
|
||||
{
|
||||
"has_alerts": True,
|
||||
"alert_details": ["reactive_epoxide", "metabolic_hydrazine"],
|
||||
"num_alerts": 2
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### NIBR Filters
|
||||
|
||||
**Source:** Novartis Institutes for BioMedical Research
|
||||
|
||||
**Purpose:** Industrial medicinal chemistry filtering rules
|
||||
|
||||
**Features:**
|
||||
- Proprietary filter set developed from Novartis experience
|
||||
- Balances drug-likeness with practical medicinal chemistry
|
||||
- Includes both structural alerts and property filters
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
nibr_filter = mc.structural.NIBRFilters()
|
||||
results = nibr_filter(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
**Return Format:** Boolean list (True = passes)
|
||||
|
||||
---
|
||||
|
||||
### Lilly Demerits Filter
|
||||
|
||||
**Reference:** Based on Eli Lilly medicinal chemistry rules
|
||||
|
||||
**Source:** 275 structural patterns accumulated over 18 years
|
||||
|
||||
**Purpose:** Identify assay interference and problematic functionalities
|
||||
|
||||
**Mechanism:**
|
||||
- Each matched pattern adds demerits
|
||||
- Molecules with >100 demerits are rejected
|
||||
- Some patterns add 10-50 demerits, others add 100+ (instant rejection)
|
||||
|
||||
**Demerit Categories:**
|
||||
|
||||
1. **High Demerits (>50):**
|
||||
- Known toxic groups
|
||||
- Highly reactive functionalities
|
||||
- Strong metal chelators
|
||||
|
||||
2. **Medium Demerits (20-50):**
|
||||
- Metabolic liabilities
|
||||
- Aggregation-prone structures
|
||||
- Frequent hitters
|
||||
|
||||
3. **Low Demerits (5-20):**
|
||||
- Minor concerns
|
||||
- Context-dependent issues
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
lilly_filter = mc.structural.LillyDemeritsFilters()
|
||||
results = lilly_filter(mols=mol_list, n_jobs=-1)
|
||||
```
|
||||
|
||||
**Return Format:**
|
||||
```python
|
||||
{
|
||||
"demerits": 35,
|
||||
"passes": True, # (demerits ≤ 100)
|
||||
"matched_patterns": [
|
||||
{"pattern": "phenolic_ester", "demerits": 20},
|
||||
{"pattern": "aniline_derivative", "demerits": 15}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Chemical Group Patterns
|
||||
|
||||
### Hinge Binders
|
||||
|
||||
**Purpose:** Identify kinase hinge-binding motifs
|
||||
|
||||
**Common Patterns:**
|
||||
- Aminopyridines
|
||||
- Aminopyrimidines
|
||||
- Indazoles
|
||||
- Benzimidazoles
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
group = mc.groups.ChemicalGroup(groups=["hinge_binders"])
|
||||
has_hinge = group.has_match(mol_list)
|
||||
```
|
||||
|
||||
**Application:** Kinase inhibitor design
|
||||
|
||||
---
|
||||
|
||||
### Phosphate Binders
|
||||
|
||||
**Purpose:** Identify phosphate-binding groups
|
||||
|
||||
**Common Patterns:**
|
||||
- Basic amines in specific geometries
|
||||
- Guanidinium groups
|
||||
- Arginine mimetics
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
group = mc.groups.ChemicalGroup(groups=["phosphate_binders"])
|
||||
```
|
||||
|
||||
**Application:** Kinase inhibitors, phosphatase inhibitors
|
||||
|
||||
---
|
||||
|
||||
### Michael Acceptors
|
||||
|
||||
**Purpose:** Identify electrophilic Michael acceptor groups
|
||||
|
||||
**Common Patterns:**
|
||||
- α,β-Unsaturated carbonyls
|
||||
- α,β-Unsaturated nitriles
|
||||
- Vinyl sulfones
|
||||
- Acrylamides
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
group = mc.groups.ChemicalGroup(groups=["michael_acceptors"])
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Can be desirable for covalent inhibitors
|
||||
- Often flagged as reactive alerts in screening
|
||||
|
||||
---
|
||||
|
||||
### Reactive Groups
|
||||
|
||||
**Purpose:** Identify generally reactive functionalities
|
||||
|
||||
**Common Patterns:**
|
||||
- Epoxides
|
||||
- Aziridines
|
||||
- Acyl halides
|
||||
- Isocyanates
|
||||
- Sulfonyl chlorides
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
group = mc.groups.ChemicalGroup(groups=["reactive_groups"])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Custom SMARTS Patterns
|
||||
|
||||
Define custom structural patterns using SMARTS:
|
||||
|
||||
```python
|
||||
custom_patterns = {
|
||||
"my_warhead": "[C;H0](=O)C(F)(F)F", # Trifluoromethyl ketone
|
||||
"my_scaffold": "c1ccc2c(c1)ncc(n2)N", # Aminobenzimidazole
|
||||
}
|
||||
|
||||
group = mc.groups.ChemicalGroup(
|
||||
groups=["hinge_binders"],
|
||||
custom_smarts=custom_patterns
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Filter Selection Guidelines
|
||||
|
||||
### Initial Screening (High-Throughput)
|
||||
|
||||
Recommended filters:
|
||||
- Rule of Five
|
||||
- PAINS filter
|
||||
- Common Alerts (permissive settings)
|
||||
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_five", "pains_filter"])
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Hit-to-Lead
|
||||
|
||||
Recommended filters:
|
||||
- Rule of Oprea or Leadlike (soft)
|
||||
- NIBR filters
|
||||
- Lilly Demerits
|
||||
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_oprea"])
|
||||
nibr_filter = mc.structural.NIBRFilters()
|
||||
lilly_filter = mc.structural.LillyDemeritsFilters()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Lead Optimization
|
||||
|
||||
Recommended filters:
|
||||
- Rule of Drug
|
||||
- Leadlike (strict)
|
||||
- Full structural alert analysis
|
||||
- Complexity filters
|
||||
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_drug", "rule_of_leadlike_strict"])
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
complexity_filter = mc.complexity.ComplexityFilter(max_complexity=400)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### CNS Targets
|
||||
|
||||
Recommended filters:
|
||||
- Rule of CNS
|
||||
- Reduced PAINS criteria (CNS-focused)
|
||||
- BBB permeability constraints
|
||||
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_cns"])
|
||||
constraints = mc.constraints.Constraints(
|
||||
tpsa_max=90,
|
||||
hbd_max=2,
|
||||
mw_range=(300, 450)
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fragment-Based Drug Discovery
|
||||
|
||||
Recommended filters:
|
||||
- Rule of Three
|
||||
- Minimal complexity
|
||||
- Basic reactive group check
|
||||
|
||||
```python
|
||||
rfilter = mc.rules.RuleFilters(rule_list=["rule_of_three"])
|
||||
complexity_filter = mc.complexity.ComplexityFilter(max_complexity=250)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Important Considerations
|
||||
|
||||
### False Positives and False Negatives
|
||||
|
||||
**Filters are guidelines, not absolutes:**
|
||||
|
||||
1. **False Positives** (good drugs flagged):
|
||||
- ~10% of marketed drugs fail Rule of Five
|
||||
- Natural products often violate standard rules
|
||||
- Prodrugs intentionally break rules
|
||||
- Antibiotics and antivirals frequently non-compliant
|
||||
|
||||
2. **False Negatives** (bad compounds passing):
|
||||
- Passing filters doesn't guarantee success
|
||||
- Target-specific issues not captured
|
||||
- In vivo properties not fully predicted
|
||||
|
||||
### Context-Specific Application
|
||||
|
||||
**Different contexts require different criteria:**
|
||||
|
||||
- **Target Class:** Kinases vs GPCRs vs ion channels have different optimal spaces
|
||||
- **Modality:** Small molecules vs PROTACs vs molecular glues
|
||||
- **Administration Route:** Oral vs IV vs topical
|
||||
- **Disease Area:** CNS vs oncology vs infectious disease
|
||||
- **Stage:** Screening vs hit-to-lead vs lead optimization
|
||||
|
||||
### Complementing with Machine Learning
|
||||
|
||||
Modern approaches combine rules with ML:
|
||||
|
||||
```python
|
||||
# Rule-based pre-filtering
|
||||
rule_results = mc.rules.RuleFilters(rule_list=["rule_of_five"])(mols)
|
||||
filtered_mols = [mol for mol, r in zip(mols, rule_results) if r["passes"]]
|
||||
|
||||
# ML model scoring on filtered set
|
||||
ml_scores = ml_model.predict(filtered_mols)
|
||||
|
||||
# Combined decision
|
||||
final_candidates = [
|
||||
mol for mol, score in zip(filtered_mols, ml_scores)
|
||||
if score > threshold
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
1. Lipinski CA et al. Adv Drug Deliv Rev (1997) 23:3-25
|
||||
2. Veber DF et al. J Med Chem (2002) 45:2615-2623
|
||||
3. Oprea TI et al. J Chem Inf Comput Sci (2001) 41:1308-1315
|
||||
4. Congreve M et al. Drug Discov Today (2003) 8:876-877
|
||||
5. Baell JB & Holloway GA. J Med Chem (2010) 53:2719-2740
|
||||
6. Johnson TW et al. J Med Chem (2009) 52:5487-5500
|
||||
7. Walters WP & Murcko MA. Adv Drug Deliv Rev (2002) 54:255-271
|
||||
8. Hann MM & Oprea TI. Curr Opin Chem Biol (2004) 8:255-263
|
||||
9. Rishton GM. Drug Discov Today (1997) 2:382-384
|
||||
418
scientific-packages/medchem/scripts/filter_molecules.py
Normal file
418
scientific-packages/medchem/scripts/filter_molecules.py
Normal file
@@ -0,0 +1,418 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch molecular filtering using medchem library.
|
||||
|
||||
This script provides a production-ready workflow for filtering compound libraries
|
||||
using medchem rules, structural alerts, and custom constraints.
|
||||
|
||||
Usage:
|
||||
python filter_molecules.py input.csv --rules rule_of_five,rule_of_cns --alerts nibr --output filtered.csv
|
||||
python filter_molecules.py input.sdf --rules rule_of_drug --lilly --complexity 400 --output results.csv
|
||||
python filter_molecules.py smiles.txt --nibr --pains --n-jobs -1 --output clean.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional, Tuple
|
||||
import json
|
||||
|
||||
try:
|
||||
import pandas as pd
|
||||
import datamol as dm
|
||||
import medchem as mc
|
||||
from rdkit import Chem
|
||||
from tqdm import tqdm
|
||||
except ImportError as e:
|
||||
print(f"Error: Missing required package: {e}")
|
||||
print("Install dependencies: pip install medchem datamol pandas tqdm")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def load_molecules(input_file: Path, smiles_column: str = "smiles") -> Tuple[pd.DataFrame, List[Chem.Mol]]:
|
||||
"""
|
||||
Load molecules from various file formats.
|
||||
|
||||
Supports:
|
||||
- CSV/TSV with SMILES column
|
||||
- SDF files
|
||||
- Plain text files with one SMILES per line
|
||||
|
||||
Returns:
|
||||
Tuple of (DataFrame with metadata, list of RDKit molecules)
|
||||
"""
|
||||
suffix = input_file.suffix.lower()
|
||||
|
||||
if suffix == ".sdf":
|
||||
print(f"Loading SDF file: {input_file}")
|
||||
supplier = Chem.SDMolSupplier(str(input_file))
|
||||
mols = [mol for mol in supplier if mol is not None]
|
||||
|
||||
# Create DataFrame from SDF properties
|
||||
data = []
|
||||
for mol in mols:
|
||||
props = mol.GetPropsAsDict()
|
||||
props["smiles"] = Chem.MolToSmiles(mol)
|
||||
data.append(props)
|
||||
df = pd.DataFrame(data)
|
||||
|
||||
elif suffix in [".csv", ".tsv"]:
|
||||
print(f"Loading CSV/TSV file: {input_file}")
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
df = pd.read_csv(input_file, sep=sep)
|
||||
|
||||
if smiles_column not in df.columns:
|
||||
print(f"Error: Column '{smiles_column}' not found in file")
|
||||
print(f"Available columns: {', '.join(df.columns)}")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Converting SMILES to molecules...")
|
||||
mols = [dm.to_mol(smi) for smi in tqdm(df[smiles_column], desc="Parsing")]
|
||||
|
||||
elif suffix == ".txt":
|
||||
print(f"Loading text file: {input_file}")
|
||||
with open(input_file) as f:
|
||||
smiles_list = [line.strip() for line in f if line.strip()]
|
||||
|
||||
df = pd.DataFrame({"smiles": smiles_list})
|
||||
print(f"Converting SMILES to molecules...")
|
||||
mols = [dm.to_mol(smi) for smi in tqdm(smiles_list, desc="Parsing")]
|
||||
|
||||
else:
|
||||
print(f"Error: Unsupported file format: {suffix}")
|
||||
print("Supported formats: .csv, .tsv, .sdf, .txt")
|
||||
sys.exit(1)
|
||||
|
||||
# Filter out invalid molecules
|
||||
valid_indices = [i for i, mol in enumerate(mols) if mol is not None]
|
||||
if len(valid_indices) < len(mols):
|
||||
n_invalid = len(mols) - len(valid_indices)
|
||||
print(f"Warning: {n_invalid} invalid molecules removed")
|
||||
df = df.iloc[valid_indices].reset_index(drop=True)
|
||||
mols = [mols[i] for i in valid_indices]
|
||||
|
||||
print(f"Loaded {len(mols)} valid molecules")
|
||||
return df, mols
|
||||
|
||||
|
||||
def apply_rule_filters(mols: List[Chem.Mol], rules: List[str], n_jobs: int) -> pd.DataFrame:
|
||||
"""Apply medicinal chemistry rule filters."""
|
||||
print(f"\nApplying rule filters: {', '.join(rules)}")
|
||||
|
||||
rfilter = mc.rules.RuleFilters(rule_list=rules)
|
||||
results = rfilter(mols=mols, n_jobs=n_jobs, progress=True)
|
||||
|
||||
# Convert to DataFrame
|
||||
df_results = pd.DataFrame(results)
|
||||
|
||||
# Add summary column
|
||||
df_results["passes_all_rules"] = df_results.all(axis=1)
|
||||
|
||||
return df_results
|
||||
|
||||
|
||||
def apply_structural_alerts(mols: List[Chem.Mol], alert_type: str, n_jobs: int) -> pd.DataFrame:
|
||||
"""Apply structural alert filters."""
|
||||
print(f"\nApplying {alert_type} structural alerts...")
|
||||
|
||||
if alert_type == "common":
|
||||
alert_filter = mc.structural.CommonAlertsFilters()
|
||||
results = alert_filter(mols=mols, n_jobs=n_jobs, progress=True)
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"has_common_alerts": [r["has_alerts"] for r in results],
|
||||
"num_common_alerts": [r["num_alerts"] for r in results],
|
||||
"common_alert_details": [", ".join(r["alert_details"]) if r["alert_details"] else "" for r in results]
|
||||
})
|
||||
|
||||
elif alert_type == "nibr":
|
||||
nibr_filter = mc.structural.NIBRFilters()
|
||||
results = nibr_filter(mols=mols, n_jobs=n_jobs, progress=True)
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"passes_nibr": results
|
||||
})
|
||||
|
||||
elif alert_type == "lilly":
|
||||
lilly_filter = mc.structural.LillyDemeritsFilters()
|
||||
results = lilly_filter(mols=mols, n_jobs=n_jobs, progress=True)
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"lilly_demerits": [r["demerits"] for r in results],
|
||||
"passes_lilly": [r["passes"] for r in results],
|
||||
"lilly_patterns": [", ".join([p["pattern"] for p in r["matched_patterns"]]) for r in results]
|
||||
})
|
||||
|
||||
elif alert_type == "pains":
|
||||
results = [mc.rules.basic_rules.pains_filter(mol) for mol in tqdm(mols, desc="PAINS")]
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"passes_pains": results
|
||||
})
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unknown alert type: {alert_type}")
|
||||
|
||||
return df_results
|
||||
|
||||
|
||||
def apply_complexity_filter(mols: List[Chem.Mol], max_complexity: float, method: str = "bertz") -> pd.DataFrame:
|
||||
"""Calculate molecular complexity."""
|
||||
print(f"\nCalculating molecular complexity (method={method}, max={max_complexity})...")
|
||||
|
||||
complexity_scores = [
|
||||
mc.complexity.calculate_complexity(mol, method=method)
|
||||
for mol in tqdm(mols, desc="Complexity")
|
||||
]
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"complexity_score": complexity_scores,
|
||||
"passes_complexity": [score <= max_complexity for score in complexity_scores]
|
||||
})
|
||||
|
||||
return df_results
|
||||
|
||||
|
||||
def apply_constraints(mols: List[Chem.Mol], constraints: Dict, n_jobs: int) -> pd.DataFrame:
|
||||
"""Apply custom property constraints."""
|
||||
print(f"\nApplying constraints: {constraints}")
|
||||
|
||||
constraint_filter = mc.constraints.Constraints(**constraints)
|
||||
results = constraint_filter(mols=mols, n_jobs=n_jobs, progress=True)
|
||||
|
||||
df_results = pd.DataFrame({
|
||||
"passes_constraints": [r["passes"] for r in results],
|
||||
"constraint_violations": [", ".join(r["violations"]) if r["violations"] else "" for r in results]
|
||||
})
|
||||
|
||||
return df_results
|
||||
|
||||
|
||||
def apply_chemical_groups(mols: List[Chem.Mol], groups: List[str]) -> pd.DataFrame:
|
||||
"""Detect chemical groups."""
|
||||
print(f"\nDetecting chemical groups: {', '.join(groups)}")
|
||||
|
||||
group_detector = mc.groups.ChemicalGroup(groups=groups)
|
||||
results = group_detector.get_all_matches(mols)
|
||||
|
||||
df_results = pd.DataFrame()
|
||||
for group in groups:
|
||||
df_results[f"has_{group}"] = [bool(r.get(group)) for r in results]
|
||||
|
||||
return df_results
|
||||
|
||||
|
||||
def generate_summary(df: pd.DataFrame, output_file: Path):
|
||||
"""Generate filtering summary report."""
|
||||
summary_file = output_file.parent / f"{output_file.stem}_summary.txt"
|
||||
|
||||
with open(summary_file, "w") as f:
|
||||
f.write("=" * 80 + "\n")
|
||||
f.write("MEDCHEM FILTERING SUMMARY\n")
|
||||
f.write("=" * 80 + "\n\n")
|
||||
|
||||
f.write(f"Total molecules processed: {len(df)}\n\n")
|
||||
|
||||
# Rule results
|
||||
rule_cols = [col for col in df.columns if col.startswith("rule_") or col == "passes_all_rules"]
|
||||
if rule_cols:
|
||||
f.write("RULE FILTERS:\n")
|
||||
f.write("-" * 40 + "\n")
|
||||
for col in rule_cols:
|
||||
if col in df.columns and df[col].dtype == bool:
|
||||
n_pass = df[col].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" {col}: {n_pass} passed ({pct:.1f}%)\n")
|
||||
f.write("\n")
|
||||
|
||||
# Structural alerts
|
||||
alert_cols = [col for col in df.columns if "alert" in col.lower() or "nibr" in col.lower() or "lilly" in col.lower() or "pains" in col.lower()]
|
||||
if alert_cols:
|
||||
f.write("STRUCTURAL ALERTS:\n")
|
||||
f.write("-" * 40 + "\n")
|
||||
if "has_common_alerts" in df.columns:
|
||||
n_clean = (~df["has_common_alerts"]).sum()
|
||||
pct = 100 * n_clean / len(df)
|
||||
f.write(f" No common alerts: {n_clean} ({pct:.1f}%)\n")
|
||||
if "passes_nibr" in df.columns:
|
||||
n_pass = df["passes_nibr"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" Passes NIBR: {n_pass} ({pct:.1f}%)\n")
|
||||
if "passes_lilly" in df.columns:
|
||||
n_pass = df["passes_lilly"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" Passes Lilly: {n_pass} ({pct:.1f}%)\n")
|
||||
avg_demerits = df["lilly_demerits"].mean()
|
||||
f.write(f" Average Lilly demerits: {avg_demerits:.1f}\n")
|
||||
if "passes_pains" in df.columns:
|
||||
n_pass = df["passes_pains"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" Passes PAINS: {n_pass} ({pct:.1f}%)\n")
|
||||
f.write("\n")
|
||||
|
||||
# Complexity
|
||||
if "complexity_score" in df.columns:
|
||||
f.write("COMPLEXITY:\n")
|
||||
f.write("-" * 40 + "\n")
|
||||
avg_complexity = df["complexity_score"].mean()
|
||||
f.write(f" Average complexity: {avg_complexity:.1f}\n")
|
||||
if "passes_complexity" in df.columns:
|
||||
n_pass = df["passes_complexity"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" Within threshold: {n_pass} ({pct:.1f}%)\n")
|
||||
f.write("\n")
|
||||
|
||||
# Constraints
|
||||
if "passes_constraints" in df.columns:
|
||||
f.write("CONSTRAINTS:\n")
|
||||
f.write("-" * 40 + "\n")
|
||||
n_pass = df["passes_constraints"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write(f" Passes all constraints: {n_pass} ({pct:.1f}%)\n")
|
||||
f.write("\n")
|
||||
|
||||
# Overall pass rate
|
||||
pass_cols = [col for col in df.columns if col.startswith("passes_")]
|
||||
if pass_cols:
|
||||
df["passes_all_filters"] = df[pass_cols].all(axis=1)
|
||||
n_pass = df["passes_all_filters"].sum()
|
||||
pct = 100 * n_pass / len(df)
|
||||
f.write("OVERALL:\n")
|
||||
f.write("-" * 40 + "\n")
|
||||
f.write(f" Molecules passing all filters: {n_pass} ({pct:.1f}%)\n")
|
||||
|
||||
f.write("\n" + "=" * 80 + "\n")
|
||||
|
||||
print(f"\nSummary report saved to: {summary_file}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Batch molecular filtering using medchem",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=__doc__
|
||||
)
|
||||
|
||||
# Input/Output
|
||||
parser.add_argument("input", type=Path, help="Input file (CSV, TSV, SDF, or TXT)")
|
||||
parser.add_argument("--output", "-o", type=Path, required=True, help="Output CSV file")
|
||||
parser.add_argument("--smiles-column", default="smiles", help="Name of SMILES column (default: smiles)")
|
||||
|
||||
# Rule filters
|
||||
parser.add_argument("--rules", help="Comma-separated list of rules (e.g., rule_of_five,rule_of_cns)")
|
||||
|
||||
# Structural alerts
|
||||
parser.add_argument("--common-alerts", action="store_true", help="Apply common structural alerts")
|
||||
parser.add_argument("--nibr", action="store_true", help="Apply NIBR filters")
|
||||
parser.add_argument("--lilly", action="store_true", help="Apply Lilly demerits filter")
|
||||
parser.add_argument("--pains", action="store_true", help="Apply PAINS filter")
|
||||
|
||||
# Complexity
|
||||
parser.add_argument("--complexity", type=float, help="Maximum complexity threshold")
|
||||
parser.add_argument("--complexity-method", default="bertz", choices=["bertz", "whitlock", "barone"],
|
||||
help="Complexity calculation method")
|
||||
|
||||
# Constraints
|
||||
parser.add_argument("--mw-range", help="Molecular weight range (e.g., 200,500)")
|
||||
parser.add_argument("--logp-range", help="LogP range (e.g., -2,5)")
|
||||
parser.add_argument("--tpsa-max", type=float, help="Maximum TPSA")
|
||||
parser.add_argument("--hbd-max", type=int, help="Maximum H-bond donors")
|
||||
parser.add_argument("--hba-max", type=int, help="Maximum H-bond acceptors")
|
||||
parser.add_argument("--rotatable-bonds-max", type=int, help="Maximum rotatable bonds")
|
||||
|
||||
# Chemical groups
|
||||
parser.add_argument("--groups", help="Comma-separated chemical groups to detect")
|
||||
|
||||
# Processing options
|
||||
parser.add_argument("--n-jobs", type=int, default=-1, help="Number of parallel jobs (-1 = all cores)")
|
||||
parser.add_argument("--no-summary", action="store_true", help="Don't generate summary report")
|
||||
parser.add_argument("--filter-output", action="store_true", help="Only output molecules passing all filters")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load molecules
|
||||
df, mols = load_molecules(args.input, args.smiles_column)
|
||||
|
||||
# Apply filters
|
||||
result_dfs = [df]
|
||||
|
||||
# Rules
|
||||
if args.rules:
|
||||
rule_list = [r.strip() for r in args.rules.split(",")]
|
||||
df_rules = apply_rule_filters(mols, rule_list, args.n_jobs)
|
||||
result_dfs.append(df_rules)
|
||||
|
||||
# Structural alerts
|
||||
if args.common_alerts:
|
||||
df_alerts = apply_structural_alerts(mols, "common", args.n_jobs)
|
||||
result_dfs.append(df_alerts)
|
||||
|
||||
if args.nibr:
|
||||
df_nibr = apply_structural_alerts(mols, "nibr", args.n_jobs)
|
||||
result_dfs.append(df_nibr)
|
||||
|
||||
if args.lilly:
|
||||
df_lilly = apply_structural_alerts(mols, "lilly", args.n_jobs)
|
||||
result_dfs.append(df_lilly)
|
||||
|
||||
if args.pains:
|
||||
df_pains = apply_structural_alerts(mols, "pains", args.n_jobs)
|
||||
result_dfs.append(df_pains)
|
||||
|
||||
# Complexity
|
||||
if args.complexity:
|
||||
df_complexity = apply_complexity_filter(mols, args.complexity, args.complexity_method)
|
||||
result_dfs.append(df_complexity)
|
||||
|
||||
# Constraints
|
||||
constraints = {}
|
||||
if args.mw_range:
|
||||
mw_min, mw_max = map(float, args.mw_range.split(","))
|
||||
constraints["mw_range"] = (mw_min, mw_max)
|
||||
if args.logp_range:
|
||||
logp_min, logp_max = map(float, args.logp_range.split(","))
|
||||
constraints["logp_range"] = (logp_min, logp_max)
|
||||
if args.tpsa_max:
|
||||
constraints["tpsa_max"] = args.tpsa_max
|
||||
if args.hbd_max:
|
||||
constraints["hbd_max"] = args.hbd_max
|
||||
if args.hba_max:
|
||||
constraints["hba_max"] = args.hba_max
|
||||
if args.rotatable_bonds_max:
|
||||
constraints["rotatable_bonds_max"] = args.rotatable_bonds_max
|
||||
|
||||
if constraints:
|
||||
df_constraints = apply_constraints(mols, constraints, args.n_jobs)
|
||||
result_dfs.append(df_constraints)
|
||||
|
||||
# Chemical groups
|
||||
if args.groups:
|
||||
group_list = [g.strip() for g in args.groups.split(",")]
|
||||
df_groups = apply_chemical_groups(mols, group_list)
|
||||
result_dfs.append(df_groups)
|
||||
|
||||
# Combine results
|
||||
df_final = pd.concat(result_dfs, axis=1)
|
||||
|
||||
# Filter output if requested
|
||||
if args.filter_output:
|
||||
pass_cols = [col for col in df_final.columns if col.startswith("passes_")]
|
||||
if pass_cols:
|
||||
df_final["passes_all"] = df_final[pass_cols].all(axis=1)
|
||||
df_final = df_final[df_final["passes_all"]]
|
||||
print(f"\nFiltered to {len(df_final)} molecules passing all filters")
|
||||
|
||||
# Save results
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
df_final.to_csv(args.output, index=False)
|
||||
print(f"\nResults saved to: {args.output}")
|
||||
|
||||
# Generate summary
|
||||
if not args.no_summary:
|
||||
generate_summary(df_final, args.output)
|
||||
|
||||
print("\nDone!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
516
scientific-packages/molfeat/SKILL.md
Normal file
516
scientific-packages/molfeat/SKILL.md
Normal file
@@ -0,0 +1,516 @@
|
||||
---
|
||||
name: molfeat
|
||||
description: Comprehensive molecular featurization toolkit for converting chemical structures into numerical representations for machine learning. Use this skill when working with molecular data, SMILES strings, chemical fingerprints, molecular descriptors, or building QSAR/QSPR models. Provides access to 100+ featurizers including traditional fingerprints (ECFP, MACCS), molecular descriptors (RDKit, Mordred), and pretrained deep learning models (ChemBERTa, ChemGPT, GNN models) for cheminformatics and drug discovery tasks.
|
||||
---
|
||||
|
||||
# Molfeat - Molecular Featurization Hub
|
||||
|
||||
## Overview
|
||||
|
||||
Molfeat is a comprehensive Python library for molecular featurization that unifies pre-trained embeddings and hand-crafted featurizers into a single, fast, and user-friendly package. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations suitable for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications.
|
||||
|
||||
**Key Capabilities:**
|
||||
- 100+ featurizers including fingerprints, descriptors, and pretrained models
|
||||
- Fast parallel processing with simple API
|
||||
- Scikit-learn compatible transformers
|
||||
- Built-in caching and state persistence
|
||||
- Integration with PyTorch, TensorFlow, and graph neural networks
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply molfeat when working with:
|
||||
- **Molecular machine learning**: Building QSAR/QSPR models, property prediction
|
||||
- **Virtual screening**: Ranking compound libraries for biological activity
|
||||
- **Similarity searching**: Finding structurally similar molecules
|
||||
- **Chemical space analysis**: Clustering, visualization, dimensionality reduction
|
||||
- **Deep learning**: Training neural networks on molecular data
|
||||
- **Featurization pipelines**: Converting SMILES to ML-ready representations
|
||||
- **Cheminformatics**: Any task requiring molecular feature extraction
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Recommended: Using conda/mamba
|
||||
mamba install -c conda-forge molfeat
|
||||
|
||||
# Alternative: Using pip
|
||||
pip install molfeat
|
||||
|
||||
# With all optional dependencies
|
||||
pip install "molfeat[all]"
|
||||
```
|
||||
|
||||
**Optional dependencies for specific featurizers:**
|
||||
- `molfeat[dgl]` - GNN models (GIN variants)
|
||||
- `molfeat[graphormer]` - Graphormer models
|
||||
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
|
||||
- `molfeat[fcd]` - FCD descriptors
|
||||
- `molfeat[map4]` - MAP4 fingerprints
|
||||
|
||||
## Core Concepts
|
||||
|
||||
Molfeat organizes featurization into three hierarchical classes:
|
||||
|
||||
### 1. Calculators (`molfeat.calc`)
|
||||
|
||||
Callable objects that convert individual molecules into feature vectors. Accept RDKit `Chem.Mol` objects or SMILES strings.
|
||||
|
||||
**Use calculators for:**
|
||||
- Single molecule featurization
|
||||
- Custom processing loops
|
||||
- Direct feature computation
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
features = calc("CCO") # Returns numpy array (2048,)
|
||||
```
|
||||
|
||||
### 2. Transformers (`molfeat.trans`)
|
||||
|
||||
Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
|
||||
|
||||
**Use transformers for:**
|
||||
- Batch featurization of molecular datasets
|
||||
- Integration with scikit-learn pipelines
|
||||
- Parallel processing (automatic CPU utilization)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
features = transformer(smiles_list) # Parallel processing
|
||||
```
|
||||
|
||||
### 3. Pretrained Transformers (`molfeat.trans.pretrained`)
|
||||
|
||||
Specialized transformers for deep learning models with batched inference and caching.
|
||||
|
||||
**Use pretrained transformers for:**
|
||||
- State-of-the-art molecular embeddings
|
||||
- Transfer learning from large chemical datasets
|
||||
- Deep learning feature extraction
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
embeddings = transformer(smiles_list) # Deep learning embeddings
|
||||
```
|
||||
|
||||
## Quick Start Workflow
|
||||
|
||||
### Basic Featurization
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
# Load molecular data
|
||||
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
|
||||
|
||||
# Create calculator and transformer
|
||||
calc = FPCalculator("ecfp", radius=3)
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Featurize molecules
|
||||
features = transformer(smiles)
|
||||
print(f"Shape: {features.shape}") # (4, 2048)
|
||||
```
|
||||
|
||||
### Save and Load Configuration
|
||||
|
||||
```python
|
||||
# Save featurizer configuration for reproducibility
|
||||
transformer.to_state_yaml_file("featurizer_config.yml")
|
||||
|
||||
# Reload exact configuration
|
||||
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
|
||||
```
|
||||
|
||||
### Handle Errors Gracefully
|
||||
|
||||
```python
|
||||
# Process dataset with potentially invalid SMILES
|
||||
transformer = MoleculeTransformer(
|
||||
calc,
|
||||
n_jobs=-1,
|
||||
ignore_errors=True, # Continue on failures
|
||||
verbose=True # Log error details
|
||||
)
|
||||
|
||||
features = transformer(smiles_with_errors)
|
||||
# Returns None for failed molecules
|
||||
```
|
||||
|
||||
## Choosing the Right Featurizer
|
||||
|
||||
### For Traditional Machine Learning (RF, SVM, XGBoost)
|
||||
|
||||
**Start with fingerprints:**
|
||||
```python
|
||||
# ECFP - Most popular, general-purpose
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
|
||||
# MACCS - Fast, good for scaffold hopping
|
||||
FPCalculator("maccs")
|
||||
|
||||
# MAP4 - Efficient for large-scale screening
|
||||
FPCalculator("map4")
|
||||
```
|
||||
|
||||
**For interpretable models:**
|
||||
```python
|
||||
# RDKit 2D descriptors (200+ named properties)
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
RDKitDescriptors2D()
|
||||
|
||||
# Mordred (1800+ comprehensive descriptors)
|
||||
from molfeat.calc import MordredDescriptors
|
||||
MordredDescriptors()
|
||||
```
|
||||
|
||||
**Combine multiple featurizers:**
|
||||
```python
|
||||
from molfeat.trans import FeatConcat
|
||||
|
||||
concat = FeatConcat([
|
||||
FPCalculator("maccs"), # 167 dimensions
|
||||
FPCalculator("ecfp") # 2048 dimensions
|
||||
]) # Result: 2215-dimensional combined features
|
||||
```
|
||||
|
||||
### For Deep Learning
|
||||
|
||||
**Transformer-based embeddings:**
|
||||
```python
|
||||
# ChemBERTa - Pre-trained on 77M PubChem compounds
|
||||
PretrainedMolTransformer("ChemBERTa-77M-MLM")
|
||||
|
||||
# ChemGPT - Autoregressive language model
|
||||
PretrainedMolTransformer("ChemGPT-1.2B")
|
||||
```
|
||||
|
||||
**Graph neural networks:**
|
||||
```python
|
||||
# GIN models with different pre-training objectives
|
||||
PretrainedMolTransformer("gin-supervised-masking")
|
||||
PretrainedMolTransformer("gin-supervised-infomax")
|
||||
|
||||
# Graphormer for quantum chemistry
|
||||
PretrainedMolTransformer("Graphormer-pcqm4mv2")
|
||||
```
|
||||
|
||||
### For Similarity Searching
|
||||
|
||||
```python
|
||||
# ECFP - General purpose, most widely used
|
||||
FPCalculator("ecfp")
|
||||
|
||||
# MACCS - Fast, scaffold-based similarity
|
||||
FPCalculator("maccs")
|
||||
|
||||
# MAP4 - Efficient for large databases
|
||||
FPCalculator("map4")
|
||||
|
||||
# USR/USRCAT - 3D shape similarity
|
||||
from molfeat.calc import USRDescriptors
|
||||
USRDescriptors()
|
||||
```
|
||||
|
||||
### For Pharmacophore-Based Approaches
|
||||
|
||||
```python
|
||||
# FCFP - Functional group based
|
||||
FPCalculator("fcfp")
|
||||
|
||||
# CATS - Pharmacophore pair distributions
|
||||
from molfeat.calc import CATSCalculator
|
||||
CATSCalculator(mode="2D")
|
||||
|
||||
# Gobbi - Explicit pharmacophore features
|
||||
FPCalculator("gobbi2D")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Building a QSAR Model
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
# Featurize molecules
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X = transformer(smiles_train)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100)
|
||||
scores = cross_val_score(model, X, y_train, cv=5)
|
||||
print(f"R² = {scores.mean():.3f}")
|
||||
|
||||
# Save configuration for deployment
|
||||
transformer.to_state_yaml_file("production_featurizer.yml")
|
||||
```
|
||||
|
||||
### Virtual Screening Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Train on known actives/inactives
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X_train = transformer(train_smiles)
|
||||
clf = RandomForestClassifier(n_estimators=500)
|
||||
clf.fit(X_train, train_labels)
|
||||
|
||||
# Screen large library
|
||||
X_screen = transformer(screening_library) # e.g., 1M compounds
|
||||
predictions = clf.predict_proba(X_screen)[:, 1]
|
||||
|
||||
# Rank and select top hits
|
||||
top_indices = predictions.argsort()[::-1][:1000]
|
||||
top_hits = [screening_library[i] for i in top_indices]
|
||||
```
|
||||
|
||||
### Similarity Search
|
||||
|
||||
```python
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
# Query molecule
|
||||
calc = FPCalculator("ecfp")
|
||||
query_fp = calc(query_smiles).reshape(1, -1)
|
||||
|
||||
# Database fingerprints
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
database_fps = transformer(database_smiles)
|
||||
|
||||
# Compute similarity
|
||||
similarities = cosine_similarity(query_fp, database_fps)[0]
|
||||
top_similar = similarities.argsort()[-10:][::-1]
|
||||
```
|
||||
|
||||
### Scikit-learn Pipeline Integration
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Create end-to-end pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', RandomForestClassifier(n_estimators=100))
|
||||
])
|
||||
|
||||
# Train and predict directly on SMILES
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
```
|
||||
|
||||
### Comparing Multiple Featurizers
|
||||
|
||||
```python
|
||||
featurizers = {
|
||||
'ECFP': FPCalculator("ecfp"),
|
||||
'MACCS': FPCalculator("maccs"),
|
||||
'Descriptors': RDKitDescriptors2D(),
|
||||
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
|
||||
}
|
||||
|
||||
results = {}
|
||||
for name, feat in featurizers.items():
|
||||
transformer = MoleculeTransformer(feat, n_jobs=-1)
|
||||
X = transformer(smiles)
|
||||
# Evaluate with your ML model
|
||||
score = evaluate_model(X, y)
|
||||
results[name] = score
|
||||
```
|
||||
|
||||
## Discovering Available Featurizers
|
||||
|
||||
Use the ModelStore to explore all available featurizers:
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
all_models = store.available_models
|
||||
print(f"Total featurizers: {len(all_models)}")
|
||||
|
||||
# Search for specific models
|
||||
chemberta_models = store.search(name="ChemBERTa")
|
||||
for model in chemberta_models:
|
||||
print(f"- {model.name}: {model.description}")
|
||||
|
||||
# Get usage information
|
||||
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
|
||||
model_card.usage() # Display usage examples
|
||||
|
||||
# Load model
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Preprocessing
|
||||
|
||||
```python
|
||||
class CustomTransformer(MoleculeTransformer):
|
||||
def preprocess(self, mol):
|
||||
"""Custom preprocessing pipeline"""
|
||||
if isinstance(mol, str):
|
||||
mol = dm.to_mol(mol)
|
||||
mol = dm.standardize_mol(mol)
|
||||
mol = dm.remove_salts(mol)
|
||||
return mol
|
||||
|
||||
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
```
|
||||
|
||||
### Batch Processing Large Datasets
|
||||
|
||||
```python
|
||||
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
|
||||
"""Process large datasets in chunks to manage memory"""
|
||||
all_features = []
|
||||
for i in range(0, len(smiles_list), chunk_size):
|
||||
chunk = smiles_list[i:i+chunk_size]
|
||||
features = transformer(chunk)
|
||||
all_features.append(features)
|
||||
return np.vstack(all_features)
|
||||
```
|
||||
|
||||
### Caching Expensive Embeddings
|
||||
|
||||
```python
|
||||
import pickle
|
||||
|
||||
cache_file = "embeddings_cache.pkl"
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
try:
|
||||
with open(cache_file, "rb") as f:
|
||||
embeddings = pickle.load(f)
|
||||
except FileNotFoundError:
|
||||
embeddings = transformer(smiles_list)
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(embeddings, f)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use parallelization**: Set `n_jobs=-1` to utilize all CPU cores
|
||||
2. **Batch processing**: Process multiple molecules at once instead of loops
|
||||
3. **Choose appropriate featurizers**: Fingerprints are faster than deep learning models
|
||||
4. **Cache pretrained models**: Leverage built-in caching for repeated use
|
||||
5. **Use float32**: Set `dtype=np.float32` when precision allows
|
||||
6. **Handle errors efficiently**: Use `ignore_errors=True` for large datasets
|
||||
|
||||
## Common Featurizers Reference
|
||||
|
||||
**Quick reference for frequently used featurizers:**
|
||||
|
||||
| Featurizer | Type | Dimensions | Speed | Use Case |
|
||||
|------------|------|------------|-------|----------|
|
||||
| `ecfp` | Fingerprint | 2048 | Fast | General purpose |
|
||||
| `maccs` | Fingerprint | 167 | Very fast | Scaffold similarity |
|
||||
| `desc2D` | Descriptors | 200+ | Fast | Interpretable models |
|
||||
| `mordred` | Descriptors | 1800+ | Medium | Comprehensive features |
|
||||
| `map4` | Fingerprint | 1024 | Fast | Large-scale screening |
|
||||
| `ChemBERTa-77M-MLM` | Deep learning | 768 | Slow* | Transfer learning |
|
||||
| `gin-supervised-masking` | GNN | Variable | Slow* | Graph-based models |
|
||||
|
||||
*First run is slow; subsequent runs benefit from caching
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/api_reference.md
|
||||
Complete API documentation covering:
|
||||
- `molfeat.calc` - All calculator classes and parameters
|
||||
- `molfeat.trans` - Transformer classes and methods
|
||||
- `molfeat.store` - ModelStore usage
|
||||
- Common patterns and integration examples
|
||||
- Performance optimization tips
|
||||
|
||||
**When to load:** Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
|
||||
|
||||
### references/available_featurizers.md
|
||||
Comprehensive catalog of all 100+ featurizers organized by category:
|
||||
- Transformer-based language models (ChemBERTa, ChemGPT)
|
||||
- Graph neural networks (GIN, Graphormer)
|
||||
- Molecular descriptors (RDKit, Mordred)
|
||||
- Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
|
||||
- Pharmacophore descriptors (CATS, Gobbi)
|
||||
- Shape descriptors (USR, ElectroShape)
|
||||
- Scaffold-based descriptors
|
||||
|
||||
**When to load:** Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
|
||||
|
||||
**Search tip:** Use grep to find specific featurizer types:
|
||||
```bash
|
||||
grep -i "chembert" references/available_featurizers.md
|
||||
grep -i "pharmacophore" references/available_featurizers.md
|
||||
```
|
||||
|
||||
### references/examples.md
|
||||
Practical code examples for common scenarios:
|
||||
- Installation and quick start
|
||||
- Calculator and transformer examples
|
||||
- Pretrained model usage
|
||||
- Scikit-learn and PyTorch integration
|
||||
- Virtual screening workflows
|
||||
- QSAR model building
|
||||
- Similarity searching
|
||||
- Troubleshooting and best practices
|
||||
|
||||
**When to load:** Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Invalid Molecules
|
||||
Enable error handling to skip invalid SMILES:
|
||||
```python
|
||||
transformer = MoleculeTransformer(
|
||||
calc,
|
||||
ignore_errors=True,
|
||||
verbose=True
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Issues with Large Datasets
|
||||
Process in chunks or use streaming approaches for datasets > 100K molecules.
|
||||
|
||||
### Pretrained Model Dependencies
|
||||
Some models require additional packages. Install specific extras:
|
||||
```bash
|
||||
pip install "molfeat[transformer]" # For ChemBERTa/ChemGPT
|
||||
pip install "molfeat[dgl]" # For GIN models
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
Save exact configurations and document versions:
|
||||
```python
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
import molfeat
|
||||
print(f"molfeat version: {molfeat.__version__}")
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Official Documentation**: https://molfeat-docs.datamol.io/
|
||||
- **GitHub Repository**: https://github.com/datamol-io/molfeat
|
||||
- **PyPI Package**: https://pypi.org/project/molfeat/
|
||||
- **Tutorial**: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6
|
||||
428
scientific-packages/molfeat/references/api_reference.md
Normal file
428
scientific-packages/molfeat/references/api_reference.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Molfeat API Reference
|
||||
|
||||
## Core Modules
|
||||
|
||||
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
|
||||
|
||||
- **`molfeat.store`** - Manages model loading, listing, and registration
|
||||
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
|
||||
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
|
||||
- **`molfeat.utils`** - Utility functions for data handling
|
||||
- **`molfeat.viz`** - Visualization tools for molecular features
|
||||
|
||||
---
|
||||
|
||||
## molfeat.calc - Calculators
|
||||
|
||||
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
|
||||
|
||||
### SerializableCalculator (Base Class)
|
||||
|
||||
Base abstract class for all calculators. When subclassing, must implement:
|
||||
- `__call__()` - Required method for featurization
|
||||
- `__len__()` - Optional, returns output length
|
||||
- `columns` - Optional property, returns feature names
|
||||
- `batch_compute()` - Optional, for efficient batch processing
|
||||
|
||||
**State Management Methods:**
|
||||
- `to_state_json()` - Save calculator state as JSON
|
||||
- `to_state_yaml()` - Save calculator state as YAML
|
||||
- `from_state_dict()` - Load calculator from state dictionary
|
||||
- `to_state_dict()` - Export calculator state as dictionary
|
||||
|
||||
### FPCalculator
|
||||
|
||||
Computes molecular fingerprints. Supports 15+ fingerprint methods.
|
||||
|
||||
**Supported Fingerprint Types:**
|
||||
|
||||
**Structural Fingerprints:**
|
||||
- `ecfp` - Extended-connectivity fingerprints (circular)
|
||||
- `fcfp` - Functional-class fingerprints
|
||||
- `rdkit` - RDKit topological fingerprints
|
||||
- `maccs` - MACCS keys (166-bit structural keys)
|
||||
- `avalon` - Avalon fingerprints
|
||||
- `pattern` - Pattern fingerprints
|
||||
- `layered` - Layered fingerprints
|
||||
|
||||
**Atom-based Fingerprints:**
|
||||
- `atompair` - Atom pair fingerprints
|
||||
- `atompair-count` - Counted atom pairs
|
||||
- `topological` - Topological torsion fingerprints
|
||||
- `topological-count` - Counted topological torsions
|
||||
|
||||
**Specialized Fingerprints:**
|
||||
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
|
||||
- `secfp` - SMILES extended connectivity fingerprint
|
||||
- `erg` - Extended reduced graphs
|
||||
- `estate` - Electrotopological state indices
|
||||
|
||||
**Parameters:**
|
||||
- `method` (str) - Fingerprint type name
|
||||
- `radius` (int) - Radius for circular fingerprints (default: 3)
|
||||
- `fpSize` (int) - Fingerprint size (default: 2048)
|
||||
- `includeChirality` (bool) - Include chirality information
|
||||
- `counting` (bool) - Use count vectors instead of binary
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create fingerprint calculator
|
||||
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
|
||||
# Compute fingerprint for single molecule
|
||||
fp = calc("CCO") # Returns numpy array
|
||||
|
||||
# Get fingerprint length
|
||||
length = len(calc) # 2048
|
||||
|
||||
# Get feature names
|
||||
names = calc.columns
|
||||
```
|
||||
|
||||
**Common Fingerprint Dimensions:**
|
||||
- MACCS: 167 dimensions
|
||||
- ECFP (default): 2048 dimensions
|
||||
- MAP4 (default): 1024 dimensions
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
**RDKitDescriptors2D**
|
||||
Computes 2D molecular descriptors using RDKit.
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
|
||||
calc = RDKitDescriptors2D()
|
||||
descriptors = calc("CCO") # Returns 200+ descriptors
|
||||
```
|
||||
|
||||
**RDKitDescriptors3D**
|
||||
Computes 3D molecular descriptors (requires conformer generation).
|
||||
|
||||
**MordredDescriptors**
|
||||
Calculates over 1800 molecular descriptors using Mordred.
|
||||
|
||||
```python
|
||||
from molfeat.calc import MordredDescriptors
|
||||
|
||||
calc = MordredDescriptors()
|
||||
descriptors = calc("CCO")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
**Pharmacophore2D**
|
||||
RDKit's 2D pharmacophore fingerprint generation.
|
||||
|
||||
**Pharmacophore3D**
|
||||
Consensus pharmacophore fingerprints from multiple conformers.
|
||||
|
||||
**CATSCalculator**
|
||||
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
|
||||
|
||||
**Parameters:**
|
||||
- `mode` - "2D" or "3D" distance calculations
|
||||
- `dist_bins` - Distance bins for pair distributions
|
||||
- `scale` - Scaling mode: "raw", "num", or "count"
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
calc = CATSCalculator(mode="2D", scale="raw")
|
||||
cats = calc("CCO") # Returns 21 descriptors by default
|
||||
```
|
||||
|
||||
### Shape Descriptors
|
||||
|
||||
**USRDescriptors**
|
||||
Ultrafast shape recognition descriptors (multiple variants).
|
||||
|
||||
**ElectroShapeDescriptors**
|
||||
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
|
||||
|
||||
### Graph-Based Calculators
|
||||
|
||||
**ScaffoldKeyCalculator**
|
||||
Computes 40+ scaffold-based molecular properties.
|
||||
|
||||
**AtomCalculator**
|
||||
Atom-level featurization for graph neural networks.
|
||||
|
||||
**BondCalculator**
|
||||
Bond-level featurization for graph neural networks.
|
||||
|
||||
### Utility Function
|
||||
|
||||
**get_calculator()**
|
||||
Factory function to instantiate calculators by name.
|
||||
|
||||
```python
|
||||
from molfeat.calc import get_calculator
|
||||
|
||||
# Instantiate any calculator by name
|
||||
calc = get_calculator("ecfp", radius=3)
|
||||
calc = get_calculator("maccs")
|
||||
calc = get_calculator("desc2D")
|
||||
```
|
||||
|
||||
Raises `ValueError` for unsupported featurizers.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.trans - Transformers
|
||||
|
||||
Transformers wrap calculators into complete featurization pipelines for batch processing.
|
||||
|
||||
### MoleculeTransformer
|
||||
|
||||
Scikit-learn compatible transformer for batch molecular featurization.
|
||||
|
||||
**Key Parameters:**
|
||||
- `featurizer` - Calculator or featurizer to use
|
||||
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
|
||||
- `dtype` - Output data type (numpy float32/64, torch tensors)
|
||||
- `verbose` (bool) - Enable verbose logging
|
||||
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
|
||||
|
||||
**Essential Methods:**
|
||||
- `transform(mols)` - Processes batches and returns representations
|
||||
- `_transform(mol)` - Handles individual molecule featurization
|
||||
- `__call__(mols)` - Convenience wrapper around transform()
|
||||
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
|
||||
- `to_state_yaml_file(path)` - Save transformer configuration
|
||||
- `from_state_yaml_file(path)` - Load transformer configuration
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
# Load molecules
|
||||
smiles = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Featurize batch
|
||||
features = transformer(smiles) # Returns numpy array (100, 2048)
|
||||
|
||||
# Save configuration
|
||||
transformer.to_state_yaml_file("ecfp_config.yml")
|
||||
|
||||
# Reload
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
|
||||
```
|
||||
|
||||
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
|
||||
|
||||
### FeatConcat
|
||||
|
||||
Concatenates multiple featurizers into unified representations.
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine multiple fingerprints
|
||||
concat = FeatConcat([
|
||||
FPCalculator("maccs"), # 167 dimensions
|
||||
FPCalculator("ecfp") # 2048 dimensions
|
||||
])
|
||||
|
||||
# Result: 2167-dimensional features
|
||||
transformer = MoleculeTransformer(concat, n_jobs=-1)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### PretrainedMolTransformer
|
||||
|
||||
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
|
||||
|
||||
**Unique Features:**
|
||||
- `_embed()` - Batched inference for neural networks
|
||||
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
|
||||
- SELFIES strings for language models
|
||||
- DGL graphs for graph neural networks
|
||||
- Integrated caching system for efficient storage
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = transformer(smiles)
|
||||
```
|
||||
|
||||
### PrecomputedMolTransformer
|
||||
|
||||
Transformer for cached/precomputed features.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.store - Model Store
|
||||
|
||||
Manages featurizer discovery, loading, and registration.
|
||||
|
||||
### ModelStore
|
||||
|
||||
Central hub for accessing available featurizers.
|
||||
|
||||
**Key Methods:**
|
||||
- `available_models` - Property listing all available featurizers
|
||||
- `search(name=None, **kwargs)` - Search for specific featurizers
|
||||
- `load(name, **kwargs)` - Load a featurizer by name
|
||||
- `register(name, card)` - Register custom featurizer
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
all_models = store.available_models
|
||||
print(f"Found {len(all_models)} featurizers")
|
||||
|
||||
# Search for specific model
|
||||
results = store.search(name="ChemBERTa-77M-MLM")
|
||||
if results:
|
||||
model_card = results[0]
|
||||
|
||||
# View usage information
|
||||
model_card.usage()
|
||||
|
||||
# Load the model
|
||||
transformer = model_card.load()
|
||||
|
||||
# Direct loading
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
**ModelCard Attributes:**
|
||||
- `name` - Model identifier
|
||||
- `description` - Model description
|
||||
- `version` - Model version
|
||||
- `authors` - Model authors
|
||||
- `tags` - Categorization tags
|
||||
- `usage()` - Display usage examples
|
||||
- `load(**kwargs)` - Load the model
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Enable error tolerance
|
||||
featurizer = MoleculeTransformer(
|
||||
calc,
|
||||
n_jobs=-1,
|
||||
verbose=True,
|
||||
ignore_errors=True
|
||||
)
|
||||
|
||||
# Failed molecules return None
|
||||
features = featurizer(smiles_with_errors)
|
||||
```
|
||||
|
||||
### Data Type Control
|
||||
|
||||
```python
|
||||
# NumPy float32 (default)
|
||||
features = transformer(smiles, enforce_dtype=True)
|
||||
|
||||
# PyTorch tensors
|
||||
import torch
|
||||
transformer = MoleculeTransformer(calc, dtype=torch.float32)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### Persistence and Reproducibility
|
||||
|
||||
```python
|
||||
# Save transformer state
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
transformer.to_state_json_file("config.json")
|
||||
|
||||
# Load from saved state
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
|
||||
transformer = MoleculeTransformer.from_state_json_file("config.json")
|
||||
```
|
||||
|
||||
### Preprocessing
|
||||
|
||||
```python
|
||||
# Manual preprocessing
|
||||
mol = transformer.preprocess("CCO")
|
||||
|
||||
# Transform with preprocessing
|
||||
features = transformer.transform(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
# Fit and predict
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
```
|
||||
|
||||
### PyTorch Integration
|
||||
|
||||
```python
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.smiles = smiles
|
||||
self.labels = labels
|
||||
self.transformer = transformer
|
||||
|
||||
def __len__(self):
|
||||
return len(self.smiles)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
features = self.transformer(self.smiles[idx])
|
||||
return torch.tensor(features), torch.tensor(self.labels[idx])
|
||||
|
||||
# Create dataset and dataloader
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"))
|
||||
dataset = MoleculeDataset(smiles, labels, transformer)
|
||||
loader = DataLoader(dataset, batch_size=32)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
|
||||
2. **Batch Processing**: Process multiple molecules at once instead of loops
|
||||
3. **Caching**: Leverage built-in caching for pretrained models
|
||||
4. **Data Types**: Use float32 instead of float64 when precision allows
|
||||
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules
|
||||
333
scientific-packages/molfeat/references/available_featurizers.md
Normal file
333
scientific-packages/molfeat/references/available_featurizers.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Available Featurizers in Molfeat
|
||||
|
||||
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
|
||||
|
||||
## Transformer-Based Language Models
|
||||
|
||||
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
|
||||
|
||||
### RoBERTa-style Models
|
||||
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
|
||||
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
|
||||
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
|
||||
|
||||
### GPT-style Autoregressive Models
|
||||
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
|
||||
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
|
||||
|
||||
### Specialized Transformer Models
|
||||
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
|
||||
|
||||
## Graph Neural Networks (GNNs)
|
||||
|
||||
Pre-trained graph neural network models operating on molecular graph structures.
|
||||
|
||||
### GIN (Graph Isomorphism Network) Variants
|
||||
All pre-trained on ChEMBL molecules with different objectives:
|
||||
- **gin-supervised-masking** - Supervised with node masking objective
|
||||
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
|
||||
- **gin-supervised-edgepred** - Supervised with edge prediction objective
|
||||
- **gin-supervised-contextpred** - Supervised with context prediction objective
|
||||
|
||||
### Other Graph-Based Models
|
||||
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
|
||||
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
|
||||
|
||||
## Molecular Descriptors
|
||||
|
||||
Calculators for physico-chemical properties and molecular characteristics.
|
||||
|
||||
### 2D Descriptors
|
||||
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
|
||||
- Molecular weight, logP, TPSA
|
||||
- H-bond donors/acceptors
|
||||
- Rotatable bonds
|
||||
- Ring counts and aromaticity
|
||||
- Molecular complexity metrics
|
||||
|
||||
### 3D Descriptors
|
||||
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
|
||||
- Inertial moments
|
||||
- PMI (Principal Moments of Inertia) ratios
|
||||
- Asphericity, eccentricity
|
||||
- Radius of gyration
|
||||
|
||||
### Comprehensive Descriptor Sets
|
||||
- **mordred** - Over 1800 molecular descriptors covering:
|
||||
- Constitutional descriptors
|
||||
- Topological indices
|
||||
- Connectivity indices
|
||||
- Information content
|
||||
- 2D/3D autocorrelations
|
||||
- WHIM descriptors
|
||||
- GETAWAY descriptors
|
||||
- And many more
|
||||
|
||||
### Electrotopological Descriptors
|
||||
- **estate** - Electrotopological state (E-State) indices encoding:
|
||||
- Atomic environment information
|
||||
- Electronic and topological properties
|
||||
- Heteroatom contributions
|
||||
|
||||
## Molecular Fingerprints
|
||||
|
||||
Binary or count-based fixed-length vectors representing molecular substructures.
|
||||
|
||||
### Circular Fingerprints (ECFP-style)
|
||||
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
|
||||
- Radius variants (2, 4, 6 correspond to diameter)
|
||||
- Default: radius=3, 2048 bits
|
||||
- Most popular for similarity searching
|
||||
- **ecfp-count** - Count version of ECFP (non-binary)
|
||||
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
|
||||
- Similar to ECFP but uses functional groups
|
||||
- Better for pharmacophore-based similarity
|
||||
|
||||
### Path-Based Fingerprints
|
||||
- **rdkit** - RDKit topological fingerprints based on linear paths
|
||||
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
|
||||
- **layered** - Layered fingerprints with multiple substructure layers
|
||||
|
||||
### Key-Based Fingerprints
|
||||
- **maccs** - MACCS keys (166-bit structural keys)
|
||||
- Fixed set of predefined substructures
|
||||
- Good for scaffold hopping
|
||||
- Fast computation
|
||||
- **avalon** - Avalon fingerprints
|
||||
- Similar to MACCS but more features
|
||||
- Optimized for similarity searching
|
||||
|
||||
### Atom-Pair Fingerprints
|
||||
- **atompair** - Atom pair fingerprints
|
||||
- Encodes pairs of atoms and distance between them
|
||||
- Good for 3D similarity
|
||||
- **atompair-count** - Count version of atom pairs
|
||||
|
||||
### Topological Torsion Fingerprints
|
||||
- **topological** - Topological torsion fingerprints
|
||||
- Encodes sequences of 4 connected atoms
|
||||
- Captures local topology
|
||||
- **topological-count** - Count version of topological torsions
|
||||
|
||||
### MinHashed Fingerprints
|
||||
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
|
||||
- Combines atom-pair and ECFP concepts
|
||||
- Default: 1024 dimensions
|
||||
- Fast and efficient for large datasets
|
||||
- **secfp** - SMILES Extended Connectivity Fingerprint
|
||||
- Operates directly on SMILES strings
|
||||
- Captures both substructure and atom-pair information
|
||||
|
||||
### Extended Reduced Graph
|
||||
- **erg** - Extended Reduced Graph
|
||||
- Uses pharmacophoric points instead of atoms
|
||||
- Reduces graph complexity while preserving key features
|
||||
|
||||
## Pharmacophore Descriptors
|
||||
|
||||
Features based on pharmacologically relevant functional groups and their spatial relationships.
|
||||
|
||||
### CATS (Chemically Advanced Template Search)
|
||||
- **cats2D** - 2D CATS descriptors
|
||||
- Pharmacophore point pair distributions
|
||||
- Distance based on shortest path
|
||||
- 21 descriptors by default
|
||||
- **cats3D** - 3D CATS descriptors
|
||||
- Euclidean distance based
|
||||
- Requires conformer generation
|
||||
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
|
||||
|
||||
### Gobbi Pharmacophores
|
||||
- **gobbi2D** - 2D pharmacophore fingerprints
|
||||
- 8 pharmacophore feature types:
|
||||
- Hydrophobic
|
||||
- Aromatic
|
||||
- H-bond acceptor
|
||||
- H-bond donor
|
||||
- Positive ionizable
|
||||
- Negative ionizable
|
||||
- Lumped hydrophobe
|
||||
- Good for virtual screening
|
||||
|
||||
### Pmapper Pharmacophores
|
||||
- **pmapper2D** - 2D pharmacophore signatures
|
||||
- **pmapper3D** - 3D pharmacophore signatures
|
||||
- High-dimensional pharmacophore descriptors
|
||||
- Useful for QSAR and similarity searching
|
||||
|
||||
## Shape Descriptors
|
||||
|
||||
Descriptors capturing 3D molecular shape and electrostatic properties.
|
||||
|
||||
### USR (Ultrafast Shape Recognition)
|
||||
- **usr** - Basic USR descriptors
|
||||
- 12 dimensions encoding shape distribution
|
||||
- Extremely fast computation
|
||||
- **usrcat** - USR with pharmacophoric constraints
|
||||
- 60 dimensions (12 per feature type)
|
||||
- Combines shape and pharmacophore information
|
||||
|
||||
### Electrostatic Shape
|
||||
- **electroshape** - ElectroShape descriptors
|
||||
- Combines molecular shape, chirality, and electrostatics
|
||||
- Useful for protein-ligand docking predictions
|
||||
|
||||
## Scaffold-Based Descriptors
|
||||
|
||||
Descriptors based on molecular scaffolds and core structures.
|
||||
|
||||
### Scaffold Keys
|
||||
- **scaffoldkeys** - Scaffold key calculator
|
||||
- 40+ scaffold-based properties
|
||||
- Bioisosteric scaffold representation
|
||||
- Captures core structural features
|
||||
|
||||
## Graph Featurizers for GNN Input
|
||||
|
||||
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
|
||||
|
||||
### Atom-Level Features
|
||||
- **atom-onehot** - One-hot encoded atom features
|
||||
- **atom-default** - Default atom featurization including:
|
||||
- Atomic number
|
||||
- Degree, formal charge
|
||||
- Hybridization
|
||||
- Aromaticity
|
||||
- Number of hydrogen atoms
|
||||
|
||||
### Bond-Level Features
|
||||
- **bond-onehot** - One-hot encoded bond features
|
||||
- **bond-default** - Default bond featurization including:
|
||||
- Bond type (single, double, triple, aromatic)
|
||||
- Conjugation
|
||||
- Ring membership
|
||||
- Stereochemistry
|
||||
|
||||
## Integrated Pretrained Model Collections
|
||||
|
||||
Molfeat integrates models from various sources:
|
||||
|
||||
### HuggingFace Models
|
||||
Access to transformer models through HuggingFace hub:
|
||||
- ChemBERTa variants
|
||||
- ChemGPT variants
|
||||
- MolT5
|
||||
- Custom uploaded models
|
||||
|
||||
### DGL-LifeSci Models
|
||||
Pre-trained GNN models from DGL-Life:
|
||||
- GIN variants with different pre-training tasks
|
||||
- AttentiveFP models
|
||||
- MPNN models
|
||||
|
||||
### FCD (Fréchet ChemNet Distance)
|
||||
- **fcd** - Pre-trained CNN for molecular generation evaluation
|
||||
|
||||
### Graphormer Models
|
||||
- Graph transformers from Microsoft Research
|
||||
- Pre-trained on quantum chemistry datasets
|
||||
|
||||
## Usage Notes
|
||||
|
||||
### Choosing a Featurizer
|
||||
|
||||
**For traditional ML (Random Forest, SVM, etc.):**
|
||||
- Start with **ecfp** or **maccs** fingerprints
|
||||
- Try **desc2D** for interpretable models
|
||||
- Use **FeatConcat** to combine multiple fingerprints
|
||||
|
||||
**For deep learning:**
|
||||
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
|
||||
- Use **gin-supervised-*** for graph neural network embeddings
|
||||
- Consider **Graphormer** for quantum property predictions
|
||||
|
||||
**For similarity searching:**
|
||||
- **ecfp** - General purpose, most popular
|
||||
- **maccs** - Fast, good for scaffold hopping
|
||||
- **map4** - Efficient for large-scale searches
|
||||
- **usr** / **usrcat** - 3D shape similarity
|
||||
|
||||
**For pharmacophore-based approaches:**
|
||||
- **fcfp** - Functional group based
|
||||
- **cats2D/3D** - Pharmacophore pair distributions
|
||||
- **gobbi2D** - Explicit pharmacophore features
|
||||
|
||||
**For interpretability:**
|
||||
- **desc2D** / **mordred** - Named descriptors
|
||||
- **maccs** - Interpretable substructure keys
|
||||
- **scaffoldkeys** - Scaffold-based features
|
||||
|
||||
### Model Dependencies
|
||||
|
||||
Some featurizers require optional dependencies:
|
||||
|
||||
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
|
||||
- **Graphormer**: `pip install "molfeat[graphormer]"`
|
||||
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
|
||||
- **FCD**: `pip install "molfeat[fcd]"`
|
||||
- **MAP4**: `pip install "molfeat[map4]"`
|
||||
- **All dependencies**: `pip install "molfeat[all]"`
|
||||
|
||||
### Accessing All Available Models
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
all_models = store.available_models
|
||||
|
||||
# Print all available featurizers
|
||||
for model in all_models:
|
||||
print(f"{model.name}: {model.description}")
|
||||
|
||||
# Search for specific types
|
||||
transformers = [m for m in all_models if "transformer" in m.tags]
|
||||
gnn_models = [m for m in all_models if "gnn" in m.tags]
|
||||
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Computational Speed (relative)
|
||||
**Fastest:**
|
||||
- maccs
|
||||
- ecfp
|
||||
- rdkit fingerprints
|
||||
- usr
|
||||
|
||||
**Medium:**
|
||||
- desc2D
|
||||
- cats2D
|
||||
- Most fingerprints
|
||||
|
||||
**Slower:**
|
||||
- mordred (1800+ descriptors)
|
||||
- desc3D (requires conformer generation)
|
||||
- 3D descriptors in general
|
||||
|
||||
**Slowest (first run):**
|
||||
- Pretrained models (ChemBERTa, ChemGPT, GIN)
|
||||
- Note: Subsequent runs benefit from caching
|
||||
|
||||
### Dimensionality
|
||||
|
||||
**Low (< 200 dims):**
|
||||
- maccs (167)
|
||||
- usr (12)
|
||||
- usrcat (60)
|
||||
|
||||
**Medium (200-2000 dims):**
|
||||
- desc2D (~200)
|
||||
- ecfp (2048 default, configurable)
|
||||
- map4 (1024 default)
|
||||
|
||||
**High (> 2000 dims):**
|
||||
- mordred (1800+)
|
||||
- Concatenated fingerprints
|
||||
- Some transformer embeddings
|
||||
|
||||
**Variable:**
|
||||
- Transformer models (typically 768-1024)
|
||||
- GNN models (depends on architecture)
|
||||
723
scientific-packages/molfeat/references/examples.md
Normal file
723
scientific-packages/molfeat/references/examples.md
Normal file
@@ -0,0 +1,723 @@
|
||||
# Molfeat Usage Examples
|
||||
|
||||
This document provides practical examples for common molfeat use cases.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Recommended: Using conda/mamba
|
||||
mamba install -c conda-forge molfeat
|
||||
|
||||
# Alternative: Using pip
|
||||
pip install molfeat
|
||||
|
||||
# With all optional dependencies
|
||||
pip install "molfeat[all]"
|
||||
|
||||
# With specific dependencies
|
||||
pip install "molfeat[dgl]" # For GNN models
|
||||
pip install "molfeat[graphormer]" # For Graphormer
|
||||
pip install "molfeat[transformer]" # For ChemBERTa, ChemGPT
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Featurization Workflow
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
# Load sample data
|
||||
data = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Single molecule featurization
|
||||
calc = FPCalculator("ecfp")
|
||||
features_single = calc(data[0])
|
||||
print(f"Single molecule features shape: {features_single.shape}")
|
||||
# Output: (2048,)
|
||||
|
||||
# Batch featurization with parallelization
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
features_batch = transformer(data)
|
||||
print(f"Batch features shape: {features_batch.shape}")
|
||||
# Output: (100, 2048)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Calculator Examples
|
||||
|
||||
### Fingerprint Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# ECFP (Extended-Connectivity Fingerprints)
|
||||
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
fp = ecfp("CCO") # Ethanol
|
||||
print(f"ECFP shape: {fp.shape}") # (2048,)
|
||||
|
||||
# MACCS keys
|
||||
maccs = FPCalculator("maccs")
|
||||
fp = maccs("c1ccccc1") # Benzene
|
||||
print(f"MACCS shape: {fp.shape}") # (167,)
|
||||
|
||||
# Count-based fingerprints
|
||||
ecfp_count = FPCalculator("ecfp-count", radius=3)
|
||||
fp_count = ecfp_count("CC(C)CC(C)C") # Non-binary counts
|
||||
|
||||
# MAP4 fingerprints
|
||||
map4 = FPCalculator("map4")
|
||||
fp = map4("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
|
||||
```
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
|
||||
|
||||
# RDKit 2D descriptors (200+ properties)
|
||||
desc2d = RDKitDescriptors2D()
|
||||
descriptors = desc2d("CCO")
|
||||
print(f"Number of 2D descriptors: {len(descriptors)}")
|
||||
|
||||
# Get descriptor names
|
||||
names = desc2d.columns
|
||||
print(f"First 5 descriptors: {names[:5]}")
|
||||
|
||||
# Mordred descriptors (1800+ properties)
|
||||
mordred = MordredDescriptors()
|
||||
descriptors = mordred("c1ccccc1O") # Phenol
|
||||
print(f"Mordred descriptors: {len(descriptors)}")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
# 2D CATS descriptors
|
||||
cats = CATSCalculator(mode="2D", scale="raw")
|
||||
descriptors = cats("CC(C)Cc1ccc(C)cc1C") # Cymene
|
||||
print(f"CATS descriptors: {descriptors.shape}") # (21,)
|
||||
|
||||
# 3D CATS descriptors (requires conformer)
|
||||
cats3d = CATSCalculator(mode="3D", scale="num")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Transformer Examples
|
||||
|
||||
### Basic Transformer Usage
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import datamol as dm
|
||||
|
||||
# Prepare data
|
||||
smiles_list = [
|
||||
"CCO",
|
||||
"CC(=O)O",
|
||||
"c1ccccc1",
|
||||
"CC(C)O",
|
||||
"CCCC"
|
||||
]
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Transform molecules
|
||||
features = transformer(smiles_list)
|
||||
print(f"Features shape: {features.shape}") # (5, 2048)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Handle invalid SMILES gracefully
|
||||
smiles_with_errors = [
|
||||
"CCO", # Valid
|
||||
"invalid", # Invalid
|
||||
"CC(=O)O", # Valid
|
||||
"xyz123", # Invalid
|
||||
]
|
||||
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=-1,
|
||||
verbose=True, # Log errors
|
||||
ignore_errors=True # Continue on failure
|
||||
)
|
||||
|
||||
features = transformer(smiles_with_errors)
|
||||
# Returns: array with None for failed molecules
|
||||
print(features) # [array(...), None, array(...), None]
|
||||
```
|
||||
|
||||
### Concatenating Multiple Featurizers
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat, MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
|
||||
concat_calc = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
])
|
||||
|
||||
transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
print(f"Combined features shape: {features.shape}") # (n, 2215)
|
||||
|
||||
# Triple combination
|
||||
triple_concat = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp"),
|
||||
FPCalculator("rdkit")
|
||||
])
|
||||
```
|
||||
|
||||
### Saving and Loading Configurations
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create and save transformer
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048),
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
# Save to YAML
|
||||
transformer.to_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Save to JSON
|
||||
transformer.to_state_json_file("my_featurizer.json")
|
||||
|
||||
# Load from saved state
|
||||
loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Use loaded transformer
|
||||
features = loaded_transformer(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pretrained Model Examples
|
||||
|
||||
### Using the ModelStore
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize model store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
print(f"Total available models: {len(store.available_models)}")
|
||||
|
||||
# Search for specific models
|
||||
chemberta_models = store.search(name="ChemBERTa")
|
||||
for model in chemberta_models:
|
||||
print(f"- {model.name}: {model.description}")
|
||||
|
||||
# Get model information
|
||||
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
|
||||
print(f"Model: {model_card.name}")
|
||||
print(f"Version: {model_card.version}")
|
||||
print(f"Authors: {model_card.authors}")
|
||||
|
||||
# View usage instructions
|
||||
model_card.usage()
|
||||
|
||||
# Load model directly
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
### ChemBERTa Embeddings
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load ChemBERTa model
|
||||
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
|
||||
embeddings = chemberta(smiles)
|
||||
print(f"ChemBERTa embeddings shape: {embeddings.shape}")
|
||||
# Output: (3, 768) - 768-dimensional embeddings
|
||||
|
||||
# Use in ML pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
embeddings, labels, test_size=0.2
|
||||
)
|
||||
|
||||
clf = RandomForestClassifier()
|
||||
clf.fit(X_train, y_train)
|
||||
predictions = clf.predict(X_test)
|
||||
```
|
||||
|
||||
### ChemGPT Models
|
||||
|
||||
```python
|
||||
# Small model (4.7M parameters)
|
||||
chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
|
||||
|
||||
# Medium model (19M parameters)
|
||||
chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
|
||||
|
||||
# Large model (1.2B parameters)
|
||||
chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = chemgpt_small(smiles)
|
||||
```
|
||||
|
||||
### Graph Neural Network Models
|
||||
|
||||
```python
|
||||
# GIN models with different pre-training objectives
|
||||
gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
|
||||
gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
|
||||
gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
|
||||
|
||||
# Generate graph embeddings
|
||||
embeddings = gin_masking(smiles)
|
||||
print(f"GIN embeddings shape: {embeddings.shape}")
|
||||
|
||||
# Graphormer (for quantum chemistry)
|
||||
graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
|
||||
embeddings = graphormer(smiles)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Machine Learning Integration
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create ML pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', RandomForestClassifier(n_estimators=100))
|
||||
])
|
||||
|
||||
# Train and evaluate
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
|
||||
# Cross-validation
|
||||
scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
|
||||
print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Grid Search for Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.svm import SVC
|
||||
|
||||
# Define pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', SVC())
|
||||
])
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'classifier__C': [0.1, 1, 10],
|
||||
'classifier__kernel': ['rbf', 'linear'],
|
||||
'classifier__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
# Grid search
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
||||
grid_search.fit(smiles_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
```
|
||||
|
||||
### Multiple Featurizer Comparison
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score
|
||||
|
||||
# Test different featurizers
|
||||
featurizers = {
|
||||
'ECFP': FPCalculator("ecfp"),
|
||||
'MACCS': FPCalculator("maccs"),
|
||||
'RDKit': FPCalculator("rdkit"),
|
||||
'Descriptors': RDKitDescriptors2D(),
|
||||
'Combined': FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp")
|
||||
])
|
||||
}
|
||||
|
||||
results = {}
|
||||
for name, calc in featurizers.items():
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
X_train = transformer(smiles_train)
|
||||
X_test = transformer(smiles_test)
|
||||
|
||||
clf = RandomForestClassifier(n_estimators=100)
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
y_pred = clf.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_test, y_pred)
|
||||
results[name] = auc
|
||||
|
||||
print(f"{name}: AUC = {auc:.3f}")
|
||||
```
|
||||
|
||||
### PyTorch Deep Learning
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Custom dataset
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.features = transformer(smiles)
|
||||
self.labels = torch.tensor(labels, dtype=torch.float32)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.labels)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return (
|
||||
torch.tensor(self.features[idx], dtype=torch.float32),
|
||||
self.labels[idx]
|
||||
)
|
||||
|
||||
# Prepare data
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
|
||||
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
|
||||
|
||||
# Simple neural network
|
||||
class MoleculeClassifier(nn.Module):
|
||||
def __init__(self, input_dim):
|
||||
super().__init__()
|
||||
self.network = nn.Sequential(
|
||||
nn.Linear(input_dim, 512),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(512, 256),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(256, 1),
|
||||
nn.Sigmoid()
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.network(x)
|
||||
|
||||
# Train model
|
||||
model = MoleculeClassifier(input_dim=2048)
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
criterion = nn.BCELoss()
|
||||
|
||||
for epoch in range(10):
|
||||
for batch_features, batch_labels in train_loader:
|
||||
optimizer.zero_grad()
|
||||
outputs = model(batch_features).squeeze()
|
||||
loss = criterion(outputs, batch_labels)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Usage Patterns
|
||||
|
||||
### Custom Preprocessing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
class CustomTransformer(MoleculeTransformer):
|
||||
def preprocess(self, mol):
|
||||
"""Custom preprocessing: standardize molecule"""
|
||||
if isinstance(mol, str):
|
||||
mol = dm.to_mol(mol)
|
||||
|
||||
# Standardize
|
||||
mol = dm.standardize_mol(mol)
|
||||
|
||||
# Remove salts
|
||||
mol = dm.remove_salts(mol)
|
||||
|
||||
return mol
|
||||
|
||||
# Use custom transformer
|
||||
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
```
|
||||
|
||||
### Featurization with Conformers
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import RDKitDescriptors3D
|
||||
|
||||
# Generate conformers
|
||||
def prepare_3d_mol(smiles):
|
||||
mol = dm.to_mol(smiles)
|
||||
mol = dm.add_hs(mol)
|
||||
mol = dm.conform.generate_conformers(mol, n_confs=1)
|
||||
return mol
|
||||
|
||||
# 3D descriptors
|
||||
calc_3d = RDKitDescriptors3D()
|
||||
|
||||
smiles = "CC(C)Cc1ccc(C)cc1C"
|
||||
mol_3d = prepare_3d_mol(smiles)
|
||||
descriptors_3d = calc_3d(mol_3d)
|
||||
```
|
||||
|
||||
### Parallel Batch Processing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import time
|
||||
|
||||
# Large dataset
|
||||
smiles_large = load_large_dataset() # e.g., 100,000 molecules
|
||||
|
||||
# Test different parallelization levels
|
||||
for n_jobs in [1, 2, 4, -1]:
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=n_jobs
|
||||
)
|
||||
|
||||
start = time.time()
|
||||
features = transformer(smiles_large)
|
||||
elapsed = time.time() - start
|
||||
|
||||
print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
|
||||
```
|
||||
|
||||
### Caching for Expensive Operations
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
import pickle
|
||||
|
||||
# Load expensive pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Cache embeddings for reuse
|
||||
cache_file = "embeddings_cache.pkl"
|
||||
|
||||
try:
|
||||
# Try loading cached embeddings
|
||||
with open(cache_file, "rb") as f:
|
||||
embeddings = pickle.load(f)
|
||||
print("Loaded cached embeddings")
|
||||
except FileNotFoundError:
|
||||
# Compute and cache
|
||||
embeddings = transformer(smiles_list)
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(embeddings, f)
|
||||
print("Computed and cached embeddings")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Virtual Screening Workflow
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import datamol as dm
|
||||
|
||||
# 1. Prepare training data (known actives/inactives)
|
||||
train_smiles = load_training_data()
|
||||
train_labels = load_training_labels() # 1=active, 0=inactive
|
||||
|
||||
# 2. Featurize training set
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X_train = transformer(train_smiles)
|
||||
|
||||
# 3. Train classifier
|
||||
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
|
||||
clf.fit(X_train, train_labels)
|
||||
|
||||
# 4. Featurize screening library
|
||||
screening_smiles = load_screening_library() # e.g., 1M compounds
|
||||
X_screen = transformer(screening_smiles)
|
||||
|
||||
# 5. Predict and rank
|
||||
predictions = clf.predict_proba(X_screen)[:, 1]
|
||||
ranked_indices = predictions.argsort()[::-1]
|
||||
|
||||
# 6. Get top hits
|
||||
top_n = 1000
|
||||
top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
|
||||
```
|
||||
|
||||
### QSAR Model Building
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
from sklearn.linear_model import Ridge
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_score
|
||||
import numpy as np
|
||||
|
||||
# Load QSAR dataset
|
||||
smiles = load_molecules()
|
||||
y = load_activity_values() # e.g., IC50, logP
|
||||
|
||||
# Featurize with interpretable descriptors
|
||||
transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
|
||||
X = transformer(smiles)
|
||||
|
||||
# Standardize features
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Build linear model
|
||||
model = Ridge(alpha=1.0)
|
||||
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
|
||||
print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
|
||||
# Fit final model
|
||||
model.fit(X_scaled, y)
|
||||
|
||||
# Interpret feature importance
|
||||
feature_names = transformer.featurizer.columns
|
||||
importance = np.abs(model.coef_)
|
||||
top_features_idx = importance.argsort()[-10:][::-1]
|
||||
|
||||
print("Top 10 important features:")
|
||||
for idx in top_features_idx:
|
||||
print(f" {feature_names[idx]}: {model.coef_[idx]:.3f}")
|
||||
```
|
||||
|
||||
### Similarity Search
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
import numpy as np
|
||||
|
||||
# Query molecule
|
||||
query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
|
||||
|
||||
# Database of molecules
|
||||
database_smiles = load_molecule_database() # Large collection
|
||||
|
||||
# Compute fingerprints
|
||||
calc = FPCalculator("ecfp")
|
||||
query_fp = calc(query_smiles).reshape(1, -1)
|
||||
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
database_fps = transformer(database_smiles)
|
||||
|
||||
# Compute similarity
|
||||
similarities = cosine_similarity(query_fp, database_fps)[0]
|
||||
|
||||
# Find most similar
|
||||
top_k = 10
|
||||
top_indices = similarities.argsort()[-top_k:][::-1]
|
||||
|
||||
print(f"Top {top_k} similar molecules:")
|
||||
for i, idx in enumerate(top_indices, 1):
|
||||
print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Handling Invalid Molecules
|
||||
|
||||
```python
|
||||
# Use ignore_errors to skip invalid molecules
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
ignore_errors=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Filter out None values after transformation
|
||||
features = transformer(smiles_list)
|
||||
valid_mask = [f is not None for f in features]
|
||||
valid_features = [f for f in features if f is not None]
|
||||
valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
|
||||
```
|
||||
|
||||
### Memory Management for Large Datasets
|
||||
|
||||
```python
|
||||
# Process in chunks for very large datasets
|
||||
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
|
||||
all_features = []
|
||||
|
||||
for i in range(0, len(smiles_list), chunk_size):
|
||||
chunk = smiles_list[i:i+chunk_size]
|
||||
features = transformer(chunk)
|
||||
all_features.append(features)
|
||||
print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
|
||||
|
||||
return np.vstack(all_features)
|
||||
|
||||
# Use with large dataset
|
||||
features = featurize_in_chunks(large_smiles_list, transformer)
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
|
||||
```python
|
||||
import random
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
# Set all random seeds
|
||||
def set_seed(seed=42):
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
|
||||
set_seed(42)
|
||||
|
||||
# Save exact configuration
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
|
||||
# Document version
|
||||
import molfeat
|
||||
print(f"molfeat version: {molfeat.__version__}")
|
||||
```
|
||||
381
scientific-packages/polars/SKILL.md
Normal file
381
scientific-packages/polars/SKILL.md
Normal file
@@ -0,0 +1,381 @@
|
||||
---
|
||||
name: polars
|
||||
description: This skill should be used when working with the Polars DataFrame library for high-performance data manipulation in Python. Use when users ask about Polars operations, migrating from pandas, optimizing data processing pipelines, or working with large datasets that benefit from lazy evaluation and parallel processing.
|
||||
---
|
||||
|
||||
# Polars
|
||||
|
||||
## Overview
|
||||
|
||||
Polars is a lightning-fast DataFrame library for Python (and Rust) built on Apache Arrow. This skill provides guidance for working with Polars, including its expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities. Use this skill when helping users write efficient data processing code, migrate from pandas, or optimize data pipelines.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation and Basic Usage
|
||||
|
||||
Install Polars:
|
||||
```python
|
||||
pip install polars
|
||||
```
|
||||
|
||||
Basic DataFrame creation and operations:
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Create DataFrame
|
||||
df = pl.DataFrame({
|
||||
"name": ["Alice", "Bob", "Charlie"],
|
||||
"age": [25, 30, 35],
|
||||
"city": ["NY", "LA", "SF"]
|
||||
})
|
||||
|
||||
# Select columns
|
||||
df.select("name", "age")
|
||||
|
||||
# Filter rows
|
||||
df.filter(pl.col("age") > 25)
|
||||
|
||||
# Add computed columns
|
||||
df.with_columns(
|
||||
age_plus_10=pl.col("age") + 10
|
||||
)
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Expressions
|
||||
|
||||
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
|
||||
|
||||
**Key principles:**
|
||||
- Use `pl.col("column_name")` to reference columns
|
||||
- Chain methods to build complex transformations
|
||||
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Expression-based computation
|
||||
df.select(
|
||||
pl.col("name"),
|
||||
(pl.col("age") * 12).alias("age_in_months")
|
||||
)
|
||||
```
|
||||
|
||||
### Lazy vs Eager Evaluation
|
||||
|
||||
**Eager (DataFrame):** Operations execute immediately
|
||||
```python
|
||||
df = pl.read_csv("file.csv") # Reads immediately
|
||||
result = df.filter(pl.col("age") > 25) # Executes immediately
|
||||
```
|
||||
|
||||
**Lazy (LazyFrame):** Operations build a query plan, optimized before execution
|
||||
```python
|
||||
lf = pl.scan_csv("file.csv") # Doesn't read yet
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
df = result.collect() # Now executes optimized query
|
||||
```
|
||||
|
||||
**When to use lazy:**
|
||||
- Working with large datasets
|
||||
- Complex query pipelines
|
||||
- When only some columns/rows are needed
|
||||
- Performance is critical
|
||||
|
||||
**Benefits of lazy evaluation:**
|
||||
- Automatic query optimization
|
||||
- Predicate pushdown
|
||||
- Projection pushdown
|
||||
- Parallel execution
|
||||
|
||||
For detailed concepts, load `references/core_concepts.md`.
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Select
|
||||
Select and manipulate columns:
|
||||
```python
|
||||
# Select specific columns
|
||||
df.select("name", "age")
|
||||
|
||||
# Select with expressions
|
||||
df.select(
|
||||
pl.col("name"),
|
||||
(pl.col("age") * 2).alias("double_age")
|
||||
)
|
||||
|
||||
# Select all columns matching a pattern
|
||||
df.select(pl.col("^.*_id$"))
|
||||
```
|
||||
|
||||
### Filter
|
||||
Filter rows by conditions:
|
||||
```python
|
||||
# Single condition
|
||||
df.filter(pl.col("age") > 25)
|
||||
|
||||
# Multiple conditions (cleaner than using &)
|
||||
df.filter(
|
||||
pl.col("age") > 25,
|
||||
pl.col("city") == "NY"
|
||||
)
|
||||
|
||||
# Complex conditions
|
||||
df.filter(
|
||||
(pl.col("age") > 25) | (pl.col("city") == "LA")
|
||||
)
|
||||
```
|
||||
|
||||
### With Columns
|
||||
Add or modify columns while preserving existing ones:
|
||||
```python
|
||||
# Add new columns
|
||||
df.with_columns(
|
||||
age_plus_10=pl.col("age") + 10,
|
||||
name_upper=pl.col("name").str.to_uppercase()
|
||||
)
|
||||
|
||||
# Parallel computation (all columns computed in parallel)
|
||||
df.with_columns(
|
||||
pl.col("value") * 10,
|
||||
pl.col("value") * 100,
|
||||
)
|
||||
```
|
||||
|
||||
### Group By and Aggregations
|
||||
Group data and compute aggregations:
|
||||
```python
|
||||
# Basic grouping
|
||||
df.group_by("city").agg(
|
||||
pl.col("age").mean().alias("avg_age"),
|
||||
pl.len().alias("count")
|
||||
)
|
||||
|
||||
# Multiple group keys
|
||||
df.group_by("city", "department").agg(
|
||||
pl.col("salary").sum()
|
||||
)
|
||||
|
||||
# Conditional aggregations
|
||||
df.group_by("city").agg(
|
||||
(pl.col("age") > 30).sum().alias("over_30")
|
||||
)
|
||||
```
|
||||
|
||||
For detailed operation patterns, load `references/operations.md`.
|
||||
|
||||
## Aggregations and Window Functions
|
||||
|
||||
### Aggregation Functions
|
||||
Common aggregations within `group_by` context:
|
||||
- `pl.len()` - count rows
|
||||
- `pl.col("x").sum()` - sum values
|
||||
- `pl.col("x").mean()` - average
|
||||
- `pl.col("x").min()` / `pl.col("x").max()` - extremes
|
||||
- `pl.first()` / `pl.last()` - first/last values
|
||||
|
||||
### Window Functions with `over()`
|
||||
Apply aggregations while preserving row count:
|
||||
```python
|
||||
# Add group statistics to each row
|
||||
df.with_columns(
|
||||
avg_age_by_city=pl.col("age").mean().over("city"),
|
||||
rank_in_city=pl.col("salary").rank().over("city")
|
||||
)
|
||||
|
||||
# Multiple grouping columns
|
||||
df.with_columns(
|
||||
group_avg=pl.col("value").mean().over("category", "region")
|
||||
)
|
||||
```
|
||||
|
||||
**Mapping strategies:**
|
||||
- `group_to_rows` (default): Preserves original row order
|
||||
- `explode`: Faster but groups rows together
|
||||
- `join`: Creates list columns
|
||||
|
||||
## Data I/O
|
||||
|
||||
### Supported Formats
|
||||
Polars supports reading and writing:
|
||||
- CSV, Parquet, JSON, Excel
|
||||
- Databases (via connectors)
|
||||
- Cloud storage (S3, Azure, GCS)
|
||||
- Google BigQuery
|
||||
- Multiple/partitioned files
|
||||
|
||||
### Common I/O Operations
|
||||
|
||||
**CSV:**
|
||||
```python
|
||||
# Eager
|
||||
df = pl.read_csv("file.csv")
|
||||
df.write_csv("output.csv")
|
||||
|
||||
# Lazy (preferred for large files)
|
||||
lf = pl.scan_csv("file.csv")
|
||||
result = lf.filter(...).select(...).collect()
|
||||
```
|
||||
|
||||
**Parquet (recommended for performance):**
|
||||
```python
|
||||
df = pl.read_parquet("file.parquet")
|
||||
df.write_parquet("output.parquet")
|
||||
```
|
||||
|
||||
**JSON:**
|
||||
```python
|
||||
df = pl.read_json("file.json")
|
||||
df.write_json("output.json")
|
||||
```
|
||||
|
||||
For comprehensive I/O documentation, load `references/io_guide.md`.
|
||||
|
||||
## Transformations
|
||||
|
||||
### Joins
|
||||
Combine DataFrames:
|
||||
```python
|
||||
# Inner join
|
||||
df1.join(df2, on="id", how="inner")
|
||||
|
||||
# Left join
|
||||
df1.join(df2, on="id", how="left")
|
||||
|
||||
# Join on different column names
|
||||
df1.join(df2, left_on="user_id", right_on="id")
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
Stack DataFrames:
|
||||
```python
|
||||
# Vertical (stack rows)
|
||||
pl.concat([df1, df2], how="vertical")
|
||||
|
||||
# Horizontal (add columns)
|
||||
pl.concat([df1, df2], how="horizontal")
|
||||
|
||||
# Diagonal (union with different schemas)
|
||||
pl.concat([df1, df2], how="diagonal")
|
||||
```
|
||||
|
||||
### Pivot and Unpivot
|
||||
Reshape data:
|
||||
```python
|
||||
# Pivot (wide format)
|
||||
df.pivot(values="sales", index="date", columns="product")
|
||||
|
||||
# Unpivot (long format)
|
||||
df.unpivot(index="id", on=["col1", "col2"])
|
||||
```
|
||||
|
||||
For detailed transformation examples, load `references/transformations.md`.
|
||||
|
||||
## Pandas Migration
|
||||
|
||||
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
|
||||
|
||||
### Conceptual Differences
|
||||
- **No index**: Polars uses integer positions only
|
||||
- **Strict typing**: No silent type conversions
|
||||
- **Lazy evaluation**: Available via LazyFrame
|
||||
- **Parallel by default**: Operations parallelized automatically
|
||||
|
||||
### Common Operation Mappings
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Select column | `df["col"]` | `df.select("col")` |
|
||||
| Filter | `df[df["col"] > 10]` | `df.filter(pl.col("col") > 10)` |
|
||||
| Add column | `df.assign(x=...)` | `df.with_columns(x=...)` |
|
||||
| Group by | `df.groupby("col").agg(...)` | `df.group_by("col").agg(...)` |
|
||||
| Window | `df.groupby("col").transform(...)` | `df.with_columns(...).over("col")` |
|
||||
|
||||
### Key Syntax Patterns
|
||||
|
||||
**Pandas sequential (slow):**
|
||||
```python
|
||||
df.assign(
|
||||
col_a=lambda df_: df_.value * 10,
|
||||
col_b=lambda df_: df_.value * 100
|
||||
)
|
||||
```
|
||||
|
||||
**Polars parallel (fast):**
|
||||
```python
|
||||
df.with_columns(
|
||||
col_a=pl.col("value") * 10,
|
||||
col_b=pl.col("value") * 100,
|
||||
)
|
||||
```
|
||||
|
||||
For comprehensive migration guide, load `references/pandas_migration.md`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Use lazy evaluation for large datasets:**
|
||||
```python
|
||||
lf = pl.scan_csv("large.csv") # Don't use read_csv
|
||||
result = lf.filter(...).select(...).collect()
|
||||
```
|
||||
|
||||
2. **Avoid Python functions in hot paths:**
|
||||
- Stay within expression API for parallelization
|
||||
- Use `.map_elements()` only when necessary
|
||||
- Prefer native Polars operations
|
||||
|
||||
3. **Use streaming for very large data:**
|
||||
```python
|
||||
lf.collect(streaming=True)
|
||||
```
|
||||
|
||||
4. **Select only needed columns early:**
|
||||
```python
|
||||
# Good: Select columns early
|
||||
lf.select("col1", "col2").filter(...)
|
||||
|
||||
# Bad: Filter on all columns first
|
||||
lf.filter(...).select("col1", "col2")
|
||||
```
|
||||
|
||||
5. **Use appropriate data types:**
|
||||
- Categorical for low-cardinality strings
|
||||
- Appropriate integer sizes (i32 vs i64)
|
||||
- Date types for temporal data
|
||||
|
||||
### Expression Patterns
|
||||
|
||||
**Conditional operations:**
|
||||
```python
|
||||
pl.when(condition).then(value).otherwise(other_value)
|
||||
```
|
||||
|
||||
**Column operations across multiple columns:**
|
||||
```python
|
||||
df.select(pl.col("^.*_value$") * 2) # Regex pattern
|
||||
```
|
||||
|
||||
**Null handling:**
|
||||
```python
|
||||
pl.col("x").fill_null(0)
|
||||
pl.col("x").is_null()
|
||||
pl.col("x").drop_nulls()
|
||||
```
|
||||
|
||||
For additional best practices and patterns, load `references/best_practices.md`.
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/
|
||||
- `core_concepts.md` - Detailed explanations of expressions, lazy evaluation, and type system
|
||||
- `operations.md` - Comprehensive guide to all common operations with examples
|
||||
- `pandas_migration.md` - Complete migration guide from pandas to Polars
|
||||
- `io_guide.md` - Data I/O operations for all supported formats
|
||||
- `transformations.md` - Joins, concatenation, pivots, and reshaping operations
|
||||
- `best_practices.md` - Performance optimization tips and common patterns
|
||||
|
||||
Load these references as needed when users require detailed information about specific topics.
|
||||
649
scientific-packages/polars/references/best_practices.md
Normal file
649
scientific-packages/polars/references/best_practices.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# Polars Best Practices and Performance Guide
|
||||
|
||||
Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Use Lazy Evaluation
|
||||
|
||||
**Always prefer lazy mode for large datasets:**
|
||||
|
||||
```python
|
||||
# Bad: Eager mode loads everything immediately
|
||||
df = pl.read_csv("large_file.csv")
|
||||
result = df.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# Good: Lazy mode optimizes before execution
|
||||
lf = pl.scan_csv("large_file.csv")
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
|
||||
```
|
||||
|
||||
**Benefits of lazy evaluation:**
|
||||
- Predicate pushdown (filter at source)
|
||||
- Projection pushdown (read only needed columns)
|
||||
- Query optimization
|
||||
- Parallel execution planning
|
||||
|
||||
### 2. Filter and Select Early
|
||||
|
||||
Push filters and column selection as early as possible in the pipeline:
|
||||
|
||||
```python
|
||||
# Bad: Process all data, then filter and select
|
||||
result = (
|
||||
lf.group_by("category")
|
||||
.agg(pl.col("value").mean())
|
||||
.join(other, on="category")
|
||||
.filter(pl.col("value") > 100)
|
||||
.select("category", "value")
|
||||
)
|
||||
|
||||
# Good: Filter and select early
|
||||
result = (
|
||||
lf.select("category", "value") # Only needed columns
|
||||
.filter(pl.col("value") > 100) # Filter early
|
||||
.group_by("category")
|
||||
.agg(pl.col("value").mean())
|
||||
.join(other.select("category", "other_col"), on="category")
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Avoid Python Functions
|
||||
|
||||
Stay within the expression API to maintain parallelization:
|
||||
|
||||
```python
|
||||
# Bad: Python function disables parallelization
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
|
||||
)
|
||||
|
||||
# Good: Use native expressions (parallelized)
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
```
|
||||
|
||||
**When you must use custom functions:**
|
||||
```python
|
||||
# If truly needed, be explicit
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(
|
||||
custom_function,
|
||||
return_dtype=pl.Float64,
|
||||
skip_nulls=True # Optimize null handling
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Use Streaming for Very Large Data
|
||||
|
||||
Enable streaming for datasets larger than RAM:
|
||||
|
||||
```python
|
||||
# Streaming mode processes data in chunks
|
||||
lf = pl.scan_parquet("very_large.parquet")
|
||||
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
|
||||
|
||||
# Or use sink for direct streaming writes
|
||||
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
|
||||
```
|
||||
|
||||
### 5. Optimize Data Types
|
||||
|
||||
Choose appropriate data types to reduce memory and improve performance:
|
||||
|
||||
```python
|
||||
# Bad: Default types may be wasteful
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# Good: Specify optimal types
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={
|
||||
"id": pl.UInt32, # Instead of Int64 if values fit
|
||||
"category": pl.Categorical, # For low-cardinality strings
|
||||
"date": pl.Date, # Instead of String
|
||||
"small_int": pl.Int16, # Instead of Int64
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
**Type optimization guidelines:**
|
||||
- Use smallest integer type that fits your data
|
||||
- Use `Categorical` for strings with low cardinality (<50% unique)
|
||||
- Use `Date` instead of `Datetime` when time isn't needed
|
||||
- Use `Boolean` instead of integers for binary flags
|
||||
|
||||
### 6. Parallel Operations
|
||||
|
||||
Structure code to maximize parallelization:
|
||||
|
||||
```python
|
||||
# Bad: Sequential pipe operations disable parallelization
|
||||
df = (
|
||||
df.pipe(operation1)
|
||||
.pipe(operation2)
|
||||
.pipe(operation3)
|
||||
)
|
||||
|
||||
# Good: Combined operations enable parallelization
|
||||
df = df.with_columns(
|
||||
result1=operation1_expr(),
|
||||
result2=operation2_expr(),
|
||||
result3=operation3_expr()
|
||||
)
|
||||
```
|
||||
|
||||
### 7. Rechunk After Concatenation
|
||||
|
||||
```python
|
||||
# Concatenation can fragment data
|
||||
combined = pl.concat([df1, df2, df3])
|
||||
|
||||
# Rechunk for better performance in subsequent operations
|
||||
combined = pl.concat([df1, df2, df3], rechunk=True)
|
||||
```
|
||||
|
||||
## Expression Patterns
|
||||
|
||||
### Conditional Logic
|
||||
|
||||
**Simple conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
status=pl.when(pl.col("age") >= 18)
|
||||
.then("adult")
|
||||
.otherwise("minor")
|
||||
)
|
||||
```
|
||||
|
||||
**Multiple conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
grade=pl.when(pl.col("score") >= 90)
|
||||
.then("A")
|
||||
.when(pl.col("score") >= 80)
|
||||
.then("B")
|
||||
.when(pl.col("score") >= 70)
|
||||
.then("C")
|
||||
.when(pl.col("score") >= 60)
|
||||
.then("D")
|
||||
.otherwise("F")
|
||||
)
|
||||
```
|
||||
|
||||
**Complex conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
category=pl.when(
|
||||
(pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
|
||||
)
|
||||
.then("enterprise")
|
||||
.when(
|
||||
(pl.col("revenue") > 100000) | (pl.col("customers") > 50)
|
||||
)
|
||||
.then("business")
|
||||
.otherwise("starter")
|
||||
)
|
||||
```
|
||||
|
||||
### Null Handling
|
||||
|
||||
**Check for nulls:**
|
||||
```python
|
||||
df.filter(pl.col("value").is_null())
|
||||
df.filter(pl.col("value").is_not_null())
|
||||
```
|
||||
|
||||
**Fill nulls:**
|
||||
```python
|
||||
# Constant value
|
||||
df.with_columns(pl.col("value").fill_null(0))
|
||||
|
||||
# Forward fill
|
||||
df.with_columns(pl.col("value").fill_null(strategy="forward"))
|
||||
|
||||
# Backward fill
|
||||
df.with_columns(pl.col("value").fill_null(strategy="backward"))
|
||||
|
||||
# Mean
|
||||
df.with_columns(pl.col("value").fill_null(strategy="mean"))
|
||||
|
||||
# Per-group fill
|
||||
df.with_columns(
|
||||
pl.col("value").fill_null(pl.col("value").mean()).over("group")
|
||||
)
|
||||
```
|
||||
|
||||
**Coalesce (first non-null):**
|
||||
```python
|
||||
df.with_columns(
|
||||
combined=pl.coalesce(["col1", "col2", "col3"])
|
||||
)
|
||||
```
|
||||
|
||||
### Column Selection Patterns
|
||||
|
||||
**By name:**
|
||||
```python
|
||||
df.select("col1", "col2", "col3")
|
||||
```
|
||||
|
||||
**By pattern:**
|
||||
```python
|
||||
# Regex
|
||||
df.select(pl.col("^sales_.*$"))
|
||||
|
||||
# Starts with
|
||||
df.select(pl.col("^sales"))
|
||||
|
||||
# Ends with
|
||||
df.select(pl.col("_total$"))
|
||||
|
||||
# Contains
|
||||
df.select(pl.col(".*revenue.*"))
|
||||
```
|
||||
|
||||
**By type:**
|
||||
```python
|
||||
# All numeric columns
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES))
|
||||
|
||||
# All string columns
|
||||
df.select(pl.col(pl.Utf8))
|
||||
|
||||
# Multiple types
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
|
||||
```
|
||||
|
||||
**Exclude columns:**
|
||||
```python
|
||||
df.select(pl.all().exclude("id", "timestamp"))
|
||||
```
|
||||
|
||||
**Transform multiple columns:**
|
||||
```python
|
||||
# Apply same operation to multiple columns
|
||||
df.select(
|
||||
pl.col("^sales_.*$") * 1.1 # 10% increase to all sales columns
|
||||
)
|
||||
```
|
||||
|
||||
### Aggregation Patterns
|
||||
|
||||
**Multiple aggregations:**
|
||||
```python
|
||||
df.group_by("category").agg(
|
||||
pl.col("value").sum().alias("total"),
|
||||
pl.col("value").mean().alias("average"),
|
||||
pl.col("value").std().alias("std_dev"),
|
||||
pl.col("id").count().alias("count"),
|
||||
pl.col("id").n_unique().alias("unique_count"),
|
||||
pl.col("value").min().alias("minimum"),
|
||||
pl.col("value").max().alias("maximum"),
|
||||
pl.col("value").quantile(0.5).alias("median"),
|
||||
pl.col("value").quantile(0.95).alias("p95")
|
||||
)
|
||||
```
|
||||
|
||||
**Conditional aggregations:**
|
||||
```python
|
||||
df.group_by("category").agg(
|
||||
# Count high values
|
||||
(pl.col("value") > 100).sum().alias("high_count"),
|
||||
|
||||
# Average of filtered values
|
||||
pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
|
||||
|
||||
# Conditional sum
|
||||
pl.when(pl.col("status") == "completed")
|
||||
.then(pl.col("amount"))
|
||||
.otherwise(0)
|
||||
.sum()
|
||||
.alias("completed_total")
|
||||
)
|
||||
```
|
||||
|
||||
**Grouped transformations:**
|
||||
```python
|
||||
df.with_columns(
|
||||
# Group statistics
|
||||
group_mean=pl.col("value").mean().over("category"),
|
||||
group_std=pl.col("value").std().over("category"),
|
||||
|
||||
# Rank within groups
|
||||
rank=pl.col("value").rank().over("category"),
|
||||
|
||||
# Percentage of group total
|
||||
pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
|
||||
)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Anti-Patterns
|
||||
|
||||
### Pitfall 1: Row Iteration
|
||||
|
||||
```python
|
||||
# Bad: Never iterate rows
|
||||
for row in df.iter_rows():
|
||||
# Process row
|
||||
result = row[0] * 2
|
||||
|
||||
# Good: Use vectorized operations
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 2: Modifying in Place
|
||||
|
||||
```python
|
||||
# Bad: Polars is immutable, this doesn't work as expected
|
||||
df["new_col"] = df["old_col"] * 2 # May work but not recommended
|
||||
|
||||
# Good: Functional style
|
||||
df = df.with_columns(new_col=pl.col("old_col") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 3: Not Using Expressions
|
||||
|
||||
```python
|
||||
# Bad: String-based operations
|
||||
df.select("value * 2") # Won't work
|
||||
|
||||
# Good: Expression-based
|
||||
df.select(pl.col("value") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 4: Inefficient Joins
|
||||
|
||||
```python
|
||||
# Bad: Join large tables without filtering
|
||||
result = large_df1.join(large_df2, on="id")
|
||||
|
||||
# Good: Filter before joining
|
||||
result = (
|
||||
large_df1.filter(pl.col("active"))
|
||||
.join(
|
||||
large_df2.filter(pl.col("status") == "valid"),
|
||||
on="id"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Pitfall 5: Not Specifying Types
|
||||
|
||||
```python
|
||||
# Bad: Let Polars infer everything
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# Good: Specify types for correctness and performance
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
|
||||
)
|
||||
```
|
||||
|
||||
### Pitfall 6: Creating Many Small DataFrames
|
||||
|
||||
```python
|
||||
# Bad: Many operations creating intermediate DataFrames
|
||||
df1 = df.filter(pl.col("age") > 25)
|
||||
df2 = df1.select("name", "age")
|
||||
df3 = df2.sort("age")
|
||||
result = df3.head(10)
|
||||
|
||||
# Good: Chain operations
|
||||
result = (
|
||||
df.filter(pl.col("age") > 25)
|
||||
.select("name", "age")
|
||||
.sort("age")
|
||||
.head(10)
|
||||
)
|
||||
|
||||
# Better: Use lazy mode
|
||||
result = (
|
||||
df.lazy()
|
||||
.filter(pl.col("age") > 25)
|
||||
.select("name", "age")
|
||||
.sort("age")
|
||||
.head(10)
|
||||
.collect()
|
||||
)
|
||||
```
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Monitor Memory Usage
|
||||
|
||||
```python
|
||||
# Check DataFrame size
|
||||
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
|
||||
|
||||
# Profile memory during operations
|
||||
lf = pl.scan_csv("large.csv")
|
||||
print(lf.explain()) # See query plan
|
||||
```
|
||||
|
||||
### Reduce Memory Footprint
|
||||
|
||||
```python
|
||||
# 1. Use lazy mode
|
||||
lf = pl.scan_parquet("data.parquet")
|
||||
|
||||
# 2. Stream results
|
||||
result = lf.collect(streaming=True)
|
||||
|
||||
# 3. Select only needed columns
|
||||
lf = lf.select("col1", "col2")
|
||||
|
||||
# 4. Optimize data types
|
||||
df = df.with_columns(
|
||||
pl.col("int_col").cast(pl.Int32), # Downcast if possible
|
||||
pl.col("category").cast(pl.Categorical) # For low cardinality
|
||||
)
|
||||
|
||||
# 5. Drop columns not needed
|
||||
df = df.drop("large_text_col", "unused_col")
|
||||
```
|
||||
|
||||
## Testing and Debugging
|
||||
|
||||
### Inspect Query Plans
|
||||
|
||||
```python
|
||||
lf = pl.scan_csv("data.csv")
|
||||
query = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# View the optimized query plan
|
||||
print(query.explain())
|
||||
|
||||
# View detailed query plan
|
||||
print(query.explain(optimized=True))
|
||||
```
|
||||
|
||||
### Sample Data for Development
|
||||
|
||||
```python
|
||||
# Use n_rows for testing
|
||||
df = pl.read_csv("large.csv", n_rows=1000)
|
||||
|
||||
# Or sample after reading
|
||||
df_sample = df.sample(n=1000, seed=42)
|
||||
```
|
||||
|
||||
### Validate Schemas
|
||||
|
||||
```python
|
||||
# Check schema
|
||||
print(df.schema)
|
||||
|
||||
# Ensure schema matches expectation
|
||||
expected_schema = {
|
||||
"id": pl.Int64,
|
||||
"name": pl.Utf8,
|
||||
"date": pl.Date
|
||||
}
|
||||
|
||||
assert df.schema == expected_schema
|
||||
```
|
||||
|
||||
### Profile Performance
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
# Time operations
|
||||
start = time.time()
|
||||
result = lf.collect()
|
||||
print(f"Execution time: {time.time() - start:.2f}s")
|
||||
|
||||
# Compare eager vs lazy
|
||||
start = time.time()
|
||||
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
|
||||
eager_time = time.time() - start
|
||||
|
||||
start = time.time()
|
||||
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
|
||||
lazy_time = time.time() - start
|
||||
|
||||
print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
|
||||
```
|
||||
|
||||
## File Format Best Practices
|
||||
|
||||
### Choose the Right Format
|
||||
|
||||
**Parquet:**
|
||||
- Best for: Large datasets, archival, data lakes
|
||||
- Pros: Excellent compression, columnar, fast reads
|
||||
- Cons: Not human-readable
|
||||
|
||||
**CSV:**
|
||||
- Best for: Small datasets, human inspection, legacy systems
|
||||
- Pros: Universal, human-readable
|
||||
- Cons: Slow, large file size, no type preservation
|
||||
|
||||
**Arrow IPC:**
|
||||
- Best for: Inter-process communication, temporary storage
|
||||
- Pros: Fastest, zero-copy, preserves all types
|
||||
- Cons: Less compression than Parquet
|
||||
|
||||
### File Reading Best Practices
|
||||
|
||||
```python
|
||||
# 1. Use lazy reading
|
||||
lf = pl.scan_parquet("data.parquet") # Not read_parquet
|
||||
|
||||
# 2. Read multiple files efficiently
|
||||
lf = pl.scan_parquet("data/*.parquet") # Parallel reading
|
||||
|
||||
# 3. Specify schema when known
|
||||
lf = pl.scan_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "date": pl.Date}
|
||||
)
|
||||
|
||||
# 4. Use predicate pushdown
|
||||
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
|
||||
```
|
||||
|
||||
### File Writing Best Practices
|
||||
|
||||
```python
|
||||
# 1. Use Parquet for large data
|
||||
df.write_parquet("output.parquet", compression="zstd")
|
||||
|
||||
# 2. Partition large datasets
|
||||
df.write_parquet("output", partition_by=["year", "month"])
|
||||
|
||||
# 3. Use streaming for very large writes
|
||||
lf.sink_parquet("output.parquet") # Streaming write
|
||||
|
||||
# 4. Optimize compression
|
||||
df.write_parquet(
|
||||
"output.parquet",
|
||||
compression="snappy", # Fast compression
|
||||
statistics=True # Enable predicate pushdown on read
|
||||
)
|
||||
```
|
||||
|
||||
## Code Organization
|
||||
|
||||
### Reusable Expressions
|
||||
|
||||
```python
|
||||
# Define reusable expressions
|
||||
age_group = (
|
||||
pl.when(pl.col("age") < 18)
|
||||
.then("minor")
|
||||
.when(pl.col("age") < 65)
|
||||
.then("adult")
|
||||
.otherwise("senior")
|
||||
)
|
||||
|
||||
revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
|
||||
|
||||
# Use in multiple contexts
|
||||
df = df.with_columns(
|
||||
age_group=age_group,
|
||||
rpc=revenue_per_customer
|
||||
)
|
||||
|
||||
# Reuse in filtering
|
||||
df = df.filter(revenue_per_customer > 100)
|
||||
```
|
||||
|
||||
### Pipeline Functions
|
||||
|
||||
```python
|
||||
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
|
||||
"""Clean and standardize data."""
|
||||
return lf.with_columns(
|
||||
pl.col("name").str.to_uppercase(),
|
||||
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
|
||||
pl.col("amount").fill_null(0)
|
||||
)
|
||||
|
||||
def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
|
||||
"""Add computed features."""
|
||||
return lf.with_columns(
|
||||
month=pl.col("date").dt.month(),
|
||||
year=pl.col("date").dt.year(),
|
||||
amount_log=pl.col("amount").log()
|
||||
)
|
||||
|
||||
# Compose pipeline
|
||||
result = (
|
||||
pl.scan_csv("data.csv")
|
||||
.pipe(clean_data)
|
||||
.pipe(add_features)
|
||||
.filter(pl.col("year") == 2023)
|
||||
.collect()
|
||||
)
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
Always document complex expressions and transformations:
|
||||
|
||||
```python
|
||||
# Good: Document intent
|
||||
df = df.with_columns(
|
||||
# Calculate customer lifetime value as sum of purchases
|
||||
# divided by months since first purchase
|
||||
clv=(
|
||||
pl.col("total_purchases") /
|
||||
((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
|
||||
.dt.total_days() / 30)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
```python
|
||||
# Check Polars version
|
||||
import polars as pl
|
||||
print(pl.__version__)
|
||||
|
||||
# Feature availability varies by version
|
||||
# Document version requirements for production code
|
||||
```
|
||||
378
scientific-packages/polars/references/core_concepts.md
Normal file
378
scientific-packages/polars/references/core_concepts.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Polars Core Concepts
|
||||
|
||||
## Expressions
|
||||
|
||||
Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
|
||||
|
||||
### What are Expressions?
|
||||
|
||||
An expression describes a transformation on data. It only materializes (executes) within specific contexts:
|
||||
- `select()` - Select and transform columns
|
||||
- `with_columns()` - Add or modify columns
|
||||
- `filter()` - Filter rows
|
||||
- `group_by().agg()` - Aggregate data
|
||||
|
||||
### Expression Syntax
|
||||
|
||||
**Basic column reference:**
|
||||
```python
|
||||
pl.col("column_name")
|
||||
```
|
||||
|
||||
**Computed expressions:**
|
||||
```python
|
||||
# Arithmetic
|
||||
pl.col("height") * 2
|
||||
pl.col("price") + pl.col("tax")
|
||||
|
||||
# With alias
|
||||
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
|
||||
|
||||
# Method chaining
|
||||
pl.col("name").str.to_uppercase().str.slice(0, 3)
|
||||
```
|
||||
|
||||
### Expression Contexts
|
||||
|
||||
**Select context:**
|
||||
```python
|
||||
df.select(
|
||||
"name", # Simple column name
|
||||
pl.col("age"), # Expression
|
||||
(pl.col("age") * 12).alias("age_in_months") # Computed expression
|
||||
)
|
||||
```
|
||||
|
||||
**With_columns context:**
|
||||
```python
|
||||
df.with_columns(
|
||||
age_doubled=pl.col("age") * 2,
|
||||
name_upper=pl.col("name").str.to_uppercase()
|
||||
)
|
||||
```
|
||||
|
||||
**Filter context:**
|
||||
```python
|
||||
df.filter(
|
||||
pl.col("age") > 25,
|
||||
pl.col("city").is_in(["NY", "LA", "SF"])
|
||||
)
|
||||
```
|
||||
|
||||
**Group_by context:**
|
||||
```python
|
||||
df.group_by("department").agg(
|
||||
pl.col("salary").mean(),
|
||||
pl.col("employee_id").count()
|
||||
)
|
||||
```
|
||||
|
||||
### Expression Expansion
|
||||
|
||||
Apply operations to multiple columns at once:
|
||||
|
||||
**All columns:**
|
||||
```python
|
||||
df.select(pl.all() * 2)
|
||||
```
|
||||
|
||||
**Pattern matching:**
|
||||
```python
|
||||
# All columns ending with "_value"
|
||||
df.select(pl.col("^.*_value$") * 100)
|
||||
|
||||
# All numeric columns
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
|
||||
```
|
||||
|
||||
**Exclude patterns:**
|
||||
```python
|
||||
df.select(pl.all().exclude("id", "name"))
|
||||
```
|
||||
|
||||
### Expression Composition
|
||||
|
||||
Expressions can be stored and reused:
|
||||
|
||||
```python
|
||||
# Define reusable expressions
|
||||
age_expression = pl.col("age") * 12
|
||||
name_expression = pl.col("name").str.to_uppercase()
|
||||
|
||||
# Use in multiple contexts
|
||||
df.select(age_expression, name_expression)
|
||||
df.with_columns(age_months=age_expression)
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
Polars has a strict type system based on Apache Arrow.
|
||||
|
||||
### Core Data Types
|
||||
|
||||
**Numeric:**
|
||||
- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
|
||||
- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
|
||||
- `Float32`, `Float64` - Floating point numbers
|
||||
|
||||
**Text:**
|
||||
- `Utf8` / `String` - UTF-8 encoded strings
|
||||
- `Categorical` - Categorized strings (low cardinality)
|
||||
- `Enum` - Fixed set of string values
|
||||
|
||||
**Temporal:**
|
||||
- `Date` - Calendar date (no time)
|
||||
- `Datetime` - Date and time with optional timezone
|
||||
- `Time` - Time of day
|
||||
- `Duration` - Time duration/difference
|
||||
|
||||
**Boolean:**
|
||||
- `Boolean` - True/False values
|
||||
|
||||
**Nested:**
|
||||
- `List` - Variable-length lists
|
||||
- `Array` - Fixed-length arrays
|
||||
- `Struct` - Nested record structures
|
||||
|
||||
**Other:**
|
||||
- `Binary` - Binary data
|
||||
- `Object` - Python objects (avoid in production)
|
||||
- `Null` - Null type
|
||||
|
||||
### Type Casting
|
||||
|
||||
Convert between types explicitly:
|
||||
|
||||
```python
|
||||
# Cast to different type
|
||||
df.select(
|
||||
pl.col("age").cast(pl.Float64),
|
||||
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
|
||||
pl.col("id").cast(pl.Utf8)
|
||||
)
|
||||
```
|
||||
|
||||
### Null Handling
|
||||
|
||||
Polars uses consistent null handling across all types:
|
||||
|
||||
**Check for nulls:**
|
||||
```python
|
||||
df.filter(pl.col("value").is_null())
|
||||
df.filter(pl.col("value").is_not_null())
|
||||
```
|
||||
|
||||
**Fill nulls:**
|
||||
```python
|
||||
pl.col("value").fill_null(0)
|
||||
pl.col("value").fill_null(strategy="forward")
|
||||
pl.col("value").fill_null(strategy="backward")
|
||||
pl.col("value").fill_null(strategy="mean")
|
||||
```
|
||||
|
||||
**Drop nulls:**
|
||||
```python
|
||||
df.drop_nulls() # Drop any row with nulls
|
||||
df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
Use categorical types for string columns with low cardinality (repeated values):
|
||||
|
||||
```python
|
||||
# Cast to categorical
|
||||
df.with_columns(
|
||||
pl.col("category").cast(pl.Categorical)
|
||||
)
|
||||
|
||||
# Benefits:
|
||||
# - Reduced memory usage
|
||||
# - Faster grouping and joining
|
||||
# - Maintains order information
|
||||
```
|
||||
|
||||
## Lazy vs Eager Evaluation
|
||||
|
||||
Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
|
||||
|
||||
### Eager Evaluation (DataFrame)
|
||||
|
||||
Operations execute immediately:
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# DataFrame operations execute right away
|
||||
df = pl.read_csv("data.csv") # Reads file immediately
|
||||
result = df.filter(pl.col("age") > 25) # Filters immediately
|
||||
final = result.select("name", "age") # Selects immediately
|
||||
```
|
||||
|
||||
**When to use eager:**
|
||||
- Small datasets that fit in memory
|
||||
- Interactive exploration in notebooks
|
||||
- Simple one-off operations
|
||||
- Immediate feedback needed
|
||||
|
||||
### Lazy Evaluation (LazyFrame)
|
||||
|
||||
Operations build a query plan, optimized before execution:
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# LazyFrame operations build a query plan
|
||||
lf = pl.scan_csv("data.csv") # Doesn't read yet
|
||||
lf2 = lf.filter(pl.col("age") > 25) # Adds to plan
|
||||
lf3 = lf2.select("name", "age") # Adds to plan
|
||||
df = lf3.collect() # NOW executes optimized plan
|
||||
```
|
||||
|
||||
**When to use lazy:**
|
||||
- Large datasets
|
||||
- Complex query pipelines
|
||||
- Only need subset of data
|
||||
- Performance is critical
|
||||
- Streaming required
|
||||
|
||||
### Query Optimization
|
||||
|
||||
Polars automatically optimizes lazy queries:
|
||||
|
||||
**Predicate Pushdown:**
|
||||
Filter operations pushed to data source when possible:
|
||||
```python
|
||||
# Only reads rows where age > 25 from CSV
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.filter(pl.col("age") > 25).collect()
|
||||
```
|
||||
|
||||
**Projection Pushdown:**
|
||||
Only read needed columns from data source:
|
||||
```python
|
||||
# Only reads "name" and "age" columns from CSV
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.select("name", "age").collect()
|
||||
```
|
||||
|
||||
**Query Plan Inspection:**
|
||||
```python
|
||||
# View the optimized query plan
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
print(result.explain()) # Shows optimized plan
|
||||
```
|
||||
|
||||
### Streaming Mode
|
||||
|
||||
Process data larger than memory:
|
||||
|
||||
```python
|
||||
# Enable streaming for very large datasets
|
||||
lf = pl.scan_csv("very_large.csv")
|
||||
result = lf.filter(pl.col("age") > 25).collect(streaming=True)
|
||||
```
|
||||
|
||||
**Streaming benefits:**
|
||||
- Process data larger than RAM
|
||||
- Lower peak memory usage
|
||||
- Chunk-based processing
|
||||
- Automatic memory management
|
||||
|
||||
**Streaming limitations:**
|
||||
- Not all operations support streaming
|
||||
- May be slower for small data
|
||||
- Some operations require materializing entire dataset
|
||||
|
||||
### Converting Between Eager and Lazy
|
||||
|
||||
**Eager to Lazy:**
|
||||
```python
|
||||
df = pl.read_csv("data.csv")
|
||||
lf = df.lazy() # Convert to LazyFrame
|
||||
```
|
||||
|
||||
**Lazy to Eager:**
|
||||
```python
|
||||
lf = pl.scan_csv("data.csv")
|
||||
df = lf.collect() # Execute and return DataFrame
|
||||
```
|
||||
|
||||
## Memory Format
|
||||
|
||||
Polars uses Apache Arrow columnar memory format:
|
||||
|
||||
**Benefits:**
|
||||
- Zero-copy data sharing with other Arrow libraries
|
||||
- Efficient columnar operations
|
||||
- SIMD vectorization
|
||||
- Reduced memory overhead
|
||||
- Fast serialization
|
||||
|
||||
**Implications:**
|
||||
- Data stored column-wise, not row-wise
|
||||
- Column operations very fast
|
||||
- Random row access slower than pandas
|
||||
- Best for analytical workloads
|
||||
|
||||
## Parallelization
|
||||
|
||||
Polars parallelizes operations automatically using Rust's concurrency:
|
||||
|
||||
**What gets parallelized:**
|
||||
- Aggregations within groups
|
||||
- Window functions
|
||||
- Most expression evaluations
|
||||
- File reading (multiple files)
|
||||
- Join operations
|
||||
|
||||
**What to avoid for parallelization:**
|
||||
- Python user-defined functions (UDFs)
|
||||
- Lambda functions in `.map_elements()`
|
||||
- Sequential `.pipe()` chains
|
||||
|
||||
**Best practice:**
|
||||
```python
|
||||
# Good: Stays in expression API (parallelized)
|
||||
df.with_columns(
|
||||
pl.col("value") * 10,
|
||||
pl.col("value").log(),
|
||||
pl.col("value").sqrt()
|
||||
)
|
||||
|
||||
# Bad: Uses Python function (sequential)
|
||||
df.with_columns(
|
||||
pl.col("value").map_elements(lambda x: x * 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Strict Type System
|
||||
|
||||
Polars enforces strict typing:
|
||||
|
||||
**No silent conversions:**
|
||||
```python
|
||||
# This will error - can't mix types
|
||||
# df.with_columns(pl.col("int_col") + "string")
|
||||
|
||||
# Must cast explicitly
|
||||
df.with_columns(
|
||||
pl.col("int_col").cast(pl.Utf8) + "_suffix"
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Prevents silent bugs
|
||||
- Predictable behavior
|
||||
- Better performance
|
||||
- Clearer code intent
|
||||
|
||||
**Integer nulls:**
|
||||
Unlike pandas, integer columns can have nulls without converting to float:
|
||||
```python
|
||||
# In pandas: Int column with null becomes Float
|
||||
# In polars: Int column with null stays Int (with null values)
|
||||
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
|
||||
# dtype: Int64 (not Float64)
|
||||
```
|
||||
557
scientific-packages/polars/references/io_guide.md
Normal file
557
scientific-packages/polars/references/io_guide.md
Normal file
@@ -0,0 +1,557 @@
|
||||
# Polars Data I/O Guide
|
||||
|
||||
Comprehensive guide to reading and writing data in various formats with Polars.
|
||||
|
||||
## CSV Files
|
||||
|
||||
### Reading CSV
|
||||
|
||||
**Eager mode (loads into memory):**
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Basic read
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# With options
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
separator=",",
|
||||
has_header=True,
|
||||
columns=["col1", "col2"], # Select specific columns
|
||||
n_rows=1000, # Read only first 1000 rows
|
||||
skip_rows=10, # Skip first 10 rows
|
||||
dtypes={"col1": pl.Int64, "col2": pl.Utf8}, # Specify types
|
||||
null_values=["NA", "null", ""], # Define null values
|
||||
encoding="utf-8",
|
||||
ignore_errors=False
|
||||
)
|
||||
```
|
||||
|
||||
**Lazy mode (scans without loading - recommended for large files):**
|
||||
```python
|
||||
# Scan CSV (builds query plan)
|
||||
lf = pl.scan_csv("data.csv")
|
||||
|
||||
# Apply operations
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# Execute and load
|
||||
df = result.collect()
|
||||
```
|
||||
|
||||
### Writing CSV
|
||||
|
||||
```python
|
||||
# Basic write
|
||||
df.write_csv("output.csv")
|
||||
|
||||
# With options
|
||||
df.write_csv(
|
||||
"output.csv",
|
||||
separator=",",
|
||||
include_header=True,
|
||||
null_value="", # How to represent nulls
|
||||
quote_char='"',
|
||||
line_terminator="\n"
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple CSV Files
|
||||
|
||||
**Read multiple files:**
|
||||
```python
|
||||
# Read all CSVs in directory
|
||||
lf = pl.scan_csv("data/*.csv")
|
||||
|
||||
# Read specific files
|
||||
lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])
|
||||
```
|
||||
|
||||
## Parquet Files
|
||||
|
||||
Parquet is the recommended format for performance and compression.
|
||||
|
||||
### Reading Parquet
|
||||
|
||||
**Eager:**
|
||||
```python
|
||||
df = pl.read_parquet("data.parquet")
|
||||
|
||||
# With options
|
||||
df = pl.read_parquet(
|
||||
"data.parquet",
|
||||
columns=["col1", "col2"], # Select specific columns
|
||||
n_rows=1000, # Read first N rows
|
||||
parallel="auto" # Control parallelization
|
||||
)
|
||||
```
|
||||
|
||||
**Lazy (recommended):**
|
||||
```python
|
||||
lf = pl.scan_parquet("data.parquet")
|
||||
|
||||
# Automatic predicate and projection pushdown
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
|
||||
```
|
||||
|
||||
### Writing Parquet
|
||||
|
||||
```python
|
||||
# Basic write
|
||||
df.write_parquet("output.parquet")
|
||||
|
||||
# With compression
|
||||
df.write_parquet(
|
||||
"output.parquet",
|
||||
compression="snappy", # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
|
||||
statistics=True, # Write statistics (enables predicate pushdown)
|
||||
use_pyarrow=False # Use Rust writer (faster)
|
||||
)
|
||||
```
|
||||
|
||||
### Partitioned Parquet (Hive-style)
|
||||
|
||||
**Write partitioned:**
|
||||
```python
|
||||
# Write with partitioning
|
||||
df.write_parquet(
|
||||
"output_dir",
|
||||
partition_by=["year", "month"] # Creates directory structure
|
||||
)
|
||||
# Creates: output_dir/year=2023/month=01/data.parquet
|
||||
```
|
||||
|
||||
**Read partitioned:**
|
||||
```python
|
||||
lf = pl.scan_parquet("output_dir/**/*.parquet")
|
||||
|
||||
# Hive partitioning columns are automatically added
|
||||
result = lf.filter(pl.col("year") == 2023).collect()
|
||||
```
|
||||
|
||||
## JSON Files
|
||||
|
||||
### Reading JSON
|
||||
|
||||
**NDJSON (newline-delimited JSON) - recommended:**
|
||||
```python
|
||||
df = pl.read_ndjson("data.ndjson")
|
||||
|
||||
# Lazy
|
||||
lf = pl.scan_ndjson("data.ndjson")
|
||||
```
|
||||
|
||||
**Standard JSON:**
|
||||
```python
|
||||
df = pl.read_json("data.json")
|
||||
|
||||
# From JSON string
|
||||
df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')
|
||||
```
|
||||
|
||||
### Writing JSON
|
||||
|
||||
```python
|
||||
# Write NDJSON
|
||||
df.write_ndjson("output.ndjson")
|
||||
|
||||
# Write standard JSON
|
||||
df.write_json("output.json")
|
||||
|
||||
# Pretty printed
|
||||
df.write_json("output.json", pretty=True, row_oriented=False)
|
||||
```
|
||||
|
||||
## Excel Files
|
||||
|
||||
### Reading Excel
|
||||
|
||||
```python
|
||||
# Read first sheet
|
||||
df = pl.read_excel("data.xlsx")
|
||||
|
||||
# Specific sheet
|
||||
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
|
||||
# Or by index
|
||||
df = pl.read_excel("data.xlsx", sheet_id=0)
|
||||
|
||||
# With options
|
||||
df = pl.read_excel(
|
||||
"data.xlsx",
|
||||
sheet_name="Sheet1",
|
||||
columns=["A", "B", "C"], # Excel columns
|
||||
n_rows=100,
|
||||
skip_rows=5,
|
||||
has_header=True
|
||||
)
|
||||
```
|
||||
|
||||
### Writing Excel
|
||||
|
||||
```python
|
||||
# Write to Excel
|
||||
df.write_excel("output.xlsx")
|
||||
|
||||
# Multiple sheets
|
||||
with pl.ExcelWriter("output.xlsx") as writer:
|
||||
df1.write_excel(writer, worksheet="Sheet1")
|
||||
df2.write_excel(writer, worksheet="Sheet2")
|
||||
```
|
||||
|
||||
## Database Connectivity
|
||||
|
||||
### Read from Database
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Read entire table
|
||||
df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")
|
||||
|
||||
# Using connectorx for better performance
|
||||
df = pl.read_database_uri(
|
||||
"SELECT * FROM users WHERE age > 25",
|
||||
uri="postgresql://user:pass@localhost/db"
|
||||
)
|
||||
```
|
||||
|
||||
### Write to Database
|
||||
|
||||
```python
|
||||
# Using SQLAlchemy
|
||||
from sqlalchemy import create_engine
|
||||
|
||||
engine = create_engine("postgresql://user:pass@localhost/db")
|
||||
df.write_database("table_name", connection=engine)
|
||||
|
||||
# With options
|
||||
df.write_database(
|
||||
"table_name",
|
||||
connection=engine,
|
||||
if_exists="replace", # or "append", "fail"
|
||||
)
|
||||
```
|
||||
|
||||
### Common Database Connectors
|
||||
|
||||
**PostgreSQL:**
|
||||
```python
|
||||
uri = "postgresql://username:password@localhost:5432/database"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
**MySQL:**
|
||||
```python
|
||||
uri = "mysql://username:password@localhost:3306/database"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
**SQLite:**
|
||||
```python
|
||||
uri = "sqlite:///path/to/database.db"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
## Cloud Storage
|
||||
|
||||
### AWS S3
|
||||
|
||||
```python
|
||||
# Read from S3
|
||||
df = pl.read_parquet("s3://bucket/path/to/file.parquet")
|
||||
lf = pl.scan_parquet("s3://bucket/path/*.parquet")
|
||||
|
||||
# Write to S3
|
||||
df.write_parquet("s3://bucket/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
import os
|
||||
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
|
||||
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
|
||||
os.environ["AWS_REGION"] = "us-west-2"
|
||||
|
||||
df = pl.read_parquet("s3://bucket/file.parquet")
|
||||
```
|
||||
|
||||
### Azure Blob Storage
|
||||
|
||||
```python
|
||||
# Read from Azure
|
||||
df = pl.read_parquet("az://container/path/file.parquet")
|
||||
|
||||
# Write to Azure
|
||||
df.write_parquet("az://container/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "account"
|
||||
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "key"
|
||||
```
|
||||
|
||||
### Google Cloud Storage (GCS)
|
||||
|
||||
```python
|
||||
# Read from GCS
|
||||
df = pl.read_parquet("gs://bucket/path/file.parquet")
|
||||
|
||||
# Write to GCS
|
||||
df.write_parquet("gs://bucket/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials.json"
|
||||
```
|
||||
|
||||
## Google BigQuery
|
||||
|
||||
```python
|
||||
# Read from BigQuery
|
||||
df = pl.read_database(
|
||||
"SELECT * FROM project.dataset.table",
|
||||
connection_uri="bigquery://project"
|
||||
)
|
||||
|
||||
# Or using Google Cloud SDK
|
||||
from google.cloud import bigquery
|
||||
client = bigquery.Client()
|
||||
|
||||
query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
|
||||
df = pl.from_pandas(client.query(query).to_dataframe())
|
||||
```
|
||||
|
||||
## Apache Arrow
|
||||
|
||||
### IPC/Feather Format
|
||||
|
||||
**Read:**
|
||||
```python
|
||||
df = pl.read_ipc("data.arrow")
|
||||
lf = pl.scan_ipc("data.arrow")
|
||||
```
|
||||
|
||||
**Write:**
|
||||
```python
|
||||
df.write_ipc("output.arrow")
|
||||
|
||||
# Compressed
|
||||
df.write_ipc("output.arrow", compression="zstd")
|
||||
```
|
||||
|
||||
### Arrow Streaming
|
||||
|
||||
```python
|
||||
# Write streaming format
|
||||
df.write_ipc("output.arrows", compression="zstd")
|
||||
|
||||
# Read streaming
|
||||
df = pl.read_ipc("output.arrows")
|
||||
```
|
||||
|
||||
### From/To Arrow
|
||||
|
||||
```python
|
||||
import pyarrow as pa
|
||||
|
||||
# From Arrow Table
|
||||
arrow_table = pa.table({"col": [1, 2, 3]})
|
||||
df = pl.from_arrow(arrow_table)
|
||||
|
||||
# To Arrow Table
|
||||
arrow_table = df.to_arrow()
|
||||
```
|
||||
|
||||
## In-Memory Formats
|
||||
|
||||
### Python Dictionaries
|
||||
|
||||
```python
|
||||
# From dict
|
||||
df = pl.DataFrame({
|
||||
"col1": [1, 2, 3],
|
||||
"col2": ["a", "b", "c"]
|
||||
})
|
||||
|
||||
# To dict
|
||||
data_dict = df.to_dict() # Column-oriented
|
||||
data_dict = df.to_dict(as_series=False) # Lists instead of Series
|
||||
```
|
||||
|
||||
### NumPy Arrays
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# From NumPy
|
||||
arr = np.array([[1, 2], [3, 4], [5, 6]])
|
||||
df = pl.DataFrame(arr, schema=["col1", "col2"])
|
||||
|
||||
# To NumPy
|
||||
arr = df.to_numpy()
|
||||
```
|
||||
|
||||
### Pandas DataFrames
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# From Pandas
|
||||
pd_df = pd.DataFrame({"col": [1, 2, 3]})
|
||||
pl_df = pl.from_pandas(pd_df)
|
||||
|
||||
# To Pandas
|
||||
pd_df = pl_df.to_pandas()
|
||||
|
||||
# Zero-copy when possible
|
||||
pl_df = pl.from_arrow(pd_df)
|
||||
```
|
||||
|
||||
### Lists of Rows
|
||||
|
||||
```python
|
||||
# From list of dicts
|
||||
data = [
|
||||
{"name": "Alice", "age": 25},
|
||||
{"name": "Bob", "age": 30}
|
||||
]
|
||||
df = pl.DataFrame(data)
|
||||
|
||||
# To list of dicts
|
||||
rows = df.to_dicts()
|
||||
|
||||
# From list of tuples
|
||||
data = [("Alice", 25), ("Bob", 30)]
|
||||
df = pl.DataFrame(data, schema=["name", "age"])
|
||||
```
|
||||
|
||||
## Streaming Large Files
|
||||
|
||||
For datasets larger than memory, use lazy mode with streaming:
|
||||
|
||||
```python
|
||||
# Streaming mode
|
||||
lf = pl.scan_csv("very_large.csv")
|
||||
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
|
||||
|
||||
# Streaming with multiple files
|
||||
lf = pl.scan_parquet("data/*.parquet")
|
||||
result = lf.group_by("category").agg(pl.col("value").sum()).collect(streaming=True)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Format Selection
|
||||
|
||||
**Use Parquet when:**
|
||||
- Need compression (up to 10x smaller than CSV)
|
||||
- Want fast reads/writes
|
||||
- Need to preserve data types
|
||||
- Working with large datasets
|
||||
- Need predicate pushdown
|
||||
|
||||
**Use CSV when:**
|
||||
- Need human-readable format
|
||||
- Interfacing with legacy systems
|
||||
- Data is small
|
||||
- Need universal compatibility
|
||||
|
||||
**Use JSON when:**
|
||||
- Working with nested/hierarchical data
|
||||
- Need web API compatibility
|
||||
- Data has flexible schema
|
||||
|
||||
**Use Arrow IPC when:**
|
||||
- Need zero-copy data sharing
|
||||
- Fastest serialization required
|
||||
- Working between Arrow-compatible systems
|
||||
|
||||
### Reading Large Files
|
||||
|
||||
```python
|
||||
# 1. Always use lazy mode
|
||||
lf = pl.scan_csv("large.csv") # NOT read_csv
|
||||
|
||||
# 2. Filter and select early (pushdown optimization)
|
||||
result = (
|
||||
lf
|
||||
.select("col1", "col2", "col3") # Only needed columns
|
||||
.filter(pl.col("date") > "2023-01-01") # Filter early
|
||||
.collect()
|
||||
)
|
||||
|
||||
# 3. Use streaming for very large data
|
||||
result = lf.filter(...).select(...).collect(streaming=True)
|
||||
|
||||
# 4. Read only needed rows during development
|
||||
df = pl.read_csv("large.csv", n_rows=10000) # Sample for testing
|
||||
```
|
||||
|
||||
### Writing Large Files
|
||||
|
||||
```python
|
||||
# 1. Use Parquet with compression
|
||||
df.write_parquet("output.parquet", compression="zstd")
|
||||
|
||||
# 2. Use partitioning for very large datasets
|
||||
df.write_parquet("output", partition_by=["year", "month"])
|
||||
|
||||
# 3. Write streaming
|
||||
lf = pl.scan_csv("input.csv")
|
||||
lf.sink_parquet("output.parquet") # Streaming write
|
||||
```
|
||||
|
||||
### Performance Tips
|
||||
|
||||
```python
|
||||
# 1. Specify dtypes when reading CSV
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "name": pl.Utf8} # Avoids inference
|
||||
)
|
||||
|
||||
# 2. Use appropriate compression
|
||||
df.write_parquet("output.parquet", compression="snappy") # Fast
|
||||
df.write_parquet("output.parquet", compression="zstd") # Better compression
|
||||
|
||||
# 3. Parallel reading
|
||||
df = pl.read_csv("data.csv", parallel="auto")
|
||||
|
||||
# 4. Read multiple files in parallel
|
||||
lf = pl.scan_parquet("data/*.parquet") # Automatic parallel read
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
df = pl.read_csv("data.csv")
|
||||
except pl.exceptions.ComputeError as e:
|
||||
print(f"Error reading CSV: {e}")
|
||||
|
||||
# Ignore errors during parsing
|
||||
df = pl.read_csv("messy.csv", ignore_errors=True)
|
||||
|
||||
# Handle missing files
|
||||
from pathlib import Path
|
||||
if Path("data.csv").exists():
|
||||
df = pl.read_csv("data.csv")
|
||||
else:
|
||||
print("File not found")
|
||||
```
|
||||
|
||||
## Schema Management
|
||||
|
||||
```python
|
||||
# Infer schema from sample
|
||||
schema = pl.read_csv("data.csv", n_rows=1000).schema
|
||||
|
||||
# Use inferred schema for full read
|
||||
df = pl.read_csv("data.csv", dtypes=schema)
|
||||
|
||||
# Define schema explicitly
|
||||
schema = {
|
||||
"id": pl.Int64,
|
||||
"name": pl.Utf8,
|
||||
"date": pl.Date,
|
||||
"value": pl.Float64
|
||||
}
|
||||
df = pl.read_csv("data.csv", dtypes=schema)
|
||||
```
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user