From 6ddea4786e78eb6a993587e0c83ac86ead0e82b3 Mon Sep 17 00:00:00 2001 From: Timothy Kassis Date: Mon, 3 Nov 2025 16:09:01 -0800 Subject: [PATCH] Improve the anndata skill --- scientific-packages/anndata/SKILL.md | 719 +++++++----------- .../anndata/references/api_reference.md | 218 ------ .../anndata/references/best_practices.md | 525 +++++++++++++ .../anndata/references/concatenation.md | 396 ++++++++++ .../anndata/references/concatenation_guide.md | 478 ------------ .../anndata/references/data_structure.md | 314 ++++++++ .../anndata/references/io_operations.md | 404 ++++++++++ .../anndata/references/manipulation.md | 516 +++++++++++++ .../references/workflows_best_practices.md | 438 ----------- 9 files changed, 2448 insertions(+), 1560 deletions(-) delete mode 100644 scientific-packages/anndata/references/api_reference.md create mode 100644 scientific-packages/anndata/references/best_practices.md create mode 100644 scientific-packages/anndata/references/concatenation.md delete mode 100644 scientific-packages/anndata/references/concatenation_guide.md create mode 100644 scientific-packages/anndata/references/data_structure.md create mode 100644 scientific-packages/anndata/references/io_operations.md create mode 100644 scientific-packages/anndata/references/manipulation.md delete mode 100644 scientific-packages/anndata/references/workflows_best_practices.md diff --git a/scientific-packages/anndata/SKILL.md b/scientific-packages/anndata/SKILL.md index ebe32fe..2e40999 100644 --- a/scientific-packages/anndata/SKILL.md +++ b/scientific-packages/anndata/SKILL.md @@ -1,527 +1,394 @@ --- name: anndata -description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows." +description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools. --- # AnnData ## Overview -AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows. +AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis. -## Core Capabilities +## When to Use This Skill -### 1. Creating and Structuring AnnData Objects +Use this skill when: +- Creating, reading, or writing AnnData objects +- Working with h5ad, zarr, or other genomics data formats +- Performing single-cell RNA-seq analysis +- Managing large datasets with sparse matrices or backed mode +- Concatenating multiple datasets or experimental batches +- Subsetting, filtering, or transforming annotated data +- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools -Create AnnData objects from various data sources and organize multi-dimensional annotations. +## Installation -**Basic creation:** +```bash +pip install anndata + +# With optional dependencies +pip install anndata[dev,test,doc] +``` + +## Quick Start + +### Creating an AnnData object ```python import anndata as ad import numpy as np -from scipy.sparse import csr_matrix - -# From dense or sparse arrays -counts = np.random.poisson(1, size=(100, 2000)) -adata = ad.AnnData(counts) - -# With sparse matrix (memory-efficient) -counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32) -adata = ad.AnnData(counts) -``` - -**With metadata:** -```python import pandas as pd -obs_meta = pd.DataFrame({ - 'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']), - 'batch': ['batch1'] * 50 + ['batch2'] * 50 -}) -var_meta = pd.DataFrame({ - 'gene_name': [f'Gene_{i}' for i in range(2000)], - 'highly_variable': np.random.choice([True, False], 2000) -}) +# Minimal creation +X = np.random.rand(100, 2000) # 100 cells × 2000 genes +adata = ad.AnnData(X) -adata = ad.AnnData(counts, obs=obs_meta, var=var_meta) +# With metadata +obs = pd.DataFrame({ + 'cell_type': ['T cell', 'B cell'] * 50, + 'sample': ['A', 'B'] * 50 +}, index=[f'cell_{i}' for i in range(100)]) + +var = pd.DataFrame({ + 'gene_name': [f'Gene_{i}' for i in range(2000)] +}, index=[f'ENSG{i:05d}' for i in range(2000)]) + +adata = ad.AnnData(X=X, obs=obs, var=var) ``` -**Understanding the structure:** -- **X**: Primary data matrix (observations × variables) -- **obs**: Row (observation) annotations as DataFrame -- **var**: Column (variable) annotations as DataFrame -- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates) -- **varm**: Multi-dimensional variable annotations (e.g., gene loadings) -- **layers**: Alternative data matrices with same dimensions as X -- **uns**: Unstructured metadata dictionary -- **obsp/varp**: Pairwise relationship matrices (graphs) - -### 2. Adding Annotations and Layers - -Organize different data representations and metadata within a single object. - -**Cell-level metadata (obs):** +### Reading data ```python -adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) -adata.obs['total_counts'] = adata.X.sum(axis=1) -adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50) -``` +# Read h5ad file +adata = ad.read_h5ad('data.h5ad') -**Gene-level metadata (var):** -```python -adata.var['highly_variable'] = gene_variance > threshold -adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...]) -``` +# Read with backed mode (for large files) +adata = ad.read_h5ad('large_data.h5ad', backed='r') -**Embeddings (obsm/varm):** -```python -# Dimensionality reduction results -adata.obsm['X_pca'] = pca_coordinates # Shape: (n_obs, n_components) -adata.obsm['X_umap'] = umap_coordinates # Shape: (n_obs, 2) -adata.obsm['X_tsne'] = tsne_coordinates - -# Gene loadings -adata.varm['PCs'] = principal_components # Shape: (n_vars, n_components) -``` - -**Alternative data representations (layers):** -```python -# Store multiple versions -adata.layers['counts'] = raw_counts -adata.layers['log1p'] = np.log1p(adata.X) -adata.layers['scaled'] = (adata.X - mean) / std -``` - -**Unstructured metadata (uns):** -```python -# Analysis parameters -adata.uns['preprocessing'] = { - 'normalization': 'TPM', - 'min_genes': 200, - 'date': '2024-01-15' -} - -# Results -adata.uns['pca'] = {'variance_ratio': variance_explained} -``` - -### 3. Subsetting and Views - -Efficiently subset data while managing memory through views and copies. - -**Subsetting operations:** -```python -# By observation/variable names -subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']] - -# By boolean masks -b_cells = adata[adata.obs.cell_type == 'B'] -high_quality = adata[adata.obs.n_genes > 200] - -# By position -first_cells = adata[:100, :] -top_genes = adata[:, :500] - -# Combined conditions -filtered = adata[ - (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200), - adata.var.highly_variable -] -``` - -**Understanding views:** -- Subsetting returns **views** by default (memory-efficient, shares data with original) -- Modifying a view affects the original object -- Check with `adata.is_view` -- Convert to independent copy with `.copy()` - -```python -# View (memory-efficient) -subset = adata[adata.obs.condition == 'treated'] -print(subset.is_view) # True - -# Independent copy -subset_copy = adata[adata.obs.condition == 'treated'].copy() -print(subset_copy.is_view) # False -``` - -### 4. File I/O and Backed Mode - -Read and write data efficiently, with options for memory-limited environments. - -**Writing data:** -```python -# Standard format with compression -adata.write('results.h5ad', compression='gzip') - -# Alternative formats -adata.write_zarr('results.zarr') # For cloud storage -adata.write_loom('results.loom') # For compatibility -adata.write_csvs('results/') # As CSV files -``` - -**Reading data:** -```python -# Load into memory -adata = ad.read_h5ad('results.h5ad') - -# Backed mode (disk-backed, memory-efficient) -adata = ad.read_h5ad('large_file.h5ad', backed='r') -print(adata.isbacked) # True -print(adata.filename) # Path to file - -# Close file connection when done -adata.file.close() -``` - -**Reading from other formats:** -```python -# 10X format -adata = ad.read_mtx('matrix.mtx') - -# CSV +# Read other formats adata = ad.read_csv('data.csv') - -# Loom adata = ad.read_loom('data.loom') +adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') ``` -**Working with backed mode:** +### Writing data ```python -# Read in backed mode for large files -adata = ad.read_h5ad('large_dataset.h5ad', backed='r') +# Write h5ad file +adata.write_h5ad('output.h5ad') -# Process in chunks -for chunk in adata.chunk_X(chunk_size=1000): - result = process_chunk(chunk) +# Write with compression +adata.write_h5ad('output.h5ad', compression='gzip') -# Load to memory if needed -adata_memory = adata.to_memory() +# Write other formats +adata.write_zarr('output.zarr') +adata.write_csvs('output_dir/') ``` -### 5. Concatenating Multiple Datasets - -Combine multiple AnnData objects with control over how data is merged. - -**Basic concatenation:** +### Basic operations ```python -# Concatenate observations (most common) -combined = ad.concat([adata1, adata2, adata3], axis=0) +# Subset by conditions +t_cells = adata[adata.obs['cell_type'] == 'T cell'] -# Concatenate variables (rare) -combined = ad.concat([adata1, adata2], axis=1) +# Subset by indices +subset = adata[0:50, 0:100] + +# Add metadata +adata.obs['quality_score'] = np.random.rand(adata.n_obs) +adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8 + +# Access dimensions +print(f"{adata.n_obs} observations × {adata.n_vars} variables") ``` -**Join strategies:** -```python -# Inner join: only shared variables (no missing data) -combined = ad.concat([adata1, adata2], join='inner') +## Core Capabilities -# Outer join: all variables (fills missing with 0) -combined = ad.concat([adata1, adata2], join='outer') +### 1. Data Structure + +Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components. + +**See**: `references/data_structure.md` for comprehensive information on: +- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw) +- Creating AnnData objects from various sources +- Accessing and manipulating data components +- Memory-efficient practices + +### 2. Input/Output Operations + +Read and write data in various formats with support for compression, backed mode, and cloud storage. + +**See**: `references/io_operations.md` for details on: +- Native formats (h5ad, zarr) +- Alternative formats (CSV, MTX, Loom, 10X, Excel) +- Backed mode for large datasets +- Remote data access +- Format conversion +- Performance optimization + +Common commands: +```python +# Read/write h5ad +adata = ad.read_h5ad('data.h5ad', backed='r') +adata.write_h5ad('output.h5ad', compression='gzip') + +# Read 10X data +adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') + +# Read MTX format +adata = ad.read_mtx('matrix.mtx').T ``` -**Tracking data sources:** +### 3. Concatenation + +Combine multiple AnnData objects along observations or variables with flexible join strategies. + +**See**: `references/concatenation.md` for comprehensive coverage of: +- Basic concatenation (axis=0 for observations, axis=1 for variables) +- Join types (inner, outer) +- Merge strategies (same, unique, first, only) +- Tracking data sources with labels +- Lazy concatenation (AnnCollection) +- On-disk concatenation for large datasets + +Common commands: ```python -# Add source labels -combined = ad.concat( +# Concatenate observations (combine samples) +adata = ad.concat( [adata1, adata2, adata3], - label='dataset', - keys=['exp1', 'exp2', 'exp3'] -) -# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3' - -# Make duplicate indices unique -combined = ad.concat( - [adata1, adata2], - keys=['batch1', 'batch2'], - index_unique='-' -) -# Cell names become: Cell_0-batch1, Cell_0-batch2, etc. -``` - -**Merge strategies for metadata:** -```python -# merge=None: exclude variable annotations (default) -combined = ad.concat([adata1, adata2], merge=None) - -# merge='same': keep only identical annotations -combined = ad.concat([adata1, adata2], merge='same') - -# merge='first': use first occurrence -combined = ad.concat([adata1, adata2], merge='first') - -# merge='unique': keep annotations with single value -combined = ad.concat([adata1, adata2], merge='unique') -``` - -**Complete example:** -```python -# Load batches -batch1 = ad.read_h5ad('batch1.h5ad') -batch2 = ad.read_h5ad('batch2.h5ad') -batch3 = ad.read_h5ad('batch3.h5ad') - -# Concatenate with full tracking -combined = ad.concat( - [batch1, batch2, batch3], axis=0, - join='outer', # Keep all genes - merge='first', # Use first batch's annotations - label='batch_id', # Track source - keys=['b1', 'b2', 'b3'], # Custom labels - index_unique='-' # Make cell names unique + join='inner', + label='batch', + keys=['batch1', 'batch2', 'batch3'] +) + +# Concatenate variables (combine modalities) +adata = ad.concat([adata_rna, adata_protein], axis=1) + +# Lazy concatenation +from anndata.experimental import AnnCollection +collection = AnnCollection( + ['data1.h5ad', 'data2.h5ad'], + join_obs='outer', + label='dataset' ) ``` -### 6. Data Conversion and Extraction +### 4. Data Manipulation -Convert between AnnData and other formats for interoperability. +Transform, subset, filter, and reorganize data efficiently. -**To DataFrame:** +**See**: `references/manipulation.md` for detailed guidance on: +- Subsetting (by indices, names, boolean masks, metadata conditions) +- Transposition +- Copying (full copies vs views) +- Renaming (observations, variables, categories) +- Type conversions (strings to categoricals, sparse/dense) +- Adding/removing data components +- Reordering +- Quality control filtering + +Common commands: ```python -# Convert X to DataFrame -df = adata.to_df() +# Subset by metadata +filtered = adata[adata.obs['quality_score'] > 0.8] +hv_genes = adata[:, adata.var['highly_variable']] -# Convert specific layer -df = adata.to_df(layer='log1p') +# Transpose +adata_T = adata.T + +# Copy vs view +view = adata[0:100, :] # View (lightweight reference) +copy = adata[0:100, :].copy() # Independent copy + +# Convert strings to categoricals +adata.strings_to_categoricals() ``` -**Extract vectors:** -```python -# Get 1D arrays from data or annotations -gene_expression = adata.obs_vector('Gene_100') -cell_metadata = adata.obs_vector('n_genes') -``` - -**Transpose:** -```python -# Swap observations and variables -transposed = adata.T -``` - -### 7. Memory Optimization - -Strategies for working with large datasets efficiently. - -**Use sparse matrices:** +### 5. Best Practices + +Follow recommended patterns for memory efficiency, performance, and reproducibility. + +**See**: `references/best_practices.md` for guidelines on: +- Memory management (sparse matrices, categoricals, backed mode) +- Views vs copies +- Data storage optimization +- Performance optimization +- Working with raw data +- Metadata management +- Reproducibility +- Error handling +- Integration with other tools +- Common pitfalls and solutions + +Key recommendations: ```python +# Use sparse matrices for sparse data from scipy.sparse import csr_matrix +adata.X = csr_matrix(adata.X) -# Check sparsity -density = (adata.X != 0).sum() / adata.X.size -if density < 0.3: # Less than 30% non-zero - adata.X = csr_matrix(adata.X) -``` - -**Convert strings to categoricals:** -```python -# Automatic conversion +# Convert strings to categoricals adata.strings_to_categoricals() -# Manual conversion (more control) -adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type']) +# Use backed mode for large files +adata = ad.read_h5ad('large.h5ad', backed='r') + +# Store raw before filtering +adata.raw = adata.copy() +adata = adata[:, adata.var['highly_variable']] ``` -**Use backed mode:** -```python -# Read without loading into memory -adata = ad.read_h5ad('large_file.h5ad', backed='r') +## Integration with Scverse Ecosystem -# Work with subsets -subset = adata[:1000, :500].copy() # Only this subset in memory +AnnData serves as the foundational data structure for the scverse ecosystem: + +### Scanpy (Single-cell analysis) +```python +import scanpy as sc + +# Preprocessing +sc.pp.filter_cells(adata, min_genes=200) +sc.pp.normalize_total(adata, target_sum=1e4) +sc.pp.log1p(adata) +sc.pp.highly_variable_genes(adata, n_top_genes=2000) + +# Dimensionality reduction +sc.pp.pca(adata, n_comps=50) +sc.pp.neighbors(adata, n_neighbors=15) +sc.tl.umap(adata) +sc.tl.leiden(adata) + +# Visualization +sc.pl.umap(adata, color=['cell_type', 'leiden']) ``` -**Chunked processing:** +### Muon (Multimodal data) ```python -# Process data in chunks -results = [] -for chunk in adata.chunk_X(chunk_size=1000): - result = expensive_computation(chunk) - results.append(result) +import muon as mu + +# Combine RNA and protein data +mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein}) +``` + +### PyTorch integration +```python +from anndata.experimental import AnnLoader + +# Create DataLoader for deep learning +dataloader = AnnLoader(adata, batch_size=128, shuffle=True) + +for batch in dataloader: + X = batch.X + # Train model ``` ## Common Workflows -### Single-Cell RNA-seq Analysis - -Complete workflow from loading to analysis: - +### Single-cell RNA-seq analysis ```python import anndata as ad -import numpy as np -import pandas as pd +import scanpy as sc # 1. Load data -adata = ad.read_mtx('matrix.mtx') -adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0] -adata.var_names = pd.read_csv('genes.tsv', header=None)[0] +adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # 2. Quality control adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) -adata.obs['total_counts'] = adata.X.sum(axis=1) -adata = adata[adata.obs.n_genes > 200] -adata = adata[adata.obs.total_counts < 10000] +adata.obs['n_counts'] = adata.X.sum(axis=1) +adata = adata[adata.obs['n_genes'] > 200] +adata = adata[adata.obs['n_counts'] < 50000] -# 3. Filter genes -min_cells = 3 -adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells] +# 3. Store raw +adata.raw = adata.copy() -# 4. Store raw counts -adata.layers['counts'] = adata.X.copy() +# 4. Normalize and filter +sc.pp.normalize_total(adata, target_sum=1e4) +sc.pp.log1p(adata) +sc.pp.highly_variable_genes(adata, n_top_genes=2000) +adata = adata[:, adata.var['highly_variable']] -# 5. Normalize -adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4 -adata.X = np.log1p(adata.X) - -# 6. Feature selection -gene_var = adata.X.var(axis=0) -adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90) - -# 7. Dimensionality reduction (example with external tools) -# adata.obsm['X_pca'] = compute_pca(adata.X) -# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca']) - -# 8. Save results -adata.write('analyzed.h5ad', compression='gzip') +# 5. Save processed data +adata.write_h5ad('processed.h5ad') ``` -### Batch Integration - -Combining multiple experimental batches: - +### Batch integration ```python -# Load batches -batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)] +# Load multiple batches +adata1 = ad.read_h5ad('batch1.h5ad') +adata2 = ad.read_h5ad('batch2.h5ad') +adata3 = ad.read_h5ad('batch3.h5ad') -# Concatenate with tracking -combined = ad.concat( - batches, - axis=0, - join='outer', +# Concatenate with batch labels +adata = ad.concat( + [adata1, adata2, adata3], label='batch', - keys=['batch_0', 'batch_1', 'batch_2'], - index_unique='-' + keys=['batch1', 'batch2', 'batch3'], + join='inner' ) -# Add batch as numeric for correction algorithms -combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes +# Apply batch correction +import scanpy as sc +sc.pp.combat(adata, key='batch') -# Perform batch correction (with external tools) -# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch']) -# combined.obsm['X_pca_corrected'] = corrected_pca - -# Save integrated data -combined.write('integrated.h5ad', compression='gzip') +# Continue analysis +sc.pp.pca(adata) +sc.pp.neighbors(adata) +sc.tl.umap(adata) ``` -### Memory-Efficient Large Dataset Processing - -Working with datasets too large for memory: - +### Working with large datasets ```python -# Read in backed mode -adata = ad.read_h5ad('huge_dataset.h5ad', backed='r') +# Open in backed mode +adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r') -# Compute statistics in chunks -total = 0 -for chunk in adata.chunk_X(chunk_size=1000): - total += chunk.sum() +# Filter based on metadata (no data loading) +high_quality = adata[adata.obs['quality_score'] > 0.8] -mean_expression = total / (adata.n_obs * adata.n_vars) +# Load filtered subset +adata_subset = high_quality.to_memory() -# Work with subset -high_quality_cells = adata.obs.n_genes > 1000 -subset = adata[high_quality_cells, :].copy() +# Process subset +process(adata_subset) -# Close file -adata.file.close() +# Or process in chunks +chunk_size = 1000 +for i in range(0, adata.n_obs, chunk_size): + chunk = adata[i:i+chunk_size, :].to_memory() + process(chunk) ``` -## Best Practices +## Troubleshooting -### Data Organization - -1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers -2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations -3. **Use uns for metadata**: Analysis parameters, dates, version information -4. **Use categoricals for efficiency**: Convert repeated strings to categorical types - -### Subsetting - -1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence -2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation -3. **Validate after subsetting**: Check dimensions and data integrity - -### File I/O - -1. **Use compression**: Always use `compression='gzip'` when writing h5ad files -2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility -3. **Close backed files**: Always close file connections when done -4. **Use backed mode for large files**: Don't load everything into memory if not needed - -### Concatenation - -1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features -2. **Track sources**: Use `label` and `keys` to track data origin -3. **Handle duplicates**: Use `index_unique` to make observation names unique -4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations - -### Memory Management - -1. **Use sparse matrices**: For data with <30% non-zero values -2. **Convert to categoricals**: For repeated string values -3. **Process in chunks**: For operations on very large matrices -4. **Use backed mode**: Read large files with `backed='r'` - -### Naming Conventions - -Follow these conventions for consistency: - -- **Embeddings**: `X_pca`, `X_umap`, `X_tsne` -- **Layers**: Descriptive names like `counts`, `log1p`, `scaled` -- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts` -- **Variables**: Use snake_case like `highly_variable`, `gene_name` - -## Reference Documentation - -For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory: - -1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters. - -2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows. - -3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns. - -## When to Load References - -Load reference files into context when: -- Implementing complex concatenation with specific merge strategies -- Troubleshooting errors or unexpected behavior -- Optimizing memory usage for large datasets -- Implementing complete analysis workflows -- Understanding nuances of specific API methods - -To search within references without loading them: +### Out of memory errors +Use backed mode or convert to sparse matrices: ```python -# Example: Search for information about backed mode -grep -r "backed mode" references/ +# Backed mode +adata = ad.read_h5ad('file.h5ad', backed='r') + +# Sparse matrices +from scipy.sparse import csr_matrix +adata.X = csr_matrix(adata.X) ``` -## Common Error Patterns +### Slow file reading +Use compression and appropriate formats: +```python +# Optimize for storage +adata.strings_to_categoricals() +adata.write_h5ad('file.h5ad', compression='gzip') -### Memory Errors -**Problem**: "MemoryError: Unable to allocate array" -**Solution**: Use backed mode, sparse matrices, or process in chunks +# Use Zarr for cloud storage +adata.write_zarr('file.zarr', chunks=(1000, 1000)) +``` -### Dimension Mismatch -**Problem**: "ValueError: operands could not be broadcast together" -**Solution**: Use outer join in concatenation or align indices before operations +### Index alignment issues +Always align external data on index: +```python +# Wrong +adata.obs['new_col'] = external_data['values'] -### View Modification -**Problem**: "ValueError: assignment destination is read-only" -**Solution**: Convert view to copy with `.copy()` before modification +# Correct +adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values'] +``` -### File Already Open -**Problem**: "OSError: Unable to open file (file is already open)" -**Solution**: Close previous file connection with `adata.file.close()` +## Additional Resources + +- **Official documentation**: https://anndata.readthedocs.io/ +- **Scanpy tutorials**: https://scanpy.readthedocs.io/ +- **Scverse ecosystem**: https://scverse.org/ +- **GitHub repository**: https://github.com/scverse/anndata diff --git a/scientific-packages/anndata/references/api_reference.md b/scientific-packages/anndata/references/api_reference.md deleted file mode 100644 index ee828b0..0000000 --- a/scientific-packages/anndata/references/api_reference.md +++ /dev/null @@ -1,218 +0,0 @@ -# AnnData API Reference - -## Core AnnData Class - -The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains. - -### Core Attributes - -| Attribute | Type | Description | -|-----------|------|-------------| -| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays | -| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count | -| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count | -| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata | -| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) | -| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) | -| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) | -| **varp** | dict-like | Pairwise variable annotations (graphs between features) | -| **layers** | dict-like | Additional data matrices matching X's dimensions | -| **raw** | AnnData | Stores original versions of X and var before transformations | - -### Dimensional Properties - -- **n_obs**: Number of observations (sample count) -- **n_vars**: Number of variables/features -- **shape**: Tuple returning (n_obs, n_vars) -- **T**: Transposed view of the entire object - -### State Properties - -- **isbacked**: Boolean indicating disk-backed storage status -- **is_view**: Boolean identifying whether object is a view of another AnnData -- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode - -### Key Methods - -#### Construction and Copying -- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object -- **`copy(filename=None)`**: Create full copy, optionally stored on disk - -#### Subsetting and Views -- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default) -- **`.copy()`**: Convert view to independent object - -#### Data Access -- **`to_df(layer=None)`**: Generate pandas DataFrame representation -- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations -- **`var_vector(k, layer=None)`**: Extract 1D array for a variable -- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks -- **`chunked_X(chunk_size)`**: Context manager for chunked iteration - -#### Transformation -- **`transpose()`**: Return transposed object -- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis -- **`to_memory(copy=False)`**: Load all backed arrays into RAM - -#### File I/O -- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format -- **`write_zarr(store, ...)`**: Export hierarchical Zarr store -- **`write_loom(filename, ...)`**: Output .loom format file -- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files - -#### Data Management -- **`strings_to_categoricals()`**: Convert string annotations to categorical types -- **`rename_categories(key, categories)`**: Update category labels in annotations -- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names -- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names - -## Module-Level Functions - -### Reading Functions - -#### Native Formats -- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files -- **`read_zarr(store)`**: Access hierarchical Zarr array stores - -#### Alternative Formats -- **`read_csv(filename, ...)`**: Import from CSV files -- **`read_excel(filename, ...)`**: Import from Excel files -- **`read_hdf(filename, key)`**: Read from HDF5 files -- **`read_loom(filename, ...)`**: Import from .loom files -- **`read_mtx(filename, ...)`**: Import from Matrix Market format -- **`read_text(filename, ...)`**: Import from text files -- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format - -#### Element-Level Access -- **`read_elem(elem)`**: Retrieve specific components from storage -- **`sparse_dataset(group)`**: Generate backed sparse matrix classes - -### Combining Operations -- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects - - **axis**: 0 (observations) or 1 (variables) - - **join**: 'inner' (intersection) or 'outer' (union) - - **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None) - - **label**: Column name for source tracking - - **keys**: Dataset identifiers for source annotation - - **index_unique**: Separator for making duplicate indices unique - -### Writing Functions -- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format -- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays -- **`write_elem(elem, ...)`**: Write individual components - -### Experimental Features -- **`AnnCollection`**: Batch processing for large collections -- **`AnnLoader`**: PyTorch DataLoader integration -- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation -- **`read_lazy(filename)`**: Lazy loading with deferred computation -- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks -- **`write_dispatched(filename, ...)`**: Custom writing with callbacks - -### Configuration -- **`settings`**: Package-wide configuration object -- **`settings.override(**kwargs)`**: Context manager for temporary settings changes - -## Common Usage Patterns - -### Creating AnnData Objects - -```python -import anndata as ad -import numpy as np -from scipy.sparse import csr_matrix - -# From dense array -counts = np.random.poisson(1, size=(100, 2000)) -adata = ad.AnnData(counts) - -# From sparse matrix -counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32) -adata = ad.AnnData(counts) - -# With metadata -import pandas as pd -obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']}) -var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}) -adata = ad.AnnData(counts, obs=obs_meta, var=var_meta) -``` - -### Subsetting - -```python -# By names -subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']] - -# By boolean mask -b_cells = adata[adata.obs.cell_type == 'B'] - -# By position -first_five = adata[:5, :100] - -# Convert view to copy -adata_copy = adata[:5].copy() -``` - -### Adding Annotations - -```python -# Cell-level metadata -adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50) - -# Gene-level metadata -adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars) - -# Embeddings -adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50)) -adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2)) - -# Alternative data representations -adata.layers['log_transformed'] = np.log1p(adata.X) -adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0) - -# Unstructured metadata -adata.uns['experiment_date'] = '2024-01-15' -adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3} -``` - -### File I/O - -```python -# Write to disk -adata.write('my_results.h5ad', compression='gzip') - -# Read into memory -adata = ad.read_h5ad('my_results.h5ad') - -# Read in backed mode (memory-efficient) -adata = ad.read_h5ad('my_results.h5ad', backed='r') - -# Close file connection -adata.file.close() -``` - -### Concatenation - -```python -# Combine multiple datasets -adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000))) -adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000))) -adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000))) - -# Simple concatenation -combined = ad.concat([adata1, adata2, adata3], axis=0) - -# With source labels -combined = ad.concat( - [adata1, adata2, adata3], - axis=0, - label='dataset', - keys=['exp1', 'exp2', 'exp3'] -) - -# Inner join (only shared variables) -combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner') - -# Outer join (all variables, pad with zeros) -combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer') -``` diff --git a/scientific-packages/anndata/references/best_practices.md b/scientific-packages/anndata/references/best_practices.md new file mode 100644 index 0000000..5074740 --- /dev/null +++ b/scientific-packages/anndata/references/best_practices.md @@ -0,0 +1,525 @@ +# Best Practices + +Guidelines for efficient and effective use of AnnData. + +## Memory Management + +### Use sparse matrices for sparse data +```python +import numpy as np +from scipy.sparse import csr_matrix +import anndata as ad + +# Check data sparsity +data = np.random.rand(1000, 2000) +sparsity = 1 - np.count_nonzero(data) / data.size +print(f"Sparsity: {sparsity:.2%}") + +# Convert to sparse if >50% zeros +if sparsity > 0.5: + adata = ad.AnnData(X=csr_matrix(data)) +else: + adata = ad.AnnData(X=data) + +# Benefits: 10-100x memory reduction for sparse genomics data +``` + +### Convert strings to categoricals +```python +# Inefficient: string columns use lots of memory +adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A'] + +# Efficient: convert to categorical +adata.obs['cell_type'] = adata.obs['cell_type'].astype('category') + +# Convert all string columns +adata.strings_to_categoricals() + +# Benefits: 10-50x memory reduction for repeated strings +``` + +### Use backed mode for large datasets +```python +# Don't load entire dataset into memory +adata = ad.read_h5ad('large_dataset.h5ad', backed='r') + +# Work with metadata +filtered = adata[adata.obs['quality'] > 0.8] + +# Load only filtered subset +adata_subset = filtered.to_memory() + +# Benefits: Work with datasets larger than RAM +``` + +## Views vs Copies + +### Understanding views +```python +# Subsetting creates a view by default +subset = adata[0:100, :] +print(subset.is_view) # True + +# Views don't copy data (memory efficient) +# But modifications can affect original + +# Check if object is a view +if adata.is_view: + adata = adata.copy() # Make independent +``` + +### When to use views +```python +# Good: Read-only operations on subsets +mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean() + +# Good: Temporary analysis +temp_subset = adata[:100, :] +result = analyze(temp_subset.X) +``` + +### When to use copies +```python +# Create independent copy for modifications +adata_filtered = adata[keep_cells, :].copy() + +# Safe to modify without affecting original +adata_filtered.obs['new_column'] = values + +# Always copy when: +# - Storing subset for later use +# - Modifying subset data +# - Passing to function that modifies data +``` + +## Data Storage Best Practices + +### Choose the right format + +**H5AD (HDF5) - Default choice** +```python +adata.write_h5ad('data.h5ad', compression='gzip') +``` +- Fast random access +- Supports backed mode +- Good compression +- Best for: Most use cases + +**Zarr - Cloud and parallel access** +```python +adata.write_zarr('data.zarr', chunks=(100, 100)) +``` +- Excellent for cloud storage (S3, GCS) +- Supports parallel I/O +- Good compression +- Best for: Large datasets, cloud workflows, parallel processing + +**CSV - Interoperability** +```python +adata.write_csvs('output_dir/') +``` +- Human readable +- Compatible with all tools +- Large file sizes, slow +- Best for: Sharing with non-Python tools, small datasets + +### Optimize file size +```python +# Before saving, optimize: + +# 1. Convert to sparse if appropriate +from scipy.sparse import csr_matrix, issparse +if not issparse(adata.X): + density = np.count_nonzero(adata.X) / adata.X.size + if density < 0.5: + adata.X = csr_matrix(adata.X) + +# 2. Convert strings to categoricals +adata.strings_to_categoricals() + +# 3. Use compression +adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9) + +# Typical results: 5-20x file size reduction +``` + +## Backed Mode Strategies + +### Read-only analysis +```python +# Open in read-only backed mode +adata = ad.read_h5ad('data.h5ad', backed='r') + +# Perform filtering without loading data +high_quality = adata[adata.obs['quality_score'] > 0.8] + +# Load only filtered data +adata_filtered = high_quality.to_memory() +``` + +### Read-write modifications +```python +# Open in read-write backed mode +adata = ad.read_h5ad('data.h5ad', backed='r+') + +# Modify metadata (written to disk) +adata.obs['new_annotation'] = values + +# X remains on disk, modifications saved immediately +``` + +### Chunked processing +```python +# Process large dataset in chunks +adata = ad.read_h5ad('huge_dataset.h5ad', backed='r') + +results = [] +chunk_size = 1000 + +for i in range(0, adata.n_obs, chunk_size): + chunk = adata[i:i+chunk_size, :].to_memory() + result = process(chunk) + results.append(result) + +final_result = combine(results) +``` + +## Performance Optimization + +### Subsetting performance +```python +# Fast: Boolean indexing with arrays +mask = np.array(adata.obs['quality'] > 0.5) +subset = adata[mask, :] + +# Slow: Boolean indexing with Series (creates view chain) +subset = adata[adata.obs['quality'] > 0.5, :] + +# Fastest: Integer indices +indices = np.where(adata.obs['quality'] > 0.5)[0] +subset = adata[indices, :] +``` + +### Avoid repeated subsetting +```python +# Inefficient: Multiple subset operations +for cell_type in ['A', 'B', 'C']: + subset = adata[adata.obs['cell_type'] == cell_type] + process(subset) + +# Efficient: Group and process +groups = adata.obs.groupby('cell_type').groups +for cell_type, indices in groups.items(): + subset = adata[indices, :] + process(subset) +``` + +### Use chunked operations for large matrices +```python +# Process X in chunks +for chunk in adata.chunked_X(chunk_size=1000): + result = compute(chunk) + +# More memory efficient than loading full X +``` + +## Working with Raw Data + +### Store raw before filtering +```python +# Original data with all genes +adata = ad.AnnData(X=counts) + +# Store raw before filtering +adata.raw = adata.copy() + +# Filter to highly variable genes +adata = adata[:, adata.var['highly_variable']] + +# Later: access original data +original_expression = adata.raw.X +all_genes = adata.raw.var_names +``` + +### When to use raw +```python +# Use raw for: +# - Differential expression on filtered genes +# - Visualization of specific genes not in filtered set +# - Accessing original counts after normalization + +# Access raw data +if adata.raw is not None: + gene_expr = adata.raw[:, 'GENE_NAME'].X +else: + gene_expr = adata[:, 'GENE_NAME'].X +``` + +## Metadata Management + +### Naming conventions +```python +# Consistent naming improves usability + +# Observation metadata (obs): +# - cell_id, sample_id +# - cell_type, tissue, condition +# - n_genes, n_counts, percent_mito +# - cluster, leiden, louvain + +# Variable metadata (var): +# - gene_id, gene_name +# - highly_variable, n_cells +# - mean_expression, dispersion + +# Embeddings (obsm): +# - X_pca, X_umap, X_tsne +# - X_diffmap, X_draw_graph_fr + +# Follow conventions from scanpy/scverse ecosystem +``` + +### Document metadata +```python +# Store metadata descriptions in uns +adata.uns['metadata_descriptions'] = { + 'cell_type': 'Cell type annotation from automated clustering', + 'quality_score': 'QC score from scrublet (0-1, higher is better)', + 'batch': 'Experimental batch identifier' +} + +# Store processing history +adata.uns['processing_steps'] = [ + 'Raw counts loaded from 10X', + 'Filtered: n_genes > 200, n_counts < 50000', + 'Normalized to 10000 counts per cell', + 'Log transformed' +] +``` + +## Reproducibility + +### Set random seeds +```python +import numpy as np + +# Set seed for reproducible results +np.random.seed(42) + +# Document in uns +adata.uns['random_seed'] = 42 +``` + +### Store parameters +```python +# Store analysis parameters in uns +adata.uns['pca'] = { + 'n_comps': 50, + 'svd_solver': 'arpack', + 'random_state': 42 +} + +adata.uns['neighbors'] = { + 'n_neighbors': 15, + 'n_pcs': 50, + 'metric': 'euclidean', + 'method': 'umap' +} +``` + +### Version tracking +```python +import anndata +import scanpy +import numpy + +# Store versions +adata.uns['versions'] = { + 'anndata': anndata.__version__, + 'scanpy': scanpy.__version__, + 'numpy': numpy.__version__, + 'python': sys.version +} +``` + +## Error Handling + +### Check data validity +```python +# Verify dimensions +assert adata.n_obs == len(adata.obs) +assert adata.n_vars == len(adata.var) +assert adata.X.shape == (adata.n_obs, adata.n_vars) + +# Check for NaN values +has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any() +if has_nan: + print("Warning: Data contains NaN values") + +# Check for negative values (if counts expected) +has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any() +if has_negative: + print("Warning: Data contains negative values") +``` + +### Validate metadata +```python +# Check for missing values +missing_obs = adata.obs.isnull().sum() +if missing_obs.any(): + print("Missing values in obs:") + print(missing_obs[missing_obs > 0]) + +# Verify indices are unique +assert adata.obs_names.is_unique, "Observation names not unique" +assert adata.var_names.is_unique, "Variable names not unique" + +# Check metadata alignment +assert len(adata.obs) == adata.n_obs +assert len(adata.var) == adata.n_vars +``` + +## Integration with Other Tools + +### Scanpy integration +```python +import scanpy as sc + +# AnnData is native format for scanpy +sc.pp.filter_cells(adata, min_genes=200) +sc.pp.filter_genes(adata, min_cells=3) +sc.pp.normalize_total(adata, target_sum=1e4) +sc.pp.log1p(adata) +sc.pp.highly_variable_genes(adata) +sc.pp.pca(adata) +sc.pp.neighbors(adata) +sc.tl.umap(adata) +``` + +### Pandas integration +```python +import pandas as pd + +# Convert to DataFrame +df = adata.to_df() + +# Create from DataFrame +adata = ad.AnnData(df) + +# Work with metadata as DataFrames +adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True) +``` + +### PyTorch integration +```python +from anndata.experimental import AnnLoader + +# Create PyTorch DataLoader +dataloader = AnnLoader(adata, batch_size=128, shuffle=True) + +# Iterate in training loop +for batch in dataloader: + X = batch.X + # Train model on batch +``` + +## Common Pitfalls + +### Pitfall 1: Modifying views +```python +# Wrong: Modifying view can affect original +subset = adata[:100, :] +subset.X = new_data # May modify adata.X! + +# Correct: Copy before modifying +subset = adata[:100, :].copy() +subset.X = new_data # Independent copy +``` + +### Pitfall 2: Index misalignment +```python +# Wrong: Assuming order matches +external_data = pd.read_csv('data.csv') +adata.obs['new_col'] = external_data['values'] # May misalign! + +# Correct: Align on index +adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values'] +``` + +### Pitfall 3: Mixing sparse and dense +```python +# Wrong: Converting sparse to dense uses huge memory +result = adata.X + 1 # Converts sparse to dense! + +# Correct: Use sparse operations +from scipy.sparse import issparse +if issparse(adata.X): + result = adata.X.copy() + result.data += 1 +``` + +### Pitfall 4: Not handling views +```python +# Wrong: Assuming subset is independent +subset = adata[mask, :] +del adata # subset may become invalid! + +# Correct: Copy when needed +subset = adata[mask, :].copy() +del adata # subset remains valid +``` + +### Pitfall 5: Ignoring memory constraints +```python +# Wrong: Loading huge dataset into memory +adata = ad.read_h5ad('100GB_file.h5ad') # OOM error! + +# Correct: Use backed mode +adata = ad.read_h5ad('100GB_file.h5ad', backed='r') +subset = adata[adata.obs['keep']].to_memory() +``` + +## Workflow Example + +Complete best-practices workflow: + +```python +import anndata as ad +import numpy as np +from scipy.sparse import csr_matrix + +# 1. Load with backed mode if large +adata = ad.read_h5ad('data.h5ad', backed='r') + +# 2. Quick metadata check without loading data +print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes") + +# 3. Filter based on metadata +high_quality = adata[adata.obs['quality_score'] > 0.8] + +# 4. Load filtered subset to memory +adata = high_quality.to_memory() + +# 5. Convert to optimal storage types +adata.strings_to_categoricals() +if not issparse(adata.X): + density = np.count_nonzero(adata.X) / adata.X.size + if density < 0.5: + adata.X = csr_matrix(adata.X) + +# 6. Store raw before filtering genes +adata.raw = adata.copy() + +# 7. Filter to highly variable genes +adata = adata[:, adata.var['highly_variable']].copy() + +# 8. Document processing +adata.uns['processing'] = { + 'filtered': 'quality_score > 0.8', + 'n_hvg': adata.n_vars, + 'date': '2025-11-03' +} + +# 9. Save optimized +adata.write_h5ad('processed.h5ad', compression='gzip') +``` diff --git a/scientific-packages/anndata/references/concatenation.md b/scientific-packages/anndata/references/concatenation.md new file mode 100644 index 0000000..6cb307c --- /dev/null +++ b/scientific-packages/anndata/references/concatenation.md @@ -0,0 +1,396 @@ +# Concatenating AnnData Objects + +Combine multiple AnnData objects along either observations or variables axis. + +## Basic Concatenation + +### Concatenate along observations (stack cells/samples) +```python +import anndata as ad +import numpy as np + +# Create multiple AnnData objects +adata1 = ad.AnnData(X=np.random.rand(100, 50)) +adata2 = ad.AnnData(X=np.random.rand(150, 50)) +adata3 = ad.AnnData(X=np.random.rand(200, 50)) + +# Concatenate along observations (axis=0, default) +adata_combined = ad.concat([adata1, adata2, adata3], axis=0) + +print(adata_combined.shape) # (450, 50) +``` + +### Concatenate along variables (stack genes/features) +```python +# Create objects with same observations, different variables +adata1 = ad.AnnData(X=np.random.rand(100, 50)) +adata2 = ad.AnnData(X=np.random.rand(100, 30)) +adata3 = ad.AnnData(X=np.random.rand(100, 70)) + +# Concatenate along variables (axis=1) +adata_combined = ad.concat([adata1, adata2, adata3], axis=1) + +print(adata_combined.shape) # (100, 150) +``` + +## Join Types + +### Inner join (intersection) +Keep only variables/observations present in all objects. + +```python +import pandas as pd + +# Create objects with different variables +adata1 = ad.AnnData( + X=np.random.rand(100, 50), + var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)]) +) +adata2 = ad.AnnData( + X=np.random.rand(150, 60), + var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)]) +) + +# Inner join: only genes 10-49 are kept (overlap) +adata_inner = ad.concat([adata1, adata2], join='inner') +print(adata_inner.n_vars) # 40 genes (overlap) +``` + +### Outer join (union) +Keep all variables/observations, filling missing values. + +```python +# Outer join: all genes are kept +adata_outer = ad.concat([adata1, adata2], join='outer') +print(adata_outer.n_vars) # 70 genes (union) + +# Missing values are filled with appropriate defaults: +# - 0 for sparse matrices +# - NaN for dense matrices +``` + +### Fill values for outer joins +```python +# Specify fill value for missing data +adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0) +``` + +## Tracking Data Sources + +### Add batch labels +```python +# Label which object each observation came from +adata_combined = ad.concat( + [adata1, adata2, adata3], + label='batch', # Column name for labels + keys=['batch1', 'batch2', 'batch3'] # Labels for each object +) + +print(adata_combined.obs['batch'].value_counts()) +# batch1 100 +# batch2 150 +# batch3 200 +``` + +### Automatic batch labels +```python +# If keys not provided, uses integer indices +adata_combined = ad.concat( + [adata1, adata2, adata3], + label='dataset' +) +# dataset column contains: 0, 1, 2 +``` + +## Merge Strategies + +Control how metadata from different objects is combined using the `merge` parameter. + +### merge=None (default for observations) +Exclude metadata on non-concatenation axis. + +```python +# When concatenating observations, var metadata must match +adata1.var['gene_type'] = 'protein_coding' +adata2.var['gene_type'] = 'protein_coding' + +# var is kept only if identical across all objects +adata_combined = ad.concat([adata1, adata2], merge=None) +``` + +### merge='same' +Keep metadata that is identical across all objects. + +```python +adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25 +adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25 +adata1.var['type'] = 'protein_coding' +adata2.var['type'] = 'lncRNA' # Different + +# 'chromosome' is kept (same), 'type' is excluded (different) +adata_combined = ad.concat([adata1, adata2], merge='same') +``` + +### merge='unique' +Keep metadata columns where each key has exactly one value. + +```python +adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)] +adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)] + +# gene_id is kept (unique values for each key) +adata_combined = ad.concat([adata1, adata2], merge='unique') +``` + +### merge='first' +Take values from the first object containing each key. + +```python +adata1.var['description'] = ['Desc1'] * 50 +adata2.var['description'] = ['Desc2'] * 50 + +# Uses descriptions from adata1 +adata_combined = ad.concat([adata1, adata2], merge='first') +``` + +### merge='only' +Keep metadata that appears in only one object. + +```python +adata1.var['adata1_specific'] = [1] * 50 +adata2.var['adata2_specific'] = [2] * 50 + +# Both metadata columns are kept +adata_combined = ad.concat([adata1, adata2], merge='only') +``` + +## Handling Index Conflicts + +### Make indices unique +```python +import pandas as pd + +# Create objects with overlapping observation names +adata1 = ad.AnnData( + X=np.random.rand(3, 10), + obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3']) +) +adata2 = ad.AnnData( + X=np.random.rand(3, 10), + obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3']) +) + +# Make indices unique by appending batch keys +adata_combined = ad.concat( + [adata1, adata2], + label='batch', + keys=['batch1', 'batch2'], + index_unique='_' # Separator for making indices unique +) + +print(adata_combined.obs_names) +# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1', +# 'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2'] +``` + +## Concatenating Layers + +```python +# Objects with layers +adata1 = ad.AnnData(X=np.random.rand(100, 50)) +adata1.layers['normalized'] = np.random.rand(100, 50) +adata1.layers['scaled'] = np.random.rand(100, 50) + +adata2 = ad.AnnData(X=np.random.rand(150, 50)) +adata2.layers['normalized'] = np.random.rand(150, 50) +adata2.layers['scaled'] = np.random.rand(150, 50) + +# Layers are concatenated automatically if present in all objects +adata_combined = ad.concat([adata1, adata2]) + +print(adata_combined.layers.keys()) +# dict_keys(['normalized', 'scaled']) +``` + +## Concatenating Multi-dimensional Annotations + +### obsm/varm +```python +# Objects with embeddings +adata1.obsm['X_pca'] = np.random.rand(100, 50) +adata2.obsm['X_pca'] = np.random.rand(150, 50) + +# obsm is concatenated along observation axis +adata_combined = ad.concat([adata1, adata2]) +print(adata_combined.obsm['X_pca'].shape) # (250, 50) +``` + +### obsp/varp (pairwise annotations) +```python +from scipy.sparse import csr_matrix + +# Pairwise matrices +adata1.obsp['connectivities'] = csr_matrix((100, 100)) +adata2.obsp['connectivities'] = csr_matrix((150, 150)) + +# By default, obsp is NOT concatenated (set pairwise=True to include) +adata_combined = ad.concat([adata1, adata2]) +# adata_combined.obsp is empty + +# Include pairwise data (creates block diagonal matrix) +adata_combined = ad.concat([adata1, adata2], pairwise=True) +print(adata_combined.obsp['connectivities'].shape) # (250, 250) +``` + +## Concatenating uns (unstructured) + +Unstructured metadata is merged recursively: + +```python +adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'} +adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'} + +# Using merge='unique' for uns +adata_combined = ad.concat([adata1, adata2], uns_merge='unique') +# 'date' is kept (same value), 'batch' might be excluded (different values) +``` + +## Lazy Concatenation (AnnCollection) + +For very large datasets, use lazy concatenation that doesn't load all data: + +```python +from anndata.experimental import AnnCollection + +# Create collection from file paths (doesn't load data) +files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'] +collection = AnnCollection( + files, + join_obs='outer', + join_vars='inner', + label='dataset', + keys=['dataset1', 'dataset2', 'dataset3'] +) + +# Access data lazily +print(collection.n_obs) # Total observations +print(collection.obs.head()) # Metadata loaded, not X + +# Convert to regular AnnData when needed (loads all data) +adata = collection.to_adata() +``` + +### Working with AnnCollection +```python +# Subset without loading data +subset = collection[collection.obs['cell_type'] == 'T cell'] + +# Iterate through datasets +for adata in collection: + print(adata.shape) + +# Access specific dataset +first_dataset = collection[0] +``` + +## Concatenation on Disk + +For datasets too large for memory, concatenate directly on disk: + +```python +from anndata.experimental import concat_on_disk + +# Concatenate without loading into memory +concat_on_disk( + ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'], + 'combined.h5ad', + join='outer' +) + +# Load result in backed mode +adata = ad.read_h5ad('combined.h5ad', backed='r') +``` + +## Common Concatenation Patterns + +### Combine technical replicates +```python +# Multiple runs of the same samples +replicates = [adata_run1, adata_run2, adata_run3] +adata_combined = ad.concat( + replicates, + label='technical_replicate', + keys=['rep1', 'rep2', 'rep3'], + join='inner' # Keep only genes measured in all runs +) +``` + +### Combine batches from experiment +```python +# Different experimental batches +batches = [adata_batch1, adata_batch2, adata_batch3] +adata_combined = ad.concat( + batches, + label='batch', + keys=['batch1', 'batch2', 'batch3'], + join='outer' # Keep all genes +) + +# Later: apply batch correction +``` + +### Merge multi-modal data +```python +# Different measurement modalities (e.g., RNA + protein) +adata_rna = ad.AnnData(X=np.random.rand(100, 2000)) +adata_protein = ad.AnnData(X=np.random.rand(100, 50)) + +# Concatenate along variables to combine modalities +adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1) + +# Add labels to distinguish modalities +adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50 +``` + +## Best Practices + +1. **Check compatibility before concatenating** +```python +# Verify shapes are compatible +print([adata.n_vars for adata in [adata1, adata2, adata3]]) + +# Check variable names match +print([set(adata.var_names) for adata in [adata1, adata2, adata3]]) +``` + +2. **Use appropriate join type** +- `inner`: When you need the same features across all samples (most stringent) +- `outer`: When you want to preserve all features (most inclusive) + +3. **Track data sources** +Always use `label` and `keys` to track which observations came from which dataset. + +4. **Consider memory usage** +- For large datasets, use `AnnCollection` or `concat_on_disk` +- Consider backed mode for the result + +5. **Handle batch effects** +Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation: +```python +# After concatenation, apply batch correction +import scanpy as sc +sc.pp.combat(adata_combined, key='batch') +``` + +6. **Validate results** +```python +# Check dimensions +print(adata_combined.shape) + +# Check batch distribution +print(adata_combined.obs['batch'].value_counts()) + +# Verify metadata integrity +print(adata_combined.var.head()) +print(adata_combined.obs.head()) +``` diff --git a/scientific-packages/anndata/references/concatenation_guide.md b/scientific-packages/anndata/references/concatenation_guide.md deleted file mode 100644 index 9c1379b..0000000 --- a/scientific-packages/anndata/references/concatenation_guide.md +++ /dev/null @@ -1,478 +0,0 @@ -# AnnData Concatenation Guide - -## Overview - -The `concat()` function combines multiple AnnData objects through two fundamental operations: -1. **Concatenation**: Stacking sub-elements in order -2. **Merging**: Combining collections into one result - -## Basic Concatenation - -### Syntax -```python -import anndata as ad - -combined = ad.concat( - adatas, # List of AnnData objects - axis=0, # 0=observations, 1=variables - join='inner', # 'inner' or 'outer' - merge=None, # Merge strategy for non-concat axis - label=None, # Column name for source tracking - keys=None, # Dataset identifiers - index_unique=None, # Separator for unique indices - fill_value=None, # Fill value for missing data - pairwise=False # Include pairwise matrices -) -``` - -### Concatenating Observations (Cells) -```python -# Most common: combining multiple samples/batches -adata1 = ad.AnnData(np.random.rand(100, 2000)) -adata2 = ad.AnnData(np.random.rand(150, 2000)) -adata3 = ad.AnnData(np.random.rand(80, 2000)) - -combined = ad.concat([adata1, adata2, adata3], axis=0) -# Result: (330 observations, 2000 variables) -``` - -### Concatenating Variables (Genes) -```python -# Less common: combining different feature sets -adata1 = ad.AnnData(np.random.rand(100, 1000)) -adata2 = ad.AnnData(np.random.rand(100, 500)) - -combined = ad.concat([adata1, adata2], axis=1) -# Result: (100 observations, 1500 variables) -``` - -## Join Strategies - -### Inner Join (Intersection) - -Keeps only shared features across all objects. - -```python -# Datasets with different genes -adata1 = ad.AnnData( - np.random.rand(100, 2000), - var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)]) -) -adata2 = ad.AnnData( - np.random.rand(150, 1800), - var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)]) -) - -# Inner join: only genes present in both -combined = ad.concat([adata1, adata2], join='inner') -# Result: (250 observations, 1800 variables) -# Only Gene_200 through Gene_1999 -``` - -**Use when:** -- You want to analyze only features measured in all datasets -- Missing features would compromise analysis -- You need a complete case analysis - -**Trade-offs:** -- May lose many features -- Ensures no missing data -- Smaller result size - -### Outer Join (Union) - -Keeps all features from all objects, padding with fill values (default 0). - -```python -# Outer join: all genes from both datasets -combined = ad.concat([adata1, adata2], join='outer') -# Result: (250 observations, 2000 variables) -# Missing values filled with 0 - -# Custom fill value -combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan) -``` - -**Use when:** -- You want to preserve all features -- Sparse data is acceptable -- Features are independent - -**Trade-offs:** -- Introduces zeros/missing values -- Larger result size -- May need imputation - -## Merge Strategies - -Merge strategies control how elements on the non-concatenation axis are combined. - -### merge=None (Default) - -Excludes all non-concatenation axis elements. - -```python -# Both datasets have var annotations -adata1.var['gene_type'] = ['protein_coding'] * 2000 -adata2.var['gene_type'] = ['protein_coding'] * 1800 - -# merge=None: var annotations excluded -combined = ad.concat([adata1, adata2], merge=None) -assert 'gene_type' not in combined.var.columns -``` - -**Use when:** -- Annotations are dataset-specific -- You'll add new annotations after merging - -### merge='same' - -Keeps only annotations with identical values across datasets. - -```python -# Same annotation values -adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000 -adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900 - -# merge='same': keeps chromosome annotation -combined = ad.concat([adata1, adata2], merge='same') -assert 'chromosome' in combined.var.columns -``` - -**Use when:** -- Annotations should be consistent -- You want to validate consistency -- Shared metadata is important - -**Note:** Comparison occurs after index alignment - only shared indices need to match. - -### merge='unique' - -Includes annotations with a single possible value. - -```python -# Unique values per gene -adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)] -adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)] - -# merge='unique': keeps ensembl_id -combined = ad.concat([adata1, adata2], merge='unique') -``` - -**Use when:** -- Each feature has a unique identifier -- Annotations are feature-specific - -### merge='first' - -Takes the first occurrence of each annotation. - -```python -# Different annotation versions -adata1.var['description'] = ['desc1'] * 2000 -adata2.var['description'] = ['desc2'] * 2000 - -# merge='first': uses adata1's descriptions -combined = ad.concat([adata1, adata2], merge='first') -# Uses descriptions from adata1 -``` - -**Use when:** -- One dataset has authoritative annotations -- Order matters -- You need a simple resolution strategy - -### merge='only' - -Retains annotations appearing in exactly one object. - -```python -# Dataset-specific annotations -adata1.var['dataset1_specific'] = ['value'] * 2000 -adata2.var['dataset2_specific'] = ['value'] * 2000 - -# merge='only': keeps both (no conflicts) -combined = ad.concat([adata1, adata2], merge='only') -``` - -**Use when:** -- Datasets have non-overlapping annotations -- You want to preserve all unique metadata - -## Source Tracking - -### Using label - -Add a categorical column to track data origin. - -```python -combined = ad.concat( - [adata1, adata2, adata3], - label='batch' -) - -# Creates obs['batch'] with values 0, 1, 2 -print(combined.obs['batch'].cat.categories) # ['0', '1', '2'] -``` - -### Using keys - -Provide custom names for source tracking. - -```python -combined = ad.concat( - [adata1, adata2, adata3], - label='study', - keys=['control', 'treatment_a', 'treatment_b'] -) - -# Creates obs['study'] with custom names -print(combined.obs['study'].unique()) # ['control', 'treatment_a', 'treatment_b'] -``` - -### Making Indices Unique - -Append source identifiers to duplicate observation names. - -```python -# Both datasets have cells named "Cell_0", "Cell_1", etc. -adata1.obs_names = [f'Cell_{i}' for i in range(100)] -adata2.obs_names = [f'Cell_{i}' for i in range(150)] - -# index_unique adds suffix -combined = ad.concat( - [adata1, adata2], - keys=['batch1', 'batch2'], - index_unique='-' -) - -# Results in: Cell_0-batch1, Cell_0-batch2, etc. -print(combined.obs_names[:5]) -``` - -## Handling Different Attributes - -### X Matrix and Layers - -Follows join strategy. Missing values filled according to `fill_value`. - -```python -# Both have layers -adata1.layers['counts'] = adata1.X.copy() -adata2.layers['counts'] = adata2.X.copy() - -# Concatenates both X and layers -combined = ad.concat([adata1, adata2]) -assert 'counts' in combined.layers -``` - -### obs and var DataFrames - -- **obs**: Concatenated along concatenation axis -- **var**: Handled by merge strategy - -```python -adata1.obs['cell_type'] = ['B cell'] * 100 -adata2.obs['cell_type'] = ['T cell'] * 150 - -combined = ad.concat([adata1, adata2]) -# obs['cell_type'] preserved for all cells -``` - -### obsm and varm - -Multi-dimensional annotations follow same rules as layers. - -```python -adata1.obsm['X_pca'] = np.random.rand(100, 50) -adata2.obsm['X_pca'] = np.random.rand(150, 50) - -combined = ad.concat([adata1, adata2]) -# obsm['X_pca'] concatenated: shape (250, 50) -``` - -### obsp and varp - -Pairwise matrices excluded by default. Enable with `pairwise=True`. - -```python -# Distance matrices -adata1.obsp['distances'] = np.random.rand(100, 100) -adata2.obsp['distances'] = np.random.rand(150, 150) - -# Excluded by default -combined = ad.concat([adata1, adata2]) -assert 'distances' not in combined.obsp - -# Include if needed -combined = ad.concat([adata1, adata2], pairwise=True) -# Results in padded block diagonal matrix -``` - -### uns Dictionary - -Merged recursively, applying merge strategy at any nesting depth. - -```python -adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'} -adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'} - -# merge='same' keeps 'lab', excludes 'date' -combined = ad.concat([adata1, adata2], merge='same') -# combined.uns['experiment'] = {'lab': 'A'} -``` - -## Advanced Patterns - -### Batch Integration Pipeline - -```python -import anndata as ad - -# Load batches -batches = [ - ad.read_h5ad(f'batch_{i}.h5ad') - for i in range(5) -] - -# Concatenate with tracking -combined = ad.concat( - batches, - axis=0, - join='outer', - merge='first', - label='batch_id', - keys=[f'batch_{i}' for i in range(5)], - index_unique='-' -) - -# Add batch effects -combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes -``` - -### Multi-Study Meta-Analysis - -```python -# Different studies with varying gene coverage -studies = { - 'study_a': ad.read_h5ad('study_a.h5ad'), - 'study_b': ad.read_h5ad('study_b.h5ad'), - 'study_c': ad.read_h5ad('study_c.h5ad') -} - -# Outer join to keep all genes -combined = ad.concat( - list(studies.values()), - axis=0, - join='outer', - label='study', - keys=list(studies.keys()), - merge='unique', - fill_value=0 -) - -# Track coverage -for study in studies: - n_genes = studies[study].n_vars - combined.uns[f'{study}_n_genes'] = n_genes -``` - -### Incremental Concatenation - -```python -# For many datasets, concatenate in batches -chunk_size = 10 -all_files = [f'dataset_{i}.h5ad' for i in range(100)] - -# Process in chunks -result = None -for i in range(0, len(all_files), chunk_size): - chunk_files = all_files[i:i+chunk_size] - chunk_adatas = [ad.read_h5ad(f) for f in chunk_files] - chunk_combined = ad.concat(chunk_adatas) - - if result is None: - result = chunk_combined - else: - result = ad.concat([result, chunk_combined]) -``` - -### Memory-Efficient On-Disk Concatenation - -```python -# Experimental feature for large datasets -from anndata.experimental import concat_on_disk - -files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad'] -concat_on_disk( - files, - 'combined.h5ad', - join='outer' -) - -# Read result in backed mode -combined = ad.read_h5ad('combined.h5ad', backed='r') -``` - -## Troubleshooting - -### Issue: Dimension Mismatch - -```python -# Error: shapes don't match -adata1 = ad.AnnData(np.random.rand(100, 2000)) -adata2 = ad.AnnData(np.random.rand(150, 1500)) - -# Solution: use outer join -combined = ad.concat([adata1, adata2], join='outer') -``` - -### Issue: Memory Error - -```python -# Problem: too many large objects in memory -large_adatas = [ad.read_h5ad(f) for f in many_files] - -# Solution: read and concatenate incrementally -result = None -for file in many_files: - adata = ad.read_h5ad(file) - if result is None: - result = adata - else: - result = ad.concat([result, adata]) - del adata # Free memory -``` - -### Issue: Duplicate Indices - -```python -# Problem: same cell names in different batches -# Solution: use index_unique -combined = ad.concat( - [adata1, adata2], - keys=['batch1', 'batch2'], - index_unique='-' -) -``` - -### Issue: Lost Annotations - -```python -# Problem: annotations disappear -adata1.var['important'] = values1 -adata2.var['important'] = values2 - -combined = ad.concat([adata1, adata2]) # merge=None by default -# Solution: use appropriate merge strategy -combined = ad.concat([adata1, adata2], merge='first') -``` - -## Performance Tips - -1. **Pre-align indices**: Ensure consistent naming before concatenation -2. **Use sparse matrices**: Convert to sparse before concatenating -3. **Batch operations**: Concatenate in groups for many datasets -4. **Choose inner join**: When possible, to reduce result size -5. **Use categoricals**: Convert string annotations before concatenating -6. **Consider on-disk**: For very large datasets, use `concat_on_disk` diff --git a/scientific-packages/anndata/references/data_structure.md b/scientific-packages/anndata/references/data_structure.md new file mode 100644 index 0000000..c419660 --- /dev/null +++ b/scientific-packages/anndata/references/data_structure.md @@ -0,0 +1,314 @@ +# AnnData Object Structure + +The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata. + +## Core Components + +### X (Data Matrix) +The primary data matrix with shape (n_obs, n_vars) storing experimental measurements. + +```python +import anndata as ad +import numpy as np + +# Create with dense array +adata = ad.AnnData(X=np.random.rand(100, 2000)) + +# Create with sparse matrix (recommended for large, sparse data) +from scipy.sparse import csr_matrix +sparse_data = csr_matrix(np.random.rand(100, 2000)) +adata = ad.AnnData(X=sparse_data) +``` + +Access data: +```python +# Full matrix (caution with large datasets) +full_data = adata.X + +# Single observation +obs_data = adata.X[0, :] + +# Single variable across all observations +var_data = adata.X[:, 0] +``` + +### obs (Observation Annotations) +DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X. + +```python +import pandas as pd + +# Create AnnData with observation metadata +obs_df = pd.DataFrame({ + 'cell_type': ['T cell', 'B cell', 'Monocyte'], + 'treatment': ['control', 'treated', 'control'], + 'timepoint': [0, 24, 24] +}, index=['cell_1', 'cell_2', 'cell_3']) + +adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df) + +# Access observation metadata +print(adata.obs['cell_type']) +print(adata.obs.loc['cell_1']) +``` + +### var (Variable Annotations) +DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X. + +```python +# Create AnnData with variable metadata +var_df = pd.DataFrame({ + 'gene_name': ['ACTB', 'GAPDH', 'TP53'], + 'chromosome': ['7', '12', '17'], + 'highly_variable': [True, False, True] +}, index=['ENSG00001', 'ENSG00002', 'ENSG00003']) + +adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df) + +# Access variable metadata +print(adata.var['gene_name']) +print(adata.var.loc['ENSG00001']) +``` + +### layers (Alternative Data Representations) +Dictionary storing alternative matrices with the same dimensions as X. + +```python +# Store raw counts, normalized data, and scaled data +adata = ad.AnnData(X=np.random.rand(100, 2000)) +adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000)) +adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True) +adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std() + +# Access layers +raw_data = adata.layers['raw_counts'] +normalized_data = adata.layers['normalized'] +``` + +Common layer uses: +- `raw_counts`: Original count data before normalization +- `normalized`: Log-normalized or TPM values +- `scaled`: Z-scored values for analysis +- `imputed`: Data after imputation + +### obsm (Multi-dimensional Observation Annotations) +Dictionary storing multi-dimensional arrays aligned to observations. + +```python +# Store PCA coordinates and UMAP embeddings +adata.obsm['X_pca'] = np.random.rand(100, 50) # 50 principal components +adata.obsm['X_umap'] = np.random.rand(100, 2) # 2D UMAP coordinates +adata.obsm['X_tsne'] = np.random.rand(100, 2) # 2D t-SNE coordinates + +# Access embeddings +pca_coords = adata.obsm['X_pca'] +umap_coords = adata.obsm['X_umap'] +``` + +Common obsm uses: +- `X_pca`: Principal component coordinates +- `X_umap`: UMAP embedding coordinates +- `X_tsne`: t-SNE embedding coordinates +- `X_diffmap`: Diffusion map coordinates +- `protein_expression`: Protein abundance measurements (CITE-seq) + +### varm (Multi-dimensional Variable Annotations) +Dictionary storing multi-dimensional arrays aligned to variables. + +```python +# Store PCA loadings +adata.varm['PCs'] = np.random.rand(2000, 50) # Loadings for 50 components +adata.varm['gene_modules'] = np.random.rand(2000, 10) # Gene module scores + +# Access loadings +pc_loadings = adata.varm['PCs'] +``` + +Common varm uses: +- `PCs`: Principal component loadings +- `gene_modules`: Gene co-expression module assignments + +### obsp (Pairwise Observation Relationships) +Dictionary storing sparse matrices representing relationships between observations. + +```python +from scipy.sparse import csr_matrix + +# Store k-nearest neighbor graph +n_obs = 100 +knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95) +adata.obsp['connectivities'] = knn_graph +adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs)) + +# Access graphs +knn_connections = adata.obsp['connectivities'] +distances = adata.obsp['distances'] +``` + +Common obsp uses: +- `connectivities`: Cell-cell neighborhood graph +- `distances`: Pairwise distances between cells + +### varp (Pairwise Variable Relationships) +Dictionary storing sparse matrices representing relationships between variables. + +```python +# Store gene-gene correlation matrix +n_vars = 2000 +gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99) +adata.varp['correlations'] = gene_corr + +# Access correlations +gene_correlations = adata.varp['correlations'] +``` + +### uns (Unstructured Annotations) +Dictionary storing arbitrary unstructured metadata. + +```python +# Store analysis parameters and results +adata.uns['experiment_date'] = '2025-11-03' +adata.uns['pca'] = { + 'variance_ratio': [0.15, 0.10, 0.08], + 'params': {'n_comps': 50} +} +adata.uns['neighbors'] = { + 'params': {'n_neighbors': 15, 'method': 'umap'}, + 'connectivities_key': 'connectivities' +} + +# Access unstructured data +exp_date = adata.uns['experiment_date'] +pca_params = adata.uns['pca']['params'] +``` + +Common uns uses: +- Analysis parameters and settings +- Color palettes for plotting +- Cluster information +- Tool-specific metadata + +### raw (Original Data Snapshot) +Optional attribute preserving the original data matrix and variable annotations before filtering. + +```python +# Create AnnData and store raw state +adata = ad.AnnData(X=np.random.rand(100, 5000)) +adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)] + +# Store raw state before filtering +adata.raw = adata.copy() + +# Filter to highly variable genes +highly_variable_mask = np.random.rand(5000) > 0.5 +adata = adata[:, highly_variable_mask] + +# Access original data +original_matrix = adata.raw.X +original_var = adata.raw.var +``` + +## Object Properties + +```python +# Dimensions +n_observations = adata.n_obs +n_variables = adata.n_vars +shape = adata.shape # (n_obs, n_vars) + +# Index information +obs_names = adata.obs_names # Observation identifiers +var_names = adata.var_names # Variable identifiers + +# Storage mode +is_view = adata.is_view # True if this is a view of another object +is_backed = adata.isbacked # True if backed by on-disk storage +filename = adata.filename # Path to backing file (if backed) +``` + +## Creating AnnData Objects + +### From arrays and DataFrames +```python +import anndata as ad +import numpy as np +import pandas as pd + +# Minimal creation +X = np.random.rand(100, 2000) +adata = ad.AnnData(X) + +# With metadata +obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)]) +var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)]) +adata = ad.AnnData(X=X, obs=obs, var=var) + +# With all components +adata = ad.AnnData( + X=X, + obs=obs, + var=var, + layers={'raw': np.random.randint(0, 100, (100, 2000))}, + obsm={'X_pca': np.random.rand(100, 50)}, + uns={'experiment': 'test'} +) +``` + +### From DataFrame +```python +# Create from pandas DataFrame (genes as columns, cells as rows) +df = pd.DataFrame( + np.random.rand(100, 50), + columns=[f'Gene_{i}' for i in range(50)], + index=[f'Cell_{i}' for i in range(100)] +) +adata = ad.AnnData(df) +``` + +## Data Access Patterns + +### Vector extraction +```python +# Get observation annotation as array +cell_types = adata.obs_vector('cell_type') + +# Get variable values across observations +gene_expression = adata.obs_vector('ACTB') # If ACTB is in var_names + +# Get variable annotation as array +gene_names = adata.var_vector('gene_name') +``` + +### Subsetting +```python +# By index +subset = adata[0:10, 0:100] # First 10 obs, first 100 vars + +# By name +subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']] + +# By boolean mask +high_count_cells = adata.obs['total_counts'] > 1000 +subset = adata[high_count_cells, :] + +# By observation metadata +t_cells = adata[adata.obs['cell_type'] == 'T cell'] +``` + +## Memory Considerations + +The AnnData structure is designed for memory efficiency: +- Sparse matrices reduce memory for sparse data +- Views avoid copying data when possible +- Backed mode enables working with data larger than RAM +- Categorical annotations reduce memory for discrete values + +```python +# Convert strings to categoricals (more memory efficient) +adata.obs['cell_type'] = adata.obs['cell_type'].astype('category') +adata.strings_to_categoricals() + +# Check if object is a view (doesn't own data) +if adata.is_view: + adata = adata.copy() # Create independent copy +``` diff --git a/scientific-packages/anndata/references/io_operations.md b/scientific-packages/anndata/references/io_operations.md new file mode 100644 index 0000000..9f4892a --- /dev/null +++ b/scientific-packages/anndata/references/io_operations.md @@ -0,0 +1,404 @@ +# Input/Output Operations + +AnnData provides comprehensive I/O functionality for reading and writing data in various formats. + +## Native Formats + +### H5AD (HDF5-based) +The recommended native format for AnnData objects, providing efficient storage and fast access. + +#### Writing H5AD files +```python +import anndata as ad + +# Write to file +adata.write_h5ad('data.h5ad') + +# Write with compression +adata.write_h5ad('data.h5ad', compression='gzip') + +# Write with specific compression level (0-9, higher = more compression) +adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9) +``` + +#### Reading H5AD files +```python +# Read entire file into memory +adata = ad.read_h5ad('data.h5ad') + +# Read in backed mode (lazy loading for large files) +adata = ad.read_h5ad('data.h5ad', backed='r') # Read-only +adata = ad.read_h5ad('data.h5ad', backed='r+') # Read-write + +# Backed mode enables working with datasets larger than RAM +# Only accessed data is loaded into memory +``` + +#### Backed mode operations +```python +# Open in backed mode +adata = ad.read_h5ad('large_dataset.h5ad', backed='r') + +# Access metadata without loading X into memory +print(adata.obs.head()) +print(adata.var.head()) + +# Subset operations create views +subset = adata[:100, :500] # View, no data loaded + +# Load specific data into memory +X_subset = subset.X[:] # Now loads this subset + +# Convert entire backed object to memory +adata_memory = adata.to_memory() +``` + +### Zarr +Hierarchical array storage format, optimized for cloud storage and parallel I/O. + +#### Writing Zarr +```python +# Write to Zarr store +adata.write_zarr('data.zarr') + +# Write with specific chunks (important for performance) +adata.write_zarr('data.zarr', chunks=(100, 100)) +``` + +#### Reading Zarr +```python +# Read Zarr store +adata = ad.read_zarr('data.zarr') +``` + +#### Remote Zarr access +```python +import fsspec + +# Access Zarr from S3 +store = fsspec.get_mapper('s3://bucket-name/data.zarr') +adata = ad.read_zarr(store) + +# Access Zarr from URL +store = fsspec.get_mapper('https://example.com/data.zarr') +adata = ad.read_zarr(store) +``` + +## Alternative Input Formats + +### CSV/TSV +```python +# Read CSV (genes as columns, cells as rows) +adata = ad.read_csv('data.csv') + +# Read with custom delimiter +adata = ad.read_csv('data.tsv', delimiter='\t') + +# Specify that first column is row names +adata = ad.read_csv('data.csv', first_column_names=True) +``` + +### Excel +```python +# Read Excel file +adata = ad.read_excel('data.xlsx') + +# Read specific sheet +adata = ad.read_excel('data.xlsx', sheet='Sheet1') +``` + +### Matrix Market (MTX) +Common format for sparse matrices in genomics. + +```python +# Read MTX with associated files +# Requires: matrix.mtx, genes.tsv, barcodes.tsv +adata = ad.read_mtx('matrix.mtx') + +# Read with custom gene and barcode files +adata = ad.read_mtx( + 'matrix.mtx', + var_names='genes.tsv', + obs_names='barcodes.tsv' +) + +# Transpose if needed (MTX often has genes as rows) +adata = adata.T +``` + +### 10X Genomics formats +```python +# Read 10X h5 format +adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') + +# Read 10X MTX directory +adata = ad.read_10x_mtx('filtered_feature_bc_matrix/') + +# Specify genome if multiple present +adata = ad.read_10x_h5('data.h5', genome='GRCh38') +``` + +### Loom +```python +# Read Loom file +adata = ad.read_loom('data.loom') + +# Read with specific observation and variable annotations +adata = ad.read_loom( + 'data.loom', + obs_names='CellID', + var_names='Gene' +) +``` + +### Text files +```python +# Read generic text file +adata = ad.read_text('data.txt', delimiter='\t') + +# Read with custom parameters +adata = ad.read_text( + 'data.txt', + delimiter=',', + first_column_names=True, + dtype='float32' +) +``` + +### UMI tools +```python +# Read UMI tools format +adata = ad.read_umi_tools('counts.tsv') +``` + +### HDF5 (generic) +```python +# Read from HDF5 file (not h5ad format) +adata = ad.read_hdf('data.h5', key='dataset') +``` + +## Alternative Output Formats + +### CSV +```python +# Write to CSV files (creates multiple files) +adata.write_csvs('output_dir/') + +# This creates: +# - output_dir/X.csv (expression matrix) +# - output_dir/obs.csv (observation annotations) +# - output_dir/var.csv (variable annotations) +# - output_dir/uns.csv (unstructured annotations, if possible) + +# Skip certain components +adata.write_csvs('output_dir/', skip_data=True) # Skip X matrix +``` + +### Loom +```python +# Write to Loom format +adata.write_loom('output.loom') +``` + +## Reading Specific Elements + +For fine-grained control, read specific elements from storage: + +```python +from anndata import read_elem + +# Read just observation annotations +obs = read_elem('data.h5ad/obs') + +# Read specific layer +layer = read_elem('data.h5ad/layers/normalized') + +# Read unstructured data element +params = read_elem('data.h5ad/uns/pca_params') +``` + +## Writing Specific Elements + +```python +from anndata import write_elem +import h5py + +# Write element to existing file +with h5py.File('data.h5ad', 'a') as f: + write_elem(f, 'new_layer', adata.X.copy()) +``` + +## Lazy Operations + +For very large datasets, use lazy reading to avoid loading entire datasets: + +```python +from anndata.experimental import read_elem_lazy + +# Lazy read (returns dask array or similar) +X_lazy = read_elem_lazy('large_data.h5ad/X') + +# Compute only when needed +subset = X_lazy[:100, :100].compute() +``` + +## Common I/O Patterns + +### Convert between formats +```python +# MTX to H5AD +adata = ad.read_mtx('matrix.mtx').T +adata.write_h5ad('data.h5ad') + +# CSV to H5AD +adata = ad.read_csv('data.csv') +adata.write_h5ad('data.h5ad') + +# H5AD to Zarr +adata = ad.read_h5ad('data.h5ad') +adata.write_zarr('data.zarr') +``` + +### Load metadata without data +```python +# Backed mode allows inspecting metadata without loading X +adata = ad.read_h5ad('large_file.h5ad', backed='r') +print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables") +print(adata.obs.columns) +print(adata.var.columns) +# X is not loaded into memory +``` + +### Append to existing file +```python +# Open in read-write mode +adata = ad.read_h5ad('data.h5ad', backed='r+') + +# Modify metadata +adata.obs['new_column'] = values + +# Changes are written to disk +``` + +### Download from URL +```python +import anndata as ad + +# Read directly from URL (for h5ad files) +url = 'https://example.com/data.h5ad' +adata = ad.read_h5ad(url, backed='r') # Streaming access + +# For other formats, download first +import urllib.request +urllib.request.urlretrieve(url, 'local_file.h5ad') +adata = ad.read_h5ad('local_file.h5ad') +``` + +## Performance Tips + +### Reading +- Use `backed='r'` for large files you only need to query +- Use `backed='r+'` if you need to modify metadata without loading all data +- H5AD format is generally fastest for random access +- Zarr is better for cloud storage and parallel access +- Consider compression for storage, but note it may slow down reading + +### Writing +- Use compression for long-term storage: `compression='gzip'` or `compression='lzf'` +- LZF compression is faster but compresses less than GZIP +- For Zarr, tune chunk sizes based on access patterns: + - Larger chunks for sequential reads + - Smaller chunks for random access +- Convert string columns to categorical before writing (smaller files) + +### Memory management +```python +# Convert strings to categoricals (reduces file size and memory) +adata.strings_to_categoricals() +adata.write_h5ad('data.h5ad') + +# Use sparse matrices for sparse data +from scipy.sparse import csr_matrix +if isinstance(adata.X, np.ndarray): + density = np.count_nonzero(adata.X) / adata.X.size + if density < 0.5: # If more than 50% zeros + adata.X = csr_matrix(adata.X) +``` + +## Handling Large Datasets + +### Strategy 1: Backed mode +```python +# Work with dataset larger than RAM +adata = ad.read_h5ad('100GB_file.h5ad', backed='r') + +# Filter based on metadata (fast, no data loading) +filtered = adata[adata.obs['quality_score'] > 0.8] + +# Load filtered subset into memory +adata_memory = filtered.to_memory() +``` + +### Strategy 2: Chunked processing +```python +# Process data in chunks +adata = ad.read_h5ad('large_file.h5ad', backed='r') + +chunk_size = 1000 +results = [] + +for i in range(0, adata.n_obs, chunk_size): + chunk = adata[i:i+chunk_size, :].to_memory() + # Process chunk + result = process(chunk) + results.append(result) +``` + +### Strategy 3: Use AnnCollection +```python +from anndata.experimental import AnnCollection + +# Create collection without loading data +adatas = [f'dataset_{i}.h5ad' for i in range(10)] +collection = AnnCollection( + adatas, + join_obs='inner', + join_vars='inner' +) + +# Process collection lazily +# Data is loaded only when accessed +``` + +## Common Issues and Solutions + +### Issue: Out of memory when reading +**Solution**: Use backed mode or read in chunks +```python +adata = ad.read_h5ad('file.h5ad', backed='r') +``` + +### Issue: Slow reading from cloud storage +**Solution**: Use Zarr format with appropriate chunking +```python +adata.write_zarr('data.zarr', chunks=(1000, 1000)) +``` + +### Issue: Large file sizes +**Solution**: Use compression and convert to sparse/categorical +```python +adata.strings_to_categoricals() +from scipy.sparse import csr_matrix +adata.X = csr_matrix(adata.X) +adata.write_h5ad('compressed.h5ad', compression='gzip') +``` + +### Issue: Cannot modify backed object +**Solution**: Either load to memory or open in 'r+' mode +```python +# Option 1: Load to memory +adata = adata.to_memory() + +# Option 2: Open in read-write mode +adata = ad.read_h5ad('file.h5ad', backed='r+') +``` diff --git a/scientific-packages/anndata/references/manipulation.md b/scientific-packages/anndata/references/manipulation.md new file mode 100644 index 0000000..de0afb0 --- /dev/null +++ b/scientific-packages/anndata/references/manipulation.md @@ -0,0 +1,516 @@ +# Data Manipulation + +Operations for transforming, subsetting, and manipulating AnnData objects. + +## Subsetting + +### By indices +```python +import anndata as ad +import numpy as np + +adata = ad.AnnData(X=np.random.rand(1000, 2000)) + +# Integer indices +subset = adata[0:100, 0:500] # First 100 obs, first 500 vars + +# List of indices +obs_indices = [0, 10, 20, 30, 40] +var_indices = [0, 1, 2, 3, 4] +subset = adata[obs_indices, var_indices] + +# Single observation or variable +single_obs = adata[0, :] +single_var = adata[:, 0] +``` + +### By names +```python +import pandas as pd + +# Create with named indices +obs_names = [f'cell_{i}' for i in range(1000)] +var_names = [f'gene_{i}' for i in range(2000)] +adata = ad.AnnData( + X=np.random.rand(1000, 2000), + obs=pd.DataFrame(index=obs_names), + var=pd.DataFrame(index=var_names) +) + +# Subset by observation names +subset = adata[['cell_0', 'cell_1', 'cell_2'], :] + +# Subset by variable names +subset = adata[:, ['gene_0', 'gene_10', 'gene_20']] + +# Both axes +subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']] +``` + +### By boolean masks +```python +# Create boolean masks +high_count_obs = np.random.rand(1000) > 0.5 +high_var_genes = np.random.rand(2000) > 0.7 + +# Subset using masks +subset = adata[high_count_obs, :] +subset = adata[:, high_var_genes] +subset = adata[high_count_obs, high_var_genes] +``` + +### By metadata conditions +```python +# Add metadata +adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000) +adata.obs['quality_score'] = np.random.rand(1000) +adata.var['highly_variable'] = np.random.rand(2000) > 0.8 + +# Filter by cell type +t_cells = adata[adata.obs['cell_type'] == 'A'] + +# Filter by multiple conditions +high_quality_a_cells = adata[ + (adata.obs['cell_type'] == 'A') & + (adata.obs['quality_score'] > 0.7) +] + +# Filter by variable metadata +hv_genes = adata[:, adata.var['highly_variable']] + +# Complex conditions +filtered = adata[ + (adata.obs['quality_score'] > 0.5) & + (adata.obs['cell_type'].isin(['A', 'B'])), + adata.var['highly_variable'] +] +``` + +## Transposition + +```python +# Transpose AnnData object (swap observations and variables) +adata_T = adata.T + +# Shape changes +print(adata.shape) # (1000, 2000) +print(adata_T.shape) # (2000, 1000) + +# obs and var are swapped +print(adata.obs.head()) # Observation metadata +print(adata_T.var.head()) # Same data, now as variable metadata + +# Useful when data is in opposite orientation +# Common with some file formats where genes are rows +``` + +## Copying + +### Full copy +```python +# Create independent copy +adata_copy = adata.copy() + +# Modifications to copy don't affect original +adata_copy.obs['new_column'] = 1 +print('new_column' in adata.obs.columns) # False +``` + +### Shallow copy +```python +# View (doesn't copy data, modifications affect original) +adata_view = adata[0:100, :] + +# Check if object is a view +print(adata_view.is_view) # True + +# Convert view to independent copy +adata_independent = adata_view.copy() +print(adata_independent.is_view) # False +``` + +## Renaming + +### Rename observations and variables +```python +# Rename all observations +adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)] + +# Rename all variables +adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)] + +# Make names unique (add suffix to duplicates) +adata.obs_names_make_unique() +adata.var_names_make_unique() +``` + +### Rename categories +```python +# Create categorical column +adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A']) + +# Rename categories +adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C']) + +# Or using dictionary +adata.rename_categories('cell_type', { + 'Type_A': 'T_cell', + 'Type_B': 'B_cell', + 'Type_C': 'Monocyte' +}) +``` + +## Type Conversions + +### Strings to categoricals +```python +# Convert string columns to categorical (more memory efficient) +adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500 +adata.obs['tissue'] = ['brain', 'liver'] * 500 + +# Convert all string columns to categorical +adata.strings_to_categoricals() + +print(adata.obs['cell_type'].dtype) # category +print(adata.obs['tissue'].dtype) # category +``` + +### Sparse to dense and vice versa +```python +from scipy.sparse import csr_matrix + +# Dense to sparse +if not isinstance(adata.X, csr_matrix): + adata.X = csr_matrix(adata.X) + +# Sparse to dense +if isinstance(adata.X, csr_matrix): + adata.X = adata.X.toarray() + +# Convert layer +adata.layers['normalized'] = csr_matrix(adata.layers['normalized']) +``` + +## Chunked Operations + +Process large datasets in chunks: + +```python +# Iterate through data in chunks +chunk_size = 100 +for chunk in adata.chunked_X(chunk_size): + # Process chunk + result = process_chunk(chunk) +``` + +## Extracting Vectors + +### Get observation vectors +```python +# Get observation metadata as array +cell_types = adata.obs_vector('cell_type') + +# Get gene expression across observations +actb_expression = adata.obs_vector('ACTB') # If ACTB in var_names +``` + +### Get variable vectors +```python +# Get variable metadata as array +gene_names = adata.var_vector('gene_name') +``` + +## Adding/Modifying Data + +### Add observations +```python +# Create new observations +new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars)) +new_obs.var_names = adata.var_names + +# Concatenate with existing +adata_extended = ad.concat([adata, new_obs], axis=0) +``` + +### Add variables +```python +# Create new variables +new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100)) +new_vars.obs_names = adata.obs_names + +# Concatenate with existing +adata_extended = ad.concat([adata, new_vars], axis=1) +``` + +### Add metadata columns +```python +# Add observation annotation +adata.obs['new_score'] = np.random.rand(adata.n_obs) + +# Add variable annotation +adata.var['new_label'] = ['label'] * adata.n_vars + +# Add from external data +external_data = pd.read_csv('metadata.csv', index_col=0) +adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column'] +``` + +### Add layers +```python +# Add new layer +adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape) +adata.layers['log_transformed'] = np.log1p(adata.X) + +# Replace layer +adata.layers['normalized'] = new_normalized_data +``` + +### Add embeddings +```python +# Add PCA +adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50) + +# Add UMAP +adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2) + +# Add multiple embeddings +adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2) +adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10) +``` + +### Add pairwise relationships +```python +from scipy.sparse import csr_matrix + +# Add nearest neighbor graph +n_obs = adata.n_obs +knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95) +adata.obsp['connectivities'] = knn_graph + +# Add distance matrix +adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs)) +``` + +### Add unstructured data +```python +# Add analysis parameters +adata.uns['pca'] = { + 'variance': [0.2, 0.15, 0.1], + 'variance_ratio': [0.4, 0.3, 0.2], + 'params': {'n_comps': 50} +} + +# Add color schemes +adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF'] +``` + +## Removing Data + +### Remove observations or variables +```python +# Keep only specific observations +keep_obs = adata.obs['quality_score'] > 0.5 +adata = adata[keep_obs, :] + +# Remove specific variables +remove_vars = adata.var['low_count'] +adata = adata[:, ~remove_vars] +``` + +### Remove metadata columns +```python +# Remove observation column +adata.obs.drop('unwanted_column', axis=1, inplace=True) + +# Remove variable column +adata.var.drop('unwanted_column', axis=1, inplace=True) +``` + +### Remove layers +```python +# Remove specific layer +del adata.layers['unwanted_layer'] + +# Remove all layers +adata.layers = {} +``` + +### Remove embeddings +```python +# Remove specific embedding +del adata.obsm['X_tsne'] + +# Remove all embeddings +adata.obsm = {} +``` + +### Remove unstructured data +```python +# Remove specific key +del adata.uns['unwanted_key'] + +# Remove all unstructured data +adata.uns = {} +``` + +## Reordering + +### Sort observations +```python +# Sort by observation metadata +adata = adata[adata.obs.sort_values('quality_score').index, :] + +# Sort by observation names +adata = adata[sorted(adata.obs_names), :] +``` + +### Sort variables +```python +# Sort by variable metadata +adata = adata[:, adata.var.sort_values('gene_name').index] + +# Sort by variable names +adata = adata[:, sorted(adata.var_names)] +``` + +### Reorder to match external list +```python +# Reorder observations to match external list +desired_order = ['cell_10', 'cell_5', 'cell_20', ...] +adata = adata[desired_order, :] + +# Reorder variables +desired_genes = ['TP53', 'ACTB', 'GAPDH', ...] +adata = adata[:, desired_genes] +``` + +## Data Transformations + +### Normalize +```python +# Total count normalization (CPM/TPM-like) +total_counts = adata.X.sum(axis=1) +adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6 + +# Log transformation +adata.layers['log1p'] = np.log1p(adata.X) + +# Z-score normalization +mean = adata.X.mean(axis=0) +std = adata.X.std(axis=0) +adata.layers['scaled'] = (adata.X - mean) / std +``` + +### Filter +```python +# Filter cells by total counts +total_counts = np.array(adata.X.sum(axis=1)).flatten() +adata.obs['total_counts'] = total_counts +adata = adata[adata.obs['total_counts'] > 1000, :] + +# Filter genes by detection rate +detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs +adata.var['detection_rate'] = np.array(detection_rate).flatten() +adata = adata[:, adata.var['detection_rate'] > 0.01] +``` + +## Working with Views + +Views are lightweight references to subsets of data that don't copy the underlying matrix: + +```python +# Create view +view = adata[0:100, 0:500] +print(view.is_view) # True + +# Views allow read access +data = view.X + +# Modifying view data affects original +# (Be careful!) + +# Convert view to independent copy +independent = view.copy() + +# Force AnnData to be a copy, not a view +adata = adata.copy() +``` + +## Merging Metadata + +```python +# Merge external metadata +external_metadata = pd.read_csv('additional_metadata.csv', index_col=0) + +# Join metadata (inner join on index) +adata.obs = adata.obs.join(external_metadata) + +# Left join (keep all adata observations) +adata.obs = adata.obs.merge( + external_metadata, + left_index=True, + right_index=True, + how='left' +) +``` + +## Common Manipulation Patterns + +### Quality control filtering +```python +# Calculate QC metrics +adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) +adata.obs['total_counts'] = adata.X.sum(axis=1) +adata.var['n_cells'] = (adata.X > 0).sum(axis=0) + +# Filter low-quality cells +adata = adata[adata.obs['n_genes'] > 200, :] +adata = adata[adata.obs['total_counts'] < 50000, :] + +# Filter rarely detected genes +adata = adata[:, adata.var['n_cells'] >= 3] +``` + +### Select highly variable genes +```python +# Mark highly variable genes +gene_variance = np.var(adata.X, axis=0) +adata.var['variance'] = np.array(gene_variance).flatten() +adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90) + +# Subset to highly variable genes +adata_hvg = adata[:, adata.var['highly_variable']].copy() +``` + +### Downsample +```python +# Random sampling of observations +np.random.seed(42) +n_sample = 500 +sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False) +adata_downsampled = adata[sample_indices, :].copy() + +# Stratified sampling by cell type +from sklearn.model_selection import train_test_split +train_idx, test_idx = train_test_split( + range(adata.n_obs), + test_size=0.2, + stratify=adata.obs['cell_type'] +) +adata_train = adata[train_idx, :].copy() +adata_test = adata[test_idx, :].copy() +``` + +### Split train/test +```python +# Random train/test split +np.random.seed(42) +n_obs = adata.n_obs +train_size = int(0.8 * n_obs) +indices = np.random.permutation(n_obs) +train_indices = indices[:train_size] +test_indices = indices[train_size:] + +adata_train = adata[train_indices, :].copy() +adata_test = adata[test_indices, :].copy() +``` diff --git a/scientific-packages/anndata/references/workflows_best_practices.md b/scientific-packages/anndata/references/workflows_best_practices.md deleted file mode 100644 index 41f91c5..0000000 --- a/scientific-packages/anndata/references/workflows_best_practices.md +++ /dev/null @@ -1,438 +0,0 @@ -# AnnData Workflows and Best Practices - -## Common Workflows - -### 1. Single-Cell RNA-seq Analysis Workflow - -#### Loading Data -```python -import anndata as ad -import numpy as np -import pandas as pd - -# Load from 10X format -adata = ad.read_mtx('matrix.mtx') -adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0] -adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0] - -# Or load from pre-processed h5ad -adata = ad.read_h5ad('preprocessed_data.h5ad') -``` - -#### Quality Control -```python -# Calculate QC metrics -adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) -adata.obs['total_counts'] = adata.X.sum(axis=1) - -# Filter cells -adata = adata[adata.obs.n_genes > 200] -adata = adata[adata.obs.total_counts < 10000] - -# Filter genes -min_cells = 3 -adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells] -``` - -#### Normalization and Preprocessing -```python -# Store raw counts -adata.layers['counts'] = adata.X.copy() - -# Normalize -adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4 - -# Log transform -adata.layers['log1p'] = np.log1p(adata.X) -adata.X = adata.layers['log1p'] - -# Identify highly variable genes -gene_variance = adata.X.var(axis=0) -adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90) -``` - -#### Dimensionality Reduction -```python -# PCA -from sklearn.decomposition import PCA -pca = PCA(n_components=50) -adata.obsm['X_pca'] = pca.fit_transform(adata.X) - -# Store PCA variance -adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_} - -# UMAP -from umap import UMAP -umap = UMAP(n_components=2) -adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca']) -``` - -#### Clustering -```python -# Store cluster assignments -adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...]) - -# Store cluster centroids -centroids = np.array([...]) -adata.varm['cluster_centroids'] = centroids -``` - -#### Save Results -```python -# Save complete analysis -adata.write('analyzed_data.h5ad', compression='gzip') -``` - -### 2. Batch Integration Workflow - -```python -import anndata as ad - -# Load multiple batches -batch1 = ad.read_h5ad('batch1.h5ad') -batch2 = ad.read_h5ad('batch2.h5ad') -batch3 = ad.read_h5ad('batch3.h5ad') - -# Concatenate with batch labels -adata = ad.concat( - [batch1, batch2, batch3], - axis=0, - label='batch', - keys=['batch1', 'batch2', 'batch3'], - index_unique='-' -) - -# Batch effect correction would go here -# (using external tools like Harmony, Scanorama, etc.) - -# Store corrected embeddings -adata.obsm['X_pca_corrected'] = corrected_pca -adata.obsm['X_umap_corrected'] = corrected_umap -``` - -### 3. Memory-Efficient Large Dataset Workflow - -```python -import anndata as ad - -# Read in backed mode -adata = ad.read_h5ad('large_dataset.h5ad', backed='r') - -# Check backing status -print(f"Is backed: {adata.isbacked}") -print(f"File: {adata.filename}") - -# Work with chunks -for chunk in adata.chunk_X(chunk_size=1000): - # Process chunk - result = process_chunk(chunk) - -# Close file when done -adata.file.close() -``` - -### 4. Multi-Dataset Comparison Workflow - -```python -import anndata as ad - -# Load datasets -datasets = { - 'study1': ad.read_h5ad('study1.h5ad'), - 'study2': ad.read_h5ad('study2.h5ad'), - 'study3': ad.read_h5ad('study3.h5ad') -} - -# Outer join to keep all genes -combined = ad.concat( - list(datasets.values()), - axis=0, - join='outer', - label='study', - keys=list(datasets.keys()), - merge='first' -) - -# Handle missing data -combined.X[np.isnan(combined.X)] = 0 - -# Add dataset-specific metadata -combined.uns['datasets'] = { - 'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs}, - 'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs}, - 'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs} -} -``` - -## Best Practices - -### Memory Management - -#### Use Sparse Matrices -```python -from scipy.sparse import csr_matrix - -# Convert to sparse if data is sparse -if density < 0.3: # Less than 30% non-zero - adata.X = csr_matrix(adata.X) -``` - -#### Use Backed Mode for Large Files -```python -# Read with backing -adata = ad.read_h5ad('large_file.h5ad', backed='r') - -# Only load what you need -subset = adata[:1000, :500].copy() # Now in memory -``` - -#### Convert Strings to Categoricals -```python -# Efficient storage for repeated strings -adata.strings_to_categoricals() - -# Or manually -adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type']) -``` - -### Data Organization - -#### Use Layers for Different Representations -```python -# Store multiple versions of the data -adata.layers['counts'] = raw_counts -adata.layers['normalized'] = normalized_data -adata.layers['log1p'] = log_transformed_data -adata.layers['scaled'] = scaled_data -``` - -#### Use obsm/varm for Multi-Dimensional Annotations -```python -# Embeddings -adata.obsm['X_pca'] = pca_coordinates -adata.obsm['X_umap'] = umap_coordinates -adata.obsm['X_tsne'] = tsne_coordinates - -# Gene loadings -adata.varm['PCs'] = principal_components -``` - -#### Use uns for Analysis Metadata -```python -# Store parameters -adata.uns['preprocessing'] = { - 'normalization': 'TPM', - 'min_genes': 200, - 'min_cells': 3, - 'date': '2024-01-15' -} - -# Store analysis results -adata.uns['differential_expression'] = { - 'method': 't-test', - 'p_value_threshold': 0.05 -} -``` - -### Subsetting and Views - -#### Understand View vs Copy -```python -# Subsetting returns a view -subset = adata[adata.obs.cell_type == 'B cell'] # View -print(subset.is_view) # True - -# Views are memory efficient but modifications affect original -subset.obs['new_column'] = value # Modifies original adata - -# Create independent copy when needed -subset_copy = adata[adata.obs.cell_type == 'B cell'].copy() -``` - -#### Chain Operations Efficiently -```python -# Bad - creates multiple intermediate views -temp1 = adata[adata.obs.batch == 'batch1'] -temp2 = temp1[temp1.obs.n_genes > 200] -result = temp2[:, temp2.var.highly_variable].copy() - -# Good - chain operations -result = adata[ - (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200), - adata.var.highly_variable -].copy() -``` - -### File I/O - -#### Use Compression -```python -# Save with compression -adata.write('data.h5ad', compression='gzip') -``` - -#### Choose the Right Format -```python -# H5AD for general use (good compression, fast) -adata.write_h5ad('data.h5ad') - -# Zarr for cloud storage and parallel access -adata.write_zarr('data.zarr') - -# Loom for compatibility with other tools -adata.write_loom('data.loom') -``` - -#### Close File Connections -```python -# Use context manager pattern -adata = ad.read_h5ad('file.h5ad', backed='r') -try: - # Work with data - process(adata) -finally: - adata.file.close() -``` - -### Concatenation - -#### Choose Appropriate Join Strategy -```python -# Inner join - only common features (safe, may lose data) -combined = ad.concat([adata1, adata2], join='inner') - -# Outer join - all features (keeps all data, may introduce zeros) -combined = ad.concat([adata1, adata2], join='outer') -``` - -#### Track Data Sources -```python -# Add source labels -combined = ad.concat( - [adata1, adata2, adata3], - label='dataset', - keys=['exp1', 'exp2', 'exp3'] -) - -# Make indices unique -combined = ad.concat( - [adata1, adata2, adata3], - index_unique='-' -) -``` - -#### Handle Variable-Specific Metadata -```python -# Use merge strategy for var annotations -combined = ad.concat( - [adata1, adata2], - merge='same', # Keep only identical annotations - join='outer' -) -``` - -### Naming Conventions - -#### Use Consistent Naming -```python -# Embeddings: X_ -adata.obsm['X_pca'] -adata.obsm['X_umap'] -adata.obsm['X_tsne'] - -# Layers: descriptive names -adata.layers['counts'] -adata.layers['log1p'] -adata.layers['scaled'] - -# Observations: snake_case -adata.obs['cell_type'] -adata.obs['n_genes'] -adata.obs['total_counts'] -``` - -#### Make Indices Unique -```python -# Ensure unique names -adata.obs_names_make_unique() -adata.var_names_make_unique() -``` - -### Error Handling - -#### Validate Data Structure -```python -# Check dimensions -assert adata.n_obs > 0, "No observations in data" -assert adata.n_vars > 0, "No variables in data" - -# Check for NaN values -if np.isnan(adata.X).any(): - print("Warning: NaN values detected") - -# Check for negative values in count data -if (adata.X < 0).any(): - print("Warning: Negative values in count data") -``` - -#### Handle Missing Data -```python -# Check for missing annotations -if adata.obs['cell_type'].isna().any(): - print("Warning: Missing cell type annotations") - # Fill or remove - adata = adata[~adata.obs['cell_type'].isna()] -``` - -## Common Pitfalls - -### 1. Forgetting to Copy Views -```python -# BAD - modifies original -subset = adata[adata.obs.condition == 'treated'] -subset.X = transformed_data # Changes original adata! - -# GOOD -subset = adata[adata.obs.condition == 'treated'].copy() -subset.X = transformed_data # Only changes subset -``` - -### 2. Mixing Backed and In-Memory Operations -```python -# BAD - trying to modify backed data -adata = ad.read_h5ad('file.h5ad', backed='r') -adata.X[0, 0] = 100 # Error: can't modify backed data - -# GOOD - load to memory first -adata = ad.read_h5ad('file.h5ad', backed='r') -adata = adata.to_memory() -adata.X[0, 0] = 100 # Works -``` - -### 3. Not Using Categoricals for Metadata -```python -# BAD - stores as strings (memory inefficient) -adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000 - -# GOOD - use categorical -adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000) -``` - -### 4. Incorrect Concatenation Axis -```python -# Concatenating observations (cells) -combined = ad.concat([adata1, adata2], axis=0) # Correct - -# Concatenating variables (genes) - rare -combined = ad.concat([adata1, adata2], axis=1) # Less common -``` - -### 5. Not Preserving Raw Data -```python -# BAD - loses original data -adata.X = normalized_data - -# GOOD - preserve original -adata.layers['counts'] = adata.X.copy() -adata.X = normalized_data -```