Initial commit

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:01:29 -07:00
parent d85386c32b
commit 152d0d54de
15 changed files with 4569 additions and 0 deletions
--- a/scientific-packages/anndata/SKILL.md
+++ b/scientific-packages/anndata/SKILL.md
@@ -0,0 +1,527 @@
+---
+name: anndata
+description: Work with AnnData objects for annotated data matrices, commonly used in single-cell genomics and other scientific domains. This skill should be used when working with .h5ad files, performing single-cell RNA-seq analysis, managing annotated datasets, concatenating multiple datasets, or working with sparse matrices and embeddings in a structured format.
+---
+
+# AnnData
+
+## Overview
+
+AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. This skill provides comprehensive guidance for working with AnnData objects, including data creation, manipulation, file I/O, concatenation, and best practices for memory-efficient workflows.
+
+## Core Capabilities
+
+### 1. Creating and Structuring AnnData Objects
+
+Create AnnData objects from various data sources and organize multi-dimensional annotations.
+
+**Basic creation:**
+```python
+import anndata as ad
+import numpy as np
+from scipy.sparse import csr_matrix
+
+# From dense or sparse arrays
+counts = np.random.poisson(1, size=(100, 2000))
+adata = ad.AnnData(counts)
+
+# With sparse matrix (memory-efficient)
+counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
+adata = ad.AnnData(counts)
+```
+
+**With metadata:**
+```python
+import pandas as pd
+
+obs_meta = pd.DataFrame({
+    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
+    'batch': ['batch1'] * 50 + ['batch2'] * 50
+})
+var_meta = pd.DataFrame({
+    'gene_name': [f'Gene_{i}' for i in range(2000)],
+    'highly_variable': np.random.choice([True, False], 2000)
+})
+
+adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
+```
+
+**Understanding the structure:**
+- **X**: Primary data matrix (observations × variables)
+- **obs**: Row (observation) annotations as DataFrame
+- **var**: Column (variable) annotations as DataFrame
+- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
+- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
+- **layers**: Alternative data matrices with same dimensions as X
+- **uns**: Unstructured metadata dictionary
+- **obsp/varp**: Pairwise relationship matrices (graphs)
+
+### 2. Adding Annotations and Layers
+
+Organize different data representations and metadata within a single object.
+
+**Cell-level metadata (obs):**
+```python
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
+```
+
+**Gene-level metadata (var):**
+```python
+adata.var['highly_variable'] = gene_variance > threshold
+adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
+```
+
+**Embeddings (obsm/varm):**
+```python
+# Dimensionality reduction results
+adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
+adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
+adata.obsm['X_tsne'] = tsne_coordinates
+
+# Gene loadings
+adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)
+```
+
+**Alternative data representations (layers):**
+```python
+# Store multiple versions
+adata.layers['counts'] = raw_counts
+adata.layers['log1p'] = np.log1p(adata.X)
+adata.layers['scaled'] = (adata.X - mean) / std
+```
+
+**Unstructured metadata (uns):**
+```python
+# Analysis parameters
+adata.uns['preprocessing'] = {
+    'normalization': 'TPM',
+    'min_genes': 200,
+    'date': '2024-01-15'
+}
+
+# Results
+adata.uns['pca'] = {'variance_ratio': variance_explained}
+```
+
+### 3. Subsetting and Views
+
+Efficiently subset data while managing memory through views and copies.
+
+**Subsetting operations:**
+```python
+# By observation/variable names
+subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
+
+# By boolean masks
+b_cells = adata[adata.obs.cell_type == 'B']
+high_quality = adata[adata.obs.n_genes > 200]
+
+# By position
+first_cells = adata[:100, :]
+top_genes = adata[:, :500]
+
+# Combined conditions
+filtered = adata[
+    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
+    adata.var.highly_variable
+]
+```
+
+**Understanding views:**
+- Subsetting returns **views** by default (memory-efficient, shares data with original)
+- Modifying a view affects the original object
+- Check with `adata.is_view`
+- Convert to independent copy with `.copy()`
+
+```python
+# View (memory-efficient)
+subset = adata[adata.obs.condition == 'treated']
+print(subset.is_view)  # True
+
+# Independent copy
+subset_copy = adata[adata.obs.condition == 'treated'].copy()
+print(subset_copy.is_view)  # False
+```
+
+### 4. File I/O and Backed Mode
+
+Read and write data efficiently, with options for memory-limited environments.
+
+**Writing data:**
+```python
+# Standard format with compression
+adata.write('results.h5ad', compression='gzip')
+
+# Alternative formats
+adata.write_zarr('results.zarr')  # For cloud storage
+adata.write_loom('results.loom')  # For compatibility
+adata.write_csvs('results/')      # As CSV files
+```
+
+**Reading data:**
+```python
+# Load into memory
+adata = ad.read_h5ad('results.h5ad')
+
+# Backed mode (disk-backed, memory-efficient)
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+print(adata.isbacked)  # True
+print(adata.filename)  # Path to file
+
+# Close file connection when done
+adata.file.close()
+```
+
+**Reading from other formats:**
+```python
+# 10X format
+adata = ad.read_mtx('matrix.mtx')
+
+# CSV
+adata = ad.read_csv('data.csv')
+
+# Loom
+adata = ad.read_loom('data.loom')
+```
+
+**Working with backed mode:**
+```python
+# Read in backed mode for large files
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Process in chunks
+for chunk in adata.chunk_X(chunk_size=1000):
+    result = process_chunk(chunk)
+
+# Load to memory if needed
+adata_memory = adata.to_memory()
+```
+
+### 5. Concatenating Multiple Datasets
+
+Combine multiple AnnData objects with control over how data is merged.
+
+**Basic concatenation:**
+```python
+# Concatenate observations (most common)
+combined = ad.concat([adata1, adata2, adata3], axis=0)
+
+# Concatenate variables (rare)
+combined = ad.concat([adata1, adata2], axis=1)
+```
+
+**Join strategies:**
+```python
+# Inner join: only shared variables (no missing data)
+combined = ad.concat([adata1, adata2], join='inner')
+
+# Outer join: all variables (fills missing with 0)
+combined = ad.concat([adata1, adata2], join='outer')
+```
+
+**Tracking data sources:**
+```python
+# Add source labels
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='dataset',
+    keys=['exp1', 'exp2', 'exp3']
+)
+# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
+
+# Make duplicate indices unique
+combined = ad.concat(
+    [adata1, adata2],
+    keys=['batch1', 'batch2'],
+    index_unique='-'
+)
+# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
+```
+
+**Merge strategies for metadata:**
+```python
+# merge=None: exclude variable annotations (default)
+combined = ad.concat([adata1, adata2], merge=None)
+
+# merge='same': keep only identical annotations
+combined = ad.concat([adata1, adata2], merge='same')
+
+# merge='first': use first occurrence
+combined = ad.concat([adata1, adata2], merge='first')
+
+# merge='unique': keep annotations with single value
+combined = ad.concat([adata1, adata2], merge='unique')
+```
+
+**Complete example:**
+```python
+# Load batches
+batch1 = ad.read_h5ad('batch1.h5ad')
+batch2 = ad.read_h5ad('batch2.h5ad')
+batch3 = ad.read_h5ad('batch3.h5ad')
+
+# Concatenate with full tracking
+combined = ad.concat(
+    [batch1, batch2, batch3],
+    axis=0,
+    join='outer',              # Keep all genes
+    merge='first',             # Use first batch's annotations
+    label='batch_id',          # Track source
+    keys=['b1', 'b2', 'b3'],  # Custom labels
+    index_unique='-'           # Make cell names unique
+)
+```
+
+### 6. Data Conversion and Extraction
+
+Convert between AnnData and other formats for interoperability.
+
+**To DataFrame:**
+```python
+# Convert X to DataFrame
+df = adata.to_df()
+
+# Convert specific layer
+df = adata.to_df(layer='log1p')
+```
+
+**Extract vectors:**
+```python
+# Get 1D arrays from data or annotations
+gene_expression = adata.obs_vector('Gene_100')
+cell_metadata = adata.obs_vector('n_genes')
+```
+
+**Transpose:**
+```python
+# Swap observations and variables
+transposed = adata.T
+```
+
+### 7. Memory Optimization
+
+Strategies for working with large datasets efficiently.
+
+**Use sparse matrices:**
+```python
+from scipy.sparse import csr_matrix
+
+# Check sparsity
+density = (adata.X != 0).sum() / adata.X.size
+if density < 0.3:  # Less than 30% non-zero
+    adata.X = csr_matrix(adata.X)
+```
+
+**Convert strings to categoricals:**
+```python
+# Automatic conversion
+adata.strings_to_categoricals()
+
+# Manual conversion (more control)
+adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
+```
+
+**Use backed mode:**
+```python
+# Read without loading into memory
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+
+# Work with subsets
+subset = adata[:1000, :500].copy()  # Only this subset in memory
+```
+
+**Chunked processing:**
+```python
+# Process data in chunks
+results = []
+for chunk in adata.chunk_X(chunk_size=1000):
+    result = expensive_computation(chunk)
+    results.append(result)
+```
+
+## Common Workflows
+
+### Single-Cell RNA-seq Analysis
+
+Complete workflow from loading to analysis:
+
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# 1. Load data
+adata = ad.read_mtx('matrix.mtx')
+adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
+adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
+
+# 2. Quality control
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata = adata[adata.obs.n_genes > 200]
+adata = adata[adata.obs.total_counts < 10000]
+
+# 3. Filter genes
+min_cells = 3
+adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
+
+# 4. Store raw counts
+adata.layers['counts'] = adata.X.copy()
+
+# 5. Normalize
+adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
+adata.X = np.log1p(adata.X)
+
+# 6. Feature selection
+gene_var = adata.X.var(axis=0)
+adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
+
+# 7. Dimensionality reduction (example with external tools)
+# adata.obsm['X_pca'] = compute_pca(adata.X)
+# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
+
+# 8. Save results
+adata.write('analyzed.h5ad', compression='gzip')
+```
+
+### Batch Integration
+
+Combining multiple experimental batches:
+
+```python
+# Load batches
+batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
+
+# Concatenate with tracking
+combined = ad.concat(
+    batches,
+    axis=0,
+    join='outer',
+    label='batch',
+    keys=['batch_0', 'batch_1', 'batch_2'],
+    index_unique='-'
+)
+
+# Add batch as numeric for correction algorithms
+combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
+
+# Perform batch correction (with external tools)
+# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
+# combined.obsm['X_pca_corrected'] = corrected_pca
+
+# Save integrated data
+combined.write('integrated.h5ad', compression='gzip')
+```
+
+### Memory-Efficient Large Dataset Processing
+
+Working with datasets too large for memory:
+
+```python
+# Read in backed mode
+adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+
+# Compute statistics in chunks
+total = 0
+for chunk in adata.chunk_X(chunk_size=1000):
+    total += chunk.sum()
+
+mean_expression = total / (adata.n_obs * adata.n_vars)
+
+# Work with subset
+high_quality_cells = adata.obs.n_genes > 1000
+subset = adata[high_quality_cells, :].copy()
+
+# Close file
+adata.file.close()
+```
+
+## Best Practices
+
+### Data Organization
+
+1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
+2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
+3. **Use uns for metadata**: Analysis parameters, dates, version information
+4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
+
+### Subsetting
+
+1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
+2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
+3. **Validate after subsetting**: Check dimensions and data integrity
+
+### File I/O
+
+1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
+2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
+3. **Close backed files**: Always close file connections when done
+4. **Use backed mode for large files**: Don't load everything into memory if not needed
+
+### Concatenation
+
+1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
+2. **Track sources**: Use `label` and `keys` to track data origin
+3. **Handle duplicates**: Use `index_unique` to make observation names unique
+4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
+
+### Memory Management
+
+1. **Use sparse matrices**: For data with <30% non-zero values
+2. **Convert to categoricals**: For repeated string values
+3. **Process in chunks**: For operations on very large matrices
+4. **Use backed mode**: Read large files with `backed='r'`
+
+### Naming Conventions
+
+Follow these conventions for consistency:
+
+- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
+- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
+- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
+- **Variables**: Use snake_case like `highly_variable`, `gene_name`
+
+## Reference Documentation
+
+For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
+
+1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
+
+2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
+
+3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
+
+## When to Load References
+
+Load reference files into context when:
+- Implementing complex concatenation with specific merge strategies
+- Troubleshooting errors or unexpected behavior
+- Optimizing memory usage for large datasets
+- Implementing complete analysis workflows
+- Understanding nuances of specific API methods
+
+To search within references without loading them:
+```python
+# Example: Search for information about backed mode
+grep -r "backed mode" references/
+```
+
+## Common Error Patterns
+
+### Memory Errors
+**Problem**: "MemoryError: Unable to allocate array"
+**Solution**: Use backed mode, sparse matrices, or process in chunks
+
+### Dimension Mismatch
+**Problem**: "ValueError: operands could not be broadcast together"
+**Solution**: Use outer join in concatenation or align indices before operations
+
+### View Modification
+**Problem**: "ValueError: assignment destination is read-only"
+**Solution**: Convert view to copy with `.copy()` before modification
+
+### File Already Open
+**Problem**: "OSError: Unable to open file (file is already open)"
+**Solution**: Close previous file connection with `adata.file.close()`