mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 152d0d54de Initial commit

2025-10-19 14:01:29 -07:00

15 KiB

Raw Blame History

name, description

name	description
anndata	Work with AnnData objects for annotated data matrices, commonly used in single-cell genomics and other scientific domains. This skill should be used when working with .h5ad files, performing single-cell RNA-seq analysis, managing annotated datasets, concatenating multiple datasets, or working with sparse matrices and embeddings in a structured format.

AnnData

Overview

AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. This skill provides comprehensive guidance for working with AnnData objects, including data creation, manipulation, file I/O, concatenation, and best practices for memory-efficient workflows.

Core Capabilities

1. Creating and Structuring AnnData Objects

Create AnnData objects from various data sources and organize multi-dimensional annotations.

Basic creation:

import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix

# From dense or sparse arrays
counts = np.random.poisson(1, size=(100, 2000))
adata = ad.AnnData(counts)

# With sparse matrix (memory-efficient)
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)

With metadata:

import pandas as pd

obs_meta = pd.DataFrame({
    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
    'batch': ['batch1'] * 50 + ['batch2'] * 50
})
var_meta = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)],
    'highly_variable': np.random.choice([True, False], 2000)
})

adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)

Understanding the structure:

X: Primary data matrix (observations × variables)
obs: Row (observation) annotations as DataFrame
var: Column (variable) annotations as DataFrame
obsm: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
varm: Multi-dimensional variable annotations (e.g., gene loadings)
layers: Alternative data matrices with same dimensions as X
uns: Unstructured metadata dictionary
obsp/varp: Pairwise relationship matrices (graphs)

2. Adding Annotations and Layers

Organize different data representations and metadata within a single object.

Cell-level metadata (obs):

adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)

Gene-level metadata (var):

adata.var['highly_variable'] = gene_variance > threshold
adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])

Embeddings (obsm/varm):

# Dimensionality reduction results
adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
adata.obsm['X_tsne'] = tsne_coordinates

# Gene loadings
adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)

Alternative data representations (layers):

# Store multiple versions
adata.layers['counts'] = raw_counts
adata.layers['log1p'] = np.log1p(adata.X)
adata.layers['scaled'] = (adata.X - mean) / std

Unstructured metadata (uns):

# Analysis parameters
adata.uns['preprocessing'] = {
    'normalization': 'TPM',
    'min_genes': 200,
    'date': '2024-01-15'
}

# Results
adata.uns['pca'] = {'variance_ratio': variance_explained}

3. Subsetting and Views

Efficiently subset data while managing memory through views and copies.

Subsetting operations:

# By observation/variable names
subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]

# By boolean masks
b_cells = adata[adata.obs.cell_type == 'B']
high_quality = adata[adata.obs.n_genes > 200]

# By position
first_cells = adata[:100, :]
top_genes = adata[:, :500]

# Combined conditions
filtered = adata[
    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
    adata.var.highly_variable
]

Understanding views:

Subsetting returns views by default (memory-efficient, shares data with original)
Modifying a view affects the original object
Check with adata.is_view
Convert to independent copy with .copy()

# View (memory-efficient)
subset = adata[adata.obs.condition == 'treated']
print(subset.is_view)  # True

# Independent copy
subset_copy = adata[adata.obs.condition == 'treated'].copy()
print(subset_copy.is_view)  # False

4. File I/O and Backed Mode

Read and write data efficiently, with options for memory-limited environments.

Writing data:

# Standard format with compression
adata.write('results.h5ad', compression='gzip')

# Alternative formats
adata.write_zarr('results.zarr')  # For cloud storage
adata.write_loom('results.loom')  # For compatibility
adata.write_csvs('results/')      # As CSV files

Reading data:

# Load into memory
adata = ad.read_h5ad('results.h5ad')

# Backed mode (disk-backed, memory-efficient)
adata = ad.read_h5ad('large_file.h5ad', backed='r')
print(adata.isbacked)  # True
print(adata.filename)  # Path to file

# Close file connection when done
adata.file.close()

Reading from other formats:

# 10X format
adata = ad.read_mtx('matrix.mtx')

# CSV
adata = ad.read_csv('data.csv')

# Loom
adata = ad.read_loom('data.loom')

Working with backed mode:

# Read in backed mode for large files
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')

# Process in chunks
for chunk in adata.chunk_X(chunk_size=1000):
    result = process_chunk(chunk)

# Load to memory if needed
adata_memory = adata.to_memory()

5. Concatenating Multiple Datasets

Combine multiple AnnData objects with control over how data is merged.

Basic concatenation:

# Concatenate observations (most common)
combined = ad.concat([adata1, adata2, adata3], axis=0)

# Concatenate variables (rare)
combined = ad.concat([adata1, adata2], axis=1)

Join strategies:

# Inner join: only shared variables (no missing data)
combined = ad.concat([adata1, adata2], join='inner')

# Outer join: all variables (fills missing with 0)
combined = ad.concat([adata1, adata2], join='outer')

Tracking data sources:

# Add source labels
combined = ad.concat(
    [adata1, adata2, adata3],
    label='dataset',
    keys=['exp1', 'exp2', 'exp3']
)
# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'

# Make duplicate indices unique
combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
)
# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.

Merge strategies for metadata:

# merge=None: exclude variable annotations (default)
combined = ad.concat([adata1, adata2], merge=None)

# merge='same': keep only identical annotations
combined = ad.concat([adata1, adata2], merge='same')

# merge='first': use first occurrence
combined = ad.concat([adata1, adata2], merge='first')

# merge='unique': keep annotations with single value
combined = ad.concat([adata1, adata2], merge='unique')

Complete example:

# Load batches
batch1 = ad.read_h5ad('batch1.h5ad')
batch2 = ad.read_h5ad('batch2.h5ad')
batch3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with full tracking
combined = ad.concat(
    [batch1, batch2, batch3],
    axis=0,
    join='outer',              # Keep all genes
    merge='first',             # Use first batch's annotations
    label='batch_id',          # Track source
    keys=['b1', 'b2', 'b3'],  # Custom labels
    index_unique='-'           # Make cell names unique
)

6. Data Conversion and Extraction

Convert between AnnData and other formats for interoperability.

To DataFrame:

# Convert X to DataFrame
df = adata.to_df()

# Convert specific layer
df = adata.to_df(layer='log1p')

Extract vectors:

# Get 1D arrays from data or annotations
gene_expression = adata.obs_vector('Gene_100')
cell_metadata = adata.obs_vector('n_genes')

Transpose:

# Swap observations and variables
transposed = adata.T

7. Memory Optimization

Strategies for working with large datasets efficiently.

Use sparse matrices:

from scipy.sparse import csr_matrix

# Check sparsity
density = (adata.X != 0).sum() / adata.X.size
if density < 0.3:  # Less than 30% non-zero
    adata.X = csr_matrix(adata.X)

Convert strings to categoricals:

# Automatic conversion
adata.strings_to_categoricals()

# Manual conversion (more control)
adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])

Use backed mode:

# Read without loading into memory
adata = ad.read_h5ad('large_file.h5ad', backed='r')

# Work with subsets
subset = adata[:1000, :500].copy()  # Only this subset in memory

Chunked processing:

# Process data in chunks
results = []
for chunk in adata.chunk_X(chunk_size=1000):
    result = expensive_computation(chunk)
    results.append(result)

Common Workflows

Single-Cell RNA-seq Analysis

Complete workflow from loading to analysis:

import anndata as ad
import numpy as np
import pandas as pd

# 1. Load data
adata = ad.read_mtx('matrix.mtx')
adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
adata.var_names = pd.read_csv('genes.tsv', header=None)[0]

# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs.n_genes > 200]
adata = adata[adata.obs.total_counts < 10000]

# 3. Filter genes
min_cells = 3
adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]

# 4. Store raw counts
adata.layers['counts'] = adata.X.copy()

# 5. Normalize
adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
adata.X = np.log1p(adata.X)

# 6. Feature selection
gene_var = adata.X.var(axis=0)
adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)

# 7. Dimensionality reduction (example with external tools)
# adata.obsm['X_pca'] = compute_pca(adata.X)
# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])

# 8. Save results
adata.write('analyzed.h5ad', compression='gzip')

Batch Integration

Combining multiple experimental batches:

# Load batches
batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]

# Concatenate with tracking
combined = ad.concat(
    batches,
    axis=0,
    join='outer',
    label='batch',
    keys=['batch_0', 'batch_1', 'batch_2'],
    index_unique='-'
)

# Add batch as numeric for correction algorithms
combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes

# Perform batch correction (with external tools)
# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
# combined.obsm['X_pca_corrected'] = corrected_pca

# Save integrated data
combined.write('integrated.h5ad', compression='gzip')

Memory-Efficient Large Dataset Processing

Working with datasets too large for memory:

# Read in backed mode
adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')

# Compute statistics in chunks
total = 0
for chunk in adata.chunk_X(chunk_size=1000):
    total += chunk.sum()

mean_expression = total / (adata.n_obs * adata.n_vars)

# Work with subset
high_quality_cells = adata.obs.n_genes > 1000
subset = adata[high_quality_cells, :].copy()

# Close file
adata.file.close()

Best Practices

Data Organization

Use layers for different representations: Store raw counts, normalized, log-transformed, and scaled data in separate layers
Use obsm/varm for multi-dimensional data: Embeddings, loadings, and other matrix-like annotations
Use uns for metadata: Analysis parameters, dates, version information
Use categoricals for efficiency: Convert repeated strings to categorical types

Subsetting

Understand views vs copies: Subsetting returns views by default; use .copy() when you need independence
Chain conditions efficiently: Combine boolean masks in a single subsetting operation
Validate after subsetting: Check dimensions and data integrity

File I/O

Use compression: Always use compression='gzip' when writing h5ad files
Choose the right format: H5AD for general use, Zarr for cloud storage, Loom for compatibility
Close backed files: Always close file connections when done
Use backed mode for large files: Don't load everything into memory if not needed

Concatenation

Choose appropriate join: Inner join for complete cases, outer join to preserve all features
Track sources: Use label and keys to track data origin
Handle duplicates: Use index_unique to make observation names unique
Select merge strategy: Choose appropriate merge strategy for variable annotations

Memory Management

Use sparse matrices: For data with <30% non-zero values
Convert to categoricals: For repeated string values
Process in chunks: For operations on very large matrices
Use backed mode: Read large files with backed='r'

Naming Conventions

Follow these conventions for consistency:

Embeddings: X_pca, X_umap, X_tsne
Layers: Descriptive names like counts, log1p, scaled
Observations: Use snake_case like cell_type, n_genes, total_counts
Variables: Use snake_case like highly_variable, gene_name

Reference Documentation

For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the references/ directory:

api_reference.md: Complete API documentation including all classes, methods, and functions with usage examples. Use grep -r "pattern" references/api_reference.md to search for specific functions or parameters.
workflows_best_practices.md: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use grep -r "pattern" references/workflows_best_practices.md to search for specific workflows.
concatenation_guide.md: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use grep -r "pattern" references/concatenation_guide.md to search for concatenation patterns.

When to Load References

Load reference files into context when:

Implementing complex concatenation with specific merge strategies
Troubleshooting errors or unexpected behavior
Optimizing memory usage for large datasets
Implementing complete analysis workflows
Understanding nuances of specific API methods

To search within references without loading them:

# Example: Search for information about backed mode
grep -r "backed mode" references/

Common Error Patterns

Memory Errors

Problem: "MemoryError: Unable to allocate array" Solution: Use backed mode, sparse matrices, or process in chunks

Dimension Mismatch

Problem: "ValueError: operands could not be broadcast together" Solution: Use outer join in concatenation or align indices before operations

View Modification

Problem: "ValueError: assignment destination is read-only" Solution: Convert view to copy with .copy() before modification

File Already Open

Problem: "OSError: Unable to open file (file is already open)" Solution: Close previous file connection with adata.file.close()

15 KiB Raw Blame History Unescape Escape