Improve the anndata skill

2026-03-28 07:33:45 +08:00 · 2025-11-03 16:09:01 -08:00
parent 094d5aa9f1
commit 6ddea4786e
9 changed files with 2448 additions and 1560 deletions
--- a/scientific-packages/anndata/SKILL.md
+++ b/scientific-packages/anndata/SKILL.md
@@ -1,527 +1,394 @@
 ---
 name: anndata
-description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows."
+description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
 ---

 # AnnData

 ## Overview

-AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
+AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

-## Core Capabilities
+## When to Use This Skill

-### 1. Creating and Structuring AnnData Objects
+Use this skill when:
+- Creating, reading, or writing AnnData objects
+- Working with h5ad, zarr, or other genomics data formats
+- Performing single-cell RNA-seq analysis
+- Managing large datasets with sparse matrices or backed mode
+- Concatenating multiple datasets or experimental batches
+- Subsetting, filtering, or transforming annotated data
+- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

-Create AnnData objects from various data sources and organize multi-dimensional annotations.
+## Installation

-**Basic creation:**
+```bash
+pip install anndata
+
+# With optional dependencies
+pip install anndata[dev,test,doc]
+```
+
+## Quick Start
+
+### Creating an AnnData object
 ```python
 import anndata as ad
 import numpy as np
-from scipy.sparse import csr_matrix
-
-# From dense or sparse arrays
-counts = np.random.poisson(1, size=(100, 2000))
-adata = ad.AnnData(counts)
-
-# With sparse matrix (memory-efficient)
-counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
-adata = ad.AnnData(counts)
-```
-
-**With metadata:**
-```python
 import pandas as pd

-obs_meta = pd.DataFrame({
-    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
-    'batch': ['batch1'] * 50 + ['batch2'] * 50
-})
-var_meta = pd.DataFrame({
-    'gene_name': [f'Gene_{i}' for i in range(2000)],
-    'highly_variable': np.random.choice([True, False], 2000)
-})
+# Minimal creation
+X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
+adata = ad.AnnData(X)

-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
+# With metadata
+obs = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell'] * 50,
+    'sample': ['A', 'B'] * 50
+}, index=[f'cell_{i}' for i in range(100)])
+
+var = pd.DataFrame({
+    'gene_name': [f'Gene_{i}' for i in range(2000)]
+}, index=[f'ENSG{i:05d}' for i in range(2000)])
+
+adata = ad.AnnData(X=X, obs=obs, var=var)
 ```

-**Understanding the structure:**
- **X**: Primary data matrix (observations × variables)
- **obs**: Row (observation) annotations as DataFrame
- **var**: Column (variable) annotations as DataFrame
- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
- **layers**: Alternative data matrices with same dimensions as X
- **uns**: Unstructured metadata dictionary
- **obsp/varp**: Pairwise relationship matrices (graphs)
-
-### 2. Adding Annotations and Layers
-
-Organize different data representations and metadata within a single object.
-
-**Cell-level metadata (obs):**
+### Reading data
 ```python
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
-```
+# Read h5ad file
+adata = ad.read_h5ad('data.h5ad')

-**Gene-level metadata (var):**
-```python
-adata.var['highly_variable'] = gene_variance > threshold
-adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
-```
+# Read with backed mode (for large files)
+adata = ad.read_h5ad('large_data.h5ad', backed='r')

-**Embeddings (obsm/varm):**
-```python
-# Dimensionality reduction results
-adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
-adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
-adata.obsm['X_tsne'] = tsne_coordinates
-
-# Gene loadings
-adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)
-```
-
-**Alternative data representations (layers):**
-```python
-# Store multiple versions
-adata.layers['counts'] = raw_counts
-adata.layers['log1p'] = np.log1p(adata.X)
-adata.layers['scaled'] = (adata.X - mean) / std
-```
-
-**Unstructured metadata (uns):**
-```python
-# Analysis parameters
-adata.uns['preprocessing'] = {
-    'normalization': 'TPM',
-    'min_genes': 200,
-    'date': '2024-01-15'
-}
-
-# Results
-adata.uns['pca'] = {'variance_ratio': variance_explained}
-```
-
-### 3. Subsetting and Views
-
-Efficiently subset data while managing memory through views and copies.
-
-**Subsetting operations:**
-```python
-# By observation/variable names
-subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
-
-# By boolean masks
-b_cells = adata[adata.obs.cell_type == 'B']
-high_quality = adata[adata.obs.n_genes > 200]
-
-# By position
-first_cells = adata[:100, :]
-top_genes = adata[:, :500]
-
-# Combined conditions
-filtered = adata[
-    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
-    adata.var.highly_variable
-]
-```
-
-**Understanding views:**
- Subsetting returns **views** by default (memory-efficient, shares data with original)
- Modifying a view affects the original object
- Check with `adata.is_view`
- Convert to independent copy with `.copy()`
-
-```python
-# View (memory-efficient)
-subset = adata[adata.obs.condition == 'treated']
-print(subset.is_view)  # True
-
-# Independent copy
-subset_copy = adata[adata.obs.condition == 'treated'].copy()
-print(subset_copy.is_view)  # False
-```
-
-### 4. File I/O and Backed Mode
-
-Read and write data efficiently, with options for memory-limited environments.
-
-**Writing data:**
-```python
-# Standard format with compression
-adata.write('results.h5ad', compression='gzip')
-
-# Alternative formats
-adata.write_zarr('results.zarr')  # For cloud storage
-adata.write_loom('results.loom')  # For compatibility
-adata.write_csvs('results/')      # As CSV files
-```
-
-**Reading data:**
-```python
-# Load into memory
-adata = ad.read_h5ad('results.h5ad')
-
-# Backed mode (disk-backed, memory-efficient)
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-print(adata.isbacked)  # True
-print(adata.filename)  # Path to file
-
-# Close file connection when done
-adata.file.close()
-```
-
-**Reading from other formats:**
-```python
-# 10X format
-adata = ad.read_mtx('matrix.mtx')
-
-# CSV
+# Read other formats
 adata = ad.read_csv('data.csv')
-
-# Loom
 adata = ad.read_loom('data.loom')
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 ```

-**Working with backed mode:**
+### Writing data
 ```python
-# Read in backed mode for large files
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+# Write h5ad file
+adata.write_h5ad('output.h5ad')

-# Process in chunks
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = process_chunk(chunk)
+# Write with compression
+adata.write_h5ad('output.h5ad', compression='gzip')

-# Load to memory if needed
-adata_memory = adata.to_memory()
+# Write other formats
+adata.write_zarr('output.zarr')
+adata.write_csvs('output_dir/')
 ```

-### 5. Concatenating Multiple Datasets
-
-Combine multiple AnnData objects with control over how data is merged.
-
-**Basic concatenation:**
+### Basic operations
 ```python
-# Concatenate observations (most common)
-combined = ad.concat([adata1, adata2, adata3], axis=0)
+# Subset by conditions
+t_cells = adata[adata.obs['cell_type'] == 'T cell']

-# Concatenate variables (rare)
-combined = ad.concat([adata1, adata2], axis=1)
+# Subset by indices
+subset = adata[0:50, 0:100]
+
+# Add metadata
+adata.obs['quality_score'] = np.random.rand(adata.n_obs)
+adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
+
+# Access dimensions
+print(f"{adata.n_obs} observations × {adata.n_vars} variables")
 ```

-**Join strategies:**
-```python
-# Inner join: only shared variables (no missing data)
-combined = ad.concat([adata1, adata2], join='inner')
+## Core Capabilities

-# Outer join: all variables (fills missing with 0)
-combined = ad.concat([adata1, adata2], join='outer')
+### 1. Data Structure
+
+Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
+
+**See**: `references/data_structure.md` for comprehensive information on:
+- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
+- Creating AnnData objects from various sources
+- Accessing and manipulating data components
+- Memory-efficient practices
+
+### 2. Input/Output Operations
+
+Read and write data in various formats with support for compression, backed mode, and cloud storage.
+
+**See**: `references/io_operations.md` for details on:
+- Native formats (h5ad, zarr)
+- Alternative formats (CSV, MTX, Loom, 10X, Excel)
+- Backed mode for large datasets
+- Remote data access
+- Format conversion
+- Performance optimization
+
+Common commands:
+```python
+# Read/write h5ad
+adata = ad.read_h5ad('data.h5ad', backed='r')
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Read 10X data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read MTX format
+adata = ad.read_mtx('matrix.mtx').T
 ```

-**Tracking data sources:**
+### 3. Concatenation
+
+Combine multiple AnnData objects along observations or variables with flexible join strategies.
+
+**See**: `references/concatenation.md` for comprehensive coverage of:
+- Basic concatenation (axis=0 for observations, axis=1 for variables)
+- Join types (inner, outer)
+- Merge strategies (same, unique, first, only)
+- Tracking data sources with labels
+- Lazy concatenation (AnnCollection)
+- On-disk concatenation for large datasets
+
+Common commands:
 ```python
-# Add source labels
-combined = ad.concat(
+# Concatenate observations (combine samples)
+adata = ad.concat(
    [adata1, adata2, adata3],
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
-
-# Make duplicate indices unique
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
-```
-
-**Merge strategies for metadata:**
-```python
-# merge=None: exclude variable annotations (default)
-combined = ad.concat([adata1, adata2], merge=None)
-
-# merge='same': keep only identical annotations
-combined = ad.concat([adata1, adata2], merge='same')
-
-# merge='first': use first occurrence
-combined = ad.concat([adata1, adata2], merge='first')
-
-# merge='unique': keep annotations with single value
-combined = ad.concat([adata1, adata2], merge='unique')
-```
-
-**Complete example:**
-```python
-# Load batches
-batch1 = ad.read_h5ad('batch1.h5ad')
-batch2 = ad.read_h5ad('batch2.h5ad')
-batch3 = ad.read_h5ad('batch3.h5ad')
-
-# Concatenate with full tracking
-combined = ad.concat(
-    [batch1, batch2, batch3],
    axis=0,
-    join='outer',              # Keep all genes
-    merge='first',             # Use first batch's annotations
-    label='batch_id',          # Track source
-    keys=['b1', 'b2', 'b3'],  # Custom labels
-    index_unique='-'           # Make cell names unique
+    join='inner',
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3']
+)
+
+# Concatenate variables (combine modalities)
+adata = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Lazy concatenation
+from anndata.experimental import AnnCollection
+collection = AnnCollection(
+    ['data1.h5ad', 'data2.h5ad'],
+    join_obs='outer',
+    label='dataset'
 )
 ```

-### 6. Data Conversion and Extraction
+### 4. Data Manipulation

-Convert between AnnData and other formats for interoperability.
+Transform, subset, filter, and reorganize data efficiently.

-**To DataFrame:**
+**See**: `references/manipulation.md` for detailed guidance on:
+- Subsetting (by indices, names, boolean masks, metadata conditions)
+- Transposition
+- Copying (full copies vs views)
+- Renaming (observations, variables, categories)
+- Type conversions (strings to categoricals, sparse/dense)
+- Adding/removing data components
+- Reordering
+- Quality control filtering
+
+Common commands:
 ```python
-# Convert X to DataFrame
-df = adata.to_df()
+# Subset by metadata
+filtered = adata[adata.obs['quality_score'] > 0.8]
+hv_genes = adata[:, adata.var['highly_variable']]

-# Convert specific layer
-df = adata.to_df(layer='log1p')
+# Transpose
+adata_T = adata.T
+
+# Copy vs view
+view = adata[0:100, :]  # View (lightweight reference)
+copy = adata[0:100, :].copy()  # Independent copy
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
 ```

-**Extract vectors:**
-```python
-# Get 1D arrays from data or annotations
-gene_expression = adata.obs_vector('Gene_100')
-cell_metadata = adata.obs_vector('n_genes')
-```
-
-**Transpose:**
-```python
-# Swap observations and variables
-transposed = adata.T
-```
-
-### 7. Memory Optimization
-
-Strategies for working with large datasets efficiently.
-
-**Use sparse matrices:**
+### 5. Best Practices
+
+Follow recommended patterns for memory efficiency, performance, and reproducibility.
+
+**See**: `references/best_practices.md` for guidelines on:
+- Memory management (sparse matrices, categoricals, backed mode)
+- Views vs copies
+- Data storage optimization
+- Performance optimization
+- Working with raw data
+- Metadata management
+- Reproducibility
+- Error handling
+- Integration with other tools
+- Common pitfalls and solutions
+
+Key recommendations:
 ```python
+# Use sparse matrices for sparse data
 from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)

-# Check sparsity
-density = (adata.X != 0).sum() / adata.X.size
-if density < 0.3:  # Less than 30% non-zero
-    adata.X = csr_matrix(adata.X)
-```
-
-**Convert strings to categoricals:**
-```python
-# Automatic conversion
+# Convert strings to categoricals
 adata.strings_to_categoricals()

-# Manual conversion (more control)
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
+# Use backed mode for large files
+adata = ad.read_h5ad('large.h5ad', backed='r')
+
+# Store raw before filtering
+adata.raw = adata.copy()
+adata = adata[:, adata.var['highly_variable']]
 ```

-**Use backed mode:**
-```python
-# Read without loading into memory
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
+## Integration with Scverse Ecosystem

-# Work with subsets
-subset = adata[:1000, :500].copy()  # Only this subset in memory
+AnnData serves as the foundational data structure for the scverse ecosystem:
+
+### Scanpy (Single-cell analysis)
+```python
+import scanpy as sc
+
+# Preprocessing
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+
+# Dimensionality reduction
+sc.pp.pca(adata, n_comps=50)
+sc.pp.neighbors(adata, n_neighbors=15)
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+
+# Visualization
+sc.pl.umap(adata, color=['cell_type', 'leiden'])
 ```

-**Chunked processing:**
+### Muon (Multimodal data)
 ```python
-# Process data in chunks
-results = []
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = expensive_computation(chunk)
-    results.append(result)
+import muon as mu
+
+# Combine RNA and protein data
+mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create DataLoader for deep learning
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+for batch in dataloader:
+    X = batch.X
+    # Train model
 ```

 ## Common Workflows

-### Single-Cell RNA-seq Analysis
-
-Complete workflow from loading to analysis:
-
+### Single-cell RNA-seq analysis
 ```python
 import anndata as ad
-import numpy as np
-import pandas as pd
+import scanpy as sc

 # 1. Load data
-adata = ad.read_mtx('matrix.mtx')
-adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
-adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

 # 2. Quality control
 adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata = adata[adata.obs.n_genes > 200]
-adata = adata[adata.obs.total_counts < 10000]
+adata.obs['n_counts'] = adata.X.sum(axis=1)
+adata = adata[adata.obs['n_genes'] > 200]
+adata = adata[adata.obs['n_counts'] < 50000]

-# 3. Filter genes
-min_cells = 3
-adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
+# 3. Store raw
+adata.raw = adata.copy()

-# 4. Store raw counts
-adata.layers['counts'] = adata.X.copy()
+# 4. Normalize and filter
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+adata = adata[:, adata.var['highly_variable']]

-# 5. Normalize
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
-adata.X = np.log1p(adata.X)
-
-# 6. Feature selection
-gene_var = adata.X.var(axis=0)
-adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
-
-# 7. Dimensionality reduction (example with external tools)
-# adata.obsm['X_pca'] = compute_pca(adata.X)
-# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
-
-# 8. Save results
-adata.write('analyzed.h5ad', compression='gzip')
+# 5. Save processed data
+adata.write_h5ad('processed.h5ad')
 ```

-### Batch Integration
-
-Combining multiple experimental batches:
-
+### Batch integration
 ```python
-# Load batches
-batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
+# Load multiple batches
+adata1 = ad.read_h5ad('batch1.h5ad')
+adata2 = ad.read_h5ad('batch2.h5ad')
+adata3 = ad.read_h5ad('batch3.h5ad')

-# Concatenate with tracking
-combined = ad.concat(
-    batches,
-    axis=0,
-    join='outer',
+# Concatenate with batch labels
+adata = ad.concat(
+    [adata1, adata2, adata3],
    label='batch',
-    keys=['batch_0', 'batch_1', 'batch_2'],
-    index_unique='-'
+    keys=['batch1', 'batch2', 'batch3'],
+    join='inner'
 )

-# Add batch as numeric for correction algorithms
-combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
+# Apply batch correction
+import scanpy as sc
+sc.pp.combat(adata, key='batch')

-# Perform batch correction (with external tools)
-# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
-# combined.obsm['X_pca_corrected'] = corrected_pca
-
-# Save integrated data
-combined.write('integrated.h5ad', compression='gzip')
+# Continue analysis
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
 ```

-### Memory-Efficient Large Dataset Processing
-
-Working with datasets too large for memory:
-
+### Working with large datasets
 ```python
-# Read in backed mode
-adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+# Open in backed mode
+adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

-# Compute statistics in chunks
-total = 0
-for chunk in adata.chunk_X(chunk_size=1000):
-    total += chunk.sum()
+# Filter based on metadata (no data loading)
+high_quality = adata[adata.obs['quality_score'] > 0.8]

-mean_expression = total / (adata.n_obs * adata.n_vars)
+# Load filtered subset
+adata_subset = high_quality.to_memory()

-# Work with subset
-high_quality_cells = adata.obs.n_genes > 1000
-subset = adata[high_quality_cells, :].copy()
+# Process subset
+process(adata_subset)

-# Close file
-adata.file.close()
+# Or process in chunks
+chunk_size = 1000
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    process(chunk)
 ```

-## Best Practices
+## Troubleshooting

-### Data Organization
-
-1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
-2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
-3. **Use uns for metadata**: Analysis parameters, dates, version information
-4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
-
-### Subsetting
-
-1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
-2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
-3. **Validate after subsetting**: Check dimensions and data integrity
-
-### File I/O
-
-1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
-2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
-3. **Close backed files**: Always close file connections when done
-4. **Use backed mode for large files**: Don't load everything into memory if not needed
-
-### Concatenation
-
-1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
-2. **Track sources**: Use `label` and `keys` to track data origin
-3. **Handle duplicates**: Use `index_unique` to make observation names unique
-4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
-
-### Memory Management
-
-1. **Use sparse matrices**: For data with <30% non-zero values
-2. **Convert to categoricals**: For repeated string values
-3. **Process in chunks**: For operations on very large matrices
-4. **Use backed mode**: Read large files with `backed='r'`
-
-### Naming Conventions
-
-Follow these conventions for consistency:
-
- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
- **Variables**: Use snake_case like `highly_variable`, `gene_name`
-
-## Reference Documentation
-
-For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
-
-1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
-
-2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
-
-3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
-
-## When to Load References
-
-Load reference files into context when:
- Implementing complex concatenation with specific merge strategies
- Troubleshooting errors or unexpected behavior
- Optimizing memory usage for large datasets
- Implementing complete analysis workflows
- Understanding nuances of specific API methods
-
-To search within references without loading them:
+### Out of memory errors
+Use backed mode or convert to sparse matrices:
 ```python
-# Example: Search for information about backed mode
-grep -r "backed mode" references/
+# Backed mode
+adata = ad.read_h5ad('file.h5ad', backed='r')
+
+# Sparse matrices
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
 ```

-## Common Error Patterns
+### Slow file reading
+Use compression and appropriate formats:
+```python
+# Optimize for storage
+adata.strings_to_categoricals()
+adata.write_h5ad('file.h5ad', compression='gzip')

-### Memory Errors
-**Problem**: "MemoryError: Unable to allocate array"
-**Solution**: Use backed mode, sparse matrices, or process in chunks
+# Use Zarr for cloud storage
+adata.write_zarr('file.zarr', chunks=(1000, 1000))
+```

-### Dimension Mismatch
-**Problem**: "ValueError: operands could not be broadcast together"
-**Solution**: Use outer join in concatenation or align indices before operations
+### Index alignment issues
+Always align external data on index:
+```python
+# Wrong
+adata.obs['new_col'] = external_data['values']

-### View Modification
-**Problem**: "ValueError: assignment destination is read-only"
-**Solution**: Convert view to copy with `.copy()` before modification
+# Correct
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```

-### File Already Open
-**Problem**: "OSError: Unable to open file (file is already open)"
-**Solution**: Close previous file connection with `adata.file.close()`
+## Additional Resources
+
+- **Official documentation**: https://anndata.readthedocs.io/
+- **Scanpy tutorials**: https://scanpy.readthedocs.io/
+- **Scverse ecosystem**: https://scverse.org/
+- **GitHub repository**: https://github.com/scverse/anndata