Initial commit

2026-03-27 07:09:27 +08:00 · 2025-10-19 14:01:29 -07:00
parent d85386c32b
commit 152d0d54de
15 changed files with 4569 additions and 0 deletions
--- a/scientific-packages/anndata/references/api_reference.md
+++ b/scientific-packages/anndata/references/api_reference.md
@@ -0,0 +1,218 @@
+# AnnData API Reference
+
+## Core AnnData Class
+
+The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
+
+### Core Attributes
+
+| Attribute | Type | Description |
+|-----------|------|-------------|
+| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
+| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
+| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
+| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
+| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
+| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
+| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
+| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
+| **layers** | dict-like | Additional data matrices matching X's dimensions |
+| **raw** | AnnData | Stores original versions of X and var before transformations |
+
+### Dimensional Properties
+
+- **n_obs**: Number of observations (sample count)
+- **n_vars**: Number of variables/features
+- **shape**: Tuple returning (n_obs, n_vars)
+- **T**: Transposed view of the entire object
+
+### State Properties
+
+- **isbacked**: Boolean indicating disk-backed storage status
+- **is_view**: Boolean identifying whether object is a view of another AnnData
+- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
+
+### Key Methods
+
+#### Construction and Copying
+- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
+- **`copy(filename=None)`**: Create full copy, optionally stored on disk
+
+#### Subsetting and Views
+- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
+- **`.copy()`**: Convert view to independent object
+
+#### Data Access
+- **`to_df(layer=None)`**: Generate pandas DataFrame representation
+- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
+- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
+- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
+- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
+
+#### Transformation
+- **`transpose()`**: Return transposed object
+- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
+- **`to_memory(copy=False)`**: Load all backed arrays into RAM
+
+#### File I/O
+- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
+- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
+- **`write_loom(filename, ...)`**: Output .loom format file
+- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
+
+#### Data Management
+- **`strings_to_categoricals()`**: Convert string annotations to categorical types
+- **`rename_categories(key, categories)`**: Update category labels in annotations
+- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
+- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
+
+## Module-Level Functions
+
+### Reading Functions
+
+#### Native Formats
+- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
+- **`read_zarr(store)`**: Access hierarchical Zarr array stores
+
+#### Alternative Formats
+- **`read_csv(filename, ...)`**: Import from CSV files
+- **`read_excel(filename, ...)`**: Import from Excel files
+- **`read_hdf(filename, key)`**: Read from HDF5 files
+- **`read_loom(filename, ...)`**: Import from .loom files
+- **`read_mtx(filename, ...)`**: Import from Matrix Market format
+- **`read_text(filename, ...)`**: Import from text files
+- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
+
+#### Element-Level Access
+- **`read_elem(elem)`**: Retrieve specific components from storage
+- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
+
+### Combining Operations
+- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
+  - **axis**: 0 (observations) or 1 (variables)
+  - **join**: 'inner' (intersection) or 'outer' (union)
+  - **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
+  - **label**: Column name for source tracking
+  - **keys**: Dataset identifiers for source annotation
+  - **index_unique**: Separator for making duplicate indices unique
+
+### Writing Functions
+- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
+- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
+- **`write_elem(elem, ...)`**: Write individual components
+
+### Experimental Features
+- **`AnnCollection`**: Batch processing for large collections
+- **`AnnLoader`**: PyTorch DataLoader integration
+- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
+- **`read_lazy(filename)`**: Lazy loading with deferred computation
+- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
+- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
+
+### Configuration
+- **`settings`**: Package-wide configuration object
+- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
+
+## Common Usage Patterns
+
+### Creating AnnData Objects
+
+```python
+import anndata as ad
+import numpy as np
+from scipy.sparse import csr_matrix
+
+# From dense array
+counts = np.random.poisson(1, size=(100, 2000))
+adata = ad.AnnData(counts)
+
+# From sparse matrix
+counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
+adata = ad.AnnData(counts)
+
+# With metadata
+import pandas as pd
+obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
+var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
+adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
+```
+
+### Subsetting
+
+```python
+# By names
+subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
+
+# By boolean mask
+b_cells = adata[adata.obs.cell_type == 'B']
+
+# By position
+first_five = adata[:5, :100]
+
+# Convert view to copy
+adata_copy = adata[:5].copy()
+```
+
+### Adding Annotations
+
+```python
+# Cell-level metadata
+adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
+
+# Gene-level metadata
+adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
+
+# Embeddings
+adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
+adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
+
+# Alternative data representations
+adata.layers['log_transformed'] = np.log1p(adata.X)
+adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
+
+# Unstructured metadata
+adata.uns['experiment_date'] = '2024-01-15'
+adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
+```
+
+### File I/O
+
+```python
+# Write to disk
+adata.write('my_results.h5ad', compression='gzip')
+
+# Read into memory
+adata = ad.read_h5ad('my_results.h5ad')
+
+# Read in backed mode (memory-efficient)
+adata = ad.read_h5ad('my_results.h5ad', backed='r')
+
+# Close file connection
+adata.file.close()
+```
+
+### Concatenation
+
+```python
+# Combine multiple datasets
+adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
+adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
+adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
+
+# Simple concatenation
+combined = ad.concat([adata1, adata2, adata3], axis=0)
+
+# With source labels
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    axis=0,
+    label='dataset',
+    keys=['exp1', 'exp2', 'exp3']
+)
+
+# Inner join (only shared variables)
+combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
+
+# Outer join (all variables, pad with zeros)
+combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
+```
--- a/scientific-packages/anndata/references/concatenation_guide.md
+++ b/scientific-packages/anndata/references/concatenation_guide.md
@@ -0,0 +1,478 @@
+# AnnData Concatenation Guide
+
+## Overview
+
+The `concat()` function combines multiple AnnData objects through two fundamental operations:
+1. **Concatenation**: Stacking sub-elements in order
+2. **Merging**: Combining collections into one result
+
+## Basic Concatenation
+
+### Syntax
+```python
+import anndata as ad
+
+combined = ad.concat(
+    adatas,              # List of AnnData objects
+    axis=0,              # 0=observations, 1=variables
+    join='inner',        # 'inner' or 'outer'
+    merge=None,          # Merge strategy for non-concat axis
+    label=None,          # Column name for source tracking
+    keys=None,           # Dataset identifiers
+    index_unique=None,   # Separator for unique indices
+    fill_value=None,     # Fill value for missing data
+    pairwise=False       # Include pairwise matrices
+)
+```
+
+### Concatenating Observations (Cells)
+```python
+# Most common: combining multiple samples/batches
+adata1 = ad.AnnData(np.random.rand(100, 2000))
+adata2 = ad.AnnData(np.random.rand(150, 2000))
+adata3 = ad.AnnData(np.random.rand(80, 2000))
+
+combined = ad.concat([adata1, adata2, adata3], axis=0)
+# Result: (330 observations, 2000 variables)
+```
+
+### Concatenating Variables (Genes)
+```python
+# Less common: combining different feature sets
+adata1 = ad.AnnData(np.random.rand(100, 1000))
+adata2 = ad.AnnData(np.random.rand(100, 500))
+
+combined = ad.concat([adata1, adata2], axis=1)
+# Result: (100 observations, 1500 variables)
+```
+
+## Join Strategies
+
+### Inner Join (Intersection)
+
+Keeps only shared features across all objects.
+
+```python
+# Datasets with different genes
+adata1 = ad.AnnData(
+    np.random.rand(100, 2000),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
+)
+adata2 = ad.AnnData(
+    np.random.rand(150, 1800),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
+)
+
+# Inner join: only genes present in both
+combined = ad.concat([adata1, adata2], join='inner')
+# Result: (250 observations, 1800 variables)
+# Only Gene_200 through Gene_1999
+```
+
+**Use when:**
+- You want to analyze only features measured in all datasets
+- Missing features would compromise analysis
+- You need a complete case analysis
+
+**Trade-offs:**
+- May lose many features
+- Ensures no missing data
+- Smaller result size
+
+### Outer Join (Union)
+
+Keeps all features from all objects, padding with fill values (default 0).
+
+```python
+# Outer join: all genes from both datasets
+combined = ad.concat([adata1, adata2], join='outer')
+# Result: (250 observations, 2000 variables)
+# Missing values filled with 0
+
+# Custom fill value
+combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
+```
+
+**Use when:**
+- You want to preserve all features
+- Sparse data is acceptable
+- Features are independent
+
+**Trade-offs:**
+- Introduces zeros/missing values
+- Larger result size
+- May need imputation
+
+## Merge Strategies
+
+Merge strategies control how elements on the non-concatenation axis are combined.
+
+### merge=None (Default)
+
+Excludes all non-concatenation axis elements.
+
+```python
+# Both datasets have var annotations
+adata1.var['gene_type'] = ['protein_coding'] * 2000
+adata2.var['gene_type'] = ['protein_coding'] * 1800
+
+# merge=None: var annotations excluded
+combined = ad.concat([adata1, adata2], merge=None)
+assert 'gene_type' not in combined.var.columns
+```
+
+**Use when:**
+- Annotations are dataset-specific
+- You'll add new annotations after merging
+
+### merge='same'
+
+Keeps only annotations with identical values across datasets.
+
+```python
+# Same annotation values
+adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
+adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
+
+# merge='same': keeps chromosome annotation
+combined = ad.concat([adata1, adata2], merge='same')
+assert 'chromosome' in combined.var.columns
+```
+
+**Use when:**
+- Annotations should be consistent
+- You want to validate consistency
+- Shared metadata is important
+
+**Note:** Comparison occurs after index alignment - only shared indices need to match.
+
+### merge='unique'
+
+Includes annotations with a single possible value.
+
+```python
+# Unique values per gene
+adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
+adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
+
+# merge='unique': keeps ensembl_id
+combined = ad.concat([adata1, adata2], merge='unique')
+```
+
+**Use when:**
+- Each feature has a unique identifier
+- Annotations are feature-specific
+
+### merge='first'
+
+Takes the first occurrence of each annotation.
+
+```python
+# Different annotation versions
+adata1.var['description'] = ['desc1'] * 2000
+adata2.var['description'] = ['desc2'] * 2000
+
+# merge='first': uses adata1's descriptions
+combined = ad.concat([adata1, adata2], merge='first')
+# Uses descriptions from adata1
+```
+
+**Use when:**
+- One dataset has authoritative annotations
+- Order matters
+- You need a simple resolution strategy
+
+### merge='only'
+
+Retains annotations appearing in exactly one object.
+
+```python
+# Dataset-specific annotations
+adata1.var['dataset1_specific'] = ['value'] * 2000
+adata2.var['dataset2_specific'] = ['value'] * 2000
+
+# merge='only': keeps both (no conflicts)
+combined = ad.concat([adata1, adata2], merge='only')
+```
+
+**Use when:**
+- Datasets have non-overlapping annotations
+- You want to preserve all unique metadata
+
+## Source Tracking
+
+### Using label
+
+Add a categorical column to track data origin.
+
+```python
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch'
+)
+
+# Creates obs['batch'] with values 0, 1, 2
+print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']
+```
+
+### Using keys
+
+Provide custom names for source tracking.
+
+```python
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='study',
+    keys=['control', 'treatment_a', 'treatment_b']
+)
+
+# Creates obs['study'] with custom names
+print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']
+```
+
+### Making Indices Unique
+
+Append source identifiers to duplicate observation names.
+
+```python
+# Both datasets have cells named "Cell_0", "Cell_1", etc.
+adata1.obs_names = [f'Cell_{i}' for i in range(100)]
+adata2.obs_names = [f'Cell_{i}' for i in range(150)]
+
+# index_unique adds suffix
+combined = ad.concat(
+    [adata1, adata2],
+    keys=['batch1', 'batch2'],
+    index_unique='-'
+)
+
+# Results in: Cell_0-batch1, Cell_0-batch2, etc.
+print(combined.obs_names[:5])
+```
+
+## Handling Different Attributes
+
+### X Matrix and Layers
+
+Follows join strategy. Missing values filled according to `fill_value`.
+
+```python
+# Both have layers
+adata1.layers['counts'] = adata1.X.copy()
+adata2.layers['counts'] = adata2.X.copy()
+
+# Concatenates both X and layers
+combined = ad.concat([adata1, adata2])
+assert 'counts' in combined.layers
+```
+
+### obs and var DataFrames
+
+- **obs**: Concatenated along concatenation axis
+- **var**: Handled by merge strategy
+
+```python
+adata1.obs['cell_type'] = ['B cell'] * 100
+adata2.obs['cell_type'] = ['T cell'] * 150
+
+combined = ad.concat([adata1, adata2])
+# obs['cell_type'] preserved for all cells
+```
+
+### obsm and varm
+
+Multi-dimensional annotations follow same rules as layers.
+
+```python
+adata1.obsm['X_pca'] = np.random.rand(100, 50)
+adata2.obsm['X_pca'] = np.random.rand(150, 50)
+
+combined = ad.concat([adata1, adata2])
+# obsm['X_pca'] concatenated: shape (250, 50)
+```
+
+### obsp and varp
+
+Pairwise matrices excluded by default. Enable with `pairwise=True`.
+
+```python
+# Distance matrices
+adata1.obsp['distances'] = np.random.rand(100, 100)
+adata2.obsp['distances'] = np.random.rand(150, 150)
+
+# Excluded by default
+combined = ad.concat([adata1, adata2])
+assert 'distances' not in combined.obsp
+
+# Include if needed
+combined = ad.concat([adata1, adata2], pairwise=True)
+# Results in padded block diagonal matrix
+```
+
+### uns Dictionary
+
+Merged recursively, applying merge strategy at any nesting depth.
+
+```python
+adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
+adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
+
+# merge='same' keeps 'lab', excludes 'date'
+combined = ad.concat([adata1, adata2], merge='same')
+# combined.uns['experiment'] = {'lab': 'A'}
+```
+
+## Advanced Patterns
+
+### Batch Integration Pipeline
+
+```python
+import anndata as ad
+
+# Load batches
+batches = [
+    ad.read_h5ad(f'batch_{i}.h5ad')
+    for i in range(5)
+]
+
+# Concatenate with tracking
+combined = ad.concat(
+    batches,
+    axis=0,
+    join='outer',
+    merge='first',
+    label='batch_id',
+    keys=[f'batch_{i}' for i in range(5)],
+    index_unique='-'
+)
+
+# Add batch effects
+combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
+```
+
+### Multi-Study Meta-Analysis
+
+```python
+# Different studies with varying gene coverage
+studies = {
+    'study_a': ad.read_h5ad('study_a.h5ad'),
+    'study_b': ad.read_h5ad('study_b.h5ad'),
+    'study_c': ad.read_h5ad('study_c.h5ad')
+}
+
+# Outer join to keep all genes
+combined = ad.concat(
+    list(studies.values()),
+    axis=0,
+    join='outer',
+    label='study',
+    keys=list(studies.keys()),
+    merge='unique',
+    fill_value=0
+)
+
+# Track coverage
+for study in studies:
+    n_genes = studies[study].n_vars
+    combined.uns[f'{study}_n_genes'] = n_genes
+```
+
+### Incremental Concatenation
+
+```python
+# For many datasets, concatenate in batches
+chunk_size = 10
+all_files = [f'dataset_{i}.h5ad' for i in range(100)]
+
+# Process in chunks
+result = None
+for i in range(0, len(all_files), chunk_size):
+    chunk_files = all_files[i:i+chunk_size]
+    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
+    chunk_combined = ad.concat(chunk_adatas)
+
+    if result is None:
+        result = chunk_combined
+    else:
+        result = ad.concat([result, chunk_combined])
+```
+
+### Memory-Efficient On-Disk Concatenation
+
+```python
+# Experimental feature for large datasets
+from anndata.experimental import concat_on_disk
+
+files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
+concat_on_disk(
+    files,
+    'combined.h5ad',
+    join='outer'
+)
+
+# Read result in backed mode
+combined = ad.read_h5ad('combined.h5ad', backed='r')
+```
+
+## Troubleshooting
+
+### Issue: Dimension Mismatch
+
+```python
+# Error: shapes don't match
+adata1 = ad.AnnData(np.random.rand(100, 2000))
+adata2 = ad.AnnData(np.random.rand(150, 1500))
+
+# Solution: use outer join
+combined = ad.concat([adata1, adata2], join='outer')
+```
+
+### Issue: Memory Error
+
+```python
+# Problem: too many large objects in memory
+large_adatas = [ad.read_h5ad(f) for f in many_files]
+
+# Solution: read and concatenate incrementally
+result = None
+for file in many_files:
+    adata = ad.read_h5ad(file)
+    if result is None:
+        result = adata
+    else:
+        result = ad.concat([result, adata])
+        del adata  # Free memory
+```
+
+### Issue: Duplicate Indices
+
+```python
+# Problem: same cell names in different batches
+# Solution: use index_unique
+combined = ad.concat(
+    [adata1, adata2],
+    keys=['batch1', 'batch2'],
+    index_unique='-'
+)
+```
+
+### Issue: Lost Annotations
+
+```python
+# Problem: annotations disappear
+adata1.var['important'] = values1
+adata2.var['important'] = values2
+
+combined = ad.concat([adata1, adata2])  # merge=None by default
+# Solution: use appropriate merge strategy
+combined = ad.concat([adata1, adata2], merge='first')
+```
+
+## Performance Tips
+
+1. **Pre-align indices**: Ensure consistent naming before concatenation
+2. **Use sparse matrices**: Convert to sparse before concatenating
+3. **Batch operations**: Concatenate in groups for many datasets
+4. **Choose inner join**: When possible, to reduce result size
+5. **Use categoricals**: Convert string annotations before concatenating
+6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
--- a/scientific-packages/anndata/references/workflows_best_practices.md
+++ b/scientific-packages/anndata/references/workflows_best_practices.md
@@ -0,0 +1,438 @@
+# AnnData Workflows and Best Practices
+
+## Common Workflows
+
+### 1. Single-Cell RNA-seq Analysis Workflow
+
+#### Loading Data
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# Load from 10X format
+adata = ad.read_mtx('matrix.mtx')
+adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
+adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
+
+# Or load from pre-processed h5ad
+adata = ad.read_h5ad('preprocessed_data.h5ad')
+```
+
+#### Quality Control
+```python
+# Calculate QC metrics
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['total_counts'] = adata.X.sum(axis=1)
+
+# Filter cells
+adata = adata[adata.obs.n_genes > 200]
+adata = adata[adata.obs.total_counts < 10000]
+
+# Filter genes
+min_cells = 3
+adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
+```
+
+#### Normalization and Preprocessing
+```python
+# Store raw counts
+adata.layers['counts'] = adata.X.copy()
+
+# Normalize
+adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
+
+# Log transform
+adata.layers['log1p'] = np.log1p(adata.X)
+adata.X = adata.layers['log1p']
+
+# Identify highly variable genes
+gene_variance = adata.X.var(axis=0)
+adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
+```
+
+#### Dimensionality Reduction
+```python
+# PCA
+from sklearn.decomposition import PCA
+pca = PCA(n_components=50)
+adata.obsm['X_pca'] = pca.fit_transform(adata.X)
+
+# Store PCA variance
+adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
+
+# UMAP
+from umap import UMAP
+umap = UMAP(n_components=2)
+adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
+```
+
+#### Clustering
+```python
+# Store cluster assignments
+adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
+
+# Store cluster centroids
+centroids = np.array([...])
+adata.varm['cluster_centroids'] = centroids
+```
+
+#### Save Results
+```python
+# Save complete analysis
+adata.write('analyzed_data.h5ad', compression='gzip')
+```
+
+### 2. Batch Integration Workflow
+
+```python
+import anndata as ad
+
+# Load multiple batches
+batch1 = ad.read_h5ad('batch1.h5ad')
+batch2 = ad.read_h5ad('batch2.h5ad')
+batch3 = ad.read_h5ad('batch3.h5ad')
+
+# Concatenate with batch labels
+adata = ad.concat(
+    [batch1, batch2, batch3],
+    axis=0,
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3'],
+    index_unique='-'
+)
+
+# Batch effect correction would go here
+# (using external tools like Harmony, Scanorama, etc.)
+
+# Store corrected embeddings
+adata.obsm['X_pca_corrected'] = corrected_pca
+adata.obsm['X_umap_corrected'] = corrected_umap
+```
+
+### 3. Memory-Efficient Large Dataset Workflow
+
+```python
+import anndata as ad
+
+# Read in backed mode
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Check backing status
+print(f"Is backed: {adata.isbacked}")
+print(f"File: {adata.filename}")
+
+# Work with chunks
+for chunk in adata.chunk_X(chunk_size=1000):
+    # Process chunk
+    result = process_chunk(chunk)
+
+# Close file when done
+adata.file.close()
+```
+
+### 4. Multi-Dataset Comparison Workflow
+
+```python
+import anndata as ad
+
+# Load datasets
+datasets = {
+    'study1': ad.read_h5ad('study1.h5ad'),
+    'study2': ad.read_h5ad('study2.h5ad'),
+    'study3': ad.read_h5ad('study3.h5ad')
+}
+
+# Outer join to keep all genes
+combined = ad.concat(
+    list(datasets.values()),
+    axis=0,
+    join='outer',
+    label='study',
+    keys=list(datasets.keys()),
+    merge='first'
+)
+
+# Handle missing data
+combined.X[np.isnan(combined.X)] = 0
+
+# Add dataset-specific metadata
+combined.uns['datasets'] = {
+    'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
+    'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
+    'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
+}
+```
+
+## Best Practices
+
+### Memory Management
+
+#### Use Sparse Matrices
+```python
+from scipy.sparse import csr_matrix
+
+# Convert to sparse if data is sparse
+if density < 0.3:  # Less than 30% non-zero
+    adata.X = csr_matrix(adata.X)
+```
+
+#### Use Backed Mode for Large Files
+```python
+# Read with backing
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+
+# Only load what you need
+subset = adata[:1000, :500].copy()  # Now in memory
+```
+
+#### Convert Strings to Categoricals
+```python
+# Efficient storage for repeated strings
+adata.strings_to_categoricals()
+
+# Or manually
+adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
+```
+
+### Data Organization
+
+#### Use Layers for Different Representations
+```python
+# Store multiple versions of the data
+adata.layers['counts'] = raw_counts
+adata.layers['normalized'] = normalized_data
+adata.layers['log1p'] = log_transformed_data
+adata.layers['scaled'] = scaled_data
+```
+
+#### Use obsm/varm for Multi-Dimensional Annotations
+```python
+# Embeddings
+adata.obsm['X_pca'] = pca_coordinates
+adata.obsm['X_umap'] = umap_coordinates
+adata.obsm['X_tsne'] = tsne_coordinates
+
+# Gene loadings
+adata.varm['PCs'] = principal_components
+```
+
+#### Use uns for Analysis Metadata
+```python
+# Store parameters
+adata.uns['preprocessing'] = {
+    'normalization': 'TPM',
+    'min_genes': 200,
+    'min_cells': 3,
+    'date': '2024-01-15'
+}
+
+# Store analysis results
+adata.uns['differential_expression'] = {
+    'method': 't-test',
+    'p_value_threshold': 0.05
+}
+```
+
+### Subsetting and Views
+
+#### Understand View vs Copy
+```python
+# Subsetting returns a view
+subset = adata[adata.obs.cell_type == 'B cell']  # View
+print(subset.is_view)  # True
+
+# Views are memory efficient but modifications affect original
+subset.obs['new_column'] = value  # Modifies original adata
+
+# Create independent copy when needed
+subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
+```
+
+#### Chain Operations Efficiently
+```python
+# Bad - creates multiple intermediate views
+temp1 = adata[adata.obs.batch == 'batch1']
+temp2 = temp1[temp1.obs.n_genes > 200]
+result = temp2[:, temp2.var.highly_variable].copy()
+
+# Good - chain operations
+result = adata[
+    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
+    adata.var.highly_variable
+].copy()
+```
+
+### File I/O
+
+#### Use Compression
+```python
+# Save with compression
+adata.write('data.h5ad', compression='gzip')
+```
+
+#### Choose the Right Format
+```python
+# H5AD for general use (good compression, fast)
+adata.write_h5ad('data.h5ad')
+
+# Zarr for cloud storage and parallel access
+adata.write_zarr('data.zarr')
+
+# Loom for compatibility with other tools
+adata.write_loom('data.loom')
+```
+
+#### Close File Connections
+```python
+# Use context manager pattern
+adata = ad.read_h5ad('file.h5ad', backed='r')
+try:
+    # Work with data
+    process(adata)
+finally:
+    adata.file.close()
+```
+
+### Concatenation
+
+#### Choose Appropriate Join Strategy
+```python
+# Inner join - only common features (safe, may lose data)
+combined = ad.concat([adata1, adata2], join='inner')
+
+# Outer join - all features (keeps all data, may introduce zeros)
+combined = ad.concat([adata1, adata2], join='outer')
+```
+
+#### Track Data Sources
+```python
+# Add source labels
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='dataset',
+    keys=['exp1', 'exp2', 'exp3']
+)
+
+# Make indices unique
+combined = ad.concat(
+    [adata1, adata2, adata3],
+    index_unique='-'
+)
+```
+
+#### Handle Variable-Specific Metadata
+```python
+# Use merge strategy for var annotations
+combined = ad.concat(
+    [adata1, adata2],
+    merge='same',  # Keep only identical annotations
+    join='outer'
+)
+```
+
+### Naming Conventions
+
+#### Use Consistent Naming
+```python
+# Embeddings: X_<method>
+adata.obsm['X_pca']
+adata.obsm['X_umap']
+adata.obsm['X_tsne']
+
+# Layers: descriptive names
+adata.layers['counts']
+adata.layers['log1p']
+adata.layers['scaled']
+
+# Observations: snake_case
+adata.obs['cell_type']
+adata.obs['n_genes']
+adata.obs['total_counts']
+```
+
+#### Make Indices Unique
+```python
+# Ensure unique names
+adata.obs_names_make_unique()
+adata.var_names_make_unique()
+```
+
+### Error Handling
+
+#### Validate Data Structure
+```python
+# Check dimensions
+assert adata.n_obs > 0, "No observations in data"
+assert adata.n_vars > 0, "No variables in data"
+
+# Check for NaN values
+if np.isnan(adata.X).any():
+    print("Warning: NaN values detected")
+
+# Check for negative values in count data
+if (adata.X < 0).any():
+    print("Warning: Negative values in count data")
+```
+
+#### Handle Missing Data
+```python
+# Check for missing annotations
+if adata.obs['cell_type'].isna().any():
+    print("Warning: Missing cell type annotations")
+    # Fill or remove
+    adata = adata[~adata.obs['cell_type'].isna()]
+```
+
+## Common Pitfalls
+
+### 1. Forgetting to Copy Views
+```python
+# BAD - modifies original
+subset = adata[adata.obs.condition == 'treated']
+subset.X = transformed_data  # Changes original adata!
+
+# GOOD
+subset = adata[adata.obs.condition == 'treated'].copy()
+subset.X = transformed_data  # Only changes subset
+```
+
+### 2. Mixing Backed and In-Memory Operations
+```python
+# BAD - trying to modify backed data
+adata = ad.read_h5ad('file.h5ad', backed='r')
+adata.X[0, 0] = 100  # Error: can't modify backed data
+
+# GOOD - load to memory first
+adata = ad.read_h5ad('file.h5ad', backed='r')
+adata = adata.to_memory()
+adata.X[0, 0] = 100  # Works
+```
+
+### 3. Not Using Categoricals for Metadata
+```python
+# BAD - stores as strings (memory inefficient)
+adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
+
+# GOOD - use categorical
+adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
+```
+
+### 4. Incorrect Concatenation Axis
+```python
+# Concatenating observations (cells)
+combined = ad.concat([adata1, adata2], axis=0)  # Correct
+
+# Concatenating variables (genes) - rare
+combined = ad.concat([adata1, adata2], axis=1)  # Less common
+```
+
+### 5. Not Preserving Raw Data
+```python
+# BAD - loses original data
+adata.X = normalized_data
+
+# GOOD - preserve original
+adata.layers['counts'] = adata.X.copy()
+adata.X = normalized_data
+```