Improve the anndata skill

2026-01-26 16:58:56 +08:00 · 2025-11-03 16:09:01 -08:00
parent 094d5aa9f1
commit 6ddea4786e
9 changed files with 2448 additions and 1560 deletions
--- a/scientific-packages/anndata/SKILL.md
+++ b/scientific-packages/anndata/SKILL.md
@@ -1,527 +1,394 @@
 ---
 name: anndata
-description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows."
+description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
 ---

 # AnnData

 ## Overview

-AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
+AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

-## Core Capabilities
+## When to Use This Skill

-### 1. Creating and Structuring AnnData Objects
+Use this skill when:
+- Creating, reading, or writing AnnData objects
+- Working with h5ad, zarr, or other genomics data formats
+- Performing single-cell RNA-seq analysis
+- Managing large datasets with sparse matrices or backed mode
+- Concatenating multiple datasets or experimental batches
+- Subsetting, filtering, or transforming annotated data
+- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

-Create AnnData objects from various data sources and organize multi-dimensional annotations.
+## Installation

-**Basic creation:**
+```bash
+pip install anndata
+
+# With optional dependencies
+pip install anndata[dev,test,doc]
+```
+
+## Quick Start
+
+### Creating an AnnData object
 ```python
 import anndata as ad
 import numpy as np
-from scipy.sparse import csr_matrix
-
-# From dense or sparse arrays
-counts = np.random.poisson(1, size=(100, 2000))
-adata = ad.AnnData(counts)
-
-# With sparse matrix (memory-efficient)
-counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
-adata = ad.AnnData(counts)
-```
-
-**With metadata:**
-```python
 import pandas as pd

-obs_meta = pd.DataFrame({
-    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
-    'batch': ['batch1'] * 50 + ['batch2'] * 50
-})
-var_meta = pd.DataFrame({
-    'gene_name': [f'Gene_{i}' for i in range(2000)],
-    'highly_variable': np.random.choice([True, False], 2000)
-})
+# Minimal creation
+X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
+adata = ad.AnnData(X)

-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
+# With metadata
+obs = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell'] * 50,
+    'sample': ['A', 'B'] * 50
+}, index=[f'cell_{i}' for i in range(100)])
+
+var = pd.DataFrame({
+    'gene_name': [f'Gene_{i}' for i in range(2000)]
+}, index=[f'ENSG{i:05d}' for i in range(2000)])
+
+adata = ad.AnnData(X=X, obs=obs, var=var)
 ```

-**Understanding the structure:**
- **X**: Primary data matrix (observations × variables)
- **obs**: Row (observation) annotations as DataFrame
- **var**: Column (variable) annotations as DataFrame
- **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
- **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
- **layers**: Alternative data matrices with same dimensions as X
- **uns**: Unstructured metadata dictionary
- **obsp/varp**: Pairwise relationship matrices (graphs)
-
-### 2. Adding Annotations and Layers
-
-Organize different data representations and metadata within a single object.
-
-**Cell-level metadata (obs):**
+### Reading data
 ```python
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
-```
+# Read h5ad file
+adata = ad.read_h5ad('data.h5ad')

-**Gene-level metadata (var):**
-```python
-adata.var['highly_variable'] = gene_variance > threshold
-adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
-```
+# Read with backed mode (for large files)
+adata = ad.read_h5ad('large_data.h5ad', backed='r')

-**Embeddings (obsm/varm):**
-```python
-# Dimensionality reduction results
-adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
-adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
-adata.obsm['X_tsne'] = tsne_coordinates
-
-# Gene loadings
-adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)
-```
-
-**Alternative data representations (layers):**
-```python
-# Store multiple versions
-adata.layers['counts'] = raw_counts
-adata.layers['log1p'] = np.log1p(adata.X)
-adata.layers['scaled'] = (adata.X - mean) / std
-```
-
-**Unstructured metadata (uns):**
-```python
-# Analysis parameters
-adata.uns['preprocessing'] = {
-    'normalization': 'TPM',
-    'min_genes': 200,
-    'date': '2024-01-15'
-}
-
-# Results
-adata.uns['pca'] = {'variance_ratio': variance_explained}
-```
-
-### 3. Subsetting and Views
-
-Efficiently subset data while managing memory through views and copies.
-
-**Subsetting operations:**
-```python
-# By observation/variable names
-subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
-
-# By boolean masks
-b_cells = adata[adata.obs.cell_type == 'B']
-high_quality = adata[adata.obs.n_genes > 200]
-
-# By position
-first_cells = adata[:100, :]
-top_genes = adata[:, :500]
-
-# Combined conditions
-filtered = adata[
-    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
-    adata.var.highly_variable
-]
-```
-
-**Understanding views:**
- Subsetting returns **views** by default (memory-efficient, shares data with original)
- Modifying a view affects the original object
- Check with `adata.is_view`
- Convert to independent copy with `.copy()`
-
-```python
-# View (memory-efficient)
-subset = adata[adata.obs.condition == 'treated']
-print(subset.is_view)  # True
-
-# Independent copy
-subset_copy = adata[adata.obs.condition == 'treated'].copy()
-print(subset_copy.is_view)  # False
-```
-
-### 4. File I/O and Backed Mode
-
-Read and write data efficiently, with options for memory-limited environments.
-
-**Writing data:**
-```python
-# Standard format with compression
-adata.write('results.h5ad', compression='gzip')
-
-# Alternative formats
-adata.write_zarr('results.zarr')  # For cloud storage
-adata.write_loom('results.loom')  # For compatibility
-adata.write_csvs('results/')      # As CSV files
-```
-
-**Reading data:**
-```python
-# Load into memory
-adata = ad.read_h5ad('results.h5ad')
-
-# Backed mode (disk-backed, memory-efficient)
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-print(adata.isbacked)  # True
-print(adata.filename)  # Path to file
-
-# Close file connection when done
-adata.file.close()
-```
-
-**Reading from other formats:**
-```python
-# 10X format
-adata = ad.read_mtx('matrix.mtx')
-
-# CSV
+# Read other formats
 adata = ad.read_csv('data.csv')
-
-# Loom
 adata = ad.read_loom('data.loom')
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 ```

-**Working with backed mode:**
+### Writing data
 ```python
-# Read in backed mode for large files
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+# Write h5ad file
+adata.write_h5ad('output.h5ad')

-# Process in chunks
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = process_chunk(chunk)
+# Write with compression
+adata.write_h5ad('output.h5ad', compression='gzip')

-# Load to memory if needed
-adata_memory = adata.to_memory()
+# Write other formats
+adata.write_zarr('output.zarr')
+adata.write_csvs('output_dir/')
 ```

-### 5. Concatenating Multiple Datasets
-
-Combine multiple AnnData objects with control over how data is merged.
-
-**Basic concatenation:**
+### Basic operations
 ```python
-# Concatenate observations (most common)
-combined = ad.concat([adata1, adata2, adata3], axis=0)
+# Subset by conditions
+t_cells = adata[adata.obs['cell_type'] == 'T cell']

-# Concatenate variables (rare)
-combined = ad.concat([adata1, adata2], axis=1)
+# Subset by indices
+subset = adata[0:50, 0:100]
+
+# Add metadata
+adata.obs['quality_score'] = np.random.rand(adata.n_obs)
+adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
+
+# Access dimensions
+print(f"{adata.n_obs} observations × {adata.n_vars} variables")
 ```

-**Join strategies:**
-```python
-# Inner join: only shared variables (no missing data)
-combined = ad.concat([adata1, adata2], join='inner')
+## Core Capabilities

-# Outer join: all variables (fills missing with 0)
-combined = ad.concat([adata1, adata2], join='outer')
+### 1. Data Structure
+
+Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
+
+**See**: `references/data_structure.md` for comprehensive information on:
+- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
+- Creating AnnData objects from various sources
+- Accessing and manipulating data components
+- Memory-efficient practices
+
+### 2. Input/Output Operations
+
+Read and write data in various formats with support for compression, backed mode, and cloud storage.
+
+**See**: `references/io_operations.md` for details on:
+- Native formats (h5ad, zarr)
+- Alternative formats (CSV, MTX, Loom, 10X, Excel)
+- Backed mode for large datasets
+- Remote data access
+- Format conversion
+- Performance optimization
+
+Common commands:
+```python
+# Read/write h5ad
+adata = ad.read_h5ad('data.h5ad', backed='r')
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Read 10X data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read MTX format
+adata = ad.read_mtx('matrix.mtx').T
 ```

-**Tracking data sources:**
+### 3. Concatenation
+
+Combine multiple AnnData objects along observations or variables with flexible join strategies.
+
+**See**: `references/concatenation.md` for comprehensive coverage of:
+- Basic concatenation (axis=0 for observations, axis=1 for variables)
+- Join types (inner, outer)
+- Merge strategies (same, unique, first, only)
+- Tracking data sources with labels
+- Lazy concatenation (AnnCollection)
+- On-disk concatenation for large datasets
+
+Common commands:
 ```python
-# Add source labels
-combined = ad.concat(
+# Concatenate observations (combine samples)
+adata = ad.concat(
    [adata1, adata2, adata3],
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-# Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
-
-# Make duplicate indices unique
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-# Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
-```
-
-**Merge strategies for metadata:**
-```python
-# merge=None: exclude variable annotations (default)
-combined = ad.concat([adata1, adata2], merge=None)
-
-# merge='same': keep only identical annotations
-combined = ad.concat([adata1, adata2], merge='same')
-
-# merge='first': use first occurrence
-combined = ad.concat([adata1, adata2], merge='first')
-
-# merge='unique': keep annotations with single value
-combined = ad.concat([adata1, adata2], merge='unique')
-```
-
-**Complete example:**
-```python
-# Load batches
-batch1 = ad.read_h5ad('batch1.h5ad')
-batch2 = ad.read_h5ad('batch2.h5ad')
-batch3 = ad.read_h5ad('batch3.h5ad')
-
-# Concatenate with full tracking
-combined = ad.concat(
-    [batch1, batch2, batch3],
    axis=0,
-    join='outer',              # Keep all genes
-    merge='first',             # Use first batch's annotations
-    label='batch_id',          # Track source
-    keys=['b1', 'b2', 'b3'],  # Custom labels
-    index_unique='-'           # Make cell names unique
+    join='inner',
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3']
+)
+
+# Concatenate variables (combine modalities)
+adata = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Lazy concatenation
+from anndata.experimental import AnnCollection
+collection = AnnCollection(
+    ['data1.h5ad', 'data2.h5ad'],
+    join_obs='outer',
+    label='dataset'
 )
 ```

-### 6. Data Conversion and Extraction
+### 4. Data Manipulation

-Convert between AnnData and other formats for interoperability.
+Transform, subset, filter, and reorganize data efficiently.

-**To DataFrame:**
+**See**: `references/manipulation.md` for detailed guidance on:
+- Subsetting (by indices, names, boolean masks, metadata conditions)
+- Transposition
+- Copying (full copies vs views)
+- Renaming (observations, variables, categories)
+- Type conversions (strings to categoricals, sparse/dense)
+- Adding/removing data components
+- Reordering
+- Quality control filtering
+
+Common commands:
 ```python
-# Convert X to DataFrame
-df = adata.to_df()
+# Subset by metadata
+filtered = adata[adata.obs['quality_score'] > 0.8]
+hv_genes = adata[:, adata.var['highly_variable']]

-# Convert specific layer
-df = adata.to_df(layer='log1p')
+# Transpose
+adata_T = adata.T
+
+# Copy vs view
+view = adata[0:100, :]  # View (lightweight reference)
+copy = adata[0:100, :].copy()  # Independent copy
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
 ```

-**Extract vectors:**
-```python
-# Get 1D arrays from data or annotations
-gene_expression = adata.obs_vector('Gene_100')
-cell_metadata = adata.obs_vector('n_genes')
-```
-
-**Transpose:**
-```python
-# Swap observations and variables
-transposed = adata.T
-```
-
-### 7. Memory Optimization
-
-Strategies for working with large datasets efficiently.
-
-**Use sparse matrices:**
+### 5. Best Practices
+
+Follow recommended patterns for memory efficiency, performance, and reproducibility.
+
+**See**: `references/best_practices.md` for guidelines on:
+- Memory management (sparse matrices, categoricals, backed mode)
+- Views vs copies
+- Data storage optimization
+- Performance optimization
+- Working with raw data
+- Metadata management
+- Reproducibility
+- Error handling
+- Integration with other tools
+- Common pitfalls and solutions
+
+Key recommendations:
 ```python
+# Use sparse matrices for sparse data
 from scipy.sparse import csr_matrix
-
-# Check sparsity
-density = (adata.X != 0).sum() / adata.X.size
-if density < 0.3:  # Less than 30% non-zero
 adata.X = csr_matrix(adata.X)
-```

-**Convert strings to categoricals:**
-```python
-# Automatic conversion
+# Convert strings to categoricals
 adata.strings_to_categoricals()

-# Manual conversion (more control)
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
+# Use backed mode for large files
+adata = ad.read_h5ad('large.h5ad', backed='r')
+
+# Store raw before filtering
+adata.raw = adata.copy()
+adata = adata[:, adata.var['highly_variable']]
 ```

-**Use backed mode:**
-```python
-# Read without loading into memory
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
+## Integration with Scverse Ecosystem

-# Work with subsets
-subset = adata[:1000, :500].copy()  # Only this subset in memory
+AnnData serves as the foundational data structure for the scverse ecosystem:
+
+### Scanpy (Single-cell analysis)
+```python
+import scanpy as sc
+
+# Preprocessing
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+
+# Dimensionality reduction
+sc.pp.pca(adata, n_comps=50)
+sc.pp.neighbors(adata, n_neighbors=15)
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+
+# Visualization
+sc.pl.umap(adata, color=['cell_type', 'leiden'])
 ```

-**Chunked processing:**
+### Muon (Multimodal data)
 ```python
-# Process data in chunks
-results = []
-for chunk in adata.chunk_X(chunk_size=1000):
-    result = expensive_computation(chunk)
-    results.append(result)
+import muon as mu
+
+# Combine RNA and protein data
+mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create DataLoader for deep learning
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+for batch in dataloader:
+    X = batch.X
+    # Train model
 ```

 ## Common Workflows

-### Single-Cell RNA-seq Analysis
-
-Complete workflow from loading to analysis:
-
+### Single-cell RNA-seq analysis
 ```python
 import anndata as ad
-import numpy as np
-import pandas as pd
+import scanpy as sc

 # 1. Load data
-adata = ad.read_mtx('matrix.mtx')
-adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
-adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

 # 2. Quality control
 adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-adata = adata[adata.obs.n_genes > 200]
-adata = adata[adata.obs.total_counts < 10000]
+adata.obs['n_counts'] = adata.X.sum(axis=1)
+adata = adata[adata.obs['n_genes'] > 200]
+adata = adata[adata.obs['n_counts'] < 50000]

-# 3. Filter genes
-min_cells = 3
-adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
+# 3. Store raw
+adata.raw = adata.copy()

-# 4. Store raw counts
-adata.layers['counts'] = adata.X.copy()
+# 4. Normalize and filter
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+adata = adata[:, adata.var['highly_variable']]

-# 5. Normalize
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
-adata.X = np.log1p(adata.X)
-
-# 6. Feature selection
-gene_var = adata.X.var(axis=0)
-adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
-
-# 7. Dimensionality reduction (example with external tools)
-# adata.obsm['X_pca'] = compute_pca(adata.X)
-# adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
-
-# 8. Save results
-adata.write('analyzed.h5ad', compression='gzip')
+# 5. Save processed data
+adata.write_h5ad('processed.h5ad')
 ```

-### Batch Integration
-
-Combining multiple experimental batches:
-
+### Batch integration
 ```python
-# Load batches
-batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
+# Load multiple batches
+adata1 = ad.read_h5ad('batch1.h5ad')
+adata2 = ad.read_h5ad('batch2.h5ad')
+adata3 = ad.read_h5ad('batch3.h5ad')

-# Concatenate with tracking
-combined = ad.concat(
-    batches,
-    axis=0,
-    join='outer',
+# Concatenate with batch labels
+adata = ad.concat(
+    [adata1, adata2, adata3],
    label='batch',
-    keys=['batch_0', 'batch_1', 'batch_2'],
-    index_unique='-'
+    keys=['batch1', 'batch2', 'batch3'],
+    join='inner'
 )

-# Add batch as numeric for correction algorithms
-combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
+# Apply batch correction
+import scanpy as sc
+sc.pp.combat(adata, key='batch')

-# Perform batch correction (with external tools)
-# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
-# combined.obsm['X_pca_corrected'] = corrected_pca
-
-# Save integrated data
-combined.write('integrated.h5ad', compression='gzip')
+# Continue analysis
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
 ```

-### Memory-Efficient Large Dataset Processing
-
-Working with datasets too large for memory:
-
+### Working with large datasets
 ```python
-# Read in backed mode
-adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+# Open in backed mode
+adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

-# Compute statistics in chunks
-total = 0
-for chunk in adata.chunk_X(chunk_size=1000):
-    total += chunk.sum()
+# Filter based on metadata (no data loading)
+high_quality = adata[adata.obs['quality_score'] > 0.8]

-mean_expression = total / (adata.n_obs * adata.n_vars)
+# Load filtered subset
+adata_subset = high_quality.to_memory()

-# Work with subset
-high_quality_cells = adata.obs.n_genes > 1000
-subset = adata[high_quality_cells, :].copy()
+# Process subset
+process(adata_subset)

-# Close file
-adata.file.close()
+# Or process in chunks
+chunk_size = 1000
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    process(chunk)
 ```

-## Best Practices
+## Troubleshooting

-### Data Organization
-
-1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
-2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
-3. **Use uns for metadata**: Analysis parameters, dates, version information
-4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
-
-### Subsetting
-
-1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
-2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
-3. **Validate after subsetting**: Check dimensions and data integrity
-
-### File I/O
-
-1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
-2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
-3. **Close backed files**: Always close file connections when done
-4. **Use backed mode for large files**: Don't load everything into memory if not needed
-
-### Concatenation
-
-1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
-2. **Track sources**: Use `label` and `keys` to track data origin
-3. **Handle duplicates**: Use `index_unique` to make observation names unique
-4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
-
-### Memory Management
-
-1. **Use sparse matrices**: For data with <30% non-zero values
-2. **Convert to categoricals**: For repeated string values
-3. **Process in chunks**: For operations on very large matrices
-4. **Use backed mode**: Read large files with `backed='r'`
-
-### Naming Conventions
-
-Follow these conventions for consistency:
-
- **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
- **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
- **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
- **Variables**: Use snake_case like `highly_variable`, `gene_name`
-
-## Reference Documentation
-
-For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
-
-1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
-
-2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
-
-3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
-
-## When to Load References
-
-Load reference files into context when:
- Implementing complex concatenation with specific merge strategies
- Troubleshooting errors or unexpected behavior
- Optimizing memory usage for large datasets
- Implementing complete analysis workflows
- Understanding nuances of specific API methods
-
-To search within references without loading them:
+### Out of memory errors
+Use backed mode or convert to sparse matrices:
 ```python
-# Example: Search for information about backed mode
-grep -r "backed mode" references/
+# Backed mode
+adata = ad.read_h5ad('file.h5ad', backed='r')
+
+# Sparse matrices
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
 ```

-## Common Error Patterns
+### Slow file reading
+Use compression and appropriate formats:
+```python
+# Optimize for storage
+adata.strings_to_categoricals()
+adata.write_h5ad('file.h5ad', compression='gzip')

-### Memory Errors
-**Problem**: "MemoryError: Unable to allocate array"
-**Solution**: Use backed mode, sparse matrices, or process in chunks
+# Use Zarr for cloud storage
+adata.write_zarr('file.zarr', chunks=(1000, 1000))
+```

-### Dimension Mismatch
-**Problem**: "ValueError: operands could not be broadcast together"
-**Solution**: Use outer join in concatenation or align indices before operations
+### Index alignment issues
+Always align external data on index:
+```python
+# Wrong
+adata.obs['new_col'] = external_data['values']

-### View Modification
-**Problem**: "ValueError: assignment destination is read-only"
-**Solution**: Convert view to copy with `.copy()` before modification
+# Correct
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```

-### File Already Open
-**Problem**: "OSError: Unable to open file (file is already open)"
-**Solution**: Close previous file connection with `adata.file.close()`
+## Additional Resources
+
+- **Official documentation**: https://anndata.readthedocs.io/
+- **Scanpy tutorials**: https://scanpy.readthedocs.io/
+- **Scverse ecosystem**: https://scverse.org/
+- **GitHub repository**: https://github.com/scverse/anndata
--- a/scientific-packages/anndata/references/api_reference.md
+++ b/scientific-packages/anndata/references/api_reference.md
@@ -1,218 +0,0 @@
-# AnnData API Reference
-
-## Core AnnData Class
-
-The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
-
-### Core Attributes
-
-| Attribute | Type | Description |
-|-----------|------|-------------|
-| **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
-| **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
-| **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
-| **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
-| **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
-| **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
-| **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
-| **varp** | dict-like | Pairwise variable annotations (graphs between features) |
-| **layers** | dict-like | Additional data matrices matching X's dimensions |
-| **raw** | AnnData | Stores original versions of X and var before transformations |
-
-### Dimensional Properties
-
- **n_obs**: Number of observations (sample count)
- **n_vars**: Number of variables/features
- **shape**: Tuple returning (n_obs, n_vars)
- **T**: Transposed view of the entire object
-
-### State Properties
-
- **isbacked**: Boolean indicating disk-backed storage status
- **is_view**: Boolean identifying whether object is a view of another AnnData
- **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
-
-### Key Methods
-
-#### Construction and Copying
- **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
- **`copy(filename=None)`**: Create full copy, optionally stored on disk
-
-#### Subsetting and Views
- **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
- **`.copy()`**: Convert view to independent object
-
-#### Data Access
- **`to_df(layer=None)`**: Generate pandas DataFrame representation
- **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
- **`var_vector(k, layer=None)`**: Extract 1D array for a variable
- **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
- **`chunked_X(chunk_size)`**: Context manager for chunked iteration
-
-#### Transformation
- **`transpose()`**: Return transposed object
- **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
- **`to_memory(copy=False)`**: Load all backed arrays into RAM
-
-#### File I/O
- **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
- **`write_zarr(store, ...)`**: Export hierarchical Zarr store
- **`write_loom(filename, ...)`**: Output .loom format file
- **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
-
-#### Data Management
- **`strings_to_categoricals()`**: Convert string annotations to categorical types
- **`rename_categories(key, categories)`**: Update category labels in annotations
- **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
- **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
-
-## Module-Level Functions
-
-### Reading Functions
-
-#### Native Formats
- **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
- **`read_zarr(store)`**: Access hierarchical Zarr array stores
-
-#### Alternative Formats
- **`read_csv(filename, ...)`**: Import from CSV files
- **`read_excel(filename, ...)`**: Import from Excel files
- **`read_hdf(filename, key)`**: Read from HDF5 files
- **`read_loom(filename, ...)`**: Import from .loom files
- **`read_mtx(filename, ...)`**: Import from Matrix Market format
- **`read_text(filename, ...)`**: Import from text files
- **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
-
-#### Element-Level Access
- **`read_elem(elem)`**: Retrieve specific components from storage
- **`sparse_dataset(group)`**: Generate backed sparse matrix classes
-
-### Combining Operations
- **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
-  - **axis**: 0 (observations) or 1 (variables)
-  - **join**: 'inner' (intersection) or 'outer' (union)
-  - **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
-  - **label**: Column name for source tracking
-  - **keys**: Dataset identifiers for source annotation
-  - **index_unique**: Separator for making duplicate indices unique
-
-### Writing Functions
- **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
- **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
- **`write_elem(elem, ...)`**: Write individual components
-
-### Experimental Features
- **`AnnCollection`**: Batch processing for large collections
- **`AnnLoader`**: PyTorch DataLoader integration
- **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
- **`read_lazy(filename)`**: Lazy loading with deferred computation
- **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
- **`write_dispatched(filename, ...)`**: Custom writing with callbacks
-
-### Configuration
- **`settings`**: Package-wide configuration object
- **`settings.override(**kwargs)`**: Context manager for temporary settings changes
-
-## Common Usage Patterns
-
-### Creating AnnData Objects
-
-```python
-import anndata as ad
-import numpy as np
-from scipy.sparse import csr_matrix
-
-# From dense array
-counts = np.random.poisson(1, size=(100, 2000))
-adata = ad.AnnData(counts)
-
-# From sparse matrix
-counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
-adata = ad.AnnData(counts)
-
-# With metadata
-import pandas as pd
-obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
-var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
-```
-
-### Subsetting
-
-```python
-# By names
-subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
-
-# By boolean mask
-b_cells = adata[adata.obs.cell_type == 'B']
-
-# By position
-first_five = adata[:5, :100]
-
-# Convert view to copy
-adata_copy = adata[:5].copy()
-```
-
-### Adding Annotations
-
-```python
-# Cell-level metadata
-adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
-
-# Gene-level metadata
-adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
-
-# Embeddings
-adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
-adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
-
-# Alternative data representations
-adata.layers['log_transformed'] = np.log1p(adata.X)
-adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
-
-# Unstructured metadata
-adata.uns['experiment_date'] = '2024-01-15'
-adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
-```
-
-### File I/O
-
-```python
-# Write to disk
-adata.write('my_results.h5ad', compression='gzip')
-
-# Read into memory
-adata = ad.read_h5ad('my_results.h5ad')
-
-# Read in backed mode (memory-efficient)
-adata = ad.read_h5ad('my_results.h5ad', backed='r')
-
-# Close file connection
-adata.file.close()
-```
-
-### Concatenation
-
-```python
-# Combine multiple datasets
-adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
-adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
-adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
-
-# Simple concatenation
-combined = ad.concat([adata1, adata2, adata3], axis=0)
-
-# With source labels
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    axis=0,
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-
-# Inner join (only shared variables)
-combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
-
-# Outer join (all variables, pad with zeros)
-combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
-```
--- a/scientific-packages/anndata/references/best_practices.md
+++ b/scientific-packages/anndata/references/best_practices.md
@@ -0,0 +1,525 @@
+# Best Practices
+
+Guidelines for efficient and effective use of AnnData.
+
+## Memory Management
+
+### Use sparse matrices for sparse data
+```python
+import numpy as np
+from scipy.sparse import csr_matrix
+import anndata as ad
+
+# Check data sparsity
+data = np.random.rand(1000, 2000)
+sparsity = 1 - np.count_nonzero(data) / data.size
+print(f"Sparsity: {sparsity:.2%}")
+
+# Convert to sparse if >50% zeros
+if sparsity > 0.5:
+    adata = ad.AnnData(X=csr_matrix(data))
+else:
+    adata = ad.AnnData(X=data)
+
+# Benefits: 10-100x memory reduction for sparse genomics data
+```
+
+### Convert strings to categoricals
+```python
+# Inefficient: string columns use lots of memory
+adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A']
+
+# Efficient: convert to categorical
+adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
+
+# Convert all string columns
+adata.strings_to_categoricals()
+
+# Benefits: 10-50x memory reduction for repeated strings
+```
+
+### Use backed mode for large datasets
+```python
+# Don't load entire dataset into memory
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Work with metadata
+filtered = adata[adata.obs['quality'] > 0.8]
+
+# Load only filtered subset
+adata_subset = filtered.to_memory()
+
+# Benefits: Work with datasets larger than RAM
+```
+
+## Views vs Copies
+
+### Understanding views
+```python
+# Subsetting creates a view by default
+subset = adata[0:100, :]
+print(subset.is_view)  # True
+
+# Views don't copy data (memory efficient)
+# But modifications can affect original
+
+# Check if object is a view
+if adata.is_view:
+    adata = adata.copy()  # Make independent
+```
+
+### When to use views
+```python
+# Good: Read-only operations on subsets
+mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean()
+
+# Good: Temporary analysis
+temp_subset = adata[:100, :]
+result = analyze(temp_subset.X)
+```
+
+### When to use copies
+```python
+# Create independent copy for modifications
+adata_filtered = adata[keep_cells, :].copy()
+
+# Safe to modify without affecting original
+adata_filtered.obs['new_column'] = values
+
+# Always copy when:
+# - Storing subset for later use
+# - Modifying subset data
+# - Passing to function that modifies data
+```
+
+## Data Storage Best Practices
+
+### Choose the right format
+
+**H5AD (HDF5) - Default choice**
+```python
+adata.write_h5ad('data.h5ad', compression='gzip')
+```
+- Fast random access
+- Supports backed mode
+- Good compression
+- Best for: Most use cases
+
+**Zarr - Cloud and parallel access**
+```python
+adata.write_zarr('data.zarr', chunks=(100, 100))
+```
+- Excellent for cloud storage (S3, GCS)
+- Supports parallel I/O
+- Good compression
+- Best for: Large datasets, cloud workflows, parallel processing
+
+**CSV - Interoperability**
+```python
+adata.write_csvs('output_dir/')
+```
+- Human readable
+- Compatible with all tools
+- Large file sizes, slow
+- Best for: Sharing with non-Python tools, small datasets
+
+### Optimize file size
+```python
+# Before saving, optimize:
+
+# 1. Convert to sparse if appropriate
+from scipy.sparse import csr_matrix, issparse
+if not issparse(adata.X):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:
+        adata.X = csr_matrix(adata.X)
+
+# 2. Convert strings to categoricals
+adata.strings_to_categoricals()
+
+# 3. Use compression
+adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
+
+# Typical results: 5-20x file size reduction
+```
+
+## Backed Mode Strategies
+
+### Read-only analysis
+```python
+# Open in read-only backed mode
+adata = ad.read_h5ad('data.h5ad', backed='r')
+
+# Perform filtering without loading data
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# Load only filtered data
+adata_filtered = high_quality.to_memory()
+```
+
+### Read-write modifications
+```python
+# Open in read-write backed mode
+adata = ad.read_h5ad('data.h5ad', backed='r+')
+
+# Modify metadata (written to disk)
+adata.obs['new_annotation'] = values
+
+# X remains on disk, modifications saved immediately
+```
+
+### Chunked processing
+```python
+# Process large dataset in chunks
+adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+
+results = []
+chunk_size = 1000
+
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    result = process(chunk)
+    results.append(result)
+
+final_result = combine(results)
+```
+
+## Performance Optimization
+
+### Subsetting performance
+```python
+# Fast: Boolean indexing with arrays
+mask = np.array(adata.obs['quality'] > 0.5)
+subset = adata[mask, :]
+
+# Slow: Boolean indexing with Series (creates view chain)
+subset = adata[adata.obs['quality'] > 0.5, :]
+
+# Fastest: Integer indices
+indices = np.where(adata.obs['quality'] > 0.5)[0]
+subset = adata[indices, :]
+```
+
+### Avoid repeated subsetting
+```python
+# Inefficient: Multiple subset operations
+for cell_type in ['A', 'B', 'C']:
+    subset = adata[adata.obs['cell_type'] == cell_type]
+    process(subset)
+
+# Efficient: Group and process
+groups = adata.obs.groupby('cell_type').groups
+for cell_type, indices in groups.items():
+    subset = adata[indices, :]
+    process(subset)
+```
+
+### Use chunked operations for large matrices
+```python
+# Process X in chunks
+for chunk in adata.chunked_X(chunk_size=1000):
+    result = compute(chunk)
+
+# More memory efficient than loading full X
+```
+
+## Working with Raw Data
+
+### Store raw before filtering
+```python
+# Original data with all genes
+adata = ad.AnnData(X=counts)
+
+# Store raw before filtering
+adata.raw = adata.copy()
+
+# Filter to highly variable genes
+adata = adata[:, adata.var['highly_variable']]
+
+# Later: access original data
+original_expression = adata.raw.X
+all_genes = adata.raw.var_names
+```
+
+### When to use raw
+```python
+# Use raw for:
+# - Differential expression on filtered genes
+# - Visualization of specific genes not in filtered set
+# - Accessing original counts after normalization
+
+# Access raw data
+if adata.raw is not None:
+    gene_expr = adata.raw[:, 'GENE_NAME'].X
+else:
+    gene_expr = adata[:, 'GENE_NAME'].X
+```
+
+## Metadata Management
+
+### Naming conventions
+```python
+# Consistent naming improves usability
+
+# Observation metadata (obs):
+# - cell_id, sample_id
+# - cell_type, tissue, condition
+# - n_genes, n_counts, percent_mito
+# - cluster, leiden, louvain
+
+# Variable metadata (var):
+# - gene_id, gene_name
+# - highly_variable, n_cells
+# - mean_expression, dispersion
+
+# Embeddings (obsm):
+# - X_pca, X_umap, X_tsne
+# - X_diffmap, X_draw_graph_fr
+
+# Follow conventions from scanpy/scverse ecosystem
+```
+
+### Document metadata
+```python
+# Store metadata descriptions in uns
+adata.uns['metadata_descriptions'] = {
+    'cell_type': 'Cell type annotation from automated clustering',
+    'quality_score': 'QC score from scrublet (0-1, higher is better)',
+    'batch': 'Experimental batch identifier'
+}
+
+# Store processing history
+adata.uns['processing_steps'] = [
+    'Raw counts loaded from 10X',
+    'Filtered: n_genes > 200, n_counts < 50000',
+    'Normalized to 10000 counts per cell',
+    'Log transformed'
+]
+```
+
+## Reproducibility
+
+### Set random seeds
+```python
+import numpy as np
+
+# Set seed for reproducible results
+np.random.seed(42)
+
+# Document in uns
+adata.uns['random_seed'] = 42
+```
+
+### Store parameters
+```python
+# Store analysis parameters in uns
+adata.uns['pca'] = {
+    'n_comps': 50,
+    'svd_solver': 'arpack',
+    'random_state': 42
+}
+
+adata.uns['neighbors'] = {
+    'n_neighbors': 15,
+    'n_pcs': 50,
+    'metric': 'euclidean',
+    'method': 'umap'
+}
+```
+
+### Version tracking
+```python
+import anndata
+import scanpy
+import numpy
+
+# Store versions
+adata.uns['versions'] = {
+    'anndata': anndata.__version__,
+    'scanpy': scanpy.__version__,
+    'numpy': numpy.__version__,
+    'python': sys.version
+}
+```
+
+## Error Handling
+
+### Check data validity
+```python
+# Verify dimensions
+assert adata.n_obs == len(adata.obs)
+assert adata.n_vars == len(adata.var)
+assert adata.X.shape == (adata.n_obs, adata.n_vars)
+
+# Check for NaN values
+has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any()
+if has_nan:
+    print("Warning: Data contains NaN values")
+
+# Check for negative values (if counts expected)
+has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any()
+if has_negative:
+    print("Warning: Data contains negative values")
+```
+
+### Validate metadata
+```python
+# Check for missing values
+missing_obs = adata.obs.isnull().sum()
+if missing_obs.any():
+    print("Missing values in obs:")
+    print(missing_obs[missing_obs > 0])
+
+# Verify indices are unique
+assert adata.obs_names.is_unique, "Observation names not unique"
+assert adata.var_names.is_unique, "Variable names not unique"
+
+# Check metadata alignment
+assert len(adata.obs) == adata.n_obs
+assert len(adata.var) == adata.n_vars
+```
+
+## Integration with Other Tools
+
+### Scanpy integration
+```python
+import scanpy as sc
+
+# AnnData is native format for scanpy
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.filter_genes(adata, min_cells=3)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata)
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
+```
+
+### Pandas integration
+```python
+import pandas as pd
+
+# Convert to DataFrame
+df = adata.to_df()
+
+# Create from DataFrame
+adata = ad.AnnData(df)
+
+# Work with metadata as DataFrames
+adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True)
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create PyTorch DataLoader
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+# Iterate in training loop
+for batch in dataloader:
+    X = batch.X
+    # Train model on batch
+```
+
+## Common Pitfalls
+
+### Pitfall 1: Modifying views
+```python
+# Wrong: Modifying view can affect original
+subset = adata[:100, :]
+subset.X = new_data  # May modify adata.X!
+
+# Correct: Copy before modifying
+subset = adata[:100, :].copy()
+subset.X = new_data  # Independent copy
+```
+
+### Pitfall 2: Index misalignment
+```python
+# Wrong: Assuming order matches
+external_data = pd.read_csv('data.csv')
+adata.obs['new_col'] = external_data['values']  # May misalign!
+
+# Correct: Align on index
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```
+
+### Pitfall 3: Mixing sparse and dense
+```python
+# Wrong: Converting sparse to dense uses huge memory
+result = adata.X + 1  # Converts sparse to dense!
+
+# Correct: Use sparse operations
+from scipy.sparse import issparse
+if issparse(adata.X):
+    result = adata.X.copy()
+    result.data += 1
+```
+
+### Pitfall 4: Not handling views
+```python
+# Wrong: Assuming subset is independent
+subset = adata[mask, :]
+del adata  # subset may become invalid!
+
+# Correct: Copy when needed
+subset = adata[mask, :].copy()
+del adata  # subset remains valid
+```
+
+### Pitfall 5: Ignoring memory constraints
+```python
+# Wrong: Loading huge dataset into memory
+adata = ad.read_h5ad('100GB_file.h5ad')  # OOM error!
+
+# Correct: Use backed mode
+adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
+subset = adata[adata.obs['keep']].to_memory()
+```
+
+## Workflow Example
+
+Complete best-practices workflow:
+
+```python
+import anndata as ad
+import numpy as np
+from scipy.sparse import csr_matrix
+
+# 1. Load with backed mode if large
+adata = ad.read_h5ad('data.h5ad', backed='r')
+
+# 2. Quick metadata check without loading data
+print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes")
+
+# 3. Filter based on metadata
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# 4. Load filtered subset to memory
+adata = high_quality.to_memory()
+
+# 5. Convert to optimal storage types
+adata.strings_to_categoricals()
+if not issparse(adata.X):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:
+        adata.X = csr_matrix(adata.X)
+
+# 6. Store raw before filtering genes
+adata.raw = adata.copy()
+
+# 7. Filter to highly variable genes
+adata = adata[:, adata.var['highly_variable']].copy()
+
+# 8. Document processing
+adata.uns['processing'] = {
+    'filtered': 'quality_score > 0.8',
+    'n_hvg': adata.n_vars,
+    'date': '2025-11-03'
+}
+
+# 9. Save optimized
+adata.write_h5ad('processed.h5ad', compression='gzip')
+```
--- a/scientific-packages/anndata/references/concatenation.md
+++ b/scientific-packages/anndata/references/concatenation.md
@@ -0,0 +1,396 @@
+# Concatenating AnnData Objects
+
+Combine multiple AnnData objects along either observations or variables axis.
+
+## Basic Concatenation
+
+### Concatenate along observations (stack cells/samples)
+```python
+import anndata as ad
+import numpy as np
+
+# Create multiple AnnData objects
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata2 = ad.AnnData(X=np.random.rand(150, 50))
+adata3 = ad.AnnData(X=np.random.rand(200, 50))
+
+# Concatenate along observations (axis=0, default)
+adata_combined = ad.concat([adata1, adata2, adata3], axis=0)
+
+print(adata_combined.shape)  # (450, 50)
+```
+
+### Concatenate along variables (stack genes/features)
+```python
+# Create objects with same observations, different variables
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata2 = ad.AnnData(X=np.random.rand(100, 30))
+adata3 = ad.AnnData(X=np.random.rand(100, 70))
+
+# Concatenate along variables (axis=1)
+adata_combined = ad.concat([adata1, adata2, adata3], axis=1)
+
+print(adata_combined.shape)  # (100, 150)
+```
+
+## Join Types
+
+### Inner join (intersection)
+Keep only variables/observations present in all objects.
+
+```python
+import pandas as pd
+
+# Create objects with different variables
+adata1 = ad.AnnData(
+    X=np.random.rand(100, 50),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
+)
+adata2 = ad.AnnData(
+    X=np.random.rand(150, 60),
+    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
+)
+
+# Inner join: only genes 10-49 are kept (overlap)
+adata_inner = ad.concat([adata1, adata2], join='inner')
+print(adata_inner.n_vars)  # 40 genes (overlap)
+```
+
+### Outer join (union)
+Keep all variables/observations, filling missing values.
+
+```python
+# Outer join: all genes are kept
+adata_outer = ad.concat([adata1, adata2], join='outer')
+print(adata_outer.n_vars)  # 70 genes (union)
+
+# Missing values are filled with appropriate defaults:
+# - 0 for sparse matrices
+# - NaN for dense matrices
+```
+
+### Fill values for outer joins
+```python
+# Specify fill value for missing data
+adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)
+```
+
+## Tracking Data Sources
+
+### Add batch labels
+```python
+# Label which object each observation came from
+adata_combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch',  # Column name for labels
+    keys=['batch1', 'batch2', 'batch3']  # Labels for each object
+)
+
+print(adata_combined.obs['batch'].value_counts())
+# batch1    100
+# batch2    150
+# batch3    200
+```
+
+### Automatic batch labels
+```python
+# If keys not provided, uses integer indices
+adata_combined = ad.concat(
+    [adata1, adata2, adata3],
+    label='dataset'
+)
+# dataset column contains: 0, 1, 2
+```
+
+## Merge Strategies
+
+Control how metadata from different objects is combined using the `merge` parameter.
+
+### merge=None (default for observations)
+Exclude metadata on non-concatenation axis.
+
+```python
+# When concatenating observations, var metadata must match
+adata1.var['gene_type'] = 'protein_coding'
+adata2.var['gene_type'] = 'protein_coding'
+
+# var is kept only if identical across all objects
+adata_combined = ad.concat([adata1, adata2], merge=None)
+```
+
+### merge='same'
+Keep metadata that is identical across all objects.
+
+```python
+adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
+adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
+adata1.var['type'] = 'protein_coding'
+adata2.var['type'] = 'lncRNA'  # Different
+
+# 'chromosome' is kept (same), 'type' is excluded (different)
+adata_combined = ad.concat([adata1, adata2], merge='same')
+```
+
+### merge='unique'
+Keep metadata columns where each key has exactly one value.
+
+```python
+adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
+adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
+
+# gene_id is kept (unique values for each key)
+adata_combined = ad.concat([adata1, adata2], merge='unique')
+```
+
+### merge='first'
+Take values from the first object containing each key.
+
+```python
+adata1.var['description'] = ['Desc1'] * 50
+adata2.var['description'] = ['Desc2'] * 50
+
+# Uses descriptions from adata1
+adata_combined = ad.concat([adata1, adata2], merge='first')
+```
+
+### merge='only'
+Keep metadata that appears in only one object.
+
+```python
+adata1.var['adata1_specific'] = [1] * 50
+adata2.var['adata2_specific'] = [2] * 50
+
+# Both metadata columns are kept
+adata_combined = ad.concat([adata1, adata2], merge='only')
+```
+
+## Handling Index Conflicts
+
+### Make indices unique
+```python
+import pandas as pd
+
+# Create objects with overlapping observation names
+adata1 = ad.AnnData(
+    X=np.random.rand(3, 10),
+    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
+)
+adata2 = ad.AnnData(
+    X=np.random.rand(3, 10),
+    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
+)
+
+# Make indices unique by appending batch keys
+adata_combined = ad.concat(
+    [adata1, adata2],
+    label='batch',
+    keys=['batch1', 'batch2'],
+    index_unique='_'  # Separator for making indices unique
+)
+
+print(adata_combined.obs_names)
+# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
+#  'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']
+```
+
+## Concatenating Layers
+
+```python
+# Objects with layers
+adata1 = ad.AnnData(X=np.random.rand(100, 50))
+adata1.layers['normalized'] = np.random.rand(100, 50)
+adata1.layers['scaled'] = np.random.rand(100, 50)
+
+adata2 = ad.AnnData(X=np.random.rand(150, 50))
+adata2.layers['normalized'] = np.random.rand(150, 50)
+adata2.layers['scaled'] = np.random.rand(150, 50)
+
+# Layers are concatenated automatically if present in all objects
+adata_combined = ad.concat([adata1, adata2])
+
+print(adata_combined.layers.keys())
+# dict_keys(['normalized', 'scaled'])
+```
+
+## Concatenating Multi-dimensional Annotations
+
+### obsm/varm
+```python
+# Objects with embeddings
+adata1.obsm['X_pca'] = np.random.rand(100, 50)
+adata2.obsm['X_pca'] = np.random.rand(150, 50)
+
+# obsm is concatenated along observation axis
+adata_combined = ad.concat([adata1, adata2])
+print(adata_combined.obsm['X_pca'].shape)  # (250, 50)
+```
+
+### obsp/varp (pairwise annotations)
+```python
+from scipy.sparse import csr_matrix
+
+# Pairwise matrices
+adata1.obsp['connectivities'] = csr_matrix((100, 100))
+adata2.obsp['connectivities'] = csr_matrix((150, 150))
+
+# By default, obsp is NOT concatenated (set pairwise=True to include)
+adata_combined = ad.concat([adata1, adata2])
+# adata_combined.obsp is empty
+
+# Include pairwise data (creates block diagonal matrix)
+adata_combined = ad.concat([adata1, adata2], pairwise=True)
+print(adata_combined.obsp['connectivities'].shape)  # (250, 250)
+```
+
+## Concatenating uns (unstructured)
+
+Unstructured metadata is merged recursively:
+
+```python
+adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
+adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}
+
+# Using merge='unique' for uns
+adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
+# 'date' is kept (same value), 'batch' might be excluded (different values)
+```
+
+## Lazy Concatenation (AnnCollection)
+
+For very large datasets, use lazy concatenation that doesn't load all data:
+
+```python
+from anndata.experimental import AnnCollection
+
+# Create collection from file paths (doesn't load data)
+files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
+collection = AnnCollection(
+    files,
+    join_obs='outer',
+    join_vars='inner',
+    label='dataset',
+    keys=['dataset1', 'dataset2', 'dataset3']
+)
+
+# Access data lazily
+print(collection.n_obs)  # Total observations
+print(collection.obs.head())  # Metadata loaded, not X
+
+# Convert to regular AnnData when needed (loads all data)
+adata = collection.to_adata()
+```
+
+### Working with AnnCollection
+```python
+# Subset without loading data
+subset = collection[collection.obs['cell_type'] == 'T cell']
+
+# Iterate through datasets
+for adata in collection:
+    print(adata.shape)
+
+# Access specific dataset
+first_dataset = collection[0]
+```
+
+## Concatenation on Disk
+
+For datasets too large for memory, concatenate directly on disk:
+
+```python
+from anndata.experimental import concat_on_disk
+
+# Concatenate without loading into memory
+concat_on_disk(
+    ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
+    'combined.h5ad',
+    join='outer'
+)
+
+# Load result in backed mode
+adata = ad.read_h5ad('combined.h5ad', backed='r')
+```
+
+## Common Concatenation Patterns
+
+### Combine technical replicates
+```python
+# Multiple runs of the same samples
+replicates = [adata_run1, adata_run2, adata_run3]
+adata_combined = ad.concat(
+    replicates,
+    label='technical_replicate',
+    keys=['rep1', 'rep2', 'rep3'],
+    join='inner'  # Keep only genes measured in all runs
+)
+```
+
+### Combine batches from experiment
+```python
+# Different experimental batches
+batches = [adata_batch1, adata_batch2, adata_batch3]
+adata_combined = ad.concat(
+    batches,
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3'],
+    join='outer'  # Keep all genes
+)
+
+# Later: apply batch correction
+```
+
+### Merge multi-modal data
+```python
+# Different measurement modalities (e.g., RNA + protein)
+adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
+adata_protein = ad.AnnData(X=np.random.rand(100, 50))
+
+# Concatenate along variables to combine modalities
+adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Add labels to distinguish modalities
+adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50
+```
+
+## Best Practices
+
+1. **Check compatibility before concatenating**
+```python
+# Verify shapes are compatible
+print([adata.n_vars for adata in [adata1, adata2, adata3]])
+
+# Check variable names match
+print([set(adata.var_names) for adata in [adata1, adata2, adata3]])
+```
+
+2. **Use appropriate join type**
+- `inner`: When you need the same features across all samples (most stringent)
+- `outer`: When you want to preserve all features (most inclusive)
+
+3. **Track data sources**
+Always use `label` and `keys` to track which observations came from which dataset.
+
+4. **Consider memory usage**
+- For large datasets, use `AnnCollection` or `concat_on_disk`
+- Consider backed mode for the result
+
+5. **Handle batch effects**
+Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:
+```python
+# After concatenation, apply batch correction
+import scanpy as sc
+sc.pp.combat(adata_combined, key='batch')
+```
+
+6. **Validate results**
+```python
+# Check dimensions
+print(adata_combined.shape)
+
+# Check batch distribution
+print(adata_combined.obs['batch'].value_counts())
+
+# Verify metadata integrity
+print(adata_combined.var.head())
+print(adata_combined.obs.head())
+```
--- a/scientific-packages/anndata/references/concatenation_guide.md
+++ b/scientific-packages/anndata/references/concatenation_guide.md
@@ -1,478 +0,0 @@
-# AnnData Concatenation Guide
-
-## Overview
-
-The `concat()` function combines multiple AnnData objects through two fundamental operations:
-1. **Concatenation**: Stacking sub-elements in order
-2. **Merging**: Combining collections into one result
-
-## Basic Concatenation
-
-### Syntax
-```python
-import anndata as ad
-
-combined = ad.concat(
-    adatas,              # List of AnnData objects
-    axis=0,              # 0=observations, 1=variables
-    join='inner',        # 'inner' or 'outer'
-    merge=None,          # Merge strategy for non-concat axis
-    label=None,          # Column name for source tracking
-    keys=None,           # Dataset identifiers
-    index_unique=None,   # Separator for unique indices
-    fill_value=None,     # Fill value for missing data
-    pairwise=False       # Include pairwise matrices
-)
-```
-
-### Concatenating Observations (Cells)
-```python
-# Most common: combining multiple samples/batches
-adata1 = ad.AnnData(np.random.rand(100, 2000))
-adata2 = ad.AnnData(np.random.rand(150, 2000))
-adata3 = ad.AnnData(np.random.rand(80, 2000))
-
-combined = ad.concat([adata1, adata2, adata3], axis=0)
-# Result: (330 observations, 2000 variables)
-```
-
-### Concatenating Variables (Genes)
-```python
-# Less common: combining different feature sets
-adata1 = ad.AnnData(np.random.rand(100, 1000))
-adata2 = ad.AnnData(np.random.rand(100, 500))
-
-combined = ad.concat([adata1, adata2], axis=1)
-# Result: (100 observations, 1500 variables)
-```
-
-## Join Strategies
-
-### Inner Join (Intersection)
-
-Keeps only shared features across all objects.
-
-```python
-# Datasets with different genes
-adata1 = ad.AnnData(
-    np.random.rand(100, 2000),
-    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
-)
-adata2 = ad.AnnData(
-    np.random.rand(150, 1800),
-    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
-)
-
-# Inner join: only genes present in both
-combined = ad.concat([adata1, adata2], join='inner')
-# Result: (250 observations, 1800 variables)
-# Only Gene_200 through Gene_1999
-```
-
-**Use when:**
- You want to analyze only features measured in all datasets
- Missing features would compromise analysis
- You need a complete case analysis
-
-**Trade-offs:**
- May lose many features
- Ensures no missing data
- Smaller result size
-
-### Outer Join (Union)
-
-Keeps all features from all objects, padding with fill values (default 0).
-
-```python
-# Outer join: all genes from both datasets
-combined = ad.concat([adata1, adata2], join='outer')
-# Result: (250 observations, 2000 variables)
-# Missing values filled with 0
-
-# Custom fill value
-combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
-```
-
-**Use when:**
- You want to preserve all features
- Sparse data is acceptable
- Features are independent
-
-**Trade-offs:**
- Introduces zeros/missing values
- Larger result size
- May need imputation
-
-## Merge Strategies
-
-Merge strategies control how elements on the non-concatenation axis are combined.
-
-### merge=None (Default)
-
-Excludes all non-concatenation axis elements.
-
-```python
-# Both datasets have var annotations
-adata1.var['gene_type'] = ['protein_coding'] * 2000
-adata2.var['gene_type'] = ['protein_coding'] * 1800
-
-# merge=None: var annotations excluded
-combined = ad.concat([adata1, adata2], merge=None)
-assert 'gene_type' not in combined.var.columns
-```
-
-**Use when:**
- Annotations are dataset-specific
- You'll add new annotations after merging
-
-### merge='same'
-
-Keeps only annotations with identical values across datasets.
-
-```python
-# Same annotation values
-adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
-adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
-
-# merge='same': keeps chromosome annotation
-combined = ad.concat([adata1, adata2], merge='same')
-assert 'chromosome' in combined.var.columns
-```
-
-**Use when:**
- Annotations should be consistent
- You want to validate consistency
- Shared metadata is important
-
-**Note:** Comparison occurs after index alignment - only shared indices need to match.
-
-### merge='unique'
-
-Includes annotations with a single possible value.
-
-```python
-# Unique values per gene
-adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
-adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
-
-# merge='unique': keeps ensembl_id
-combined = ad.concat([adata1, adata2], merge='unique')
-```
-
-**Use when:**
- Each feature has a unique identifier
- Annotations are feature-specific
-
-### merge='first'
-
-Takes the first occurrence of each annotation.
-
-```python
-# Different annotation versions
-adata1.var['description'] = ['desc1'] * 2000
-adata2.var['description'] = ['desc2'] * 2000
-
-# merge='first': uses adata1's descriptions
-combined = ad.concat([adata1, adata2], merge='first')
-# Uses descriptions from adata1
-```
-
-**Use when:**
- One dataset has authoritative annotations
- Order matters
- You need a simple resolution strategy
-
-### merge='only'
-
-Retains annotations appearing in exactly one object.
-
-```python
-# Dataset-specific annotations
-adata1.var['dataset1_specific'] = ['value'] * 2000
-adata2.var['dataset2_specific'] = ['value'] * 2000
-
-# merge='only': keeps both (no conflicts)
-combined = ad.concat([adata1, adata2], merge='only')
-```
-
-**Use when:**
- Datasets have non-overlapping annotations
- You want to preserve all unique metadata
-
-## Source Tracking
-
-### Using label
-
-Add a categorical column to track data origin.
-
-```python
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='batch'
-)
-
-# Creates obs['batch'] with values 0, 1, 2
-print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']
-```
-
-### Using keys
-
-Provide custom names for source tracking.
-
-```python
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='study',
-    keys=['control', 'treatment_a', 'treatment_b']
-)
-
-# Creates obs['study'] with custom names
-print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']
-```
-
-### Making Indices Unique
-
-Append source identifiers to duplicate observation names.
-
-```python
-# Both datasets have cells named "Cell_0", "Cell_1", etc.
-adata1.obs_names = [f'Cell_{i}' for i in range(100)]
-adata2.obs_names = [f'Cell_{i}' for i in range(150)]
-
-# index_unique adds suffix
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-
-# Results in: Cell_0-batch1, Cell_0-batch2, etc.
-print(combined.obs_names[:5])
-```
-
-## Handling Different Attributes
-
-### X Matrix and Layers
-
-Follows join strategy. Missing values filled according to `fill_value`.
-
-```python
-# Both have layers
-adata1.layers['counts'] = adata1.X.copy()
-adata2.layers['counts'] = adata2.X.copy()
-
-# Concatenates both X and layers
-combined = ad.concat([adata1, adata2])
-assert 'counts' in combined.layers
-```
-
-### obs and var DataFrames
-
- **obs**: Concatenated along concatenation axis
- **var**: Handled by merge strategy
-
-```python
-adata1.obs['cell_type'] = ['B cell'] * 100
-adata2.obs['cell_type'] = ['T cell'] * 150
-
-combined = ad.concat([adata1, adata2])
-# obs['cell_type'] preserved for all cells
-```
-
-### obsm and varm
-
-Multi-dimensional annotations follow same rules as layers.
-
-```python
-adata1.obsm['X_pca'] = np.random.rand(100, 50)
-adata2.obsm['X_pca'] = np.random.rand(150, 50)
-
-combined = ad.concat([adata1, adata2])
-# obsm['X_pca'] concatenated: shape (250, 50)
-```
-
-### obsp and varp
-
-Pairwise matrices excluded by default. Enable with `pairwise=True`.
-
-```python
-# Distance matrices
-adata1.obsp['distances'] = np.random.rand(100, 100)
-adata2.obsp['distances'] = np.random.rand(150, 150)
-
-# Excluded by default
-combined = ad.concat([adata1, adata2])
-assert 'distances' not in combined.obsp
-
-# Include if needed
-combined = ad.concat([adata1, adata2], pairwise=True)
-# Results in padded block diagonal matrix
-```
-
-### uns Dictionary
-
-Merged recursively, applying merge strategy at any nesting depth.
-
-```python
-adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
-adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
-
-# merge='same' keeps 'lab', excludes 'date'
-combined = ad.concat([adata1, adata2], merge='same')
-# combined.uns['experiment'] = {'lab': 'A'}
-```
-
-## Advanced Patterns
-
-### Batch Integration Pipeline
-
-```python
-import anndata as ad
-
-# Load batches
-batches = [
-    ad.read_h5ad(f'batch_{i}.h5ad')
-    for i in range(5)
-]
-
-# Concatenate with tracking
-combined = ad.concat(
-    batches,
-    axis=0,
-    join='outer',
-    merge='first',
-    label='batch_id',
-    keys=[f'batch_{i}' for i in range(5)],
-    index_unique='-'
-)
-
-# Add batch effects
-combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
-```
-
-### Multi-Study Meta-Analysis
-
-```python
-# Different studies with varying gene coverage
-studies = {
-    'study_a': ad.read_h5ad('study_a.h5ad'),
-    'study_b': ad.read_h5ad('study_b.h5ad'),
-    'study_c': ad.read_h5ad('study_c.h5ad')
-}
-
-# Outer join to keep all genes
-combined = ad.concat(
-    list(studies.values()),
-    axis=0,
-    join='outer',
-    label='study',
-    keys=list(studies.keys()),
-    merge='unique',
-    fill_value=0
-)
-
-# Track coverage
-for study in studies:
-    n_genes = studies[study].n_vars
-    combined.uns[f'{study}_n_genes'] = n_genes
-```
-
-### Incremental Concatenation
-
-```python
-# For many datasets, concatenate in batches
-chunk_size = 10
-all_files = [f'dataset_{i}.h5ad' for i in range(100)]
-
-# Process in chunks
-result = None
-for i in range(0, len(all_files), chunk_size):
-    chunk_files = all_files[i:i+chunk_size]
-    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
-    chunk_combined = ad.concat(chunk_adatas)
-
-    if result is None:
-        result = chunk_combined
-    else:
-        result = ad.concat([result, chunk_combined])
-```
-
-### Memory-Efficient On-Disk Concatenation
-
-```python
-# Experimental feature for large datasets
-from anndata.experimental import concat_on_disk
-
-files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
-concat_on_disk(
-    files,
-    'combined.h5ad',
-    join='outer'
-)
-
-# Read result in backed mode
-combined = ad.read_h5ad('combined.h5ad', backed='r')
-```
-
-## Troubleshooting
-
-### Issue: Dimension Mismatch
-
-```python
-# Error: shapes don't match
-adata1 = ad.AnnData(np.random.rand(100, 2000))
-adata2 = ad.AnnData(np.random.rand(150, 1500))
-
-# Solution: use outer join
-combined = ad.concat([adata1, adata2], join='outer')
-```
-
-### Issue: Memory Error
-
-```python
-# Problem: too many large objects in memory
-large_adatas = [ad.read_h5ad(f) for f in many_files]
-
-# Solution: read and concatenate incrementally
-result = None
-for file in many_files:
-    adata = ad.read_h5ad(file)
-    if result is None:
-        result = adata
-    else:
-        result = ad.concat([result, adata])
-        del adata  # Free memory
-```
-
-### Issue: Duplicate Indices
-
-```python
-# Problem: same cell names in different batches
-# Solution: use index_unique
-combined = ad.concat(
-    [adata1, adata2],
-    keys=['batch1', 'batch2'],
-    index_unique='-'
-)
-```
-
-### Issue: Lost Annotations
-
-```python
-# Problem: annotations disappear
-adata1.var['important'] = values1
-adata2.var['important'] = values2
-
-combined = ad.concat([adata1, adata2])  # merge=None by default
-# Solution: use appropriate merge strategy
-combined = ad.concat([adata1, adata2], merge='first')
-```
-
-## Performance Tips
-
-1. **Pre-align indices**: Ensure consistent naming before concatenation
-2. **Use sparse matrices**: Convert to sparse before concatenating
-3. **Batch operations**: Concatenate in groups for many datasets
-4. **Choose inner join**: When possible, to reduce result size
-5. **Use categoricals**: Convert string annotations before concatenating
-6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
--- a/scientific-packages/anndata/references/data_structure.md
+++ b/scientific-packages/anndata/references/data_structure.md
@@ -0,0 +1,314 @@
+# AnnData Object Structure
+
+The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.
+
+## Core Components
+
+### X (Data Matrix)
+The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.
+
+```python
+import anndata as ad
+import numpy as np
+
+# Create with dense array
+adata = ad.AnnData(X=np.random.rand(100, 2000))
+
+# Create with sparse matrix (recommended for large, sparse data)
+from scipy.sparse import csr_matrix
+sparse_data = csr_matrix(np.random.rand(100, 2000))
+adata = ad.AnnData(X=sparse_data)
+```
+
+Access data:
+```python
+# Full matrix (caution with large datasets)
+full_data = adata.X
+
+# Single observation
+obs_data = adata.X[0, :]
+
+# Single variable across all observations
+var_data = adata.X[:, 0]
+```
+
+### obs (Observation Annotations)
+DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.
+
+```python
+import pandas as pd
+
+# Create AnnData with observation metadata
+obs_df = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell', 'Monocyte'],
+    'treatment': ['control', 'treated', 'control'],
+    'timepoint': [0, 24, 24]
+}, index=['cell_1', 'cell_2', 'cell_3'])
+
+adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)
+
+# Access observation metadata
+print(adata.obs['cell_type'])
+print(adata.obs.loc['cell_1'])
+```
+
+### var (Variable Annotations)
+DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.
+
+```python
+# Create AnnData with variable metadata
+var_df = pd.DataFrame({
+    'gene_name': ['ACTB', 'GAPDH', 'TP53'],
+    'chromosome': ['7', '12', '17'],
+    'highly_variable': [True, False, True]
+}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])
+
+adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)
+
+# Access variable metadata
+print(adata.var['gene_name'])
+print(adata.var.loc['ENSG00001'])
+```
+
+### layers (Alternative Data Representations)
+Dictionary storing alternative matrices with the same dimensions as X.
+
+```python
+# Store raw counts, normalized data, and scaled data
+adata = ad.AnnData(X=np.random.rand(100, 2000))
+adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
+adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
+adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()
+
+# Access layers
+raw_data = adata.layers['raw_counts']
+normalized_data = adata.layers['normalized']
+```
+
+Common layer uses:
+- `raw_counts`: Original count data before normalization
+- `normalized`: Log-normalized or TPM values
+- `scaled`: Z-scored values for analysis
+- `imputed`: Data after imputation
+
+### obsm (Multi-dimensional Observation Annotations)
+Dictionary storing multi-dimensional arrays aligned to observations.
+
+```python
+# Store PCA coordinates and UMAP embeddings
+adata.obsm['X_pca'] = np.random.rand(100, 50)  # 50 principal components
+adata.obsm['X_umap'] = np.random.rand(100, 2)  # 2D UMAP coordinates
+adata.obsm['X_tsne'] = np.random.rand(100, 2)  # 2D t-SNE coordinates
+
+# Access embeddings
+pca_coords = adata.obsm['X_pca']
+umap_coords = adata.obsm['X_umap']
+```
+
+Common obsm uses:
+- `X_pca`: Principal component coordinates
+- `X_umap`: UMAP embedding coordinates
+- `X_tsne`: t-SNE embedding coordinates
+- `X_diffmap`: Diffusion map coordinates
+- `protein_expression`: Protein abundance measurements (CITE-seq)
+
+### varm (Multi-dimensional Variable Annotations)
+Dictionary storing multi-dimensional arrays aligned to variables.
+
+```python
+# Store PCA loadings
+adata.varm['PCs'] = np.random.rand(2000, 50)  # Loadings for 50 components
+adata.varm['gene_modules'] = np.random.rand(2000, 10)  # Gene module scores
+
+# Access loadings
+pc_loadings = adata.varm['PCs']
+```
+
+Common varm uses:
+- `PCs`: Principal component loadings
+- `gene_modules`: Gene co-expression module assignments
+
+### obsp (Pairwise Observation Relationships)
+Dictionary storing sparse matrices representing relationships between observations.
+
+```python
+from scipy.sparse import csr_matrix
+
+# Store k-nearest neighbor graph
+n_obs = 100
+knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
+adata.obsp['connectivities'] = knn_graph
+adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
+
+# Access graphs
+knn_connections = adata.obsp['connectivities']
+distances = adata.obsp['distances']
+```
+
+Common obsp uses:
+- `connectivities`: Cell-cell neighborhood graph
+- `distances`: Pairwise distances between cells
+
+### varp (Pairwise Variable Relationships)
+Dictionary storing sparse matrices representing relationships between variables.
+
+```python
+# Store gene-gene correlation matrix
+n_vars = 2000
+gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
+adata.varp['correlations'] = gene_corr
+
+# Access correlations
+gene_correlations = adata.varp['correlations']
+```
+
+### uns (Unstructured Annotations)
+Dictionary storing arbitrary unstructured metadata.
+
+```python
+# Store analysis parameters and results
+adata.uns['experiment_date'] = '2025-11-03'
+adata.uns['pca'] = {
+    'variance_ratio': [0.15, 0.10, 0.08],
+    'params': {'n_comps': 50}
+}
+adata.uns['neighbors'] = {
+    'params': {'n_neighbors': 15, 'method': 'umap'},
+    'connectivities_key': 'connectivities'
+}
+
+# Access unstructured data
+exp_date = adata.uns['experiment_date']
+pca_params = adata.uns['pca']['params']
+```
+
+Common uns uses:
+- Analysis parameters and settings
+- Color palettes for plotting
+- Cluster information
+- Tool-specific metadata
+
+### raw (Original Data Snapshot)
+Optional attribute preserving the original data matrix and variable annotations before filtering.
+
+```python
+# Create AnnData and store raw state
+adata = ad.AnnData(X=np.random.rand(100, 5000))
+adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]
+
+# Store raw state before filtering
+adata.raw = adata.copy()
+
+# Filter to highly variable genes
+highly_variable_mask = np.random.rand(5000) > 0.5
+adata = adata[:, highly_variable_mask]
+
+# Access original data
+original_matrix = adata.raw.X
+original_var = adata.raw.var
+```
+
+## Object Properties
+
+```python
+# Dimensions
+n_observations = adata.n_obs
+n_variables = adata.n_vars
+shape = adata.shape  # (n_obs, n_vars)
+
+# Index information
+obs_names = adata.obs_names  # Observation identifiers
+var_names = adata.var_names  # Variable identifiers
+
+# Storage mode
+is_view = adata.is_view  # True if this is a view of another object
+is_backed = adata.isbacked  # True if backed by on-disk storage
+filename = adata.filename  # Path to backing file (if backed)
+```
+
+## Creating AnnData Objects
+
+### From arrays and DataFrames
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# Minimal creation
+X = np.random.rand(100, 2000)
+adata = ad.AnnData(X)
+
+# With metadata
+obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
+var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
+adata = ad.AnnData(X=X, obs=obs, var=var)
+
+# With all components
+adata = ad.AnnData(
+    X=X,
+    obs=obs,
+    var=var,
+    layers={'raw': np.random.randint(0, 100, (100, 2000))},
+    obsm={'X_pca': np.random.rand(100, 50)},
+    uns={'experiment': 'test'}
+)
+```
+
+### From DataFrame
+```python
+# Create from pandas DataFrame (genes as columns, cells as rows)
+df = pd.DataFrame(
+    np.random.rand(100, 50),
+    columns=[f'Gene_{i}' for i in range(50)],
+    index=[f'Cell_{i}' for i in range(100)]
+)
+adata = ad.AnnData(df)
+```
+
+## Data Access Patterns
+
+### Vector extraction
+```python
+# Get observation annotation as array
+cell_types = adata.obs_vector('cell_type')
+
+# Get variable values across observations
+gene_expression = adata.obs_vector('ACTB')  # If ACTB is in var_names
+
+# Get variable annotation as array
+gene_names = adata.var_vector('gene_name')
+```
+
+### Subsetting
+```python
+# By index
+subset = adata[0:10, 0:100]  # First 10 obs, first 100 vars
+
+# By name
+subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]
+
+# By boolean mask
+high_count_cells = adata.obs['total_counts'] > 1000
+subset = adata[high_count_cells, :]
+
+# By observation metadata
+t_cells = adata[adata.obs['cell_type'] == 'T cell']
+```
+
+## Memory Considerations
+
+The AnnData structure is designed for memory efficiency:
+- Sparse matrices reduce memory for sparse data
+- Views avoid copying data when possible
+- Backed mode enables working with data larger than RAM
+- Categorical annotations reduce memory for discrete values
+
+```python
+# Convert strings to categoricals (more memory efficient)
+adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
+adata.strings_to_categoricals()
+
+# Check if object is a view (doesn't own data)
+if adata.is_view:
+    adata = adata.copy()  # Create independent copy
+```
--- a/scientific-packages/anndata/references/io_operations.md
+++ b/scientific-packages/anndata/references/io_operations.md
@@ -0,0 +1,404 @@
+# Input/Output Operations
+
+AnnData provides comprehensive I/O functionality for reading and writing data in various formats.
+
+## Native Formats
+
+### H5AD (HDF5-based)
+The recommended native format for AnnData objects, providing efficient storage and fast access.
+
+#### Writing H5AD files
+```python
+import anndata as ad
+
+# Write to file
+adata.write_h5ad('data.h5ad')
+
+# Write with compression
+adata.write_h5ad('data.h5ad', compression='gzip')
+
+# Write with specific compression level (0-9, higher = more compression)
+adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
+```
+
+#### Reading H5AD files
+```python
+# Read entire file into memory
+adata = ad.read_h5ad('data.h5ad')
+
+# Read in backed mode (lazy loading for large files)
+adata = ad.read_h5ad('data.h5ad', backed='r')  # Read-only
+adata = ad.read_h5ad('data.h5ad', backed='r+')  # Read-write
+
+# Backed mode enables working with datasets larger than RAM
+# Only accessed data is loaded into memory
+```
+
+#### Backed mode operations
+```python
+# Open in backed mode
+adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+
+# Access metadata without loading X into memory
+print(adata.obs.head())
+print(adata.var.head())
+
+# Subset operations create views
+subset = adata[:100, :500]  # View, no data loaded
+
+# Load specific data into memory
+X_subset = subset.X[:]  # Now loads this subset
+
+# Convert entire backed object to memory
+adata_memory = adata.to_memory()
+```
+
+### Zarr
+Hierarchical array storage format, optimized for cloud storage and parallel I/O.
+
+#### Writing Zarr
+```python
+# Write to Zarr store
+adata.write_zarr('data.zarr')
+
+# Write with specific chunks (important for performance)
+adata.write_zarr('data.zarr', chunks=(100, 100))
+```
+
+#### Reading Zarr
+```python
+# Read Zarr store
+adata = ad.read_zarr('data.zarr')
+```
+
+#### Remote Zarr access
+```python
+import fsspec
+
+# Access Zarr from S3
+store = fsspec.get_mapper('s3://bucket-name/data.zarr')
+adata = ad.read_zarr(store)
+
+# Access Zarr from URL
+store = fsspec.get_mapper('https://example.com/data.zarr')
+adata = ad.read_zarr(store)
+```
+
+## Alternative Input Formats
+
+### CSV/TSV
+```python
+# Read CSV (genes as columns, cells as rows)
+adata = ad.read_csv('data.csv')
+
+# Read with custom delimiter
+adata = ad.read_csv('data.tsv', delimiter='\t')
+
+# Specify that first column is row names
+adata = ad.read_csv('data.csv', first_column_names=True)
+```
+
+### Excel
+```python
+# Read Excel file
+adata = ad.read_excel('data.xlsx')
+
+# Read specific sheet
+adata = ad.read_excel('data.xlsx', sheet='Sheet1')
+```
+
+### Matrix Market (MTX)
+Common format for sparse matrices in genomics.
+
+```python
+# Read MTX with associated files
+# Requires: matrix.mtx, genes.tsv, barcodes.tsv
+adata = ad.read_mtx('matrix.mtx')
+
+# Read with custom gene and barcode files
+adata = ad.read_mtx(
+    'matrix.mtx',
+    var_names='genes.tsv',
+    obs_names='barcodes.tsv'
+)
+
+# Transpose if needed (MTX often has genes as rows)
+adata = adata.T
+```
+
+### 10X Genomics formats
+```python
+# Read 10X h5 format
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read 10X MTX directory
+adata = ad.read_10x_mtx('filtered_feature_bc_matrix/')
+
+# Specify genome if multiple present
+adata = ad.read_10x_h5('data.h5', genome='GRCh38')
+```
+
+### Loom
+```python
+# Read Loom file
+adata = ad.read_loom('data.loom')
+
+# Read with specific observation and variable annotations
+adata = ad.read_loom(
+    'data.loom',
+    obs_names='CellID',
+    var_names='Gene'
+)
+```
+
+### Text files
+```python
+# Read generic text file
+adata = ad.read_text('data.txt', delimiter='\t')
+
+# Read with custom parameters
+adata = ad.read_text(
+    'data.txt',
+    delimiter=',',
+    first_column_names=True,
+    dtype='float32'
+)
+```
+
+### UMI tools
+```python
+# Read UMI tools format
+adata = ad.read_umi_tools('counts.tsv')
+```
+
+### HDF5 (generic)
+```python
+# Read from HDF5 file (not h5ad format)
+adata = ad.read_hdf('data.h5', key='dataset')
+```
+
+## Alternative Output Formats
+
+### CSV
+```python
+# Write to CSV files (creates multiple files)
+adata.write_csvs('output_dir/')
+
+# This creates:
+# - output_dir/X.csv (expression matrix)
+# - output_dir/obs.csv (observation annotations)
+# - output_dir/var.csv (variable annotations)
+# - output_dir/uns.csv (unstructured annotations, if possible)
+
+# Skip certain components
+adata.write_csvs('output_dir/', skip_data=True)  # Skip X matrix
+```
+
+### Loom
+```python
+# Write to Loom format
+adata.write_loom('output.loom')
+```
+
+## Reading Specific Elements
+
+For fine-grained control, read specific elements from storage:
+
+```python
+from anndata import read_elem
+
+# Read just observation annotations
+obs = read_elem('data.h5ad/obs')
+
+# Read specific layer
+layer = read_elem('data.h5ad/layers/normalized')
+
+# Read unstructured data element
+params = read_elem('data.h5ad/uns/pca_params')
+```
+
+## Writing Specific Elements
+
+```python
+from anndata import write_elem
+import h5py
+
+# Write element to existing file
+with h5py.File('data.h5ad', 'a') as f:
+    write_elem(f, 'new_layer', adata.X.copy())
+```
+
+## Lazy Operations
+
+For very large datasets, use lazy reading to avoid loading entire datasets:
+
+```python
+from anndata.experimental import read_elem_lazy
+
+# Lazy read (returns dask array or similar)
+X_lazy = read_elem_lazy('large_data.h5ad/X')
+
+# Compute only when needed
+subset = X_lazy[:100, :100].compute()
+```
+
+## Common I/O Patterns
+
+### Convert between formats
+```python
+# MTX to H5AD
+adata = ad.read_mtx('matrix.mtx').T
+adata.write_h5ad('data.h5ad')
+
+# CSV to H5AD
+adata = ad.read_csv('data.csv')
+adata.write_h5ad('data.h5ad')
+
+# H5AD to Zarr
+adata = ad.read_h5ad('data.h5ad')
+adata.write_zarr('data.zarr')
+```
+
+### Load metadata without data
+```python
+# Backed mode allows inspecting metadata without loading X
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
+print(adata.obs.columns)
+print(adata.var.columns)
+# X is not loaded into memory
+```
+
+### Append to existing file
+```python
+# Open in read-write mode
+adata = ad.read_h5ad('data.h5ad', backed='r+')
+
+# Modify metadata
+adata.obs['new_column'] = values
+
+# Changes are written to disk
+```
+
+### Download from URL
+```python
+import anndata as ad
+
+# Read directly from URL (for h5ad files)
+url = 'https://example.com/data.h5ad'
+adata = ad.read_h5ad(url, backed='r')  # Streaming access
+
+# For other formats, download first
+import urllib.request
+urllib.request.urlretrieve(url, 'local_file.h5ad')
+adata = ad.read_h5ad('local_file.h5ad')
+```
+
+## Performance Tips
+
+### Reading
+- Use `backed='r'` for large files you only need to query
+- Use `backed='r+'` if you need to modify metadata without loading all data
+- H5AD format is generally fastest for random access
+- Zarr is better for cloud storage and parallel access
+- Consider compression for storage, but note it may slow down reading
+
+### Writing
+- Use compression for long-term storage: `compression='gzip'` or `compression='lzf'`
+- LZF compression is faster but compresses less than GZIP
+- For Zarr, tune chunk sizes based on access patterns:
+  - Larger chunks for sequential reads
+  - Smaller chunks for random access
+- Convert string columns to categorical before writing (smaller files)
+
+### Memory management
+```python
+# Convert strings to categoricals (reduces file size and memory)
+adata.strings_to_categoricals()
+adata.write_h5ad('data.h5ad')
+
+# Use sparse matrices for sparse data
+from scipy.sparse import csr_matrix
+if isinstance(adata.X, np.ndarray):
+    density = np.count_nonzero(adata.X) / adata.X.size
+    if density < 0.5:  # If more than 50% zeros
+        adata.X = csr_matrix(adata.X)
+```
+
+## Handling Large Datasets
+
+### Strategy 1: Backed mode
+```python
+# Work with dataset larger than RAM
+adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
+
+# Filter based on metadata (fast, no data loading)
+filtered = adata[adata.obs['quality_score'] > 0.8]
+
+# Load filtered subset into memory
+adata_memory = filtered.to_memory()
+```
+
+### Strategy 2: Chunked processing
+```python
+# Process data in chunks
+adata = ad.read_h5ad('large_file.h5ad', backed='r')
+
+chunk_size = 1000
+results = []
+
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    # Process chunk
+    result = process(chunk)
+    results.append(result)
+```
+
+### Strategy 3: Use AnnCollection
+```python
+from anndata.experimental import AnnCollection
+
+# Create collection without loading data
+adatas = [f'dataset_{i}.h5ad' for i in range(10)]
+collection = AnnCollection(
+    adatas,
+    join_obs='inner',
+    join_vars='inner'
+)
+
+# Process collection lazily
+# Data is loaded only when accessed
+```
+
+## Common Issues and Solutions
+
+### Issue: Out of memory when reading
+**Solution**: Use backed mode or read in chunks
+```python
+adata = ad.read_h5ad('file.h5ad', backed='r')
+```
+
+### Issue: Slow reading from cloud storage
+**Solution**: Use Zarr format with appropriate chunking
+```python
+adata.write_zarr('data.zarr', chunks=(1000, 1000))
+```
+
+### Issue: Large file sizes
+**Solution**: Use compression and convert to sparse/categorical
+```python
+adata.strings_to_categoricals()
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+adata.write_h5ad('compressed.h5ad', compression='gzip')
+```
+
+### Issue: Cannot modify backed object
+**Solution**: Either load to memory or open in 'r+' mode
+```python
+# Option 1: Load to memory
+adata = adata.to_memory()
+
+# Option 2: Open in read-write mode
+adata = ad.read_h5ad('file.h5ad', backed='r+')
+```
--- a/scientific-packages/anndata/references/manipulation.md
+++ b/scientific-packages/anndata/references/manipulation.md
@@ -0,0 +1,516 @@
+# Data Manipulation
+
+Operations for transforming, subsetting, and manipulating AnnData objects.
+
+## Subsetting
+
+### By indices
+```python
+import anndata as ad
+import numpy as np
+
+adata = ad.AnnData(X=np.random.rand(1000, 2000))
+
+# Integer indices
+subset = adata[0:100, 0:500]  # First 100 obs, first 500 vars
+
+# List of indices
+obs_indices = [0, 10, 20, 30, 40]
+var_indices = [0, 1, 2, 3, 4]
+subset = adata[obs_indices, var_indices]
+
+# Single observation or variable
+single_obs = adata[0, :]
+single_var = adata[:, 0]
+```
+
+### By names
+```python
+import pandas as pd
+
+# Create with named indices
+obs_names = [f'cell_{i}' for i in range(1000)]
+var_names = [f'gene_{i}' for i in range(2000)]
+adata = ad.AnnData(
+    X=np.random.rand(1000, 2000),
+    obs=pd.DataFrame(index=obs_names),
+    var=pd.DataFrame(index=var_names)
+)
+
+# Subset by observation names
+subset = adata[['cell_0', 'cell_1', 'cell_2'], :]
+
+# Subset by variable names
+subset = adata[:, ['gene_0', 'gene_10', 'gene_20']]
+
+# Both axes
+subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']]
+```
+
+### By boolean masks
+```python
+# Create boolean masks
+high_count_obs = np.random.rand(1000) > 0.5
+high_var_genes = np.random.rand(2000) > 0.7
+
+# Subset using masks
+subset = adata[high_count_obs, :]
+subset = adata[:, high_var_genes]
+subset = adata[high_count_obs, high_var_genes]
+```
+
+### By metadata conditions
+```python
+# Add metadata
+adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000)
+adata.obs['quality_score'] = np.random.rand(1000)
+adata.var['highly_variable'] = np.random.rand(2000) > 0.8
+
+# Filter by cell type
+t_cells = adata[adata.obs['cell_type'] == 'A']
+
+# Filter by multiple conditions
+high_quality_a_cells = adata[
+    (adata.obs['cell_type'] == 'A') &
+    (adata.obs['quality_score'] > 0.7)
+]
+
+# Filter by variable metadata
+hv_genes = adata[:, adata.var['highly_variable']]
+
+# Complex conditions
+filtered = adata[
+    (adata.obs['quality_score'] > 0.5) &
+    (adata.obs['cell_type'].isin(['A', 'B'])),
+    adata.var['highly_variable']
+]
+```
+
+## Transposition
+
+```python
+# Transpose AnnData object (swap observations and variables)
+adata_T = adata.T
+
+# Shape changes
+print(adata.shape)    # (1000, 2000)
+print(adata_T.shape)  # (2000, 1000)
+
+# obs and var are swapped
+print(adata.obs.head())   # Observation metadata
+print(adata_T.var.head()) # Same data, now as variable metadata
+
+# Useful when data is in opposite orientation
+# Common with some file formats where genes are rows
+```
+
+## Copying
+
+### Full copy
+```python
+# Create independent copy
+adata_copy = adata.copy()
+
+# Modifications to copy don't affect original
+adata_copy.obs['new_column'] = 1
+print('new_column' in adata.obs.columns)  # False
+```
+
+### Shallow copy
+```python
+# View (doesn't copy data, modifications affect original)
+adata_view = adata[0:100, :]
+
+# Check if object is a view
+print(adata_view.is_view)  # True
+
+# Convert view to independent copy
+adata_independent = adata_view.copy()
+print(adata_independent.is_view)  # False
+```
+
+## Renaming
+
+### Rename observations and variables
+```python
+# Rename all observations
+adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)]
+
+# Rename all variables
+adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)]
+
+# Make names unique (add suffix to duplicates)
+adata.obs_names_make_unique()
+adata.var_names_make_unique()
+```
+
+### Rename categories
+```python
+# Create categorical column
+adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A'])
+
+# Rename categories
+adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C'])
+
+# Or using dictionary
+adata.rename_categories('cell_type', {
+    'Type_A': 'T_cell',
+    'Type_B': 'B_cell',
+    'Type_C': 'Monocyte'
+})
+```
+
+## Type Conversions
+
+### Strings to categoricals
+```python
+# Convert string columns to categorical (more memory efficient)
+adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500
+adata.obs['tissue'] = ['brain', 'liver'] * 500
+
+# Convert all string columns to categorical
+adata.strings_to_categoricals()
+
+print(adata.obs['cell_type'].dtype)  # category
+print(adata.obs['tissue'].dtype)     # category
+```
+
+### Sparse to dense and vice versa
+```python
+from scipy.sparse import csr_matrix
+
+# Dense to sparse
+if not isinstance(adata.X, csr_matrix):
+    adata.X = csr_matrix(adata.X)
+
+# Sparse to dense
+if isinstance(adata.X, csr_matrix):
+    adata.X = adata.X.toarray()
+
+# Convert layer
+adata.layers['normalized'] = csr_matrix(adata.layers['normalized'])
+```
+
+## Chunked Operations
+
+Process large datasets in chunks:
+
+```python
+# Iterate through data in chunks
+chunk_size = 100
+for chunk in adata.chunked_X(chunk_size):
+    # Process chunk
+    result = process_chunk(chunk)
+```
+
+## Extracting Vectors
+
+### Get observation vectors
+```python
+# Get observation metadata as array
+cell_types = adata.obs_vector('cell_type')
+
+# Get gene expression across observations
+actb_expression = adata.obs_vector('ACTB')  # If ACTB in var_names
+```
+
+### Get variable vectors
+```python
+# Get variable metadata as array
+gene_names = adata.var_vector('gene_name')
+```
+
+## Adding/Modifying Data
+
+### Add observations
+```python
+# Create new observations
+new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars))
+new_obs.var_names = adata.var_names
+
+# Concatenate with existing
+adata_extended = ad.concat([adata, new_obs], axis=0)
+```
+
+### Add variables
+```python
+# Create new variables
+new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100))
+new_vars.obs_names = adata.obs_names
+
+# Concatenate with existing
+adata_extended = ad.concat([adata, new_vars], axis=1)
+```
+
+### Add metadata columns
+```python
+# Add observation annotation
+adata.obs['new_score'] = np.random.rand(adata.n_obs)
+
+# Add variable annotation
+adata.var['new_label'] = ['label'] * adata.n_vars
+
+# Add from external data
+external_data = pd.read_csv('metadata.csv', index_col=0)
+adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column']
+```
+
+### Add layers
+```python
+# Add new layer
+adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape)
+adata.layers['log_transformed'] = np.log1p(adata.X)
+
+# Replace layer
+adata.layers['normalized'] = new_normalized_data
+```
+
+### Add embeddings
+```python
+# Add PCA
+adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50)
+
+# Add UMAP
+adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2)
+
+# Add multiple embeddings
+adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2)
+adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10)
+```
+
+### Add pairwise relationships
+```python
+from scipy.sparse import csr_matrix
+
+# Add nearest neighbor graph
+n_obs = adata.n_obs
+knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
+adata.obsp['connectivities'] = knn_graph
+
+# Add distance matrix
+adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
+```
+
+### Add unstructured data
+```python
+# Add analysis parameters
+adata.uns['pca'] = {
+    'variance': [0.2, 0.15, 0.1],
+    'variance_ratio': [0.4, 0.3, 0.2],
+    'params': {'n_comps': 50}
+}
+
+# Add color schemes
+adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF']
+```
+
+## Removing Data
+
+### Remove observations or variables
+```python
+# Keep only specific observations
+keep_obs = adata.obs['quality_score'] > 0.5
+adata = adata[keep_obs, :]
+
+# Remove specific variables
+remove_vars = adata.var['low_count']
+adata = adata[:, ~remove_vars]
+```
+
+### Remove metadata columns
+```python
+# Remove observation column
+adata.obs.drop('unwanted_column', axis=1, inplace=True)
+
+# Remove variable column
+adata.var.drop('unwanted_column', axis=1, inplace=True)
+```
+
+### Remove layers
+```python
+# Remove specific layer
+del adata.layers['unwanted_layer']
+
+# Remove all layers
+adata.layers = {}
+```
+
+### Remove embeddings
+```python
+# Remove specific embedding
+del adata.obsm['X_tsne']
+
+# Remove all embeddings
+adata.obsm = {}
+```
+
+### Remove unstructured data
+```python
+# Remove specific key
+del adata.uns['unwanted_key']
+
+# Remove all unstructured data
+adata.uns = {}
+```
+
+## Reordering
+
+### Sort observations
+```python
+# Sort by observation metadata
+adata = adata[adata.obs.sort_values('quality_score').index, :]
+
+# Sort by observation names
+adata = adata[sorted(adata.obs_names), :]
+```
+
+### Sort variables
+```python
+# Sort by variable metadata
+adata = adata[:, adata.var.sort_values('gene_name').index]
+
+# Sort by variable names
+adata = adata[:, sorted(adata.var_names)]
+```
+
+### Reorder to match external list
+```python
+# Reorder observations to match external list
+desired_order = ['cell_10', 'cell_5', 'cell_20', ...]
+adata = adata[desired_order, :]
+
+# Reorder variables
+desired_genes = ['TP53', 'ACTB', 'GAPDH', ...]
+adata = adata[:, desired_genes]
+```
+
+## Data Transformations
+
+### Normalize
+```python
+# Total count normalization (CPM/TPM-like)
+total_counts = adata.X.sum(axis=1)
+adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6
+
+# Log transformation
+adata.layers['log1p'] = np.log1p(adata.X)
+
+# Z-score normalization
+mean = adata.X.mean(axis=0)
+std = adata.X.std(axis=0)
+adata.layers['scaled'] = (adata.X - mean) / std
+```
+
+### Filter
+```python
+# Filter cells by total counts
+total_counts = np.array(adata.X.sum(axis=1)).flatten()
+adata.obs['total_counts'] = total_counts
+adata = adata[adata.obs['total_counts'] > 1000, :]
+
+# Filter genes by detection rate
+detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs
+adata.var['detection_rate'] = np.array(detection_rate).flatten()
+adata = adata[:, adata.var['detection_rate'] > 0.01]
+```
+
+## Working with Views
+
+Views are lightweight references to subsets of data that don't copy the underlying matrix:
+
+```python
+# Create view
+view = adata[0:100, 0:500]
+print(view.is_view)  # True
+
+# Views allow read access
+data = view.X
+
+# Modifying view data affects original
+# (Be careful!)
+
+# Convert view to independent copy
+independent = view.copy()
+
+# Force AnnData to be a copy, not a view
+adata = adata.copy()
+```
+
+## Merging Metadata
+
+```python
+# Merge external metadata
+external_metadata = pd.read_csv('additional_metadata.csv', index_col=0)
+
+# Join metadata (inner join on index)
+adata.obs = adata.obs.join(external_metadata)
+
+# Left join (keep all adata observations)
+adata.obs = adata.obs.merge(
+    external_metadata,
+    left_index=True,
+    right_index=True,
+    how='left'
+)
+```
+
+## Common Manipulation Patterns
+
+### Quality control filtering
+```python
+# Calculate QC metrics
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata.var['n_cells'] = (adata.X > 0).sum(axis=0)
+
+# Filter low-quality cells
+adata = adata[adata.obs['n_genes'] > 200, :]
+adata = adata[adata.obs['total_counts'] < 50000, :]
+
+# Filter rarely detected genes
+adata = adata[:, adata.var['n_cells'] >= 3]
+```
+
+### Select highly variable genes
+```python
+# Mark highly variable genes
+gene_variance = np.var(adata.X, axis=0)
+adata.var['variance'] = np.array(gene_variance).flatten()
+adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90)
+
+# Subset to highly variable genes
+adata_hvg = adata[:, adata.var['highly_variable']].copy()
+```
+
+### Downsample
+```python
+# Random sampling of observations
+np.random.seed(42)
+n_sample = 500
+sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False)
+adata_downsampled = adata[sample_indices, :].copy()
+
+# Stratified sampling by cell type
+from sklearn.model_selection import train_test_split
+train_idx, test_idx = train_test_split(
+    range(adata.n_obs),
+    test_size=0.2,
+    stratify=adata.obs['cell_type']
+)
+adata_train = adata[train_idx, :].copy()
+adata_test = adata[test_idx, :].copy()
+```
+
+### Split train/test
+```python
+# Random train/test split
+np.random.seed(42)
+n_obs = adata.n_obs
+train_size = int(0.8 * n_obs)
+indices = np.random.permutation(n_obs)
+train_indices = indices[:train_size]
+test_indices = indices[train_size:]
+
+adata_train = adata[train_indices, :].copy()
+adata_test = adata[test_indices, :].copy()
+```
--- a/scientific-packages/anndata/references/workflows_best_practices.md
+++ b/scientific-packages/anndata/references/workflows_best_practices.md
@@ -1,438 +0,0 @@
-# AnnData Workflows and Best Practices
-
-## Common Workflows
-
-### 1. Single-Cell RNA-seq Analysis Workflow
-
-#### Loading Data
-```python
-import anndata as ad
-import numpy as np
-import pandas as pd
-
-# Load from 10X format
-adata = ad.read_mtx('matrix.mtx')
-adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
-adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
-
-# Or load from pre-processed h5ad
-adata = ad.read_h5ad('preprocessed_data.h5ad')
-```
-
-#### Quality Control
-```python
-# Calculate QC metrics
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
-
-# Filter cells
-adata = adata[adata.obs.n_genes > 200]
-adata = adata[adata.obs.total_counts < 10000]
-
-# Filter genes
-min_cells = 3
-adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
-```
-
-#### Normalization and Preprocessing
-```python
-# Store raw counts
-adata.layers['counts'] = adata.X.copy()
-
-# Normalize
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
-
-# Log transform
-adata.layers['log1p'] = np.log1p(adata.X)
-adata.X = adata.layers['log1p']
-
-# Identify highly variable genes
-gene_variance = adata.X.var(axis=0)
-adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
-```
-
-#### Dimensionality Reduction
-```python
-# PCA
-from sklearn.decomposition import PCA
-pca = PCA(n_components=50)
-adata.obsm['X_pca'] = pca.fit_transform(adata.X)
-
-# Store PCA variance
-adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
-
-# UMAP
-from umap import UMAP
-umap = UMAP(n_components=2)
-adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
-```
-
-#### Clustering
-```python
-# Store cluster assignments
-adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
-
-# Store cluster centroids
-centroids = np.array([...])
-adata.varm['cluster_centroids'] = centroids
-```
-
-#### Save Results
-```python
-# Save complete analysis
-adata.write('analyzed_data.h5ad', compression='gzip')
-```
-
-### 2. Batch Integration Workflow
-
-```python
-import anndata as ad
-
-# Load multiple batches
-batch1 = ad.read_h5ad('batch1.h5ad')
-batch2 = ad.read_h5ad('batch2.h5ad')
-batch3 = ad.read_h5ad('batch3.h5ad')
-
-# Concatenate with batch labels
-adata = ad.concat(
-    [batch1, batch2, batch3],
-    axis=0,
-    label='batch',
-    keys=['batch1', 'batch2', 'batch3'],
-    index_unique='-'
-)
-
-# Batch effect correction would go here
-# (using external tools like Harmony, Scanorama, etc.)
-
-# Store corrected embeddings
-adata.obsm['X_pca_corrected'] = corrected_pca
-adata.obsm['X_umap_corrected'] = corrected_umap
-```
-
-### 3. Memory-Efficient Large Dataset Workflow
-
-```python
-import anndata as ad
-
-# Read in backed mode
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
-
-# Check backing status
-print(f"Is backed: {adata.isbacked}")
-print(f"File: {adata.filename}")
-
-# Work with chunks
-for chunk in adata.chunk_X(chunk_size=1000):
-    # Process chunk
-    result = process_chunk(chunk)
-
-# Close file when done
-adata.file.close()
-```
-
-### 4. Multi-Dataset Comparison Workflow
-
-```python
-import anndata as ad
-
-# Load datasets
-datasets = {
-    'study1': ad.read_h5ad('study1.h5ad'),
-    'study2': ad.read_h5ad('study2.h5ad'),
-    'study3': ad.read_h5ad('study3.h5ad')
-}
-
-# Outer join to keep all genes
-combined = ad.concat(
-    list(datasets.values()),
-    axis=0,
-    join='outer',
-    label='study',
-    keys=list(datasets.keys()),
-    merge='first'
-)
-
-# Handle missing data
-combined.X[np.isnan(combined.X)] = 0
-
-# Add dataset-specific metadata
-combined.uns['datasets'] = {
-    'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
-    'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
-    'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
-}
-```
-
-## Best Practices
-
-### Memory Management
-
-#### Use Sparse Matrices
-```python
-from scipy.sparse import csr_matrix
-
-# Convert to sparse if data is sparse
-if density < 0.3:  # Less than 30% non-zero
-    adata.X = csr_matrix(adata.X)
-```
-
-#### Use Backed Mode for Large Files
-```python
-# Read with backing
-adata = ad.read_h5ad('large_file.h5ad', backed='r')
-
-# Only load what you need
-subset = adata[:1000, :500].copy()  # Now in memory
-```
-
-#### Convert Strings to Categoricals
-```python
-# Efficient storage for repeated strings
-adata.strings_to_categoricals()
-
-# Or manually
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
-```
-
-### Data Organization
-
-#### Use Layers for Different Representations
-```python
-# Store multiple versions of the data
-adata.layers['counts'] = raw_counts
-adata.layers['normalized'] = normalized_data
-adata.layers['log1p'] = log_transformed_data
-adata.layers['scaled'] = scaled_data
-```
-
-#### Use obsm/varm for Multi-Dimensional Annotations
-```python
-# Embeddings
-adata.obsm['X_pca'] = pca_coordinates
-adata.obsm['X_umap'] = umap_coordinates
-adata.obsm['X_tsne'] = tsne_coordinates
-
-# Gene loadings
-adata.varm['PCs'] = principal_components
-```
-
-#### Use uns for Analysis Metadata
-```python
-# Store parameters
-adata.uns['preprocessing'] = {
-    'normalization': 'TPM',
-    'min_genes': 200,
-    'min_cells': 3,
-    'date': '2024-01-15'
-}
-
-# Store analysis results
-adata.uns['differential_expression'] = {
-    'method': 't-test',
-    'p_value_threshold': 0.05
-}
-```
-
-### Subsetting and Views
-
-#### Understand View vs Copy
-```python
-# Subsetting returns a view
-subset = adata[adata.obs.cell_type == 'B cell']  # View
-print(subset.is_view)  # True
-
-# Views are memory efficient but modifications affect original
-subset.obs['new_column'] = value  # Modifies original adata
-
-# Create independent copy when needed
-subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
-```
-
-#### Chain Operations Efficiently
-```python
-# Bad - creates multiple intermediate views
-temp1 = adata[adata.obs.batch == 'batch1']
-temp2 = temp1[temp1.obs.n_genes > 200]
-result = temp2[:, temp2.var.highly_variable].copy()
-
-# Good - chain operations
-result = adata[
-    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
-    adata.var.highly_variable
-].copy()
-```
-
-### File I/O
-
-#### Use Compression
-```python
-# Save with compression
-adata.write('data.h5ad', compression='gzip')
-```
-
-#### Choose the Right Format
-```python
-# H5AD for general use (good compression, fast)
-adata.write_h5ad('data.h5ad')
-
-# Zarr for cloud storage and parallel access
-adata.write_zarr('data.zarr')
-
-# Loom for compatibility with other tools
-adata.write_loom('data.loom')
-```
-
-#### Close File Connections
-```python
-# Use context manager pattern
-adata = ad.read_h5ad('file.h5ad', backed='r')
-try:
-    # Work with data
-    process(adata)
-finally:
-    adata.file.close()
-```
-
-### Concatenation
-
-#### Choose Appropriate Join Strategy
-```python
-# Inner join - only common features (safe, may lose data)
-combined = ad.concat([adata1, adata2], join='inner')
-
-# Outer join - all features (keeps all data, may introduce zeros)
-combined = ad.concat([adata1, adata2], join='outer')
-```
-
-#### Track Data Sources
-```python
-# Add source labels
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    label='dataset',
-    keys=['exp1', 'exp2', 'exp3']
-)
-
-# Make indices unique
-combined = ad.concat(
-    [adata1, adata2, adata3],
-    index_unique='-'
-)
-```
-
-#### Handle Variable-Specific Metadata
-```python
-# Use merge strategy for var annotations
-combined = ad.concat(
-    [adata1, adata2],
-    merge='same',  # Keep only identical annotations
-    join='outer'
-)
-```
-
-### Naming Conventions
-
-#### Use Consistent Naming
-```python
-# Embeddings: X_<method>
-adata.obsm['X_pca']
-adata.obsm['X_umap']
-adata.obsm['X_tsne']
-
-# Layers: descriptive names
-adata.layers['counts']
-adata.layers['log1p']
-adata.layers['scaled']
-
-# Observations: snake_case
-adata.obs['cell_type']
-adata.obs['n_genes']
-adata.obs['total_counts']
-```
-
-#### Make Indices Unique
-```python
-# Ensure unique names
-adata.obs_names_make_unique()
-adata.var_names_make_unique()
-```
-
-### Error Handling
-
-#### Validate Data Structure
-```python
-# Check dimensions
-assert adata.n_obs > 0, "No observations in data"
-assert adata.n_vars > 0, "No variables in data"
-
-# Check for NaN values
-if np.isnan(adata.X).any():
-    print("Warning: NaN values detected")
-
-# Check for negative values in count data
-if (adata.X < 0).any():
-    print("Warning: Negative values in count data")
-```
-
-#### Handle Missing Data
-```python
-# Check for missing annotations
-if adata.obs['cell_type'].isna().any():
-    print("Warning: Missing cell type annotations")
-    # Fill or remove
-    adata = adata[~adata.obs['cell_type'].isna()]
-```
-
-## Common Pitfalls
-
-### 1. Forgetting to Copy Views
-```python
-# BAD - modifies original
-subset = adata[adata.obs.condition == 'treated']
-subset.X = transformed_data  # Changes original adata!
-
-# GOOD
-subset = adata[adata.obs.condition == 'treated'].copy()
-subset.X = transformed_data  # Only changes subset
-```
-
-### 2. Mixing Backed and In-Memory Operations
-```python
-# BAD - trying to modify backed data
-adata = ad.read_h5ad('file.h5ad', backed='r')
-adata.X[0, 0] = 100  # Error: can't modify backed data
-
-# GOOD - load to memory first
-adata = ad.read_h5ad('file.h5ad', backed='r')
-adata = adata.to_memory()
-adata.X[0, 0] = 100  # Works
-```
-
-### 3. Not Using Categoricals for Metadata
-```python
-# BAD - stores as strings (memory inefficient)
-adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
-
-# GOOD - use categorical
-adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
-```
-
-### 4. Incorrect Concatenation Axis
-```python
-# Concatenating observations (cells)
-combined = ad.concat([adata1, adata2], axis=0)  # Correct
-
-# Concatenating variables (genes) - rare
-combined = ad.concat([adata1, adata2], axis=1)  # Less common
-```
-
-### 5. Not Preserving Raw Data
-```python
-# BAD - loses original data
-adata.X = normalized_data
-
-# GOOD - preserve original
-adata.layers['counts'] = adata.X.copy()
-adata.X = normalized_data
-```