Improve the anndata skill

2026-01-26 16:58:56 +08:00 · 2025-11-03 16:09:01 -08:00
parent 094d5aa9f1
commit 6ddea4786e
9 changed files with 2448 additions and 1560 deletions
--- a/scientific-packages/anndata/SKILL.md
+++ b/scientific-packages/anndata/SKILL.md
@@ -1,527 +1,394 @@
 ---
 name: anndata
-description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5ad files, manage obs/var metadata, layers, embeddings (PCA/UMAP), concatenate datasets, for scRNA-seq workflows."
+description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
 ---
 # AnnData
 ## Overview
-AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
+AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
-## Core Capabilities
+## When to Use This Skill
-### 1. Creating and Structuring AnnData Objects
+Use this skill when:
 - Creating, reading, or writing AnnData objects
 - Working with h5ad, zarr, or other genomics data formats
 - Performing single-cell RNA-seq analysis
 - Managing large datasets with sparse matrices or backed mode
 - Concatenating multiple datasets or experimental batches
 - Subsetting, filtering, or transforming annotated data
 - Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
-Create AnnData objects from various data sources and organize multi-dimensional annotations.
+## Installation
-**Basic creation:**
+```bash
 pip install anndata
 # With optional dependencies
 pip install anndata[dev,test,doc]
 ```
 ## Quick Start
 ### Creating an AnnData object
 ```python
 import anndata as ad
 import numpy as np
 from scipy.sparse import csr_matrix
 # From dense or sparse arrays
 counts = np.random.poisson(1, size=(100, 2000))
 adata = ad.AnnData(counts)
 # With sparse matrix (memory-efficient)
 counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
 adata = ad.AnnData(counts)
 ```
 **With metadata:**
 ```python
 import pandas as pd
-obs_meta = pd.DataFrame({
+# Minimal creation
-    'cell_type': pd.Categorical(['B', 'T', 'Monocyte'] * 33 + ['B']),
+X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
-    'batch': ['batch1'] * 50 + ['batch2'] * 50
+adata = ad.AnnData(X)
 })
 var_meta = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)],
    'highly_variable': np.random.choice([True, False], 2000)
 })
-adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
+# With metadata
 obs = pd.DataFrame({
    'cell_type': ['T cell', 'B cell'] * 50,
    'sample': ['A', 'B'] * 50
 }, index=[f'cell_{i}' for i in range(100)])
 var = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)]
 }, index=[f'ENSG{i:05d}' for i in range(2000)])
 adata = ad.AnnData(X=X, obs=obs, var=var)
 ```
-**Understanding the structure:**
+### Reading data
 - **X**: Primary data matrix (observations × variables)
 - **obs**: Row (observation) annotations as DataFrame
 - **var**: Column (variable) annotations as DataFrame
 - **obsm**: Multi-dimensional observation annotations (e.g., PCA, UMAP coordinates)
 - **varm**: Multi-dimensional variable annotations (e.g., gene loadings)
 - **layers**: Alternative data matrices with same dimensions as X
 - **uns**: Unstructured metadata dictionary
 - **obsp/varp**: Pairwise relationship matrices (graphs)
 ### 2. Adding Annotations and Layers
 Organize different data representations and metadata within a single object.
 **Cell-level metadata (obs):**
 ```python
-adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+# Read h5ad file
-adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata = ad.read_h5ad('data.h5ad')
 adata.obs['condition'] = pd.Categorical(['control', 'treated'] * 50)
 ```
-**Gene-level metadata (var):**
+# Read with backed mode (for large files)
-```python
+adata = ad.read_h5ad('large_data.h5ad', backed='r')
 adata.var['highly_variable'] = gene_variance > threshold
 adata.var['chromosome'] = pd.Categorical(['chr1', 'chr2', ...])
 ```
-**Embeddings (obsm/varm):**
+# Read other formats
 ```python
 # Dimensionality reduction results
 adata.obsm['X_pca'] = pca_coordinates  # Shape: (n_obs, n_components)
 adata.obsm['X_umap'] = umap_coordinates  # Shape: (n_obs, 2)
 adata.obsm['X_tsne'] = tsne_coordinates
 # Gene loadings
 adata.varm['PCs'] = principal_components  # Shape: (n_vars, n_components)
 ```
 **Alternative data representations (layers):**
 ```python
 # Store multiple versions
 adata.layers['counts'] = raw_counts
 adata.layers['log1p'] = np.log1p(adata.X)
 adata.layers['scaled'] = (adata.X - mean) / std
 ```
 **Unstructured metadata (uns):**
 ```python
 # Analysis parameters
 adata.uns['preprocessing'] = {
    'normalization': 'TPM',
    'min_genes': 200,
    'date': '2024-01-15'
 }
 # Results
 adata.uns['pca'] = {'variance_ratio': variance_explained}
 ```
 ### 3. Subsetting and Views
 Efficiently subset data while managing memory through views and copies.
 **Subsetting operations:**
 ```python
 # By observation/variable names
 subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
 # By boolean masks
 b_cells = adata[adata.obs.cell_type == 'B']
 high_quality = adata[adata.obs.n_genes > 200]
 # By position
 first_cells = adata[:100, :]
 top_genes = adata[:, :500]
 # Combined conditions
 filtered = adata[
    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
    adata.var.highly_variable
 ]
 ```
 **Understanding views:**
 - Subsetting returns **views** by default (memory-efficient, shares data with original)
 - Modifying a view affects the original object
 - Check with `adata.is_view`
 - Convert to independent copy with `.copy()`
 ```python
 # View (memory-efficient)
 subset = adata[adata.obs.condition == 'treated']
 print(subset.is_view)  # True
 # Independent copy
 subset_copy = adata[adata.obs.condition == 'treated'].copy()
 print(subset_copy.is_view)  # False
 ```
 ### 4. File I/O and Backed Mode
 Read and write data efficiently, with options for memory-limited environments.
 **Writing data:**
 ```python
 # Standard format with compression
 adata.write('results.h5ad', compression='gzip')
 # Alternative formats
 adata.write_zarr('results.zarr')  # For cloud storage
 adata.write_loom('results.loom')  # For compatibility
 adata.write_csvs('results/')      # As CSV files
 ```
 **Reading data:**
 ```python
 # Load into memory
 adata = ad.read_h5ad('results.h5ad')
 # Backed mode (disk-backed, memory-efficient)
 adata = ad.read_h5ad('large_file.h5ad', backed='r')
 print(adata.isbacked)  # True
 print(adata.filename)  # Path to file
 # Close file connection when done
 adata.file.close()
 ```
 **Reading from other formats:**
 ```python
 # 10X format
 adata = ad.read_mtx('matrix.mtx')
 # CSV
 adata = ad.read_csv('data.csv')
 # Loom
 adata = ad.read_loom('data.loom')
 adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 ```
-**Working with backed mode:**
+### Writing data
 ```python
-# Read in backed mode for large files
+# Write h5ad file
-adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
+adata.write_h5ad('output.h5ad')
-# Process in chunks
+# Write with compression
-for chunk in adata.chunk_X(chunk_size=1000):
+adata.write_h5ad('output.h5ad', compression='gzip')
    result = process_chunk(chunk)
-# Load to memory if needed
+# Write other formats
-adata_memory = adata.to_memory()
+adata.write_zarr('output.zarr')
 adata.write_csvs('output_dir/')
 ```
-### 5. Concatenating Multiple Datasets
+### Basic operations
 Combine multiple AnnData objects with control over how data is merged.
 **Basic concatenation:**
 ```python
-# Concatenate observations (most common)
+# Subset by conditions
-combined = ad.concat([adata1, adata2, adata3], axis=0)
+t_cells = adata[adata.obs['cell_type'] == 'T cell']
-# Concatenate variables (rare)
+# Subset by indices
-combined = ad.concat([adata1, adata2], axis=1)
+subset = adata[0:50, 0:100]
 # Add metadata
 adata.obs['quality_score'] = np.random.rand(adata.n_obs)
 adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
 # Access dimensions
 print(f"{adata.n_obs} observations × {adata.n_vars} variables")
 ```
-**Join strategies:**
+## Core Capabilities
 ```python
 # Inner join: only shared variables (no missing data)
 combined = ad.concat([adata1, adata2], join='inner')
-# Outer join: all variables (fills missing with 0)
+### 1. Data Structure
-combined = ad.concat([adata1, adata2], join='outer')
+
 Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
 **See**: `references/data_structure.md` for comprehensive information on:
 - Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
 - Creating AnnData objects from various sources
 - Accessing and manipulating data components
 - Memory-efficient practices
 ### 2. Input/Output Operations
 Read and write data in various formats with support for compression, backed mode, and cloud storage.
 **See**: `references/io_operations.md` for details on:
 - Native formats (h5ad, zarr)
 - Alternative formats (CSV, MTX, Loom, 10X, Excel)
 - Backed mode for large datasets
 - Remote data access
 - Format conversion
 - Performance optimization
 Common commands:
 ```python
 # Read/write h5ad
 adata = ad.read_h5ad('data.h5ad', backed='r')
 adata.write_h5ad('output.h5ad', compression='gzip')
 # Read 10X data
 adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 # Read MTX format
 adata = ad.read_mtx('matrix.mtx').T
 ```
-**Tracking data sources:**
+### 3. Concatenation
 Combine multiple AnnData objects along observations or variables with flexible join strategies.
 **See**: `references/concatenation.md` for comprehensive coverage of:
 - Basic concatenation (axis=0 for observations, axis=1 for variables)
 - Join types (inner, outer)
 - Merge strategies (same, unique, first, only)
 - Tracking data sources with labels
 - Lazy concatenation (AnnCollection)
 - On-disk concatenation for large datasets
 Common commands:
 ```python
-# Add source labels
+# Concatenate observations (combine samples)
-combined = ad.concat(
+adata = ad.concat(
    [adata1, adata2, adata3],
    label='dataset',
    keys=['exp1', 'exp2', 'exp3']
 )
 # Creates combined.obs['dataset'] with values 'exp1', 'exp2', 'exp3'
 # Make duplicate indices unique
 combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
 )
 # Cell names become: Cell_0-batch1, Cell_0-batch2, etc.
 ```
 **Merge strategies for metadata:**
 ```python
 # merge=None: exclude variable annotations (default)
 combined = ad.concat([adata1, adata2], merge=None)
 # merge='same': keep only identical annotations
 combined = ad.concat([adata1, adata2], merge='same')
 # merge='first': use first occurrence
 combined = ad.concat([adata1, adata2], merge='first')
 # merge='unique': keep annotations with single value
 combined = ad.concat([adata1, adata2], merge='unique')
 ```
 **Complete example:**
 ```python
 # Load batches
 batch1 = ad.read_h5ad('batch1.h5ad')
 batch2 = ad.read_h5ad('batch2.h5ad')
 batch3 = ad.read_h5ad('batch3.h5ad')
 # Concatenate with full tracking
 combined = ad.concat(
    [batch1, batch2, batch3],
    axis=0,
-    join='outer',              # Keep all genes
+    join='inner',
-    merge='first',             # Use first batch's annotations
+    label='batch',
-    label='batch_id',          # Track source
+    keys=['batch1', 'batch2', 'batch3']
-    keys=['b1', 'b2', 'b3'],  # Custom labels
+)
-    index_unique='-'           # Make cell names unique
+
 # Concatenate variables (combine modalities)
 adata = ad.concat([adata_rna, adata_protein], axis=1)
 # Lazy concatenation
 from anndata.experimental import AnnCollection
 collection = AnnCollection(
    ['data1.h5ad', 'data2.h5ad'],
    join_obs='outer',
    label='dataset'
 )
 ```
-### 6. Data Conversion and Extraction
+### 4. Data Manipulation
-Convert between AnnData and other formats for interoperability.
+Transform, subset, filter, and reorganize data efficiently.
-**To DataFrame:**
+**See**: `references/manipulation.md` for detailed guidance on:
 - Subsetting (by indices, names, boolean masks, metadata conditions)
 - Transposition
 - Copying (full copies vs views)
 - Renaming (observations, variables, categories)
 - Type conversions (strings to categoricals, sparse/dense)
 - Adding/removing data components
 - Reordering
 - Quality control filtering
 Common commands:
 ```python
-# Convert X to DataFrame
+# Subset by metadata
-df = adata.to_df()
+filtered = adata[adata.obs['quality_score'] > 0.8]
 hv_genes = adata[:, adata.var['highly_variable']]
-# Convert specific layer
+# Transpose
-df = adata.to_df(layer='log1p')
+adata_T = adata.T
 # Copy vs view
 view = adata[0:100, :]  # View (lightweight reference)
 copy = adata[0:100, :].copy()  # Independent copy
 # Convert strings to categoricals
 adata.strings_to_categoricals()
 ```
-**Extract vectors:**
+### 5. Best Practices
-```python
+
-# Get 1D arrays from data or annotations
+Follow recommended patterns for memory efficiency, performance, and reproducibility.
-gene_expression = adata.obs_vector('Gene_100')
+
-cell_metadata = adata.obs_vector('n_genes')
+**See**: `references/best_practices.md` for guidelines on:
-```
+- Memory management (sparse matrices, categoricals, backed mode)
-
+- Views vs copies
-**Transpose:**
+- Data storage optimization
-```python
+- Performance optimization
-# Swap observations and variables
+- Working with raw data
-transposed = adata.T
+- Metadata management
-```
+- Reproducibility
-
+- Error handling
-### 7. Memory Optimization
+- Integration with other tools
-
+- Common pitfalls and solutions
-Strategies for working with large datasets efficiently.
+
-
+Key recommendations:
 **Use sparse matrices:**
 ```python
 # Use sparse matrices for sparse data
 from scipy.sparse import csr_matrix
 adata.X = csr_matrix(adata.X)
-# Check sparsity
+# Convert strings to categoricals
 density = (adata.X != 0).sum() / adata.X.size
 if density < 0.3:  # Less than 30% non-zero
    adata.X = csr_matrix(adata.X)
 ```
 **Convert strings to categoricals:**
 ```python
 # Automatic conversion
 adata.strings_to_categoricals()
-# Manual conversion (more control)
+# Use backed mode for large files
-adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
+adata = ad.read_h5ad('large.h5ad', backed='r')
 # Store raw before filtering
 adata.raw = adata.copy()
 adata = adata[:, adata.var['highly_variable']]
 ```
-**Use backed mode:**
+## Integration with Scverse Ecosystem
 ```python
 # Read without loading into memory
 adata = ad.read_h5ad('large_file.h5ad', backed='r')
-# Work with subsets
+AnnData serves as the foundational data structure for the scverse ecosystem:
-subset = adata[:1000, :500].copy()  # Only this subset in memory
+
 ### Scanpy (Single-cell analysis)
 ```python
 import scanpy as sc
 # Preprocessing
 sc.pp.filter_cells(adata, min_genes=200)
 sc.pp.normalize_total(adata, target_sum=1e4)
 sc.pp.log1p(adata)
 sc.pp.highly_variable_genes(adata, n_top_genes=2000)
 # Dimensionality reduction
 sc.pp.pca(adata, n_comps=50)
 sc.pp.neighbors(adata, n_neighbors=15)
 sc.tl.umap(adata)
 sc.tl.leiden(adata)
 # Visualization
 sc.pl.umap(adata, color=['cell_type', 'leiden'])
 ```
-**Chunked processing:**
+### Muon (Multimodal data)
 ```python
-# Process data in chunks
+import muon as mu
-results = []
+
-for chunk in adata.chunk_X(chunk_size=1000):
+# Combine RNA and protein data
-    result = expensive_computation(chunk)
+mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
-    results.append(result)
+```
 ### PyTorch integration
 ```python
 from anndata.experimental import AnnLoader
 # Create DataLoader for deep learning
 dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
 for batch in dataloader:
    X = batch.X
    # Train model
 ```
 ## Common Workflows
-### Single-Cell RNA-seq Analysis
+### Single-cell RNA-seq analysis
 Complete workflow from loading to analysis:
 ```python
 import anndata as ad
-import numpy as np
+import scanpy as sc
 import pandas as pd
 # 1. Load data
-adata = ad.read_mtx('matrix.mtx')
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 adata.obs_names = pd.read_csv('barcodes.tsv', header=None)[0]
 adata.var_names = pd.read_csv('genes.tsv', header=None)[0]
 # 2. Quality control
 adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
-adata.obs['total_counts'] = adata.X.sum(axis=1)
+adata.obs['n_counts'] = adata.X.sum(axis=1)
-adata = adata[adata.obs.n_genes > 200]
+adata = adata[adata.obs['n_genes'] > 200]
-adata = adata[adata.obs.total_counts < 10000]
+adata = adata[adata.obs['n_counts'] < 50000]
-# 3. Filter genes
+# 3. Store raw
-min_cells = 3
+adata.raw = adata.copy()
 adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
-# 4. Store raw counts
+# 4. Normalize and filter
-adata.layers['counts'] = adata.X.copy()
+sc.pp.normalize_total(adata, target_sum=1e4)
 sc.pp.log1p(adata)
 sc.pp.highly_variable_genes(adata, n_top_genes=2000)
 adata = adata[:, adata.var['highly_variable']]
-# 5. Normalize
+# 5. Save processed data
-adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
+adata.write_h5ad('processed.h5ad')
 adata.X = np.log1p(adata.X)
 # 6. Feature selection
 gene_var = adata.X.var(axis=0)
 adata.var['highly_variable'] = gene_var > np.percentile(gene_var, 90)
 # 7. Dimensionality reduction (example with external tools)
 # adata.obsm['X_pca'] = compute_pca(adata.X)
 # adata.obsm['X_umap'] = compute_umap(adata.obsm['X_pca'])
 # 8. Save results
 adata.write('analyzed.h5ad', compression='gzip')
 ```
-### Batch Integration
+### Batch integration
 Combining multiple experimental batches:
 ```python
-# Load batches
+# Load multiple batches
-batches = [ad.read_h5ad(f'batch_{i}.h5ad') for i in range(3)]
+adata1 = ad.read_h5ad('batch1.h5ad')
 adata2 = ad.read_h5ad('batch2.h5ad')
 adata3 = ad.read_h5ad('batch3.h5ad')
-# Concatenate with tracking
+# Concatenate with batch labels
-combined = ad.concat(
+adata = ad.concat(
-    batches,
+    [adata1, adata2, adata3],
    axis=0,
    join='outer',
    label='batch',
-    keys=['batch_0', 'batch_1', 'batch_2'],
+    keys=['batch1', 'batch2', 'batch3'],
-    index_unique='-'
+    join='inner'
 )
-# Add batch as numeric for correction algorithms
+# Apply batch correction
-combined.obs['batch_numeric'] = combined.obs['batch'].cat.codes
+import scanpy as sc
 sc.pp.combat(adata, key='batch')
-# Perform batch correction (with external tools)
+# Continue analysis
-# corrected_pca = run_harmony(combined.obsm['X_pca'], combined.obs['batch'])
+sc.pp.pca(adata)
-# combined.obsm['X_pca_corrected'] = corrected_pca
+sc.pp.neighbors(adata)
-
+sc.tl.umap(adata)
 # Save integrated data
 combined.write('integrated.h5ad', compression='gzip')
 ```
-### Memory-Efficient Large Dataset Processing
+### Working with large datasets
 Working with datasets too large for memory:
 ```python
-# Read in backed mode
+# Open in backed mode
-adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
+adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
-# Compute statistics in chunks
+# Filter based on metadata (no data loading)
-total = 0
+high_quality = adata[adata.obs['quality_score'] > 0.8]
 for chunk in adata.chunk_X(chunk_size=1000):
    total += chunk.sum()
-mean_expression = total / (adata.n_obs * adata.n_vars)
+# Load filtered subset
 adata_subset = high_quality.to_memory()
-# Work with subset
+# Process subset
-high_quality_cells = adata.obs.n_genes > 1000
+process(adata_subset)
 subset = adata[high_quality_cells, :].copy()
-# Close file
+# Or process in chunks
-adata.file.close()
+chunk_size = 1000
 for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    process(chunk)
 ```
-## Best Practices
+## Troubleshooting
-### Data Organization
+### Out of memory errors
-
+Use backed mode or convert to sparse matrices:
 1. **Use layers for different representations**: Store raw counts, normalized, log-transformed, and scaled data in separate layers
 2. **Use obsm/varm for multi-dimensional data**: Embeddings, loadings, and other matrix-like annotations
 3. **Use uns for metadata**: Analysis parameters, dates, version information
 4. **Use categoricals for efficiency**: Convert repeated strings to categorical types
 ### Subsetting
 1. **Understand views vs copies**: Subsetting returns views by default; use `.copy()` when you need independence
 2. **Chain conditions efficiently**: Combine boolean masks in a single subsetting operation
 3. **Validate after subsetting**: Check dimensions and data integrity
 ### File I/O
 1. **Use compression**: Always use `compression='gzip'` when writing h5ad files
 2. **Choose the right format**: H5AD for general use, Zarr for cloud storage, Loom for compatibility
 3. **Close backed files**: Always close file connections when done
 4. **Use backed mode for large files**: Don't load everything into memory if not needed
 ### Concatenation
 1. **Choose appropriate join**: Inner join for complete cases, outer join to preserve all features
 2. **Track sources**: Use `label` and `keys` to track data origin
 3. **Handle duplicates**: Use `index_unique` to make observation names unique
 4. **Select merge strategy**: Choose appropriate merge strategy for variable annotations
 ### Memory Management
 1. **Use sparse matrices**: For data with <30% non-zero values
 2. **Convert to categoricals**: For repeated string values
 3. **Process in chunks**: For operations on very large matrices
 4. **Use backed mode**: Read large files with `backed='r'`
 ### Naming Conventions
 Follow these conventions for consistency:
 - **Embeddings**: `X_pca`, `X_umap`, `X_tsne`
 - **Layers**: Descriptive names like `counts`, `log1p`, `scaled`
 - **Observations**: Use snake_case like `cell_type`, `n_genes`, `total_counts`
 - **Variables**: Use snake_case like `highly_variable`, `gene_name`
 ## Reference Documentation
 For detailed API information, usage patterns, and troubleshooting, refer to the comprehensive reference files in the `references/` directory:
 1. **api_reference.md**: Complete API documentation including all classes, methods, and functions with usage examples. Use `grep -r "pattern" references/api_reference.md` to search for specific functions or parameters.
 2. **workflows_best_practices.md**: Detailed workflows for common tasks (single-cell analysis, batch integration, large datasets), best practices for memory management, data organization, and common pitfalls to avoid. Use `grep -r "pattern" references/workflows_best_practices.md` to search for specific workflows.
 3. **concatenation_guide.md**: Comprehensive guide to concatenation strategies, join types, merge strategies, source tracking, and troubleshooting concatenation issues. Use `grep -r "pattern" references/concatenation_guide.md` to search for concatenation patterns.
 ## When to Load References
 Load reference files into context when:
 - Implementing complex concatenation with specific merge strategies
 - Troubleshooting errors or unexpected behavior
 - Optimizing memory usage for large datasets
 - Implementing complete analysis workflows
 - Understanding nuances of specific API methods
 To search within references without loading them:
 ```python
-# Example: Search for information about backed mode
+# Backed mode
-grep -r "backed mode" references/
+adata = ad.read_h5ad('file.h5ad', backed='r')
 # Sparse matrices
 from scipy.sparse import csr_matrix
 adata.X = csr_matrix(adata.X)
 ```
-## Common Error Patterns
+### Slow file reading
 Use compression and appropriate formats:
 ```python
 # Optimize for storage
 adata.strings_to_categoricals()
 adata.write_h5ad('file.h5ad', compression='gzip')
-### Memory Errors
+# Use Zarr for cloud storage
-**Problem**: "MemoryError: Unable to allocate array"
+adata.write_zarr('file.zarr', chunks=(1000, 1000))
-**Solution**: Use backed mode, sparse matrices, or process in chunks
+```
-### Dimension Mismatch
+### Index alignment issues
-**Problem**: "ValueError: operands could not be broadcast together"
+Always align external data on index:
-**Solution**: Use outer join in concatenation or align indices before operations
+```python
 # Wrong
 adata.obs['new_col'] = external_data['values']
-### View Modification
+# Correct
-**Problem**: "ValueError: assignment destination is read-only"
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
-**Solution**: Convert view to copy with `.copy()` before modification
+```
-### File Already Open
+## Additional Resources
-**Problem**: "OSError: Unable to open file (file is already open)"
+
-**Solution**: Close previous file connection with `adata.file.close()`
+- **Official documentation**: https://anndata.readthedocs.io/
 - **Scanpy tutorials**: https://scanpy.readthedocs.io/
 - **Scverse ecosystem**: https://scverse.org/
 - **GitHub repository**: https://github.com/scverse/anndata
--- a/scientific-packages/anndata/references/api_reference.md
+++ b/scientific-packages/anndata/references/api_reference.md
@@ -1,218 +0,0 @@
 # AnnData API Reference
 ## Core AnnData Class
 The `AnnData` class is the central data structure for storing and manipulating annotated datasets in single-cell genomics and other domains.
 ### Core Attributes
 | Attribute | Type | Description |
 |-----------|------|-------------|
 | **X** | array-like | Primary data matrix (#observations × #variables). Supports NumPy arrays, sparse matrices (CSR/CSC), HDF5 datasets, Zarr arrays, and Dask arrays |
 | **obs** | DataFrame | One-dimensional annotation of observations (rows). Length equals observation count |
 | **var** | DataFrame | One-dimensional annotation of variables/features (columns). Length equals variable count |
 | **uns** | OrderedDict | Unstructured annotation for miscellaneous metadata |
 | **obsm** | dict-like | Multi-dimensional observation annotations (structured arrays aligned to observation axis) |
 | **varm** | dict-like | Multi-dimensional variable annotations (structured arrays aligned to variable axis) |
 | **obsp** | dict-like | Pairwise observation annotations (square matrices representing graphs) |
 | **varp** | dict-like | Pairwise variable annotations (graphs between features) |
 | **layers** | dict-like | Additional data matrices matching X's dimensions |
 | **raw** | AnnData | Stores original versions of X and var before transformations |
 ### Dimensional Properties
 - **n_obs**: Number of observations (sample count)
 - **n_vars**: Number of variables/features
 - **shape**: Tuple returning (n_obs, n_vars)
 - **T**: Transposed view of the entire object
 ### State Properties
 - **isbacked**: Boolean indicating disk-backed storage status
 - **is_view**: Boolean identifying whether object is a view of another AnnData
 - **filename**: Path to backing .h5ad file; setting this enables disk-backed mode
 ### Key Methods
 #### Construction and Copying
 - **`AnnData(X=None, obs=None, var=None, ...)`**: Create new AnnData object
 - **`copy(filename=None)`**: Create full copy, optionally stored on disk
 #### Subsetting and Views
 - **`adata[obs_subset, var_subset]`**: Subset observations and variables (returns view by default)
 - **`.copy()`**: Convert view to independent object
 #### Data Access
 - **`to_df(layer=None)`**: Generate pandas DataFrame representation
 - **`obs_vector(k, layer=None)`**: Extract 1D array from X, layers, or annotations
 - **`var_vector(k, layer=None)`**: Extract 1D array for a variable
 - **`chunk_X(chunk_size)`**: Iterate over data matrix in chunks
 - **`chunked_X(chunk_size)`**: Context manager for chunked iteration
 #### Transformation
 - **`transpose()`**: Return transposed object
 - **`concatenate(*adatas, ...)`**: Combine multiple AnnData objects along observation axis
 - **`to_memory(copy=False)`**: Load all backed arrays into RAM
 #### File I/O
 - **`write_h5ad(filename, compression='gzip')`**: Save as .h5ad HDF5 format
 - **`write_zarr(store, ...)`**: Export hierarchical Zarr store
 - **`write_loom(filename, ...)`**: Output .loom format file
 - **`write_csvs(dirname, ...)`**: Write annotations as separate CSV files
 #### Data Management
 - **`strings_to_categoricals()`**: Convert string annotations to categorical types
 - **`rename_categories(key, categories)`**: Update category labels in annotations
 - **`obs_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate observation names
 - **`var_names_make_unique(sep='-')`**: Append numeric suffixes to duplicate variable names
 ## Module-Level Functions
 ### Reading Functions
 #### Native Formats
 - **`read_h5ad(filename, backed=None, as_sparse=None)`**: Load HDF5-based .h5ad files
 - **`read_zarr(store)`**: Access hierarchical Zarr array stores
 #### Alternative Formats
 - **`read_csv(filename, ...)`**: Import from CSV files
 - **`read_excel(filename, ...)`**: Import from Excel files
 - **`read_hdf(filename, key)`**: Read from HDF5 files
 - **`read_loom(filename, ...)`**: Import from .loom files
 - **`read_mtx(filename, ...)`**: Import from Matrix Market format
 - **`read_text(filename, ...)`**: Import from text files
 - **`read_umi_tools(filename, ...)`**: Import from UMI-tools format
 #### Element-Level Access
 - **`read_elem(elem)`**: Retrieve specific components from storage
 - **`sparse_dataset(group)`**: Generate backed sparse matrix classes
 ### Combining Operations
 - **`concat(adatas, axis=0, join='inner', merge=None, ...)`**: Merge multiple AnnData objects
  - **axis**: 0 (observations) or 1 (variables)
  - **join**: 'inner' (intersection) or 'outer' (union)
  - **merge**: Strategy for non-concatenation axis ('same', 'unique', 'first', 'only', or None)
  - **label**: Column name for source tracking
  - **keys**: Dataset identifiers for source annotation
  - **index_unique**: Separator for making duplicate indices unique
 ### Writing Functions
 - **`write_h5ad(filename, adata, compression='gzip')`**: Export to HDF5 format
 - **`write_zarr(store, adata, ...)`**: Save as Zarr hierarchical arrays
 - **`write_elem(elem, ...)`**: Write individual components
 ### Experimental Features
 - **`AnnCollection`**: Batch processing for large collections
 - **`AnnLoader`**: PyTorch DataLoader integration
 - **`concat_on_disk(*adatas, filename, ...)`**: Memory-efficient out-of-core concatenation
 - **`read_lazy(filename)`**: Lazy loading with deferred computation
 - **`read_dispatched(filename, ...)`**: Custom I/O with callbacks
 - **`write_dispatched(filename, ...)`**: Custom writing with callbacks
 ### Configuration
 - **`settings`**: Package-wide configuration object
 - **`settings.override(**kwargs)`**: Context manager for temporary settings changes
 ## Common Usage Patterns
 ### Creating AnnData Objects
 ```python
 import anndata as ad
 import numpy as np
 from scipy.sparse import csr_matrix
 # From dense array
 counts = np.random.poisson(1, size=(100, 2000))
 adata = ad.AnnData(counts)
 # From sparse matrix
 counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
 adata = ad.AnnData(counts)
 # With metadata
 import pandas as pd
 obs_meta = pd.DataFrame({'cell_type': ['B', 'T', 'Monocyte'] * 33 + ['B']})
 var_meta = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]})
 adata = ad.AnnData(counts, obs=obs_meta, var=var_meta)
 ```
 ### Subsetting
 ```python
 # By names
 subset = adata[['Cell_1', 'Cell_10'], ['Gene_5', 'Gene_1900']]
 # By boolean mask
 b_cells = adata[adata.obs.cell_type == 'B']
 # By position
 first_five = adata[:5, :100]
 # Convert view to copy
 adata_copy = adata[:5].copy()
 ```
 ### Adding Annotations
 ```python
 # Cell-level metadata
 adata.obs['batch'] = pd.Categorical(['batch1', 'batch2'] * 50)
 # Gene-level metadata
 adata.var['highly_variable'] = np.random.choice([True, False], size=adata.n_vars)
 # Embeddings
 adata.obsm['X_pca'] = np.random.normal(size=(adata.n_obs, 50))
 adata.obsm['X_umap'] = np.random.normal(size=(adata.n_obs, 2))
 # Alternative data representations
 adata.layers['log_transformed'] = np.log1p(adata.X)
 adata.layers['scaled'] = (adata.X - adata.X.mean(axis=0)) / adata.X.std(axis=0)
 # Unstructured metadata
 adata.uns['experiment_date'] = '2024-01-15'
 adata.uns['parameters'] = {'min_genes': 200, 'min_cells': 3}
 ```
 ### File I/O
 ```python
 # Write to disk
 adata.write('my_results.h5ad', compression='gzip')
 # Read into memory
 adata = ad.read_h5ad('my_results.h5ad')
 # Read in backed mode (memory-efficient)
 adata = ad.read_h5ad('my_results.h5ad', backed='r')
 # Close file connection
 adata.file.close()
 ```
 ### Concatenation
 ```python
 # Combine multiple datasets
 adata1 = ad.AnnData(np.random.poisson(1, size=(100, 2000)))
 adata2 = ad.AnnData(np.random.poisson(1, size=(150, 2000)))
 adata3 = ad.AnnData(np.random.poisson(1, size=(80, 2000)))
 # Simple concatenation
 combined = ad.concat([adata1, adata2, adata3], axis=0)
 # With source labels
 combined = ad.concat(
    [adata1, adata2, adata3],
    axis=0,
    label='dataset',
    keys=['exp1', 'exp2', 'exp3']
 )
 # Inner join (only shared variables)
 combined = ad.concat([adata1, adata2, adata3], axis=0, join='inner')
 # Outer join (all variables, pad with zeros)
 combined = ad.concat([adata1, adata2, adata3], axis=0, join='outer')
 ```
--- a/scientific-packages/anndata/references/best_practices.md
+++ b/scientific-packages/anndata/references/best_practices.md
@@ -0,0 +1,525 @@
 # Best Practices
 Guidelines for efficient and effective use of AnnData.
 ## Memory Management
 ### Use sparse matrices for sparse data
 ```python
 import numpy as np
 from scipy.sparse import csr_matrix
 import anndata as ad
 # Check data sparsity
 data = np.random.rand(1000, 2000)
 sparsity = 1 - np.count_nonzero(data) / data.size
 print(f"Sparsity: {sparsity:.2%}")
 # Convert to sparse if >50% zeros
 if sparsity > 0.5:
    adata = ad.AnnData(X=csr_matrix(data))
 else:
    adata = ad.AnnData(X=data)
 # Benefits: 10-100x memory reduction for sparse genomics data
 ```
 ### Convert strings to categoricals
 ```python
 # Inefficient: string columns use lots of memory
 adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A']
 # Efficient: convert to categorical
 adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
 # Convert all string columns
 adata.strings_to_categoricals()
 # Benefits: 10-50x memory reduction for repeated strings
 ```
 ### Use backed mode for large datasets
 ```python
 # Don't load entire dataset into memory
 adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
 # Work with metadata
 filtered = adata[adata.obs['quality'] > 0.8]
 # Load only filtered subset
 adata_subset = filtered.to_memory()
 # Benefits: Work with datasets larger than RAM
 ```
 ## Views vs Copies
 ### Understanding views
 ```python
 # Subsetting creates a view by default
 subset = adata[0:100, :]
 print(subset.is_view)  # True
 # Views don't copy data (memory efficient)
 # But modifications can affect original
 # Check if object is a view
 if adata.is_view:
    adata = adata.copy()  # Make independent
 ```
 ### When to use views
 ```python
 # Good: Read-only operations on subsets
 mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean()
 # Good: Temporary analysis
 temp_subset = adata[:100, :]
 result = analyze(temp_subset.X)
 ```
 ### When to use copies
 ```python
 # Create independent copy for modifications
 adata_filtered = adata[keep_cells, :].copy()
 # Safe to modify without affecting original
 adata_filtered.obs['new_column'] = values
 # Always copy when:
 # - Storing subset for later use
 # - Modifying subset data
 # - Passing to function that modifies data
 ```
 ## Data Storage Best Practices
 ### Choose the right format
 **H5AD (HDF5) - Default choice**
 ```python
 adata.write_h5ad('data.h5ad', compression='gzip')
 ```
 - Fast random access
 - Supports backed mode
 - Good compression
 - Best for: Most use cases
 **Zarr - Cloud and parallel access**
 ```python
 adata.write_zarr('data.zarr', chunks=(100, 100))
 ```
 - Excellent for cloud storage (S3, GCS)
 - Supports parallel I/O
 - Good compression
 - Best for: Large datasets, cloud workflows, parallel processing
 **CSV - Interoperability**
 ```python
 adata.write_csvs('output_dir/')
 ```
 - Human readable
 - Compatible with all tools
 - Large file sizes, slow
 - Best for: Sharing with non-Python tools, small datasets
 ### Optimize file size
 ```python
 # Before saving, optimize:
 # 1. Convert to sparse if appropriate
 from scipy.sparse import csr_matrix, issparse
 if not issparse(adata.X):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:
        adata.X = csr_matrix(adata.X)
 # 2. Convert strings to categoricals
 adata.strings_to_categoricals()
 # 3. Use compression
 adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
 # Typical results: 5-20x file size reduction
 ```
 ## Backed Mode Strategies
 ### Read-only analysis
 ```python
 # Open in read-only backed mode
 adata = ad.read_h5ad('data.h5ad', backed='r')
 # Perform filtering without loading data
 high_quality = adata[adata.obs['quality_score'] > 0.8]
 # Load only filtered data
 adata_filtered = high_quality.to_memory()
 ```
 ### Read-write modifications
 ```python
 # Open in read-write backed mode
 adata = ad.read_h5ad('data.h5ad', backed='r+')
 # Modify metadata (written to disk)
 adata.obs['new_annotation'] = values
 # X remains on disk, modifications saved immediately
 ```
 ### Chunked processing
 ```python
 # Process large dataset in chunks
 adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')
 results = []
 chunk_size = 1000
 for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    result = process(chunk)
    results.append(result)
 final_result = combine(results)
 ```
 ## Performance Optimization
 ### Subsetting performance
 ```python
 # Fast: Boolean indexing with arrays
 mask = np.array(adata.obs['quality'] > 0.5)
 subset = adata[mask, :]
 # Slow: Boolean indexing with Series (creates view chain)
 subset = adata[adata.obs['quality'] > 0.5, :]
 # Fastest: Integer indices
 indices = np.where(adata.obs['quality'] > 0.5)[0]
 subset = adata[indices, :]
 ```
 ### Avoid repeated subsetting
 ```python
 # Inefficient: Multiple subset operations
 for cell_type in ['A', 'B', 'C']:
    subset = adata[adata.obs['cell_type'] == cell_type]
    process(subset)
 # Efficient: Group and process
 groups = adata.obs.groupby('cell_type').groups
 for cell_type, indices in groups.items():
    subset = adata[indices, :]
    process(subset)
 ```
 ### Use chunked operations for large matrices
 ```python
 # Process X in chunks
 for chunk in adata.chunked_X(chunk_size=1000):
    result = compute(chunk)
 # More memory efficient than loading full X
 ```
 ## Working with Raw Data
 ### Store raw before filtering
 ```python
 # Original data with all genes
 adata = ad.AnnData(X=counts)
 # Store raw before filtering
 adata.raw = adata.copy()
 # Filter to highly variable genes
 adata = adata[:, adata.var['highly_variable']]
 # Later: access original data
 original_expression = adata.raw.X
 all_genes = adata.raw.var_names
 ```
 ### When to use raw
 ```python
 # Use raw for:
 # - Differential expression on filtered genes
 # - Visualization of specific genes not in filtered set
 # - Accessing original counts after normalization
 # Access raw data
 if adata.raw is not None:
    gene_expr = adata.raw[:, 'GENE_NAME'].X
 else:
    gene_expr = adata[:, 'GENE_NAME'].X
 ```
 ## Metadata Management
 ### Naming conventions
 ```python
 # Consistent naming improves usability
 # Observation metadata (obs):
 # - cell_id, sample_id
 # - cell_type, tissue, condition
 # - n_genes, n_counts, percent_mito
 # - cluster, leiden, louvain
 # Variable metadata (var):
 # - gene_id, gene_name
 # - highly_variable, n_cells
 # - mean_expression, dispersion
 # Embeddings (obsm):
 # - X_pca, X_umap, X_tsne
 # - X_diffmap, X_draw_graph_fr
 # Follow conventions from scanpy/scverse ecosystem
 ```
 ### Document metadata
 ```python
 # Store metadata descriptions in uns
 adata.uns['metadata_descriptions'] = {
    'cell_type': 'Cell type annotation from automated clustering',
    'quality_score': 'QC score from scrublet (0-1, higher is better)',
    'batch': 'Experimental batch identifier'
 }
 # Store processing history
 adata.uns['processing_steps'] = [
    'Raw counts loaded from 10X',
    'Filtered: n_genes > 200, n_counts < 50000',
    'Normalized to 10000 counts per cell',
    'Log transformed'
 ]
 ```
 ## Reproducibility
 ### Set random seeds
 ```python
 import numpy as np
 # Set seed for reproducible results
 np.random.seed(42)
 # Document in uns
 adata.uns['random_seed'] = 42
 ```
 ### Store parameters
 ```python
 # Store analysis parameters in uns
 adata.uns['pca'] = {
    'n_comps': 50,
    'svd_solver': 'arpack',
    'random_state': 42
 }
 adata.uns['neighbors'] = {
    'n_neighbors': 15,
    'n_pcs': 50,
    'metric': 'euclidean',
    'method': 'umap'
 }
 ```
 ### Version tracking
 ```python
 import anndata
 import scanpy
 import numpy
 # Store versions
 adata.uns['versions'] = {
    'anndata': anndata.__version__,
    'scanpy': scanpy.__version__,
    'numpy': numpy.__version__,
    'python': sys.version
 }
 ```
 ## Error Handling
 ### Check data validity
 ```python
 # Verify dimensions
 assert adata.n_obs == len(adata.obs)
 assert adata.n_vars == len(adata.var)
 assert adata.X.shape == (adata.n_obs, adata.n_vars)
 # Check for NaN values
 has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any()
 if has_nan:
    print("Warning: Data contains NaN values")
 # Check for negative values (if counts expected)
 has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any()
 if has_negative:
    print("Warning: Data contains negative values")
 ```
 ### Validate metadata
 ```python
 # Check for missing values
 missing_obs = adata.obs.isnull().sum()
 if missing_obs.any():
    print("Missing values in obs:")
    print(missing_obs[missing_obs > 0])
 # Verify indices are unique
 assert adata.obs_names.is_unique, "Observation names not unique"
 assert adata.var_names.is_unique, "Variable names not unique"
 # Check metadata alignment
 assert len(adata.obs) == adata.n_obs
 assert len(adata.var) == adata.n_vars
 ```
 ## Integration with Other Tools
 ### Scanpy integration
 ```python
 import scanpy as sc
 # AnnData is native format for scanpy
 sc.pp.filter_cells(adata, min_genes=200)
 sc.pp.filter_genes(adata, min_cells=3)
 sc.pp.normalize_total(adata, target_sum=1e4)
 sc.pp.log1p(adata)
 sc.pp.highly_variable_genes(adata)
 sc.pp.pca(adata)
 sc.pp.neighbors(adata)
 sc.tl.umap(adata)
 ```
 ### Pandas integration
 ```python
 import pandas as pd
 # Convert to DataFrame
 df = adata.to_df()
 # Create from DataFrame
 adata = ad.AnnData(df)
 # Work with metadata as DataFrames
 adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True)
 ```
 ### PyTorch integration
 ```python
 from anndata.experimental import AnnLoader
 # Create PyTorch DataLoader
 dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
 # Iterate in training loop
 for batch in dataloader:
    X = batch.X
    # Train model on batch
 ```
 ## Common Pitfalls
 ### Pitfall 1: Modifying views
 ```python
 # Wrong: Modifying view can affect original
 subset = adata[:100, :]
 subset.X = new_data  # May modify adata.X!
 # Correct: Copy before modifying
 subset = adata[:100, :].copy()
 subset.X = new_data  # Independent copy
 ```
 ### Pitfall 2: Index misalignment
 ```python
 # Wrong: Assuming order matches
 external_data = pd.read_csv('data.csv')
 adata.obs['new_col'] = external_data['values']  # May misalign!
 # Correct: Align on index
 adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
 ```
 ### Pitfall 3: Mixing sparse and dense
 ```python
 # Wrong: Converting sparse to dense uses huge memory
 result = adata.X + 1  # Converts sparse to dense!
 # Correct: Use sparse operations
 from scipy.sparse import issparse
 if issparse(adata.X):
    result = adata.X.copy()
    result.data += 1
 ```
 ### Pitfall 4: Not handling views
 ```python
 # Wrong: Assuming subset is independent
 subset = adata[mask, :]
 del adata  # subset may become invalid!
 # Correct: Copy when needed
 subset = adata[mask, :].copy()
 del adata  # subset remains valid
 ```
 ### Pitfall 5: Ignoring memory constraints
 ```python
 # Wrong: Loading huge dataset into memory
 adata = ad.read_h5ad('100GB_file.h5ad')  # OOM error!
 # Correct: Use backed mode
 adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
 subset = adata[adata.obs['keep']].to_memory()
 ```
 ## Workflow Example
 Complete best-practices workflow:
 ```python
 import anndata as ad
 import numpy as np
 from scipy.sparse import csr_matrix
 # 1. Load with backed mode if large
 adata = ad.read_h5ad('data.h5ad', backed='r')
 # 2. Quick metadata check without loading data
 print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes")
 # 3. Filter based on metadata
 high_quality = adata[adata.obs['quality_score'] > 0.8]
 # 4. Load filtered subset to memory
 adata = high_quality.to_memory()
 # 5. Convert to optimal storage types
 adata.strings_to_categoricals()
 if not issparse(adata.X):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:
        adata.X = csr_matrix(adata.X)
 # 6. Store raw before filtering genes
 adata.raw = adata.copy()
 # 7. Filter to highly variable genes
 adata = adata[:, adata.var['highly_variable']].copy()
 # 8. Document processing
 adata.uns['processing'] = {
    'filtered': 'quality_score > 0.8',
    'n_hvg': adata.n_vars,
    'date': '2025-11-03'
 }
 # 9. Save optimized
 adata.write_h5ad('processed.h5ad', compression='gzip')
 ```
--- a/scientific-packages/anndata/references/concatenation.md
+++ b/scientific-packages/anndata/references/concatenation.md
@@ -0,0 +1,396 @@
 # Concatenating AnnData Objects
 Combine multiple AnnData objects along either observations or variables axis.
 ## Basic Concatenation
 ### Concatenate along observations (stack cells/samples)
 ```python
 import anndata as ad
 import numpy as np
 # Create multiple AnnData objects
 adata1 = ad.AnnData(X=np.random.rand(100, 50))
 adata2 = ad.AnnData(X=np.random.rand(150, 50))
 adata3 = ad.AnnData(X=np.random.rand(200, 50))
 # Concatenate along observations (axis=0, default)
 adata_combined = ad.concat([adata1, adata2, adata3], axis=0)
 print(adata_combined.shape)  # (450, 50)
 ```
 ### Concatenate along variables (stack genes/features)
 ```python
 # Create objects with same observations, different variables
 adata1 = ad.AnnData(X=np.random.rand(100, 50))
 adata2 = ad.AnnData(X=np.random.rand(100, 30))
 adata3 = ad.AnnData(X=np.random.rand(100, 70))
 # Concatenate along variables (axis=1)
 adata_combined = ad.concat([adata1, adata2, adata3], axis=1)
 print(adata_combined.shape)  # (100, 150)
 ```
 ## Join Types
 ### Inner join (intersection)
 Keep only variables/observations present in all objects.
 ```python
 import pandas as pd
 # Create objects with different variables
 adata1 = ad.AnnData(
    X=np.random.rand(100, 50),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
 )
 adata2 = ad.AnnData(
    X=np.random.rand(150, 60),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
 )
 # Inner join: only genes 10-49 are kept (overlap)
 adata_inner = ad.concat([adata1, adata2], join='inner')
 print(adata_inner.n_vars)  # 40 genes (overlap)
 ```
 ### Outer join (union)
 Keep all variables/observations, filling missing values.
 ```python
 # Outer join: all genes are kept
 adata_outer = ad.concat([adata1, adata2], join='outer')
 print(adata_outer.n_vars)  # 70 genes (union)
 # Missing values are filled with appropriate defaults:
 # - 0 for sparse matrices
 # - NaN for dense matrices
 ```
 ### Fill values for outer joins
 ```python
 # Specify fill value for missing data
 adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)
 ```
 ## Tracking Data Sources
 ### Add batch labels
 ```python
 # Label which object each observation came from
 adata_combined = ad.concat(
    [adata1, adata2, adata3],
    label='batch',  # Column name for labels
    keys=['batch1', 'batch2', 'batch3']  # Labels for each object
 )
 print(adata_combined.obs['batch'].value_counts())
 # batch1    100
 # batch2    150
 # batch3    200
 ```
 ### Automatic batch labels
 ```python
 # If keys not provided, uses integer indices
 adata_combined = ad.concat(
    [adata1, adata2, adata3],
    label='dataset'
 )
 # dataset column contains: 0, 1, 2
 ```
 ## Merge Strategies
 Control how metadata from different objects is combined using the `merge` parameter.
 ### merge=None (default for observations)
 Exclude metadata on non-concatenation axis.
 ```python
 # When concatenating observations, var metadata must match
 adata1.var['gene_type'] = 'protein_coding'
 adata2.var['gene_type'] = 'protein_coding'
 # var is kept only if identical across all objects
 adata_combined = ad.concat([adata1, adata2], merge=None)
 ```
 ### merge='same'
 Keep metadata that is identical across all objects.
 ```python
 adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
 adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
 adata1.var['type'] = 'protein_coding'
 adata2.var['type'] = 'lncRNA'  # Different
 # 'chromosome' is kept (same), 'type' is excluded (different)
 adata_combined = ad.concat([adata1, adata2], merge='same')
 ```
 ### merge='unique'
 Keep metadata columns where each key has exactly one value.
 ```python
 adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
 adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
 # gene_id is kept (unique values for each key)
 adata_combined = ad.concat([adata1, adata2], merge='unique')
 ```
 ### merge='first'
 Take values from the first object containing each key.
 ```python
 adata1.var['description'] = ['Desc1'] * 50
 adata2.var['description'] = ['Desc2'] * 50
 # Uses descriptions from adata1
 adata_combined = ad.concat([adata1, adata2], merge='first')
 ```
 ### merge='only'
 Keep metadata that appears in only one object.
 ```python
 adata1.var['adata1_specific'] = [1] * 50
 adata2.var['adata2_specific'] = [2] * 50
 # Both metadata columns are kept
 adata_combined = ad.concat([adata1, adata2], merge='only')
 ```
 ## Handling Index Conflicts
 ### Make indices unique
 ```python
 import pandas as pd
 # Create objects with overlapping observation names
 adata1 = ad.AnnData(
    X=np.random.rand(3, 10),
    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
 )
 adata2 = ad.AnnData(
    X=np.random.rand(3, 10),
    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
 )
 # Make indices unique by appending batch keys
 adata_combined = ad.concat(
    [adata1, adata2],
    label='batch',
    keys=['batch1', 'batch2'],
    index_unique='_'  # Separator for making indices unique
 )
 print(adata_combined.obs_names)
 # ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
 #  'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']
 ```
 ## Concatenating Layers
 ```python
 # Objects with layers
 adata1 = ad.AnnData(X=np.random.rand(100, 50))
 adata1.layers['normalized'] = np.random.rand(100, 50)
 adata1.layers['scaled'] = np.random.rand(100, 50)
 adata2 = ad.AnnData(X=np.random.rand(150, 50))
 adata2.layers['normalized'] = np.random.rand(150, 50)
 adata2.layers['scaled'] = np.random.rand(150, 50)
 # Layers are concatenated automatically if present in all objects
 adata_combined = ad.concat([adata1, adata2])
 print(adata_combined.layers.keys())
 # dict_keys(['normalized', 'scaled'])
 ```
 ## Concatenating Multi-dimensional Annotations
 ### obsm/varm
 ```python
 # Objects with embeddings
 adata1.obsm['X_pca'] = np.random.rand(100, 50)
 adata2.obsm['X_pca'] = np.random.rand(150, 50)
 # obsm is concatenated along observation axis
 adata_combined = ad.concat([adata1, adata2])
 print(adata_combined.obsm['X_pca'].shape)  # (250, 50)
 ```
 ### obsp/varp (pairwise annotations)
 ```python
 from scipy.sparse import csr_matrix
 # Pairwise matrices
 adata1.obsp['connectivities'] = csr_matrix((100, 100))
 adata2.obsp['connectivities'] = csr_matrix((150, 150))
 # By default, obsp is NOT concatenated (set pairwise=True to include)
 adata_combined = ad.concat([adata1, adata2])
 # adata_combined.obsp is empty
 # Include pairwise data (creates block diagonal matrix)
 adata_combined = ad.concat([adata1, adata2], pairwise=True)
 print(adata_combined.obsp['connectivities'].shape)  # (250, 250)
 ```
 ## Concatenating uns (unstructured)
 Unstructured metadata is merged recursively:
 ```python
 adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
 adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}
 # Using merge='unique' for uns
 adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
 # 'date' is kept (same value), 'batch' might be excluded (different values)
 ```
 ## Lazy Concatenation (AnnCollection)
 For very large datasets, use lazy concatenation that doesn't load all data:
 ```python
 from anndata.experimental import AnnCollection
 # Create collection from file paths (doesn't load data)
 files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
 collection = AnnCollection(
    files,
    join_obs='outer',
    join_vars='inner',
    label='dataset',
    keys=['dataset1', 'dataset2', 'dataset3']
 )
 # Access data lazily
 print(collection.n_obs)  # Total observations
 print(collection.obs.head())  # Metadata loaded, not X
 # Convert to regular AnnData when needed (loads all data)
 adata = collection.to_adata()
 ```
 ### Working with AnnCollection
 ```python
 # Subset without loading data
 subset = collection[collection.obs['cell_type'] == 'T cell']
 # Iterate through datasets
 for adata in collection:
    print(adata.shape)
 # Access specific dataset
 first_dataset = collection[0]
 ```
 ## Concatenation on Disk
 For datasets too large for memory, concatenate directly on disk:
 ```python
 from anndata.experimental import concat_on_disk
 # Concatenate without loading into memory
 concat_on_disk(
    ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
    'combined.h5ad',
    join='outer'
 )
 # Load result in backed mode
 adata = ad.read_h5ad('combined.h5ad', backed='r')
 ```
 ## Common Concatenation Patterns
 ### Combine technical replicates
 ```python
 # Multiple runs of the same samples
 replicates = [adata_run1, adata_run2, adata_run3]
 adata_combined = ad.concat(
    replicates,
    label='technical_replicate',
    keys=['rep1', 'rep2', 'rep3'],
    join='inner'  # Keep only genes measured in all runs
 )
 ```
 ### Combine batches from experiment
 ```python
 # Different experimental batches
 batches = [adata_batch1, adata_batch2, adata_batch3]
 adata_combined = ad.concat(
    batches,
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='outer'  # Keep all genes
 )
 # Later: apply batch correction
 ```
 ### Merge multi-modal data
 ```python
 # Different measurement modalities (e.g., RNA + protein)
 adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
 adata_protein = ad.AnnData(X=np.random.rand(100, 50))
 # Concatenate along variables to combine modalities
 adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)
 # Add labels to distinguish modalities
 adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50
 ```
 ## Best Practices
 1. **Check compatibility before concatenating**
 ```python
 # Verify shapes are compatible
 print([adata.n_vars for adata in [adata1, adata2, adata3]])
 # Check variable names match
 print([set(adata.var_names) for adata in [adata1, adata2, adata3]])
 ```
 2. **Use appropriate join type**
 - `inner`: When you need the same features across all samples (most stringent)
 - `outer`: When you want to preserve all features (most inclusive)
 3. **Track data sources**
 Always use `label` and `keys` to track which observations came from which dataset.
 4. **Consider memory usage**
 - For large datasets, use `AnnCollection` or `concat_on_disk`
 - Consider backed mode for the result
 5. **Handle batch effects**
 Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:
 ```python
 # After concatenation, apply batch correction
 import scanpy as sc
 sc.pp.combat(adata_combined, key='batch')
 ```
 6. **Validate results**
 ```python
 # Check dimensions
 print(adata_combined.shape)
 # Check batch distribution
 print(adata_combined.obs['batch'].value_counts())
 # Verify metadata integrity
 print(adata_combined.var.head())
 print(adata_combined.obs.head())
 ```
--- a/scientific-packages/anndata/references/concatenation_guide.md
+++ b/scientific-packages/anndata/references/concatenation_guide.md
@@ -1,478 +0,0 @@
 # AnnData Concatenation Guide
 ## Overview
 The `concat()` function combines multiple AnnData objects through two fundamental operations:
 1. **Concatenation**: Stacking sub-elements in order
 2. **Merging**: Combining collections into one result
 ## Basic Concatenation
 ### Syntax
 ```python
 import anndata as ad
 combined = ad.concat(
    adatas,              # List of AnnData objects
    axis=0,              # 0=observations, 1=variables
    join='inner',        # 'inner' or 'outer'
    merge=None,          # Merge strategy for non-concat axis
    label=None,          # Column name for source tracking
    keys=None,           # Dataset identifiers
    index_unique=None,   # Separator for unique indices
    fill_value=None,     # Fill value for missing data
    pairwise=False       # Include pairwise matrices
 )
 ```
 ### Concatenating Observations (Cells)
 ```python
 # Most common: combining multiple samples/batches
 adata1 = ad.AnnData(np.random.rand(100, 2000))
 adata2 = ad.AnnData(np.random.rand(150, 2000))
 adata3 = ad.AnnData(np.random.rand(80, 2000))
 combined = ad.concat([adata1, adata2, adata3], axis=0)
 # Result: (330 observations, 2000 variables)
 ```
 ### Concatenating Variables (Genes)
 ```python
 # Less common: combining different feature sets
 adata1 = ad.AnnData(np.random.rand(100, 1000))
 adata2 = ad.AnnData(np.random.rand(100, 500))
 combined = ad.concat([adata1, adata2], axis=1)
 # Result: (100 observations, 1500 variables)
 ```
 ## Join Strategies
 ### Inner Join (Intersection)
 Keeps only shared features across all objects.
 ```python
 # Datasets with different genes
 adata1 = ad.AnnData(
    np.random.rand(100, 2000),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(2000)])
 )
 adata2 = ad.AnnData(
    np.random.rand(150, 1800),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(200, 2000)])
 )
 # Inner join: only genes present in both
 combined = ad.concat([adata1, adata2], join='inner')
 # Result: (250 observations, 1800 variables)
 # Only Gene_200 through Gene_1999
 ```
 **Use when:**
 - You want to analyze only features measured in all datasets
 - Missing features would compromise analysis
 - You need a complete case analysis
 **Trade-offs:**
 - May lose many features
 - Ensures no missing data
 - Smaller result size
 ### Outer Join (Union)
 Keeps all features from all objects, padding with fill values (default 0).
 ```python
 # Outer join: all genes from both datasets
 combined = ad.concat([adata1, adata2], join='outer')
 # Result: (250 observations, 2000 variables)
 # Missing values filled with 0
 # Custom fill value
 combined = ad.concat([adata1, adata2], join='outer', fill_value=np.nan)
 ```
 **Use when:**
 - You want to preserve all features
 - Sparse data is acceptable
 - Features are independent
 **Trade-offs:**
 - Introduces zeros/missing values
 - Larger result size
 - May need imputation
 ## Merge Strategies
 Merge strategies control how elements on the non-concatenation axis are combined.
 ### merge=None (Default)
 Excludes all non-concatenation axis elements.
 ```python
 # Both datasets have var annotations
 adata1.var['gene_type'] = ['protein_coding'] * 2000
 adata2.var['gene_type'] = ['protein_coding'] * 1800
 # merge=None: var annotations excluded
 combined = ad.concat([adata1, adata2], merge=None)
 assert 'gene_type' not in combined.var.columns
 ```
 **Use when:**
 - Annotations are dataset-specific
 - You'll add new annotations after merging
 ### merge='same'
 Keeps only annotations with identical values across datasets.
 ```python
 # Same annotation values
 adata1.var['chromosome'] = ['chr1'] * 1000 + ['chr2'] * 1000
 adata2.var['chromosome'] = ['chr1'] * 900 + ['chr2'] * 900
 # merge='same': keeps chromosome annotation
 combined = ad.concat([adata1, adata2], merge='same')
 assert 'chromosome' in combined.var.columns
 ```
 **Use when:**
 - Annotations should be consistent
 - You want to validate consistency
 - Shared metadata is important
 **Note:** Comparison occurs after index alignment - only shared indices need to match.
 ### merge='unique'
 Includes annotations with a single possible value.
 ```python
 # Unique values per gene
 adata1.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
 adata2.var['ensembl_id'] = [f'ENSG{i:08d}' for i in range(2000)]
 # merge='unique': keeps ensembl_id
 combined = ad.concat([adata1, adata2], merge='unique')
 ```
 **Use when:**
 - Each feature has a unique identifier
 - Annotations are feature-specific
 ### merge='first'
 Takes the first occurrence of each annotation.
 ```python
 # Different annotation versions
 adata1.var['description'] = ['desc1'] * 2000
 adata2.var['description'] = ['desc2'] * 2000
 # merge='first': uses adata1's descriptions
 combined = ad.concat([adata1, adata2], merge='first')
 # Uses descriptions from adata1
 ```
 **Use when:**
 - One dataset has authoritative annotations
 - Order matters
 - You need a simple resolution strategy
 ### merge='only'
 Retains annotations appearing in exactly one object.
 ```python
 # Dataset-specific annotations
 adata1.var['dataset1_specific'] = ['value'] * 2000
 adata2.var['dataset2_specific'] = ['value'] * 2000
 # merge='only': keeps both (no conflicts)
 combined = ad.concat([adata1, adata2], merge='only')
 ```
 **Use when:**
 - Datasets have non-overlapping annotations
 - You want to preserve all unique metadata
 ## Source Tracking
 ### Using label
 Add a categorical column to track data origin.
 ```python
 combined = ad.concat(
    [adata1, adata2, adata3],
    label='batch'
 )
 # Creates obs['batch'] with values 0, 1, 2
 print(combined.obs['batch'].cat.categories)  # ['0', '1', '2']
 ```
 ### Using keys
 Provide custom names for source tracking.
 ```python
 combined = ad.concat(
    [adata1, adata2, adata3],
    label='study',
    keys=['control', 'treatment_a', 'treatment_b']
 )
 # Creates obs['study'] with custom names
 print(combined.obs['study'].unique())  # ['control', 'treatment_a', 'treatment_b']
 ```
 ### Making Indices Unique
 Append source identifiers to duplicate observation names.
 ```python
 # Both datasets have cells named "Cell_0", "Cell_1", etc.
 adata1.obs_names = [f'Cell_{i}' for i in range(100)]
 adata2.obs_names = [f'Cell_{i}' for i in range(150)]
 # index_unique adds suffix
 combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
 )
 # Results in: Cell_0-batch1, Cell_0-batch2, etc.
 print(combined.obs_names[:5])
 ```
 ## Handling Different Attributes
 ### X Matrix and Layers
 Follows join strategy. Missing values filled according to `fill_value`.
 ```python
 # Both have layers
 adata1.layers['counts'] = adata1.X.copy()
 adata2.layers['counts'] = adata2.X.copy()
 # Concatenates both X and layers
 combined = ad.concat([adata1, adata2])
 assert 'counts' in combined.layers
 ```
 ### obs and var DataFrames
 - **obs**: Concatenated along concatenation axis
 - **var**: Handled by merge strategy
 ```python
 adata1.obs['cell_type'] = ['B cell'] * 100
 adata2.obs['cell_type'] = ['T cell'] * 150
 combined = ad.concat([adata1, adata2])
 # obs['cell_type'] preserved for all cells
 ```
 ### obsm and varm
 Multi-dimensional annotations follow same rules as layers.
 ```python
 adata1.obsm['X_pca'] = np.random.rand(100, 50)
 adata2.obsm['X_pca'] = np.random.rand(150, 50)
 combined = ad.concat([adata1, adata2])
 # obsm['X_pca'] concatenated: shape (250, 50)
 ```
 ### obsp and varp
 Pairwise matrices excluded by default. Enable with `pairwise=True`.
 ```python
 # Distance matrices
 adata1.obsp['distances'] = np.random.rand(100, 100)
 adata2.obsp['distances'] = np.random.rand(150, 150)
 # Excluded by default
 combined = ad.concat([adata1, adata2])
 assert 'distances' not in combined.obsp
 # Include if needed
 combined = ad.concat([adata1, adata2], pairwise=True)
 # Results in padded block diagonal matrix
 ```
 ### uns Dictionary
 Merged recursively, applying merge strategy at any nesting depth.
 ```python
 adata1.uns['experiment'] = {'date': '2024-01', 'lab': 'A'}
 adata2.uns['experiment'] = {'date': '2024-02', 'lab': 'A'}
 # merge='same' keeps 'lab', excludes 'date'
 combined = ad.concat([adata1, adata2], merge='same')
 # combined.uns['experiment'] = {'lab': 'A'}
 ```
 ## Advanced Patterns
 ### Batch Integration Pipeline
 ```python
 import anndata as ad
 # Load batches
 batches = [
    ad.read_h5ad(f'batch_{i}.h5ad')
    for i in range(5)
 ]
 # Concatenate with tracking
 combined = ad.concat(
    batches,
    axis=0,
    join='outer',
    merge='first',
    label='batch_id',
    keys=[f'batch_{i}' for i in range(5)],
    index_unique='-'
 )
 # Add batch effects
 combined.obs['batch_numeric'] = combined.obs['batch_id'].cat.codes
 ```
 ### Multi-Study Meta-Analysis
 ```python
 # Different studies with varying gene coverage
 studies = {
    'study_a': ad.read_h5ad('study_a.h5ad'),
    'study_b': ad.read_h5ad('study_b.h5ad'),
    'study_c': ad.read_h5ad('study_c.h5ad')
 }
 # Outer join to keep all genes
 combined = ad.concat(
    list(studies.values()),
    axis=0,
    join='outer',
    label='study',
    keys=list(studies.keys()),
    merge='unique',
    fill_value=0
 )
 # Track coverage
 for study in studies:
    n_genes = studies[study].n_vars
    combined.uns[f'{study}_n_genes'] = n_genes
 ```
 ### Incremental Concatenation
 ```python
 # For many datasets, concatenate in batches
 chunk_size = 10
 all_files = [f'dataset_{i}.h5ad' for i in range(100)]
 # Process in chunks
 result = None
 for i in range(0, len(all_files), chunk_size):
    chunk_files = all_files[i:i+chunk_size]
    chunk_adatas = [ad.read_h5ad(f) for f in chunk_files]
    chunk_combined = ad.concat(chunk_adatas)
    if result is None:
        result = chunk_combined
    else:
        result = ad.concat([result, chunk_combined])
 ```
 ### Memory-Efficient On-Disk Concatenation
 ```python
 # Experimental feature for large datasets
 from anndata.experimental import concat_on_disk
 files = ['dataset1.h5ad', 'dataset2.h5ad', 'dataset3.h5ad']
 concat_on_disk(
    files,
    'combined.h5ad',
    join='outer'
 )
 # Read result in backed mode
 combined = ad.read_h5ad('combined.h5ad', backed='r')
 ```
 ## Troubleshooting
 ### Issue: Dimension Mismatch
 ```python
 # Error: shapes don't match
 adata1 = ad.AnnData(np.random.rand(100, 2000))
 adata2 = ad.AnnData(np.random.rand(150, 1500))
 # Solution: use outer join
 combined = ad.concat([adata1, adata2], join='outer')
 ```
 ### Issue: Memory Error
 ```python
 # Problem: too many large objects in memory
 large_adatas = [ad.read_h5ad(f) for f in many_files]
 # Solution: read and concatenate incrementally
 result = None
 for file in many_files:
    adata = ad.read_h5ad(file)
    if result is None:
        result = adata
    else:
        result = ad.concat([result, adata])
        del adata  # Free memory
 ```
 ### Issue: Duplicate Indices
 ```python
 # Problem: same cell names in different batches
 # Solution: use index_unique
 combined = ad.concat(
    [adata1, adata2],
    keys=['batch1', 'batch2'],
    index_unique='-'
 )
 ```
 ### Issue: Lost Annotations
 ```python
 # Problem: annotations disappear
 adata1.var['important'] = values1
 adata2.var['important'] = values2
 combined = ad.concat([adata1, adata2])  # merge=None by default
 # Solution: use appropriate merge strategy
 combined = ad.concat([adata1, adata2], merge='first')
 ```
 ## Performance Tips
 1. **Pre-align indices**: Ensure consistent naming before concatenation
 2. **Use sparse matrices**: Convert to sparse before concatenating
 3. **Batch operations**: Concatenate in groups for many datasets
 4. **Choose inner join**: When possible, to reduce result size
 5. **Use categoricals**: Convert string annotations before concatenating
 6. **Consider on-disk**: For very large datasets, use `concat_on_disk`
--- a/scientific-packages/anndata/references/data_structure.md
+++ b/scientific-packages/anndata/references/data_structure.md
@@ -0,0 +1,314 @@
 # AnnData Object Structure
 The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.
 ## Core Components
 ### X (Data Matrix)
 The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.
 ```python
 import anndata as ad
 import numpy as np
 # Create with dense array
 adata = ad.AnnData(X=np.random.rand(100, 2000))
 # Create with sparse matrix (recommended for large, sparse data)
 from scipy.sparse import csr_matrix
 sparse_data = csr_matrix(np.random.rand(100, 2000))
 adata = ad.AnnData(X=sparse_data)
 ```
 Access data:
 ```python
 # Full matrix (caution with large datasets)
 full_data = adata.X
 # Single observation
 obs_data = adata.X[0, :]
 # Single variable across all observations
 var_data = adata.X[:, 0]
 ```
 ### obs (Observation Annotations)
 DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.
 ```python
 import pandas as pd
 # Create AnnData with observation metadata
 obs_df = pd.DataFrame({
    'cell_type': ['T cell', 'B cell', 'Monocyte'],
    'treatment': ['control', 'treated', 'control'],
    'timepoint': [0, 24, 24]
 }, index=['cell_1', 'cell_2', 'cell_3'])
 adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)
 # Access observation metadata
 print(adata.obs['cell_type'])
 print(adata.obs.loc['cell_1'])
 ```
 ### var (Variable Annotations)
 DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.
 ```python
 # Create AnnData with variable metadata
 var_df = pd.DataFrame({
    'gene_name': ['ACTB', 'GAPDH', 'TP53'],
    'chromosome': ['7', '12', '17'],
    'highly_variable': [True, False, True]
 }, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])
 adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)
 # Access variable metadata
 print(adata.var['gene_name'])
 print(adata.var.loc['ENSG00001'])
 ```
 ### layers (Alternative Data Representations)
 Dictionary storing alternative matrices with the same dimensions as X.
 ```python
 # Store raw counts, normalized data, and scaled data
 adata = ad.AnnData(X=np.random.rand(100, 2000))
 adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
 adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
 adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()
 # Access layers
 raw_data = adata.layers['raw_counts']
 normalized_data = adata.layers['normalized']
 ```
 Common layer uses:
 - `raw_counts`: Original count data before normalization
 - `normalized`: Log-normalized or TPM values
 - `scaled`: Z-scored values for analysis
 - `imputed`: Data after imputation
 ### obsm (Multi-dimensional Observation Annotations)
 Dictionary storing multi-dimensional arrays aligned to observations.
 ```python
 # Store PCA coordinates and UMAP embeddings
 adata.obsm['X_pca'] = np.random.rand(100, 50)  # 50 principal components
 adata.obsm['X_umap'] = np.random.rand(100, 2)  # 2D UMAP coordinates
 adata.obsm['X_tsne'] = np.random.rand(100, 2)  # 2D t-SNE coordinates
 # Access embeddings
 pca_coords = adata.obsm['X_pca']
 umap_coords = adata.obsm['X_umap']
 ```
 Common obsm uses:
 - `X_pca`: Principal component coordinates
 - `X_umap`: UMAP embedding coordinates
 - `X_tsne`: t-SNE embedding coordinates
 - `X_diffmap`: Diffusion map coordinates
 - `protein_expression`: Protein abundance measurements (CITE-seq)
 ### varm (Multi-dimensional Variable Annotations)
 Dictionary storing multi-dimensional arrays aligned to variables.
 ```python
 # Store PCA loadings
 adata.varm['PCs'] = np.random.rand(2000, 50)  # Loadings for 50 components
 adata.varm['gene_modules'] = np.random.rand(2000, 10)  # Gene module scores
 # Access loadings
 pc_loadings = adata.varm['PCs']
 ```
 Common varm uses:
 - `PCs`: Principal component loadings
 - `gene_modules`: Gene co-expression module assignments
 ### obsp (Pairwise Observation Relationships)
 Dictionary storing sparse matrices representing relationships between observations.
 ```python
 from scipy.sparse import csr_matrix
 # Store k-nearest neighbor graph
 n_obs = 100
 knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
 adata.obsp['connectivities'] = knn_graph
 adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
 # Access graphs
 knn_connections = adata.obsp['connectivities']
 distances = adata.obsp['distances']
 ```
 Common obsp uses:
 - `connectivities`: Cell-cell neighborhood graph
 - `distances`: Pairwise distances between cells
 ### varp (Pairwise Variable Relationships)
 Dictionary storing sparse matrices representing relationships between variables.
 ```python
 # Store gene-gene correlation matrix
 n_vars = 2000
 gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
 adata.varp['correlations'] = gene_corr
 # Access correlations
 gene_correlations = adata.varp['correlations']
 ```
 ### uns (Unstructured Annotations)
 Dictionary storing arbitrary unstructured metadata.
 ```python
 # Store analysis parameters and results
 adata.uns['experiment_date'] = '2025-11-03'
 adata.uns['pca'] = {
    'variance_ratio': [0.15, 0.10, 0.08],
    'params': {'n_comps': 50}
 }
 adata.uns['neighbors'] = {
    'params': {'n_neighbors': 15, 'method': 'umap'},
    'connectivities_key': 'connectivities'
 }
 # Access unstructured data
 exp_date = adata.uns['experiment_date']
 pca_params = adata.uns['pca']['params']
 ```
 Common uns uses:
 - Analysis parameters and settings
 - Color palettes for plotting
 - Cluster information
 - Tool-specific metadata
 ### raw (Original Data Snapshot)
 Optional attribute preserving the original data matrix and variable annotations before filtering.
 ```python
 # Create AnnData and store raw state
 adata = ad.AnnData(X=np.random.rand(100, 5000))
 adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]
 # Store raw state before filtering
 adata.raw = adata.copy()
 # Filter to highly variable genes
 highly_variable_mask = np.random.rand(5000) > 0.5
 adata = adata[:, highly_variable_mask]
 # Access original data
 original_matrix = adata.raw.X
 original_var = adata.raw.var
 ```
 ## Object Properties
 ```python
 # Dimensions
 n_observations = adata.n_obs
 n_variables = adata.n_vars
 shape = adata.shape  # (n_obs, n_vars)
 # Index information
 obs_names = adata.obs_names  # Observation identifiers
 var_names = adata.var_names  # Variable identifiers
 # Storage mode
 is_view = adata.is_view  # True if this is a view of another object
 is_backed = adata.isbacked  # True if backed by on-disk storage
 filename = adata.filename  # Path to backing file (if backed)
 ```
 ## Creating AnnData Objects
 ### From arrays and DataFrames
 ```python
 import anndata as ad
 import numpy as np
 import pandas as pd
 # Minimal creation
 X = np.random.rand(100, 2000)
 adata = ad.AnnData(X)
 # With metadata
 obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
 var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
 adata = ad.AnnData(X=X, obs=obs, var=var)
 # With all components
 adata = ad.AnnData(
    X=X,
    obs=obs,
    var=var,
    layers={'raw': np.random.randint(0, 100, (100, 2000))},
    obsm={'X_pca': np.random.rand(100, 50)},
    uns={'experiment': 'test'}
 )
 ```
 ### From DataFrame
 ```python
 # Create from pandas DataFrame (genes as columns, cells as rows)
 df = pd.DataFrame(
    np.random.rand(100, 50),
    columns=[f'Gene_{i}' for i in range(50)],
    index=[f'Cell_{i}' for i in range(100)]
 )
 adata = ad.AnnData(df)
 ```
 ## Data Access Patterns
 ### Vector extraction
 ```python
 # Get observation annotation as array
 cell_types = adata.obs_vector('cell_type')
 # Get variable values across observations
 gene_expression = adata.obs_vector('ACTB')  # If ACTB is in var_names
 # Get variable annotation as array
 gene_names = adata.var_vector('gene_name')
 ```
 ### Subsetting
 ```python
 # By index
 subset = adata[0:10, 0:100]  # First 10 obs, first 100 vars
 # By name
 subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]
 # By boolean mask
 high_count_cells = adata.obs['total_counts'] > 1000
 subset = adata[high_count_cells, :]
 # By observation metadata
 t_cells = adata[adata.obs['cell_type'] == 'T cell']
 ```
 ## Memory Considerations
 The AnnData structure is designed for memory efficiency:
 - Sparse matrices reduce memory for sparse data
 - Views avoid copying data when possible
 - Backed mode enables working with data larger than RAM
 - Categorical annotations reduce memory for discrete values
 ```python
 # Convert strings to categoricals (more memory efficient)
 adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
 adata.strings_to_categoricals()
 # Check if object is a view (doesn't own data)
 if adata.is_view:
    adata = adata.copy()  # Create independent copy
 ```
--- a/scientific-packages/anndata/references/io_operations.md
+++ b/scientific-packages/anndata/references/io_operations.md
@@ -0,0 +1,404 @@
 # Input/Output Operations
 AnnData provides comprehensive I/O functionality for reading and writing data in various formats.
 ## Native Formats
 ### H5AD (HDF5-based)
 The recommended native format for AnnData objects, providing efficient storage and fast access.
 #### Writing H5AD files
 ```python
 import anndata as ad
 # Write to file
 adata.write_h5ad('data.h5ad')
 # Write with compression
 adata.write_h5ad('data.h5ad', compression='gzip')
 # Write with specific compression level (0-9, higher = more compression)
 adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
 ```
 #### Reading H5AD files
 ```python
 # Read entire file into memory
 adata = ad.read_h5ad('data.h5ad')
 # Read in backed mode (lazy loading for large files)
 adata = ad.read_h5ad('data.h5ad', backed='r')  # Read-only
 adata = ad.read_h5ad('data.h5ad', backed='r+')  # Read-write
 # Backed mode enables working with datasets larger than RAM
 # Only accessed data is loaded into memory
 ```
 #### Backed mode operations
 ```python
 # Open in backed mode
 adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
 # Access metadata without loading X into memory
 print(adata.obs.head())
 print(adata.var.head())
 # Subset operations create views
 subset = adata[:100, :500]  # View, no data loaded
 # Load specific data into memory
 X_subset = subset.X[:]  # Now loads this subset
 # Convert entire backed object to memory
 adata_memory = adata.to_memory()
 ```
 ### Zarr
 Hierarchical array storage format, optimized for cloud storage and parallel I/O.
 #### Writing Zarr
 ```python
 # Write to Zarr store
 adata.write_zarr('data.zarr')
 # Write with specific chunks (important for performance)
 adata.write_zarr('data.zarr', chunks=(100, 100))
 ```
 #### Reading Zarr
 ```python
 # Read Zarr store
 adata = ad.read_zarr('data.zarr')
 ```
 #### Remote Zarr access
 ```python
 import fsspec
 # Access Zarr from S3
 store = fsspec.get_mapper('s3://bucket-name/data.zarr')
 adata = ad.read_zarr(store)
 # Access Zarr from URL
 store = fsspec.get_mapper('https://example.com/data.zarr')
 adata = ad.read_zarr(store)
 ```
 ## Alternative Input Formats
 ### CSV/TSV
 ```python
 # Read CSV (genes as columns, cells as rows)
 adata = ad.read_csv('data.csv')
 # Read with custom delimiter
 adata = ad.read_csv('data.tsv', delimiter='\t')
 # Specify that first column is row names
 adata = ad.read_csv('data.csv', first_column_names=True)
 ```
 ### Excel
 ```python
 # Read Excel file
 adata = ad.read_excel('data.xlsx')
 # Read specific sheet
 adata = ad.read_excel('data.xlsx', sheet='Sheet1')
 ```
 ### Matrix Market (MTX)
 Common format for sparse matrices in genomics.
 ```python
 # Read MTX with associated files
 # Requires: matrix.mtx, genes.tsv, barcodes.tsv
 adata = ad.read_mtx('matrix.mtx')
 # Read with custom gene and barcode files
 adata = ad.read_mtx(
    'matrix.mtx',
    var_names='genes.tsv',
    obs_names='barcodes.tsv'
 )
 # Transpose if needed (MTX often has genes as rows)
 adata = adata.T
 ```
 ### 10X Genomics formats
 ```python
 # Read 10X h5 format
 adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
 # Read 10X MTX directory
 adata = ad.read_10x_mtx('filtered_feature_bc_matrix/')
 # Specify genome if multiple present
 adata = ad.read_10x_h5('data.h5', genome='GRCh38')
 ```
 ### Loom
 ```python
 # Read Loom file
 adata = ad.read_loom('data.loom')
 # Read with specific observation and variable annotations
 adata = ad.read_loom(
    'data.loom',
    obs_names='CellID',
    var_names='Gene'
 )
 ```
 ### Text files
 ```python
 # Read generic text file
 adata = ad.read_text('data.txt', delimiter='\t')
 # Read with custom parameters
 adata = ad.read_text(
    'data.txt',
    delimiter=',',
    first_column_names=True,
    dtype='float32'
 )
 ```
 ### UMI tools
 ```python
 # Read UMI tools format
 adata = ad.read_umi_tools('counts.tsv')
 ```
 ### HDF5 (generic)
 ```python
 # Read from HDF5 file (not h5ad format)
 adata = ad.read_hdf('data.h5', key='dataset')
 ```
 ## Alternative Output Formats
 ### CSV
 ```python
 # Write to CSV files (creates multiple files)
 adata.write_csvs('output_dir/')
 # This creates:
 # - output_dir/X.csv (expression matrix)
 # - output_dir/obs.csv (observation annotations)
 # - output_dir/var.csv (variable annotations)
 # - output_dir/uns.csv (unstructured annotations, if possible)
 # Skip certain components
 adata.write_csvs('output_dir/', skip_data=True)  # Skip X matrix
 ```
 ### Loom
 ```python
 # Write to Loom format
 adata.write_loom('output.loom')
 ```
 ## Reading Specific Elements
 For fine-grained control, read specific elements from storage:
 ```python
 from anndata import read_elem
 # Read just observation annotations
 obs = read_elem('data.h5ad/obs')
 # Read specific layer
 layer = read_elem('data.h5ad/layers/normalized')
 # Read unstructured data element
 params = read_elem('data.h5ad/uns/pca_params')
 ```
 ## Writing Specific Elements
 ```python
 from anndata import write_elem
 import h5py
 # Write element to existing file
 with h5py.File('data.h5ad', 'a') as f:
    write_elem(f, 'new_layer', adata.X.copy())
 ```
 ## Lazy Operations
 For very large datasets, use lazy reading to avoid loading entire datasets:
 ```python
 from anndata.experimental import read_elem_lazy
 # Lazy read (returns dask array or similar)
 X_lazy = read_elem_lazy('large_data.h5ad/X')
 # Compute only when needed
 subset = X_lazy[:100, :100].compute()
 ```
 ## Common I/O Patterns
 ### Convert between formats
 ```python
 # MTX to H5AD
 adata = ad.read_mtx('matrix.mtx').T
 adata.write_h5ad('data.h5ad')
 # CSV to H5AD
 adata = ad.read_csv('data.csv')
 adata.write_h5ad('data.h5ad')
 # H5AD to Zarr
 adata = ad.read_h5ad('data.h5ad')
 adata.write_zarr('data.zarr')
 ```
 ### Load metadata without data
 ```python
 # Backed mode allows inspecting metadata without loading X
 adata = ad.read_h5ad('large_file.h5ad', backed='r')
 print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
 print(adata.obs.columns)
 print(adata.var.columns)
 # X is not loaded into memory
 ```
 ### Append to existing file
 ```python
 # Open in read-write mode
 adata = ad.read_h5ad('data.h5ad', backed='r+')
 # Modify metadata
 adata.obs['new_column'] = values
 # Changes are written to disk
 ```
 ### Download from URL
 ```python
 import anndata as ad
 # Read directly from URL (for h5ad files)
 url = 'https://example.com/data.h5ad'
 adata = ad.read_h5ad(url, backed='r')  # Streaming access
 # For other formats, download first
 import urllib.request
 urllib.request.urlretrieve(url, 'local_file.h5ad')
 adata = ad.read_h5ad('local_file.h5ad')
 ```
 ## Performance Tips
 ### Reading
 - Use `backed='r'` for large files you only need to query
 - Use `backed='r+'` if you need to modify metadata without loading all data
 - H5AD format is generally fastest for random access
 - Zarr is better for cloud storage and parallel access
 - Consider compression for storage, but note it may slow down reading
 ### Writing
 - Use compression for long-term storage: `compression='gzip'` or `compression='lzf'`
 - LZF compression is faster but compresses less than GZIP
 - For Zarr, tune chunk sizes based on access patterns:
  - Larger chunks for sequential reads
  - Smaller chunks for random access
 - Convert string columns to categorical before writing (smaller files)
 ### Memory management
 ```python
 # Convert strings to categoricals (reduces file size and memory)
 adata.strings_to_categoricals()
 adata.write_h5ad('data.h5ad')
 # Use sparse matrices for sparse data
 from scipy.sparse import csr_matrix
 if isinstance(adata.X, np.ndarray):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:  # If more than 50% zeros
        adata.X = csr_matrix(adata.X)
 ```
 ## Handling Large Datasets
 ### Strategy 1: Backed mode
 ```python
 # Work with dataset larger than RAM
 adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
 # Filter based on metadata (fast, no data loading)
 filtered = adata[adata.obs['quality_score'] > 0.8]
 # Load filtered subset into memory
 adata_memory = filtered.to_memory()
 ```
 ### Strategy 2: Chunked processing
 ```python
 # Process data in chunks
 adata = ad.read_h5ad('large_file.h5ad', backed='r')
 chunk_size = 1000
 results = []
 for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    # Process chunk
    result = process(chunk)
    results.append(result)
 ```
 ### Strategy 3: Use AnnCollection
 ```python
 from anndata.experimental import AnnCollection
 # Create collection without loading data
 adatas = [f'dataset_{i}.h5ad' for i in range(10)]
 collection = AnnCollection(
    adatas,
    join_obs='inner',
    join_vars='inner'
 )
 # Process collection lazily
 # Data is loaded only when accessed
 ```
 ## Common Issues and Solutions
 ### Issue: Out of memory when reading
 **Solution**: Use backed mode or read in chunks
 ```python
 adata = ad.read_h5ad('file.h5ad', backed='r')
 ```
 ### Issue: Slow reading from cloud storage
 **Solution**: Use Zarr format with appropriate chunking
 ```python
 adata.write_zarr('data.zarr', chunks=(1000, 1000))
 ```
 ### Issue: Large file sizes
 **Solution**: Use compression and convert to sparse/categorical
 ```python
 adata.strings_to_categoricals()
 from scipy.sparse import csr_matrix
 adata.X = csr_matrix(adata.X)
 adata.write_h5ad('compressed.h5ad', compression='gzip')
 ```
 ### Issue: Cannot modify backed object
 **Solution**: Either load to memory or open in 'r+' mode
 ```python
 # Option 1: Load to memory
 adata = adata.to_memory()
 # Option 2: Open in read-write mode
 adata = ad.read_h5ad('file.h5ad', backed='r+')
 ```
--- a/scientific-packages/anndata/references/manipulation.md
+++ b/scientific-packages/anndata/references/manipulation.md
@@ -0,0 +1,516 @@
 # Data Manipulation
 Operations for transforming, subsetting, and manipulating AnnData objects.
 ## Subsetting
 ### By indices
 ```python
 import anndata as ad
 import numpy as np
 adata = ad.AnnData(X=np.random.rand(1000, 2000))
 # Integer indices
 subset = adata[0:100, 0:500]  # First 100 obs, first 500 vars
 # List of indices
 obs_indices = [0, 10, 20, 30, 40]
 var_indices = [0, 1, 2, 3, 4]
 subset = adata[obs_indices, var_indices]
 # Single observation or variable
 single_obs = adata[0, :]
 single_var = adata[:, 0]
 ```
 ### By names
 ```python
 import pandas as pd
 # Create with named indices
 obs_names = [f'cell_{i}' for i in range(1000)]
 var_names = [f'gene_{i}' for i in range(2000)]
 adata = ad.AnnData(
    X=np.random.rand(1000, 2000),
    obs=pd.DataFrame(index=obs_names),
    var=pd.DataFrame(index=var_names)
 )
 # Subset by observation names
 subset = adata[['cell_0', 'cell_1', 'cell_2'], :]
 # Subset by variable names
 subset = adata[:, ['gene_0', 'gene_10', 'gene_20']]
 # Both axes
 subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']]
 ```
 ### By boolean masks
 ```python
 # Create boolean masks
 high_count_obs = np.random.rand(1000) > 0.5
 high_var_genes = np.random.rand(2000) > 0.7
 # Subset using masks
 subset = adata[high_count_obs, :]
 subset = adata[:, high_var_genes]
 subset = adata[high_count_obs, high_var_genes]
 ```
 ### By metadata conditions
 ```python
 # Add metadata
 adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000)
 adata.obs['quality_score'] = np.random.rand(1000)
 adata.var['highly_variable'] = np.random.rand(2000) > 0.8
 # Filter by cell type
 t_cells = adata[adata.obs['cell_type'] == 'A']
 # Filter by multiple conditions
 high_quality_a_cells = adata[
    (adata.obs['cell_type'] == 'A') &
    (adata.obs['quality_score'] > 0.7)
 ]
 # Filter by variable metadata
 hv_genes = adata[:, adata.var['highly_variable']]
 # Complex conditions
 filtered = adata[
    (adata.obs['quality_score'] > 0.5) &
    (adata.obs['cell_type'].isin(['A', 'B'])),
    adata.var['highly_variable']
 ]
 ```
 ## Transposition
 ```python
 # Transpose AnnData object (swap observations and variables)
 adata_T = adata.T
 # Shape changes
 print(adata.shape)    # (1000, 2000)
 print(adata_T.shape)  # (2000, 1000)
 # obs and var are swapped
 print(adata.obs.head())   # Observation metadata
 print(adata_T.var.head()) # Same data, now as variable metadata
 # Useful when data is in opposite orientation
 # Common with some file formats where genes are rows
 ```
 ## Copying
 ### Full copy
 ```python
 # Create independent copy
 adata_copy = adata.copy()
 # Modifications to copy don't affect original
 adata_copy.obs['new_column'] = 1
 print('new_column' in adata.obs.columns)  # False
 ```
 ### Shallow copy
 ```python
 # View (doesn't copy data, modifications affect original)
 adata_view = adata[0:100, :]
 # Check if object is a view
 print(adata_view.is_view)  # True
 # Convert view to independent copy
 adata_independent = adata_view.copy()
 print(adata_independent.is_view)  # False
 ```
 ## Renaming
 ### Rename observations and variables
 ```python
 # Rename all observations
 adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)]
 # Rename all variables
 adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)]
 # Make names unique (add suffix to duplicates)
 adata.obs_names_make_unique()
 adata.var_names_make_unique()
 ```
 ### Rename categories
 ```python
 # Create categorical column
 adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A'])
 # Rename categories
 adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C'])
 # Or using dictionary
 adata.rename_categories('cell_type', {
    'Type_A': 'T_cell',
    'Type_B': 'B_cell',
    'Type_C': 'Monocyte'
 })
 ```
 ## Type Conversions
 ### Strings to categoricals
 ```python
 # Convert string columns to categorical (more memory efficient)
 adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500
 adata.obs['tissue'] = ['brain', 'liver'] * 500
 # Convert all string columns to categorical
 adata.strings_to_categoricals()
 print(adata.obs['cell_type'].dtype)  # category
 print(adata.obs['tissue'].dtype)     # category
 ```
 ### Sparse to dense and vice versa
 ```python
 from scipy.sparse import csr_matrix
 # Dense to sparse
 if not isinstance(adata.X, csr_matrix):
    adata.X = csr_matrix(adata.X)
 # Sparse to dense
 if isinstance(adata.X, csr_matrix):
    adata.X = adata.X.toarray()
 # Convert layer
 adata.layers['normalized'] = csr_matrix(adata.layers['normalized'])
 ```
 ## Chunked Operations
 Process large datasets in chunks:
 ```python
 # Iterate through data in chunks
 chunk_size = 100
 for chunk in adata.chunked_X(chunk_size):
    # Process chunk
    result = process_chunk(chunk)
 ```
 ## Extracting Vectors
 ### Get observation vectors
 ```python
 # Get observation metadata as array
 cell_types = adata.obs_vector('cell_type')
 # Get gene expression across observations
 actb_expression = adata.obs_vector('ACTB')  # If ACTB in var_names
 ```
 ### Get variable vectors
 ```python
 # Get variable metadata as array
 gene_names = adata.var_vector('gene_name')
 ```
 ## Adding/Modifying Data
 ### Add observations
 ```python
 # Create new observations
 new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars))
 new_obs.var_names = adata.var_names
 # Concatenate with existing
 adata_extended = ad.concat([adata, new_obs], axis=0)
 ```
 ### Add variables
 ```python
 # Create new variables
 new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100))
 new_vars.obs_names = adata.obs_names
 # Concatenate with existing
 adata_extended = ad.concat([adata, new_vars], axis=1)
 ```
 ### Add metadata columns
 ```python
 # Add observation annotation
 adata.obs['new_score'] = np.random.rand(adata.n_obs)
 # Add variable annotation
 adata.var['new_label'] = ['label'] * adata.n_vars
 # Add from external data
 external_data = pd.read_csv('metadata.csv', index_col=0)
 adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column']
 ```
 ### Add layers
 ```python
 # Add new layer
 adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape)
 adata.layers['log_transformed'] = np.log1p(adata.X)
 # Replace layer
 adata.layers['normalized'] = new_normalized_data
 ```
 ### Add embeddings
 ```python
 # Add PCA
 adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50)
 # Add UMAP
 adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2)
 # Add multiple embeddings
 adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2)
 adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10)
 ```
 ### Add pairwise relationships
 ```python
 from scipy.sparse import csr_matrix
 # Add nearest neighbor graph
 n_obs = adata.n_obs
 knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
 adata.obsp['connectivities'] = knn_graph
 # Add distance matrix
 adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
 ```
 ### Add unstructured data
 ```python
 # Add analysis parameters
 adata.uns['pca'] = {
    'variance': [0.2, 0.15, 0.1],
    'variance_ratio': [0.4, 0.3, 0.2],
    'params': {'n_comps': 50}
 }
 # Add color schemes
 adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF']
 ```
 ## Removing Data
 ### Remove observations or variables
 ```python
 # Keep only specific observations
 keep_obs = adata.obs['quality_score'] > 0.5
 adata = adata[keep_obs, :]
 # Remove specific variables
 remove_vars = adata.var['low_count']
 adata = adata[:, ~remove_vars]
 ```
 ### Remove metadata columns
 ```python
 # Remove observation column
 adata.obs.drop('unwanted_column', axis=1, inplace=True)
 # Remove variable column
 adata.var.drop('unwanted_column', axis=1, inplace=True)
 ```
 ### Remove layers
 ```python
 # Remove specific layer
 del adata.layers['unwanted_layer']
 # Remove all layers
 adata.layers = {}
 ```
 ### Remove embeddings
 ```python
 # Remove specific embedding
 del adata.obsm['X_tsne']
 # Remove all embeddings
 adata.obsm = {}
 ```
 ### Remove unstructured data
 ```python
 # Remove specific key
 del adata.uns['unwanted_key']
 # Remove all unstructured data
 adata.uns = {}
 ```
 ## Reordering
 ### Sort observations
 ```python
 # Sort by observation metadata
 adata = adata[adata.obs.sort_values('quality_score').index, :]
 # Sort by observation names
 adata = adata[sorted(adata.obs_names), :]
 ```
 ### Sort variables
 ```python
 # Sort by variable metadata
 adata = adata[:, adata.var.sort_values('gene_name').index]
 # Sort by variable names
 adata = adata[:, sorted(adata.var_names)]
 ```
 ### Reorder to match external list
 ```python
 # Reorder observations to match external list
 desired_order = ['cell_10', 'cell_5', 'cell_20', ...]
 adata = adata[desired_order, :]
 # Reorder variables
 desired_genes = ['TP53', 'ACTB', 'GAPDH', ...]
 adata = adata[:, desired_genes]
 ```
 ## Data Transformations
 ### Normalize
 ```python
 # Total count normalization (CPM/TPM-like)
 total_counts = adata.X.sum(axis=1)
 adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6
 # Log transformation
 adata.layers['log1p'] = np.log1p(adata.X)
 # Z-score normalization
 mean = adata.X.mean(axis=0)
 std = adata.X.std(axis=0)
 adata.layers['scaled'] = (adata.X - mean) / std
 ```
 ### Filter
 ```python
 # Filter cells by total counts
 total_counts = np.array(adata.X.sum(axis=1)).flatten()
 adata.obs['total_counts'] = total_counts
 adata = adata[adata.obs['total_counts'] > 1000, :]
 # Filter genes by detection rate
 detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs
 adata.var['detection_rate'] = np.array(detection_rate).flatten()
 adata = adata[:, adata.var['detection_rate'] > 0.01]
 ```
 ## Working with Views
 Views are lightweight references to subsets of data that don't copy the underlying matrix:
 ```python
 # Create view
 view = adata[0:100, 0:500]
 print(view.is_view)  # True
 # Views allow read access
 data = view.X
 # Modifying view data affects original
 # (Be careful!)
 # Convert view to independent copy
 independent = view.copy()
 # Force AnnData to be a copy, not a view
 adata = adata.copy()
 ```
 ## Merging Metadata
 ```python
 # Merge external metadata
 external_metadata = pd.read_csv('additional_metadata.csv', index_col=0)
 # Join metadata (inner join on index)
 adata.obs = adata.obs.join(external_metadata)
 # Left join (keep all adata observations)
 adata.obs = adata.obs.merge(
    external_metadata,
    left_index=True,
    right_index=True,
    how='left'
 )
 ```
 ## Common Manipulation Patterns
 ### Quality control filtering
 ```python
 # Calculate QC metrics
 adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
 adata.obs['total_counts'] = adata.X.sum(axis=1)
 adata.var['n_cells'] = (adata.X > 0).sum(axis=0)
 # Filter low-quality cells
 adata = adata[adata.obs['n_genes'] > 200, :]
 adata = adata[adata.obs['total_counts'] < 50000, :]
 # Filter rarely detected genes
 adata = adata[:, adata.var['n_cells'] >= 3]
 ```
 ### Select highly variable genes
 ```python
 # Mark highly variable genes
 gene_variance = np.var(adata.X, axis=0)
 adata.var['variance'] = np.array(gene_variance).flatten()
 adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90)
 # Subset to highly variable genes
 adata_hvg = adata[:, adata.var['highly_variable']].copy()
 ```
 ### Downsample
 ```python
 # Random sampling of observations
 np.random.seed(42)
 n_sample = 500
 sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False)
 adata_downsampled = adata[sample_indices, :].copy()
 # Stratified sampling by cell type
 from sklearn.model_selection import train_test_split
 train_idx, test_idx = train_test_split(
    range(adata.n_obs),
    test_size=0.2,
    stratify=adata.obs['cell_type']
 )
 adata_train = adata[train_idx, :].copy()
 adata_test = adata[test_idx, :].copy()
 ```
 ### Split train/test
 ```python
 # Random train/test split
 np.random.seed(42)
 n_obs = adata.n_obs
 train_size = int(0.8 * n_obs)
 indices = np.random.permutation(n_obs)
 train_indices = indices[:train_size]
 test_indices = indices[train_size:]
 adata_train = adata[train_indices, :].copy()
 adata_test = adata[test_indices, :].copy()
 ```
--- a/scientific-packages/anndata/references/workflows_best_practices.md
+++ b/scientific-packages/anndata/references/workflows_best_practices.md
@@ -1,438 +0,0 @@
 # AnnData Workflows and Best Practices
 ## Common Workflows
 ### 1. Single-Cell RNA-seq Analysis Workflow
 #### Loading Data
 ```python
 import anndata as ad
 import numpy as np
 import pandas as pd
 # Load from 10X format
 adata = ad.read_mtx('matrix.mtx')
 adata.var_names = pd.read_csv('genes.tsv', sep='\t', header=None)[0]
 adata.obs_names = pd.read_csv('barcodes.tsv', sep='\t', header=None)[0]
 # Or load from pre-processed h5ad
 adata = ad.read_h5ad('preprocessed_data.h5ad')
 ```
 #### Quality Control
 ```python
 # Calculate QC metrics
 adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
 adata.obs['total_counts'] = adata.X.sum(axis=1)
 # Filter cells
 adata = adata[adata.obs.n_genes > 200]
 adata = adata[adata.obs.total_counts < 10000]
 # Filter genes
 min_cells = 3
 adata = adata[:, (adata.X > 0).sum(axis=0) >= min_cells]
 ```
 #### Normalization and Preprocessing
 ```python
 # Store raw counts
 adata.layers['counts'] = adata.X.copy()
 # Normalize
 adata.X = adata.X / adata.obs.total_counts.values[:, None] * 1e4
 # Log transform
 adata.layers['log1p'] = np.log1p(adata.X)
 adata.X = adata.layers['log1p']
 # Identify highly variable genes
 gene_variance = adata.X.var(axis=0)
 adata.var['highly_variable'] = gene_variance > np.percentile(gene_variance, 90)
 ```
 #### Dimensionality Reduction
 ```python
 # PCA
 from sklearn.decomposition import PCA
 pca = PCA(n_components=50)
 adata.obsm['X_pca'] = pca.fit_transform(adata.X)
 # Store PCA variance
 adata.uns['pca'] = {'variance_ratio': pca.explained_variance_ratio_}
 # UMAP
 from umap import UMAP
 umap = UMAP(n_components=2)
 adata.obsm['X_umap'] = umap.fit_transform(adata.obsm['X_pca'])
 ```
 #### Clustering
 ```python
 # Store cluster assignments
 adata.obs['clusters'] = pd.Categorical(['cluster_0', 'cluster_1', ...])
 # Store cluster centroids
 centroids = np.array([...])
 adata.varm['cluster_centroids'] = centroids
 ```
 #### Save Results
 ```python
 # Save complete analysis
 adata.write('analyzed_data.h5ad', compression='gzip')
 ```
 ### 2. Batch Integration Workflow
 ```python
 import anndata as ad
 # Load multiple batches
 batch1 = ad.read_h5ad('batch1.h5ad')
 batch2 = ad.read_h5ad('batch2.h5ad')
 batch3 = ad.read_h5ad('batch3.h5ad')
 # Concatenate with batch labels
 adata = ad.concat(
    [batch1, batch2, batch3],
    axis=0,
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    index_unique='-'
 )
 # Batch effect correction would go here
 # (using external tools like Harmony, Scanorama, etc.)
 # Store corrected embeddings
 adata.obsm['X_pca_corrected'] = corrected_pca
 adata.obsm['X_umap_corrected'] = corrected_umap
 ```
 ### 3. Memory-Efficient Large Dataset Workflow
 ```python
 import anndata as ad
 # Read in backed mode
 adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
 # Check backing status
 print(f"Is backed: {adata.isbacked}")
 print(f"File: {adata.filename}")
 # Work with chunks
 for chunk in adata.chunk_X(chunk_size=1000):
    # Process chunk
    result = process_chunk(chunk)
 # Close file when done
 adata.file.close()
 ```
 ### 4. Multi-Dataset Comparison Workflow
 ```python
 import anndata as ad
 # Load datasets
 datasets = {
    'study1': ad.read_h5ad('study1.h5ad'),
    'study2': ad.read_h5ad('study2.h5ad'),
    'study3': ad.read_h5ad('study3.h5ad')
 }
 # Outer join to keep all genes
 combined = ad.concat(
    list(datasets.values()),
    axis=0,
    join='outer',
    label='study',
    keys=list(datasets.keys()),
    merge='first'
 )
 # Handle missing data
 combined.X[np.isnan(combined.X)] = 0
 # Add dataset-specific metadata
 combined.uns['datasets'] = {
    'study1': {'date': '2023-01', 'n_samples': datasets['study1'].n_obs},
    'study2': {'date': '2023-06', 'n_samples': datasets['study2'].n_obs},
    'study3': {'date': '2024-01', 'n_samples': datasets['study3'].n_obs}
 }
 ```
 ## Best Practices
 ### Memory Management
 #### Use Sparse Matrices
 ```python
 from scipy.sparse import csr_matrix
 # Convert to sparse if data is sparse
 if density < 0.3:  # Less than 30% non-zero
    adata.X = csr_matrix(adata.X)
 ```
 #### Use Backed Mode for Large Files
 ```python
 # Read with backing
 adata = ad.read_h5ad('large_file.h5ad', backed='r')
 # Only load what you need
 subset = adata[:1000, :500].copy()  # Now in memory
 ```
 #### Convert Strings to Categoricals
 ```python
 # Efficient storage for repeated strings
 adata.strings_to_categoricals()
 # Or manually
 adata.obs['cell_type'] = pd.Categorical(adata.obs['cell_type'])
 ```
 ### Data Organization
 #### Use Layers for Different Representations
 ```python
 # Store multiple versions of the data
 adata.layers['counts'] = raw_counts
 adata.layers['normalized'] = normalized_data
 adata.layers['log1p'] = log_transformed_data
 adata.layers['scaled'] = scaled_data
 ```
 #### Use obsm/varm for Multi-Dimensional Annotations
 ```python
 # Embeddings
 adata.obsm['X_pca'] = pca_coordinates
 adata.obsm['X_umap'] = umap_coordinates
 adata.obsm['X_tsne'] = tsne_coordinates
 # Gene loadings
 adata.varm['PCs'] = principal_components
 ```
 #### Use uns for Analysis Metadata
 ```python
 # Store parameters
 adata.uns['preprocessing'] = {
    'normalization': 'TPM',
    'min_genes': 200,
    'min_cells': 3,
    'date': '2024-01-15'
 }
 # Store analysis results
 adata.uns['differential_expression'] = {
    'method': 't-test',
    'p_value_threshold': 0.05
 }
 ```
 ### Subsetting and Views
 #### Understand View vs Copy
 ```python
 # Subsetting returns a view
 subset = adata[adata.obs.cell_type == 'B cell']  # View
 print(subset.is_view)  # True
 # Views are memory efficient but modifications affect original
 subset.obs['new_column'] = value  # Modifies original adata
 # Create independent copy when needed
 subset_copy = adata[adata.obs.cell_type == 'B cell'].copy()
 ```
 #### Chain Operations Efficiently
 ```python
 # Bad - creates multiple intermediate views
 temp1 = adata[adata.obs.batch == 'batch1']
 temp2 = temp1[temp1.obs.n_genes > 200]
 result = temp2[:, temp2.var.highly_variable].copy()
 # Good - chain operations
 result = adata[
    (adata.obs.batch == 'batch1') & (adata.obs.n_genes > 200),
    adata.var.highly_variable
 ].copy()
 ```
 ### File I/O
 #### Use Compression
 ```python
 # Save with compression
 adata.write('data.h5ad', compression='gzip')
 ```
 #### Choose the Right Format
 ```python
 # H5AD for general use (good compression, fast)
 adata.write_h5ad('data.h5ad')
 # Zarr for cloud storage and parallel access
 adata.write_zarr('data.zarr')
 # Loom for compatibility with other tools
 adata.write_loom('data.loom')
 ```
 #### Close File Connections
 ```python
 # Use context manager pattern
 adata = ad.read_h5ad('file.h5ad', backed='r')
 try:
    # Work with data
    process(adata)
 finally:
    adata.file.close()
 ```
 ### Concatenation
 #### Choose Appropriate Join Strategy
 ```python
 # Inner join - only common features (safe, may lose data)
 combined = ad.concat([adata1, adata2], join='inner')
 # Outer join - all features (keeps all data, may introduce zeros)
 combined = ad.concat([adata1, adata2], join='outer')
 ```
 #### Track Data Sources
 ```python
 # Add source labels
 combined = ad.concat(
    [adata1, adata2, adata3],
    label='dataset',
    keys=['exp1', 'exp2', 'exp3']
 )
 # Make indices unique
 combined = ad.concat(
    [adata1, adata2, adata3],
    index_unique='-'
 )
 ```
 #### Handle Variable-Specific Metadata
 ```python
 # Use merge strategy for var annotations
 combined = ad.concat(
    [adata1, adata2],
    merge='same',  # Keep only identical annotations
    join='outer'
 )
 ```
 ### Naming Conventions
 #### Use Consistent Naming
 ```python
 # Embeddings: X_<method>
 adata.obsm['X_pca']
 adata.obsm['X_umap']
 adata.obsm['X_tsne']
 # Layers: descriptive names
 adata.layers['counts']
 adata.layers['log1p']
 adata.layers['scaled']
 # Observations: snake_case
 adata.obs['cell_type']
 adata.obs['n_genes']
 adata.obs['total_counts']
 ```
 #### Make Indices Unique
 ```python
 # Ensure unique names
 adata.obs_names_make_unique()
 adata.var_names_make_unique()
 ```
 ### Error Handling
 #### Validate Data Structure
 ```python
 # Check dimensions
 assert adata.n_obs > 0, "No observations in data"
 assert adata.n_vars > 0, "No variables in data"
 # Check for NaN values
 if np.isnan(adata.X).any():
    print("Warning: NaN values detected")
 # Check for negative values in count data
 if (adata.X < 0).any():
    print("Warning: Negative values in count data")
 ```
 #### Handle Missing Data
 ```python
 # Check for missing annotations
 if adata.obs['cell_type'].isna().any():
    print("Warning: Missing cell type annotations")
    # Fill or remove
    adata = adata[~adata.obs['cell_type'].isna()]
 ```
 ## Common Pitfalls
 ### 1. Forgetting to Copy Views
 ```python
 # BAD - modifies original
 subset = adata[adata.obs.condition == 'treated']
 subset.X = transformed_data  # Changes original adata!
 # GOOD
 subset = adata[adata.obs.condition == 'treated'].copy()
 subset.X = transformed_data  # Only changes subset
 ```
 ### 2. Mixing Backed and In-Memory Operations
 ```python
 # BAD - trying to modify backed data
 adata = ad.read_h5ad('file.h5ad', backed='r')
 adata.X[0, 0] = 100  # Error: can't modify backed data
 # GOOD - load to memory first
 adata = ad.read_h5ad('file.h5ad', backed='r')
 adata = adata.to_memory()
 adata.X[0, 0] = 100  # Works
 ```
 ### 3. Not Using Categoricals for Metadata
 ```python
 # BAD - stores as strings (memory inefficient)
 adata.obs['cell_type'] = ['B cell', 'T cell', ...] * 1000
 # GOOD - use categorical
 adata.obs['cell_type'] = pd.Categorical(['B cell', 'T cell', ...] * 1000)
 ```
 ### 4. Incorrect Concatenation Axis
 ```python
 # Concatenating observations (cells)
 combined = ad.concat([adata1, adata2], axis=0)  # Correct
 # Concatenating variables (genes) - rare
 combined = ad.concat([adata1, adata2], axis=1)  # Less common
 ```
 ### 5. Not Preserving Raw Data
 ```python
 # BAD - loses original data
 adata.X = normalized_data
 # GOOD - preserve original
 adata.layers['counts'] = adata.X.copy()
 adata.X = normalized_data
 ```