From 4080bf907b5c68d13a8ceb30dedbadbd9160745a Mon Sep 17 00:00:00 2001 From: Timothy Kassis Date: Sun, 19 Oct 2025 17:14:11 -0700 Subject: [PATCH] Add GEO --- .claude-plugin/marketplace.json | 3 +- README.md | 4 +- scientific-databases/geo-database/SKILL.md | 805 +++++++++++++++++ .../geo-database/references/geo_reference.md | 829 ++++++++++++++++++ 4 files changed, 1638 insertions(+), 3 deletions(-) create mode 100644 scientific-databases/geo-database/SKILL.md create mode 100644 scientific-databases/geo-database/references/geo_reference.md diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index ca4ba61..b4f240d 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -7,7 +7,7 @@ }, "metadata": { "description": "Claude scientific skills from K-Dense Inc", - "version": "1.4.0" + "version": "1.5.0" }, "plugins": [ { @@ -62,6 +62,7 @@ "./scientific-databases/alphafold-database", "./scientific-databases/chembl-database", "./scientific-databases/gene-database", + "./scientific-databases/geo-database", "./scientific-databases/pdb-database", "./scientific-databases/pubchem-database", "./scientific-databases/pubmed-database", diff --git a/README.md b/README.md index 52dfb3e..aa83a16 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ A comprehensive collection of ready-to-use scientific skills for Claude, curated - **AlphaFold DB** - AI-predicted protein structure database with 200M+ predictions, confidence metrics (pLDDT, PAE), and Google Cloud bulk access - **ChEMBL** - Bioactive molecule database with drug-like properties (2M+ compounds, 19M+ activities, 13K+ targets) +- **GEO (Gene Expression Omnibus)** - High-throughput gene expression and functional genomics data repository (264K+ studies, 8M+ samples) with microarray, RNA-seq, and expression profile access - **NCBI Gene** - Work with NCBI Gene database to search, retrieve, and analyze gene information including nomenclature, sequences, variations, phenotypes, and pathways using E-utilities and Datasets API - **Protein Data Bank (PDB)** - Access 3D structural data of proteins, nucleic acids, and biological macromolecules (200K+ structures) with search, retrieval, and analysis capabilities - **PubChem** - Access chemical compound data from the world's largest free chemical database (110M+ compounds, 270M+ bioactivities) @@ -102,7 +103,7 @@ To use any skill from this repository or upload custom skills, follow the instru ### Claude API You can use Anthropic's pre-built skills, and upload custom skills, via the Claude API. See the [Skills API Quickstart](https://docs.anthropic.com/claude/skills-api-quickstart) for more. -## TODO: Future Scientific Capabilities (Availble currently in K-Dense) +## TODO: Future Scientific Capabilities ### Scientific Databases - **UniProt** - Protein sequence and functional information database @@ -110,7 +111,6 @@ You can use Anthropic's pre-built skills, and upload custom skills, via the Clau - **COSMIC** - Catalogue of Somatic Mutations in Cancer - **ClinVar** - Clinical significance of genomic variants - **STRING** - Protein-protein interaction networks -- **GEO (Gene Expression Omnibus)** - Functional genomics data repository - **European Nucleotide Archive (ENA)** - Comprehensive nucleotide sequence database ### Bioinformatics & Genomics diff --git a/scientific-databases/geo-database/SKILL.md b/scientific-databases/geo-database/SKILL.md new file mode 100644 index 0000000..baf99ba --- /dev/null +++ b/scientific-databases/geo-database/SKILL.md @@ -0,0 +1,805 @@ +--- +name: geo-database +description: Work with the Gene Expression Omnibus (GEO) database to search, retrieve, and analyze high-throughput gene expression and functional genomics data. Use this skill when working with microarray data, RNA-seq datasets, gene expression profiles, GEO accessions (GSE, GSM, GPL, GDS), downloading SOFT/MINiML files, querying expression experiments, performing differential expression analysis, accessing GEO metadata, or when needing programmatic access to functional genomics data repositories. +--- + +# GEO Database + +## Overview + +This skill provides tools and guidance for working with the Gene Expression Omnibus (GEO), NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments. Use this skill to search for gene expression datasets, retrieve experimental data, download raw and processed files, query expression profiles, and integrate GEO data into computational analysis workflows. + +## Core Capabilities + +### 1. Understanding GEO Data Organization + +GEO organizes data hierarchically using different accession types: + +**Series (GSE):** A complete experiment with a set of related samples +- Example: GSE123456 +- Contains experimental design, samples, and overall study information +- Largest organizational unit in GEO +- Current count: 264,928+ series + +**Sample (GSM):** A single experimental sample or biological replicate +- Example: GSM987654 +- Contains individual sample data, protocols, and metadata +- Linked to platforms and series +- Current count: 8,068,632+ samples + +**Platform (GPL):** The microarray or sequencing platform used +- Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) +- Describes the technology and probe/feature annotations +- Shared across multiple experiments +- Current count: 27,739+ platforms + +**DataSet (GDS):** Curated collections with consistent formatting +- Example: GDS5678 +- Experimentally-comparable samples organized by study design +- Processed for differential analysis +- Subset of GEO data (4,348 curated datasets) +- Ideal for quick comparative analyses + +**Profiles:** Gene-specific expression data linked to sequence features +- Queryable by gene name or annotation +- Cross-references to Entrez Gene +- Enables gene-centric searches across all studies + +### 2. Searching GEO Data + +**GEO DataSets Search:** + +Search for studies by keywords, organism, or experimental conditions: + +```python +from Bio import Entrez + +# Configure Entrez (required) +Entrez.email = "your.email@example.com" + +# Search for datasets +def search_geo_datasets(query, retmax=20): + """Search GEO DataSets database""" + handle = Entrez.esearch( + db="gds", + term=query, + retmax=retmax, + usehistory="y" + ) + results = Entrez.read(handle) + handle.close() + return results + +# Example searches +results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]") +print(f"Found {results['Count']} datasets") + +# Search by specific platform +results = search_geo_datasets("GPL570[Accession]") + +# Search by study type +results = search_geo_datasets("expression profiling by array[DataSet Type]") +``` + +**GEO Profiles Search:** + +Find gene-specific expression patterns: + +```python +# Search for gene expression profiles +def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100): + """Search GEO Profiles for a specific gene""" + query = f"{gene_name}[Gene Name] AND {organism}[Organism]" + handle = Entrez.esearch( + db="geoprofiles", + term=query, + retmax=retmax + ) + results = Entrez.read(handle) + handle.close() + return results + +# Find TP53 expression across studies +tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") +print(f"Found {tp53_results['Count']} expression profiles for TP53") +``` + +**Advanced Search Patterns:** + +```python +# Combine multiple search terms +def advanced_geo_search(terms, operator="AND"): + """Build complex search queries""" + query = f" {operator} ".join(terms) + return search_geo_datasets(query) + +# Find recent high-throughput studies +search_terms = [ + "RNA-seq[DataSet Type]", + "Homo sapiens[Organism]", + "2024[Publication Date]" +] +results = advanced_geo_search(search_terms) + +# Search by author and condition +search_terms = [ + "Smith[Author]", + "diabetes[Disease]" +] +results = advanced_geo_search(search_terms) +``` + +### 3. Retrieving GEO Data with GEOparse (Recommended) + +**GEOparse** is the primary Python library for accessing GEO data: + +**Installation:** +```bash +pip install GEOparse +``` + +**Basic Usage:** + +```python +import GEOparse + +# Download and parse a GEO Series +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") + +# Access series metadata +print(gse.metadata['title']) +print(gse.metadata['summary']) +print(gse.metadata['overall_design']) + +# Access sample information +for gsm_name, gsm in gse.gsms.items(): + print(f"Sample: {gsm_name}") + print(f" Title: {gsm.metadata['title'][0]}") + print(f" Source: {gsm.metadata['source_name_ch1'][0]}") + print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}") + +# Access platform information +for gpl_name, gpl in gse.gpls.items(): + print(f"Platform: {gpl_name}") + print(f" Title: {gpl.metadata['title'][0]}") + print(f" Organism: {gpl.metadata['organism'][0]}") +``` + +**Working with Expression Data:** + +```python +import GEOparse +import pandas as pd + +# Get expression data from series +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") + +# Extract expression matrix +# Method 1: From series matrix file (fastest) +if hasattr(gse, 'pivot_samples'): + expression_df = gse.pivot_samples('VALUE') + print(expression_df.shape) # genes x samples + +# Method 2: From individual samples +expression_data = {} +for gsm_name, gsm in gse.gsms.items(): + if hasattr(gsm, 'table'): + expression_data[gsm_name] = gsm.table['VALUE'] + +expression_df = pd.DataFrame(expression_data) +print(f"Expression matrix: {expression_df.shape}") +``` + +**Accessing Supplementary Files:** + +```python +import GEOparse + +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") + +# Download supplementary files +gse.download_supplementary_files( + directory="./data/GSE123456_suppl", + download_sra=False # Set to True to download SRA files +) + +# List available supplementary files +for gsm_name, gsm in gse.gsms.items(): + if hasattr(gsm, 'supplementary_files'): + print(f"Sample {gsm_name}:") + for file_url in gsm.metadata.get('supplementary_file', []): + print(f" {file_url}") +``` + +**Filtering and Subsetting Data:** + +```python +import GEOparse + +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") + +# Filter samples by metadata +control_samples = [ + gsm_name for gsm_name, gsm in gse.gsms.items() + if 'control' in gsm.metadata.get('title', [''])[0].lower() +] + +treatment_samples = [ + gsm_name for gsm_name, gsm in gse.gsms.items() + if 'treatment' in gsm.metadata.get('title', [''])[0].lower() +] + +print(f"Control samples: {len(control_samples)}") +print(f"Treatment samples: {len(treatment_samples)}") + +# Extract subset expression matrix +expression_df = gse.pivot_samples('VALUE') +control_expr = expression_df[control_samples] +treatment_expr = expression_df[treatment_samples] +``` + +### 4. Using NCBI E-utilities for GEO Access + +**E-utilities** provide lower-level programmatic access to GEO metadata: + +**Basic E-utilities Workflow:** + +```python +from Bio import Entrez +import time + +Entrez.email = "your.email@example.com" + +# Step 1: Search for GEO entries +def search_geo(query, db="gds", retmax=100): + """Search GEO using E-utilities""" + handle = Entrez.esearch( + db=db, + term=query, + retmax=retmax, + usehistory="y" + ) + results = Entrez.read(handle) + handle.close() + return results + +# Step 2: Fetch summaries +def fetch_geo_summaries(id_list, db="gds"): + """Fetch document summaries for GEO entries""" + ids = ",".join(id_list) + handle = Entrez.esummary(db=db, id=ids) + summaries = Entrez.read(handle) + handle.close() + return summaries + +# Step 3: Fetch full records +def fetch_geo_records(id_list, db="gds"): + """Fetch full GEO records""" + ids = ",".join(id_list) + handle = Entrez.efetch(db=db, id=ids, retmode="xml") + records = Entrez.read(handle) + handle.close() + return records + +# Example workflow +search_results = search_geo("breast cancer AND Homo sapiens") +id_list = search_results['IdList'][:5] + +summaries = fetch_geo_summaries(id_list) +for summary in summaries: + print(f"GDS: {summary.get('Accession', 'N/A')}") + print(f"Title: {summary.get('title', 'N/A')}") + print(f"Samples: {summary.get('n_samples', 'N/A')}") + print() +``` + +**Batch Processing with E-utilities:** + +```python +from Bio import Entrez +import time + +Entrez.email = "your.email@example.com" + +def batch_fetch_geo_metadata(accessions, batch_size=100): + """Fetch metadata for multiple GEO accessions""" + results = {} + + for i in range(0, len(accessions), batch_size): + batch = accessions[i:i + batch_size] + + # Search for each accession + for accession in batch: + try: + query = f"{accession}[Accession]" + search_handle = Entrez.esearch(db="gds", term=query) + search_results = Entrez.read(search_handle) + search_handle.close() + + if search_results['IdList']: + # Fetch summary + summary_handle = Entrez.esummary( + db="gds", + id=search_results['IdList'][0] + ) + summary = Entrez.read(summary_handle) + summary_handle.close() + results[accession] = summary[0] + + # Be polite to NCBI servers + time.sleep(0.34) # Max 3 requests per second + + except Exception as e: + print(f"Error fetching {accession}: {e}") + + return results + +# Fetch metadata for multiple datasets +gse_list = ["GSE100001", "GSE100002", "GSE100003"] +metadata = batch_fetch_geo_metadata(gse_list) +``` + +### 5. Direct FTP Access for Data Files + +**FTP URLs for GEO Data:** + +GEO data can be downloaded directly via FTP: + +```python +import ftplib +import os + +def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"): + """Download GEO files via FTP""" + # Construct FTP path based on accession type + if accession.startswith("GSE"): + # Series files + gse_num = accession[3:] + base_num = gse_num[:-3] + "nnn" + ftp_path = f"/geo/series/GSE{base_num}/{accession}/" + + if file_type == "matrix": + filename = f"{accession}_series_matrix.txt.gz" + elif file_type == "soft": + filename = f"{accession}_family.soft.gz" + elif file_type == "miniml": + filename = f"{accession}_family.xml.tgz" + + # Connect to FTP server + ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov") + ftp.login() + ftp.cwd(ftp_path) + + # Download file + os.makedirs(dest_dir, exist_ok=True) + local_file = os.path.join(dest_dir, filename) + + with open(local_file, 'wb') as f: + ftp.retrbinary(f'RETR {filename}', f.write) + + ftp.quit() + print(f"Downloaded: {local_file}") + return local_file + +# Download series matrix file +download_geo_ftp("GSE123456", file_type="matrix") + +# Download SOFT format file +download_geo_ftp("GSE123456", file_type="soft") +``` + +**Using wget or curl for Downloads:** + +```bash +# Download series matrix file +wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz + +# Download all supplementary files for a series +wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/ + +# Download SOFT format family file +wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz +``` + +### 6. Analyzing GEO Data + +**Quality Control and Preprocessing:** + +```python +import GEOparse +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt + +# Load dataset +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") +expression_df = gse.pivot_samples('VALUE') + +# Check for missing values +print(f"Missing values: {expression_df.isnull().sum().sum()}") + +# Log transformation (if needed) +if expression_df.min().min() > 0: # Check if already log-transformed + if expression_df.max().max() > 100: + expression_df = np.log2(expression_df + 1) + print("Applied log2 transformation") + +# Distribution plots +plt.figure(figsize=(12, 5)) + +plt.subplot(1, 2, 1) +expression_df.plot.box(ax=plt.gca()) +plt.title("Expression Distribution per Sample") +plt.xticks(rotation=90) + +plt.subplot(1, 2, 2) +expression_df.mean(axis=1).hist(bins=50) +plt.title("Gene Expression Distribution") +plt.xlabel("Average Expression") + +plt.tight_layout() +plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight') +``` + +**Differential Expression Analysis:** + +```python +import GEOparse +import pandas as pd +import numpy as np +from scipy import stats + +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") +expression_df = gse.pivot_samples('VALUE') + +# Define sample groups +control_samples = ["GSM1", "GSM2", "GSM3"] +treatment_samples = ["GSM4", "GSM5", "GSM6"] + +# Calculate fold changes and p-values +results = [] +for gene in expression_df.index: + control_expr = expression_df.loc[gene, control_samples] + treatment_expr = expression_df.loc[gene, treatment_samples] + + # Calculate statistics + fold_change = treatment_expr.mean() - control_expr.mean() + t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr) + + results.append({ + 'gene': gene, + 'log2_fold_change': fold_change, + 'p_value': p_value, + 'control_mean': control_expr.mean(), + 'treatment_mean': treatment_expr.mean() + }) + +# Create results DataFrame +de_results = pd.DataFrame(results) + +# Multiple testing correction (Benjamini-Hochberg) +from statsmodels.stats.multitest import multipletests +_, de_results['q_value'], _, _ = multipletests( + de_results['p_value'], + method='fdr_bh' +) + +# Filter significant genes +significant_genes = de_results[ + (de_results['q_value'] < 0.05) & + (abs(de_results['log2_fold_change']) > 1) +] + +print(f"Significant genes: {len(significant_genes)}") +significant_genes.to_csv("de_results.csv", index=False) +``` + +**Correlation and Clustering Analysis:** + +```python +import GEOparse +import seaborn as sns +import matplotlib.pyplot as plt +from scipy.cluster import hierarchy +from scipy.spatial.distance import pdist + +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") +expression_df = gse.pivot_samples('VALUE') + +# Sample correlation heatmap +sample_corr = expression_df.corr() + +plt.figure(figsize=(10, 8)) +sns.heatmap(sample_corr, cmap='coolwarm', center=0, + square=True, linewidths=0.5) +plt.title("Sample Correlation Matrix") +plt.tight_layout() +plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight') + +# Hierarchical clustering +distances = pdist(expression_df.T, metric='correlation') +linkage = hierarchy.linkage(distances, method='average') + +plt.figure(figsize=(12, 6)) +hierarchy.dendrogram(linkage, labels=expression_df.columns) +plt.title("Hierarchical Clustering of Samples") +plt.xlabel("Samples") +plt.ylabel("Distance") +plt.xticks(rotation=90) +plt.tight_layout() +plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight') +``` + +### 7. Batch Processing Multiple Datasets + +**Download and Process Multiple Series:** + +```python +import GEOparse +import pandas as pd +import os + +def batch_download_geo(gse_list, destdir="./geo_data"): + """Download multiple GEO series""" + results = {} + + for gse_id in gse_list: + try: + print(f"Processing {gse_id}...") + gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir) + + # Extract key information + results[gse_id] = { + 'title': gse.metadata.get('title', ['N/A'])[0], + 'organism': gse.metadata.get('organism', ['N/A'])[0], + 'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A', + 'num_samples': len(gse.gsms), + 'submission_date': gse.metadata.get('submission_date', ['N/A'])[0] + } + + # Save expression data + if hasattr(gse, 'pivot_samples'): + expr_df = gse.pivot_samples('VALUE') + expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv") + results[gse_id]['num_genes'] = len(expr_df) + + except Exception as e: + print(f"Error processing {gse_id}: {e}") + results[gse_id] = {'error': str(e)} + + # Save summary + summary_df = pd.DataFrame(results).T + summary_df.to_csv(f"{destdir}/batch_summary.csv") + + return results + +# Process multiple datasets +gse_list = ["GSE100001", "GSE100002", "GSE100003"] +results = batch_download_geo(gse_list) +``` + +**Meta-Analysis Across Studies:** + +```python +import GEOparse +import pandas as pd +import numpy as np + +def meta_analysis_geo(gse_list, gene_of_interest): + """Perform meta-analysis of gene expression across studies""" + results = [] + + for gse_id in gse_list: + try: + gse = GEOparse.get_GEO(geo=gse_id, destdir="./data") + + # Get platform annotation + gpl = list(gse.gpls.values())[0] + + # Find gene in platform + if hasattr(gpl, 'table'): + gene_probes = gpl.table[ + gpl.table['Gene Symbol'].str.contains( + gene_of_interest, + case=False, + na=False + ) + ] + + if not gene_probes.empty: + expr_df = gse.pivot_samples('VALUE') + + for probe_id in gene_probes['ID']: + if probe_id in expr_df.index: + expr_values = expr_df.loc[probe_id] + + results.append({ + 'study': gse_id, + 'probe': probe_id, + 'mean_expression': expr_values.mean(), + 'std_expression': expr_values.std(), + 'num_samples': len(expr_values) + }) + + except Exception as e: + print(f"Error in {gse_id}: {e}") + + return pd.DataFrame(results) + +# Meta-analysis for TP53 +gse_studies = ["GSE100001", "GSE100002", "GSE100003"] +meta_results = meta_analysis_geo(gse_studies, "TP53") +print(meta_results) +``` + +## Installation and Setup + +### Python Libraries + +```bash +# Primary GEO access library (recommended) +pip install GEOparse + +# For E-utilities and programmatic NCBI access +pip install biopython + +# For data analysis +pip install pandas numpy scipy + +# For visualization +pip install matplotlib seaborn + +# For statistical analysis +pip install statsmodels scikit-learn +``` + +### Configuration + +Set up NCBI E-utilities access: + +```python +from Bio import Entrez + +# Always set your email (required by NCBI) +Entrez.email = "your.email@example.com" + +# Optional: Set API key for increased rate limits +# Get your API key from: https://www.ncbi.nlm.nih.gov/account/ +Entrez.api_key = "your_api_key_here" + +# With API key: 10 requests/second +# Without API key: 3 requests/second +``` + +## Common Use Cases + +### Transcriptomics Research +- Download gene expression data for specific conditions +- Compare expression profiles across studies +- Identify differentially expressed genes +- Perform meta-analyses across multiple datasets + +### Drug Response Studies +- Analyze gene expression changes after drug treatment +- Identify biomarkers for drug response +- Compare drug effects across cell lines or patients +- Build predictive models for drug sensitivity + +### Disease Biology +- Study gene expression in disease vs. normal tissues +- Identify disease-associated expression signatures +- Compare patient subgroups and disease stages +- Correlate expression with clinical outcomes + +### Biomarker Discovery +- Screen for diagnostic or prognostic markers +- Validate biomarkers across independent cohorts +- Compare marker performance across platforms +- Integrate expression with clinical data + +## Key Concepts + +**SOFT (Simple Omnibus Format in Text):** GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse. + +**MINiML (MIAME Notation in Markup Language):** XML format for GEO data, used for programmatic access and data exchange. + +**Series Matrix:** Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data. + +**MIAME Compliance:** Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions. + +**Expression Value Types:** Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods. + +**Platform Annotation:** Maps probe/feature IDs to genes. Essential for biological interpretation of expression data. + +## GEO2R Web Tool + +For quick analysis without coding, use GEO2R: + +- Web-based statistical analysis tool integrated into GEO +- Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx +- Performs differential expression analysis +- Generates R scripts for reproducibility +- Useful for exploratory analysis before downloading data + +## Rate Limiting and Best Practices + +**NCBI E-utilities Rate Limits:** +- Without API key: 3 requests per second +- With API key: 10 requests per second +- Implement delays between requests: `time.sleep(0.34)` (no API key) or `time.sleep(0.1)` (with API key) + +**FTP Access:** +- No rate limits for FTP downloads +- Preferred method for bulk downloads +- Can download entire directories with wget -r + +**GEOparse Caching:** +- GEOparse automatically caches downloaded files in destdir +- Subsequent calls use cached data +- Clean cache periodically to save disk space + +**Optimal Practices:** +- Use GEOparse for series-level access (easiest) +- Use E-utilities for metadata searching and batch queries +- Use FTP for direct file downloads and bulk operations +- Cache data locally to avoid repeated downloads +- Always set Entrez.email when using Biopython + +## Resources + +### references/geo_reference.md + +Comprehensive reference documentation covering: +- Detailed E-utilities API specifications and endpoints +- Complete SOFT and MINiML file format documentation +- Advanced GEOparse usage patterns and examples +- FTP directory structure and file naming conventions +- Data processing pipelines and normalization methods +- Troubleshooting common issues and error handling +- Platform-specific considerations and quirks + +Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats. + +## Important Notes + +### Data Quality Considerations + +- GEO accepts user-submitted data with varying quality standards +- Always check platform annotation and processing methods +- Verify sample metadata and experimental design +- Be cautious with batch effects across studies +- Consider reprocessing raw data for consistency + +### File Size Warnings + +- Series matrix files can be large (>1 GB for large studies) +- Supplementary files (e.g., CEL files) can be very large +- Plan for adequate disk space before downloading +- Consider downloading samples incrementally + +### Data Usage and Citation + +- GEO data is freely available for research use +- Always cite original studies when using GEO data +- Cite GEO database: Barrett et al. (2013) Nucleic Acids Research +- Check individual dataset usage restrictions (if any) +- Follow NCBI guidelines for programmatic access + +### Common Pitfalls + +- Different platforms use different probe IDs (requires annotation mapping) +- Expression values may be raw, normalized, or log-transformed (check metadata) +- Sample metadata can be inconsistently formatted across studies +- Not all series have series matrix files (older submissions) +- Platform annotations may be outdated (genes renamed, IDs deprecated) + +## Additional Resources + +- **GEO Website:** https://www.ncbi.nlm.nih.gov/geo/ +- **GEO Submission Guidelines:** https://www.ncbi.nlm.nih.gov/geo/info/submission.html +- **GEOparse Documentation:** https://geoparse.readthedocs.io/ +- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/ +- **GEO FTP Site:** ftp://ftp.ncbi.nlm.nih.gov/geo/ +- **GEO2R Tool:** https://www.ncbi.nlm.nih.gov/geo/geo2r/ +- **NCBI API Keys:** https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ +- **Biopython Tutorial:** https://biopython.org/DIST/docs/tutorial/Tutorial.html diff --git a/scientific-databases/geo-database/references/geo_reference.md b/scientific-databases/geo-database/references/geo_reference.md new file mode 100644 index 0000000..913330a --- /dev/null +++ b/scientific-databases/geo-database/references/geo_reference.md @@ -0,0 +1,829 @@ +# GEO Database Reference Documentation + +## Complete E-utilities API Specifications + +### Overview + +The NCBI Entrez Programming Utilities (E-utilities) provide programmatic access to GEO metadata through a set of nine server-side programs. All E-utilities return results in XML format by default. + +### Base URL + +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ +``` + +### Core E-utility Programs + +#### eSearch - Text Query to ID List + +**Purpose:** Search a database and return a list of UIDs matching the query. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi +``` + +**Parameters:** +- `db` (required): Database to search (e.g., "gds", "geoprofiles") +- `term` (required): Search query string +- `retmax`: Maximum number of UIDs to return (default: 20, max: 10000) +- `retstart`: Starting position in result set (for pagination) +- `usehistory`: Set to "y" to store results on history server +- `sort`: Sort order (e.g., "relevance", "pub_date") +- `field`: Limit search to specific field +- `datetype`: Type of date to limit by +- `reldate`: Limit to items within N days of today +- `mindate`, `maxdate`: Date range limits (YYYY/MM/DD) + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Basic search +handle = Entrez.esearch( + db="gds", + term="breast cancer AND Homo sapiens", + retmax=100, + usehistory="y" +) +results = Entrez.read(handle) +handle.close() + +# Results contain: +# - Count: Total number of matches +# - RetMax: Number of UIDs returned +# - RetStart: Starting position +# - IdList: List of UIDs +# - QueryKey: Key for history server (if usehistory="y") +# - WebEnv: Web environment string (if usehistory="y") +``` + +#### eSummary - Document Summaries + +**Purpose:** Retrieve document summaries for a list of UIDs. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi +``` + +**Parameters:** +- `db` (required): Database +- `id` (required): Comma-separated list of UIDs or query_key+WebEnv +- `retmode`: Return format ("xml" or "json") +- `version`: Summary version ("2.0" recommended) + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Get summaries for multiple IDs +handle = Entrez.esummary( + db="gds", + id="200000001,200000002", + retmode="xml", + version="2.0" +) +summaries = Entrez.read(handle) +handle.close() + +# Summary fields for GEO DataSets: +# - Accession: GDS accession +# - title: Dataset title +# - summary: Dataset description +# - PDAT: Publication date +# - n_samples: Number of samples +# - Organism: Source organism +# - PubMedIds: Associated PubMed IDs +``` + +#### eFetch - Full Records + +**Purpose:** Retrieve full records for a list of UIDs. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi +``` + +**Parameters:** +- `db` (required): Database +- `id` (required): Comma-separated list of UIDs +- `retmode`: Return format ("xml", "text") +- `rettype`: Record type (database-specific) + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Fetch full records +handle = Entrez.efetch( + db="gds", + id="200000001", + retmode="xml" +) +records = Entrez.read(handle) +handle.close() +``` + +#### eLink - Cross-Database Linking + +**Purpose:** Find related records in same or different databases. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi +``` + +**Parameters:** +- `dbfrom` (required): Source database +- `db` (required): Target database +- `id` (required): UID from source database +- `cmd`: Link command type + - "neighbor": Return linked UIDs (default) + - "neighbor_score": Return scored links + - "acheck": Check for links + - "ncheck": Count links + - "llinks": Return URLs to LinkOut resources + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Find PubMed articles linked to a GEO dataset +handle = Entrez.elink( + dbfrom="gds", + db="pubmed", + id="200000001" +) +links = Entrez.read(handle) +handle.close() +``` + +#### ePost - Upload UID List + +**Purpose:** Upload a list of UIDs to the history server for use in subsequent requests. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi +``` + +**Parameters:** +- `db` (required): Database +- `id` (required): Comma-separated list of UIDs + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Post large list of IDs +large_id_list = [str(i) for i in range(200000001, 200000101)] +handle = Entrez.epost(db="gds", id=",".join(large_id_list)) +result = Entrez.read(handle) +handle.close() + +# Use returned QueryKey and WebEnv in subsequent calls +query_key = result["QueryKey"] +webenv = result["WebEnv"] +``` + +#### eInfo - Database Information + +**Purpose:** Get information about available databases and their fields. + +**URL Pattern:** +``` +https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi +``` + +**Parameters:** +- `db`: Database name (omit to get list of all databases) +- `version`: Set to "2.0" for detailed field information + +**Example:** +```python +from Bio import Entrez +Entrez.email = "your@email.com" + +# Get information about gds database +handle = Entrez.einfo(db="gds", version="2.0") +info = Entrez.read(handle) +handle.close() + +# Returns: +# - Database description +# - Last update date +# - Record count +# - Available search fields +# - Link information +``` + +### Search Field Qualifiers for GEO + +Common search fields for building targeted queries: + +**General Fields:** +- `[Accession]`: GEO accession number +- `[Title]`: Dataset title +- `[Author]`: Author name +- `[Organism]`: Source organism +- `[Entry Type]`: Type of entry (e.g., "Expression profiling by array") +- `[Platform]`: Platform accession or name +- `[PubMed ID]`: Associated PubMed ID + +**Date Fields:** +- `[Publication Date]`: Publication date (YYYY or YYYY/MM/DD) +- `[Submission Date]`: Submission date +- `[Modification Date]`: Last modification date + +**MeSH Terms:** +- `[MeSH Terms]`: Medical Subject Headings +- `[MeSH Major Topic]`: Major MeSH topics + +**Study Type Fields:** +- `[DataSet Type]`: Type of study (e.g., "RNA-seq", "ChIP-seq") +- `[Sample Type]`: Sample type + +**Example Complex Query:** +```python +query = """ + (breast cancer[MeSH] OR breast neoplasms[Title]) AND + Homo sapiens[Organism] AND + expression profiling by array[Entry Type] AND + 2020:2024[Publication Date] AND + GPL570[Platform] +""" +``` + +## SOFT File Format Specification + +### Overview + +SOFT (Simple Omnibus Format in Text) is GEO's primary data exchange format. Files are structured as key-value pairs with data tables. + +### File Types + +**Family SOFT Files:** +- Filename: `GSExxxxx_family.soft.gz` +- Contains: Complete series with all samples and platforms +- Size: Can be very large (100s of MB compressed) +- Use: Complete data extraction + +**Series Matrix Files:** +- Filename: `GSExxxxx_series_matrix.txt.gz` +- Contains: Expression matrix with minimal metadata +- Size: Smaller than family files +- Use: Quick access to expression data + +**Platform SOFT Files:** +- Filename: `GPLxxxxx.soft` +- Contains: Platform annotation and probe information +- Use: Mapping probes to genes + +### SOFT File Structure + +``` +^DATABASE = GeoMiame +!Database_name = Gene Expression Omnibus (GEO) +!Database_institute = NCBI NLM NIH +!Database_web_link = http://www.ncbi.nlm.nih.gov/geo +!Database_email = geo@ncbi.nlm.nih.gov + +^SERIES = GSExxxxx +!Series_title = Study Title Here +!Series_summary = Study description and background... +!Series_overall_design = Experimental design... +!Series_type = Expression profiling by array +!Series_pubmed_id = 12345678 +!Series_submission_date = Jan 01 2024 +!Series_last_update_date = Jan 15 2024 +!Series_contributor = John,Doe +!Series_contributor = Jane,Smith +!Series_sample_id = GSMxxxxxx +!Series_sample_id = GSMxxxxxx + +^PLATFORM = GPLxxxxx +!Platform_title = Platform Name +!Platform_distribution = commercial or custom +!Platform_organism = Homo sapiens +!Platform_manufacturer = Affymetrix +!Platform_technology = in situ oligonucleotide +!Platform_data_row_count = 54675 +#ID = Probe ID +#GB_ACC = GenBank accession +#SPOT_ID = Spot identifier +#Gene Symbol = Gene symbol +#Gene Title = Gene title +!platform_table_begin +ID GB_ACC SPOT_ID Gene Symbol Gene Title +1007_s_at U48705 - DDR1 discoidin domain receptor... +1053_at M87338 - RFC2 replication factor C... +!platform_table_end + +^SAMPLE = GSMxxxxxx +!Sample_title = Sample name +!Sample_source_name_ch1 = cell line XYZ +!Sample_organism_ch1 = Homo sapiens +!Sample_characteristics_ch1 = cell type: epithelial +!Sample_characteristics_ch1 = treatment: control +!Sample_molecule_ch1 = total RNA +!Sample_label_ch1 = biotin +!Sample_platform_id = GPLxxxxx +!Sample_data_processing = normalization method +#ID_REF = Probe identifier +#VALUE = Expression value +!sample_table_begin +ID_REF VALUE +1007_s_at 8.456 +1053_at 7.234 +!sample_table_end +``` + +### Parsing SOFT Files + +**With GEOparse:** +```python +import GEOparse + +# Parse series +gse = GEOparse.get_GEO(filepath="GSE123456_family.soft.gz") + +# Access metadata +metadata = gse.metadata +phenotype_data = gse.phenotype_data + +# Access samples +for gsm_name, gsm in gse.gsms.items(): + sample_data = gsm.table + sample_metadata = gsm.metadata + +# Access platforms +for gpl_name, gpl in gse.gpls.items(): + platform_table = gpl.table + platform_metadata = gpl.metadata +``` + +**Manual Parsing:** +```python +import gzip + +def parse_soft_file(filename): + """Basic SOFT file parser""" + sections = {} + current_section = None + current_metadata = {} + current_table = [] + in_table = False + + with gzip.open(filename, 'rt') as f: + for line in f: + line = line.strip() + + # New section + if line.startswith('^'): + if current_section: + sections[current_section] = { + 'metadata': current_metadata, + 'table': current_table + } + parts = line[1:].split(' = ') + current_section = parts[1] if len(parts) > 1 else parts[0] + current_metadata = {} + current_table = [] + in_table = False + + # Metadata + elif line.startswith('!'): + if in_table: + in_table = False + key_value = line[1:].split(' = ', 1) + if len(key_value) == 2: + key, value = key_value + if key in current_metadata: + if isinstance(current_metadata[key], list): + current_metadata[key].append(value) + else: + current_metadata[key] = [current_metadata[key], value] + else: + current_metadata[key] = value + + # Table data + elif line.startswith('#') or in_table: + in_table = True + current_table.append(line) + + return sections +``` + +## MINiML File Format + +### Overview + +MINiML (MIAME Notation in Markup Language) is GEO's XML-based format for data exchange. + +### File Structure + +```xml + + + + + 2024-01-01 + 2024-01-15 + 2024-01-15 + + Study Title + Study description... + Experimental design... + Expression profiling by array + + + John + Doe + + + + + + Platform Name + commercial + in situ oligonucleotide + Homo sapiens + + + ID + Probe identifier + + + + 1007_s_at + U48705 + + + + + + + Sample name + cell line XYZ + Homo sapiens + epithelial + control + + + + ID_REF + + + VALUE + + + + 1007_s_at + 8.456 + + + + + +``` + +## FTP Directory Structure + +### Series Files + +**Pattern:** +``` +ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE{nnn}nnn/GSE{xxxxx}/ +``` + +Where `{nnn}` represents replacing last 3 digits with "nnn" and `{xxxxx}` is the full accession. + +**Example:** +- GSE123456 → `/geo/series/GSE123nnn/GSE123456/` +- GSE1234 → `/geo/series/GSE1nnn/GSE1234/` +- GSE100001 → `/geo/series/GSE100nnn/GSE100001/` + +**Subdirectories:** +- `/matrix/` - Series matrix files +- `/soft/` - Family SOFT files +- `/miniml/` - MINiML XML files +- `/suppl/` - Supplementary files + +**File Types:** +``` +matrix/ + └── GSE123456_series_matrix.txt.gz + +soft/ + └── GSE123456_family.soft.gz + +miniml/ + └── GSE123456_family.xml.tgz + +suppl/ + ├── GSE123456_RAW.tar + ├── filelist.txt + └── [various supplementary files] +``` + +### Sample Files + +**Pattern:** +``` +ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM{nnn}nnn/GSM{xxxxx}/ +``` + +**Subdirectories:** +- `/suppl/` - Sample-specific supplementary files + +### Platform Files + +**Pattern:** +``` +ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL{nnn}nnn/GPL{xxxxx}/ +``` + +**File Types:** +``` +soft/ + └── GPL570.soft.gz + +miniml/ + └── GPL570.xml + +annot/ + └── GPL570.annot.gz # Enhanced annotation (if available) +``` + +## Advanced GEOparse Usage + +### Custom Parsing Options + +```python +import GEOparse + +# Parse with custom options +gse = GEOparse.get_GEO( + geo="GSE123456", + destdir="./data", + silent=False, # Show progress + how="full", # Parse mode: "full", "quick", "brief" + annotate_gpl=True, # Include platform annotation + geotype="GSE" # Explicit type +) + +# Access specific sample +gsm = gse.gsms['GSM1234567'] + +# Get expression values for specific probe +probe_id = "1007_s_at" +if hasattr(gsm, 'table'): + probe_data = gsm.table[gsm.table['ID_REF'] == probe_id] + +# Get all characteristics +characteristics = {} +for key, values in gsm.metadata.items(): + if key.startswith('characteristics'): + for value in (values if isinstance(values, list) else [values]): + if ':' in value: + char_key, char_value = value.split(':', 1) + characteristics[char_key.strip()] = char_value.strip() +``` + +### Working with Platform Annotations + +```python +import GEOparse +import pandas as pd + +gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") + +# Get platform +gpl = list(gse.gpls.values())[0] + +# Extract annotation table +if hasattr(gpl, 'table'): + annotation = gpl.table + + # Common annotation columns: + # - ID: Probe identifier + # - Gene Symbol: Gene symbol + # - Gene Title: Gene description + # - GB_ACC: GenBank accession + # - Gene ID: Entrez Gene ID + # - RefSeq: RefSeq accession + # - UniGene: UniGene cluster + + # Map probes to genes + probe_to_gene = dict(zip( + annotation['ID'], + annotation['Gene Symbol'] + )) + + # Handle multiple probes per gene + gene_to_probes = {} + for probe, gene in probe_to_gene.items(): + if gene and gene != '---': + if gene not in gene_to_probes: + gene_to_probes[gene] = [] + gene_to_probes[gene].append(probe) +``` + +### Handling Large Datasets + +```python +import GEOparse +import pandas as pd +import numpy as np + +def process_large_gse(gse_id, chunk_size=1000): + """Process large GEO series in chunks""" + gse = GEOparse.get_GEO(geo=gse_id, destdir="./data") + + # Get sample list + sample_list = list(gse.gsms.keys()) + + # Process in chunks + for i in range(0, len(sample_list), chunk_size): + chunk_samples = sample_list[i:i+chunk_size] + + # Extract data for chunk + chunk_data = {} + for gsm_id in chunk_samples: + gsm = gse.gsms[gsm_id] + if hasattr(gsm, 'table'): + chunk_data[gsm_id] = gsm.table['VALUE'] + + # Process chunk + chunk_df = pd.DataFrame(chunk_data) + + # Save chunk results + chunk_df.to_csv(f"chunk_{i//chunk_size}.csv") + + print(f"Processed {i+len(chunk_samples)}/{len(sample_list)} samples") +``` + +## Troubleshooting Common Issues + +### Issue: GEOparse Fails to Download + +**Symptoms:** Timeout errors, connection failures + +**Solutions:** +1. Check internet connection +2. Try downloading directly via FTP first +3. Parse local files: +```python +gse = GEOparse.get_GEO(filepath="./local/GSE123456_family.soft.gz") +``` +4. Increase timeout (modify GEOparse source if needed) + +### Issue: Missing Expression Data + +**Symptoms:** `pivot_samples()` fails or returns empty + +**Cause:** Not all series have series matrix files (older submissions) + +**Solution:** Parse individual sample tables: +```python +expression_data = {} +for gsm_name, gsm in gse.gsms.items(): + if hasattr(gsm, 'table') and 'VALUE' in gsm.table.columns: + expression_data[gsm_name] = gsm.table.set_index('ID_REF')['VALUE'] + +expression_df = pd.DataFrame(expression_data) +``` + +### Issue: Inconsistent Probe IDs + +**Symptoms:** Probe IDs don't match between samples + +**Cause:** Different platform versions or sample processing + +**Solution:** Standardize using platform annotation: +```python +# Get common probe set +all_probes = set() +for gsm in gse.gsms.values(): + if hasattr(gsm, 'table'): + all_probes.update(gsm.table['ID_REF'].values) + +# Create standardized matrix +standardized_data = {} +for gsm_name, gsm in gse.gsms.items(): + if hasattr(gsm, 'table'): + sample_data = gsm.table.set_index('ID_REF')['VALUE'] + standardized_data[gsm_name] = sample_data.reindex(all_probes) + +expression_df = pd.DataFrame(standardized_data) +``` + +### Issue: E-utilities Rate Limiting + +**Symptoms:** HTTP 429 errors, slow responses + +**Solution:** +1. Get an API key from NCBI +2. Implement rate limiting: +```python +import time +from functools import wraps + +def rate_limit(calls_per_second=3): + min_interval = 1.0 / calls_per_second + + def decorator(func): + last_called = [0.0] + + @wraps(func) + def wrapper(*args, **kwargs): + elapsed = time.time() - last_called[0] + wait_time = min_interval - elapsed + if wait_time > 0: + time.sleep(wait_time) + result = func(*args, **kwargs) + last_called[0] = time.time() + return result + return wrapper + return decorator + +@rate_limit(calls_per_second=3) +def safe_esearch(query): + handle = Entrez.esearch(db="gds", term=query) + results = Entrez.read(handle) + handle.close() + return results +``` + +### Issue: Memory Errors with Large Datasets + +**Symptoms:** MemoryError, system slowdown + +**Solution:** +1. Process data in chunks +2. Use sparse matrices for expression data +3. Load only necessary columns +4. Use memory-efficient data types: +```python +import pandas as pd + +# Read with specific dtypes +expression_df = pd.read_csv( + "expression_matrix.csv", + dtype={'ID': str, 'GSM1': np.float32} # Use float32 instead of float64 +) + +# Or use sparse format for mostly-zero data +import scipy.sparse as sp +sparse_matrix = sp.csr_matrix(expression_df.values) +``` + +## Platform-Specific Considerations + +### Affymetrix Arrays + +- Probe IDs format: `1007_s_at`, `1053_at` +- Multiple probe sets per gene common +- Check for `_at`, `_s_at`, `_x_at` suffixes +- May need RMA or MAS5 normalization + +### Illumina Arrays + +- Probe IDs format: `ILMN_1234567` +- Watch for duplicate probes +- BeadChip-specific processing may be needed + +### RNA-seq + +- May not have traditional "probes" +- Check for gene IDs (Ensembl, Entrez) +- Counts vs. FPKM/TPM values +- May need separate count files + +### Two-Channel Arrays + +- Look for `_ch1` and `_ch2` suffixes in metadata +- VALUE_ch1, VALUE_ch2 columns +- May need ratio or intensity values +- Check dye-swap experiments + +## Best Practices Summary + +1. **Always set Entrez.email** before using E-utilities +2. **Use API key** for better rate limits +3. **Cache downloaded files** locally +4. **Check data quality** before analysis +5. **Verify platform annotations** are current +6. **Document data processing** steps +7. **Cite original studies** when using data +8. **Check for batch effects** in meta-analyses +9. **Validate results** with independent datasets +10. **Follow NCBI usage guidelines**