mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-28 07:33:45 +08:00
- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
368 lines
13 KiB
Markdown
368 lines
13 KiB
Markdown
---
|
|
name: cbioportal-database
|
|
description: Query cBioPortal for cancer genomics data including somatic mutations, copy number alterations, gene expression, and survival data across hundreds of cancer studies. Essential for cancer target validation, oncogene/tumor suppressor analysis, and patient-level genomic profiling.
|
|
license: LGPL-3.0
|
|
metadata:
|
|
skill-author: Kuan-lin Huang
|
|
---
|
|
|
|
# cBioPortal Database
|
|
|
|
## Overview
|
|
|
|
cBioPortal for Cancer Genomics (https://www.cbioportal.org/) is an open-access resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. It hosts data from The Cancer Genome Atlas (TCGA), AACR Project GENIE, MSK-IMPACT, and hundreds of other cancer studies — covering mutations, copy number alterations (CNA), structural variants, mRNA/protein expression, methylation, and clinical data for thousands of cancer samples.
|
|
|
|
**Key resources:**
|
|
- cBioPortal website: https://www.cbioportal.org/
|
|
- REST API: https://www.cbioportal.org/api/
|
|
- API docs (Swagger): https://www.cbioportal.org/api/swagger-ui/index.html
|
|
- Python client: `bravado` or `requests`
|
|
- GitHub: https://github.com/cBioPortal/cbioportal
|
|
|
|
## When to Use This Skill
|
|
|
|
Use cBioPortal when:
|
|
|
|
- **Mutation landscape**: What fraction of a cancer type has mutations in a specific gene?
|
|
- **Oncogene/TSG validation**: Is a gene frequently mutated, amplified, or deleted in cancer?
|
|
- **Co-mutation patterns**: Are mutations in gene A and gene B mutually exclusive or co-occurring?
|
|
- **Survival analysis**: Do mutations in a gene associate with better or worse patient outcomes?
|
|
- **Alteration profiles**: What types of alterations (missense, truncating, amplification, deletion) affect a gene?
|
|
- **Pan-cancer analysis**: Compare alteration frequencies across cancer types
|
|
- **Clinical associations**: Link genomic alterations to clinical variables (stage, grade, treatment response)
|
|
- **TCGA/GENIE exploration**: Systematic access to TCGA and clinical sequencing datasets
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. cBioPortal REST API
|
|
|
|
Base URL: `https://www.cbioportal.org/api`
|
|
|
|
The API is RESTful, returns JSON, and requires no API key for public data.
|
|
|
|
```python
|
|
import requests
|
|
|
|
BASE_URL = "https://www.cbioportal.org/api"
|
|
HEADERS = {"Accept": "application/json", "Content-Type": "application/json"}
|
|
|
|
def cbioportal_get(endpoint, params=None):
|
|
url = f"{BASE_URL}/{endpoint}"
|
|
response = requests.get(url, params=params, headers=HEADERS)
|
|
response.raise_for_status()
|
|
return response.json()
|
|
|
|
def cbioportal_post(endpoint, body):
|
|
url = f"{BASE_URL}/{endpoint}"
|
|
response = requests.post(url, json=body, headers=HEADERS)
|
|
response.raise_for_status()
|
|
return response.json()
|
|
```
|
|
|
|
### 2. Browse Studies
|
|
|
|
```python
|
|
def get_all_studies():
|
|
"""List all available cancer studies."""
|
|
return cbioportal_get("studies", {"pageSize": 500})
|
|
|
|
# Each study has:
|
|
# studyId: unique identifier (e.g., "brca_tcga")
|
|
# name: human-readable name
|
|
# description: dataset description
|
|
# cancerTypeId: cancer type abbreviation
|
|
# referenceGenome: GRCh37 or GRCh38
|
|
# pmid: associated publication
|
|
|
|
studies = get_all_studies()
|
|
print(f"Total studies: {len(studies)}")
|
|
|
|
# Common TCGA study IDs:
|
|
# brca_tcga, luad_tcga, coadread_tcga, gbm_tcga, prad_tcga,
|
|
# skcm_tcga, blca_tcga, hnsc_tcga, lihc_tcga, stad_tcga
|
|
|
|
# Filter for TCGA studies
|
|
tcga_studies = [s for s in studies if "tcga" in s["studyId"]]
|
|
print([s["studyId"] for s in tcga_studies[:10]])
|
|
```
|
|
|
|
### 3. Molecular Profiles
|
|
|
|
Each study has multiple molecular profiles (mutation, CNA, expression, etc.):
|
|
|
|
```python
|
|
def get_molecular_profiles(study_id):
|
|
"""Get all molecular profiles for a study."""
|
|
return cbioportal_get(f"studies/{study_id}/molecular-profiles")
|
|
|
|
profiles = get_molecular_profiles("brca_tcga")
|
|
for p in profiles:
|
|
print(f" {p['molecularProfileId']}: {p['name']} ({p['molecularAlterationType']})")
|
|
|
|
# Alteration types:
|
|
# MUTATION_EXTENDED — somatic mutations
|
|
# COPY_NUMBER_ALTERATION — CNA (GISTIC)
|
|
# MRNA_EXPRESSION — mRNA expression
|
|
# PROTEIN_LEVEL — RPPA protein expression
|
|
# STRUCTURAL_VARIANT — fusions/rearrangements
|
|
```
|
|
|
|
### 4. Mutation Data
|
|
|
|
```python
|
|
def get_mutations(molecular_profile_id, entrez_gene_ids, sample_list_id=None):
|
|
"""Get mutations for specified genes in a molecular profile."""
|
|
body = {
|
|
"entrezGeneIds": entrez_gene_ids,
|
|
"sampleListId": sample_list_id or molecular_profile_id.replace("_mutations", "_all")
|
|
}
|
|
return cbioportal_post(
|
|
f"molecular-profiles/{molecular_profile_id}/mutations/fetch",
|
|
body
|
|
)
|
|
|
|
# BRCA1 Entrez ID is 672, TP53 is 7157, PTEN is 5728
|
|
mutations = get_mutations("brca_tcga_mutations", entrez_gene_ids=[7157]) # TP53
|
|
|
|
# Each mutation record contains:
|
|
# patientId, sampleId, entrezGeneId, gene.hugoGeneSymbol
|
|
# mutationType (Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, etc.)
|
|
# proteinChange (e.g., "R175H")
|
|
# variantClassification, variantType
|
|
# ncbiBuild, chr, startPosition, endPosition, referenceAllele, variantAllele
|
|
# mutationStatus (Somatic/Germline)
|
|
# alleleFreqT (tumor VAF)
|
|
|
|
import pandas as pd
|
|
df = pd.DataFrame(mutations)
|
|
print(df[["patientId", "mutationType", "proteinChange", "alleleFreqT"]].head())
|
|
print(f"\nMutation types:\n{df['mutationType'].value_counts()}")
|
|
```
|
|
|
|
### 5. Copy Number Alteration Data
|
|
|
|
```python
|
|
def get_cna(molecular_profile_id, entrez_gene_ids):
|
|
"""Get discrete CNA data (GISTIC: -2, -1, 0, 1, 2)."""
|
|
body = {
|
|
"entrezGeneIds": entrez_gene_ids,
|
|
"sampleListId": molecular_profile_id.replace("_gistic", "_all").replace("_cna", "_all")
|
|
}
|
|
return cbioportal_post(
|
|
f"molecular-profiles/{molecular_profile_id}/discrete-copy-number/fetch",
|
|
body
|
|
)
|
|
|
|
# GISTIC values:
|
|
# -2 = Deep deletion (homozygous loss)
|
|
# -1 = Shallow deletion (heterozygous loss)
|
|
# 0 = Diploid (neutral)
|
|
# 1 = Low-level gain
|
|
# 2 = High-level amplification
|
|
|
|
cna_data = get_cna("brca_tcga_gistic", entrez_gene_ids=[1956]) # EGFR
|
|
df_cna = pd.DataFrame(cna_data)
|
|
print(df_cna["value"].value_counts())
|
|
```
|
|
|
|
### 6. Alteration Frequency (OncoPrint-style)
|
|
|
|
```python
|
|
def get_alteration_frequency(study_id, gene_symbols, alteration_types=None):
|
|
"""Compute alteration frequencies for genes across a cancer study."""
|
|
import requests, pandas as pd
|
|
|
|
# Get sample list
|
|
samples = requests.get(
|
|
f"{BASE_URL}/studies/{study_id}/sample-lists",
|
|
headers=HEADERS
|
|
).json()
|
|
all_samples_id = next(
|
|
(s["sampleListId"] for s in samples if s["category"] == "all_cases_in_study"), None
|
|
)
|
|
total_samples = len(requests.get(
|
|
f"{BASE_URL}/sample-lists/{all_samples_id}/sample-ids",
|
|
headers=HEADERS
|
|
).json())
|
|
|
|
# Get gene Entrez IDs
|
|
gene_data = requests.post(
|
|
f"{BASE_URL}/genes/fetch",
|
|
json=[{"hugoGeneSymbol": g} for g in gene_symbols],
|
|
headers=HEADERS
|
|
).json()
|
|
entrez_ids = [g["entrezGeneId"] for g in gene_data]
|
|
|
|
# Get mutations
|
|
mutation_profile = f"{study_id}_mutations"
|
|
mutations = get_mutations(mutation_profile, entrez_ids, all_samples_id)
|
|
|
|
freq = {}
|
|
for g_symbol, e_id in zip(gene_symbols, entrez_ids):
|
|
mutated = len(set(m["patientId"] for m in mutations if m["entrezGeneId"] == e_id))
|
|
freq[g_symbol] = mutated / total_samples * 100
|
|
|
|
return freq
|
|
|
|
# Example
|
|
freq = get_alteration_frequency("brca_tcga", ["TP53", "PIK3CA", "BRCA1", "BRCA2"])
|
|
for gene, pct in sorted(freq.items(), key=lambda x: -x[1]):
|
|
print(f" {gene}: {pct:.1f}%")
|
|
```
|
|
|
|
### 7. Clinical Data
|
|
|
|
```python
|
|
def get_clinical_data(study_id, attribute_ids=None):
|
|
"""Get patient-level clinical data."""
|
|
params = {"studyId": study_id}
|
|
all_clinical = cbioportal_get(
|
|
"clinical-data/fetch",
|
|
params
|
|
)
|
|
# Returns list of {patientId, studyId, clinicalAttributeId, value}
|
|
|
|
# Clinical attributes include:
|
|
# OS_STATUS, OS_MONTHS, DFS_STATUS, DFS_MONTHS (survival)
|
|
# TUMOR_STAGE, GRADE, AGE, SEX, RACE
|
|
# Study-specific attributes vary
|
|
|
|
def get_clinical_attributes(study_id):
|
|
"""List all available clinical attributes for a study."""
|
|
return cbioportal_get(f"studies/{study_id}/clinical-attributes")
|
|
```
|
|
|
|
## Query Workflows
|
|
|
|
### Workflow 1: Gene Alteration Profile in a Cancer Type
|
|
|
|
```python
|
|
import requests, pandas as pd
|
|
|
|
def alteration_profile(study_id, gene_symbol):
|
|
"""Full alteration profile for a gene in a cancer study."""
|
|
|
|
# 1. Get gene Entrez ID
|
|
gene_info = requests.post(
|
|
f"{BASE_URL}/genes/fetch",
|
|
json=[{"hugoGeneSymbol": gene_symbol}],
|
|
headers=HEADERS
|
|
).json()[0]
|
|
entrez_id = gene_info["entrezGeneId"]
|
|
|
|
# 2. Get mutations
|
|
mutations = get_mutations(f"{study_id}_mutations", [entrez_id])
|
|
mut_df = pd.DataFrame(mutations) if mutations else pd.DataFrame()
|
|
|
|
# 3. Get CNAs
|
|
cna = get_cna(f"{study_id}_gistic", [entrez_id])
|
|
cna_df = pd.DataFrame(cna) if cna else pd.DataFrame()
|
|
|
|
# 4. Summary
|
|
n_mut = len(set(mut_df["patientId"])) if not mut_df.empty else 0
|
|
n_amp = len(cna_df[cna_df["value"] == 2]) if not cna_df.empty else 0
|
|
n_del = len(cna_df[cna_df["value"] == -2]) if not cna_df.empty else 0
|
|
|
|
return {"mutations": n_mut, "amplifications": n_amp, "deep_deletions": n_del}
|
|
|
|
result = alteration_profile("brca_tcga", "PIK3CA")
|
|
print(result)
|
|
```
|
|
|
|
### Workflow 2: Pan-Cancer Gene Mutation Frequency
|
|
|
|
```python
|
|
import requests, pandas as pd
|
|
|
|
def pan_cancer_mutation_freq(gene_symbol, cancer_study_ids=None):
|
|
"""Mutation frequency of a gene across multiple cancer types."""
|
|
studies = get_all_studies()
|
|
if cancer_study_ids:
|
|
studies = [s for s in studies if s["studyId"] in cancer_study_ids]
|
|
|
|
results = []
|
|
for study in studies[:20]: # Limit for demo
|
|
try:
|
|
freq = get_alteration_frequency(study["studyId"], [gene_symbol])
|
|
results.append({
|
|
"study": study["studyId"],
|
|
"cancer": study.get("cancerTypeId", ""),
|
|
"mutation_pct": freq.get(gene_symbol, 0)
|
|
})
|
|
except Exception:
|
|
pass
|
|
|
|
df = pd.DataFrame(results).sort_values("mutation_pct", ascending=False)
|
|
return df
|
|
```
|
|
|
|
### Workflow 3: Survival Analysis by Mutation Status
|
|
|
|
```python
|
|
import requests, pandas as pd
|
|
|
|
def survival_by_mutation(study_id, gene_symbol):
|
|
"""Get survival data split by mutation status."""
|
|
# This workflow fetches clinical and mutation data for downstream analysis
|
|
|
|
gene_info = requests.post(
|
|
f"{BASE_URL}/genes/fetch",
|
|
json=[{"hugoGeneSymbol": gene_symbol}],
|
|
headers=HEADERS
|
|
).json()[0]
|
|
entrez_id = gene_info["entrezGeneId"]
|
|
|
|
mutations = get_mutations(f"{study_id}_mutations", [entrez_id])
|
|
mutated_patients = set(m["patientId"] for m in mutations)
|
|
|
|
clinical = cbioportal_get("clinical-data/fetch", {"studyId": study_id})
|
|
clinical_df = pd.DataFrame(clinical)
|
|
|
|
os_data = clinical_df[clinical_df["clinicalAttributeId"].isin(["OS_MONTHS", "OS_STATUS"])]
|
|
os_wide = os_data.pivot(index="patientId", columns="clinicalAttributeId", values="value")
|
|
os_wide["mutated"] = os_wide.index.isin(mutated_patients)
|
|
|
|
return os_wide
|
|
```
|
|
|
|
## Key API Endpoints Summary
|
|
|
|
| Endpoint | Description |
|
|
|----------|-------------|
|
|
| `GET /studies` | List all studies |
|
|
| `GET /studies/{studyId}/molecular-profiles` | Molecular profiles for a study |
|
|
| `POST /molecular-profiles/{profileId}/mutations/fetch` | Get mutation data |
|
|
| `POST /molecular-profiles/{profileId}/discrete-copy-number/fetch` | Get CNA data |
|
|
| `POST /molecular-profiles/{profileId}/molecular-data/fetch` | Get expression data |
|
|
| `GET /studies/{studyId}/clinical-attributes` | Available clinical variables |
|
|
| `GET /clinical-data/fetch` | Clinical data |
|
|
| `POST /genes/fetch` | Gene metadata by symbol or Entrez ID |
|
|
| `GET /studies/{studyId}/sample-lists` | Sample lists |
|
|
|
|
## Best Practices
|
|
|
|
- **Know your study IDs**: Use the Swagger UI or `GET /studies` to find the correct study ID
|
|
- **Use sample lists**: Each study has an `all` sample list and subsets; always specify the appropriate one
|
|
- **TCGA vs. GENIE**: TCGA data is comprehensive but older; GENIE has more recent clinical sequencing data
|
|
- **Entrez gene IDs**: The API uses Entrez IDs — use `/genes/fetch` to convert from symbols
|
|
- **Handle 404s**: Some molecular profiles may not exist for all studies
|
|
- **Rate limiting**: Add delays for bulk queries; consider downloading data files for large-scale analyses
|
|
|
|
## Data Downloads
|
|
|
|
For large-scale analyses, download study data directly:
|
|
```bash
|
|
# Download TCGA BRCA data
|
|
wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga.tar.gz
|
|
```
|
|
|
|
## Additional Resources
|
|
|
|
- **cBioPortal website**: https://www.cbioportal.org/
|
|
- **API Swagger UI**: https://www.cbioportal.org/api/swagger-ui/index.html
|
|
- **Documentation**: https://docs.cbioportal.org/
|
|
- **GitHub**: https://github.com/cBioPortal/cbioportal
|
|
- **Data hub**: https://datahub.cbioportal.org/
|
|
- **Citation**: Cerami E et al. (2012) Cancer Discovery. PMID: 22588877
|
|
- **API clients**: https://docs.cbioportal.org/web-api-and-clients/
|