117 lines
3.6 KiB
Markdown
117 lines
3.6 KiB
Markdown
# Effective Genome Sizes
|
|
|
|
## Definition
|
|
|
|
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
|
|
|
|
## Why It Matters
|
|
|
|
- Required for RPGC normalization (`--normalizeUsing RPGC`)
|
|
- Affects accuracy of coverage calculations
|
|
- Must match your data processing approach (filtered vs unfiltered reads)
|
|
|
|
## Calculation Methods
|
|
|
|
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
|
|
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
|
|
|
|
## Common Organism Values
|
|
|
|
### Using Non-N Bases Method
|
|
|
|
| Organism | Assembly | Effective Size | Full Command |
|
|
|----------|----------|----------------|--------------|
|
|
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
|
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
|
|
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
|
|
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
|
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
|
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
|
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
|
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
|
|
|
|
### Human (GRCh38) by Read Length
|
|
|
|
For quality-filtered reads, values vary by read length:
|
|
|
|
| Read Length | Effective Size |
|
|
|-------------|----------------|
|
|
| 50bp | ~2.7 billion |
|
|
| 75bp | ~2.8 billion |
|
|
| 100bp | ~2.8 billion |
|
|
| 150bp | ~2.9 billion |
|
|
| 250bp | ~2.9 billion |
|
|
|
|
### Mouse (GRCm38) by Read Length
|
|
|
|
| Read Length | Effective Size |
|
|
|-------------|----------------|
|
|
| 50bp | ~2.3 billion |
|
|
| 75bp | ~2.5 billion |
|
|
| 100bp | ~2.6 billion |
|
|
|
|
## Usage in deepTools
|
|
|
|
The effective genome size is most commonly used with:
|
|
|
|
### bamCoverage with RPGC normalization
|
|
```bash
|
|
bamCoverage --bam input.bam --outFileName output.bw \
|
|
--normalizeUsing RPGC \
|
|
--effectiveGenomeSize 2913022398
|
|
```
|
|
|
|
### bamCompare with RPGC normalization
|
|
```bash
|
|
bamCompare -b1 treatment.bam -b2 control.bam \
|
|
--outFileName comparison.bw \
|
|
--scaleFactorsMethod RPGC \
|
|
--effectiveGenomeSize 2913022398
|
|
```
|
|
|
|
### computeGCBias / correctGCBias
|
|
```bash
|
|
computeGCBias --bamfile input.bam \
|
|
--effectiveGenomeSize 2913022398 \
|
|
--genome genome.2bit \
|
|
--fragmentLength 200 \
|
|
--biasPlot bias.png
|
|
```
|
|
|
|
## Choosing the Right Value
|
|
|
|
**For most analyses:** Use the non-N bases method value for your reference genome
|
|
|
|
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
|
|
|
|
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
|
|
|
|
## Common Shortcuts
|
|
|
|
deepTools also accepts these shorthand values in some contexts:
|
|
|
|
- `hs` or `GRCh38`: 2913022398
|
|
- `mm` or `GRCm38`: 2652783500
|
|
- `dm` or `dm6`: 142573017
|
|
- `ce` or `ce10`: 100286401
|
|
|
|
Check your specific deepTools version documentation for supported shortcuts.
|
|
|
|
## Calculating Custom Values
|
|
|
|
For custom genomes or assemblies, calculate the non-N bases count:
|
|
|
|
```bash
|
|
# Using faCount (UCSC tools)
|
|
faCount genome.fa | grep "total" | awk '{print $2-$7}'
|
|
|
|
# Using seqtk
|
|
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
|
|
```
|
|
|
|
## References
|
|
|
|
For the most up-to-date effective genome sizes and detailed calculation methods, see:
|
|
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
|
|
- ENCODE documentation for reference genome details
|