bio-variant-calling-joint-calling
SKILL.md
Joint Calling
Call variants jointly across multiple samples for improved accuracy and consistent genotyping.
Why Joint Calling?
- Improved sensitivity - Leverage information across samples
- Consistent genotyping - Same sites called across all samples
- VQSR eligible - Requires cohort for machine learning filtering
- Population analysis - Allele frequencies across cohort
Workflow Overview
Sample BAMs
│
├── HaplotypeCaller (per-sample, -ERC GVCF)
│ └── sample1.g.vcf.gz, sample2.g.vcf.gz, ...
│
├── CombineGVCFs or GenomicsDBImport
│ └── Combine into cohort database
│
├── GenotypeGVCFs
│ └── Joint genotyping
│
└── VQSR or Hard Filtering
└── Final VCF
Step 1: Per-Sample gVCF Generation
# Generate gVCF for each sample
gatk HaplotypeCaller \
-R reference.fa \
-I sample1.bam \
-O sample1.g.vcf.gz \
-ERC GVCF
# With intervals (faster)
gatk HaplotypeCaller \
-R reference.fa \
-I sample1.bam \
-O sample1.g.vcf.gz \
-ERC GVCF \
-L intervals.bed
Batch Processing
# Process all samples
for bam in *.bam; do
sample=$(basename $bam .bam)
gatk HaplotypeCaller \
-R reference.fa \
-I $bam \
-O ${sample}.g.vcf.gz \
-ERC GVCF &
done
wait
Step 2a: CombineGVCFs (Small Cohorts)
For <100 samples:
gatk CombineGVCFs \
-R reference.fa \
-V sample1.g.vcf.gz \
-V sample2.g.vcf.gz \
-V sample3.g.vcf.gz \
-O cohort.g.vcf.gz
From Sample Map
# Create sample map file
# sample1 /path/to/sample1.g.vcf.gz
# sample2 /path/to/sample2.g.vcf.gz
ls *.g.vcf.gz | while read f; do
echo -e "$(basename $f .g.vcf.gz)\t$f"
done > sample_map.txt
# Combine with -V for each
gatk CombineGVCFs \
-R reference.fa \
$(cat sample_map.txt | cut -f2 | sed 's/^/-V /') \
-O cohort.g.vcf.gz
Step 2b: GenomicsDBImport (Large Cohorts)
For >100 samples, use GenomicsDB:
# Create sample map
ls *.g.vcf.gz | while read f; do
echo -e "$(basename $f .g.vcf.gz)\t$f"
done > sample_map.txt
# Import to GenomicsDB (per chromosome for parallelism)
gatk GenomicsDBImport \
--sample-name-map sample_map.txt \
--genomicsdb-workspace-path genomicsdb_chr1 \
-L chr1 \
--reader-threads 4
# Or all chromosomes
for chr in {1..22} X Y; do
gatk GenomicsDBImport \
--sample-name-map sample_map.txt \
--genomicsdb-workspace-path genomicsdb_chr${chr} \
-L chr${chr} &
done
wait
Update GenomicsDB with New Samples
gatk GenomicsDBImport \
--genomicsdb-update-workspace-path genomicsdb_chr1 \
--sample-name-map new_samples.txt \
-L chr1
Step 3: GenotypeGVCFs
From Combined gVCF
gatk GenotypeGVCFs \
-R reference.fa \
-V cohort.g.vcf.gz \
-O cohort.vcf.gz
From GenomicsDB
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb_chr1 \
-O chr1.vcf.gz
# All chromosomes
for chr in {1..22} X Y; do
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb_chr${chr} \
-O chr${chr}.vcf.gz &
done
wait
# Merge chromosomes
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz \
-Oz -o cohort.vcf.gz
With Allele-Specific Annotations
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb \
-O cohort.vcf.gz \
-G StandardAnnotation \
-G AS_StandardAnnotation
Step 4: Filtering
VQSR (Recommended for >30 Samples)
# SNPs
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O snps.recal \
--tranches-file snps.tranches
gatk ApplyVQSR \
-R reference.fa \
-V cohort.vcf.gz \
--recal-file snps.recal \
--tranches-file snps.tranches \
-mode SNP \
--truth-sensitivity-filter-level 99.5 \
-O cohort.snps.vcf.gz
# Indels
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.snps.vcf.gz \
--resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode INDEL \
-O indels.recal \
--tranches-file indels.tranches
gatk ApplyVQSR \
-R reference.fa \
-V cohort.snps.vcf.gz \
--recal-file indels.recal \
--tranches-file indels.tranches \
-mode INDEL \
--truth-sensitivity-filter-level 99.0 \
-O cohort.filtered.vcf.gz
Hard Filtering (Small Cohorts)
# See filtering-best-practices skill
gatk VariantFiltration \
-R reference.fa \
-V cohort.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
-O cohort.filtered.vcf.gz
Complete Pipeline Script
#!/bin/bash
set -euo pipefail
REFERENCE=$1
OUTPUT_DIR=$2
THREADS=16
mkdir -p $OUTPUT_DIR/{gvcfs,genomicsdb,vcfs}
echo "=== Step 1: Generate gVCFs ==="
for bam in data/*.bam; do
sample=$(basename $bam .bam)
gatk HaplotypeCaller \
-R $REFERENCE \
-I $bam \
-O $OUTPUT_DIR/gvcfs/${sample}.g.vcf.gz \
-ERC GVCF &
# Limit parallelism
while [ $(jobs -r | wc -l) -ge $THREADS ]; do sleep 1; done
done
wait
echo "=== Step 2: Create sample map ==="
ls $OUTPUT_DIR/gvcfs/*.g.vcf.gz | while read f; do
echo -e "$(basename $f .g.vcf.gz)\t$(realpath $f)"
done > $OUTPUT_DIR/sample_map.txt
echo "=== Step 3: GenomicsDBImport ==="
gatk GenomicsDBImport \
--sample-name-map $OUTPUT_DIR/sample_map.txt \
--genomicsdb-workspace-path $OUTPUT_DIR/genomicsdb \
-L intervals.bed \
--reader-threads 4
echo "=== Step 4: Joint genotyping ==="
gatk GenotypeGVCFs \
-R $REFERENCE \
-V gendb://$OUTPUT_DIR/genomicsdb \
-O $OUTPUT_DIR/vcfs/cohort.vcf.gz
echo "=== Step 5: Index ==="
bcftools index -t $OUTPUT_DIR/vcfs/cohort.vcf.gz
echo "=== Statistics ==="
bcftools stats $OUTPUT_DIR/vcfs/cohort.vcf.gz > $OUTPUT_DIR/vcfs/cohort_stats.txt
echo "=== Complete ==="
echo "Joint VCF: $OUTPUT_DIR/vcfs/cohort.vcf.gz"
Tips
Memory for Large Cohorts
# Increase Java heap
gatk --java-options "-Xmx64g" GenotypeGVCFs ...
# Batch size for GenomicsDBImport
gatk GenomicsDBImport --batch-size 50 ...
Incremental Updates
# Add new samples to existing database
gatk GenomicsDBImport \
--genomicsdb-update-workspace-path existing_db \
--sample-name-map new_samples.txt
Related Skills
- variant-calling/gatk-variant-calling - Single-sample calling
- variant-calling/filtering-best-practices - VQSR and hard filtering
- population-genetics/plink-basics - Population analysis of joint calls
- workflows/fastq-to-variants - End-to-end germline pipeline
Weekly Installs
3
Repository
gptomics/bioskillsInstalled on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2