bio-vcf
VCF Toolkit
Toolkit for VCF/BCF variant file analysis: calculate statistics, filter variants, and export as JSON. Designed for WGS/WES sequencing result inspection and quality control.
Quick Start
Install
uv pip install pysam typer
Basic Usage
# 1. VCF 統計情報を取得
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1
# 2. 高品質バリアントのみをフィルタして新しい VCF を作成
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output high_quality.vcf \
--min-qual 30 \
--min-dp 10
# 3. フィルタされたバリアントを JSON で出力(≤100 エントリ)
python scripts/inspect_vcf.py \
--vcf high_quality.vcf \
--chrom chr1 \
--output chr1.json
Scripts
inspect_vcf.py - VCF Inspection & JSON Export
Extract variants from VCF files for specific chromosomes or regions and export as JSON format.
Required Arguments
--vcf PATH- Input VCF file path--chrom TEXTor--region TEXT- Either one required--chrom: Entire chromosome (e.g.,chr1)--region: Specific region (e.g.,chr1:1000000-2000000)
Optional Arguments
Output:
--output PATH- JSON output path (default: stdout)
Filter Conditions:
--min-qual FLOAT- Minimum quality score (QUAL >= X)--min-dp INT- Minimum depth (INFO/DP >= X)--min-af FLOAT- Minimum allele frequency (INFO/AF >= X)--max-af FLOAT- Maximum allele frequency (INFO/AF <= X)--pass-only/--all-filters- PASS only (default) / Include all filters
Limits:
--max-variants INT- Maximum variant count (default: 100)--force- Ignore entry limit (allows large JSON output)
Output Format (JSON)
{
"num_variants": 45,
"samples": ["sample1", "sample2"],
"variants": [
{
"chrom": "chr1",
"pos": 12345,
"id": "rs123456",
"ref": "A",
"alts": ["G"],
"qual": 100.0,
"filter": ["PASS"],
"info": {
"DP": 50,
"AF": [0.5],
"AC": [25]
},
"samples": {
"sample1": {"GT": "0/1", "DP": 25, "GQ": 99},
"sample2": {"GT": "0/0", "DP": 25, "GQ": 99}
}
}
]
}
vcf_stats.py - VCF Statistics
Calculate comprehensive statistics from VCF files and output as JSON. Includes variant counts, quality distributions, depth distributions, and allele frequency statistics.
Arguments
Required:
--vcf PATH- Input VCF file path
Optional:
--chrom TEXT- Chromosome specification (default: all chromosomes)--region TEXT- Region specification (e.g.,chr1:1000-2000)--output PATH- JSON output path (default: stdout)
Output Content (JSON)
total_variants- Total variant countfilter_counts- Breakdown by filter (PASS, LowQual, etc.)variant_types- Breakdown by variant type (SNP, insertion, deletion)chrom_counts- Variant count per chromosomequality_stats- Quality score statistics (min, max, mean, median)depth_stats- Depth statistics (INFO/DP)allele_frequency_stats- Allele frequency statistics (INFO/AF)
Usage Examples
# Calculate statistics for chr1
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1
# Calculate statistics for all chromosomes (output to JSON file)
python scripts/vcf_stats.py --vcf variants.vcf.gz --output stats.json
# Calculate statistics for specific region
python scripts/vcf_stats.py --vcf variants.vcf.gz --region chr1:10000-20000
filter_vcf.py - VCF Filtering
Filter VCF files by quality, depth, and allele frequency criteria. Output filtered variants as a new VCF file.
Arguments
Required:
--vcf PATH- Input VCF file path--output PATH- Output VCF file path
Optional:
--chrom TEXT- Chromosome specification--region TEXT- Region specification (e.g.,chr1:1000-2000)--min-qual FLOAT- Minimum quality score--min-dp INT- Minimum depth (INFO/DP)--min-af FLOAT- Minimum allele frequency (INFO/AF)--max-af FLOAT- Maximum allele frequency (INFO/AF)--pass-only- PASS variants only (default: False)
Usage Examples
# Extract chr1 PASS variants only
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output chr1_pass.vcf \
--chrom chr1 \
--pass-only
# Extract high-quality variants (QUAL >= 30, DP >= 10)
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output high_quality.vcf \
--min-qual 30 \
--min-dp 10
# Extract rare variants (AF <= 0.01)
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output rare_variants.vcf \
--max-af 0.01
Workflow Examples
Example 1: Comprehensive Variant Analysis Workflow
Combine all three scripts for complete VCF analysis:
# Step 1: Calculate overall statistics
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1 --output stats.json
# Step 2: Filter high-quality variants to new VCF
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output high_quality.vcf \
--chrom chr1 \
--min-qual 30 \
--min-dp 10 \
--pass-only
# Step 3: Export filtered variants as JSON for downstream analysis
python scripts/inspect_vcf.py \
--vcf high_quality.vcf \
--chrom chr1 \
--output chr1_filtered.json
Example 2: Rare Variant Discovery
Identify and export rare variants from specific region:
# Filter rare variants (AF <= 0.01)
python scripts/filter_vcf.py \
--vcf variants.vcf.gz \
--output rare.vcf \
--region chr17:41196312-41277500 \
--max-af 0.01
# Export as JSON for analysis
python scripts/inspect_vcf.py \
--vcf rare.vcf \
--region chr17:41196312-41277500 \
--output brca1_rare.json
Error Handling
Variant Count Exceeds Limit
$ python scripts/inspect_vcf.py --vcf huge.vcf --chrom chr1 --output out.json
Error: VCF contains 1,234+ variants after filtering (limit: 100).
Suggestions:
- Apply more restrictive filters: --min-qual, --min-dp, --pass-only
- Specify a genomic region: --region chr1:1000-2000
- Override limit with --force (warning: may produce very large JSON)
- Use bcftools directly for large-scale processing
Current filter conditions:
--chrom chr1 --pass-only
Solutions:
- Apply more restrictive filters:
--min-qual 30,--min-dp 10 - Narrow down the region:
--region chr1:1000000-1100000 - Override limit with
--force(use cautiously)
Missing Chromosome/Region Specification
$ python scripts/inspect_vcf.py --vcf variants.vcf --output out.json
Error: Either --chrom or --region must be specified.
Solutions:
- Add
--chrom chr1or--region chr1:1000-2000to the command
Best Practices
1. Always Specify Chromosome or Region
Always specify chromosome or region when using inspect_vcf.py to avoid processing entire VCF files inefficiently.
# ❌ Bad: No chromosome specified
python scripts/inspect_vcf.py --vcf variants.vcf
# ✅ Good: Chromosome specified
python scripts/inspect_vcf.py --vcf variants.vcf --chrom chr1
2. Apply Additional Filters for Efficiency
Combine quality and depth filters with default PASS-only filtering for better results.
# ✅ Good: Multiple filters applied
python scripts/inspect_vcf.py \
--vcf variants.vcf \
--chrom chr1 \
--min-qual 30 \
--min-dp 10
3. Respect 100-Entry Limit for JSON Export
Use inspect_vcf.py for small datasets only. Pre-filter large VCF files with filter_vcf.py or bcftools before JSON export.
# Pre-filter large datasets with bcftools
bcftools view -i 'QUAL>=30 && DP>=10' -r chr1:1000000-2000000 variants.vcf > filtered.vcf
# Then export to JSON
python scripts/inspect_vcf.py --vcf filtered.vcf --chrom chr1 --output filtered.json
4. Use --force Cautiously
Use --force only when necessary. JSON files with thousands of entries can become several MB to tens of MB in size.
When to Use vcf-toolkit vs bcftools
| Task | vcf-toolkit | bcftools |
|---|---|---|
| Small dataset JSON export | ✅ inspect_vcf.py | - |
| Large-scale filtering | filter_vcf.py | ✅ bcftools view |
| Complex filter expressions | - | ✅ bcftools |
| VCF-to-VCF conversion | filter_vcf.py | ✅ bcftools |
| Variant statistics | ✅ vcf_stats.py | ✅ bcftools stats |
Recommended Workflow:
- Pre-filter large datasets with bcftools or filter_vcf.py
- Export filtered results to JSON with inspect_vcf.py for detailed inspection
- Perform downstream analysis in Python/R using JSON output
Related Skills
- pysam - BAM/CRAM alignment file operations
- sequence-io - FASTA/FASTQ sequence file operations
- blast-search - BLAST homology search
- blat-api-searching - BLAT genome mapping
Troubleshooting
VCF File Too Large
Specify a narrower region or pre-filter with bcftools before JSON export.
# Specify narrower region
python scripts/inspect_vcf.py --vcf variants.vcf --region chr1:1000000-1100000
# Pre-filter with bcftools
bcftools view -i 'QUAL>=50' variants.vcf | python scripts/inspect_vcf.py --vcf - --chrom chr1
Index Error
Create tabix index for compressed VCF files.
# Compress with bgzip
bgzip variants.vcf
# Create tabix index
tabix -p vcf variants.vcf.gz
# Use indexed VCF
python scripts/inspect_vcf.py --vcf variants.vcf.gz --chrom chr1
Include Non-PASS Variants
Use --all-filters flag to include all variants regardless of FILTER field.
python scripts/inspect_vcf.py --vcf variants.vcf --chrom chr1 --all-filters