vgp-pipeline
VGP Assembly Pipeline Skill
Overview
The Vertebrate Genome Project (VGP) assembly pipeline consists of Galaxy workflows for producing high-quality, phased, chromosome-level genome assemblies. This skill covers workflow selection, execution patterns, and quality control checkpoints.
Supporting files (detailed reference material):
- RESOURCE_ANALYSIS.md - Workflow canonical names, official/non-official filtering, metric availability, tool-level resource optimization
- DATA_INTEGRATION.md - ToLID patterns, GenomeArk S3 integration, NCBI accession recovery, Meryl k-mer management, species-metrics merging
- QUALITY_VALIDATION.md - Curation impact analysis, GenomeScope data validation, assembly size interpretation, communication patterns
Trajectories (by frequency of use)
Trajectory A: HiFi + Hi-C (Most Common)
- Inputs: HiFi Reads, Hi-C Reads
- Path: WF1 -> WF4 -> [WF6] -> WF8 -> WF9 -> PreCuration
- Output: HiC Phased assembly (hap1/hap2)
- WF6: Optional (can skip directly to WF8)
Trajectory B: HiFi + Trio
- Inputs: HiFi Reads, Hi-C Reads, Parental Reads
- Path: WF2 -> WF5 -> [WF6] -> WF8 -> WF9 -> PreCuration
- Output: Trio Phased assembly (maternal/paternal)
- WF6: Optional (can skip directly to WF8)
Trajectory C: HiFi Only (Least Common)
- Inputs: HiFi Reads only
- Path: WF1 -> WF3 -> WF6 -> WF9 -> PreCuration
- Output: Pseudohaplotype assembly (primary/alternate)
- WF6: Required (no Hi-C scaffolding step)
- Note: Skips WF8 entirely
Workflow Selection by Data Availability
Non-trio workflows (HiFi reads only)
- VGP1 (WF1): K-mer profiling with HiFi reads alone
- VGP3 (WF3): HiFi-only assembly with HiFiasm
Trio workflows (HiFi + Parental Illumina)
- VGP2 (WF2): Trio k-mer profiling (HiFi child + Illumina parents)
- VGP5 (WF5): Trio-phased assembly with HiFiasm
Universal scaffolding workflows
- RagTag scaffolding: Used for both trio and non-trio assemblies
- Requires reference genome specification
Methods language pattern
When documenting workflow selection in publications:
"For species with available parental data (trio datasets), we employed
VGP2 -> VGP5 workflows. For species without parental data (non-trio datasets),
we performed VGP1 -> VGP3 workflows."
Workflow Descriptions
| Workflow | Name | Description |
|---|---|---|
| WF0 | Mitochondrial Assembly | MitoHiFi assembly (runs in parallel, may fail if no mito reads) |
| WF1 | K-mer Profiling | Genome size, heterozygosity estimation (HiFi) |
| WF2 | Trio K-mer Profiling | K-mer profiling with parental data |
| WF3 | Hifiasm | HiFi-only assembly |
| WF4 | Hifiasm + HiC | HiC-phased assembly |
| WF5 | Hifiasm Trio | Trio-phased assembly |
| WF6 | Purge Duplicates | Remove haplotypic duplications |
| Deprecated - no longer used | ||
| WF8 | Hi-C Scaffolding | YAHS chromosome scaffolding |
| WF9 | Decontamination | Remove contaminants |
| PreCuration | Pretext Snapshot | Prepare files for manual curation |
IWC Workflow Versions (as of March 2026)
| Workflow | IWC Repo | Latest Version | Dockstore ID |
|---|---|---|---|
| WF1 | kmer-profiling-hifi-VGP1 | v0.6 | github.com/iwc-workflows/kmer-profiling-hifi-VGP1/main |
| WF4 | Assembly-Hifi-HiC-phasing-VGP4 | v0.5 | github.com/iwc-workflows/Assembly-Hifi-HiC-phasing-VGP4/main |
| WF8 | Scaffolding-HiC-VGP8 | v3.3 | github.com/iwc-workflows/Scaffolding-HiC-VGP8/main |
Recent Breaking Changes
BUSCO -> Compleasm (WF4 v0.5, WF8 v3.3):
- Compleasm (
0.2.5+galaxy0) replaced BUSCO for gene completeness assessment - Uses miniprot for protein-to-genome alignment (faster than BUSCO's BLAST approach)
- Same output categories: Complete (Single-copy + Duplicated), Fragmented, Missing
- Input parameters still named "Database for Busco Lineage" and "Lineage" (backward compat)
Hi-C reads format change (WF4 v0.5, WF8 v3.3):
- Changed from separate forward/reverse datasets to list:paired collection
- Users must build a list:paired collection before running these workflows
New required inputs across all workflows:
- Species Name (text) -- used for workflow reports
- Assembly Name (text) -- used for workflow reports
WF4 additional new inputs: Trim Hi-C reads? (boolean), Name for Haplotype 1/2 (defaults: Hap1/Hap2), Bits for bloom filter (default: 37) WF8 additional new inputs: Haplotype (restricted: Haplotype 1/2, Maternal/Paternal, Primary/Alternate), Trim Hi-C Data? (boolean), Minimum Mapping Quality (default: 10)
Verifying IWC Versions
Check latest versions via Dockstore API:
https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fiwc-workflows%2F{REPO}%2Fmain/versions
Check workflow inputs by fetching the .ga file from GitHub:
https://raw.githubusercontent.com/iwc-workflows/{REPO}/main/{WORKFLOW_NAME}.ga
Haplotype Execution Patterns
Run Once (Both Haplotypes Together)
- WF1, WF2 (K-mer profiling)
- WF3, WF4, WF5 (Assembly)
- WF6 (Purge Duplicates) - depends on trajectory
- PreCuration
Run Twice (x2 per Haplotype)
- WF8 (Hi-C Scaffolding)
- WF9 (Decontamination)
WF6 (Purge Duplicates) Decision Logic
if trajectory == "C" (HiFi only):
WF6 is REQUIRED
WF6 border: solid
else: # Trajectory A or B
WF6 is OPTIONAL
WF6 border: dashed
Can skip directly to WF8
When to skip WF6 (Trajectories A/B):
- Merqury k-mer spectra shows clean haplotype separation
- Assembly QV is already high
- No significant duplication detected
When to run WF6 (Trajectories A/B):
- K-mer spectra shows residual duplications
- Higher heterozygosity samples
- Conservative approach preferred
Coverage Requirements
| Data Type | Minimum Coverage | Notes |
|---|---|---|
| HiFi | 30x | Diploid genome |
| Hi-C | 60x | Diploid genome |
QC Checkpoints
After WF1/WF2 (K-mer Profiling)
- Verify GenomeScope2 model fit
- Check estimated genome size
- Review heterozygosity estimate
After WF4/WF5 (Assembly)
- Inspect Merqury k-mer spectra
- Decide whether to run WF6 based on duplication levels
After WF8 (Hi-C Scaffolding)
- Check Pretext Hi-C contact maps
- Verify chromosome-level scaffolding
- Validate against expected karyotype (see Karyotype Validation below)
After WF9 (Decontamination)
- Review contamination reports
- Check for unexpected removals
Karyotype-Based Scaffold Validation
Sex Chromosome Adjustment
Problem: VGP assemblies often place both sex chromosomes (X+Y or Z+W) in the main haplotype, requiring adjustment to expected chromosome counts.
Solution: When both sex chromosomes present, expected = n + 1 (not n)
Implementation:
# Adjust haploid expected when BOTH sex chromosomes in main haplotype
df['num_chromosomes_haploid_adjusted'] = df['num_chromosomes_haploid'].copy()
both_sex_chr_patterns = [
'Has X and Y',
'Has Z and W',
'has Z and W',
'Has X1, X2, and Y',
'Has Z1, Z2, and W',
'Has 5X and 5Y'
]
if 'Sex chromosomes main haploptype' in df.columns:
has_both_sex = df['Sex chromosomes main haploptype'].isin(both_sex_chr_patterns)
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid_adjusted'] = \
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid'] + 1
Biological Reasoning:
- Diploid organisms have two sex chromosomes (XX, XY, ZZ, ZW)
- X and Y (or Z and W) are distinct chromosomes
- If both in main haplotype -> two separate scaffolds expected
- Example: Asian elephant 2n=56, n=28, has X+Y -> expect 29 scaffolds
Impact: Improved perfect match rate from 0% to ~90% in validation analyses
Validation Metrics:
# Use adjusted counts for validation
achieved = df['total_number_of_chromosomes']
expected = df['num_chromosomes_haploid_adjusted']
perfect_matches = (achieved == expected).sum()
within_1 = ((achieved - expected).abs() <= 1).sum()
ratio = achieved / expected
Common Pitfalls
Wrong: Compare diploid expected (2n) to haploid assembly
- Results in ~50% achievement rates
- Biologically incorrect
Wrong: Use haploid (n) when both sex chromosomes present
- Underestimates by 1
- Shows artificial "extra scaffold" problem
Correct: Use adjusted haploid (n or n+1 depending on sex chromosome configuration)
WF0 (Mitochondrial) Handling
WF0 runs in parallel with the main pipeline and may fail if:
- No mitochondrial reads present in HiFi data
- This is a biological failure, not technical
def check_mitohifi_failure(wf0_result):
"""Distinguish biological vs technical failure"""
if "no_mito_reads" in wf0_result.log:
return "biological" # Expected for some samples
else:
return "technical" # Investigate further
Visual Diagram Elements
When creating workflow diagrams:
Color Coding (Suggested)
- K-mer Profiling section: Orange (
#fff3e0) - Assembly section: Green (
#e8f5e9) - Purging section: Purple (
#f3e5f5) - Scaffolding section: Blue (
#e3f2fd) - Finishing section: Green (
#e8f5e9) - WF0 (Mitochondrial): Pink (
#fce4ec)
Visual Indicators
- Solid lines: Required workflow connections
- Dashed lines: Optional skip paths
- Dashed box border: Optional workflow (WF6 in trajectories A/B)
- Solid box border: Required workflow
- Dimmed elements: Workflows not used in current trajectory
Haplotype Badges
- Blue badge (
#e3f2fd): "x2 per haplotype" - runs separately - Green badge (
#e8f5e9): "both haplotypes" - runs together
Input Data Labels
- HiFi Reads: Blue (
#4285f4) - Hi-C Reads: Green (
#34a853) - Parental Reads: Red (
#ea4335)
Summary Table
| Trajectory | Inputs | K-mer | Assembly | Purge | Scaffold | Finish | Output |
|---|---|---|---|---|---|---|---|
| A | HiFi+HiC | WF1 | WF4 | [WF6] | WF8 | WF9->Pre | hap1/hap2 |
| B | HiFi+Trio | WF2 | WF5 | [WF6] | WF8 | WF9->Pre | mat/pat |
| C | HiFi only | WF1 | WF3 | WF6 | - | WF9->Pre | pri/alt |
[WF6] = optional, WF6 = required, - = skipped
Reference Genomes for Scaffolding
Common Reference Genome
GCA_011100685.1 - Frequently used reference genome for RagTag scaffolding in canid genome assemblies.
When documenting scaffolding in methods sections:
- Always specify the reference genome accession
- Include version number if applicable
- Example: "scaffolded using RagTag v2.1.0 with the reference genome GCA_011100685.1"
Best Practices
For reproducibility:
- Document exact accession used
- Specify if custom modifications were made to reference
- Note if different references used for different species/assemblies
References
- VGP Galaxy Workflows - VGP workflows
- Vertebrate Genome Project