genomeark-aws
GenomeArk AWS S3 Data Repository
Comprehensive guide for accessing and navigating the GenomeArk AWS S3 public bucket containing Vertebrate Genomes Project (VGP) assemblies and quality control data.
Supporting files (read as needed for detailed code and strategies):
- assembly-date-extraction.md - Extract assembly dates from FASTA filenames, validation rules
- qc-data-fetching.md - GenomeScope, BUSCO, Merqury, Meryl fetching code and parsing
- best-practices.md - AWS CLI patterns, batch processing, common pitfalls, testing examples, version history
When to Use This Skill
Use this skill when:
- Accessing VGP genome assemblies from GenomeArk AWS S3
- Fetching QC metrics (GenomeScope, BUSCO, Merqury) for genomic analyses
- Downloading genome evaluation data for comparative studies
- Accessing meryl k-mer histograms for GenomeScope analysis
- Building automated pipelines that fetch VGP data
- Troubleshooting S3 path issues or missing data
- Working with species-specific genome data from VGP
Repository Overview
GenomeArk is a public AWS S3 bucket (s3://genomeark/) hosting:
- VGP genome assemblies (primary, alternate, trio)
- Quality control metrics (GenomeScope, BUSCO, Merqury)
- Intermediate files (meryl databases, k-mer histograms)
- Assembly evaluation reports
- Haplotype-resolved assemblies
Access Method: Public bucket requiring no AWS credentials when using --no-sign-request
Critical Discovery: GenomeArk structure has evolved over time (2022 -> 2024+). Always implement fallback path patterns for reliability.
Directory Structure
Base Structure
s3://genomeark/
└── species/
└── {Genus_species}/ # e.g., Rhinolophus_ferrumequinum
└── {ToLID}/ # e.g., mRhiFer1 (VGP specimen ID)
├── assembly_vgp_{type}_{version}/
│ ├── evaluation/ # QC metrics (MAIN ACCESS POINT)
│ │ ├── genomescope/
│ │ ├── busco/
│ │ ├── merqury/
│ │ └── ...
│ └── intermediates/ # K-mer databases, temp files
│ └── meryl/
└── genomic_data/ # Raw sequencing data folders
Assembly Directory Variations
assembly_vgp_{type}_{version} - Standard VGP Patterns:
assembly_vgp_HiC_2.0- Hi-C phased assembly (case-sensitive!)assembly_vgp_standard_2.0- Standard assembly without Hi-Cassembly_vgp_hic_2.0- Alternative Hi-C namingassembly_vgp_trio_2.0- Trio-binned assembly
Legacy Versions (2019-2021 assemblies):
assembly_vgp_standard_1.6- Version 1.6 (common in fish, birds)assembly_vgp_standard_1.0- Version 1.0 (early assemblies)assembly_vgp_HiC_1.6- Hi-C version 1.6assembly_vgp_HiC_1.0- Hi-C version 1.0assembly_vgp_HiC_1.4- Hi-C version 1.4
Verkko Assemblies (diploid assemblies):
assembly_verkko_1.4/- Verkko version 1.4assembly_verkko_1.1-0.1/- Verkko version 1.1-0.1assembly_verkko_1.1-0.1-freeze/- Frozen versionassembly_verkko_1.1-0.2/- Version 1.1-0.2assembly_verkko_1.4.1r/- Revised version 1.4.1
Clade-Specific Directories (2023+ specialized assemblies):
assembly_primate_v1.4.2/- Primate-specific pipelineassembly_fish_*- Fish-specific (potential)assembly_bird_*- Bird-specific (potential)
Institution-Specific Directories:
assembly_rockefeller/- Rockefeller University assembliesassembly_cambridge/- Cambridge assembliesassembly_MT_rockefeller/- Case variationassembly_mt_rockefeller/- Lowercase variationassembly_mt_milan/- Milan institute
Directories Without "assembly_" Prefix (rare):
vgp_standard_1.6/- Standard v1.6 without prefixvgp_standard_1.0/- Standard v1.0 without prefixvgp_HiC_1.6/- Hi-C v1.6 without prefix
Curated Assemblies (post-manual curation):
assembly_curated/- Exclude for date extraction (post-curation dates)
CRITICAL CASE SENSITIVITY:
- Metadata may store:
assembly_vgp_hic_2.0(lowercase) - S3 requires:
assembly_vgp_HiC_2.0(mixed case!) - Always normalize before fetching
COMPREHENSIVE PATTERN MATCHING:
- Don't stop at first match: Try ALL valid paths
- Pri/alt assemblies often use legacy versions (1.6, 1.0)
- Phased assemblies typically use version 2.0
- Verkko assemblies are diploid, use different naming
- Coverage improvement: Using all patterns -> 47-62% vs 27% with basic patterns
Data Access Summary
For detailed fetching code and parsing logic, see qc-data-fetching.md.
| Data Type | Location | Key Notes |
|---|---|---|
| GenomeScope | evaluation/genomescope/ |
3 filename patterns (double/single/no underscore); validate heterozygosity ranges |
| BUSCO | evaluation/busco/{subdir}/ |
Dynamic subdir search (c/, p/, c1/, p1/); parse C:XX.X% |
| Merqury | evaluation/merqury/ |
Two path layouts (direct vs nested); QV in column 4 |
| Meryl hist | intermediates/meryl/ |
Use .hist file only (~700KB), not full database (~10GB) |
| Assembly dates | FASTA filenames | YYYYMMDD stamps; see assembly-date-extraction.md |
| Technology | genomic_data/ subfolders |
pacbio_hifi/ -> HiFi, ont/ -> ONT, etc. |
Path Normalization (used by all fetching functions)
def normalize_s3_path(s3_path):
"""Normalize path for GenomeArk (case sensitivity!)"""
if not s3_path:
return None
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_path
GenomeScope Filename Patterns (TRY ALL THREE!)
- Pattern A:
{ToLID}_genomescope__Summary.txt(double underscore, most common) - Pattern C:
{ToLID}_genomescope_Summary.txt(single underscore, easily missed) - Pattern B:
{ToLID}_Summary.txt(no prefix, older assemblies)
Checking only A and B causes ~30-40% of data to be missed.
GenomeScope Validation
Reject failed runs where heterozygosity range > 50% or max > 95%. A range of 0%-100% indicates complete model failure.
Meryl Histograms - Direct HTTPS URLs (for Galaxy import)
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.hist
Quick Reference
AWS CLI pattern (prefer over boto3 for public buckets):
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
Rate limiting: 0.2s delay between requests.
Common pitfalls: Case sensitivity (hic vs HiC), directory evolution (2022 vs 2024 layouts), downloading full meryl databases instead of .hist files. See best-practices.md for full list.