NYC
skills/delphine-l/claude_global/bioinformatics-fundamentals

bioinformatics-fundamentals

SKILL.md

Bioinformatics Fundamentals

Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.

When to Use This Skill

  • Working with sequencing data (PacBio HiFi, Hi-C, Illumina)
  • Debugging SAM/BAM alignment or filtering issues
  • Processing AGP files for genome assembly curation
  • Validating AGP coordinate systems and unloc assignments
  • Understanding paired-end vs single-end data
  • Interpreting quality metrics (MAPQ, PHRED scores)
  • Troubleshooting empty outputs or broken read pairs
  • General bioinformatics data analysis

SAM/BAM Format Essentials

SAM Flags (Bitwise)

Flags are additive - a read can have multiple flags set simultaneously.

Common Flags:

  • 0x0001 (1): Read is paired in sequencing
  • 0x0002 (2): Each segment properly aligned (proper pair)
  • 0x0004 (4): Read unmapped
  • 0x0008 (8): Mate unmapped
  • 0x0010 (16): Read mapped to reverse strand
  • 0x0020 (32): Mate mapped to reverse strand
  • 0x0040 (64): First in pair (R1/forward)
  • 0x0080 (128): Second in pair (R2/reverse)
  • 0x0100 (256): Secondary alignment
  • 0x0400 (1024): PCR or optical duplicate
  • 0x0800 (2048): Supplementary alignment

Flag Combinations:

  • Properly paired R1: 99 (0x63 = 1 + 2 + 32 + 64)
  • Properly paired R2: 147 (0x93 = 1 + 2 + 16 + 128)
  • Unmapped read: 4
  • Mate unmapped: 8

Proper Pair Flag (0x0002)

What "proper pair" means:

  • Both R1 and R2 are mapped
  • Mapping orientations are correct (typically R1 forward, R2 reverse)
  • Insert size is reasonable for the library
  • Pair conforms to aligner's expectations

Important: Different aligners have different criteria for proper pairs!

MAPQ (Mapping Quality)

Formula: MAPQ = -10 * log10(P(mapping is wrong))

Common Thresholds:

  • MAPQ >= 60: High confidence (error probability < 0.0001%)
  • MAPQ >= 30: Good quality (error probability < 0.1%)
  • MAPQ >= 20: Acceptable (error probability < 1%)
  • MAPQ >= 10: Low confidence (error probability < 10%)
  • MAPQ = 0: Multi-mapper or unmapped

Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.

CIGAR String

Represents alignment between read and reference:

  • M: Match or mismatch (alignment match)
  • I: Insertion in read vs reference
  • D: Deletion in read vs reference
  • S: Soft clipping (bases in read not aligned)
  • H: Hard clipping (bases not in read sequence)
  • N: Skipped region (for RNA-seq splicing)

Example: 100M = perfect 100bp match Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match

Sequencing Technologies

PacBio HiFi (High Fidelity)

Characteristics:

  • Long reads: 10-25 kb typical
  • High accuracy: >99.9% (Q20+)
  • Circular Consensus Sequencing (CCS)
  • Single-end data (though from circular molecules)
  • Excellent for de novo assembly

Best Mappers:

  • minimap2 presets: map-pb, map-hifi
  • BWA-MEM2 can work but optimized for short reads

Typical Use Cases:

  • De novo genome assembly
  • Structural variant detection
  • Isoform sequencing (Iso-Seq)
  • Haplotype phasing

Hi-C (Chromatin Conformation Capture)

Characteristics:

  • Paired-end short reads (typically 100-150 bp)
  • Read pairs capture chromatin interactions
  • R1 and R2 often map to different scaffolds/chromosomes
  • Requires careful proper pair handling
  • Used for scaffolding and 3D genome structure

Best Mappers:

  • BWA-MEM2 (paired-end mode)
  • BWA-MEM (paired-end mode)

Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!

Typical Use Cases:

  • Genome scaffolding (connecting contigs)
  • 3D chromatin structure analysis
  • Haplotype phasing
  • Assembly quality assessment

Illumina Short Reads

Characteristics:

  • Short reads: 50-300 bp
  • Paired-end or single-end
  • High throughput
  • Well-established quality scores

Best Mappers:

  • BWA-MEM2, BWA-MEM (general purpose)
  • Bowtie2 (fast, local alignment)
  • STAR (RNA-seq spliced alignment)

Common Tools and Their Behaviors

samtools view

Purpose: Filter, convert, and view SAM/BAM files

Key Flags:

  • -b: Output BAM format
  • -h: Include header
  • -f INT: Require flags (keep reads WITH these flags)
  • -F INT: Filter flags (remove reads WITH these flags)
  • -q INT: Minimum MAPQ threshold
  • -L FILE: Keep reads overlapping regions in BED file

Important Behavior:

  • -L (region filtering) checks each read individually, not pairs
  • Can break read pairs if mates map to different regions
  • Flag filters (-f, -F) are applied before region filters (-L)

Example - Filter for proper pairs:

samtools view -b -f 2 input.bam > proper_pairs.bam

Example - Filter by region (may break pairs):

samtools view -b -L regions.bed input.bam > filtered.bam

Example - Proper pairs in regions (correct order):

samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

bamtools filter

Purpose: Advanced filtering with complex criteria

Key Features:

  • Can filter on multiple properties simultaneously
  • More strict about pair validation than samtools
  • Supports JSON filter rules

Common Filters:

  • isPaired: true - Read is from paired-end sequencing
  • isProperPair: true - Read is part of proper pair
  • isMapped: true - Read is mapped
  • mapQuality: >=30 - Mapping quality threshold

Important Difference from samtools:

  • isProperPair is more strict than samtools -f 2
  • Checks pair validity more thoroughly
  • Better for ensuring R1/R2 match correctly

samtools fastx

Purpose: Convert SAM/BAM to FASTQ/FASTA

Output Modes:

  • outputs: ["r1", "r2"] - Separate forward and reverse for paired-end
  • outputs: ["other"] - Single output for single-end data
  • outputs: ["r0"] - All reads (mixed paired/unpaired)

Filtering Options:

  • inclusive_filter: ["2"] - Require proper pair flag
  • exclusive_filter: ["4", "8"] - Exclude unmapped or mate unmapped
  • exclusive_filter_all: ["8"] - Exclude if mate unmapped

Critical: Use appropriate filters to ensure R1/R2 files match!

Common Patterns and Best Practices

Pattern 1: Filtering Paired-End Data by Regions

WRONG WAY (breaks pairs):

# Region filter first → breaks pairs when mates are in different regions
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# Result: Empty output (all pairs broken)

RIGHT WAY (preserves pairs):

# Proper pair filter FIRST, then region filter
samtools view -b -f 2 -L regions.bed input.bam > output.bam
# Result: Pairs where both mates are in regions (or one mate in region, other anywhere)

BEST WAY (both mates in regions):

# Filter for proper pairs, then use paired-aware region filtering
samtools view -b -f 2 input.bam | \
  # Custom script to keep pairs where both mates in regions

Pattern 2: Extracting FASTQ from Filtered BAM

For Paired-End:

# Ensure proper pairs before extraction
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
  --i1-flags 2 \  # Require proper pair
  --i2-flags 64,128 \  # Separate R1/R2
  input.bam

For Single-End:

# Simple extraction
samtools fastx -0 output.fq.gz input.bam

Pattern 3: Quality Filtering

Conservative (high quality):

samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30, proper pairs, no secondary/supplementary

Permissive (for low-coverage data):

samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10, mapped reads

Common Issues and Solutions

Issue 1: Empty Output After Region Filtering (Hi-C Data)

Symptom:

  • BAM file non-empty before filtering
  • Empty after region filtering + proper pair filtering
  • Happens with paired-end data (especially Hi-C)

Cause:

  • Region filter (samtools view -L) breaks read pairs
  • One mate in region, other mate outside region
  • Proper pair flag (0x2) is lost
  • Subsequent isProperPair filter removes all reads

Solution:

# Apply proper pair filter BEFORE region filtering
samtools view -b -f 2 -L regions.bed input.bam > output.bam

See Also: common-issues.md for detailed troubleshooting

Issue 2: R1 and R2 Files Have Different Read Counts

Symptom:

  • Forward and reverse FASTQ files have different numbers of reads
  • Downstream tools fail expecting matched pairs

Cause:

  • Improper filtering broke some pairs
  • One mate filtered out, other kept
  • Extraction didn't require proper pairing

Solution:

# Require proper pairs during extraction
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam

Issue 3: Low Mapping Rate for Hi-C Data

Symptom:

  • Many Hi-C reads unmapped or low MAPQ
  • Expected for Hi-C due to chimeric reads

Not Actually a Problem:

  • Hi-C involves ligation of distant DNA fragments
  • Creates chimeric molecules
  • Mappers may mark these as low quality or unmapped
  • This is normal for Hi-C data

Solution:

  • Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer)
  • Don't filter too aggressively on MAPQ
  • Accept lower mapping rates than DNA-seq

Issue 4: Proper Pairs Lost After Mapping

Symptom:

  • Few reads marked as proper pairs (flag 0x2)
  • Expected paired-end data

Possible Causes:

  1. Insert size distribution wrong (check aligner parameters)
  2. Reference mismatch (reads from different assembly)
  3. Poor library quality
  4. Incorrect orientation flags passed to aligner

Solution:

# Check insert size distribution
samtools stats input.bam | grep "insert size"

# Check pairing flags
samtools flagstat input.bam

Quality Metrics

N50 and Related Metrics

N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer

How to interpret:

  • Higher N50 = better contiguity
  • Compare to expected chromosome/scaffold sizes
  • Use with caution - can be misleading for fragmented assemblies

Related Metrics:

  • L50: Number of contigs needed to reach N50
  • N90: More stringent than N50 (90% coverage)
  • NG50: N50 relative to genome size (better for comparisons)

Coverage and Depth

Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base

Recommended Depths:

  • Genome assembly (HiFi): 30-50x
  • Variant calling: 30x minimum
  • RNA-seq: 20-40 million reads
  • Hi-C scaffolding: 50-100x genomic coverage

File Format Quick Reference

FASTA

>sequence_id description
ATCGATCGATCG
ATCGATCG
  • Header line starts with >
  • Can span multiple lines
  • No quality scores

FASTQ

@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
  • Four lines per read
  • Quality scores (Phred+33 encoding typical)
  • Can be gzipped (.fastq.gz)

BED

chr1    1000    2000    feature_name    score    +
  • 0-based coordinates
  • Used for regions, features, intervals
  • Minimum 3 columns (chrom, start, end)

AGP

chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation
  • Tab-delimited genome assembly format
  • 1-based closed coordinates [start, end]
  • Describes construction of objects from components
  • Object and component lengths must match
  • See AGP Format section for complete specification

Best Practices

General

  1. Always check data type: Paired-end vs single-end determines filtering strategy
  2. Understand your sequencing technology: Hi-C behaves differently than HiFi
  3. Filter in the right order: Proper pairs BEFORE region filtering
  4. Validate outputs: Check file sizes, read counts, flagstat
  5. Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise

For Hi-C Data

  1. Expect distant read pairs: Don't be surprised by different scaffolds
  2. Preserve proper pairs: Critical for downstream scaffolding
  3. Use paired-aware tools: Standard filters may break pairs
  4. Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq

For HiFi Data

  1. Single-end processing: No pair concerns
  2. High quality expected: Can use strict filters
  3. Use appropriate presets: minimap2 map-hifi or map-pb
  4. Consider read length distribution: HiFi reads vary in length

For Tool Testing

  1. Create self-contained datasets: Both mates in selected region
  2. Maintain proper pairs: Essential for realistic testing
  3. Use representative data: Subsample proportionally, not randomly
  4. Verify file sizes: Too small = overly filtered

Genome Ark Data Retrieval

Overview

Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using --no-sign-request.

Directory Structure

s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # Standard HiC assembly
├── assembly_vgp_standard_2.1/     # Standard assembly
├── assembly_vgp_trio_1.0/         # Trio assembly
├── assembly_curated/              # Manually curated
├── assembly_{METHOD}_{VERSION}/   # Method-specific (e.g., assembly_verkko_1.4)
└── genomic_data/                  # Raw sequencing data

GenomeScope Data Locations

GenomeScope summaries can be in multiple locations - search comprehensively:

Common paths (priority order):

  1. {assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt
  2. {assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt
  3. {assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt
  4. {assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt

Search strategy:

  1. Dynamically discover assembly folders (not all species use same structure)
  2. Search multiple subfolder locations per assembly
  3. Try standard assemblies first (HiC_2.0, standard_2.1)
  4. Fall back to method-specific assemblies

Python Implementation Pattern

def search_genomescope_recursive(species_name, tolid):
    """Search for GenomeScope data across all assembly folders."""
    species_s3 = species_name.replace(' ', '_')

    # 1. Discover assembly folders
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\n'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. Search patterns
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. Try each combination
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None

GenomeScope Summary Parsing

def parse_genomescope_summary(content):
    """Extract genome characteristics from GenomeScope2 summary."""
    data = {}

    # Genome Haploid Length (max value - second column)
    match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # Heterozygosity percentage (max value)
    match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # Genome Unique Length (max value)
    match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # Genome Repeat Length (max value)
    match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # Calculate repeat content percentage
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data

Best Practices

  1. Rate limiting: Add 0.2s delay between successful fetches to be respectful to AWS
  2. Caching: Check if file exists locally before fetching
  3. Timeout: Use 10-15s timeout per request
  4. Assembly discovery: Always discover assemblies dynamically - don't assume structure
  5. Multiple locations: GenomeScope data may be in evaluation/, qc/, or intermediates/
  6. Expected coverage: ~15-20% of VGP assemblies have GenomeScope data available

Common Issues

Species not found: Some species use different naming (check exact spacing/underscores)

# Search for species
aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"

Assembly folder variations: Not all use standard names

# List all assemblies for a species
aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request

AWS CLI Setup

# Install AWS CLI (no credentials needed for Genome Ark)
conda install -c conda-forge awscli

# Test access
aws s3 ls s3://genomeark/species/ --no-sign-request | head

Karyotype Data Curation and Literature Search

Overview

Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.

Search Strategy

Effective Search Terms

"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"

Best Reference Sources

  1. PubMed/PMC: Primary cytogenetic studies
  2. ResearchGate: Karyotype descriptions and figures
  3. Specialized databases:
  4. Genome assembly papers: Often mention expected karyotype
  5. Comparative cytogenetic studies: Family-level analyses

Search Time Estimates

  • Model organisms, domestic species: 2-3 minutes
  • Well-studied taxonomic groups: 5-10 minutes
  • Rare/uncommon species: 10-20 minutes or not found

Taxonomic Conservation Patterns

Mammals

  • Cetaceans: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42)
  • Felidae: Conserved 2n = 38, n = 19
  • Canidae: Conserved 2n = 78, n = 39
  • Primates: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46)

Birds

  • Anatidae (waterfowl): Highly conserved 2n = 80, n = 40 across ducks, geese, swans
  • Galliformes (game birds): Typically 2n = 78, n = 39 (chicken, quail, grouse)
  • Passerines: Variable 2n = 78-82, most common 2n = 80
  • Ancestral avian karyotype: Putative 2n = 80
  • General pattern: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80

Reptiles

  • Lacertidae (wall lizards): Often 2n = 38, n = 19

Genome Assembly Interpretation

⚠️ Warning: Chromosome-level assemblies often report fewer chromosomes than actual diploid number

Why: Assemblies typically capture only:

  • Macrochromosomes (large chromosomes)
  • Larger microchromosomes
  • Small microchromosomes remain unassembled

Example: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"

  • True karyotype: 10 macro pairs + 30 micro pairs = 80
  • Assembly: ~34-42 scaffolds (only macro + larger micros)

Using Conservation for Inference

When specific karyotype data is unavailable but genus/family patterns are strong:

  1. High confidence inference (acceptable for publication):

    • Multiple congeneric species confirmed
    • Family-level conservation documented
    • No known exceptions in genus
  2. Document inference clearly:

    accession,taxid,species,2n,n,notes,reference
    GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url
    
  3. Priority for direct confirmation:

    • Species with conservation exceptions
    • Type specimens or reference species
    • Phylogenetically divergent lineages

VGP-Specific: Sex Chromosome Adjustment

When both sex chromosomes are in main haplotype (common in VGP assemblies):

  • Expected scaffolds = n + 1 (not n)
  • Reason: X+Y or Z+W = two distinct chromosomes
  • Check: VGP metadata column "Sex chromosomes main haplotype"
  • Patterns: "Has X and Y", "Has Z and W", "Has X1, X2, and Y"

Data Recording Format

CSV Structure:

accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...

Notes field examples:

  • "Standard {family} karyotype"
  • "Conserved {genus} karyotype"
  • "Inferred from {family} conservation"
  • "Unusual karyotype for family"
  • "Geographic variation reported"

Prioritization for Literature Searches

TIER 1 (>90% success rate):

  • Model organisms (zebrafish, mouse, medaka)
  • Domestic species (chicken, goat, sheep)
  • Game animals (waterfowl, deer)
  • Laboratory species (fruit fly, nematode)

TIER 2 (70-90% success rate):

  • Well-studied taxonomic groups (Podarcis lizards, corvids)
  • Conservation focus species (raptors, large mammals)
  • Commercial species (salmonids, oysters)

TIER 3 (50-70% success rate):

  • Common but not economically important
  • Widespread distribution
  • Recent phylogenetic interest

Low priority (<50% success rate):

  • Deep-sea species
  • Rare/endangered without conservation genetics
  • Recently described species
  • Cryptic species complexes

AGP Format (A Golden Path)

Overview

AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.

When to Use This Knowledge

  • Processing genome assemblies for submission to databases
  • Curating chromosome-level assemblies
  • Splitting haplotype assemblies
  • Assigning unlocalized scaffolds (unlocs)
  • Debugging AGP validation errors
  • Converting between assembly representations

AGP Format Structure

Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)

Sequence Lines (Column 5 = 'W'):

object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation

Gap Lines (Column 5 = 'U' or 'N'):

object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence

Critical Coordinate Rules

Rule 1: Object and Component Lengths MUST Match

For sequence lines, the span in the object MUST equal the span in the component:

(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)

Example - CORRECT:

Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -
# Object length: 54360 - 1 + 1 = 54,360 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓

Example - INCORRECT:

Scaffold_47_unloc_1  1  19328398  1  W  scaffold_23.hap1  19274039  19328398  -
# Object length: 19328398 - 1 + 1 = 19,328,398 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗
# ERROR: Lengths don't match!

Rule 2: Component Numbering Restarts for New Objects

Each object (column 1) has its own component numbering (column 4) starting at 1:

Scaffold_10         1  30578279  1  W  scaffold_4.hap2   1  30578279  -
Scaffold_10_unloc_1 1  65764     1  W  scaffold_74.hap2  1  65764     +  # ← Starts at 1, not 3!

Rule 3: Sequential Component Numbering Within Objects

Component numbers increment sequentially (gaps and sequences both count):

Scaffold_2  1        1731008   1  W  scaffold_25.hap1  1  1731008  -
Scaffold_2  1731009  1731108   2  U  100  scaffold  yes  proximity_ligation
Scaffold_2  1731109  1956041   3  W  scaffold_70.hap1  1  224933   -

Common AGP Processing Issues

Issue 1: Incorrect Object Coordinates When Creating Unlocs

Symptom:

ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length

Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.

Wrong Approach:

# Setting object end to component end coordinate
agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end']  # ✗ WRONG

Correct Approach:

# Calculate actual length from component coordinates
agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1  # ✓ CORRECT

Issue 2: Component Numbering Not Reset for New Objects

Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.

Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.

Solution:

# When creating unlocs, reset component number
agp_df.loc[index, '#_scaffs'] = 1  # Column 4: component number

Issue 3: AGPcorrect Accumulating Coordinates

Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.

Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.

Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).

AGP Processing Best Practices

1. Coordinate System Understanding

  • Object coordinates (columns 2-3): Position within the assembled object (1-based, inclusive)
  • Component coordinates (columns 7-8): Position within the source sequence (1-based, inclusive)
  • Both use 1-based closed intervals [start, end]
  • Length calculation: end - start + 1

2. Creating Unlocalized Sequences (Unlocs)

# When extracting a region to create an unloc:
# 1. Calculate the actual length of the region
length = int(comp_end) - int(comp_start) + 1

# 2. Set object coordinates for the new unloc
obj_start = 1  # Always starts at 1
obj_end = length  # Equals the length

# 3. Reset component number
component_num = 1  # New object, new numbering

# 4. Rename the object
new_object_name = f"{parent_scaffold}_unloc_{unloc_number}"

3. Validating AGP Files

Use NCBI's AGP validator:

agp_validate assembly.agp

Common validation checks:

  • Object/component length match
  • Sequential component numbering
  • No coordinate overlaps
  • Gap specifications valid
  • Orientation values (+, -, ?, 0, na)

4. Handling Haplotype-Split Assemblies

When splitting diploid assemblies into haplotypes:

  1. Identify haplotype markers in sequence names (H1/hap1, H2/hap2)
  2. Maintain proper pairing information
  3. Process unlocs separately per haplotype
  4. Remove haplotig duplications
  5. Track gaps appropriately (especially proximity ligation gaps)

AGP Coordinate Debugging Pattern

When encountering coordinate errors:

# For each AGP line, verify:
obj_length = int(obj_end) - int(obj_beg) + 1
comp_length = int(comp_end) - int(comp_beg) + 1

assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"

# For sequential component numbers:
assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"

AGP File Structure by Assembly Stage

1. Raw Assembly AGP:

  • Direct representation from assembler
  • May have incorrect sequence lengths
  • Needs coordinate correction (AGPcorrect)

2. Corrected AGP:

  • Sequence lengths match actual FASTA
  • Coordinates adjusted for length discrepancies
  • Ready for haplotype splitting

3. Haplotype-Split AGP:

  • Separate files per haplotype
  • Unlocs identified but not separated
  • Haplotigs marked but not removed

4. Final Curated AGP:

  • Unlocs separated into individual objects
  • Haplotigs removed to separate file
  • Proximity ligation gaps cleaned
  • Ready for database submission

BED File Processing and Telomere Analysis

Pattern: Classifying Scaffolds by Telomere Types

When analyzing telomere data from BED files to classify scaffolds:

File Structure:

  • Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession
  • Interstitial telomeres BED: similar structure with position markers (p/q/u for internal)

Best Practice - Use Python CSV Module:

import csv
from collections import defaultdict

# Use defaultdict for automatic initialization
telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})

# Process with csv.reader (more portable than pandas)
with open('telomeres.bed', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scaffold = row[0]
        accession = row[10]  # GCA accession
        key = (accession, scaffold)
        telomere_counts[key]['terminal'] += 1

Why CSV over pandas:

  • No external dependencies (pandas may not be installed)
  • Faster for simple tabular operations
  • Lower memory footprint for large files
  • Better portability across environments

Classification Categories:

  1. Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes)
  2. Category 2: 1 terminal telomere, 0 interstitial (partial)
  3. Category 3: Has interstitial telomeres (likely assembly issues)

NCBI Data Integration Strategies

Check Existing Data Sources Before API Calls

Problem: Need chromosome counts for 400+ assemblies from NCBI.

Anti-pattern: Query NCBI datasets API for each accession

# DON'T: Query 400+ times
for accession in missing_data:
    result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession])
    # Takes 10+ minutes, hits API rate limits

Better Pattern: Check if data already exists in compiled tables

# DO: Look for existing compiled data first
# VGP table has multiple chromosome count columns:
# - num_chromosomes (column 54)
# - total_number_of_chromosomes (column 106)
# - num_chromosomes_haploid (column 122)

# Read from existing comprehensive table
with open('VGP-table.csv') as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        num_chr = row[53] if row[53] else row[105]  # Fallback strategy

Results: Filled 392/417 missing values instantly vs 10+ minutes of API calls.

Fallback Strategy for Multiple Columns:

# Try multiple sources in order of preference
num_chromosomes = row[53] if (len(row) > 53 and row[53]) else ''
if not num_chromosomes and len(row) > 105:
    num_chromosomes = row[105]  # Alternative column

When to use NCBI API:

  • Data not in existing tables
  • Need real-time/latest data
  • Fetching assembly reports or sequence data
  • Small number of queries (<20)

API Best Practices (when necessary):

  • Use full path to datasets command (may be aliased)
  • Add delays between calls (time.sleep(0.5))
  • Set reasonable timeouts
  • Handle errors gracefully

Related Skills

  • vgp-pipeline - VGP workflows process Hi-C and HiFi data
  • galaxy-tool-wrapping - Galaxy tools work with SAM/BAM and sequencing data formats
  • galaxy-workflow-development - Workflows process sequencing data

Supporting Documentation

  • reference.md: Detailed format specifications and tool documentation
  • common-issues.md: Comprehensive troubleshooting guide with examples

Version History

  • v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies
  • v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns
  • v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns
Weekly Installs
29
First Seen
Jan 24, 2026
Installed on
claude-code19
opencode17
gemini-cli15
codex14
antigravity12
cursor12