Bioinformatics Fundamentals

Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.

When to Use This Skill

Working with sequencing data (PacBio HiFi, Hi-C, Illumina)
Debugging SAM/BAM alignment or filtering issues
Processing AGP files for genome assembly curation
Validating AGP coordinate systems and unloc assignments
Understanding paired-end vs single-end data
Interpreting quality metrics (MAPQ, PHRED scores)
Troubleshooting empty outputs or broken read pairs
General bioinformatics data analysis

SAM/BAM Format Essentials

SAM Flags (Bitwise)

Flags are additive - a read can have multiple flags set simultaneously.

Common Flags:

0x0001 (1): Read is paired in sequencing
0x0002 (2): Each segment properly aligned (proper pair)
0x0004 (4): Read unmapped
0x0008 (8): Mate unmapped
0x0010 (16): Read mapped to reverse strand
0x0020 (32): Mate mapped to reverse strand
0x0040 (64): First in pair (R1/forward)
0x0080 (128): Second in pair (R2/reverse)
0x0100 (256): Secondary alignment
0x0400 (1024): PCR or optical duplicate
0x0800 (2048): Supplementary alignment

Flag Combinations:

Properly paired R1: 99 (0x63 = 1 + 2 + 32 + 64)
Properly paired R2: 147 (0x93 = 1 + 2 + 16 + 128)
Unmapped read: 4
Mate unmapped: 8

Proper Pair Flag (0x0002)

What "proper pair" means:

Both R1 and R2 are mapped
Mapping orientations are correct (typically R1 forward, R2 reverse)
Insert size is reasonable for the library
Pair conforms to aligner's expectations

Important: Different aligners have different criteria for proper pairs!

MAPQ (Mapping Quality)

Formula: MAPQ = -10 * log10(P(mapping is wrong))

Common Thresholds:

MAPQ >= 60: High confidence (error probability < 0.0001%)
MAPQ >= 30: Good quality (error probability < 0.1%)
MAPQ >= 20: Acceptable (error probability < 1%)
MAPQ >= 10: Low confidence (error probability < 10%)
MAPQ = 0: Multi-mapper or unmapped

Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.

CIGAR String

Represents alignment between read and reference:

M: Match or mismatch (alignment match)
I: Insertion in read vs reference
D: Deletion in read vs reference
S: Soft clipping (bases in read not aligned)
H: Hard clipping (bases not in read sequence)
N: Skipped region (for RNA-seq splicing)

Example: 100M = perfect 100bp match Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match

Sequencing Technologies

PacBio HiFi (High Fidelity)

Characteristics:

Long reads: 10-25 kb typical
High accuracy: >99.9% (Q20+)
Circular Consensus Sequencing (CCS)
Single-end data (though from circular molecules)
Excellent for de novo assembly

Best Mappers:

minimap2 presets: map-pb, map-hifi
BWA-MEM2 can work but optimized for short reads

Typical Use Cases:

De novo genome assembly
Structural variant detection
Isoform sequencing (Iso-Seq)
Haplotype phasing

Hi-C (Chromatin Conformation Capture)

Characteristics:

Paired-end short reads (typically 100-150 bp)
Read pairs capture chromatin interactions
R1 and R2 often map to different scaffolds/chromosomes
Requires careful proper pair handling
Used for scaffolding and 3D genome structure

Best Mappers:

BWA-MEM2 (paired-end mode)
BWA-MEM (paired-end mode)

Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!

Typical Use Cases:

Genome scaffolding (connecting contigs)
3D chromatin structure analysis
Haplotype phasing
Assembly quality assessment

Illumina Short Reads

Characteristics:

Short reads: 50-300 bp
Paired-end or single-end
High throughput
Well-established quality scores

Best Mappers:

BWA-MEM2, BWA-MEM (general purpose)
Bowtie2 (fast, local alignment)
STAR (RNA-seq spliced alignment)

Common Tools and Their Behaviors

samtools view

Purpose: Filter, convert, and view SAM/BAM files

Key Flags:

-b: Output BAM format
-h: Include header
-f INT: Require flags (keep reads WITH these flags)
-F INT: Filter flags (remove reads WITH these flags)
-q INT: Minimum MAPQ threshold
-L FILE: Keep reads overlapping regions in BED file

Important Behavior:

-L (region filtering) checks each read individually, not pairs
Can break read pairs if mates map to different regions
Flag filters (-f, -F) are applied before region filters (-L)

Example - Filter for proper pairs:

samtools view -b -f 2 input.bam > proper_pairs.bam

Example - Filter by region (may break pairs):

samtools view -b -L regions.bed input.bam > filtered.bam

Example - Proper pairs in regions (correct order):

samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

bamtools filter

Purpose: Advanced filtering with complex criteria

Key Features:

Can filter on multiple properties simultaneously
More strict about pair validation than samtools
Supports JSON filter rules

Common Filters:

isPaired: true - Read is from paired-end sequencing
isProperPair: true - Read is part of proper pair
isMapped: true - Read is mapped
mapQuality: >=30 - Mapping quality threshold

Important Difference from samtools:

isProperPair is more strict than samtools -f 2
Checks pair validity more thoroughly
Better for ensuring R1/R2 match correctly

samtools fastx

Purpose: Convert SAM/BAM to FASTQ/FASTA

Output Modes:

outputs: ["r1", "r2"] - Separate forward and reverse for paired-end
outputs: ["other"] - Single output for single-end data
outputs: ["r0"] - All reads (mixed paired/unpaired)

Filtering Options:

inclusive_filter: ["2"] - Require proper pair flag
exclusive_filter: ["4", "8"] - Exclude unmapped or mate unmapped
exclusive_filter_all: ["8"] - Exclude if mate unmapped

Critical: Use appropriate filters to ensure R1/R2 files match!

Common Patterns and Best Practices

Pattern 1: Filtering Paired-End Data by Regions

WRONG WAY (breaks pairs):

# Region filter first → breaks pairs when mates are in different regions
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# Result: Empty output (all pairs broken)

RIGHT WAY (preserves pairs):

# Proper pair filter FIRST, then region filter
samtools view -b -f 2 -L regions.bed input.bam > output.bam
# Result: Pairs where both mates are in regions (or one mate in region, other anywhere)

BEST WAY (both mates in regions):

# Filter for proper pairs, then use paired-aware region filtering
samtools view -b -f 2 input.bam | \
  # Custom script to keep pairs where both mates in regions

Pattern 2: Extracting FASTQ from Filtered BAM

For Paired-End:

# Ensure proper pairs before extraction
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
  --i1-flags 2 \  # Require proper pair
  --i2-flags 64,128 \  # Separate R1/R2
  input.bam

For Single-End:

# Simple extraction
samtools fastx -0 output.fq.gz input.bam

Pattern 3: Quality Filtering

Conservative (high quality):

samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30, proper pairs, no secondary/supplementary

Permissive (for low-coverage data):

samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10, mapped reads

Common Issues and Solutions

Issue 1: Empty Output After Region Filtering (Hi-C Data)

Symptom:

BAM file non-empty before filtering
Empty after region filtering + proper pair filtering
Happens with paired-end data (especially Hi-C)

Cause:

Region filter (samtools view -L) breaks read pairs
One mate in region, other mate outside region
Proper pair flag (0x2) is lost
Subsequent isProperPair filter removes all reads

Solution:

# Apply proper pair filter BEFORE region filtering
samtools view -b -f 2 -L regions.bed input.bam > output.bam

See Also: common-issues.md for detailed troubleshooting

Issue 2: R1 and R2 Files Have Different Read Counts

Symptom:

Forward and reverse FASTQ files have different numbers of reads
Downstream tools fail expecting matched pairs

Cause:

Improper filtering broke some pairs
One mate filtered out, other kept
Extraction didn't require proper pairing

Solution:

# Require proper pairs during extraction
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam

Issue 3: Low Mapping Rate for Hi-C Data

Symptom:

Many Hi-C reads unmapped or low MAPQ
Expected for Hi-C due to chimeric reads

Not Actually a Problem:

Hi-C involves ligation of distant DNA fragments
Creates chimeric molecules
Mappers may mark these as low quality or unmapped
This is normal for Hi-C data

Solution:

Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer)
Don't filter too aggressively on MAPQ
Accept lower mapping rates than DNA-seq

Issue 4: Proper Pairs Lost After Mapping

Symptom:

Few reads marked as proper pairs (flag 0x2)
Expected paired-end data

Possible Causes:

Insert size distribution wrong (check aligner parameters)
Reference mismatch (reads from different assembly)
Poor library quality
Incorrect orientation flags passed to aligner

Solution:

# Check insert size distribution
samtools stats input.bam | grep "insert size"

# Check pairing flags
samtools flagstat input.bam

Quality Metrics

N50 and Related Metrics

N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer

How to interpret:

Higher N50 = better contiguity
Compare to expected chromosome/scaffold sizes
Use with caution - can be misleading for fragmented assemblies

Related Metrics:

L50: Number of contigs needed to reach N50
N90: More stringent than N50 (90% coverage)
NG50: N50 relative to genome size (better for comparisons)

Coverage and Depth

Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base

Recommended Depths:

Genome assembly (HiFi): 30-50x
Variant calling: 30x minimum
RNA-seq: 20-40 million reads
Hi-C scaffolding: 50-100x genomic coverage

File Format Quick Reference

FASTA

>sequence_id description
ATCGATCGATCG
ATCGATCG

Header line starts with >
Can span multiple lines
No quality scores

FASTQ

@read_id
ATCGATCGATCG
+
IIIIIIIIIIII

Four lines per read
Quality scores (Phred+33 encoding typical)
Can be gzipped (.fastq.gz)

BED

chr1    1000    2000    feature_name    score    +

0-based coordinates
Used for regions, features, intervals
Minimum 3 columns (chrom, start, end)

AGP

chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation

Tab-delimited genome assembly format
1-based closed coordinates [start, end]
Describes construction of objects from components
Object and component lengths must match
See AGP Format section for complete specification

Best Practices

General

Always check data type: Paired-end vs single-end determines filtering strategy
Understand your sequencing technology: Hi-C behaves differently than HiFi
Filter in the right order: Proper pairs BEFORE region filtering
Validate outputs: Check file sizes, read counts, flagstat
Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise

For Hi-C Data

Expect distant read pairs: Don't be surprised by different scaffolds
Preserve proper pairs: Critical for downstream scaffolding
Use paired-aware tools: Standard filters may break pairs
Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq

For HiFi Data

Single-end processing: No pair concerns
High quality expected: Can use strict filters
Use appropriate presets: minimap2 map-hifi or map-pb
Consider read length distribution: HiFi reads vary in length

For Tool Testing

Create self-contained datasets: Both mates in selected region
Maintain proper pairs: Essential for realistic testing
Use representative data: Subsample proportionally, not randomly
Verify file sizes: Too small = overly filtered

Genome Ark Data Retrieval

Overview

Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using --no-sign-request.

Directory Structure

s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # Standard HiC assembly
├── assembly_vgp_standard_2.1/     # Standard assembly
├── assembly_vgp_trio_1.0/         # Trio assembly
├── assembly_curated/              # Manually curated
├── assembly_{METHOD}_{VERSION}/   # Method-specific (e.g., assembly_verkko_1.4)
└── genomic_data/                  # Raw sequencing data

GenomeScope Data Locations

GenomeScope summaries can be in multiple locations - search comprehensively:

Common paths (priority order):

{assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt
{assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt
{assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt
{assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt

Search strategy:

Dynamically discover assembly folders (not all species use same structure)
Search multiple subfolder locations per assembly
Try standard assemblies first (HiC_2.0, standard_2.1)
Fall back to method-specific assemblies

Python Implementation Pattern

def search_genomescope_recursive(species_name, tolid):
    """Search for GenomeScope data across all assembly folders."""
    species_s3 = species_name.replace(' ', '_')

    # 1. Discover assembly folders
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\n'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. Search patterns
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. Try each combination
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None

GenomeScope Summary Parsing

def parse_genomescope_summary(content):
    """Extract genome characteristics from GenomeScope2 summary."""
    data = {}

    # Genome Haploid Length (max value - second column)
    match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # Heterozygosity percentage (max value)
    match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # Genome Unique Length (max value)
    match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # Genome Repeat Length (max value)
    match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # Calculate repeat content percentage
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data

Best Practices

Rate limiting: Add 0.2s delay between successful fetches to be respectful to AWS
Caching: Check if file exists locally before fetching
Timeout: Use 10-15s timeout per request
Assembly discovery: Always discover assemblies dynamically - don't assume structure
Multiple locations: GenomeScope data may be in evaluation/, qc/, or intermediates/
Expected coverage: ~15-20% of VGP assemblies have GenomeScope data available

Common Issues

Species not found: Some species use different naming (check exact spacing/underscores)

# Search for species
aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"

Assembly folder variations: Not all use standard names

# List all assemblies for a species
aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request

AWS CLI Setup

# Install AWS CLI (no credentials needed for Genome Ark)
conda install -c conda-forge awscli

# Test access
aws s3 ls s3://genomeark/species/ --no-sign-request | head

Karyotype Data Curation and Literature Search

Overview

Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.

Search Strategy

Effective Search Terms

"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"

Best Reference Sources

PubMed/PMC: Primary cytogenetic studies
ResearchGate: Karyotype descriptions and figures
Specialized databases:
- Bird Chromosome Database: https://sites.unipampa.edu.br/birdchromosomedatabase/
- Animal Genome Size Database: http://www.genomesize.com/
Genome assembly papers: Often mention expected karyotype
Comparative cytogenetic studies: Family-level analyses

Search Time Estimates

Model organisms, domestic species: 2-3 minutes
Well-studied taxonomic groups: 5-10 minutes
Rare/uncommon species: 10-20 minutes or not found

Taxonomic Conservation Patterns

Mammals

Cetaceans: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42)
Felidae: Conserved 2n = 38, n = 19
Canidae: Conserved 2n = 78, n = 39
Primates: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46)

Birds

Anatidae (waterfowl): Highly conserved 2n = 80, n = 40 across ducks, geese, swans
Galliformes (game birds): Typically 2n = 78, n = 39 (chicken, quail, grouse)
Passerines: Variable 2n = 78-82, most common 2n = 80
Ancestral avian karyotype: Putative 2n = 80
General pattern: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80

Reptiles

Lacertidae (wall lizards): Often 2n = 38, n = 19

Genome Assembly Interpretation

⚠️ Warning: Chromosome-level assemblies often report fewer chromosomes than actual diploid number

Why: Assemblies typically capture only:

Macrochromosomes (large chromosomes)
Larger microchromosomes
Small microchromosomes remain unassembled

Example: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"

True karyotype: 10 macro pairs + 30 micro pairs = 80
Assembly: ~34-42 scaffolds (only macro + larger micros)

Using Conservation for Inference

When specific karyotype data is unavailable but genus/family patterns are strong:

High confidence inference (acceptable for publication):
- Multiple congeneric species confirmed
- Family-level conservation documented
- No known exceptions in genus

Document inference clearly:

accession,taxid,species,2n,n,notes,reference
GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url

Priority for direct confirmation:
- Species with conservation exceptions
- Type specimens or reference species
- Phylogenetically divergent lineages

VGP-Specific: Sex Chromosome Adjustment

When both sex chromosomes are in main haplotype (common in VGP assemblies):

Expected scaffolds = n + 1 (not n)
Reason: X+Y or Z+W = two distinct chromosomes
Check: VGP metadata column "Sex chromosomes main haplotype"
Patterns: "Has X and Y", "Has Z and W", "Has X1, X2, and Y"

Data Recording Format

CSV Structure:

accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...

Notes field examples:

"Standard {family} karyotype"
"Conserved {genus} karyotype"
"Inferred from {family} conservation"
"Unusual karyotype for family"
"Geographic variation reported"

Prioritization for Literature Searches

TIER 1 (>90% success rate):

Model organisms (zebrafish, mouse, medaka)
Domestic species (chicken, goat, sheep)
Game animals (waterfowl, deer)
Laboratory species (fruit fly, nematode)

TIER 2 (70-90% success rate):

Well-studied taxonomic groups (Podarcis lizards, corvids)
Conservation focus species (raptors, large mammals)
Commercial species (salmonids, oysters)

TIER 3 (50-70% success rate):

Common but not economically important
Widespread distribution
Recent phylogenetic interest

Low priority (<50% success rate):

Deep-sea species
Rare/endangered without conservation genetics
Recently described species
Cryptic species complexes

AGP Format (A Golden Path)

Overview

AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.

When to Use This Knowledge

Processing genome assemblies for submission to databases
Curating chromosome-level assemblies
Splitting haplotype assemblies
Assigning unlocalized scaffolds (unlocs)
Debugging AGP validation errors
Converting between assembly representations

AGP Format Structure

Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)

Sequence Lines (Column 5 = 'W'):

object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation

Gap Lines (Column 5 = 'U' or 'N'):

object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence

Critical Coordinate Rules

Rule 1: Object and Component Lengths MUST Match

For sequence lines, the span in the object MUST equal the span in the component:

(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)

Example - CORRECT:

Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -
# Object length: 54360 - 1 + 1 = 54,360 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓

Example - INCORRECT:

Scaffold_47_unloc_1  1  19328398  1  W  scaffold_23.hap1  19274039  19328398  -
# Object length: 19328398 - 1 + 1 = 19,328,398 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗
# ERROR: Lengths don't match!

Rule 2: Component Numbering Restarts for New Objects

Each object (column 1) has its own component numbering (column 4) starting at 1:

Scaffold_10         1  30578279  1  W  scaffold_4.hap2   1  30578279  -
Scaffold_10_unloc_1 1  65764     1  W  scaffold_74.hap2  1  65764     +  # ← Starts at 1, not 3!

Rule 3: Sequential Component Numbering Within Objects

Component numbers increment sequentially (gaps and sequences both count):

Scaffold_2  1        1731008   1  W  scaffold_25.hap1  1  1731008  -
Scaffold_2  1731009  1731108   2  U  100  scaffold  yes  proximity_ligation
Scaffold_2  1731109  1956041   3  W  scaffold_70.hap1  1  224933   -

Common AGP Processing Issues

Issue 1: Incorrect Object Coordinates When Creating Unlocs

Symptom:

ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length

Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.

Wrong Approach:

# Setting object end to component end coordinate
agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end']  # ✗ WRONG

Correct Approach:

# Calculate actual length from component coordinates
agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1  # ✓ CORRECT

Issue 2: Component Numbering Not Reset for New Objects

Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.

Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.

Solution:

# When creating unlocs, reset component number
agp_df.loc[index, '#_scaffs'] = 1  # Column 4: component number

Issue 3: AGPcorrect Accumulating Coordinates

Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.

Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.

Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).

AGP Processing Best Practices

1. Coordinate System Understanding

Object coordinates (columns 2-3): Position within the assembled object (1-based, inclusive)
Component coordinates (columns 7-8): Position within the source sequence (1-based, inclusive)
Both use 1-based closed intervals [start, end]
Length calculation: end - start + 1

2. Creating Unlocalized Sequences (Unlocs)

# When extracting a region to create an unloc:
# 1. Calculate the actual length of the region
length = int(comp_end) - int(comp_start) + 1

# 2. Set object coordinates for the new unloc
obj_start = 1  # Always starts at 1
obj_end = length  # Equals the length

# 3. Reset component number
component_num = 1  # New object, new numbering

# 4. Rename the object
new_object_name = f"{parent_scaffold}_unloc_{unloc_number}"

3. Validating AGP Files

Use NCBI's AGP validator:

agp_validate assembly.agp

Common validation checks:

Object/component length match
Sequential component numbering
No coordinate overlaps
Gap specifications valid
Orientation values (+, -, ?, 0, na)

4. Handling Haplotype-Split Assemblies

When splitting diploid assemblies into haplotypes:

Identify haplotype markers in sequence names (H1/hap1, H2/hap2)
Maintain proper pairing information
Process unlocs separately per haplotype
Remove haplotig duplications
Track gaps appropriately (especially proximity ligation gaps)

AGP Coordinate Debugging Pattern

When encountering coordinate errors:

# For each AGP line, verify:
obj_length = int(obj_end) - int(obj_beg) + 1
comp_length = int(comp_end) - int(comp_beg) + 1

assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"

# For sequential component numbers:
assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"

AGP File Structure by Assembly Stage

1. Raw Assembly AGP:

Direct representation from assembler
May have incorrect sequence lengths
Needs coordinate correction (AGPcorrect)

2. Corrected AGP:

Sequence lengths match actual FASTA
Coordinates adjusted for length discrepancies
Ready for haplotype splitting

3. Haplotype-Split AGP:

Separate files per haplotype
Unlocs identified but not separated
Haplotigs marked but not removed

4. Final Curated AGP:

Unlocs separated into individual objects
Haplotigs removed to separate file
Proximity ligation gaps cleaned
Ready for database submission

BED File Processing and Telomere Analysis

Pattern: Classifying Scaffolds by Telomere Types

When analyzing telomere data from BED files to classify scaffolds:

File Structure:

Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession
Interstitial telomeres BED: similar structure with position markers (p/q/u for internal)

Best Practice - Use Python CSV Module:

import csv
from collections import defaultdict

# Use defaultdict for automatic initialization
telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})

# Process with csv.reader (more portable than pandas)
with open('telomeres.bed', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scaffold = row[0]
        accession = row[10]  # GCA accession
        key = (accession, scaffold)
        telomere_counts[key]['terminal'] += 1

Why CSV over pandas:

No external dependencies (pandas may not be installed)
Faster for simple tabular operations
Lower memory footprint for large files
Better portability across environments

Classification Categories:

Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes)
Category 2: 1 terminal telomere, 0 interstitial (partial)
Category 3: Has interstitial telomeres (likely assembly issues)

NCBI Data Integration Strategies

Check Existing Data Sources Before API Calls

Problem: Need chromosome counts for 400+ assemblies from NCBI.

Anti-pattern: Query NCBI datasets API for each accession

# DON'T: Query 400+ times
for accession in missing_data:
    result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession])
    # Takes 10+ minutes, hits API rate limits

Better Pattern: Check if data already exists in compiled tables

# DO: Look for existing compiled data first
# VGP table has multiple chromosome count columns:
# - num_chromosomes (column 54)
# - total_number_of_chromosomes (column 106)
# - num_chromosomes_haploid (column 122)

# Read from existing comprehensive table
with open('VGP-table.csv') as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        num_chr = row[53] if row[53] else row[105]  # Fallback strategy

Results: Filled 392/417 missing values instantly vs 10+ minutes of API calls.

Fallback Strategy for Multiple Columns:

# Try multiple sources in order of preference
num_chromosomes = row[53] if (len(row) > 53 and row[53]) else ''
if not num_chromosomes and len(row) > 105:
    num_chromosomes = row[105]  # Alternative column

When to use NCBI API:

Data not in existing tables
Need real-time/latest data
Fetching assembly reports or sequence data
Small number of queries (<20)

API Best Practices (when necessary):

Use full path to datasets command (may be aliased)
Add delays between calls (time.sleep(0.5))
Set reasonable timeouts
Handle errors gracefully

Related Skills

vgp-pipeline - VGP workflows process Hi-C and HiFi data
galaxy-tool-wrapping - Galaxy tools work with SAM/BAM and sequencing data formats
galaxy-workflow-development - Workflows process sequencing data

Supporting Documentation

reference.md: Detailed format specifications and tool documentation
common-issues.md: Comprehensive troubleshooting guide with examples

Version History

v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies
v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns
v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns

bioinformatics-fundamentals

Bioinformatics Fundamentals

When to Use This Skill

SAM/BAM Format Essentials

SAM Flags (Bitwise)

Proper Pair Flag (0x0002)

MAPQ (Mapping Quality)

CIGAR String

Sequencing Technologies

PacBio HiFi (High Fidelity)

Hi-C (Chromatin Conformation Capture)

Illumina Short Reads

Common Tools and Their Behaviors

samtools view

bamtools filter

samtools fastx

Common Patterns and Best Practices

Pattern 1: Filtering Paired-End Data by Regions

Pattern 2: Extracting FASTQ from Filtered BAM

Pattern 3: Quality Filtering

Common Issues and Solutions

Issue 1: Empty Output After Region Filtering (Hi-C Data)

Issue 2: R1 and R2 Files Have Different Read Counts

Issue 3: Low Mapping Rate for Hi-C Data

Issue 4: Proper Pairs Lost After Mapping

Quality Metrics

N50 and Related Metrics

Coverage and Depth

File Format Quick Reference

FASTA

FASTQ

BED

AGP

Best Practices

General

For Hi-C Data

For HiFi Data

For Tool Testing

Genome Ark Data Retrieval

Overview

Directory Structure

GenomeScope Data Locations

Python Implementation Pattern

GenomeScope Summary Parsing

Best Practices

Common Issues

AWS CLI Setup

Karyotype Data Curation and Literature Search

Overview

Search Strategy

Effective Search Terms

Best Reference Sources

Search Time Estimates

Taxonomic Conservation Patterns

Mammals

Birds

Reptiles

Genome Assembly Interpretation

Using Conservation for Inference

VGP-Specific: Sex Chromosome Adjustment

Data Recording Format

Prioritization for Literature Searches

AGP Format (A Golden Path)

Overview

When to Use This Knowledge

AGP Format Structure

Critical Coordinate Rules

Common AGP Processing Issues

Issue 1: Incorrect Object Coordinates When Creating Unlocs

Issue 2: Component Numbering Not Reset for New Objects

Issue 3: AGPcorrect Accumulating Coordinates

AGP Processing Best Practices

1. Coordinate System Understanding

2. Creating Unlocalized Sequences (Unlocs)

3. Validating AGP Files

4. Handling Haplotype-Split Assemblies

AGP Coordinate Debugging Pattern

AGP File Structure by Assembly Stage

BED File Processing and Telomere Analysis

Pattern: Classifying Scaffolds by Telomere Types