bioinformatics-fundamentals
Bioinformatics Fundamentals
Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.
When to Use This Skill
- Working with sequencing data (PacBio HiFi, Hi-C, Illumina)
- Debugging SAM/BAM alignment or filtering issues
- Processing AGP files for genome assembly curation
- Validating AGP coordinate systems and unloc assignments
- Understanding paired-end vs single-end data
- Interpreting quality metrics (MAPQ, PHRED scores)
- Troubleshooting empty outputs or broken read pairs
- General bioinformatics data analysis
SAM/BAM Format Essentials
SAM Flags (Bitwise)
Flags are additive - a read can have multiple flags set simultaneously.
Common Flags:
0x0001(1): Read is paired in sequencing0x0002(2): Each segment properly aligned (proper pair)0x0004(4): Read unmapped0x0008(8): Mate unmapped0x0010(16): Read mapped to reverse strand0x0020(32): Mate mapped to reverse strand0x0040(64): First in pair (R1/forward)0x0080(128): Second in pair (R2/reverse)0x0100(256): Secondary alignment0x0400(1024): PCR or optical duplicate0x0800(2048): Supplementary alignment
Flag Combinations:
- Properly paired R1:
99(0x63 = 1 + 2 + 32 + 64) - Properly paired R2:
147(0x93 = 1 + 2 + 16 + 128) - Unmapped read:
4 - Mate unmapped:
8
Proper Pair Flag (0x0002)
What "proper pair" means:
- Both R1 and R2 are mapped
- Mapping orientations are correct (typically R1 forward, R2 reverse)
- Insert size is reasonable for the library
- Pair conforms to aligner's expectations
Important: Different aligners have different criteria for proper pairs!
MAPQ (Mapping Quality)
Formula: MAPQ = -10 * log10(P(mapping is wrong))
Common Thresholds:
MAPQ >= 60: High confidence (error probability < 0.0001%)MAPQ >= 30: Good quality (error probability < 0.1%)MAPQ >= 20: Acceptable (error probability < 1%)MAPQ >= 10: Low confidence (error probability < 10%)MAPQ = 0: Multi-mapper or unmapped
Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.
CIGAR String
Represents alignment between read and reference:
M: Match or mismatch (alignment match)I: Insertion in read vs referenceD: Deletion in read vs referenceS: Soft clipping (bases in read not aligned)H: Hard clipping (bases not in read sequence)N: Skipped region (for RNA-seq splicing)
Example: 100M = perfect 100bp match
Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match
Sequencing Technologies
PacBio HiFi (High Fidelity)
Characteristics:
- Long reads: 10-25 kb typical
- High accuracy: >99.9% (Q20+)
- Circular Consensus Sequencing (CCS)
- Single-end data (though from circular molecules)
- Excellent for de novo assembly
Best Mappers:
- minimap2 presets:
map-pb,map-hifi - BWA-MEM2 can work but optimized for short reads
Typical Use Cases:
- De novo genome assembly
- Structural variant detection
- Isoform sequencing (Iso-Seq)
- Haplotype phasing
Hi-C (Chromatin Conformation Capture)
Characteristics:
- Paired-end short reads (typically 100-150 bp)
- Read pairs capture chromatin interactions
- R1 and R2 often map to different scaffolds/chromosomes
- Requires careful proper pair handling
- Used for scaffolding and 3D genome structure
Best Mappers:
- BWA-MEM2 (paired-end mode)
- BWA-MEM (paired-end mode)
Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!
Typical Use Cases:
- Genome scaffolding (connecting contigs)
- 3D chromatin structure analysis
- Haplotype phasing
- Assembly quality assessment
Illumina Short Reads
Characteristics:
- Short reads: 50-300 bp
- Paired-end or single-end
- High throughput
- Well-established quality scores
Best Mappers:
- BWA-MEM2, BWA-MEM (general purpose)
- Bowtie2 (fast, local alignment)
- STAR (RNA-seq spliced alignment)
Common Tools and Their Behaviors
samtools view
Purpose: Filter, convert, and view SAM/BAM files
Key Flags:
-b: Output BAM format-h: Include header-f INT: Require flags (keep reads WITH these flags)-F INT: Filter flags (remove reads WITH these flags)-q INT: Minimum MAPQ threshold-L FILE: Keep reads overlapping regions in BED file
Important Behavior:
-L(region filtering) checks each read individually, not pairs- Can break read pairs if mates map to different regions
- Flag filters (
-f,-F) are applied before region filters (-L)
Example - Filter for proper pairs:
samtools view -b -f 2 input.bam > proper_pairs.bam
Example - Filter by region (may break pairs):
samtools view -b -L regions.bed input.bam > filtered.bam
Example - Proper pairs in regions (correct order):
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam
bamtools filter
Purpose: Advanced filtering with complex criteria
Key Features:
- Can filter on multiple properties simultaneously
- More strict about pair validation than samtools
- Supports JSON filter rules
Common Filters:
isPaired: true- Read is from paired-end sequencingisProperPair: true- Read is part of proper pairisMapped: true- Read is mappedmapQuality: >=30- Mapping quality threshold
Important Difference from samtools:
isProperPairis more strict than samtools-f 2- Checks pair validity more thoroughly
- Better for ensuring R1/R2 match correctly
samtools fastx
Purpose: Convert SAM/BAM to FASTQ/FASTA
Output Modes:
outputs: ["r1", "r2"]- Separate forward and reverse for paired-endoutputs: ["other"]- Single output for single-end dataoutputs: ["r0"]- All reads (mixed paired/unpaired)
Filtering Options:
inclusive_filter: ["2"]- Require proper pair flagexclusive_filter: ["4", "8"]- Exclude unmapped or mate unmappedexclusive_filter_all: ["8"]- Exclude if mate unmapped
Critical: Use appropriate filters to ensure R1/R2 files match!
Common Patterns and Best Practices
Pattern 1: Filtering Paired-End Data by Regions
WRONG WAY (breaks pairs):
# Region filter first → breaks pairs when mates are in different regions
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# Result: Empty output (all pairs broken)
RIGHT WAY (preserves pairs):
# Proper pair filter FIRST, then region filter
samtools view -b -f 2 -L regions.bed input.bam > output.bam
# Result: Pairs where both mates are in regions (or one mate in region, other anywhere)
BEST WAY (both mates in regions):
# Filter for proper pairs, then use paired-aware region filtering
samtools view -b -f 2 input.bam | \
# Custom script to keep pairs where both mates in regions
Pattern 2: Extracting FASTQ from Filtered BAM
For Paired-End:
# Ensure proper pairs before extraction
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
--i1-flags 2 \ # Require proper pair
--i2-flags 64,128 \ # Separate R1/R2
input.bam
For Single-End:
# Simple extraction
samtools fastx -0 output.fq.gz input.bam
Pattern 3: Quality Filtering
Conservative (high quality):
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30, proper pairs, no secondary/supplementary
Permissive (for low-coverage data):
samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10, mapped reads
Common Issues and Solutions
Issue 1: Empty Output After Region Filtering (Hi-C Data)
Symptom:
- BAM file non-empty before filtering
- Empty after region filtering + proper pair filtering
- Happens with paired-end data (especially Hi-C)
Cause:
- Region filter (
samtools view -L) breaks read pairs - One mate in region, other mate outside region
- Proper pair flag (0x2) is lost
- Subsequent
isProperPairfilter removes all reads
Solution:
# Apply proper pair filter BEFORE region filtering
samtools view -b -f 2 -L regions.bed input.bam > output.bam
See Also: common-issues.md for detailed troubleshooting
Issue 2: R1 and R2 Files Have Different Read Counts
Symptom:
- Forward and reverse FASTQ files have different numbers of reads
- Downstream tools fail expecting matched pairs
Cause:
- Improper filtering broke some pairs
- One mate filtered out, other kept
- Extraction didn't require proper pairing
Solution:
# Require proper pairs during extraction
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
Issue 3: Low Mapping Rate for Hi-C Data
Symptom:
- Many Hi-C reads unmapped or low MAPQ
- Expected for Hi-C due to chimeric reads
Not Actually a Problem:
- Hi-C involves ligation of distant DNA fragments
- Creates chimeric molecules
- Mappers may mark these as low quality or unmapped
- This is normal for Hi-C data
Solution:
- Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer)
- Don't filter too aggressively on MAPQ
- Accept lower mapping rates than DNA-seq
Issue 4: Proper Pairs Lost After Mapping
Symptom:
- Few reads marked as proper pairs (flag 0x2)
- Expected paired-end data
Possible Causes:
- Insert size distribution wrong (check aligner parameters)
- Reference mismatch (reads from different assembly)
- Poor library quality
- Incorrect orientation flags passed to aligner
Solution:
# Check insert size distribution
samtools stats input.bam | grep "insert size"
# Check pairing flags
samtools flagstat input.bam
Quality Metrics
N50 and Related Metrics
N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer
How to interpret:
- Higher N50 = better contiguity
- Compare to expected chromosome/scaffold sizes
- Use with caution - can be misleading for fragmented assemblies
Related Metrics:
- L50: Number of contigs needed to reach N50
- N90: More stringent than N50 (90% coverage)
- NG50: N50 relative to genome size (better for comparisons)
Coverage and Depth
Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base
Recommended Depths:
- Genome assembly (HiFi): 30-50x
- Variant calling: 30x minimum
- RNA-seq: 20-40 million reads
- Hi-C scaffolding: 50-100x genomic coverage
File Format Quick Reference
FASTA
>sequence_id description
ATCGATCGATCG
ATCGATCG
- Header line starts with
> - Can span multiple lines
- No quality scores
FASTQ
@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
- Four lines per read
- Quality scores (Phred+33 encoding typical)
- Can be gzipped (.fastq.gz)
BED
chr1 1000 2000 feature_name score +
- 0-based coordinates
- Used for regions, features, intervals
- Minimum 3 columns (chrom, start, end)
AGP
chr1 1 5000 1 W contig_1 1 5000 +
chr1 5001 5100 2 U 100 scaffold yes proximity_ligation
- Tab-delimited genome assembly format
- 1-based closed coordinates [start, end]
- Describes construction of objects from components
- Object and component lengths must match
- See AGP Format section for complete specification
Best Practices
General
- Always check data type: Paired-end vs single-end determines filtering strategy
- Understand your sequencing technology: Hi-C behaves differently than HiFi
- Filter in the right order: Proper pairs BEFORE region filtering
- Validate outputs: Check file sizes, read counts, flagstat
- Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise
For Hi-C Data
- Expect distant read pairs: Don't be surprised by different scaffolds
- Preserve proper pairs: Critical for downstream scaffolding
- Use paired-aware tools: Standard filters may break pairs
- Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq
For HiFi Data
- Single-end processing: No pair concerns
- High quality expected: Can use strict filters
- Use appropriate presets: minimap2
map-hifiormap-pb - Consider read length distribution: HiFi reads vary in length
For Tool Testing
- Create self-contained datasets: Both mates in selected region
- Maintain proper pairs: Essential for realistic testing
- Use representative data: Subsample proportionally, not randomly
- Verify file sizes: Too small = overly filtered
Genome Ark Data Retrieval
Overview
Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using --no-sign-request.
Directory Structure
s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/ # Standard HiC assembly
├── assembly_vgp_standard_2.1/ # Standard assembly
├── assembly_vgp_trio_1.0/ # Trio assembly
├── assembly_curated/ # Manually curated
├── assembly_{METHOD}_{VERSION}/ # Method-specific (e.g., assembly_verkko_1.4)
└── genomic_data/ # Raw sequencing data
GenomeScope Data Locations
GenomeScope summaries can be in multiple locations - search comprehensively:
Common paths (priority order):
{assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt{assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt{assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt{assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt
Search strategy:
- Dynamically discover assembly folders (not all species use same structure)
- Search multiple subfolder locations per assembly
- Try standard assemblies first (HiC_2.0, standard_2.1)
- Fall back to method-specific assemblies
Python Implementation Pattern
def search_genomescope_recursive(species_name, tolid):
"""Search for GenomeScope data across all assembly folders."""
species_s3 = species_name.replace(' ', '_')
# 1. Discover assembly folders
result = subprocess.run(
['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
'--no-sign-request'],
capture_output=True, text=True
)
assembly_folders = []
for line in result.stdout.strip().split('\n'):
if 'PRE' in line:
folder = line.split('PRE')[1].strip().rstrip('/')
if folder.startswith('assembly'):
assembly_folders.append(folder)
# 2. Search patterns
patterns = [
'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
'qc/genomescope/{tolid}_genomescope__Summary.txt',
'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
]
# 3. Try each combination
for assembly in assembly_folders:
for pattern in patterns:
s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
result = subprocess.run(
['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0 and result.stdout:
return result.stdout
return None
GenomeScope Summary Parsing
def parse_genomescope_summary(content):
"""Extract genome characteristics from GenomeScope2 summary."""
data = {}
# Genome Haploid Length (max value - second column)
match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
if match:
data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))
# Heterozygosity percentage (max value)
match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
if match:
data['heterozygosity_percent'] = float(match.group(1))
# Genome Unique Length (max value)
match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
if match:
data['unique_length'] = int(match.group(1).replace(',', ''))
# Genome Repeat Length (max value)
match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
if match:
data['repeat_length'] = int(match.group(1).replace(',', ''))
# Calculate repeat content percentage
if 'repeat_length' in data and 'unique_length' in data:
total = data['repeat_length'] + data['unique_length']
if total > 0:
data['repeat_content_percent'] = (data['repeat_length'] / total) * 100
return data
Best Practices
- Rate limiting: Add 0.2s delay between successful fetches to be respectful to AWS
- Caching: Check if file exists locally before fetching
- Timeout: Use 10-15s timeout per request
- Assembly discovery: Always discover assemblies dynamically - don't assume structure
- Multiple locations: GenomeScope data may be in evaluation/, qc/, or intermediates/
- Expected coverage: ~15-20% of VGP assemblies have GenomeScope data available
Common Issues
Species not found: Some species use different naming (check exact spacing/underscores)
# Search for species
aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"
Assembly folder variations: Not all use standard names
# List all assemblies for a species
aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request
AWS CLI Setup
# Install AWS CLI (no credentials needed for Genome Ark)
conda install -c conda-forge awscli
# Test access
aws s3 ls s3://genomeark/species/ --no-sign-request | head
Karyotype Data Curation and Literature Search
Overview
Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.
Search Strategy
Effective Search Terms
"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"
Best Reference Sources
- PubMed/PMC: Primary cytogenetic studies
- ResearchGate: Karyotype descriptions and figures
- Specialized databases:
- Bird Chromosome Database: https://sites.unipampa.edu.br/birdchromosomedatabase/
- Animal Genome Size Database: http://www.genomesize.com/
- Genome assembly papers: Often mention expected karyotype
- Comparative cytogenetic studies: Family-level analyses
Search Time Estimates
- Model organisms, domestic species: 2-3 minutes
- Well-studied taxonomic groups: 5-10 minutes
- Rare/uncommon species: 10-20 minutes or not found
Taxonomic Conservation Patterns
Mammals
- Cetaceans: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42)
- Felidae: Conserved 2n = 38, n = 19
- Canidae: Conserved 2n = 78, n = 39
- Primates: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46)
Birds
- Anatidae (waterfowl): Highly conserved 2n = 80, n = 40 across ducks, geese, swans
- Galliformes (game birds): Typically 2n = 78, n = 39 (chicken, quail, grouse)
- Passerines: Variable 2n = 78-82, most common 2n = 80
- Ancestral avian karyotype: Putative 2n = 80
- General pattern: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80
Reptiles
- Lacertidae (wall lizards): Often 2n = 38, n = 19
Genome Assembly Interpretation
⚠️ Warning: Chromosome-level assemblies often report fewer chromosomes than actual diploid number
Why: Assemblies typically capture only:
- Macrochromosomes (large chromosomes)
- Larger microchromosomes
- Small microchromosomes remain unassembled
Example: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"
- True karyotype: 10 macro pairs + 30 micro pairs = 80
- Assembly: ~34-42 scaffolds (only macro + larger micros)
Using Conservation for Inference
When specific karyotype data is unavailable but genus/family patterns are strong:
-
High confidence inference (acceptable for publication):
- Multiple congeneric species confirmed
- Family-level conservation documented
- No known exceptions in genus
-
Document inference clearly:
accession,taxid,species,2n,n,notes,reference GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url -
Priority for direct confirmation:
- Species with conservation exceptions
- Type specimens or reference species
- Phylogenetically divergent lineages
VGP-Specific: Sex Chromosome Adjustment
When both sex chromosomes are in main haplotype (common in VGP assemblies):
- Expected scaffolds = n + 1 (not n)
- Reason: X+Y or Z+W = two distinct chromosomes
- Check: VGP metadata column "Sex chromosomes main haplotype"
- Patterns: "Has X and Y", "Has Z and W", "Has X1, X2, and Y"
Data Recording Format
CSV Structure:
accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...
Notes field examples:
- "Standard {family} karyotype"
- "Conserved {genus} karyotype"
- "Inferred from {family} conservation"
- "Unusual karyotype for family"
- "Geographic variation reported"
Prioritization for Literature Searches
TIER 1 (>90% success rate):
- Model organisms (zebrafish, mouse, medaka)
- Domestic species (chicken, goat, sheep)
- Game animals (waterfowl, deer)
- Laboratory species (fruit fly, nematode)
TIER 2 (70-90% success rate):
- Well-studied taxonomic groups (Podarcis lizards, corvids)
- Conservation focus species (raptors, large mammals)
- Commercial species (salmonids, oysters)
TIER 3 (50-70% success rate):
- Common but not economically important
- Widespread distribution
- Recent phylogenetic interest
Low priority (<50% success rate):
- Deep-sea species
- Rare/endangered without conservation genetics
- Recently described species
- Cryptic species complexes
AGP Format (A Golden Path)
Overview
AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.
When to Use This Knowledge
- Processing genome assemblies for submission to databases
- Curating chromosome-level assemblies
- Splitting haplotype assemblies
- Assigning unlocalized scaffolds (unlocs)
- Debugging AGP validation errors
- Converting between assembly representations
AGP Format Structure
Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)
Sequence Lines (Column 5 = 'W'):
object obj_beg obj_end part_num W component_id comp_beg comp_end orientation
Gap Lines (Column 5 = 'U' or 'N'):
object obj_beg obj_end part_num gap_type gap_length gap_type linkage linkage_evidence
Critical Coordinate Rules
Rule 1: Object and Component Lengths MUST Match
For sequence lines, the span in the object MUST equal the span in the component:
(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)
Example - CORRECT:
Scaffold_47_unloc_1 1 54360 1 W scaffold_23.hap1 19274039 19328398 -
# Object length: 54360 - 1 + 1 = 54,360 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓
Example - INCORRECT:
Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -
# Object length: 19328398 - 1 + 1 = 19,328,398 bp
# Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗
# ERROR: Lengths don't match!
Rule 2: Component Numbering Restarts for New Objects
Each object (column 1) has its own component numbering (column 4) starting at 1:
Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 -
Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← Starts at 1, not 3!
Rule 3: Sequential Component Numbering Within Objects
Component numbers increment sequentially (gaps and sequences both count):
Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 -
Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation
Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -
Common AGP Processing Issues
Issue 1: Incorrect Object Coordinates When Creating Unlocs
Symptom:
ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length
Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.
Wrong Approach:
# Setting object end to component end coordinate
agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ WRONG
Correct Approach:
# Calculate actual length from component coordinates
agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ CORRECT
Issue 2: Component Numbering Not Reset for New Objects
Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.
Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.
Solution:
# When creating unlocs, reset component number
agp_df.loc[index, '#_scaffs'] = 1 # Column 4: component number
Issue 3: AGPcorrect Accumulating Coordinates
Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.
Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.
Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).
AGP Processing Best Practices
1. Coordinate System Understanding
- Object coordinates (columns 2-3): Position within the assembled object (1-based, inclusive)
- Component coordinates (columns 7-8): Position within the source sequence (1-based, inclusive)
- Both use 1-based closed intervals [start, end]
- Length calculation:
end - start + 1
2. Creating Unlocalized Sequences (Unlocs)
# When extracting a region to create an unloc:
# 1. Calculate the actual length of the region
length = int(comp_end) - int(comp_start) + 1
# 2. Set object coordinates for the new unloc
obj_start = 1 # Always starts at 1
obj_end = length # Equals the length
# 3. Reset component number
component_num = 1 # New object, new numbering
# 4. Rename the object
new_object_name = f"{parent_scaffold}_unloc_{unloc_number}"
3. Validating AGP Files
Use NCBI's AGP validator:
agp_validate assembly.agp
Common validation checks:
- Object/component length match
- Sequential component numbering
- No coordinate overlaps
- Gap specifications valid
- Orientation values (+, -, ?, 0, na)
4. Handling Haplotype-Split Assemblies
When splitting diploid assemblies into haplotypes:
- Identify haplotype markers in sequence names (H1/hap1, H2/hap2)
- Maintain proper pairing information
- Process unlocs separately per haplotype
- Remove haplotig duplications
- Track gaps appropriately (especially proximity ligation gaps)
AGP Coordinate Debugging Pattern
When encountering coordinate errors:
# For each AGP line, verify:
obj_length = int(obj_end) - int(obj_beg) + 1
comp_length = int(comp_end) - int(comp_beg) + 1
assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"
# For sequential component numbers:
assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"
AGP File Structure by Assembly Stage
1. Raw Assembly AGP:
- Direct representation from assembler
- May have incorrect sequence lengths
- Needs coordinate correction (AGPcorrect)
2. Corrected AGP:
- Sequence lengths match actual FASTA
- Coordinates adjusted for length discrepancies
- Ready for haplotype splitting
3. Haplotype-Split AGP:
- Separate files per haplotype
- Unlocs identified but not separated
- Haplotigs marked but not removed
4. Final Curated AGP:
- Unlocs separated into individual objects
- Haplotigs removed to separate file
- Proximity ligation gaps cleaned
- Ready for database submission
BED File Processing and Telomere Analysis
Pattern: Classifying Scaffolds by Telomere Types
When analyzing telomere data from BED files to classify scaffolds:
File Structure:
- Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession
- Interstitial telomeres BED: similar structure with position markers (p/q/u for internal)
Best Practice - Use Python CSV Module:
import csv
from collections import defaultdict
# Use defaultdict for automatic initialization
telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})
# Process with csv.reader (more portable than pandas)
with open('telomeres.bed', 'r') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
scaffold = row[0]
accession = row[10] # GCA accession
key = (accession, scaffold)
telomere_counts[key]['terminal'] += 1
Why CSV over pandas:
- No external dependencies (pandas may not be installed)
- Faster for simple tabular operations
- Lower memory footprint for large files
- Better portability across environments
Classification Categories:
- Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes)
- Category 2: 1 terminal telomere, 0 interstitial (partial)
- Category 3: Has interstitial telomeres (likely assembly issues)
NCBI Data Integration Strategies
Check Existing Data Sources Before API Calls
Problem: Need chromosome counts for 400+ assemblies from NCBI.
Anti-pattern: Query NCBI datasets API for each accession
# DON'T: Query 400+ times
for accession in missing_data:
result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession])
# Takes 10+ minutes, hits API rate limits
Better Pattern: Check if data already exists in compiled tables
# DO: Look for existing compiled data first
# VGP table has multiple chromosome count columns:
# - num_chromosomes (column 54)
# - total_number_of_chromosomes (column 106)
# - num_chromosomes_haploid (column 122)
# Read from existing comprehensive table
with open('VGP-table.csv') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
num_chr = row[53] if row[53] else row[105] # Fallback strategy
Results: Filled 392/417 missing values instantly vs 10+ minutes of API calls.
Fallback Strategy for Multiple Columns:
# Try multiple sources in order of preference
num_chromosomes = row[53] if (len(row) > 53 and row[53]) else ''
if not num_chromosomes and len(row) > 105:
num_chromosomes = row[105] # Alternative column
When to use NCBI API:
- Data not in existing tables
- Need real-time/latest data
- Fetching assembly reports or sequence data
- Small number of queries (<20)
API Best Practices (when necessary):
- Use full path to datasets command (may be aliased)
- Add delays between calls (
time.sleep(0.5)) - Set reasonable timeouts
- Handle errors gracefully
Related Skills
- vgp-pipeline - VGP workflows process Hi-C and HiFi data
- galaxy-tool-wrapping - Galaxy tools work with SAM/BAM and sequencing data formats
- galaxy-workflow-development - Workflows process sequencing data
Supporting Documentation
- reference.md: Detailed format specifications and tool documentation
- common-issues.md: Comprehensive troubleshooting guide with examples
Version History
- v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies
- v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns
- v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns