pysam
SKILL.md
Pysam - Genomic Alignments
Used for high-throughput sequencing pipelines. It allows efficient access to billions of DNA fragments aligned to a reference genome.
When to Use
- Processing next-generation sequencing (NGS) data.
- Analyzing genomic variants (SNPs, indels).
- Extracting reads from specific genomic regions.
- Building custom bioinformatics pipelines.
- Quality control of sequencing data.
Core Principles
Indexed Access
BAM files must be indexed (.bai) for efficient random access to genomic regions.
Coordinate System
Genomic coordinates are 0-based (Python-style) for positions, but 1-based for ranges in some contexts.
Read Attributes
Each read contains sequence, quality scores, alignment position, and flags.
Quick Reference
Standard Imports
import pysam
Basic Patterns
# 1. Open BAM file
samfile = pysam.AlignmentFile("aligned_reads.bam", "rb")
# 2. Iterate over reads in a specific genomic region
for read in samfile.fetch("chr1", 10000, 10100):
print(f"Read: {read.query_name}, Quality: {read.mapping_quality}")
print(f"Sequence: {read.query_sequence}")
print(f"Position: {read.reference_start}")
# 3. Variant analysis (VCF)
vcf = pysam.VariantFile("mutations.vcf")
for rec in vcf.fetch("chr1", 10000, 10100):
print(f"Pos: {rec.pos}, Ref: {rec.ref}, Alt: {rec.alts}")
print(f"Genotype: {rec.samples['sample1']['GT']}")
# 4. Writing aligned reads
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)
for read in samfile:
if read.mapping_quality > 30:
outfile.write(read)
outfile.close()
Critical Rules
✅ DO
- Always use indexed files - Create index with
pysam.index("file.bam")for fast access. - Check read flags - Use
read.is_paired,read.is_unmappedto filter reads. - Handle unmapped reads - Unmapped reads have
reference_start = -1. - Close files explicitly - Use context managers or
.close()to avoid resource leaks.
❌ DON'T
- Don't iterate over entire BAM - Use
fetch()with regions for efficiency. - Don't ignore quality scores - Low-quality bases can cause false variants.
- Don't mix coordinate systems - Be consistent with 0-based vs 1-based indexing.
Advanced Patterns
Counting Reads per Gene
# Using a gene annotation file
genes = {} # gene_name -> (chr, start, end)
for read in samfile.fetch():
# Check if read overlaps any gene
for gene, (chr, start, end) in genes.items():
if read.reference_name == chr and start <= read.reference_start < end:
genes[gene]['count'] += 1
Variant Filtering
# Filter variants by quality and depth
for rec in vcf.fetch():
depth = rec.samples['sample1']['DP']
qual = rec.qual
if depth > 10 and qual > 20:
# Process high-quality variant
pass
Pysam provides the low-level access needed for genomic data processing, enabling researchers to work directly with the raw data of life itself.
Weekly Installs
3
Repository
tondevrel/scien…t-skillsGitHub Stars
5
First Seen
Feb 8, 2026
Security Audits
Installed on
gemini-cli3
opencode3
codebuddy3
github-copilot3
codex3
kimi-cli3