pysam
Pysam - Genomic Alignments
Used for high-throughput sequencing pipelines. It allows efficient access to billions of DNA fragments aligned to a reference genome.
When to Use
- Processing next-generation sequencing (NGS) data.
- Analyzing genomic variants (SNPs, indels).
- Extracting reads from specific genomic regions.
- Building custom bioinformatics pipelines.
- Quality control of sequencing data.
Core Principles
Indexed Access
BAM files must be indexed (.bai) for efficient random access to genomic regions.
Coordinate System
Genomic coordinates are 0-based (Python-style) for positions, but 1-based for ranges in some contexts.
Read Attributes
Each read contains sequence, quality scores, alignment position, and flags.
Quick Reference
Standard Imports
import pysam
Basic Patterns
# 1. Open BAM file
samfile = pysam.AlignmentFile("aligned_reads.bam", "rb")
# 2. Iterate over reads in a specific genomic region
for read in samfile.fetch("chr1", 10000, 10100):
print(f"Read: {read.query_name}, Quality: {read.mapping_quality}")
print(f"Sequence: {read.query_sequence}")
print(f"Position: {read.reference_start}")
# 3. Variant analysis (VCF)
vcf = pysam.VariantFile("mutations.vcf")
for rec in vcf.fetch("chr1", 10000, 10100):
print(f"Pos: {rec.pos}, Ref: {rec.ref}, Alt: {rec.alts}")
print(f"Genotype: {rec.samples['sample1']['GT']}")
# 4. Writing aligned reads
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)
for read in samfile:
if read.mapping_quality > 30:
outfile.write(read)
outfile.close()
Critical Rules
✅ DO
- Always use indexed files - Create index with
pysam.index("file.bam")for fast access. - Check read flags - Use
read.is_paired,read.is_unmappedto filter reads. - Handle unmapped reads - Unmapped reads have
reference_start = -1. - Close files explicitly - Use context managers or
.close()to avoid resource leaks.
❌ DON'T
- Don't iterate over entire BAM - Use
fetch()with regions for efficiency. - Don't ignore quality scores - Low-quality bases can cause false variants.
- Don't mix coordinate systems - Be consistent with 0-based vs 1-based indexing.
Advanced Patterns
Counting Reads per Gene
# Using a gene annotation file
genes = {} # gene_name -> (chr, start, end)
for read in samfile.fetch():
# Check if read overlaps any gene
for gene, (chr, start, end) in genes.items():
if read.reference_name == chr and start <= read.reference_start < end:
genes[gene]['count'] += 1
Variant Filtering
# Filter variants by quality and depth
for rec in vcf.fetch():
depth = rec.samples['sample1']['DP']
qual = rec.qual
if depth > 10 and qual > 20:
# Process high-quality variant
pass
Pysam provides the low-level access needed for genomic data processing, enabling researchers to work directly with the raw data of life itself.
More from tondevrel/scientific-agent-skills
xgboost-lightgbm
Industry-standard gradient boosting libraries for tabular data and structured datasets. XGBoost and LightGBM excel at classification and regression tasks on tables, CSVs, and databases. Use when working with tabular machine learning, gradient boosting trees, Kaggle competitions, feature importance analysis, hyperparameter tuning, or when you need state-of-the-art performance on structured data.
193opencv
Open Source Computer Vision Library (OpenCV) for real-time image processing, video analysis, object detection, face recognition, and camera calibration. Use when working with images, videos, cameras, edge detection, contours, feature detection, image transformations, object tracking, optical flow, or any computer vision task.
142ortools
Google Optimization Tools. An open-source software suite for optimization, specialized in vehicle routing, flows, integer and linear programming, and constraint programming. Features the world-class CP-SAT solver. Use for vehicle routing problems (VRP), scheduling, bin packing, knapsack problems, linear programming (LP), integer programming (MIP), network flows, constraint programming, combinatorial optimization, resource allocation, shift scheduling, job-shop scheduling, and discrete optimization problems.
74matplotlib
The foundational library for creating static, animated, and interactive visualizations in Python. Highly customizable and the industry standard for publication-quality figures. Use for 2D plotting, scientific data visualization, heatmaps, contours, vector fields, multi-panel figures, LaTeX-formatted plots, custom visualization tools, and plotting from NumPy arrays or Pandas DataFrames.
70scipy
Comprehensive guide for SciPy - the fundamental library for scientific and technical computing in Python. Use for integration, optimization, interpolation, linear algebra, signal processing, statistics, ODEs, Fourier transforms, and advanced scientific algorithms. Built on NumPy and essential for research and engineering.
51plotly
A high-level interactive graphing library for Python. Ideal for web-based visualizations, 3D plots, and complex interactive dashboards. Built on plotly.js, it allows users to zoom, pan, and hover over data points in a browser-based environment. Use for interactive charts, web applications, Jupyter notebooks, 3D data visualization, geographic maps, financial charts, animations, time-series analysis, and building production-ready dashboards with Dash.
50