bio-genome-intervals-gtf-gff-handling
SKILL.md
GTF/GFF Handling
GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.
Format Comparison
| Feature | GTF | GFF3 |
|---|---|---|
| Coordinate system | 1-based, inclusive | 1-based, inclusive |
| Hierarchy | Implicit (gene_id, transcript_id) | Explicit (Parent attribute) |
| Attribute format | key "value"; | key=value; |
| Comments | # | # |
| Fasta sequences | Not standard | ##FASTA directive |
GTF Format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
GFF3 Format
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1
chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972
chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328
Parse GTF with gtfparse (Python)
Installation
pip install gtfparse
Basic Parsing
import gtfparse
# Load entire GTF
df = gtfparse.read_gtf('annotation.gtf')
# View columns
print(df.columns)
# ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',
# 'gene_id', 'transcript_id', 'gene_name', ...]
# Filter by feature type
genes = df[df['feature'] == 'gene']
transcripts = df[df['feature'] == 'transcript']
exons = df[df['feature'] == 'exon']
# Get specific gene
gene_df = df[df['gene_name'] == 'TP53']
Extract Gene Coordinates
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
# All genes
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]
# Convert to BED format (0-based)
genes_bed = genes.copy()
genes_bed['start'] = genes_bed['start'] - 1 # GTF is 1-based, BED is 0-based
genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']]
genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)
Get Exons for Gene
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
# Get all exons for TP53
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')]
tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']]
print(tp53_exons)
Parse GFF with gffutils (Python)
Installation
pip install gffutils
Create Database
import gffutils
# Create database (slow first time, fast for subsequent queries)
db = gffutils.create_db('annotation.gff3', 'annotation.db',
force=True, merge_strategy='create_unique')
# Or load existing database
db = gffutils.FeatureDB('annotation.db')
Query Features
import gffutils
db = gffutils.FeatureDB('annotation.db')
# Count features by type
for featuretype in db.featuretypes():
count = db.count_features_of_type(featuretype)
print(f'{featuretype}: {count}')
# Get all genes
for gene in db.features_of_type('gene'):
print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')
# Get gene by ID
gene = db['ENSG00000141510'] # TP53
print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')
# Get children (transcripts, exons)
for transcript in db.children(gene, featuretype='mRNA'):
print(f' Transcript: {transcript.id}')
for exon in db.children(transcript, featuretype='exon'):
print(f' Exon: {exon.start}-{exon.end}')
Get Introns
import gffutils
db = gffutils.FeatureDB('annotation.db')
# Get introns for a transcript
transcript = db['ENST00000269305']
introns = list(db.interfeatures(db.children(transcript, featuretype='exon'),
new_featuretype='intron'))
for intron in introns:
print(f'Intron: {intron.start}-{intron.end}')
Convert Formats with gffread (CLI)
Installation
conda install -c bioconda gffread
GTF to GFF3
gffread annotation.gtf -o annotation.gff3
GFF3 to GTF
gffread annotation.gff3 -T -o annotation.gtf
Extract Sequences
# Extract transcript sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf
# Extract CDS sequences
gffread -x cds.fa -g genome.fa annotation.gtf
# Extract protein sequences
gffread -y proteins.fa -g genome.fa annotation.gtf
Filter Features
# Keep only protein-coding genes
gffread annotation.gtf -C -o coding.gtf
# Keep specific gene types
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf
Extract Regions with bedtools
Get Promoters
# Extract TSS (transcript start sites)
awk '$3 == "transcript"' annotation.gtf | \
awk -v OFS='\t' '{
if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
else print $1, $5-1, $5, ".", ".", $7;
}' > tss.bed
# Get promoter regions (2kb upstream of TSS)
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed
Get Gene Bodies
# Extract gene coordinates to BED
awk '$3 == "gene"' annotation.gtf | \
awk -v OFS='\t' '{
split($0, a, "gene_id \""); split(a[2], b, "\"");
print $1, $4-1, $5, b[1], ".", $7;
}' > genes.bed
Get Exons
# Extract unique exons
awk '$3 == "exon"' annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \
sort -k1,1 -k2,2n | uniq > exons.bed
Python: GTF to BED Conversion
import gtfparse
import pandas as pd
def gtf_to_bed(gtf_path, feature_type='gene', output_path=None):
'''Convert GTF features to BED format.'''
df = gtfparse.read_gtf(gtf_path)
features = df[df['feature'] == feature_type].copy()
# Convert to 0-based coordinates
bed = pd.DataFrame({
'chrom': features['seqname'],
'start': features['start'] - 1,
'end': features['end'],
'name': features.get('gene_name', features.get('gene_id', '.')),
'score': 0,
'strand': features['strand']
})
if output_path:
bed.to_csv(output_path, sep='\t', header=False, index=False)
return bed
# Usage
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed')
exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')
Validate GTF/GFF
# Check GTF format
gffread -E annotation.gtf
# Check GFF3 format
gffread -E annotation.gff3
# Detailed validation
gt gff3validator annotation.gff3 # requires genometools
Common Attributes
GTF Attributes
| Attribute | Description |
|---|---|
| gene_id | Ensembl gene ID |
| gene_name | Gene symbol |
| gene_biotype | protein_coding, lncRNA, etc. |
| transcript_id | Ensembl transcript ID |
| transcript_name | Transcript symbol |
| exon_number | Exon position in transcript |
| exon_id | Ensembl exon ID |
GFF3 Attributes
| Attribute | Description |
|---|---|
| ID | Unique feature identifier |
| Name | Display name |
| Parent | Parent feature ID |
| Dbxref | Database cross-references |
| gene_biotype | Gene type |
Memory-Efficient Processing
import gtfparse
# Process large files in chunks (gtfparse loads all into memory)
# For very large files, use gffutils database approach
# Or filter during parsing
df = gtfparse.read_gtf('annotation.gtf',
features=['gene', 'exon']) # Only load specific features
Related Skills
- bed-file-basics - BED format and conversion
- interval-arithmetic - Gene/exon overlap analysis
- proximity-operations - TSS proximity analysis
- differential-expression - Gene coordinate mapping
Weekly Installs
3
Repository
gptomics/bioskillsInstalled on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2