skills/gptomics/bioskills/bio-population-genetics-plink-basics

bio-population-genetics-plink-basics

SKILL.md

PLINK Basics

File formats, conversion, and quality control filtering with PLINK 1.9 and 2.0.

File Formats

Binary Format (Recommended)

File Contents
.bed Binary genotype data
.bim Variant information (chr, ID, cM, pos, A1, A2)
.fam Sample information (FID, IID, father, mother, sex, pheno)

PLINK 2.0 Format

File Contents
.pgen Binary genotype data (compressed)
.pvar Variant information
.psam Sample information

Text Format (Legacy)

File Contents
.ped Genotypes (FID, IID, father, mother, sex, pheno, genotypes)
.map Variant positions (chr, ID, cM, pos)

Format Conversion

VCF to PLINK Binary

# PLINK 1.9
plink --vcf input.vcf.gz --make-bed --out output

# PLINK 2.0
plink2 --vcf input.vcf.gz --make-bed --out output

# With sample ID handling
plink2 --vcf input.vcf.gz --double-id --make-bed --out output

PLINK Binary to VCF

# PLINK 1.9
plink --bfile input --recode vcf --out output

# PLINK 2.0
plink2 --bfile input --export vcf --out output

# Compressed VCF
plink2 --bfile input --export vcf bgz --out output

PED/MAP to Binary (PLINK 1.9 Only)

# PLINK 1.9 (PLINK 2.0 doesn't support .ped/.map directly)
plink --file input --make-bed --out output

Binary to PED/MAP

# PLINK 1.9
plink --bfile input --recode --out output

# PLINK 2.0
plink2 --bfile input --export ped --out output

PLINK 1.9 to 2.0 Format

# Convert to PGEN format
plink2 --bfile input --make-pgen --out output

# Convert back to BED
plink2 --pfile input --make-bed --out output

Quality Control Filtering

MAF Filter (Minor Allele Frequency)

# Remove variants with MAF < 0.01
plink --bfile input --maf 0.01 --make-bed --out output

# PLINK 2.0
plink2 --bfile input --maf 0.01 --make-bed --out output

# Remove rare variants (MAF < 0.05)
plink2 --bfile input --maf 0.05 --make-bed --out output

Genotyping Rate Filters

# Per-variant missing rate (remove if >5% missing)
plink2 --bfile input --geno 0.05 --make-bed --out output

# Per-sample missing rate (remove if >5% missing)
plink2 --bfile input --mind 0.05 --make-bed --out output

Hardy-Weinberg Equilibrium Filter

# Remove variants with HWE p-value < 1e-6
plink2 --bfile input --hwe 1e-6 --make-bed --out output

# Different threshold for cases vs controls
plink2 --bfile input --hwe 1e-6 --hwe-all --make-bed --out output

Combined QC Pipeline

# Standard QC filtering
plink2 --bfile input \
    --maf 0.01 \
    --geno 0.05 \
    --mind 0.05 \
    --hwe 1e-6 \
    --make-bed --out qc_filtered

Sample and Variant Selection

Keep/Remove Samples

# Keep specific samples (samples.txt: FID IID per line)
plink2 --bfile input --keep samples.txt --make-bed --out output

# Remove specific samples
plink2 --bfile input --remove samples.txt --make-bed --out output

# Keep single sample
plink2 --bfile input --keep-fam sample_id --make-bed --out output

Extract/Exclude Variants

# Extract specific variants (variants.txt: variant IDs)
plink2 --bfile input --extract variants.txt --make-bed --out output

# Exclude specific variants
plink2 --bfile input --exclude variants.txt --make-bed --out output

# Extract by range
plink2 --bfile input --extract range chr1:1000000-2000000 --make-bed --out output

Chromosome Selection

# Single chromosome
plink2 --bfile input --chr 22 --make-bed --out chr22

# Multiple chromosomes
plink2 --bfile input --chr 1-22 --make-bed --out autosomes

# Exclude chromosome
plink2 --bfile input --not-chr 23,24,25,26 --make-bed --out autosomes

Allele Frequency

# PLINK 1.9 (MAF-based)
plink --bfile input --freq --out output

# PLINK 2.0 (ALT allele frequency - not MAF!)
plink2 --bfile input --freq --out output

# PLINK 2.0 with MAF
plink2 --bfile input --freq cols=+mac,+mafreq --out output

Missing Data Statistics

# Per-sample and per-variant missing rates
plink2 --bfile input --missing --out output

# Output files:
# output.smiss - sample missing rates
# output.vmiss - variant missing rates

Sex Check

Verify reported sex matches X chromosome heterozygosity.

# PLINK 1.9
plink --bfile input --check-sex --out sex_check

# PLINK 2.0
plink2 --bfile input --split-par hg38 --check-sex --out sex_check

Interpret Results

import pandas as pd

sex = pd.read_csv('sex_check.sexcheck', sep='\s+')

problems = sex[sex['STATUS'] == 'PROBLEM']
print(f'Sex mismatches: {len(problems)}')

# F statistic: <0.2 = female, >0.8 = male, between = ambiguous
# PEDSEX: reported sex (1=male, 2=female, 0=unknown)
# SNPSEX: inferred sex (1=male, 2=female, 0=undetermined)

Update or Remove

# Update sex from check results
plink2 --bfile input --update-sex sex_check.sexcheck col-num=4 --make-bed --out updated

# Remove sex mismatches
awk '$5 == "PROBLEM" {print $1, $2}' sex_check.sexcheck > sex_problems.txt
plink2 --bfile input --remove sex_problems.txt --make-bed --out output

Sample Information

Update Phenotypes

# phenotypes.txt: FID IID pheno (1=control, 2=case, -9=missing)
plink2 --bfile input --pheno phenotypes.txt --make-bed --out output

# Quantitative phenotype
plink2 --bfile input --pheno phenotypes.txt --make-bed --out output

Update Sex

# sex.txt: FID IID sex (1=male, 2=female, 0=unknown)
plink2 --bfile input --update-sex sex.txt --make-bed --out output

Update Sample IDs

# ids.txt: old_FID old_IID new_FID new_IID
plink2 --bfile input --update-ids ids.txt --make-bed --out output

Merging Datasets

# Merge two datasets (PLINK 1.9)
plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed --out merged

# Merge list of datasets
plink --bfile data1 --merge-list merge_list.txt --make-bed --out merged
# merge_list.txt contains: data2.bed data2.bim data2.fam (one set per line)

# Handle strand flips
plink --bfile data1 --bmerge data2 --make-bed --out merged
# If error: plink --bfile data2 --flip missnps.txt --make-bed --out data2_flipped

Variant Information

Set Variant IDs

# Set ID based on position
plink2 --bfile input --set-all-var-ids @:#:\$r:\$a --make-bed --out output
# Format: chr:pos:ref:alt

Update Variant Names

# update.txt: old_id new_id
plink2 --bfile input --update-name update.txt --make-bed --out output

PLINK 2.0 vs 1.9 Summary

Feature PLINK 2.0 PLINK 1.9
Status Current Legacy
Command plink2 plink
Format .pgen/.pvar/.psam .bed/.bim/.fam
Speed Faster Baseline
Memory More efficient Higher for large data
Export VCF --export vcf --recode vcf
Frequency output ALT frequency MAF
Missing output .smiss/.vmiss .imiss/.lmiss
PED/MAP support No (convert via 1.9) Yes (--file)

Related Skills

  • association-testing - GWAS with filtered data
  • population-structure - PCA after QC
  • variant-calling/vcf-basics - VCF format before conversion
Weekly Installs
3
GitHub Stars
349
First Seen
Jan 24, 2026
Installed on
opencode2
codex2
claude-code2
windsurf1
openclaw1
trae1