Genomics and Epigenomics Data Processing

⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions

When the input is a long-format methylation CSV (one row per (sample, CpG_position) e.g. columns Pos, Chromosome, MethylationPercentage), "how many sites are removed when filtering" almost always means rows removed, NOT unique-position removals. The two answers differ by a factor of ≈ n_samples.

Question phrasing	What it means
"how many sites are removed when filtering …"	rows removed (= samples × positions failing the filter)
"how many unique CpG sites pass filter"	unique positions (dedupe by `Pos` then filter)

❌ WRONG: df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90") then len(filtered) → counts unique positions (typically 100–1500)

✅ RIGHT: df.query("MethylationPercentage<10 or MethylationPercentage>90") then len(df) - len(filtered) → counts rows (typically 10k–30k)

tooluniverse-epigenomics

Genomics and Epigenomics Data Processing

⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions