skills/mims-harvard/tooluniverse/tooluniverse-noncoding-rna

tooluniverse-noncoding-rna

Installation
SKILL.md

Non-Coding RNA Analysis

Pipeline for identifying, annotating, and interpreting non-coding RNAs and their biological roles. Covers microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and other ncRNA classes.

Key principles:

  1. Class determines function — miRNAs repress mRNA translation; lncRNAs have diverse mechanisms (scaffolds, guides, decoys, enhancers); rRNAs/tRNAs are structural
  2. Targets matter more than the ncRNA itself — for miRNAs, the regulated mRNA targets determine the phenotype
  3. Expression context is critical — ncRNAs are highly tissue/cell-type specific
  4. Conservation indicates function — deeply conserved ncRNAs (miR-let-7, MALAT1) have well-established roles
  5. Evidence grading — T1: validated targets (reporter assay, CLIP-seq), T2: high-confidence computational prediction, T3: expression correlation, T4: sequence-based prediction only

Type-based reasoning — look up, don't guess: Non-coding RNA function depends on type: miRNA silences target mRNAs (look up targets in miRTarBase/TargetScan), lncRNA has diverse functions (scaffolding, guiding, decoying — check literature for the specific lncRNA), circRNA may sponge miRNAs.

For any ncRNA query: first identify the class from the name/sequence, then select the appropriate evidence source. Do not assume function based on name alone — a gene named "LINC" may have a characterized mechanism, or none at all. Always search PubMed for the specific ncRNA before interpreting. For miRNAs, validated targets (T1) from miRTarBase outweigh any computational prediction — a predicted target with no experimental support is a hypothesis, not a finding. For lncRNAs, mechanism is almost always determined by experimental studies; use PubMed_search_articles with the lncRNA name + "mechanism" or "function" to find relevant evidence. For circRNAs, miRNA sponging is the most common proposed mechanism but is frequently over-claimed — look for CLIP-seq or reporter assay evidence before asserting it.


When to Use

  • "What are the targets of miR-21?"
  • "Find lncRNAs associated with breast cancer"
  • "Is this lncRNA conserved across species?"
  • "What miRNAs regulate TP53?"
  • "Annotate these non-coding RNA IDs"
  • "Which miRNAs are biomarkers for [disease]?"

Not this skill: For mRNA expression analysis, use tooluniverse-rnaseq-deseq2. For CRISPR screens, use tooluniverse-crispr-screen-analysis.


Core Tools

Tool Use For
miRBase_search_mirna Search miRNAs by name, accession, or sequence
miRBase_get_mirna Detailed miRNA info (sequence, genomic location, family)
miRBase_get_mature_mirna Mature miRNA sequences and annotations
PubMed_search_articles Search for validated miRNA targets in literature (e.g., "miR-21 target validation")
LNCipedia_search_lncrna Search lncRNAs by name, gene symbol, or transcript ID
LNCipedia_get_lncrna Detailed lncRNA transcript info (sequence, structure, conservation)
LNCipedia_get_lncrna_xrefs lncRNA gene info with all transcript variants
LNCipedia_search_ncrna_by_type List all transcripts for a lncRNA gene
LNCipedia_get_lncrna_publications lncRNA sequence (FASTA format)
RNAcentral_search Search all ncRNA types across databases
RNAcentral_get_rna Detailed ncRNA annotations from 40+ databases
Rfam_get_family RNA family details (structure, alignment, species distribution)
Rfam_search Search RNA families by keyword
DisGeNET_search_gene ncRNA-disease associations
PubMed_search_articles ncRNA literature
GTEx_get_median_gene_expression Tissue expression of ncRNA genes

Workflow

Phase 0: ncRNA Identity & Classification
  Name/ID → miRBase/LNCipedia/RNAcentral → class, sequence, genomic location
    |
Phase 1: Target & Interaction Analysis
  miRNA → target mRNAs; lncRNA → interacting proteins/RNAs/chromatin
    |
Phase 2: Expression & Tissue Specificity
  GTEx/GEO → where is it expressed? Tissue-specific or ubiquitous?
    |
Phase 3: Disease Associations
  DisGeNET/PubMed/CTD → ncRNA-disease links with evidence
    |
Phase 4: Functional Interpretation
  Pathway enrichment of targets → biological role → clinical significance

Phase 0: ncRNA Identity & Classification

ncRNA classes by size and database:

  • miRNA (~22 nt, miRBase): Post-transcriptional silencing via 3'UTR binding
  • lncRNA (>200 nt, LNCipedia): Diverse — chromatin remodeling, transcription regulation, miRNA sponges
  • rRNA (120-5000 nt, RNAcentral/Rfam): Ribosome components
  • tRNA (~76 nt, RNAcentral): Amino acid delivery
  • snoRNA (60-300 nt, Rfam): rRNA modification (methylation, pseudouridylation)
  • snRNA (~150 nt, Rfam): Spliceosome components
  • piRNA (26-31 nt, RNAcentral): Transposon silencing in germline
  • circRNA (variable, RNAcentral): miRNA sponges, protein scaffolds (experimental evidence required)

Identification workflow:

  • Name starts with miR- or hsa-mir- → search miRBase
  • Name starts with LINC, MALAT, HOTAIR, XIST, or ends in -AS1 → search LNCipedia
  • Any ncRNA type → search RNAcentral (aggregates all databases)
  • RNA family question → search Rfam

Phase 1: Target & Interaction Analysis

For miRNAs — the targets determine the biology:

NOTE: There is no dedicated miRNA target lookup tool in ToolUniverse. To find miRNA targets:

  1. Literature search (most reliable): PubMed_search_articles(query="miR-21 target validation luciferase")
  2. Cross-references: miRBase_get_mirna_xrefs(accession="MIMAT0000076") — may link to external target databases
  3. Known targets for well-studied miRNAs: Use the reference table below, then validate via STRING/Reactome
  4. For novel miRNAs: Search PubMed for "[miRNA] target" and extract validated targets from papers

Well-studied miRNA targets (for common oncomiRs/tumor suppressors):

  • miR-21: PTEN, PDCD4, TPM1, RECK, SPRY1, SPRY2, BTG2
  • miR-155: SOCS1, SHIP1, AID, TP53INP1
  • miR-122: SLC7A1, ADAM17 (also HCV IRES cofactor)
  • let-7: RAS, HMGA2, MYC, LIN28

Target interpretation framework:

  • Validated (T1): Luciferase reporter, CLIP-seq, degradome-seq — base conclusions on these
  • High-confidence prediction (T2): TargetScan conserved sites, DIANA-microT score > 0.9 — support validated findings
  • Prediction only (T3-T4): miRanda, PicTar, RNA22 — hypothesis generation only; do not report as findings

For lncRNAs — the mechanism varies:

lncRNA Mechanism Example How to Investigate
Chromatin modifier HOTAIR, XIST Check interacting proteins (PRC2, LSD1) via PubMed
Transcription regulator NEAT1, MEG3 Check nearby genes (cis-regulation) via genomic location
miRNA sponge MALAT1, circRNAs Search for miRNA binding sites
Scaffold NKILA, BCAR4 Check protein interactions
Enhancer RNA eRNAs Check ENCODE enhancer annotations

Phase 2: Expression & Tissue Specificity

GTEx_get_median_gene_expression(gene_symbol="MIR21")  # miRNA host gene expression
# Note: GTEx measures RNA-seq; miRNA expression may need miRNA-seq data from GEO

Interpretation: Tissue-restricted ncRNAs are often functionally important in that tissue. Ubiquitous ncRNAs (like MALAT1) tend to have housekeeping roles.

Phase 3: Disease Associations

DisGeNET_search_gene(query="MIR21")  # miR-21 disease associations
PubMed_search_articles(query="miR-21 biomarker cancer")

Key ncRNA-disease associations (well-established T1 examples — always verify via DisGeNET or PubMed for the specific ncRNA):

  • miR-21: OncomiR in multiple cancers; targets PTEN, PDCD4, TPM1 (hundreds of T1 studies)
  • miR-155: B-cell lymphoma, inflammation — immune regulation
  • miR-122: Hepatitis C liver disease — HCV replication cofactor; therapeutic target (miravirsen)
  • let-7 family: Lung cancer, stem cell differentiation — tumor suppressor targeting RAS, HMGA2
  • HOTAIR: Breast/colorectal cancer — recruits PRC2, promotes metastasis
  • MALAT1: Lung cancer/metastasis — splicing regulation
  • XIST: X-inactivation, cancer — chromatin silencing
  • H19: Beckwith-Wiedemann syndrome, cancer — imprinted lncRNA, miR-675 host
  • ANRIL: CVD, diabetes, cancer — CDKN2A/B locus regulation (GWAS-validated)

Phase 4: Functional Interpretation

After identifying miRNA targets (Phase 1), run pathway enrichment:

# Collect validated target gene symbols
targets = ["PTEN", "PDCD4", "TPM1", "RECK", "SPRY1"]  # miR-21 targets

# Pathway enrichment
ReactomeAnalysis_pathway_enrichment(identifiers="PTEN PDCD4 TPM1 RECK SPRY1")
STRING_get_network(identifiers="PTEN\rPDCD4\rTPM1\rRECK\rSPRY1", species=9606)

Interpretation: If miR-21 targets are enriched in apoptosis and PI3K-AKT signaling → miR-21 is an oncomiR that promotes survival by simultaneously suppressing multiple tumor suppressors.

Report structure:

  1. ncRNA Identity — class, sequence, genomic location, conservation
  2. Targets/Interactions — validated targets with evidence grades
  3. Expression Profile — tissue specificity, disease-specific expression changes
  4. Disease Associations — evidence-graded disease links
  5. Pathway Analysis — enriched pathways among targets
  6. Mechanistic Model — how this ncRNA contributes to disease biology
  7. Clinical Potential — biomarker utility, therapeutic target potential (antagomirs, ASOs)

Limitations

Computational Procedure: TargetScan Predicted Targets (Download-and-Process)

TargetScan provides the best computational miRNA target predictions but has no REST API. Download and process locally:

# Step 1: Download TargetScan predicted targets (one-time, ~10MB zipped)
# URL: https://www.targetscan.org/vert_80/vert_80_data_download/Summary_Counts.default_predictions.txt.zip
import pandas as pd
import zipfile, io, requests

url = "https://www.targetscan.org/vert_80/vert_80_data_download/Summary_Counts.default_predictions.txt.zip"
resp = requests.get(url, timeout=60)
with zipfile.ZipFile(io.BytesIO(resp.content)) as z:
    fname = z.namelist()[0]
    df = pd.read_csv(z.open(fname), sep='\t')

# Step 2: Query for a specific miRNA family
mirna = "miR-21-5p"  # or "miR-21/590-5p" (TargetScan uses family names)
targets = df[df['miRNA Family'].str.contains("miR-21", case=False, na=False)]

# Step 3: Rank by cumulative weighted context++ score
targets_ranked = targets.sort_values('Cumulative weighted context++ score', ascending=True)
print(f"Top 20 predicted targets of {mirna}:")
for _, row in targets_ranked.head(20).iterrows():
    print(f"  {row['Target Gene']:10s} score={row['Cumulative weighted context++ score']:.3f}  "
          f"sites={row['Total num conserved sites']}")

Interpretation: More negative context++ score = stronger predicted repression. Conserved sites (>1) are higher confidence.

Computational Procedure: miRTarBase Validated Targets (Download-and-Process)

miRTarBase has Cloudflare protection blocking programmatic access. Use the R/Bioconductor data package or bulk download:

# Option 1: Download from miRTarBase bulk export (requires browser download first)
# Go to: https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2025/
# Download: hsa_MTI.xlsx (human miRNA-target interactions)

# Option 2: Use the GitHub data dump
# https://github.com/jorainer/mirtarbase — R package with cached data

# Once you have the file:
import pandas as pd
mti = pd.read_excel("hsa_MTI.xlsx")  # or read_csv if TSV

# Filter for your miRNA
mir21_targets = mti[mti['miRNA'].str.contains('hsa-miR-21', case=False, na=False)]
print(f"miR-21 validated targets: {len(mir21_targets)}")

# Filter by evidence strength
strong = mir21_targets[mir21_targets['Support Type'].str.contains(
    'Luciferase|Reporter|Western|CLIP', case=False, na=False
)]
print(f"  Strong evidence (reporter/CLIP): {len(strong)}")
for _, row in strong.head(10).iterrows():
    print(f"    {row['Target Gene']:10s}{row['Support Type']}")

When download is not available: Use the built-in reference table in Phase 1 for well-studied miRNAs, or search PubMed for validated targets.


Limitations

  • miRNA target prediction is noisy — even the best algorithms have >50% false positive rates; always prioritize experimentally validated targets
  • lncRNA function is poorly characterized — only ~5% of annotated lncRNAs have known functions
  • Expression measurement varies — miRNA-seq, RNA-seq, and microarray capture different ncRNA classes; check the assay type
  • Species differences — miRNAs are often conserved but lncRNAs are frequently species-specific; cross-species lncRNA comparisons are unreliable
Weekly Installs
43
GitHub Stars
1.3K
First Seen
Today