NYC
skills/letta-ai/skills/count-dataset-tokens

count-dataset-tokens

SKILL.md

Count Dataset Tokens

Overview

This skill provides a systematic approach for accurately counting tokens in datasets. It emphasizes thorough data exploration, proper interpretation of task requirements, and verification of results to avoid common mistakes like incomplete field coverage or misinterpreting terminology.

When to Use This Skill

  • Counting tokens in HuggingFace datasets or similar data sources
  • Tasks involving tokenization of text fields
  • Filtering datasets by domain, category, or other metadata
  • Working with datasets that have multiple text fields that may contribute to token counts
  • Any task requiring accurate quantification of textual content in structured datasets

Critical Pre-Implementation Steps

1. Clarify Terminology Before Proceeding

When a task uses specific terms (e.g., "deepseek tokens", "science domain"), verify exactly what content this refers to:

  • Examine the README/documentation thoroughly - Documentation often contains critical definitions
  • List all available fields in the dataset schema before making assumptions
  • Identify all fields that could potentially be relevant to the token count
  • Do not assume field names tell the complete story - A field like deepseek_reasoning may not be the only field relevant for counting "deepseek tokens"

2. Explore Dataset Structure Thoroughly

Before writing any counting logic:

1. Load a sample of the dataset
2. Print ALL column names and their types
3. Examine multiple sample entries in full detail
4. Identify relationships between fields
5. Check for nested structures or JSON fields
6. Look for metadata columns that might indicate which fields to include

3. Understand Domain/Category Mappings

When filtering by categories like "science":

  • List all unique values in the domain/category column
  • Determine if the target category is an explicit value OR a grouping of related values
  • Example: "science" might mean biology + chemistry + physics rather than a literal "science" value
  • Document your interpretation and verify it aligns with the task intent

Implementation Workflow

Step 1: Data Discovery

1. Load the dataset (or a representative sample)
2. Enumerate all columns/fields
3. For each text field, examine:
   - Field name and description
   - Sample content from multiple entries
   - Whether it should be included in token counts
4. Check for a metadata or schema subset that documents field purposes

Step 2: Define Scope Explicitly

Before counting, explicitly document:

  • Which fields will be tokenized
  • Which filter criteria will be applied (e.g., domain == "biology")
  • The tokenizer to be used and why
  • Any fields being excluded and the reasoning

Step 3: Implement with Verification Points

1. Load the appropriate tokenizer
2. Apply filters to select relevant entries
3. For each entry, tokenize ALL relevant fields
4. Sum token counts with running totals
5. Print progress checkpoints (e.g., every 1000 entries)
6. Track statistics: entry count, empty fields, errors

Step 4: Validate Results

Before reporting final numbers:

  • Spot-check individual entries - Manually verify token counts for 3-5 random samples
  • Sanity check totals - Does the average tokens per entry seem reasonable?
  • Cross-reference with metadata - If the dataset provides expected statistics, compare
  • Verify filter results - Confirm the filtered count matches expected entries

Common Pitfalls to Avoid

Pitfall 1: Incomplete Field Identification

Mistake: Assuming a single field (e.g., deepseek_reasoning) contains all relevant content

Solution:

  • Examine the full schema before deciding which fields to include
  • Consider whether multiple fields contribute to the "complete" content
  • Check if there are fields like prompt, response, full_text, or conversation that should be included

Pitfall 2: Ambiguous Terminology

Mistake: Interpreting "deepseek tokens" as "tokens in the deepseek_reasoning field only"

Solution:

  • Research what the terminology means in the dataset's context
  • Read any available documentation or README files completely
  • When uncertain, consider multiple interpretations and document your choice

Pitfall 3: Assuming Category Names Are Literal

Mistake: Looking for domain == "science" when science is actually a group of domains

Solution:

  • Always enumerate unique values in category/domain fields first
  • Understand the taxonomy before applying filters
  • Common groupings: science (biology, chemistry, physics), stem (science + math), humanities

Pitfall 4: Not Validating Intermediate Results

Mistake: Running a complete count without checking partial results

Solution:

  • Process in batches with intermediate output
  • Verify token counts for sample entries manually
  • Compare against any available reference statistics

Verification Checklist

Before finalizing results, confirm:

  • All relevant text fields have been identified and included
  • The correct tokenizer is being used
  • Filter criteria correctly identify the target subset
  • Sample entries have been manually verified
  • Empty or null values are handled appropriately
  • The final count passes a reasonableness check (average tokens/entry, total entries)
  • Documentation has been consulted for any ambiguous terminology

Example Exploration Code

When starting a dataset token counting task, use exploratory code like:

# Initial exploration
from datasets import load_dataset

# Load a small sample first
ds = load_dataset("dataset_name", split="train[:100]")

# Print all column names
print("Columns:", ds.column_names)

# Examine a single entry in full
print("\nSample entry:")
for key, value in ds[0].items():
    print(f"  {key}: {type(value).__name__} = {str(value)[:200]}...")

# Check for domain/category distributions if filtering
if 'domain' in ds.column_names:
    from collections import Counter
    domains = Counter(ds['domain'])
    print("\nDomain distribution:", domains)

Key Principles

  1. Explore before implementing - Understand the full data structure first
  2. Clarify ambiguity explicitly - Don't assume; document interpretations
  3. Verify incrementally - Check results at multiple stages
  4. Consider all relevant fields - Token counts often span multiple columns
  5. Read documentation thoroughly - READMEs contain critical context
Weekly Installs
19
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
claude-code14
gemini-cli13
opencode13
codex12
antigravity12
github-copilot10