Count Dataset Tokens

Overview

This skill provides a systematic approach for accurately counting tokens in datasets. It emphasizes thorough data exploration, proper interpretation of task requirements, and verification of results to avoid common mistakes like incomplete field coverage or misinterpreting terminology.

When to Use This Skill

Counting tokens in HuggingFace datasets or similar data sources
Tasks involving tokenization of text fields
Filtering datasets by domain, category, or other metadata
Working with datasets that have multiple text fields that may contribute to token counts
Any task requiring accurate quantification of textual content in structured datasets

Critical Pre-Implementation Steps

1. Clarify Terminology Before Proceeding

When a task uses specific terms (e.g., "deepseek tokens", "science domain"), verify exactly what content this refers to:

Examine the README/documentation thoroughly - Documentation often contains critical definitions
List all available fields in the dataset schema before making assumptions
Identify all fields that could potentially be relevant to the token count
Do not assume field names tell the complete story - A field like deepseek_reasoning may not be the only field relevant for counting "deepseek tokens"

2. Explore Dataset Structure Thoroughly

Before writing any counting logic:

1. Load a sample of the dataset
2. Print ALL column names and their types
3. Examine multiple sample entries in full detail
4. Identify relationships between fields
5. Check for nested structures or JSON fields
6. Look for metadata columns that might indicate which fields to include

3. Understand Domain/Category Mappings

When filtering by categories like "science":

List all unique values in the domain/category column
Determine if the target category is an explicit value OR a grouping of related values
Example: "science" might mean biology + chemistry + physics rather than a literal "science" value
Document your interpretation and verify it aligns with the task intent

Implementation Workflow

Step 1: Data Discovery

1. Load the dataset (or a representative sample)
2. Enumerate all columns/fields
3. For each text field, examine:
   - Field name and description
   - Sample content from multiple entries
   - Whether it should be included in token counts
4. Check for a metadata or schema subset that documents field purposes

Step 2: Define Scope Explicitly

Before counting, explicitly document:

Which fields will be tokenized
Which filter criteria will be applied (e.g., domain == "biology")
The tokenizer to be used and why
Any fields being excluded and the reasoning

Step 3: Implement with Verification Points

1. Load the appropriate tokenizer
2. Apply filters to select relevant entries
3. For each entry, tokenize ALL relevant fields
4. Sum token counts with running totals
5. Print progress checkpoints (e.g., every 1000 entries)
6. Track statistics: entry count, empty fields, errors

Step 4: Validate Results

Before reporting final numbers:

Spot-check individual entries - Manually verify token counts for 3-5 random samples
Sanity check totals - Does the average tokens per entry seem reasonable?
Cross-reference with metadata - If the dataset provides expected statistics, compare
Verify filter results - Confirm the filtered count matches expected entries

Common Pitfalls to Avoid

Pitfall 1: Incomplete Field Identification

Mistake: Assuming a single field (e.g., deepseek_reasoning) contains all relevant content

Solution:

Examine the full schema before deciding which fields to include
Consider whether multiple fields contribute to the "complete" content
Check if there are fields like prompt, response, full_text, or conversation that should be included

Pitfall 2: Ambiguous Terminology

Mistake: Interpreting "deepseek tokens" as "tokens in the deepseek_reasoning field only"

Solution:

Research what the terminology means in the dataset's context
Read any available documentation or README files completely
When uncertain, consider multiple interpretations and document your choice

Pitfall 3: Assuming Category Names Are Literal

Mistake: Looking for domain == "science" when science is actually a group of domains

Solution:

Always enumerate unique values in category/domain fields first
Understand the taxonomy before applying filters
Common groupings: science (biology, chemistry, physics), stem (science + math), humanities

Pitfall 4: Not Validating Intermediate Results

Mistake: Running a complete count without checking partial results

Solution:

Process in batches with intermediate output
Verify token counts for sample entries manually
Compare against any available reference statistics

Verification Checklist

Before finalizing results, confirm:

All relevant text fields have been identified and included
The correct tokenizer is being used
Filter criteria correctly identify the target subset
Sample entries have been manually verified
Empty or null values are handled appropriately
The final count passes a reasonableness check (average tokens/entry, total entries)
Documentation has been consulted for any ambiguous terminology

Example Exploration Code

When starting a dataset token counting task, use exploratory code like:

# Initial exploration
from datasets import load_dataset

# Load a small sample first
ds = load_dataset("dataset_name", split="train[:100]")

# Print all column names
print("Columns:", ds.column_names)

# Examine a single entry in full
print("\nSample entry:")
for key, value in ds[0].items():
    print(f"  {key}: {type(value).__name__} = {str(value)[:200]}...")

# Check for domain/category distributions if filtering
if 'domain' in ds.column_names:
    from collections import Counter
    domains = Counter(ds['domain'])
    print("\nDomain distribution:", domains)

Key Principles

Explore before implementing - Understand the full data structure first
Clarify ambiguity explicitly - Don't assume; document interpretations
Verify incrementally - Check results at multiple stages
Consider all relevant fields - Token counts often span multiple columns
Read documentation thoroughly - READMEs contain critical context

count-dataset-tokens

Count Dataset Tokens

Overview

When to Use This Skill

Critical Pre-Implementation Steps

1. Clarify Terminology Before Proceeding

2. Explore Dataset Structure Thoroughly

3. Understand Domain/Category Mappings

Implementation Workflow

Step 1: Data Discovery

Step 2: Define Scope Explicitly

Step 3: Implement with Verification Points

Step 4: Validate Results

Common Pitfalls to Avoid

Pitfall 1: Incomplete Field Identification

Pitfall 2: Ambiguous Terminology

Pitfall 3: Assuming Category Names Are Literal

Pitfall 4: Not Validating Intermediate Results

Verification Checklist

Example Exploration Code

Key Principles