grepai-chunking
GrepAI Chunking Configuration
This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.
When to Use This Skill
- Optimizing search accuracy
- Adjusting for code style (verbose vs. concise)
- Troubleshooting search results
- Understanding how indexing works
What is Chunking?
Chunking is the process of splitting source files into smaller segments for embedding:
┌─────────────────────────────────────┐
│ Large Source File │
│ (1000+ tokens) │
└─────────────────────────────────────┘
↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512 │ │ ~512 │ │ ~512 │
│ tokens │ │ tokens │ │ tokens │
└─────────┘ └─────────┘ └─────────┘
↓
Each chunk gets
its own embedding
Why Chunking Matters
Embedding models have optimal input sizes:
- Too large chunks: Less precise search results
- Too small chunks: Lost context, fragmented results
- Just right: Good balance of precision and context
Configuration
Basic Settings
# .grepai/config.yaml
chunking:
size: 512 # Tokens per chunk
overlap: 50 # Overlap between chunks
Understanding Parameters
Chunk Size
The target number of tokens per chunk.
| Size | Effect |
|---|---|
| 256 | More precise, less context |
| 512 | Balanced (default) |
| 1024 | More context, less precise |
Overlap
Tokens shared between adjacent chunks. Preserves context at boundaries.
| Overlap | Effect |
|---|---|
| 0 | No overlap, may lose context at boundaries |
| 50 | Standard overlap (default) |
| 100 | More context, larger index |
Visualization
With size=512 and overlap=50:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
┌────────────────────────────────────┐
│ func Login(user, pass)... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 2: tokens 463-974
┌────────────────────────────────────┐
│ ...validate credentials... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 3: tokens 925-1000
┌──────────────┐
│ ...return │
└──────────────┘
Recommended Settings by Language
Verbose Languages (Java, C#)
chunking:
size: 768 # Larger to capture full methods
overlap: 75
Concise Languages (Go, Python)
chunking:
size: 512 # Standard size
overlap: 50
Very Concise (Rust, Zig)
chunking:
size: 384 # Smaller for precise results
overlap: 40
Recommended Settings by Codebase
Small Functions (Microservices)
chunking:
size: 384 # Capture individual functions
overlap: 40
Large Classes (Monolith)
chunking:
size: 768 # Capture more context
overlap: 100
Mixed Codebase
chunking:
size: 512 # Balanced default
overlap: 50
How Tokens are Counted
GrepAI uses approximate token counting:
- ~4 characters = 1 token (for English text)
- Code varies based on identifiers and syntax
Example:
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}
≈ 45 tokens
Impact on Index Size
Larger overlap = more chunks = larger index:
| Size | Overlap | Chunks per 10K tokens | Index Impact |
|---|---|---|---|
| 512 | 0 | ~20 | Smallest |
| 512 | 50 | ~22 | Standard |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |
Impact on Search Quality
Too Small Chunks (size: 128)
Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(Fragment, missing context)
Just Right (size: 512)
Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(Complete function with context)
Too Large Chunks (size: 2048)
Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(Too much noise)
Experimentation
Testing Different Settings
- Try smaller chunks for more precise results:
chunking:
size: 384
overlap: 40
- Re-index:
rm .grepai/index.gob
grepai watch
- Test with searches:
grepai search "your query"
- Adjust and repeat until satisfied.
Comparing Results
Before changing settings, save a search result:
grepai search "authentication" > before.txt
After changing settings and re-indexing:
grepai search "authentication" > after.txt
diff before.txt after.txt
Chunk Boundaries
GrepAI tries to split at logical boundaries:
- Empty lines (function/class boundaries)
- Closing braces
- Statement ends
This means actual chunk sizes may vary slightly from the target.
Best Practices
- Start with defaults: 512/50 works well for most codebases
- Adjust based on code style: Verbose = larger, concise = smaller
- Test with real queries: See what your searches return
- Re-index after changes: Must regenerate embeddings
- Consider overlap: Don't set to 0 unless index size is critical
Common Issues
❌ Problem: Search results are too fragmented ✅ Solution: Increase chunk size:
chunking:
size: 768
❌ Problem: Search results have too much irrelevant context ✅ Solution: Decrease chunk size:
chunking:
size: 384
❌ Problem: Results miss related code at function boundaries ✅ Solution: Increase overlap:
chunking:
overlap: 100
❌ Problem: Index is too large ✅ Solutions:
- Decrease overlap
- Increase chunk size
- Add more ignore patterns
Output Format
Chunking status:
✅ Chunking Configuration
Size: 512 tokens
Overlap: 50 tokens
Index Statistics:
- Total files: 245
- Total chunks: 1,234
- Avg chunks/file: 5.0
- Avg chunk size: 478 tokens
Recommendations:
- Current settings are balanced
- Consider size: 384 for more precise results
- Consider size: 768 for more context