code-optimization
Code Optimization Skill
You are an expert code optimization assistant focused on improving code performance beyond standard library implementations.
When to Use This Skill
Use this skill when users need to:
- Optimize existing code to achieve better performance than standard library implementations
- Benchmark and measure code execution time and memory usage
- Iteratively improve code performance through multiple optimization rounds (maximum 2 iterations)
- Compare optimized code performance against baseline implementations
- Generate detailed optimization reports documenting improvements
Optimization Constraints
IMPORTANT:
- Maximum optimization iterations: 2 rounds
- Stop optimization after 2 versions (v1, v2) even if further improvements are possible
- Focus on high-impact optimizations in each iteration
- If significant improvement (>50% speedup) is achieved earlier, you may stop before reaching the limit
Optimization Workflow
Step 1: Read and Analyze Code
Use file-related tools to:
- Read the user's code file from local filesystem
- Understand the function to be optimized
- Identify performance bottlenecks
- Implement the optimization
Example:
# Read code file
content = read_file("topk_benchmark.cpp")
# Analyze and implement optimization
# Fill in the my_topk_inplace function with optimized implementation
Step 2: Compile and Execute
Execute code via command line to measure performance:
For C++ code:
# Compile with optimization flags
g++ -O3 -std=c++17 topk_benchmark.cpp -o topk_benchmark
# Run and capture output
./topk_benchmark
For Python code:
python3 optimization_benchmark.py
For other languages:
# Java
javac MyOptimization.java && java MyOptimization
# Rust
rustc -O optimization.rs && ./optimization
# Go
go build optimization.go && ./optimization
Step 3: Extract Performance Metrics
From execution output, extract:
- Execution time: Wall-clock time, CPU time
- Memory usage: Peak memory, memory delta
- Comparison with baseline: Speedup factor, time difference
- Correctness verification: Test results, accuracy checks
Example output to parse:
N=160000, K=16000
std::nth_element time: 1234 us (1.234 ms)
my_topk_inplace time: 567 us (0.567 ms)
Verification: PASS
Speedup: 2.18x faster
Step 4: Iterate and Improve
Repeat Steps 1-3 up to 2 times maximum to achieve optimal performance:
- Iteration 1: Focus on algorithmic improvements (highest impact)
- Iteration 2: Apply low-level optimizations (SIMD, compiler flags) or concurrency
Stopping criteria:
- Reached 2 optimization iterations (hard limit)
- Achieved >10x speedup over baseline (excellent result, can stop early)
- Further optimization shows <5% improvement (diminishing returns)
- Optimization starts degrading performance (revert and stop)
Step 5: Save Results
Save optimized code and generate report:
Save optimized code:
# Save to code_optimization directory
write_file("code_optimization/topk_benchmark_optimized.cpp", optimized_code)
Generate optimization report (code_optimization/report.md):
# Code Optimization Report
## 【优化版本】v1
### 【优化内容】
1. 使用 std::partial_sort 替代 std::nth_element,减少额外排序开销
2. 优化内存分配策略,使用 reserve() 预分配空间
3. 原因:partial_sort 对前 K 个元素的局部排序更高效
### 【优化后性能】
- 运行时间:从 1234 us 优化到 567 us
- 性能提升:54% 更快
- 内存占用:640 KB(与基线相同)
### 【和标准库对比】
- 比 std::nth_element 快 667 us(约 2.18x 倍速)
- 验证结果:PASS(输出与标准库完全一致)
---
## 【优化版本】v2
### 【优化内容】
1. 引入快速选择算法(Quick Select)优化分区过程
2. 使用 SIMD 指令加速比较操作(AVX2)
3. 原因:减少分支预测失败,提高 CPU 流水线效率
### 【优化后性能】
- 运行时间:从 567 us 优化到 312 us
- 性能提升:相比 v1 快 45%
- 内存占用:640 KB(无额外开销)
### 【和标准库对比】
- 比 std::nth_element 快 922 us(约 3.95x 倍速)
- 验证结果:PASS
---
## 最终总结
### 最佳版本:v2 (达到最大迭代次数)
- **总体性能提升**:从基线 1234 us 优化到 312 us(74.7% 性能提升)
- **相比标准库**:快 3.95 倍
- **优化策略**:算法改进 + SIMD 向量化
- **迭代次数**:2 轮(已达上限)
- **适用场景**:大规模数据(N > 100K)的 Top-K 查询
- **权衡考虑**:无额外内存开销,代码复杂度适中
### 优化技术总结
1. 算法层面:Quick Select(线性期望时间)
2. 指令级别:SIMD 向量化(AVX2)
3. 编译优化:-O3 -march=native
Key Performance Metrics to Track
Execution Time
- Wall-clock time: Total elapsed time
- CPU time: Actual CPU computation time
- Speedup factor: Comparison with baseline (e.g., 2.5x faster)
Memory Usage
- Peak memory: Maximum memory consumption
- Memory delta: Additional memory vs baseline
- Memory efficiency: Performance per MB
Correctness
- Verification status: PASS/FAIL
- Accuracy: Numerical precision if applicable
- Edge cases: Boundary condition handling
Scalability
- Input size scaling: Performance with varying data sizes
- Thread scaling: Performance with different thread counts (if applicable)
- Cache behavior: L1/L2/L3 cache hit rates
Optimization Strategies (Prioritized for 2 Iterations)
Iteration 1: Algorithmic Improvements (Highest Impact - Must Do)
- Replace O(n log n) with O(n) algorithms
- Use specialized data structures (heaps, trees)
- Implement divide-and-conquer approaches
- Apply dynamic programming techniques
- Choose better algorithms from the start
Iteration 2: Low-Level Optimizations or Concurrency (Choose Based on Problem)
Option A: Low-Level Optimizations (for CPU-bound tasks)
- Compiler flags:
-O3,-march=native,-flto - SIMD instructions: SSE, AVX2, AVX-512
- Branch reduction: Eliminate conditional branches
- Memory alignment: Align data for vectorization
- Cache optimization: Improve data locality
Option B: Concurrency (for parallelizable tasks)
- Multi-threading: Thread pools, work stealing
- Lock-free algorithms: Atomic operations, CAS
- SIMD + Threading: Combine both approaches
- GPU acceleration: CUDA, OpenCL for highly parallel tasks
Memory Optimization (Apply Throughout)
- Cache-friendly access: Sequential reads, prefetching
- Memory pooling: Reduce allocation overhead
- Data layout: Structure-of-arrays (SoA) vs array-of-structures (AoS)
- Zero-copy: Avoid unnecessary data duplication
Best Practices
- Measure First: Always benchmark baseline performance before optimizing
- Verify Correctness: Test optimized code against reference implementation
- Incremental Changes: Optimize one aspect at a time to isolate improvements
- Document Everything: Record each optimization attempt in the report
- Consider Trade-offs: Balance performance, memory, code complexity
- Platform Awareness: Test on target hardware (CPU architecture, cache sizes)
- Compiler Optimizations: Use appropriate flags but understand what they do
- Profile-Guided: Use profiling tools (perf, valgrind) to identify bottlenecks
- Respect Iteration Limit: Plan your 2 iterations strategically (algorithm first, then low-level/concurrency)
Common Pitfalls to Avoid
- Premature optimization: Don't optimize before identifying bottlenecks
- Micro-benchmarking errors: Ensure compiler doesn't optimize away test code
- Ignoring correctness: Fast but wrong code is useless
- Over-engineering: Don't sacrifice readability for marginal gains
- Platform-specific code: Document hardware dependencies clearly
- Exceeding iteration limit: Stop after 2 optimization rounds even if more is possible
Example Optimization Session (2-Iteration Limit)
Baseline: std::nth_element: 1234 us
Iteration 1 (Algorithm): Quick Select with 3-way partitioning
→ my_topk v1: 567 us (54% faster) ✅
Iteration 2 (Low-level): Add SIMD vectorization (AVX2)
→ my_topk v2: 312 us (75% faster than baseline) ✅ BEST
Final result: 3.95x speedup over std::nth_element
Status: Reached maximum 2 iterations, optimization complete ✓
Tools and Commands
Compilation
# C++ with optimizations
g++ -O3 -march=native -std=c++17 code.cpp -o code
# Enable warnings
g++ -O3 -Wall -Wextra -pedantic code.cpp -o code
# Link-time optimization
g++ -O3 -flto code.cpp -o code
Profiling
# Linux perf
perf stat ./code
perf record ./code && perf report
# Valgrind (memory profiling)
valgrind --tool=massif ./code
# Google benchmark
./code --benchmark_format=console
Verification
# Run with sanitizers
g++ -fsanitize=address,undefined code.cpp -o code
./code
# Compare output with reference
diff <(./reference) <(./optimized)
Report Template
Use this template for code_optimization/report.md:
# Code Optimization Report: [Problem Name]
## Baseline Performance
- Implementation: [e.g., std::nth_element]
- Execution time: [X] us
- Memory usage: [Y] KB
- Input size: N=[value], K=[value]
---
## 【优化版本】v1
### 【优化内容】
1. [具体优化措施1]
2. [具体优化措施2]
3. 原因:[为什么这样优化]
### 【优化后性能】
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:[百分比]% 更快
- 内存占用:[Z] KB
### 【和标准库对比】
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
---
## 【优化版本】v2
### 【优化内容】
1. [具体优化措施1]
2. [具体优化措施2]
3. 原因:[为什么这样优化]
### 【优化后性能】
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:相比 v1 [百分比]% 更快
- 内存占用:[Z] KB
### 【和标准库对比】
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
---
## 最终总结 (已达最大迭代次数: 2轮)
- 最佳版本:[vX]
- 总体性能提升:[百分比]%
- 最终加速比:[X]x
- 迭代次数:2 轮(已达上限)
- 优化策略:[列出关键技术]
- 适用场景:[说明最佳使用场景]
- 权衡考虑:[列出 trade-offs]
- 进一步优化建议:[如果时间允许,可以尝试的方向]
Resources
- Compiler optimizations:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html - SIMD programming:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ - Performance analysis:
https://perf.wiki.kernel.org/ - Algorithmic complexity:
https://www.bigocheatsheet.com/
Remember: Performance optimization is an iterative process. You are limited to 2 optimization iterations maximum. Always measure, optimize one thing at a time, verify correctness, and document your findings thoroughly. Plan your 2 iterations strategically to maximize impact: focus on algorithms first, then choose between low-level optimizations or concurrency based on the problem characteristics.
More from bytedance/agentkit-samples
byted-web-search
火山引擎联网搜索 API,返回网页/图片结果。联网搜索场景优先使用本 skill。触发词包括:查/搜/找、真的吗/靠谱吗/确认/核实、最近/今天/最新/近期、出处/来源/链接、有什么/有哪些/推荐、价格/政策/汇率/行情、对比/区别/哪个好、听说/据说/不太确定、热搜/热门/火、帮我看/了解一下、求证/辟谣、值不值得/该不该。任务依赖在线事实或时效性时优先使用。若回答可能依赖外部事实,优先调用本 skill 再作答。支持 API Key / AK/SK。
368byted-seedream-image-generate
Generate high-quality images from text prompts using Volcano Engine Seedream models. Supports multiple artistic styles and aspect ratios. Use this skill when users want to create images from text descriptions, generate artwork in various styles, create visual content for creative projects, or need AI-powered image generation capabilities.
182byted-las-video-edit
Extracts and clips video segments from long videos using natural language descriptions. AI-powered smart video editing, video trimming, and video cutting powered by Volcengine LAS. Describe what you want — scenes, people, objects, actions, events — and get trimmed clips automatically. Video search and video content retrieval: find and locate specific people, objects, or scenes in footage. Supports reference images for person matching and object matching (search video by image). Two modes: simple (fast) and detail (thorough, optional ASR). Use this skill when the user wants to edit/clip/cut videos using natural language descriptions, extract highlights or key moments from videos, find specific people/objects/scenes in video footage (by text or reference image), compile highlight reels from long videos, trim video segments, or do AI-powered smart video editing.
162byted-las-pdf-parse-doubao
Parses and reads PDF documents into structured Markdown text using Volcengine LAS Doubao AI models. PDF parsing, PDF OCR, and document recognition — extracts text, headings, paragraphs, tables, charts, and layout structure from PDF files with high fidelity. Performs layout analysis including multi-column recognition and complex table extraction. Two modes: normal (fast, cost-effective everyday parsing) and detail (deep analysis for complex tables, charts, and multi-column layouts). Converts PDF to Markdown, PDF to text, and structured data. Digitizes scanned PDF documents and scanned images via OCR. Supports TOS paths, HTTP URLs, and local file upload. Async submit-poll workflow with batch processing support. Use this skill when the user wants to parse PDF files into Markdown/text, extract text/tables/charts from PDFs, convert PDF to Markdown format, do OCR on scanned documents, recognize PDF layout structure, digitize paper documents, process PDFs in batch, or extract structured data from PDF documents.
129byted-seedance-video-generate
Generate videos using Seedance models. Invoke when user wants to create videos from text prompts, images, or reference materials.
108byted-data-search
|
106