Review All Evaluations

Review all evaluations in the repository against a single code quality standard or topic. This workflow is useful for systematic code quality improvements and ensuring consistency across all evaluations.

Workflow

Setup

If not already provided, ask user for the topic from the CONTRIBUTING.md or BEST_PRACTICES.md that this review should be focused on. If not provided, come up with a short topic identifier in a file-safe format (e.g., pytest_marks, import_patterns, test_coverage).

Create or read existing directory structure:

<repo root>/agent_artefacts/code_quality/<topic_id>/ - Directory for this review topic
<repo root>/agent_artefacts/code_quality/<topic_id>/README.md - Documentation for this specific topic
<repo root>/agent_artefacts/code_quality/<topic_id>/results.json - Results of the review
<repo root>/agent_artefacts/code_quality/<topic_id>/SUMMARY.md - Summary of the review

README.md Structure

The README.md file should contain topic-specific information:

Topic Description: What this review checks for and why it matters
Requirements: Specific requirements from CONTRIBUTING.md or BEST_PRACTICES.md
Detection Strategy: How to identify issues (patterns to look for, tools to use)
Commands: Specific commands useful for this topic (grep patterns, pytest commands, etc.)
Good Examples: Code snippets showing correct implementation
Bad Examples: Code snippets showing common mistakes and how to fix them
Review Date: When this documentation was created/updated

results.json Structure

The results.json file should contain a list of all evaluations in <repo root>/src/inspect_evals directory and the status of each evaluation:

{
    "eval_name": {
      "as_of_date": "YYYY-MM-DD (date the code was last evaluated)",
      "status": "pass" | "fail" | "error",
      "issues": [
        {
          "issue_type": "issue type within the topic if applicable",
          "issue_location": "relative/path/from/repo/root/file.py:line_number",
          "issue_description": "clear description of the issue",
          "suggested_fix": "short description of how to fix it",
          "fix_status": "optional field added by code-quality-fix-all skill"
        }
      ]
    }
}

Important: The issue_location field should use paths relative to the repository root with forward slashes (e.g., tests/foo/test_foo.py:42 or src/inspect_evals/foo/bar.py:15, not C:\Users\...\test_foo.py:42 or tests\foo\test_foo.py:42).

General Guidelines for All Topics

Systematic Approach: Review all evaluations in a consistent manner. Use scripts or automated tools where possible to ensure completeness.
Clear Issue Reporting: Each issue should include:
- The specific file and line number where the issue occurs
- A clear description of what's wrong
- A concrete suggestion for how to fix it
- The issue type for categorization
Verification: After identifying potential issues, verify a sample of them by reading the actual files to ensure accuracy.
Statistics: Provide summary statistics including:
- Total evaluations reviewed
- Number passing/failing
- Breakdown of issue types
- Most affected evaluations
Prioritization: Identify which issues are most critical or affect the most evaluations to help guide remediation efforts.
Reusability: Write scripts and documentation that can be rerun easily as the codebase evolves. Include any helper scripts in the topic directory.
False Positives: Be aware that automated detection may produce false positives. When possible, include logic to reduce these or document known limitations.

Workflow Steps

Create the directory structure: agent_artefacts/code_quality/<topic_id>/
Handle existing results.json (if re-running review):
- Read existing results.json
- Note issues that have "fix_status" field (were attempted to be fixed)
- After scanning current code state, compare with existing results
- Remove entries for issues that no longer exist (they were successfully fixed)
- Keep entries for issues that still exist, preserving "fix_status" if present
- Add new entries for newly discovered issues
Create or use automated tools to scan all evaluations in src/inspect_evals/
Organize findings by evaluation name
Write topic-specific documentation to README.md
Write results to results.json with relative paths:
- Include all currently detected issues
- IMPORTANT: Preserve "fix_status" field for issues that still exist
- Remove issues that are no longer detected in the code
Create a SUMMARY.md file in the topic directory with:
- Overview of findings
- Key statistics
- Most affected evaluations
- Recommendations for remediation
- Impact analysis
If you created helper scripts, save them in the topic directory for future use
Inform the user that the review is complete and where to find the results

Skill Responsibilities

This skill (code-quality-review-all) owns results.json and has full control:

Add new issues when detected
Update existing issues if location/description changes
Remove issues that no longer exist in the codebase
Preserve "fix_status" field when updating issues
Update evaluation status (pass/fail) based on current findings

The code-quality-fix-all skill has limited control:

Can ONLY add/update the "fix_status" field on existing issues
Cannot remove entries from results.json
Relies on this skill to verify fixes and remove resolved issues

This separation ensures:

Clear ownership of results.json
Fix skill focuses on fixing, not determining what's fixed
Review skill has authoritative view of current code state

Expected Output

After running this workflow, you should have:

agent_artefacts/code_quality/<topic_id>/
├── README.md           # Topic-specific documentation
├── results.json        # Detailed results for all evaluations
├── SUMMARY.md          # Executive summary
└── <helper_scripts>    # Optional: automated checker scripts