Task Checker

Goal

Turn “is this plan task actually done?” into a reviewable evidence chain:

Per-task verification: confirm goals/acceptance criteria/deliverables exist in code and observable behavior
Per-task drift detection: confirm the task description still matches the implementation (stale / diverged / re-scoped)
Per-task test review: assess whether tests cover key success paths and failure/edge cases; propose minimal additions

Hard Rules (No False Positives)

These rules exist to prevent “looks done” audits that miss real gaps.

1) No fabricated or “hand-wavy” evidence

Disallowed evidence patterns (treat as a GAP):

“Implicitly tested via integration” without a concrete test file + assertion that exercises the exact acceptance point
“Build succeeds” / “tests pass” without actually running a command and recording its output (or finding a committed log with the exact output)
“Git evidence shows…” unless you actually ran git commands and cited concrete commits/changes (do not assume history)
Linking to a directory or repo root as evidence; evidence must point to a file and (usually) a line range
“Task marked complete / passes:true in plan” as evidence; plan metadata is not implementation proof

2) Completion requires closure of acceptance points

If any acceptance point lacks strong evidence, the task cannot be “Completed”.
If an acceptance point is not verifiable as written, mark “Mis-specified” and list what must change in the plan to make it auditable.

3) Intermediate tasks are not “Completed” when superseded

If the plan has an explicit intermediate step (e.g., “hardcode X as a bridge”), and the codebase no longer contains that intermediate behavior:

Mark completion as “Not Done (superseded)” and description accuracy as “Drifted”.
Explain what replaced it and propose a plan revision (split/merge/retire the intermediate task).

4) Disabled tests never count as passing evidence

DISABLED tests are always a gap, not coverage.
If the only test coverage for a criterion is DISABLED (or “no crash”), the task cannot be “Completed”.

5) TODO/stub detectors gate completion

If the relevant code contains TODOs/stubs that directly relate to the task’s promised semantics (e.g., “Parse CONNECTIONS”):

You must call it out as a gap.
If the task’s acceptance points imply those semantics exist, the task is at most “Partial”.

Preconditions (must hold)

The project provides a JSON plan that follows the writing-plans-plus schema (“plan file”)
- At minimum: plan metadata + a tasks list
- Each task includes: id, title, description (or equivalent), and at least one acceptance/done criteria field (e.g., acceptance_criteria / validation_criteria / done_definition)

If the plan does not meet the minimum structure: do not guess. Output a “plan-structure gap list” and require the missing fields before auditing.

Inputs

User should provide:

Plan file path (required)
Audit scope (optional):
- all tasks (default)
- specific task ids (e.g., [1,2,5])
Audit depth (optional):
- static evidence only (default): code/tests/docs/scripts
- include local dynamic verification: run minimal necessary tests locally (ignore CI)
Update plan file (optional):
- update_plan: boolean (default: false) - Whether to write audit results back to plan JSON
- write_mode: "minimal" | "full" (default: "minimal") - "minimal" only updates passes/issue; "full" also adds notes

Core Method (follow this order)

1) Read and “structure” the plan

Read the plan file and extract:
- plan goal/scope/assumptions (goal/description/architecture/depends_on, etc.)
- task list: id/title/description/steps/criteria/deliverables/dependencies
Convert each task into an audit checklist:
- intent: what behavior/artifact must change
- deliverables: which files/modules/interfaces should exist or change
- acceptance points: which observable facts must hold (build success is not completion)
Record plan risk signals:
- acceptance criteria based on log strings/output text
- mandatory “temporary hardcode/intermediate stage” steps likely to be skipped later
- missing or non-verifiable criteria (e.g., “looks better”)
Treat any “passes: true” fields as plan bookkeeping only; they are not evidence.

2) Build an evidence model per task

For each task, attempt to collect at least these evidence types:

code evidence: where implemented, where invoked, whether it is reachable
test evidence: which tests cover it, whether assertions are meaningful
runtime evidence: reproducible commands/scripts/test cases that validate key paths
doc/plan consistency evidence: whether README/guide/plan notes match implementation

Map each acceptance point to one or more evidence items, producing an “acceptance → evidence” table.

2.1 Evidence strength grading (use consistently)

Grade each acceptance point’s evidence:

Strong: direct code evidence + a test with meaningful assertions OR runtime check with machine-checkable outputs
Medium: direct code evidence + indirect validation (runtime log excerpt only, or a test that only partially asserts)
Weak: code exists but no proof it is correct/reachable/covered
None: no evidence

Task completion gates:

Completed: all acceptance points are Strong (or a clearly justified mix of Strong + at most one Medium) AND no hard-rule violations
Partial: at least one acceptance point is Medium/Weak/None, or there is a TODO/stub gap, or coverage is Borderline/Inadequate
Not Done: core deliverables missing, acceptance contradicted, unreachable code, or intermediate task has been removed without plan update

3) Gather and validate evidence (do not stop at “found a file”)

3.1 Code evidence

Prefer semantic search to locate implementation entry points:
- class/function definitions
- key fields/config/CLI flags
- files mentioned by the plan (if any)
For each acceptance point, confirm at least:
- implementation truly exists (not a stub, not TODO-only, not “return true”)
- there is an invocation path (main flow can actually reach it)
- behavior matches the task description (e.g., “parse CONNECTIONS” really builds connectivity)
- if the criterion is a log/output string, require BOTH:
  - code evidence that prints/emits the exact string, AND
  - runtime evidence (or a test assertion) that the string actually appears
Produce evidence links:
- always provide navigable file links with line ranges (file:///...#Lx-Ly)

3.2 Test evidence

Search tests related to the task:
- unit tests (same module/function)
- integration tests (end-to-end / main flow)
Decide whether the tests are “countable” evidence:
- tests that only do SUCCEED() / “no crash” are not semantic completion evidence
- DISABLED tests (external env dependent) are coverage gaps, not coverage
- scripts without assertions or machine-checkable outputs are not sufficient validation
Coverage assessment (answer at least):
- success path: typical valid inputs produce expected artifacts/state
- failure path: invalid/missing fields/boundaries are rejected or degraded properly
- regression risk: plan-stated compatibility points are pinned by tests (format/interface/output)

3.3 Dynamic verification (optional but strongly recommended)

If the project supports local test execution:

Prefer the minimal set:
- project-native test commands (pick the smallest relevant subset)
- examples: pytest -q, npm test, go test ./..., cargo test, ctest --output-on-failure
If the plan uses “manual command verification”, collapse it into repeatable checks:
- e.g., run command + check key output + verify artifact files exist
Record an “executed checks list”:
- commands run
- key outputs / exit codes
- which task acceptance point each check supports

3.4 Data-shape validation (required for data-driven parsers)

If tasks involve parsing/loading structured inputs (JSON/XML/IR/etc.):

Locate the canonical sample inputs in-repo referenced by docs/benchmarks/tests.
Compare the sample’s shape with the parser’s branching logic (arrays vs objects, field names, nesting).
If there is a shape mismatch that would drop data silently (e.g., ignoring non-PATTERN entries in an array), it is a high-risk gap and blocks “Completed”.

4) Decide completion per task (strict, explainable, reviewable)

Assign one of these statuses per task:

Completed: every acceptance point has strong evidence; tests cover key success + main failure/edge cases; no obvious TODO/stub/bypass implementation
Partial: some points met, but key gaps exist (e.g., connectivity not built, skeleton-only, tests disabled)
Not Done: core deliverables or acceptance points missing, or implementation clearly contradicts the description

Also evaluate “task description accuracy”:

Accurate: description matches implementation
Drifted: implementation changed but plan/docs still reflect an older state (common around “stub/hardcode” phases)
Mis-specified: the plan is not reasonable or not verifiable (e.g., acceptance based on log strings)

5) Rate test adequacy and propose minimal additions

For each task, rate tests (Adequate / Borderline / Inadequate) and output a “minimal test additions list”:

specify test intent (which acceptance point / failure path)
specify suggested location (reuse existing test framework and directories)
specify assertion type (state/artifact/structure/behavior), avoid “no crash only”

6) Write back to plan (if requested)

If update_plan: true:

Load writing-plans-plus SKILL first
- Invoke: Skill: writing-plans-plus
- This ensures we follow the correct schema for task updates
For each audited task, update JSON with ONLY these fields:
- passes: boolean - true if Completed, false if Partial/Not Done/superseded
- issue: array (only if issues found or superseded)
  - For superseded tasks: [“Task superseded by Task X”]
  - For tasks with gaps: list the specific gaps found
- completed_at: ISO timestamp (if passes: true)
- completed_by: string (if passes: true)
- notes: string (optional, summary of audit findings)
Fields NOT to add (avoid custom extensions):
- ❌ Do NOT add: verified, verification_status, evidence, verification_date
- ❌ Do NOT add custom objects or nested structures
- ❌ Do NOT add description_accuracy, superseded_by, superseded_reason as top-level fields
- ✅ Only use standard writing-plans-plus fields
Superseded task handling:
- Set passes: false
- Add issue: [“Task superseded by Task X: <reason>”]
- Optional in notes: “Implementation evolved to skip this intermediate step”
Validation before write:
- Validate JSON is still valid after modifications
- Preserve all other existing fields (dependencies, files, etc.)

Output Format (must follow)

A. Overview

Plan file:
Audit scope: all tasks / task ids [...]
Overall verdict: Completed / Partial / Not Done (1–2 evidence-based sentences)
Executed checks: list commands actually run (or “None”)
High-risk gaps (3–6 bullets): only issues that cause false “done” conclusions

B. Per-task Audit Table

For each task, output:

Task
- Completion: Completed / Partial / Not Done
- Description accuracy: Accurate / Drifted / Mis-specified
- Acceptance → evidence mapping:
  - → evidence grade (Strong/Medium/Weak/None) → code evidence (link) → test evidence (link) → runtime evidence (if any)
- Gaps:
  - missing implementation / missing invocation path / missing connectivity semantics / missing tests / tests without assertions / disabled tests, etc.
- Minimal test additions (max 3, priority-ordered)

C. Plan Revision Suggestions (optional)

Only output if clear plan drift/mismatch is found:

which tasks/milestones should be updated or split
which acceptance criteria should be converted from “soft signals” to “hard evidence”

D. Plan File Updates (only when `update_plan: true`)

Tasks updated:
Tasks marked completed:
Tasks marked failed/superseded:
Fields modified: passes, issue, completed_at, completed_by, notes (as applicable)
Validation: JSON schema validated against writing-plans-plus requirements

Example Usage Patterns

Audit only (no write):

User: “Check plan file docs/plans/phase1.json”
→ Outputs audit report only, does NOT modify JSON

Audit with write-back:

User: “Check plan file docs/plans/phase1.json and update it”
→ Outputs audit report AND updates JSON fields per writing-plans-plus schema

Notes (common sources of false positives)

“Builds / returns true / prints logs” is not completion: verify core semantics (e.g., real connectivity graph built)
DISABLED tests or SUCCEED()-only tests are coverage gaps, not coverage
Scripts that run without assertions or stable output checks are not sufficient validation
If intermediate phases (e.g., hardcoding) are skipped during evolution, mark as Drifted and recommend plan updates

Audit Self-Check (before finalizing)

If you wrote any of the following, you must replace it with concrete evidence or mark a gap:

“implicit / likely / probably / should be” (without evidence)
“Build succeeds” (without executed command output)
“Test passes” (without citing an actual test + assertion, or test output)
“Verified via integration” (without pointing to the integration test code or runtime log)
“Git shows” (without commands and concrete results)

task-checker

Task Checker

Goal

Hard Rules (No False Positives)

1) No fabricated or “hand-wavy” evidence

2) Completion requires closure of acceptance points

3) Intermediate tasks are not “Completed” when superseded

4) Disabled tests never count as passing evidence

5) TODO/stub detectors gate completion

Preconditions (must hold)

Inputs

Core Method (follow this order)

1) Read and “structure” the plan

2) Build an evidence model per task

2.1 Evidence strength grading (use consistently)

3) Gather and validate evidence (do not stop at “found a file”)

3.1 Code evidence

3.2 Test evidence

3.3 Dynamic verification (optional but strongly recommended)

3.4 Data-shape validation (required for data-driven parsers)

4) Decide completion per task (strict, explainable, reviewable)

5) Rate test adequacy and propose minimal additions

6) Write back to plan (if requested)

Output Format (must follow)

A. Overview

B. Per-task Audit Table

C. Plan Revision Suggestions (optional)

D. Plan File Updates (only when `update_plan: true`)

Example Usage Patterns

Notes (common sources of false positives)

Audit Self-Check (before finalizing)

More from satone7/skills

writing-plans-plus

find-next-task

aitc-workflow

guardian

find-task-skills

executing-single-task

task-checker

Task Checker

Goal

Hard Rules (No False Positives)

1) No fabricated or “hand-wavy” evidence

2) Completion requires closure of acceptance points

3) Intermediate tasks are not “Completed” when superseded

4) Disabled tests never count as passing evidence

5) TODO/stub detectors gate completion

Preconditions (must hold)

Inputs

Core Method (follow this order)

1) Read and “structure” the plan

2) Build an evidence model per task

2.1 Evidence strength grading (use consistently)

3) Gather and validate evidence (do not stop at “found a file”)

3.1 Code evidence

3.2 Test evidence

3.3 Dynamic verification (optional but strongly recommended)

3.4 Data-shape validation (required for data-driven parsers)

4) Decide completion per task (strict, explainable, reviewable)

5) Rate test adequacy and propose minimal additions

6) Write back to plan (if requested)

Output Format (must follow)

A. Overview

B. Per-task Audit Table

C. Plan Revision Suggestions (optional)

D. Plan File Updates (only when update_plan: true)

Example Usage Patterns

Notes (common sources of false positives)

Audit Self-Check (before finalizing)

More from satone7/skills

writing-plans-plus

find-next-task

aitc-workflow

guardian

find-task-skills

executing-single-task

D. Plan File Updates (only when `update_plan: true`)