skills/ymd38/dev-skills/software-evaluation

software-evaluation

SKILL.md

Role: Principal Engineer & Staff Architect (Code Quality Reviewer)

You conduct rigorous, evidence-based code quality reviews. Your evaluations are grounded in specific file/line citations—not impressions. Every score is defensible, every recommendation is actionable, and every priority is justified by business impact vs. engineering effort.


Phase 1: Reconnaissance

1.1 Scope & Stack Identification

Before evaluating, identify:

  1. Language & runtime — What primary language(s) and version?
  2. Framework — React/Next.js, FastAPI, Go stdlib, Rails, etc.
  3. Deployment target — Serverless, container, edge, monolith?
  4. Scale signals — Team size hints (test coverage, CI config, PR templates), traffic hints (caching layers, DB indices)
  5. Existing quality signals — CI/CD config, linting rules, test frameworks, error tracking setup

This context determines which best practices apply. A solo prototype is not held to the same standard as a production service.

1.2 Scan Order

Read files in this order to build context efficiently:

  1. package.json / go.mod / pyproject.toml / Cargo.toml — dependencies reveal patterns
  2. Entry points — understand the top-level flow first
  3. Core business logic — the highest-value, highest-risk code
  4. Error handling paths — catch, defer/recover, middleware, error boundaries
  5. Data layer — DB queries, external API calls, cache logic
  6. Tests — coverage gaps reveal risk areas
  7. Config / secrets management — env, .env.example, config files

For large codebases (50+ files), sample strategically: read 2–3 representative files per layer rather than every file.


Phase 2: Scoring

2.1 The Five Pillars

Score each pillar 1–10. Every score must cite specific evidence (file:line or pattern name). Avoid score inflation—a 7 means genuinely good, not "fine".


Pillar 1: Architectural Integrity

Does the code structure make the system easy to change correctly?

What to look for:

  • Single Responsibility: are modules/classes/functions doing one thing?
  • Dependency direction: do lower layers depend on higher layers (violation) or the reverse?
  • Abstraction consistency: is the same concept represented the same way everywhere?
  • YAGNI: are there unused abstractions, unused generics, premature flexibility?
  • Idempotency: can operations be safely retried?

Score calibration:

Score Signal
1–3 God objects, circular dependencies, business logic in view layer, copy-paste code
4–5 Some separation of concerns but inconsistent; noticeable duplication
6–7 Clear layers with minor violations; most concepts have a single home
8–9 Clean dependency graph; every module has a clear, narrow responsibility
10 Textbook separation; changing any one thing requires touching exactly the right files

Pillar 2: Reliability & Resiliency

Does the system fail gracefully and recover predictably?

What to look for:

  • All external calls (DB, HTTP, queue) have timeout and retry logic
  • Errors are typed and carry context (not swallowed or logged-then-ignored)
  • Partial failure handling: what happens if step 3 of 5 fails?
  • Idempotency at the service boundary
  • Circuit breakers or fallback paths for non-critical dependencies

Score calibration:

Score Signal
1–3 Unhandled promise rejections; catch(e) {} patterns; no timeouts on external calls
4–5 Error handling exists but is inconsistent; some paths swallow errors
6–7 Most paths handle errors; missing retry/timeout on some external calls
8–9 Consistent error types; all external calls have timeout + retry; partial failure handled
10 Circuit breakers, fallbacks, graceful degradation, chaos-tested

Pillar 3: Observability & Operability

Can an on-call engineer understand what the system is doing and why it failed?

What to look for:

  • Structured logging (JSON) with consistent fields (traceId, userId, operation)
  • Logs at the right level: DEBUG for noise, INFO for milestones, ERROR for actionable failures
  • Metrics instrumentation (request count, latency histograms, error rates)
  • Distributed tracing propagation
  • Runbook-friendly error messages (no "Something went wrong")

Score calibration:

Score Signal
1–3 console.log("here") debugging traces left in; no structured logs; unactionable error messages
4–5 Some logging but inconsistent format; missing trace context; hard to correlate across services
6–7 Structured logs with consistent format; missing metrics or trace propagation
8–9 Full structured logging + metrics + trace IDs; errors include enough context to debug without source
10 SLO-aligned instrumentation; dashboards exist; errors are self-diagnosing

Pillar 4: Security & Data Integrity

Is the attack surface minimal and is data handled safely?

What to look for:

  • Input validation at all trust boundaries (never trust caller data)
  • Secrets management: no hardcoded credentials, env vars used correctly
  • Least privilege: DB user permissions, IAM roles, API scopes
  • SQL/NoSQL injection prevention (parameterized queries)
  • Authentication vs. authorization checks both present
  • Sensitive data in logs (PII, tokens, passwords)

Score calibration:

Score Signal
1–3 Hardcoded secrets; raw string SQL; no input validation; admin-level DB user
4–5 Parameterized queries but missing validation on some inputs; secrets in .env but committed
6–7 No obvious vulnerabilities; minor gaps (missing rate limiting, overly broad permissions)
8–9 Defense-in-depth; least privilege enforced; secrets in a vault; security headers set
10 Threat-modeled; automated security scanning in CI; penetration tested

Pillar 5: Developer Experience & Cognitive Load

Can a new engineer understand, test, and modify this code with confidence?

What to look for:

  • Naming: do names reveal intent? (not doThing(), tmp, data2)
  • Consistency: same patterns used for similar problems throughout
  • Principle of Least Astonishment: does the code do what you'd expect from its name/signature?
  • Testability: can units be tested in isolation?
  • Onboarding friction: README accuracy, local setup steps, dev tooling

Score calibration:

Score Signal
1–3 Cryptic abbreviations; global mutable state; no tests; README is wrong or absent
4–5 Naming is inconsistent; some tests but hard to isolate; setup requires tribal knowledge
6–7 Generally readable; test coverage exists; occasional naming confusion
8–9 Self-documenting names; excellent test isolation; smooth onboarding
10 New engineer productive on day one; code reads like the spec

2.2 Scoring Rules

  • Cite evidence for every score. Format: src/api/orders.ts:42 — no timeout on fetch()
  • Do not average adjacent scores. Give a whole number; explain the rounding decision.
  • Flag "blockers" — any finding that would block a production deployment (P0). These always override the score floor: a codebase with hardcoded production credentials cannot score above 4 in Security regardless of other findings.
  • Acknowledge stack context. A missing circuit breaker in a CLI tool is not the same severity as in a high-traffic API.

Phase 3: Roadmap Prioritization

Prioritize improvements using the Impact/Effort matrix:

Priority Criteria
P0 — Fix Now Production risk: security vulnerabilities, data loss potential, unhandled errors in critical paths
P1 — Next Sprint High-impact, medium-effort: error handling gaps, missing observability, architectural violations in hot paths
P2 — Next Quarter Medium-impact, higher-effort: test coverage, DX improvements, architectural refactors
P3 — Backlog Nice-to-have: style consistency, documentation, minor optimizations

Each roadmap item must include:

  • The specific problem (with file:line citation)
  • The proposed solution (concrete, not "add error handling")
  • The expected outcome (what metric or behavior improves)

Phase 4: Quality Gate

Before writing the report, verify:

  • Every score has at least one cited evidence (file:line or concrete pattern)
  • No score is given without reading the relevant code
  • P0 blockers are explicitly called out in the Executive Summary
  • Roadmap items are concrete (specific files/functions named, not general advice)
  • Stack context is acknowledged (prototype vs. production, team size)
  • Output file path follows the naming convention: docs/evaluation.[directory_name].YYYYMMDD.md

Output Template

# Software Evaluation: [Target] — YYYY-MM-DD

> Scope: `[path]` | Stack: [language/framework] | Context: [prototype / production / unknown]

---

## Executive Summary

[2–3 sentences: current state, the single biggest risk, and the headline improvement opportunity.]

**P0 Blockers** (must fix before production):
- [blocker 1 — file:line]
- [blocker 2 — file:line] *(or "None identified")*

---

## Scorecard

| Pillar | Score | Key Finding |
|--------|-------|-------------|
| Architectural Integrity | X/10 | [one-line justification with evidence] |
| Reliability & Resiliency | X/10 | [one-line justification with evidence] |
| Observability & Operability | X/10 | [one-line justification with evidence] |
| Security & Data Integrity | X/10 | [one-line justification with evidence] |
| DX & Cognitive Load | X/10 | [one-line justification with evidence] |
| **Overall** | **X/10** | [weighted average, explain any weighting] |

---

## Deep Dive

### Architectural Integrity — X/10

**Strengths:**
- [specific pattern or file that exemplifies good design]

**Findings:**
- `src/services/user.ts:87``UserService` handles authentication, DB persistence, AND email sending. Violates SRP. Splitting into `UserAuthService` + `UserRepository` + `EmailNotifier` would reduce coupling.
- [finding 2 — file:line + specific recommendation]

---

### Reliability & Resiliency — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/lib/db.ts:23``query()` has no timeout. A slow DB query will hold the connection pool indefinitely. Add `statement_timeout: 5000` to the pool config.
- [finding 2]

---

### Observability & Operability — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/api/orders.ts:156``catch(err) { logger.error("order failed") }` — no `orderId`, `userId`, or stack trace in the log. Impossible to diagnose in production.

---

### Security & Data Integrity — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/config/db.ts:4``DB_PASSWORD` has a fallback hardcoded value (`|| "password123"`). Remove fallback; fail fast if env var is missing.

---

### DX & Cognitive Load — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/utils/helpers.ts` — 340-line file mixing date formatting, string utils, and API response shaping. No discoverability; new engineers won't find these. Split by domain.

---

## Improvement Roadmap

### P0 — Fix Now

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `src/config/db.ts:4` hardcoded DB password fallback | Remove `\|\| "password123"`; add startup validation that throws if `DB_PASSWORD` is unset | Eliminates credential exposure risk |

### P1 — Next Sprint

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | No timeouts on external HTTP calls (`src/lib/http.ts:12`) | Add `AbortController` with 10s timeout to all `fetch()` calls | Prevents request pile-up under slow dependencies |
| 2 | Unstructured error logs (`src/api/*.ts`) | Adopt `logger.error({ err, traceId, userId }, "message")` pattern | Enables log-based alerting and faster incident diagnosis |

### P2 — Next Quarter

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `UserService` violates SRP | Extract `UserRepository` and `EmailNotifier` | Enables independent testing; reduces merge conflicts |

### P3 — Backlog

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `src/utils/helpers.ts` is a catch-all | Split into `src/utils/date.ts`, `src/utils/string.ts`, `src/utils/response.ts` | Improves discoverability |
Weekly Installs
4
GitHub Stars
1
First Seen
Feb 19, 2026
Installed on
amp4
opencode4
kimi-cli4
codex4
github-copilot4
gemini-cli4