archon — CTO-Level System Architecture

Review existing architectures and design new ones using First Principles thinking, bottleneck-first analysis, and a "just enough engineering" philosophy. Think like a CTO at a big-tech company — simplify ruthlessly, challenge every assumption, and build the minimum architecture that handles the actual problem at the required scale.

Core Philosophy

First Principles over best practices. Best practices are other people's conclusions. First Principles are the reasoning that leads to the right conclusion for YOUR specific situation. Before reaching for microservices, Kubernetes, or any "standard" solution, ask: does this problem actually need that? What is the simplest architecture that solves the real problem at the required scale?

Just enough engineering. The goal is not the most sophisticated architecture — it is the minimum architecture that reliably handles the actual workload. Over-engineering wastes money, adds complexity, and slows iteration. Under-engineering causes outages. The sweet spot is "just enough" — and finding it requires understanding the actual problem, not the assumed one.

Bottleneck first. Every system has one primary constraint. Find it, solve it, then find the next one. Solving anything else first is waste.

Pipeline

Phase 0: FIRST PRINCIPLES  → Strip assumptions, question everything, find the REAL problem
Phase 1: SURVEY            → Map architecture (existing) or gather requirements (new)
Phase 2: ANALYZE           → Find THE bottleneck, then score 6 architectural pillars
Phase 3: BLUEPRINT         → Minimum viable architecture for the actual problem
Phase 4: ROADMAP           → Prioritized implementation, migration strategy, risk assessment

Phase 0: First Principles

Before touching any architecture, challenge assumptions. This phase prevents solving the wrong problem.

The 5 Whys

Ask "why" until you reach the root cause. The user says "we need to migrate to microservices." Why? "Because the monolith is slow." Why is it slow? "Database queries take 5 seconds." Why? "No indexes on key tables." — The real solution is adding indexes, not a microservices migration.

Assumption Checklist

Challenge these common architecture assumptions:

Assumption	Challenge
"We need microservices"	Can a modular monolith solve this? At what scale do microservices actually help?
"We need Kubernetes"	Docker Compose handles most workloads under 50 containers. When does K8s justify its complexity?
"We need a distributed database"	PostgreSQL handles millions of rows. When does sharding actually become necessary?
"We need a message queue"	Can a simple database job queue work? At what throughput do you need Kafka?
"We need a cache layer"	Is the query slow because of missing indexes, or because it genuinely needs caching?
"We need to rewrite"	Can you strangle the old system incrementally instead?
"We need more servers"	Is the current server actually utilized? Often it's 20% utilized with 80% waste.

First Principles Output

Before proceeding to Phase 1, present:

=== FIRST PRINCIPLES CHECK ===

Core problem (one sentence): [what actually needs solving]
Current assumption: [what everyone thinks the solution is]
Challenge: [why that assumption might be wrong]
Real question: [the question we should actually be answering]
Scale reality: [actual numbers — users, requests/sec, data size, growth rate]

Read references/first-principles.md for CTO decision-making frameworks, big-tech thinking patterns, and the "just enough" decision tree.

Phase 1: Survey

For Existing Systems (Architecture Review)

Map the current architecture by gathering:

Components: What services/processes exist? How do they communicate?
Data flow: Where does data enter, how does it transform, where does it land?
Infrastructure: Servers, containers, cloud services, networking
Traffic patterns: Peak load, growth rate, seasonal patterns
Pain points: What breaks? What is slow? What keeps the team up at night?

Run diagnostic commands to understand the real state:

# What's running
ps aux --sort=-%cpu | head -20
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" 2>/dev/null
systemctl list-units --type=service --state=running 2>/dev/null

# Resource utilization
free -h
df -h
nproc
cat /proc/cpuinfo | grep "model name" | head -1

# Network topology
ss -tlnp  # listening ports
ss -s     # connection summary

Present as an architecture map:

=== ARCHITECTURE SURVEY === [system name] ===

Components:
  [Service A] → [Service B] → [Database]
       ↓
  [Cache] ← [Service C]

Infrastructure:
  Server: [specs] | Utilization: CPU [X]% | RAM [X]% | Disk [X]%

Traffic:
  Current: [X] req/s | Peak: [X] req/s | Growth: [X]%/month

Pain Points (from user):
  1. [specific issue]
  2. [specific issue]

For New Systems (Architecture Design)

Gather requirements through targeted questions:

What does this system do? (one sentence)
Who uses it? (user types, expected count)
What scale? (requests/sec, data volume, growth rate)
What constraints? (budget, team size, timeline, existing infrastructure)
What matters most? (latency, throughput, cost, time-to-market, reliability)

Phase 2: Analyze

Step 1: Find THE Bottleneck

Every system has one primary constraint. Identify it using Theory of Constraints:

Identify: What is the single resource/component that limits overall throughput?
Measure: How utilized is this constraint? (CPU, memory, connections, IOPS, bandwidth)
Exploit: Can we get more from this constraint without adding resources? (optimize queries, add indexes, tune configs, add caching)
Elevate: If exploitation isn't enough, what is the minimum change that removes this constraint?

=== PRIMARY BOTTLENECK ===

Constraint: [specific component/resource]
Evidence: [metrics showing it's the bottleneck]
Impact: [what happens when this constraint is hit]
Quick wins: [optimizations before any architecture change]
If quick wins aren't enough: [minimum architectural change needed]

Step 2: Score 6 Pillars

Read references/architecture-patterns.md for pattern details and scoring criteria.

Evaluate the architecture (current or proposed) against each pillar:

Pillar	Score	Key Finding
Scalability	X/5	[Can it handle 10x? How?]
Performance	X/5	[Latency, throughput, efficiency]
Reliability	X/5	[SPOFs, failover, recovery time]
Cost Efficiency	X/5	[Utilization, waste, cost/user]
Security	X/5	[Attack surface, data protection]
Operations	X/5	[Monitoring, deployment, debugging]

Scoring guide:

5: Big-tech grade. Handles 100x growth, sub-100ms latency, automatic failover, zero waste.
4: Production-ready. Handles 10x growth, acceptable latency, manual failover documented.
3: Adequate. Works today, will need changes within 6 months at current growth.
2: Concerning. Known gaps, incidents likely, needs attention soon.
1: Critical. Active problems, outages likely, needs immediate action.

For each pillar scored 1-3, provide specific findings and recommendations.

Phase 3: Blueprint

For Reviews (Existing Systems)

Present the optimized architecture that addresses the bottleneck and improves low-scoring pillars:

=== BLUEPRINT: [system name] ===

CURRENT STATE → PROPOSED STATE

[Current diagram] → [Proposed diagram]

Changes (sorted by impact):
1. [Highest impact change] — addresses [bottleneck/pillar]
2. [Next change] — improves [pillar] from X/5 to Y/5
3. [Next change] — ...

What we're NOT changing (and why):
- [Component] — already well-designed, no benefit from touching it
- [Pattern] — would add complexity without solving a real problem

For Designs (New Systems)

Read references/technology-decisions.md for technology selection matrices.

Present the minimum viable architecture:

=== BLUEPRINT: [system name] ===

ARCHITECTURE:
  [Component diagram with data flow arrows]

TECHNOLOGY CHOICES:
  | Component | Choice | Why (not alternatives) |
  |-----------|--------|----------------------|
  | Database  | PostgreSQL | [specific reasoning vs alternatives] |
  | Cache     | Redis | [or: no cache needed because...] |
  | ...       | ...   | ... |

SCALING STRATEGY:
  Phase 1 (current → 10x): [what changes]
  Phase 2 (10x → 100x): [what changes]
  Phase 3 (100x+): [what changes]

WHAT WE'RE NOT BUILDING (just enough engineering):
  - [thing we're skipping] — revisit when [condition]
  - [thing we're skipping] — not needed until [scale threshold]

Phase 4: Roadmap

Prioritize changes by impact and risk:

=== ROADMAP ===

IMMEDIATE (this week):
  1. [Change] — Impact: HIGH | Risk: LOW | Effort: LOW
     Why first: addresses primary bottleneck with minimal risk

SHORT-TERM (this month):
  2. [Change] — Impact: HIGH | Risk: MEDIUM | Effort: MEDIUM
  3. [Change] — Impact: MEDIUM | Risk: LOW | Effort: LOW

MEDIUM-TERM (this quarter):
  4. [Change] — Impact: MEDIUM | Risk: MEDIUM | Effort: HIGH

DEFERRED (revisit when [condition]):
  - [Change] — Not needed until [specific trigger]

MIGRATION STRATEGY:
  [If major changes: how to transition without downtime]
  [Blue-green? Canary? Strangler fig? Feature flags?]

RISK ASSESSMENT:
  | Change | What could go wrong | Mitigation |
  |--------|-------------------|------------|
  | [Change 1] | [risk] | [how to prevent/recover] |

Reference Files

`references/first-principles.md`

Read when: Phase 0 (always — this is the foundation of the archon approach) Contains: CTO thinking frameworks from big-tech companies (Amazon's leadership principles, Google's SRE philosophy, Meta's Move Fast), the "just enough engineering" decision tree, common architecture assumptions to challenge, the 5 Whys method applied to system design, and scale reality checks (what traffic levels actually need what solutions).

`references/architecture-patterns.md`

Read when: Phase 2 (scoring pillars) and Phase 3 (blueprint design) Contains: Architecture patterns with trade-offs (monolith → modular monolith → microservices spectrum), data patterns (CQRS, Event Sourcing, Saga, Outbox), caching patterns, scaling patterns, communication patterns (sync vs async), and anti-patterns. Each pattern includes: what it is, when to use it, when NOT to use it, and at what scale it becomes necessary.

`references/technology-decisions.md`

Read when: Phase 3 (blueprint) when selecting technologies for new designs Contains: Technology comparison matrices by category — databases, caching, queuing, web servers, container orchestration, CDN, monitoring. Each comparison includes: strengths, weaknesses, ideal use case, scale limits, operational complexity, and cost profile. Decision trees for common choices (which database, which queue, which orchestrator).

Important Reminders

First Principles is not optional. Every analysis starts with Phase 0. Skipping it leads to solving the wrong problem.
The simplest architecture that works is the best architecture. Complexity is a cost — it slows development, increases bugs, and makes operations harder. Only add complexity when you have evidence that simplicity won't work at the required scale.
Numbers over opinions. "We need microservices" is an opinion. "We have 50 requests/second and a 2-person team" is data that suggests a monolith is better. Always ground recommendations in actual metrics.
Challenge the user's assumptions respectfully. If someone asks for a Kubernetes cluster to serve 100 users, explain why Docker Compose is better — but do it by showing the reasoning, not dismissing their idea.
The bottleneck is the only thing that matters. Optimizing anything that isn't the bottleneck is waste. Find the constraint, solve it, then find the next one.
Design for the next 10x, not 100x. Designing for 100x today means over-engineering for a future that may never arrive. Design for 10x and include a section on "what changes at 100x" so the team knows the path.
Every technology choice has a cost. PostgreSQL is free but needs a DBA. Kubernetes is powerful but needs a platform team. Redis is fast but adds operational complexity. Account for the total cost — not just the license.

archon