review-salience-xlsx
Review Salience → PCA → K-means Customer Segmentation
Full pipeline from a review corpus to customer segments:
- Score every review on every attribute (salience 0–7) →
review × attributematrix - Reduce attributes to latent dimensions via PCA
- Segment reviews into customer groups via iterative K-means
Each stage is independently useful. Run only the stages the user needs.
For the Excel output format, read references/xlsx-format.md.
For PCA and K-means implementation details, read references/pca-kmeans.md.
For a worked example (safety-eyewear, 923 reviews, 30 attributes, 4 clusters), read references/worked-example.md.
Salience measures how prominently a reviewer mentions an attribute — not sentiment. Each cell is an integer 0–7:
| Score | Meaning |
|---|---|
| 0 | Attribute not mentioned at all |
| 1–3 | Slight or indirect mention |
| 4 | Neutral or ambiguous mention |
| 5–6 | Clearly and explicitly mentioned |
| 7 | Strongly and fully emphasised |
For the Excel output format, read references/xlsx-format.md.
For PCA and K-means implementation details, read references/pca-kmeans.md.
For a worked example (safety-eyewear corpus, 923 reviews, 30 attributes, 4 clusters), read references/worked-example.md.
Concepts
Attribute catalog
A frozen, ordered list of attributes produced upstream (e.g. by the
review-scoring-docx skill). Each attribute has an id (zero-padded, e.g.
01) and a label. The catalog must not change after scoring begins.
Scorer
The component that reads one review and returns N integers. This skill is scorer-agnostic: Claude reads and scores by default, but the architecture supports swapping in any external scorer without changing the rest of the pipeline. See Scorer contract below.
Salience matrix
A table with one row per review and one column per attribute, plus metadata
columns (review_id, product, review_text). Column names follow the pattern
s01, s02 … sN. This is the input to both PCA and K-means.
PC scores matrix
Produced by PCA on the standardised salience matrix. Shape: (n_reviews, n_components).
Each column is a latent dimension (e.g. "整體使用價值感", "場景創新適應力").
This matrix is the direct input to K-means clustering.
Customer segments
K-means groups applied to the PC scores matrix. Each review (= each customer voice) is assigned to exactly one segment. The iterative pruning rule ensures no segment is smaller than 5% of the total corpus.
Workflow
Step 0 — Locate required skills
Check your available skills for an xlsx skill (covers .xlsx or spreadsheet
creation) and a docx skill if Word output is needed. Read their SKILL.md
files before writing any output code.
Step 1 — Ingest reviews
Load all reviews for each product. Never truncate. Accept all languages.
import csv
def load_reviews(filepath):
candidates = ['body', 'Body', 'review', 'text', 'content']
with open(filepath, encoding='utf-8', errors='replace') as f:
reader = csv.DictReader(f)
col = next((c for c in candidates if c in reader.fieldnames), None)
if col is None:
raise ValueError(f"No text column found. Headers: {reader.fieldnames}")
return [r[col].strip() for r in reader
if r[col].strip() and len(r[col].strip()) > 15]
Print a per-product count summary before proceeding.
Step 2 — Confirm the attribute catalog
Either receive the catalog from upstream or rediscover it by reading the corpus. Freeze as an ordered list before scoring starts:
ATTRS = [
("01", "attribute label 1"),
("02", "attribute label 2"),
# ...
]
Step 3 — Score reviews ← scorer swap point
This is the scorer contract boundary. Anything that satisfies the contract below can replace Claude's built-in scoring.
Scorer contract
Input: a single review string (any language)
Output: a list of integers, one per attribute, in catalog order, each 0–7
Built-in scorer (Claude reads semantically):
Read each review, understand its meaning, assign scores. No keyword matching.
See references/worked-example.md for calibration examples.
External scorer (n8n / API / other):
See references/external-scorer.md for integration patterns.
scores = {}
for pid, reviews in all_reviews.items():
scores[pid] = []
for review_text in reviews:
scores[pid].append(score_one_review(review_text, ATTRS))
Save progress incrementally if corpus > 200 reviews.
Step 4 — Build salience Excel
See references/xlsx-format.md for exact sheet layout, colour scheme, and
openpyxl patterns. Two sheets: full matrix + product summary.
Step 5 — PCA (if requested)
Read references/pca-kmeans.md → Section A before writing any PCA code.
Key steps:
- Standardise:
X_std = StandardScaler().fit_transform(X) - Fit PCA: use Kaiser criterion (eigenvalue > 1) to choose
n_components - Compute factor loadings:
loadings = components_.T × sqrt(eigenvalues_) - Name each PC by its dominant loadings (|≥ 0.30|)
- Save PC scores matrix for clustering:
PC = pca.fit_transform(X_std)
Step 6 — K-means segmentation (if requested)
Read references/pca-kmeans.md → Section B before writing any clustering code.
Key steps:
- Scan K=2–9 for Elbow + Silhouette
- Pick starting K (favour interpretability over peak silhouette if gap is small)
- Run iterative pruning: remove clusters < 5% of corpus, decrement K, refit
- After convergence: call
km.predict(ALL_reviews)— never discard pruned reviews - Name clusters by PC centroid pattern and top attribute means
Step 7 — Output
- Save
.xlsxto/mnt/user-data/outputs/ - Save
.docxto/mnt/user-data/outputs/(if Word output was requested) - Call
present_filesfor every output file - Write a prose summary: total reviewed, cluster sizes + names, top finding per cluster
Hard rules
- Never truncate reviews. Every valid review (> 15 chars) must be scored and clustered.
- All languages count. Do not filter by language.
- Catalog is frozen before scoring. No mid-run additions or reordering.
- Integer scores only. Each salience cell is a whole number 0–7.
- Scorer is swappable. The pipeline must not assume Claude is the scorer.
- Never discard pruned reviews. After iterative pruning converges, use
km.predict()on the full corpus to assign every review — including those removed during pruning — to the nearest final centroid. - Use absolute column widths. Percentage widths break in Google Sheets.
- Present with
present_files. Never ask the user to navigate to the file.
More from timlai666/skills
investment-research-prompts
Use when the user needs stock screening, portfolio risk review, dividend portfolio design, pre-earnings analysis, industry competition analysis, DCF valuation, technical analysis, or stock trend/anomaly detection. Trigger on requests like 選股, 投資組合風險, 股息策略, 財報前瞻, DCF 估值, 技術面分析, 產業競爭研究, 趨勢識別, or investment research prompts.
22landing-page-studio
Use when creating high-conversion landing pages with modern visual effects, including SVG/WebGL animation, autonomous multi-iteration optimization, and dual output modes (single-file HTML or React project). Trigger on requests like landing page, LP, hero section build, animated marketing page, conversion page redesign, or style variants A-L/custom.
17bcg-growth-share-matrix
Use when analyzing a group, business-unit, product-line, or brand portfolio with the BCG growth-share matrix, especially when the task involves relative market share, market growth, quadrant classification, capital allocation, portfolio prioritization, or deciding whether to invest, maintain, harvest, reposition, or exit.
17data-analysis-workflow
Use when planning or executing a full data analysis workflow, including schema inspection, data quality audit, data cleaning, EDA, relationship analysis, feature engineering, modeling, evaluation, and report generation. Trigger on requests like 資料分析流程, EDA 到建模, 數據分析規劃, 分析報告產出, or end-to-end analytics workflow.
16psychological-trigger-marketing
Use when generating high-conversion marketing angles, campaign hooks, landing page messaging, promotional copy directions, social post hooks, or CTA concepts with psychological triggers such as FOMO, justification, desire, priming, anchoring, and framing, especially for offers, launches, limited-time promotions, and Traditional Chinese marketing for Taiwan audiences.
16service-innovation-workshop
Use when turning a service innovation problem into concept options, prototype testing, and risk checks, especially for 服務創新, 創新機會, 價值共創, 服務原型, SCAMPER, 購買者效用矩陣, 創新流程, 創新困境, or 服務體驗工程. Trigger when the task asks for practical innovation directions, concept generation, opportunity framing, or validation planning rather than only concept definitions.
16