naming-tagging-standards

SKILL.md

Naming, Comment & Tagging Standards

Enterprise-wide standards for Databricks asset naming, SQL comments, and cost/governance tagging.

Essential Rules (Retain in Working Memory)

After reading this skill, retain these 5 rules and release the full content:

  1. snake_case everywhere — all object names: tables, columns, functions, schemas
  2. dim_/fact_ prefixes — Gold tables prefixed by entity type; Bronze/Silver by layer
  3. Dual-purpose COMMENT pattern[Definition]. Business: [context]. Technical: [details].
  4. Mandatory tagslayer, domain, PII (via class.* governed tags) on every table; team, cost_center, environment on every workflow
  5. Budget policy tag — all serverless resources must use approved budget policies

Golden Rules

ID Rule Severity
NC-01 All object names use snake_case Critical
NC-02 Tables prefixed by layer (bronze_, silver_) or entity type (dim_, fact_) Critical
NC-03 No abbreviations except approved list Required
CM-01 SQL block comments (/* */) for all DDL operations Required
CM-02 Table COMMENT follows dual-purpose format Critical
CM-03 Column COMMENT required for all columns Critical
CM-04 TVF COMMENT follows v3.0 structured format Critical
TG-00 Scan context/ for customer tagging standards (any format); derive smart defaults if absent Critical
TG-01 All workflows must have required tags (team, cost_center, environment) Critical
TG-02 Use Databricks Data Classification class.* governed tags for PII columns Critical
TG-03 Use UC governed tags for catalogs, schemas, and tables Critical
TG-04 Serverless resources must use approved budget policies Critical

Tagging Configuration Resolution

Before applying ANY tags, the agent MUST follow this resolution order:

Step 1: Scan context/ for Customer-Supplied Tagging Standards

Search the context/ directory for any file containing tagging standards. The customer supplies their standards in whatever format they already have — the agent adapts to them, not the other way around.

Format Detection Pattern How to Parse
YAML/JSON *tag*, *tagging*, *governance*, *standards* Parse key-value structure directly
CSV Files with columns like tag_name, tag_value, scope, object_type Parse rows as tag definitions
Markdown Files with tagging tables or bullet lists Extract tag names and values from tables/lists
Plain text *tag*, *tagging* with .txt Parse key: value or key = value lines
Free-form Any file mentioning tags, cost center, PII, classification Extract intent — the agent understands natural language

Parsing rules:

  • Look for recognizable tag keys: team, cost_center, business_unit, data_owner, data_classification, project, pii, class.*, or any custom keys
  • Normalize all keys to snake_case
  • Accept any reasonable structure — the agent interprets intent, not rigid schema
  • If multiple files match, merge them (later file wins on conflicts)

Step 1b: Normalize to Canonical Internal Schema

After parsing the customer file, the agent normalizes all extracted values into the canonical schema defined in references/canonical-tagging-schema.yaml. This schema is never shown to the user — it is the agent's internal working format.

Normalization examples (any input → canonical schema):

Customer Input (any format) Canonical Field
team = platform-engineering workflow_tags.team
CSV row: workflow,all,cost_center,CC-9901 workflow_tags.cost_center AND governed_tags.catalog.cost_center
MD bullet: - dim_guest.email → class.email_address pii_columns.dim_guest[0].class_tag
Prose: "Our data owner is data-platform@acme.com" governed_tags.schema.data_owner
Table: ` data_classification

The agent fills any fields not provided by the customer using smart defaults (Step 2 below).

Step 2: Derive Smart Defaults (No Customer Config Found, or Gaps in Config)

If no tagging file is found in context/, OR the customer file has gaps (e.g., they specify team and cost_center but not data_owner), the agent fills missing fields with meaningful values derived from project context:

Tag Derivation Strategy Example (Wanderbricks)
team "data-engineering" (universal default — most common team name) data-engineering
cost_center Cannot infer — use "REVIEW-REQUIRED" + # TODO comment REVIEW-REQUIRED
environment Always ${bundle.target} (non-negotiable) dev, staging, prod
project Extract from schema CSV schema name, catalog name, or repo folder wanderbricks
business_unit Infer domain from schema/table names (e.g., hospitality, retail, sales) hospitality
data_owner "{project}-data-team" pattern wanderbricks-data-team
data_classification "internal" (safe default for all tables) internal
PII columns Infer from column names using pattern matching (see PII Inference below) class.email_address

Key difference from v1: Only cost_center gets a placeholder — all other tags get real derived values.

Step 3: PII Column Inference (Always Active)

Whether or not customer config is supplied, the agent always scans column names from the schema CSV (or YAML designs) and applies Databricks Data Classification class.* system governed tags using column name pattern matching:

Column Name Pattern Databricks Classification Tag Examples
*email* class.email_address email, customer_email, contact_email
*phone*, *mobile*, *cell* class.phone_number phone_number, mobile_phone, work_phone
*_name (person context: first_name, last_name, guest_name, host_name, customer_name) class.name first_name, last_name, guest_name
*address*, *street* (physical, not email) class.location street_address, mailing_address
*ssn*, *social_security*, *tax_id* class.us_ssn ssn, social_security_number
*passport* class.us_passport passport_number
*driver_license*, *license_number* class.us_driver_license driver_license_number
*credit_card*, *card_number*, *cc_number* class.credit_card credit_card_number
*ip_address*, *ip_addr* class.ip_address client_ip_address
*iban* class.iban_code iban_code, bank_iban

Rules for PII inference:

  • Always apply inferred PII tags regardless of whether customer config exists
  • If customer config explicitly declares PII columns, those take precedence over inference (customer may override or suppress)
  • Use class.* system governed tags (not custom pii/pii_type tags) for compatibility with Databricks Data Classification
  • Customer can suppress a false positive by adding exclude_pii: [column_name] in their config
  • Also recommend enabling Databricks Data Classification on the catalog for automated AI-powered scanning beyond column name patterns

Step 4: Environment Tag (Always Automatic)

The environment tag is always ${bundle.target} — never hardcoded. This is non-negotiable regardless of config.

Step 5: Consistency Enforcement

All tags must be consistent across all assets in the project:

Rule Enforcement
Same team on every job, pipeline, and dashboard Agent verifies before generating
Same cost_center on workflows AND catalog governed tags Must match
Same project across all Asset Bundle resources Derived once, used everywhere
data_classification must be set on every Gold table Default "internal" if not specified
PII class.* tags must be consistent across all layers Same column tagged in Bronze, Silver, and Gold

Internal Reference

The canonical schema the agent normalizes to is defined in references/canonical-tagging-schema.yaml. This file is an agent-internal reference — the user never sees or fills it in. The user simply drops their tagging standards into context/ in whatever format they prefer.


Part 1: Naming Conventions

Object Naming Patterns

Object Format Example
Catalog {env}_{domain}_catalog prod_sales_catalog
Schema {layer} or {domain}_{layer} bronze, sales_gold
Table {prefix}_{entity} dim_customer, fact_orders
Column {descriptive_name} customer_id, order_date
Constraint {type}_{table}[_{column}] pk_dim_customer, fk_orders_customer
Function get_{entity}_{action} get_daily_sales
Job [${bundle.target}] {Domain} - {Action} {Entity} [dev] Sales - Merge Orders
Pipeline [${bundle.target}] {Layer} {Domain} Pipeline [dev] Silver Sales Pipeline

Never use: camelCase, PascalCase, SCREAMING_CASE, kebab-case, or spaces.

Table Prefixes

Layer/Type Prefix Example
Bronze bronze_ bronze_raw_orders
Silver silver_ silver_orders
Gold Dimension dim_ dim_customer
Gold Fact fact_ fact_sales
Bridge bridge_ bridge_customer_product
Aggregate agg_ agg_daily_sales
Staging stg_ stg_customer_updates
Quarantine _quarantine suffix silver_orders_quarantine
History _history suffix dim_customer_history

Approved Abbreviations (Use ONLY These)

id, ts, dt, amt, qty, pct, num, cnt, avg, min, max, pk, fk, dim, fact, agg, stg

Forbidden: cust, prod, inv, trans, ord, emp -- spell these out.

Column Patterns

Pattern Format Example
Primary Key {entity}_id or {entity}_key customer_id, store_key
Foreign Key Same as referenced PK customer_id
Business Key {entity}_{identifier} customer_number
Surrogate Key {entity}_key customer_key (MD5)
Date {event}_date order_date
Timestamp {event}_timestamp or {event}_ts created_ts
Boolean is_{condition} or has_{thing} is_active, has_discount
Amount {type}_amount or {type}_amt total_amount
Count {thing}_count or {thing}_cnt order_count
SCD2 effective_from, effective_to, is_current Standard names

For detailed naming examples, see references/naming-conventions.md.


Part 2: Comment Conventions

CM-01: SQL Block Comments

All DDL must include /* */ block comments with purpose, grain, and source:

/*
 * Table: dim_customer
 * Layer: Gold | Domain: Sales
 * Grain: One row per customer per effective period.
 * Source: Silver layer silver_customers table.
 */

CM-02: Table COMMENT (Dual-Purpose Format)

[One-line description]. Business: [use cases, consumers]. Technical: [grain, source, update frequency].

Examples:

-- Gold Dimension
COMMENT ON TABLE gold.dim_customer IS
'Customer dimension with SCD Type 2 history tracking. Business: Primary customer reference for segmentation and cohort analysis. Technical: MD5 surrogate key, is_current flag, daily merge from Silver.';

-- Gold Fact
COMMENT ON TABLE gold.fact_orders IS
'Daily order facts at customer-product-day grain. Business: Primary source for revenue reporting and sales dashboards. Technical: Composite PK, incremental merge, CDF enabled.';

CM-03: Column COMMENT (Dual-Purpose Format)

[Brief definition]. Business: [how it's used]. Technical: [data type notes, source, calculation].
customer_key STRING NOT NULL
COMMENT 'Surrogate key for SCD Type 2 versioning. Business: Used for joining fact tables. Technical: MD5 hash of customer_id + effective_from.';

total_amount DECIMAL(18,2)
COMMENT 'Total order amount after discounts in USD. Business: Primary revenue metric. Technical: SUM(quantity * unit_price * (1 - discount_pct)).';

CM-04: TVF COMMENT (v3.0 Structured Format)

COMMENT '
• PURPOSE: [One-line description of what the function returns].
• BEST FOR: [Query type 1] | [Query type 2] | [Query type 3]
• NOT FOR: [What to use instead] (use [alternative] instead)
• RETURNS: [PRE-AGGREGATED or DETAIL rows]
• PARAMS: [param1 (format)], [param2 (format)]
• SYNTAX: SELECT * FROM function_name(''param1'', ''param2'')
• NOTE: [Critical usage notes]
'

Metric View COMMENT Format (YAML)

comment: >
  PURPOSE: Cost analytics for billing analysis.
  BEST FOR: Total spend by workspace | Cost by SKU | Daily trends
  NOT FOR: Commit tracking (use commit_tracking) | Real-time alerts (use TVF)
  DIMENSIONS: usage_date, workspace_name, sku_name
  MEASURES: total_cost, total_dbus, tag_coverage_pct
  SOURCE: fact_usage (billing domain)
  JOINS: dim_workspace, dim_sku
  NOTE: Costs are list prices. Actual billed amounts may differ.

For complete comment examples, see references/comment-templates.md.


Part 3: Tagging Standards

Config-aware: Tag values come from customer-supplied config (any format) when available. When absent, the agent derives meaningful defaults from project context. See Tagging Configuration Resolution above.

TG-01: Workflow Tags (Asset Bundles)

All jobs and pipelines must include:

Tag Required With Customer Config Without Config (Smart Default)
team Yes From customer file "data-engineering"
cost_center Yes From customer file "REVIEW-REQUIRED" + # TODO
environment Yes Always ${bundle.target} Always ${bundle.target}
project Recommended From customer file Inferred from schema/catalog/repo name
layer Recommended Auto-derived Auto-derived from pipeline layer
job_type Recommended Auto-derived Auto-derived from job purpose

With customer config (values sourced from customer file in context/):

resources:
  jobs:
    gold_merge_job:
      tags:
        team: platform-engineering         # From customer tagging standards
        cost_center: CC-9901               # From customer tagging standards
        environment: ${bundle.target}      # Always automatic
        project: wanderbricks              # From customer tagging standards
        layer: gold                        # Auto-derived
        job_type: merge                    # Auto-derived

Without customer config (meaningful defaults):

resources:
  jobs:
    gold_merge_job:
      tags:
        team: data-engineering             # Default — add tagging config to context/ to customize
        cost_center: REVIEW-REQUIRED       # TODO: Add tagging config to context/ with cost_center
        environment: ${bundle.target}      # Always automatic
        project: wanderbricks              # Inferred from schema CSV (samples.wanderbricks)
        layer: gold                        # Auto-derived
        job_type: merge                    # Auto-derived

TG-02: Databricks Data Classification Tags (PII Columns)

Use Databricks Data Classification class.* system governed tags for PII — not custom pii/pii_type tags.

Supported class.* tags (global):

Tag Description Column Pattern Match
class.email_address Email addresses *email*
class.phone_number Phone numbers *phone*, *mobile*, *cell*
class.name Person names first_name, last_name, *_name (person context)
class.location Physical locations *address*, *street* (not email)
class.credit_card Credit card numbers *credit_card*, *card_number*
class.us_ssn US Social Security *ssn*, *social_security*
class.us_passport US Passport *passport*
class.us_driver_license US Driver License *driver_license*
class.ip_address IP addresses *ip_address*, *ip_addr*
class.iban_code Bank account (IBAN) *iban*
class.url URLs *_url, *website*

PII tags are always inferred from column names AND from customer config (if supplied). Customer config takes precedence — they can add, override, or suppress inferred tags.

-- Applied to columns based on name pattern inference + customer config
ALTER TABLE gold.dim_guest ALTER COLUMN email SET TAGS ('class.email_address' = '');
ALTER TABLE gold.dim_guest ALTER COLUMN phone_number SET TAGS ('class.phone_number' = '');
ALTER TABLE gold.dim_guest ALTER COLUMN guest_name SET TAGS ('class.name' = '');

-- Also recommend enabling automated Data Classification on the catalog:
-- Navigate to catalog > Details > Data Classification toggle > Enable

TG-03: Unity Catalog Governed Tags (Catalogs, Schemas, Tables)

Tag Apply To With Customer Config Without Config (Smart Default)
cost_center Catalogs From customer file "REVIEW-REQUIRED" + # TODO
business_unit Catalogs From customer file Inferred from domain (e.g., "hospitality")
data_owner Schemas From customer file "{project}-data-team" pattern
data_classification Tables From customer file "internal" (safe default for all tables)

With customer config:

ALTER CATALOG wanderbricks SET TAGS ('cost_center' = 'CC-9901', 'business_unit' = 'Hospitality');
ALTER SCHEMA wanderbricks.gold SET TAGS ('data_owner' = 'data-platform@wanderbricks.com');
ALTER TABLE gold.dim_guest SET TAGS ('data_classification' = 'confidential');
ALTER TABLE gold.fact_bookings SET TAGS ('data_classification' = 'internal');

Without customer config (meaningful defaults):

ALTER CATALOG wanderbricks SET TAGS ('cost_center' = 'REVIEW-REQUIRED');  -- TODO: Add tagging config to context/
ALTER CATALOG wanderbricks SET TAGS ('business_unit' = 'hospitality');     -- Inferred from domain
ALTER SCHEMA wanderbricks.gold SET TAGS ('data_owner' = 'wanderbricks-data-team');  -- Derived from project
ALTER TABLE gold.dim_guest SET TAGS ('data_classification' = 'internal');  -- Safe default

TG-04: Serverless Budget Policies

Serverless compute resources require budget policies with team, cost_center, and environment tags. If a named policy is specified in the customer config, reference it in Asset Bundles.

For tag query patterns and serverless cost attribution, see references/tagging-patterns.md.


Validation Checklist

Tagging Config Resolution

  • Scanned context/ for customer tagging files in any format (YAML, CSV, MD, JSON, TXT)
  • If config found: parsed and applied customer values (all tag keys normalized to snake_case)
  • If config absent: derived meaningful defaults from project context (not lazy placeholders)
  • Only cost_center uses REVIEW-REQUIRED placeholder (everything else gets a real derived value)
  • environment tag is ${bundle.target} (never hardcoded)

PII / Data Classification

  • Column names scanned for PII patterns and class.* system governed tags applied
  • Used class.* tags (NOT custom pii/pii_type tags)
  • Customer PII declarations (if any) take precedence over inference
  • Recommended enabling Databricks Data Classification on the catalog

Naming

  • All objects use snake_case
  • Tables have correct prefix (dim_, fact_, bronze_, silver_)
  • Only approved abbreviations used
  • Columns follow standard patterns (PK, FK, boolean, SCD2)
  • Jobs/pipelines follow naming format

Comments

  • DDL includes /* */ block comments
  • All tables have dual-purpose COMMENT
  • All columns have dual-purpose COMMENT
  • TVFs have v3.0 structured COMMENT
  • Metric views have structured comment in YAML

Tags

  • All jobs have team, cost_center, environment tags
  • All pipelines have required tags
  • Catalogs tagged with cost_center and business_unit
  • Schemas tagged with data_owner
  • Confidential tables tagged with data_classification
  • PII columns tagged with class.* governed tags (inferred + customer-declared)
  • Tag values consistent across all project assets (same team, cost_center, project everywhere)

Additional Resources

Weekly Installs
1
GitHub Stars
2
First Seen
8 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1