Naming, Comment & Tagging Standards

Enterprise-wide standards for Databricks asset naming, SQL comments, and cost/governance tagging.

Essential Rules (Retain in Working Memory)

After reading this skill, retain these 5 rules and release the full content:

snake_case everywhere — all object names: tables, columns, functions, schemas
dim_/fact_ prefixes — Gold tables prefixed by entity type; Bronze/Silver by layer
Dual-purpose COMMENT pattern — [Definition]. Business: [context]. Technical: [details].
Mandatory tags — layer, domain, PII (via class.* governed tags) on every table; team, cost_center, environment on every workflow
Budget policy tag — all serverless resources must use approved budget policies

Golden Rules

ID	Rule	Severity
NC-01	All object names use `snake_case`	Critical
NC-02	Tables prefixed by layer (`bronze_`, `silver_`) or entity type (`dim_`, `fact_`)	Critical
NC-03	No abbreviations except approved list	Required
CM-01	SQL block comments (`/* */`) for all DDL operations	Required
CM-02	Table COMMENT follows dual-purpose format	Critical
CM-03	Column COMMENT required for all columns	Critical
CM-04	TVF COMMENT follows v3.0 structured format	Critical
TG-00	Scan `context/` for customer tagging standards (any format); derive smart defaults if absent	Critical
TG-01	All workflows must have required tags (`team`, `cost_center`, `environment`)	Critical
TG-02	Use Databricks Data Classification `class.*` governed tags for PII columns	Critical
TG-03	Use UC governed tags for catalogs, schemas, and tables	Critical
TG-04	Serverless resources must use approved budget policies	Critical

Tagging Configuration Resolution

Before applying ANY tags, the agent MUST follow this resolution order:

Step 1: Scan `context/` for Customer-Supplied Tagging Standards

Search the context/ directory for any file containing tagging standards. The customer supplies their standards in whatever format they already have — the agent adapts to them, not the other way around.

Format	Detection Pattern	How to Parse
YAML/JSON	`tag`, `tagging`, `governance`, `standards`	Parse key-value structure directly
CSV	Files with columns like `tag_name`, `tag_value`, `scope`, `object_type`	Parse rows as tag definitions
Markdown	Files with tagging tables or bullet lists	Extract tag names and values from tables/lists
Plain text	`tag`, `tagging` with `.txt`	Parse `key: value` or `key = value` lines
Free-form	Any file mentioning tags, cost center, PII, classification	Extract intent — the agent understands natural language

Parsing rules:

Look for recognizable tag keys: team, cost_center, business_unit, data_owner, data_classification, project, pii, class.*, or any custom keys
Normalize all keys to snake_case
Accept any reasonable structure — the agent interprets intent, not rigid schema
If multiple files match, merge them (later file wins on conflicts)

Step 1b: Normalize to Canonical Internal Schema

After parsing the customer file, the agent normalizes all extracted values into the canonical schema defined in references/canonical-tagging-schema.yaml. This schema is never shown to the user — it is the agent's internal working format.

Normalization examples (any input → canonical schema):

Customer Input (any format)	Canonical Field
`team = platform-engineering`	`workflow_tags.team`
CSV row: `workflow,all,cost_center,CC-9901`	`workflow_tags.cost_center` AND `governed_tags.catalog.cost_center`
MD bullet: `- dim_guest.email → class.email_address`	`pii_columns.dim_guest[0].class_tag`
Prose: "Our data owner is data-platform@acme.com"	`governed_tags.schema.data_owner`
Table: `	data_classification

The agent fills any fields not provided by the customer using smart defaults (Step 2 below).

Step 2: Derive Smart Defaults (No Customer Config Found, or Gaps in Config)

If no tagging file is found in context/, OR the customer file has gaps (e.g., they specify team and cost_center but not data_owner), the agent fills missing fields with meaningful values derived from project context:

Tag	Derivation Strategy	Example (Wanderbricks)
`team`	`"data-engineering"` (universal default — most common team name)	`data-engineering`
`cost_center`	Cannot infer — use `"REVIEW-REQUIRED"` + `# TODO` comment	`REVIEW-REQUIRED`
`environment`	Always `${bundle.target}` (non-negotiable)	`dev`, `staging`, `prod`
`project`	Extract from schema CSV schema name, catalog name, or repo folder	`wanderbricks`
`business_unit`	Infer domain from schema/table names (e.g., hospitality, retail, sales)	`hospitality`
`data_owner`	`"{project}-data-team"` pattern	`wanderbricks-data-team`
`data_classification`	`"internal"` (safe default for all tables)	`internal`
PII columns	Infer from column names using pattern matching (see PII Inference below)	`class.email_address`

Key difference from v1: Only cost_center gets a placeholder — all other tags get real derived values.

Step 3: PII Column Inference (Always Active)

Whether or not customer config is supplied, the agent always scans column names from the schema CSV (or YAML designs) and applies Databricks Data Classification class.* system governed tags using column name pattern matching:

Column Name Pattern	Databricks Classification Tag	Examples
`email`	`class.email_address`	`email`, `customer_email`, `contact_email`
`phone`, `mobile`, `cell`	`class.phone_number`	`phone_number`, `mobile_phone`, `work_phone`
`*_name` (person context: `first_name`, `last_name`, `guest_name`, `host_name`, `customer_name`)	`class.name`	`first_name`, `last_name`, `guest_name`
`address`, `street` (physical, not email)	`class.location`	`street_address`, `mailing_address`
`ssn`, `social_security`, `tax_id`	`class.us_ssn`	`ssn`, `social_security_number`
`passport`	`class.us_passport`	`passport_number`
`driver_license`, `license_number`	`class.us_driver_license`	`driver_license_number`
`credit_card`, `card_number`, `cc_number`	`class.credit_card`	`credit_card_number`
`ip_address`, `ip_addr`	`class.ip_address`	`client_ip_address`
`iban`	`class.iban_code`	`iban_code`, `bank_iban`

Rules for PII inference:

Always apply inferred PII tags regardless of whether customer config exists
If customer config explicitly declares PII columns, those take precedence over inference (customer may override or suppress)
Use class.* system governed tags (not custom pii/pii_type tags) for compatibility with Databricks Data Classification
Customer can suppress a false positive by adding exclude_pii: [column_name] in their config
Also recommend enabling Databricks Data Classification on the catalog for automated AI-powered scanning beyond column name patterns

Step 4: Environment Tag (Always Automatic)

The environment tag is always ${bundle.target} — never hardcoded. This is non-negotiable regardless of config.

Step 5: Consistency Enforcement

All tags must be consistent across all assets in the project:

Rule	Enforcement
Same `team` on every job, pipeline, and dashboard	Agent verifies before generating
Same `cost_center` on workflows AND catalog governed tags	Must match
Same `project` across all Asset Bundle resources	Derived once, used everywhere
`data_classification` must be set on every Gold table	Default `"internal"` if not specified
PII `class.*` tags must be consistent across all layers	Same column tagged in Bronze, Silver, and Gold

Internal Reference

The canonical schema the agent normalizes to is defined in references/canonical-tagging-schema.yaml. This file is an agent-internal reference — the user never sees or fills it in. The user simply drops their tagging standards into context/ in whatever format they prefer.

Part 1: Naming Conventions

Object Naming Patterns

Object	Format	Example
Catalog	`{env}_{domain}_catalog`	`prod_sales_catalog`
Schema	`{layer}` or `{domain}_{layer}`	`bronze`, `sales_gold`
Table	`{prefix}_{entity}`	`dim_customer`, `fact_orders`
Column	`{descriptive_name}`	`customer_id`, `order_date`
Constraint	`{type}_{table}[_{column}]`	`pk_dim_customer`, `fk_orders_customer`
Function	`get_{entity}_{action}`	`get_daily_sales`
Job	`[${bundle.target}] {Domain} - {Action} {Entity}`	`[dev] Sales - Merge Orders`
Pipeline	`[${bundle.target}] {Layer} {Domain} Pipeline`	`[dev] Silver Sales Pipeline`

Never use: camelCase, PascalCase, SCREAMING_CASE, kebab-case, or spaces.

Table Prefixes

Layer/Type	Prefix	Example
Bronze	`bronze_`	`bronze_raw_orders`
Silver	`silver_`	`silver_orders`
Gold Dimension	`dim_`	`dim_customer`
Gold Fact	`fact_`	`fact_sales`
Bridge	`bridge_`	`bridge_customer_product`
Aggregate	`agg_`	`agg_daily_sales`
Staging	`stg_`	`stg_customer_updates`
Quarantine	`_quarantine` suffix	`silver_orders_quarantine`
History	`_history` suffix	`dim_customer_history`

Approved Abbreviations (Use ONLY These)

id, ts, dt, amt, qty, pct, num, cnt, avg, min, max, pk, fk, dim, fact, agg, stg

Forbidden: cust, prod, inv, trans, ord, emp -- spell these out.

Column Patterns

Pattern	Format	Example
Primary Key	`{entity}_id` or `{entity}_key`	`customer_id`, `store_key`
Foreign Key	Same as referenced PK	`customer_id`
Business Key	`{entity}_{identifier}`	`customer_number`
Surrogate Key	`{entity}_key`	`customer_key` (MD5)
Date	`{event}_date`	`order_date`
Timestamp	`{event}_timestamp` or `{event}_ts`	`created_ts`
Boolean	`is_{condition}` or `has_{thing}`	`is_active`, `has_discount`
Amount	`{type}_amount` or `{type}_amt`	`total_amount`
Count	`{thing}_count` or `{thing}_cnt`	`order_count`
SCD2	`effective_from`, `effective_to`, `is_current`	Standard names

For detailed naming examples, see references/naming-conventions.md.

Part 2: Comment Conventions

CM-01: SQL Block Comments

All DDL must include /* */ block comments with purpose, grain, and source:

/*
 * Table: dim_customer
 * Layer: Gold | Domain: Sales
 * Grain: One row per customer per effective period.
 * Source: Silver layer silver_customers table.
 */

CM-02: Table COMMENT (Dual-Purpose Format)

[One-line description]. Business: [use cases, consumers]. Technical: [grain, source, update frequency].

Examples:

-- Gold Dimension
COMMENT ON TABLE gold.dim_customer IS
'Customer dimension with SCD Type 2 history tracking. Business: Primary customer reference for segmentation and cohort analysis. Technical: MD5 surrogate key, is_current flag, daily merge from Silver.';

-- Gold Fact
COMMENT ON TABLE gold.fact_orders IS
'Daily order facts at customer-product-day grain. Business: Primary source for revenue reporting and sales dashboards. Technical: Composite PK, incremental merge, CDF enabled.';

CM-03: Column COMMENT (Dual-Purpose Format)

[Brief definition]. Business: [how it's used]. Technical: [data type notes, source, calculation].

customer_key STRING NOT NULL
COMMENT 'Surrogate key for SCD Type 2 versioning. Business: Used for joining fact tables. Technical: MD5 hash of customer_id + effective_from.';

total_amount DECIMAL(18,2)
COMMENT 'Total order amount after discounts in USD. Business: Primary revenue metric. Technical: SUM(quantity * unit_price * (1 - discount_pct)).';

CM-04: TVF COMMENT (v3.0 Structured Format)

COMMENT '
• PURPOSE: [One-line description of what the function returns].
• BEST FOR: [Query type 1] | [Query type 2] | [Query type 3]
• NOT FOR: [What to use instead] (use [alternative] instead)
• RETURNS: [PRE-AGGREGATED or DETAIL rows]
• PARAMS: [param1 (format)], [param2 (format)]
• SYNTAX: SELECT * FROM function_name(''param1'', ''param2'')
• NOTE: [Critical usage notes]
'

Metric View COMMENT Format (YAML)

comment: >
  PURPOSE: Cost analytics for billing analysis.
  BEST FOR: Total spend by workspace | Cost by SKU | Daily trends
  NOT FOR: Commit tracking (use commit_tracking) | Real-time alerts (use TVF)
  DIMENSIONS: usage_date, workspace_name, sku_name
  MEASURES: total_cost, total_dbus, tag_coverage_pct
  SOURCE: fact_usage (billing domain)
  JOINS: dim_workspace, dim_sku
  NOTE: Costs are list prices. Actual billed amounts may differ.

For complete comment examples, see references/comment-templates.md.

Part 3: Tagging Standards

Config-aware: Tag values come from customer-supplied config (any format) when available. When absent, the agent derives meaningful defaults from project context. See Tagging Configuration Resolution above.

TG-01: Workflow Tags (Asset Bundles)

All jobs and pipelines must include:

Tag	Required	With Customer Config	Without Config (Smart Default)
`team`	Yes	From customer file	`"data-engineering"`
`cost_center`	Yes	From customer file	`"REVIEW-REQUIRED"` + `# TODO`
`environment`	Yes	Always `${bundle.target}`	Always `${bundle.target}`
`project`	Recommended	From customer file	Inferred from schema/catalog/repo name
`layer`	Recommended	Auto-derived	Auto-derived from pipeline layer
`job_type`	Recommended	Auto-derived	Auto-derived from job purpose

With customer config (values sourced from customer file in context/):

resources:
  jobs:
    gold_merge_job:
      tags:
        team: platform-engineering         # From customer tagging standards
        cost_center: CC-9901               # From customer tagging standards
        environment: ${bundle.target}      # Always automatic
        project: wanderbricks              # From customer tagging standards
        layer: gold                        # Auto-derived
        job_type: merge                    # Auto-derived

Without customer config (meaningful defaults):

resources:
  jobs:
    gold_merge_job:
      tags:
        team: data-engineering             # Default — add tagging config to context/ to customize
        cost_center: REVIEW-REQUIRED       # TODO: Add tagging config to context/ with cost_center
        environment: ${bundle.target}      # Always automatic
        project: wanderbricks              # Inferred from schema CSV (samples.wanderbricks)
        layer: gold                        # Auto-derived
        job_type: merge                    # Auto-derived

TG-02: Databricks Data Classification Tags (PII Columns)

Use Databricks Data Classification class.* system governed tags for PII — not custom pii/pii_type tags.

Supported class.* tags (global):

Tag	Description	Column Pattern Match
`class.email_address`	Email addresses	`email`
`class.phone_number`	Phone numbers	`phone`, `mobile`, `cell`
`class.name`	Person names	`first_name`, `last_name`, `*_name` (person context)
`class.location`	Physical locations	`address`, `street` (not email)
`class.credit_card`	Credit card numbers	`credit_card`, `card_number`
`class.us_ssn`	US Social Security	`ssn`, `social_security`
`class.us_passport`	US Passport	`passport`
`class.us_driver_license`	US Driver License	`driver_license`
`class.ip_address`	IP addresses	`ip_address`, `ip_addr`
`class.iban_code`	Bank account (IBAN)	`iban`
`class.url`	URLs	`_url`, `website*`

PII tags are always inferred from column names AND from customer config (if supplied). Customer config takes precedence — they can add, override, or suppress inferred tags.

-- Applied to columns based on name pattern inference + customer config
ALTER TABLE gold.dim_guest ALTER COLUMN email SET TAGS ('class.email_address' = '');
ALTER TABLE gold.dim_guest ALTER COLUMN phone_number SET TAGS ('class.phone_number' = '');
ALTER TABLE gold.dim_guest ALTER COLUMN guest_name SET TAGS ('class.name' = '');

-- Also recommend enabling automated Data Classification on the catalog:
-- Navigate to catalog > Details > Data Classification toggle > Enable

TG-03: Unity Catalog Governed Tags (Catalogs, Schemas, Tables)

Tag	Apply To	With Customer Config	Without Config (Smart Default)
`cost_center`	Catalogs	From customer file	`"REVIEW-REQUIRED"` + `# TODO`
`business_unit`	Catalogs	From customer file	Inferred from domain (e.g., `"hospitality"`)
`data_owner`	Schemas	From customer file	`"{project}-data-team"` pattern
`data_classification`	Tables	From customer file	`"internal"` (safe default for all tables)

With customer config:

ALTER CATALOG wanderbricks SET TAGS ('cost_center' = 'CC-9901', 'business_unit' = 'Hospitality');
ALTER SCHEMA wanderbricks.gold SET TAGS ('data_owner' = 'data-platform@wanderbricks.com');
ALTER TABLE gold.dim_guest SET TAGS ('data_classification' = 'confidential');
ALTER TABLE gold.fact_bookings SET TAGS ('data_classification' = 'internal');