azure-best-practices

Installation
SKILL.md

Azure Best Practices

This skill provides rules for building secure, reliable, and cost-effective Azure services. Rules are ordered by impact: CRITICAL rules prevent outages or security breaches; HIGH rules prevent common production issues; MEDIUM rules improve maintainability and cost.


Learned Patterns (Auto-Updated)

Before applying the rules below, check if LESSONS.md exists in the project root. If it does, read the section tagged with azure-best-practices and apply those project-specific lessons alongside the rules below.


Category 1: Identity & Access (CRITICAL)

Rule 1: Use Managed Identity — Never Embed Credentials

Impact: CRITICAL — Hardcoded credentials are the #1 cause of Azure security incidents.

Wrong:

from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string(
    "DefaultEndpointsProtocol=https;AccountName=mystore;AccountKey=abc123..."
)

Correct:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

credential = DefaultAzureCredential()
client = BlobServiceClient(
    account_url="https://mystore.blob.core.windows.net",
    credential=credential
)

Why: DefaultAzureCredential uses Managed Identity in production and your Azure CLI login locally. No secrets to leak.

Rule 2: Workload Identity Federation for CI/CD

Impact: CRITICAL — Service principal secrets in CI/CD pipelines expire and can be stolen.

Wrong:

# GitHub Actions
- uses: azure/login@v2
  with:
    creds: ${{ secrets.AZURE_CREDENTIALS }}  # JSON with client secret

Correct:

# GitHub Actions with OIDC
permissions:
  id-token: write
  contents: read

- uses: azure/login@v2
  with:
    client-id: ${{ secrets.AZURE_CLIENT_ID }}
    tenant-id: ${{ secrets.AZURE_TENANT_ID }}
    subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Why: Federated credentials use short-lived OIDC tokens — no long-lived secrets to rotate or protect.

Rule 3: Key Vault for All Secrets

Impact: CRITICAL — Environment variables are visible in portal, logs, and process listings.

Wrong:

resource appService 'Microsoft.Web/sites@2023-12-01' = {
  properties: {
    siteConfig: {
      appSettings: [
        { name: 'DB_PASSWORD', value: 'super-secret-123' }
      ]
    }
  }
}

Correct:

resource appService 'Microsoft.Web/sites@2023-12-01' = {
  properties: {
    siteConfig: {
      appSettings: [
        {
          name: 'DB_PASSWORD'
          value: '@Microsoft.KeyVault(SecretUri=${keyVault::dbPassword.properties.secretUri})'
        }
      ]
    }
  }
}

Why: Key Vault references are resolved at runtime; the secret value never appears in config or logs.

Rule 4: Least-Privilege Role Assignments

Impact: CRITICAL — Overly broad roles (Owner, Contributor) amplify the blast radius of a compromise.

Wrong:

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      'b24988ac-6180-42a0-ab88-20f7382dd24c') // Contributor — too broad
    principalId: app.identity.principalId
  }
}

Correct:

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      'ba92f5b4-2d11-453d-a403-e96b0029c9fe') // Storage Blob Data Contributor — scoped
    principalId: app.identity.principalId
  }
}

Category 2: Container Apps & Deployment (CRITICAL)

Rule 5: Pin Image Tags — Never Use :latest

Impact: CRITICAL — :latest causes unpredictable deployments and makes rollback impossible.

Wrong:

containers:
  - name: api
    image: myregistry.azurecr.io/api:latest

Correct:

containers:
  - name: api
    image: myregistry.azurecr.io/api:1.2.3-sha-a1b2c3d

Rule 6: Always Configure Health Probes

Impact: CRITICAL — Without health probes, Azure routes traffic to broken containers.

Wrong:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  properties: {
    template: {
      containers: [{ name: 'api', image: '...' }]
      // No probes defined
    }
  }
}

Correct:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  properties: {
    template: {
      containers: [{
        name: 'api'
        image: '...'
        probes: [
          {
            type: 'liveness'
            httpGet: { path: '/healthz', port: 8080 }
            periodSeconds: 10
          }
          {
            type: 'readiness'
            httpGet: { path: '/ready', port: 8080 }
            periodSeconds: 5
          }
        ]
      }]
    }
  }
}

Rule 7: Set CPU and Memory Limits

Impact: HIGH — Unlimited resources cause noisy-neighbor problems and cost overruns.

Wrong:

containers: [{
  name: 'api'
  image: '...'
  // No resource limits
}]

Correct:

containers: [{
  name: 'api'
  image: '...'
  resources: {
    cpu: json('0.5')
    memory: '1Gi'
  }
}]

Rule 8: Use Revision Scope for Zero-Downtime Deploys

Impact: HIGH — Single-revision mode causes downtime during deployments.

Wrong:

properties: {
  configuration: {
    activeRevisionsMode: 'Single'
  }
}

Correct:

properties: {
  configuration: {
    activeRevisionsMode: 'Multiple'
    ingress: {
      traffic: [
        { revisionName: 'api--v2', weight: 100 }
      ]
    }
  }
}

Category 3: Azure OpenAI & AI Foundry (HIGH)

Rule 9: Always Set max_tokens

Impact: HIGH — Without max_tokens, a runaway prompt can consume your entire quota.

Wrong:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
    # No max_tokens — defaults to model maximum
)

Correct:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=4096
)

Rule 10: Retry with Exponential Backoff for 429s

Impact: HIGH — Azure OpenAI rate-limits aggressively; tight retry loops make it worse.

Wrong:

import time

for attempt in range(5):
    try:
        response = client.chat.completions.create(...)
        break
    except Exception:
        time.sleep(1)  # Fixed 1-second delay

Correct:

import time
import random

for attempt in range(5):
    try:
        response = client.chat.completions.create(...)
        break
    except openai.RateLimitError:
        wait = (2 ** attempt) + random.uniform(0, 1)
        time.sleep(wait)

Rule 11: Use Managed Identity for Azure OpenAI

Impact: HIGH — API keys are long-lived secrets that can be leaked.

Wrong:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="sk-abc123...",
    api_version="2024-06-01",
    azure_endpoint="https://myoai.openai.azure.com"
)

Correct:

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-06-01",
    azure_endpoint="https://myoai.openai.azure.com"
)

Category 4: Networking & Security (HIGH)

Rule 12: Enable HTTPS Only

Impact: HIGH — HTTP traffic is unencrypted and vulnerable to interception.

Wrong:

properties: {
  httpsOnly: false
}

Correct:

properties: {
  httpsOnly: true
}

Rule 13: Use Private Endpoints for Data Services

Impact: HIGH — Public endpoints expose databases and storage to the internet.

Wrong:

resource cosmosDb 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  properties: {
    publicNetworkAccess: 'Enabled'  // Accessible from internet
  }
}

Correct:

resource cosmosDb 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  properties: {
    publicNetworkAccess: 'Disabled'
  }
}

resource privateEndpoint 'Microsoft.Network/privateEndpoints@2023-11-01' = {
  properties: {
    privateLinkServiceConnections: [{
      properties: {
        privateLinkServiceId: cosmosDb.id
        groupIds: ['Sql']
      }
    }]
  }
}

Rule 14: Enable Diagnostic Logging

Impact: HIGH — Without logs, you cannot investigate incidents.

Wrong: No diagnostic settings configured (the default).

Correct:

resource diagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  scope: containerApp
  properties: {
    workspaceId: logAnalytics.id
    logs: [{ categoryGroup: 'allLogs', enabled: true }]
    metrics: [{ category: 'AllMetrics', enabled: true }]
  }
}

Rule 15: CORS — Restrict Origins

Impact: HIGH — Wildcard CORS allows any website to call your API.

Wrong:

cors: {
  allowedOrigins: ['*']
}

Correct:

cors: {
  allowedOrigins: ['https://myapp.example.com']
}

Category 4b: Key Vault RBAC Gotchas (HIGH)

Rule 15b: Contributor Does NOT Grant Key Vault Secret Access

Impact: HIGH — Apps with Contributor role on a resource group still cannot read Key Vault secrets when RBAC is enabled.

Wrong assumption:

"The backend has Contributor role on the resource group, so it can read Key Vault secrets."

Correct: When Key Vault uses RBAC authorization (enableRbacAuthorization: true), you need separate data-plane roles:

Who Role Purpose
Application (Managed Identity) Key Vault Secrets User Read secrets at runtime
Developer / CI/CD Key Vault Secrets Officer Write/manage secrets
Admin Key Vault Administrator Full control
// Grant the app read-only access to secrets
resource kvRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6') // Key Vault Secrets User
    principalId: containerApp.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Why: Key Vault separates management-plane (Contributor) from data-plane (Secrets User/Officer). This is by design for defense-in-depth.

Rule 15c: Container Apps Need AcrPull for ACR

Impact: HIGH — Container Apps with identity: 'system' for ACR registry fail if no AcrPull role is assigned.

Wrong:

registries: [{
  server: acrLoginServer
  identity: 'system'  // Assumes Managed Identity can pull — it cannot without AcrPull
}]

Correct: Assign AcrPull role to the Container App's managed identity:

resource acrPullRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: acr
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '7f951dda-4ed3-4680-a7ca-43fe172d538d') // AcrPull
    principalId: containerApp.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Why: Without AcrPull, the Container App revision will fail with "Identity with resource ID 'system' not found for registry." This blocks all updates to the container app.


Category 4c: Azure OpenAI Multi-Region (HIGH)

Rule 15d: Use Multiple Regions for Latest Models

Impact: HIGH — The newest models (e.g., GPT-5.4) often launch in limited regions first.

Correct pattern: Deploy Azure OpenAI resources in multiple regions and route via LiteLLM:

Korea Central:  gpt-4o, gpt-5.1-chat, gpt-5-pro
East US 2:      gpt-5.4-mini (newest family)
# LiteLLM routes to the correct endpoint based on model
MODEL_TIERS = {
    "generation": [
        {"model": "azure/gpt-5-1-chat", "api_base": KR_ENDPOINT},
        {"model": "azure/gpt-5-4-mini", "api_base": US_ENDPOINT},  # fallback
    ],
}

Why: Korea Central may not have the latest models on launch day. Having a secondary region (East US 2 or Sweden Central) ensures access to new models while keeping primary workloads in-region.

Rule 15e: Azure Cache for Redis Takes 15+ Minutes to Provision

Impact: MEDIUM — Redis provisioning blocks downstream Bicep resources that depend on it.

Correct: For sandbox/dev environments, consider using a Redis container on Azure Container Apps instead of managed Azure Cache for Redis:

// Sandbox: use containerized Redis (fast, cheap)
// Production: use Azure Cache for Redis (managed, HA)

Why: Azure Cache for Redis (even Basic C0) takes 10-20 minutes to provision. This delays initial deployments significantly. Containerized Redis starts in seconds.


Category 4d: Bicep & IaC Gotchas (HIGH)

Rule 15f: Use JSON Parameter Files for Secure Params

Impact: HIGH — Bicep .bicepparam files require ALL parameters declared, including secrets. JSON parameter files allow CLI overrides.

Wrong:

// sandbox.bicepparam — must declare ALL params including secrets
using '../main.bicep'
param environment = 'sandbox'
// ERROR: postgresPassword and jwtSecret are missing

Correct: Use JSON parameter files and pass secrets via CLI:

// sandbox.json
{
  "parameters": {
    "environment": { "value": "sandbox" },
    "location": { "value": "koreacentral" }
  }
}
az deployment group create \
  --parameters @infra/parameters/sandbox.json \
  --parameters postgresPassword="$PG_PASS" jwtSecret="$JWT_SECRET"

Why: Secrets should never appear in parameter files (even .bicepparam). JSON files + CLI overrides keep secrets out of source control.

Rule 15g: Founders Hub Subscriptions Limit RBAC via CLI

Impact: MEDIUM — az role assignment create may fail on sponsored subscriptions with "MissingSubscription" error.

Workaround: Assign roles via the Azure Portal instead:

  1. Navigate to the resource → Access Control (IAM)
  2. + AddAdd role assignment
  3. Select role → Select principal → Review + assign

Why: Some sponsored/educational subscriptions have API-level restrictions on role assignments. The Portal uses a different auth flow that works.


Category 5: Data & Storage (MEDIUM)

Rule 16: Cosmos DB — Use Partition Keys Wisely

Impact: MEDIUM — Bad partition keys create hot partitions and throttling.

Wrong:

container = database.create_container(
    id="orders",
    partition_key=PartitionKey(path="/region")  # Low cardinality
)

Correct:

container = database.create_container(
    id="orders",
    partition_key=PartitionKey(path="/customerId")  # High cardinality
)

Rule 17: Enable Soft Delete on Storage Accounts

Impact: MEDIUM — Without soft delete, accidental deletions are permanent.

Correct:

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  properties: {
    isSoftDeleteEnabled: true
    softDeleteRetentionDays: 30
  }
}

Rule 18: Use Connection Pooling for SQL

Impact: MEDIUM — Opening a new connection per request causes performance issues.

Wrong:

def get_data():
    conn = pyodbc.connect(connection_string)  # New connection every call
    cursor = conn.cursor()
    # ...
    conn.close()

Correct:

from sqlalchemy import create_engine

engine = create_engine(connection_string, pool_size=10, max_overflow=20)

def get_data():
    with engine.connect() as conn:
        # Connection from pool
        # ...

Category 6: Cost Optimization (MEDIUM)

Rule 19: Use Consumption-Based SKUs for Dev/Test

Impact: MEDIUM — Running premium SKUs in dev/test wastes budget.

Wrong: Using S1 App Service Plan for a dev environment.

Correct: Use B1 or Free tier for dev; Consumption plan for Functions dev.

Rule 20: Set Budget Alerts

Impact: MEDIUM — Without alerts, cost overruns go unnoticed until the bill arrives.

Correct:

resource budget 'Microsoft.Consumption/budgets@2023-11-01' = {
  name: 'monthly-budget'
  properties: {
    amount: 500
    timeGrain: 'Monthly'
    notifications: {
      actual_GreaterThan_80_Percent: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: ['team@example.com']
      }
    }
  }
}

Rule 21: Auto-Shutdown Dev VMs

Impact: MEDIUM — Idle VMs running overnight and weekends are pure waste.

Correct:

resource autoShutdown 'Microsoft.DevTestLab/schedules@2018-09-15' = {
  name: 'shutdown-computevm-${vm.name}'
  properties: {
    status: 'Enabled'
    dailyRecurrence: { time: '1900' }
    timeZoneId: 'Korea Standard Time'
    targetResourceId: vm.id
  }
}

Category 7: Container Apps Advanced (HIGH)

Rule 22: Container Apps Jobs for Batch Workloads

Impact: HIGH — Using always-on Container Apps for batch/scheduled work wastes compute.

Wrong:

// Always-on container running a cron job internally
resource app 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'batch-processor'
  properties: {
    template: {
      containers: [{
        name: 'batch'
        image: 'myregistry.azurecr.io/batch:1.0'
        // Runs 24/7, but only does work every hour
      }]
      scale: { minReplicas: 1, maxReplicas: 1 }
    }
  }
}

Correct:

resource job 'Microsoft.App/jobs@2024-03-01' = {
  name: 'batch-processor'
  properties: {
    configuration: {
      triggerType: 'Schedule'
      scheduleTriggerConfig: {
        cronExpression: '0 * * * *'  // Every hour
        parallelism: 1
        replicaCompletionCount: 1
      }
      replicaTimeout: 1800  // 30 min max
      replicaRetryLimit: 2
    }
    template: {
      containers: [{
        name: 'batch'
        image: 'myregistry.azurecr.io/batch:1.0'
        resources: { cpu: json('1.0'), memory: '2Gi' }
      }]
    }
  }
}

Why: Container Apps Jobs run on-demand or on a schedule and stop when done. You only pay for execution time. Use for data processing, report generation, cleanup tasks, and migration scripts.

Rule 23: Azure Static Web Apps with API Backend

Impact: HIGH — Hosting static frontends on Container Apps or App Service is over-provisioned and costly.

Wrong:

// Full Container App for a React SPA
resource app 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'frontend'
  properties: {
    template: {
      containers: [{
        name: 'nginx'
        image: 'myregistry.azurecr.io/frontend:1.0'
        resources: { cpu: json('0.5'), memory: '1Gi' }
      }]
    }
  }
}

Correct:

# staticwebapp.config.json
{
  "navigationFallback": {
    "rewrite": "/index.html",
    "exclude": ["/api/*", "/images/*"]
  },
  "routes": [
    { "route": "/api/*", "allowedRoles": ["authenticated"] }
  ],
  "responseOverrides": {
    "401": { "redirect": "/login", "statusCode": 302 }
  },
  "platform": {
    "apiRuntime": "node:18"
  }
}
# GitHub Actions deployment
- uses: Azure/static-web-apps-deploy@v1
  with:
    azure_static_web_apps_api_token: ${{ secrets.SWA_TOKEN }}
    app_location: "/"
    api_location: "api"
    output_location: "dist"

Why: Static Web Apps provides free SSL, global CDN, built-in auth, and API Functions backend. Free tier handles most SPAs. Use Container Apps only when you need WebSocket, SSR, or custom server logic.


Category 8: Infrastructure as Code Advanced (HIGH)

Rule 24: Bicep Module Registry for Reusable Components

Impact: HIGH — Copy-pasting Bicep modules across repos leads to drift and inconsistency.

Wrong:

# Copy-pasting the same container-app.bicep into every project
project-a/infra/modules/container-app.bicep  # v1
project-b/infra/modules/container-app.bicep  # v1 with local edits
project-c/infra/modules/container-app.bicep  # v2, different from above

Correct:

// Reference module from Azure Container Registry
module containerApp 'br:myregistry.azurecr.io/bicep/modules/container-app:1.2.0' = {
  name: 'deploy-api'
  params: {
    name: 'my-api'
    image: 'myregistry.azurecr.io/api:1.0'
    environmentId: containerAppEnv.id
    managedIdentity: true
    minReplicas: 1
  }
}
# Publish module to registry
az bicep publish \
  --file modules/container-app.bicep \
  --target br:myregistry.azurecr.io/bicep/modules/container-app:1.2.0

Why: Bicep module registry in ACR provides versioning, immutable artifacts, and a single source of truth for infrastructure patterns. Tag modules with semantic versions.

Rule 25: Bicep Linting and What-If Before Deploy

Impact: HIGH — Deploying untested Bicep can delete or misconfigure production resources.

Correct:

# 1. Lint — catch syntax and best practice issues
az bicep lint --file main.bicep

# 2. What-If — preview changes before applying
az deployment group what-if \
  --resource-group my-rg \
  --template-file main.bicep \
  --parameters @params.json \
  --parameters apiKey=${{ secrets.API_KEY }}

# 3. Deploy only after review
az deployment group create \
  --resource-group my-rg \
  --template-file main.bicep \
  --parameters @params.json \
  --parameters apiKey=${{ secrets.API_KEY }}

Why: what-if shows creates, deletes, and modifications before they happen. Always run it in CI/CD before deploy. Use --confirm-with-what-if for interactive deployments.


Category 9: Monitoring & Observability (HIGH)

Rule 26: Azure Monitor Alerts and Action Groups

Impact: HIGH — Without alerting, outages go undetected until users report them.

Correct:

resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ops-team'
  location: 'global'
  properties: {
    groupShortName: 'ops'
    enabled: true
    emailReceivers: [
      { name: 'ops-email', emailAddress: 'ops@contoso.com' }
    ]
    // For Korean Teams channel notifications
    webhookReceivers: [
      {
        name: 'teams-webhook'
        serviceUri: teamsWebhookUrl
        useCommonAlertSchema: true
      }
    ]
  }
}

resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'high-cpu-alert'
  location: 'global'
  properties: {
    severity: 2
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [{
        name: 'cpu'
        metricName: 'CpuUsageNanoCores'
        operator: 'GreaterThan'
        threshold: 800000000  // 80% of 1 core
        timeAggregation: 'Average'
      }]
    }
    actions: [{ actionGroupId: actionGroup.id }]
    targetResourceType: 'Microsoft.App/containerApps'
    scopes: [containerApp.id]
  }
}

Why: Set alerts for: CPU > 80%, memory > 80%, HTTP 5xx > 1%, response time > 2s. Use Action Groups to route to email, Teams webhooks, or PagerDuty.

Rule 27: Application Insights for End-to-End Tracing

Impact: HIGH — Without distributed tracing, diagnosing multi-service issues requires log correlation across services.

Correct:

from azure.monitor.opentelemetry import configure_azure_monitor

# One-line setup — auto-instruments requests, dependencies, exceptions
configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

# Custom telemetry for business metrics
from opentelemetry import metrics
meter = metrics.get_meter("contoso-app")
request_counter = meter.create_counter(
    "business.requests",
    description="Business operation requests"
)

@app.post("/api/orders")
async def create_order(request):
    request_counter.add(1, {"operation": "create_order", "region": "KR"})
    # ... business logic

Why: Azure Monitor OpenTelemetry auto-instruments HTTP requests, database calls, and external dependencies. Add custom metrics for business KPIs. Use Application Map in Azure Portal to visualize service dependencies.


Category 10: Cost Optimization (MEDIUM)

Rule 28: Reserved Instances and Savings Plans

Impact: MEDIUM — Pay-as-you-go pricing is 30-60% more expensive than reserved capacity for predictable workloads.

Decision guide:

| Workload Pattern | Recommendation | Savings |
|---|---|---|
| Steady 24/7 (prod DB, app server) | 1-year Reserved Instance | ~35% |
| Steady 24/7, committed | 3-year Reserved Instance | ~55% |
| Variable but predictable compute | Azure Savings Plan | ~25% |
| Batch/dev/test | Spot VMs | ~60-90% |
| Short-lived experiments | Pay-as-you-go | 0% (but no commitment) |

Why: After running a workload for 2+ months with stable usage, evaluate reserved pricing. Use Azure Advisor cost recommendations to identify candidates.

Rule 29: Azure Front Door for Global Distribution

Impact: MEDIUM — Serving all traffic from a single region increases latency for global users.

Correct:

resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
  name: 'contoso-fd'
  sku: { name: 'Standard_AzureFrontDoor' }
  location: 'global'
}

resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2023-05-01' = {
  parent: frontDoor
  name: 'api-endpoint'
  location: 'global'
  properties: { enabledState: 'Enabled' }
}

resource originGroup 'Microsoft.Cdn/profiles/originGroups@2023-05-01' = {
  parent: frontDoor
  name: 'api-origins'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
    }
    healthProbeSettings: {
      probePath: '/health'
      probeRequestType: 'HEAD'
      probeProtocol: 'Https'
      probeIntervalInSeconds: 30
    }
  }
}

Why: Front Door provides global load balancing, WAF, SSL termination, and caching. Use Standard tier for static content caching, Premium tier for private link origins and advanced WAF rules. Note: a complete deployment also requires Microsoft.Cdn/profiles/originGroups/origins and Microsoft.Cdn/profiles/afdEndpoints/routes child resources — see Azure Front Door Bicep docs for full examples.

Rule 30: Azure Service Bus vs. Event Grid Decision Matrix

Impact: MEDIUM — Choosing the wrong messaging service leads to over-engineering or under-reliability.

Decision guide:

| Scenario | Use | Why |
|---|---|---|
| Command/task queue (order processing) | Service Bus Queue | Guaranteed delivery, FIFO, dead-letter |
| Publish/subscribe (notifications) | Service Bus Topic | Filtered subscriptions, sessions |
| Event-driven react (blob created) | Event Grid | Push-based, serverless, low latency |
| High-volume telemetry/logs | Event Hubs | Partitioned streaming, replay |
| Simple webhook notifications | Event Grid | HTTP push, no polling |
| Saga/workflow coordination | Service Bus + Sessions | Session-based message grouping |

Why: Service Bus = reliable messaging with transactions. Event Grid = reactive event routing. Event Hubs = high-throughput streaming. Don't use Service Bus for simple event notifications (Event Grid is cheaper and simpler). Don't use Event Grid for ordered processing (it doesn't guarantee order).

Rule 31: Budget Alerts and Cost Anomaly Detection

Impact: MEDIUM — Unexpected cost spikes from misconfigured services or attacks go unnoticed without alerts.

Correct:

resource budget 'Microsoft.Consumption/budgets@2023-11-01' = {
  name: 'monthly-budget'
  properties: {
    category: 'Cost'
    amount: 5000000  // ₩5,000,000 / month
    timeGrain: 'Monthly'
    timePeriod: {
      startDate: '2026-04-01'
      endDate: '2027-03-31'
    }
    notifications: {
      '80Percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: ['ops@contoso.com']
        thresholdType: 'Actual'
      }
      '100Percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        contactEmails: ['ops@contoso.com', 'finance@contoso.com']
        thresholdType: 'Actual'
      }
      'Forecast120': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 120
        contactEmails: ['ops@contoso.com']
        thresholdType: 'Forecasted'  // Alert on projected overspend
      }
    }
  }
}

Why: Set budget alerts at 80%, 100% actual and 120% forecasted. Enable Azure Cost Management anomaly detection for automatic alerts on unusual spending patterns.


Pre-Deployment Checklist

Run through this checklist before every production deployment:

Identity & Access:

  • All service connections use Managed Identity (no connection strings with secrets)
  • CI/CD uses workload identity federation (no service principal secrets)
  • All secrets stored in Key Vault (not in environment variables or config)
  • Role assignments follow least privilege
  • Key Vault Secrets User role assigned to app Managed Identity (Contributor is NOT enough)
  • Key Vault Secrets Officer role assigned to developer/deployer accounts
  • AcrPull role assigned to Container App Managed Identities

Container Apps / Compute:

  • Image tags are pinned (no :latest)
  • Health probes configured (liveness + readiness)
  • Health probe endpoints return proper HTTP status codes (use JSONResponse, not tuples)
  • CPU and memory limits set
  • HTTPS only enabled

Networking:

  • Private endpoints for all data services
  • CORS restricted to known origins
  • Diagnostic logging enabled

Infrastructure as Code:

  • Bicep parameter files use JSON format (not .bicepparam) for secret override support
  • Secrets passed via CLI --parameters flag, never in parameter files
  • Role assignments included in Bicep modules (don't rely on manual Portal setup)

Cost:

  • Dev/test environments use consumption or basic SKUs
  • Budget alerts configured
  • Dev VMs have auto-shutdown scheduled
  • Consider containerized Redis for sandbox (Azure Cache takes 15+ min to provision)

AI Services:

  • max_tokens set on all Azure OpenAI calls
  • Retry with exponential backoff for rate limits
  • Azure OpenAI accessed via Managed Identity (not API key)
  • Multi-region Azure OpenAI for access to newest models
  • LiteLLM or equivalent router for provider failover and cost tracking
Related skills
Installs
8
First Seen
Apr 10, 2026