Azure Best Practices

This skill provides rules for building secure, reliable, and cost-effective Azure services. Rules are ordered by impact: CRITICAL rules prevent outages or security breaches; HIGH rules prevent common production issues; MEDIUM rules improve maintainability and cost.

Learned Patterns (Auto-Updated)

Before applying the rules below, check if LESSONS.md exists in the project root. If it does, read the section tagged with azure-best-practices and apply those project-specific lessons alongside the rules below.

Category 1: Identity & Access (CRITICAL)

Rule 1: Use Managed Identity — Never Embed Credentials

Impact: CRITICAL — Hardcoded credentials are the #1 cause of Azure security incidents.

❌ Wrong:

from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string(
    "DefaultEndpointsProtocol=https;AccountName=mystore;AccountKey=abc123..."
)

✅ Correct:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

credential = DefaultAzureCredential()
client = BlobServiceClient(
    account_url="https://mystore.blob.core.windows.net",
    credential=credential
)

Why: DefaultAzureCredential uses Managed Identity in production and your Azure CLI login locally. No secrets to leak.

Rule 2: Workload Identity Federation for CI/CD

Impact: CRITICAL — Service principal secrets in CI/CD pipelines expire and can be stolen.

❌ Wrong:

# GitHub Actions
- uses: azure/login@v2
  with:
    creds: ${{ secrets.AZURE_CREDENTIALS }}  # JSON with client secret

✅ Correct:

# GitHub Actions with OIDC
permissions:
  id-token: write
  contents: read

- uses: azure/login@v2
  with:
    client-id: ${{ secrets.AZURE_CLIENT_ID }}
    tenant-id: ${{ secrets.AZURE_TENANT_ID }}
    subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Why: Federated credentials use short-lived OIDC tokens — no long-lived secrets to rotate or protect.

Rule 3: Key Vault for All Secrets

Impact: CRITICAL — Environment variables are visible in portal, logs, and process listings.

❌ Wrong:

resource appService 'Microsoft.Web/sites@2023-12-01' = {
  properties: {
    siteConfig: {
      appSettings: [
        { name: 'DB_PASSWORD', value: 'super-secret-123' }
      ]
    }
  }
}

✅ Correct:

resource appService 'Microsoft.Web/sites@2023-12-01' = {
  properties: {
    siteConfig: {
      appSettings: [
        {
          name: 'DB_PASSWORD'
          value: '@Microsoft.KeyVault(SecretUri=${keyVault::dbPassword.properties.secretUri})'
        }
      ]
    }
  }
}

Why: Key Vault references are resolved at runtime; the secret value never appears in config or logs.

Rule 4: Least-Privilege Role Assignments

Impact: CRITICAL — Overly broad roles (Owner, Contributor) amplify the blast radius of a compromise.

❌ Wrong:

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      'b24988ac-6180-42a0-ab88-20f7382dd24c') // Contributor — too broad
    principalId: app.identity.principalId
  }
}

✅ Correct:

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      'ba92f5b4-2d11-453d-a403-e96b0029c9fe') // Storage Blob Data Contributor — scoped
    principalId: app.identity.principalId
  }
}

Category 2: Container Apps & Deployment (CRITICAL)

Rule 5: Pin Image Tags — Never Use :latest

Impact: CRITICAL — :latest causes unpredictable deployments and makes rollback impossible.

❌ Wrong:

containers:
  - name: api
    image: myregistry.azurecr.io/api:latest

✅ Correct:

containers:
  - name: api
    image: myregistry.azurecr.io/api:1.2.3-sha-a1b2c3d

Rule 6: Always Configure Health Probes

Impact: CRITICAL — Without health probes, Azure routes traffic to broken containers.

❌ Wrong:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  properties: {
    template: {
      containers: [{ name: 'api', image: '...' }]
      // No probes defined
    }
  }
}

✅ Correct:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  properties: {
    template: {
      containers: [{
        name: 'api'
        image: '...'
        probes: [
          {
            type: 'liveness'
            httpGet: { path: '/healthz', port: 8080 }
            periodSeconds: 10
          }
          {
            type: 'readiness'
            httpGet: { path: '/ready', port: 8080 }
            periodSeconds: 5
          }
        ]
      }]
    }
  }
}

Rule 7: Set CPU and Memory Limits

Impact: HIGH — Unlimited resources cause noisy-neighbor problems and cost overruns.

❌ Wrong:

containers: [{
  name: 'api'
  image: '...'
  // No resource limits
}]

✅ Correct:

containers: [{
  name: 'api'
  image: '...'
  resources: {
    cpu: json('0.5')
    memory: '1Gi'
  }
}]

Rule 8: Use Revision Scope for Zero-Downtime Deploys

Impact: HIGH — Single-revision mode causes downtime during deployments.

❌ Wrong:

properties: {
  configuration: {
    activeRevisionsMode: 'Single'
  }
}

✅ Correct:

properties: {
  configuration: {
    activeRevisionsMode: 'Multiple'
    ingress: {
      traffic: [
        { revisionName: 'api--v2', weight: 100 }
      ]
    }
  }
}

Category 3: Azure OpenAI & AI Foundry (HIGH)

Rule 9: Always Set max_tokens

Impact: HIGH — Without max_tokens, a runaway prompt can consume your entire quota.

❌ Wrong:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
    # No max_tokens — defaults to model maximum
)

✅ Correct:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=4096
)

Rule 10: Retry with Exponential Backoff for 429s

Impact: HIGH — Azure OpenAI rate-limits aggressively; tight retry loops make it worse.

❌ Wrong:

import time

for attempt in range(5):
    try:
        response = client.chat.completions.create(...)
        break
    except Exception:
        time.sleep(1)  # Fixed 1-second delay

✅ Correct:

import time
import random

for attempt in range(5):
    try:
        response = client.chat.completions.create(...)
        break
    except openai.RateLimitError:
        wait = (2 ** attempt) + random.uniform(0, 1)
        time.sleep(wait)

Rule 11: Use Managed Identity for Azure OpenAI

Impact: HIGH — API keys are long-lived secrets that can be leaked.

❌ Wrong:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="sk-abc123...",
    api_version="2024-06-01",
    azure_endpoint="https://myoai.openai.azure.com"
)

✅ Correct:

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-06-01",
    azure_endpoint="https://myoai.openai.azure.com"
)

Category 4: Networking & Security (HIGH)

Rule 12: Enable HTTPS Only

Impact: HIGH — HTTP traffic is unencrypted and vulnerable to interception.

❌ Wrong:

properties: {
  httpsOnly: false
}

✅ Correct:

properties: {
  httpsOnly: true
}

Rule 13: Use Private Endpoints for Data Services

Impact: HIGH — Public endpoints expose databases and storage to the internet.

❌ Wrong:

resource cosmosDb 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  properties: {
    publicNetworkAccess: 'Enabled'  // Accessible from internet
  }
}

✅ Correct:

resource cosmosDb 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  properties: {
    publicNetworkAccess: 'Disabled'
  }
}

resource privateEndpoint 'Microsoft.Network/privateEndpoints@2023-11-01' = {
  properties: {
    privateLinkServiceConnections: [{
      properties: {
        privateLinkServiceId: cosmosDb.id
        groupIds: ['Sql']
      }
    }]
  }
}

Rule 14: Enable Diagnostic Logging

Impact: HIGH — Without logs, you cannot investigate incidents.

❌ Wrong: No diagnostic settings configured (the default).

✅ Correct:

resource diagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  scope: containerApp
  properties: {
    workspaceId: logAnalytics.id
    logs: [{ categoryGroup: 'allLogs', enabled: true }]
    metrics: [{ category: 'AllMetrics', enabled: true }]
  }
}

Rule 15: CORS — Restrict Origins

Impact: HIGH — Wildcard CORS allows any website to call your API.

❌ Wrong:

cors: {
  allowedOrigins: ['*']
}

✅ Correct:

cors: {
  allowedOrigins: ['https://myapp.example.com']
}

Category 4b: Key Vault RBAC Gotchas (HIGH)

Rule 15b: Contributor Does NOT Grant Key Vault Secret Access

Impact: HIGH — Apps with Contributor role on a resource group still cannot read Key Vault secrets when RBAC is enabled.

❌ Wrong assumption:

"The backend has Contributor role on the resource group, so it can read Key Vault secrets."

✅ Correct: When Key Vault uses RBAC authorization (enableRbacAuthorization: true), you need separate data-plane roles:

Who	Role	Purpose
Application (Managed Identity)	Key Vault Secrets User	Read secrets at runtime
Developer / CI/CD	Key Vault Secrets Officer	Write/manage secrets
Admin	Key Vault Administrator	Full control

// Grant the app read-only access to secrets
resource kvRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6') // Key Vault Secrets User
    principalId: containerApp.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Why: Key Vault separates management-plane (Contributor) from data-plane (Secrets User/Officer). This is by design for defense-in-depth.

Rule 15c: Container Apps Need AcrPull for ACR

Impact: HIGH — Container Apps with identity: 'system' for ACR registry fail if no AcrPull role is assigned.

❌ Wrong:

registries: [{
  server: acrLoginServer
  identity: 'system'  // Assumes Managed Identity can pull — it cannot without AcrPull
}]

✅ Correct: Assign AcrPull role to the Container App's managed identity:

resource acrPullRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: acr
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '7f951dda-4ed3-4680-a7ca-43fe172d538d') // AcrPull
    principalId: containerApp.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Why: Without AcrPull, the Container App revision will fail with "Identity with resource ID 'system' not found for registry." This blocks all updates to the container app.

Category 4c: Azure OpenAI Multi-Region (HIGH)

Rule 15d: Use Multiple Regions for Latest Models

Impact: HIGH — The newest models (e.g., GPT-5.4) often launch in limited regions first.

✅ Correct pattern: Deploy Azure OpenAI resources in multiple regions and route via LiteLLM:

Korea Central:  gpt-4o, gpt-5.1-chat, gpt-5-pro
East US 2:      gpt-5.4-mini (newest family)

# LiteLLM routes to the correct endpoint based on model
MODEL_TIERS = {
    "generation": [
        {"model": "azure/gpt-5-1-chat", "api_base": KR_ENDPOINT},
        {"model": "azure/gpt-5-4-mini", "api_base": US_ENDPOINT},  # fallback
    ],
}

Why: Korea Central may not have the latest models on launch day. Having a secondary region (East US 2 or Sweden Central) ensures access to new models while keeping primary workloads in-region.

Rule 15e: Azure Cache for Redis Takes 15+ Minutes to Provision

Impact: MEDIUM — Redis provisioning blocks downstream Bicep resources that depend on it.

✅ Correct: For sandbox/dev environments, consider using a Redis container on Azure Container Apps instead of managed Azure Cache for Redis:

// Sandbox: use containerized Redis (fast, cheap)
// Production: use Azure Cache for Redis (managed, HA)

Why: Azure Cache for Redis (even Basic C0) takes 10-20 minutes to provision. This delays initial deployments significantly. Containerized Redis starts in seconds.

Category 4d: Bicep & IaC Gotchas (HIGH)

Rule 15f: Use JSON Parameter Files for Secure Params

Impact: HIGH — Bicep .bicepparam files require ALL parameters declared, including secrets. JSON parameter files allow CLI overrides.

❌ Wrong:

// sandbox.bicepparam — must declare ALL params including secrets
using '../main.bicep'
param environment = 'sandbox'
// ERROR: postgresPassword and jwtSecret are missing

✅ Correct: Use JSON parameter files and pass secrets via CLI:

// sandbox.json
{
  "parameters": {
    "environment": { "value": "sandbox" },
    "location": { "value": "koreacentral" }
  }
}

az deployment group create \
  --parameters @infra/parameters/sandbox.json \
  --parameters postgresPassword="$PG_PASS" jwtSecret="$JWT_SECRET"

Why: Secrets should never appear in parameter files (even .bicepparam). JSON files + CLI overrides keep secrets out of source control.

Rule 15g: Founders Hub Subscriptions Limit RBAC via CLI

Impact: MEDIUM — az role assignment create may fail on sponsored subscriptions with "MissingSubscription" error.

✅ Workaround: Assign roles via the Azure Portal instead:

Navigate to the resource → Access Control (IAM)
+ Add → Add role assignment
Select role → Select principal → Review + assign

Why: Some sponsored/educational subscriptions have API-level restrictions on role assignments. The Portal uses a different auth flow that works.

Category 5: Data & Storage (MEDIUM)

Rule 16: Cosmos DB — Use Partition Keys Wisely

Impact: MEDIUM — Bad partition keys create hot partitions and throttling.

❌ Wrong:

container = database.create_container(
    id="orders",
    partition_key=PartitionKey(path="/region")  # Low cardinality
)

✅ Correct:

container = database.create_container(
    id="orders",
    partition_key=PartitionKey(path="/customerId")  # High cardinality
)

Rule 17: Enable Soft Delete on Storage Accounts

Impact: MEDIUM — Without soft delete, accidental deletions are permanent.

✅ Correct:

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  properties: {
    isSoftDeleteEnabled: true
    softDeleteRetentionDays: 30
  }
}

Rule 18: Use Connection Pooling for SQL

Impact: MEDIUM — Opening a new connection per request causes performance issues.

❌ Wrong:

def get_data():
    conn = pyodbc.connect(connection_string)  # New connection every call
    cursor = conn.cursor()
    # ...
    conn.close()

✅ Correct:

from sqlalchemy import create_engine

engine = create_engine(connection_string, pool_size=10, max_overflow=20)

def get_data():
    with engine.connect() as conn:
        # Connection from pool
        # ...

Category 6: Cost Optimization (MEDIUM)

Rule 19: Use Consumption-Based SKUs for Dev/Test

Impact: MEDIUM — Running premium SKUs in dev/test wastes budget.

❌ Wrong: Using S1 App Service Plan for a dev environment.

✅ Correct: Use B1 or Free tier for dev; Consumption plan for Functions dev.

Rule 20: Set Budget Alerts

Impact: MEDIUM — Without alerts, cost overruns go unnoticed until the bill arrives.

✅ Correct:

resource budget 'Microsoft.Consumption/budgets@2023-11-01' = {
  name: 'monthly-budget'
  properties: {
    amount: 500
    timeGrain: 'Monthly'
    notifications: {
      actual_GreaterThan_80_Percent: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: ['team@example.com']
      }
    }
  }
}

Rule 21: Auto-Shutdown Dev VMs

Impact: MEDIUM — Idle VMs running overnight and weekends are pure waste.

✅ Correct:

resource autoShutdown 'Microsoft.DevTestLab/schedules@2018-09-15' = {
  name: 'shutdown-computevm-${vm.name}'
  properties: {
    status: 'Enabled'
    dailyRecurrence: { time: '1900' }
    timeZoneId: 'Korea Standard Time'
    targetResourceId: vm.id
  }
}

Category 7: Container Apps Advanced (HIGH)

Rule 22: Container Apps Jobs for Batch Workloads

Impact: HIGH — Using always-on Container Apps for batch/scheduled work wastes compute.

❌ Wrong:

// Always-on container running a cron job internally
resource app 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'batch-processor'
  properties: {
    template: {
      containers: [{
        name: 'batch'
        image: 'myregistry.azurecr.io/batch:1.0'
        // Runs 24/7, but only does work every hour
      }]
      scale: { minReplicas: 1, maxReplicas: 1 }
    }
  }
}

✅ Correct:

resource job 'Microsoft.App/jobs@2024-03-01' = {
  name: 'batch-processor'
  properties: {
    configuration: {
      triggerType: 'Schedule'
      scheduleTriggerConfig: {
        cronExpression: '0 * * * *'  // Every hour
        parallelism: 1
        replicaCompletionCount: 1
      }
      replicaTimeout: 1800  // 30 min max
      replicaRetryLimit: 2
    }
    template: {
      containers: [{
        name: 'batch'
        image: 'myregistry.azurecr.io/batch:1.0'
        resources: { cpu: json('1.0'), memory: '2Gi' }
      }]
    }
  }
}

Why: Container Apps Jobs run on-demand or on a schedule and stop when done. You only pay for execution time. Use for data processing, report generation, cleanup tasks, and migration scripts.

Rule 23: Azure Static Web Apps with API Backend

Impact: HIGH — Hosting static frontends on Container Apps or App Service is over-provisioned and costly.

❌ Wrong:

// Full Container App for a React SPA
resource app 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'frontend'
  properties: {
    template: {
      containers: [{
        name: 'nginx'
        image: 'myregistry.azurecr.io/frontend:1.0'
        resources: { cpu: json('0.5'), memory: '1Gi' }
      }]
    }
  }
}

✅ Correct:

# staticwebapp.config.json
{
  "navigationFallback": {
    "rewrite": "/index.html",
    "exclude": ["/api/*", "/images/*"]
  },
  "routes": [
    { "route": "/api/*", "allowedRoles": ["authenticated"] }
  ],
  "responseOverrides": {
    "401": { "redirect": "/login", "statusCode": 302 }
  },
  "platform": {
    "apiRuntime": "node:18"
  }
}

# GitHub Actions deployment
- uses: Azure/static-web-apps-deploy@v1
  with:
    azure_static_web_apps_api_token: ${{ secrets.SWA_TOKEN }}
    app_location: "/"
    api_location: "api"
    output_location: "dist"

Why: Static Web Apps provides free SSL, global CDN, built-in auth, and API Functions backend. Free tier handles most SPAs. Use Container Apps only when you need WebSocket, SSR, or custom server logic.

Category 8: Infrastructure as Code Advanced (HIGH)

Rule 24: Bicep Module Registry for Reusable Components

Impact: HIGH — Copy-pasting Bicep modules across repos leads to drift and inconsistency.

❌ Wrong:

# Copy-pasting the same container-app.bicep into every project
project-a/infra/modules/container-app.bicep  # v1
project-b/infra/modules/container-app.bicep  # v1 with local edits
project-c/infra/modules/container-app.bicep  # v2, different from above

✅ Correct:

// Reference module from Azure Container Registry
module containerApp 'br:myregistry.azurecr.io/bicep/modules/container-app:1.2.0' = {
  name: 'deploy-api'
  params: {
    name: 'my-api'
    image: 'myregistry.azurecr.io/api:1.0'
    environmentId: containerAppEnv.id
    managedIdentity: true
    minReplicas: 1
  }
}

# Publish module to registry
az bicep publish \
  --file modules/container-app.bicep \
  --target br:myregistry.azurecr.io/bicep/modules/container-app:1.2.0

Why: Bicep module registry in ACR provides versioning, immutable artifacts, and a single source of truth for infrastructure patterns. Tag modules with semantic versions.

Rule 25: Bicep Linting and What-If Before Deploy

Impact: HIGH — Deploying untested Bicep can delete or misconfigure production resources.

✅ Correct:

# 1. Lint — catch syntax and best practice issues
az bicep lint --file main.bicep

# 2. What-If — preview changes before applying
az deployment group what-if \
  --resource-group my-rg \
  --template-file main.bicep \
  --parameters @params.json \
  --parameters apiKey=${{ secrets.API_KEY }}

# 3. Deploy only after review
az deployment group create \
  --resource-group my-rg \
  --template-file main.bicep \
  --parameters @params.json \
  --parameters apiKey=${{ secrets.API_KEY }}

Why: what-if shows creates, deletes, and modifications before they happen. Always run it in CI/CD before deploy. Use --confirm-with-what-if for interactive deployments.

Category 9: Monitoring & Observability (HIGH)

Rule 26: Azure Monitor Alerts and Action Groups

Impact: HIGH — Without alerting, outages go undetected until users report them.

✅ Correct:

resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ops-team'
  location: 'global'
  properties: {
    groupShortName: 'ops'
    enabled: true
    emailReceivers: [
      { name: 'ops-email', emailAddress: 'ops@contoso.com' }
    ]
    // For Korean Teams channel notifications
    webhookReceivers: [
      {
        name: 'teams-webhook'
        serviceUri: teamsWebhookUrl
        useCommonAlertSchema: true
      }
    ]
  }
}

resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'high-cpu-alert'
  location: 'global'
  properties: {
    severity: 2
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [{
        name: 'cpu'
        metricName: 'CpuUsageNanoCores'
        operator: 'GreaterThan'
        threshold: 800000000  // 80% of 1 core
        timeAggregation: 'Average'
      }]
    }
    actions: [{ actionGroupId: actionGroup.id }]
    targetResourceType: 'Microsoft.App/containerApps'
    scopes: [containerApp.id]
  }
}

Why: Set alerts for: CPU > 80%, memory > 80%, HTTP 5xx > 1%, response time > 2s. Use Action Groups to route to email, Teams webhooks, or PagerDuty.

Rule 27: Application Insights for End-to-End Tracing

Impact: HIGH — Without distributed tracing, diagnosing multi-service issues requires log correlation across services.

✅ Correct:

from azure.monitor.opentelemetry import configure_azure_monitor

# One-line setup — auto-instruments requests, dependencies, exceptions
configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

# Custom telemetry for business metrics
from opentelemetry import metrics
meter = metrics.get_meter("contoso-app")
request_counter = meter.create_counter(
    "business.requests",
    description="Business operation requests"
)

@app.post("/api/orders")
async def create_order(request):
    request_counter.add(1, {"operation": "create_order", "region": "KR"})
    # ... business logic

Why: Azure Monitor OpenTelemetry auto-instruments HTTP requests, database calls, and external dependencies. Add custom metrics for business KPIs. Use Application Map in Azure Portal to visualize service dependencies.

Category 10: Cost Optimization (MEDIUM)

Rule 28: Reserved Instances and Savings Plans

Impact: MEDIUM — Pay-as-you-go pricing is 30-60% more expensive than reserved capacity for predictable workloads.

✅ Decision guide:

| Workload Pattern | Recommendation | Savings |
|---|---|---|
| Steady 24/7 (prod DB, app server) | 1-year Reserved Instance | ~35% |
| Steady 24/7, committed | 3-year Reserved Instance | ~55% |
| Variable but predictable compute | Azure Savings Plan | ~25% |
| Batch/dev/test | Spot VMs | ~60-90% |
| Short-lived experiments | Pay-as-you-go | 0% (but no commitment) |

Why: After running a workload for 2+ months with stable usage, evaluate reserved pricing. Use Azure Advisor cost recommendations to identify candidates.

Rule 29: Azure Front Door for Global Distribution

Impact: MEDIUM — Serving all traffic from a single region increases latency for global users.

✅ Correct:

resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
  name: 'contoso-fd'
  sku: { name: 'Standard_AzureFrontDoor' }
  location: 'global'
}

resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2023-05-01' = {
  parent: frontDoor
  name: 'api-endpoint'
  location: 'global'
  properties: { enabledState: 'Enabled' }
}

resource originGroup 'Microsoft.Cdn/profiles/originGroups@2023-05-01' = {
  parent: frontDoor
  name: 'api-origins'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
    }
    healthProbeSettings: {
      probePath: '/health'
      probeRequestType: 'HEAD'
      probeProtocol: 'Https'
      probeIntervalInSeconds: 30
    }
  }
}

Why: Front Door provides global load balancing, WAF, SSL termination, and caching. Use Standard tier for static content caching, Premium tier for private link origins and advanced WAF rules. Note: a complete deployment also requires Microsoft.Cdn/profiles/originGroups/origins and Microsoft.Cdn/profiles/afdEndpoints/routes child resources — see Azure Front Door Bicep docs for full examples.

Rule 30: Azure Service Bus vs. Event Grid Decision Matrix

Impact: MEDIUM — Choosing the wrong messaging service leads to over-engineering or under-reliability.

✅ Decision guide:

| Scenario | Use | Why |
|---|---|---|
| Command/task queue (order processing) | Service Bus Queue | Guaranteed delivery, FIFO, dead-letter |
| Publish/subscribe (notifications) | Service Bus Topic | Filtered subscriptions, sessions |
| Event-driven react (blob created) | Event Grid | Push-based, serverless, low latency |
| High-volume telemetry/logs | Event Hubs | Partitioned streaming, replay |
| Simple webhook notifications | Event Grid | HTTP push, no polling |
| Saga/workflow coordination | Service Bus + Sessions | Session-based message grouping |

Why: Service Bus = reliable messaging with transactions. Event Grid = reactive event routing. Event Hubs = high-throughput streaming. Don't use Service Bus for simple event notifications (Event Grid is cheaper and simpler). Don't use Event Grid for ordered processing (it doesn't guarantee order).

Rule 31: Budget Alerts and Cost Anomaly Detection

Impact: MEDIUM — Unexpected cost spikes from misconfigured services or attacks go unnoticed without alerts.

✅ Correct:

resource budget 'Microsoft.Consumption/budgets@2023-11-01' = {
  name: 'monthly-budget'
  properties: {
    category: 'Cost'
    amount: 5000000  // ₩5,000,000 / month
    timeGrain: 'Monthly'
    timePeriod: {
      startDate: '2026-04-01'
      endDate: '2027-03-31'
    }
    notifications: {
      '80Percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: ['ops@contoso.com']
        thresholdType: 'Actual'
      }
      '100Percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        contactEmails: ['ops@contoso.com', 'finance@contoso.com']
        thresholdType: 'Actual'
      }
      'Forecast120': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 120
        contactEmails: ['ops@contoso.com']
        thresholdType: 'Forecasted'  // Alert on projected overspend
      }
    }
  }
}

Why: Set budget alerts at 80%, 100% actual and 120% forecasted. Enable Azure Cost Management anomaly detection for automatic alerts on unusual spending patterns.

Pre-Deployment Checklist

Run through this checklist before every production deployment:

Identity & Access:

All service connections use Managed Identity (no connection strings with secrets)
CI/CD uses workload identity federation (no service principal secrets)
All secrets stored in Key Vault (not in environment variables or config)
Role assignments follow least privilege
Key Vault Secrets User role assigned to app Managed Identity (Contributor is NOT enough)
Key Vault Secrets Officer role assigned to developer/deployer accounts
AcrPull role assigned to Container App Managed Identities

Container Apps / Compute:

Image tags are pinned (no :latest)
Health probes configured (liveness + readiness)
Health probe endpoints return proper HTTP status codes (use JSONResponse, not tuples)
CPU and memory limits set
HTTPS only enabled

Networking:

Private endpoints for all data services
CORS restricted to known origins
Diagnostic logging enabled

Infrastructure as Code:

Bicep parameter files use JSON format (not .bicepparam) for secret override support
Secrets passed via CLI --parameters flag, never in parameter files
Role assignments included in Bicep modules (don't rely on manual Portal setup)

Cost:

Dev/test environments use consumption or basic SKUs
Budget alerts configured
Dev VMs have auto-shutdown scheduled
Consider containerized Redis for sandbox (Azure Cache takes 15+ min to provision)

AI Services:

max_tokens set on all Azure OpenAI calls
Retry with exponential backoff for rate limits
Azure OpenAI accessed via Managed Identity (not API key)
Multi-region Azure OpenAI for access to newest models
LiteLLM or equivalent router for provider failover and cost tracking

azure-best-practices

Azure Best Practices

Learned Patterns (Auto-Updated)

Category 1: Identity & Access (CRITICAL)

Rule 1: Use Managed Identity — Never Embed Credentials

Rule 2: Workload Identity Federation for CI/CD

Rule 3: Key Vault for All Secrets

Rule 4: Least-Privilege Role Assignments

Category 2: Container Apps & Deployment (CRITICAL)

Rule 5: Pin Image Tags — Never Use :latest

Rule 6: Always Configure Health Probes

Rule 7: Set CPU and Memory Limits

Rule 8: Use Revision Scope for Zero-Downtime Deploys

Category 3: Azure OpenAI & AI Foundry (HIGH)

Rule 9: Always Set max_tokens

Rule 10: Retry with Exponential Backoff for 429s

Rule 11: Use Managed Identity for Azure OpenAI

Category 4: Networking & Security (HIGH)

Rule 12: Enable HTTPS Only

Rule 13: Use Private Endpoints for Data Services

Rule 14: Enable Diagnostic Logging

Rule 15: CORS — Restrict Origins

Category 4b: Key Vault RBAC Gotchas (HIGH)

Rule 15b: Contributor Does NOT Grant Key Vault Secret Access

Rule 15c: Container Apps Need AcrPull for ACR

Category 4c: Azure OpenAI Multi-Region (HIGH)

Rule 15d: Use Multiple Regions for Latest Models

Rule 15e: Azure Cache for Redis Takes 15+ Minutes to Provision

Category 4d: Bicep & IaC Gotchas (HIGH)

Rule 15f: Use JSON Parameter Files for Secure Params

Rule 15g: Founders Hub Subscriptions Limit RBAC via CLI

Category 5: Data & Storage (MEDIUM)

Rule 16: Cosmos DB — Use Partition Keys Wisely

Rule 17: Enable Soft Delete on Storage Accounts

Rule 18: Use Connection Pooling for SQL

Category 6: Cost Optimization (MEDIUM)

Rule 19: Use Consumption-Based SKUs for Dev/Test

Rule 20: Set Budget Alerts

Rule 21: Auto-Shutdown Dev VMs

Category 7: Container Apps Advanced (HIGH)

Rule 22: Container Apps Jobs for Batch Workloads

Rule 23: Azure Static Web Apps with API Backend

Category 8: Infrastructure as Code Advanced (HIGH)

Rule 24: Bicep Module Registry for Reusable Components

Rule 25: Bicep Linting and What-If Before Deploy

Category 9: Monitoring & Observability (HIGH)

Rule 26: Azure Monitor Alerts and Action Groups

Rule 27: Application Insights for End-to-End Tracing

Category 10: Cost Optimization (MEDIUM)

Rule 28: Reserved Instances and Savings Plans

Rule 29: Azure Front Door for Global Distribution

Rule 30: Azure Service Bus vs. Event Grid Decision Matrix

Rule 31: Budget Alerts and Cost Anomaly Detection

Pre-Deployment Checklist

More from parandurume-labs/duru-conductor

careful

review

retro

azure-ai-foundry

korean-compliance

azure-security-audit