Azure AI Foundry Best Practices

This skill provides rules for building secure, reliable, and cost-effective AI applications on Azure AI Foundry. Rules are ordered by impact: CRITICAL rules prevent security breaches or runaway costs; HIGH rules prevent common production issues; MEDIUM rules improve maintainability and efficiency.

Learned Patterns (Auto-Updated)

Before applying the rules below, check if LESSONS.md exists in the project root. If it does, read the section tagged with azure-ai-foundry and apply those project-specific lessons alongside the rules below.

Category 1: Identity & Security (CRITICAL)

Rule 1: Use Managed Identity for AI Services — Never Use API Keys

Impact: CRITICAL — API keys for Azure OpenAI are the most commonly leaked credentials in AI projects.

❌ Wrong:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="sk-abc123...",  # Hardcoded API key
    api_version="2024-10-21",
    azure_endpoint="https://my-openai.openai.azure.com"
)

✅ Correct:

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21",
    azure_endpoint="https://my-openai.openai.azure.com"
)

Why: Managed Identity eliminates API key management entirely. Assign Cognitive Services OpenAI User role to the app's managed identity. This works for Azure OpenAI, AI Foundry, and all Cognitive Services.

Rule 2: Configure Content Safety Filtering

Impact: CRITICAL — Unfiltered AI outputs can generate harmful, illegal, or brand-damaging content.

❌ Wrong:

# Deploying model with no content filtering
# (or requesting filter removal without business justification)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}]
    # No content filtering configured on the deployment
)

✅ Correct:

# 1. Configure content filtering policy in AI Foundry portal or via API
# Set severity thresholds for: hate, sexual, violence, self-harm
# Recommended minimum: medium severity for all categories

# 2. Handle filtered responses gracefully
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_input}]
)

if response.choices[0].finish_reason == "content_filter":
    # Log for monitoring, return safe fallback
    log.warning(f"Content filtered: prompt_id={prompt_id}")
    return {
        "message": "요청하신 내용을 처리할 수 없습니다.",
        "messageEn": "Unable to process the requested content.",
        "filtered": True
    }

Why: Azure AI Foundry provides built-in content filtering. Always configure it, handle content_filter finish reasons, and log filtered requests for monitoring.

Rule 3: Set max_tokens on Every AI Call

Impact: CRITICAL — Unbounded token generation causes runaway costs and timeouts.

❌ Wrong:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
    # No max_tokens — model can generate up to context limit
)

✅ Correct:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=2000,  # Appropriate for the use case
    temperature=0.7
)

# Track token usage for cost monitoring
usage = response.usage
log.info(
    f"Tokens: prompt={usage.prompt_tokens}, "
    f"completion={usage.completion_tokens}, "
    f"cost_estimate=${estimate_cost(usage)}"
)

Why: Without max_tokens, a single malformed prompt can generate a 128K-token response. At GPT-4o pricing, that's ~$4 per request. Set limits appropriate for each use case.

Category 2: Model Deployment (HIGH)

Rule 4: Use Standard Deployments for Production, Provisioned for Predictable Workloads

Impact: HIGH — Wrong deployment type causes either cost overruns or throttling under load.

❌ Wrong:

// Using Global Standard for latency-sensitive production workload
resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  name: 'gpt-4o-prod'
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o', version: '2024-08-06' }
  }
  sku: { name: 'GlobalStandard', capacity: 100 }  // Variable latency
}

✅ Correct:

// Standard deployment for most workloads
resource standardDeployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  name: 'gpt-4o-standard'
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o', version: '2024-08-06' }
  }
  sku: { name: 'Standard', capacity: 80 }  // Regional, lower latency
}

// Provisioned Throughput for high-volume, latency-sensitive workloads
resource provisionedDeployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  name: 'gpt-4o-provisioned'
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o', version: '2024-08-06' }
  }
  sku: { name: 'ProvisionedManaged', capacity: 50 }  // PTUs, guaranteed throughput
}

Why: Standard = pay-per-token, good for variable loads. Provisioned = reserved throughput units (PTUs), good for predictable high-volume workloads. Global Standard = cheapest but highest latency variance.

Rule 5: Multi-Region Deployment for Availability and Model Access

Impact: HIGH — New models and higher quotas are often available in specific regions first.

❌ Wrong:

# Single-region, single-endpoint
client = AzureOpenAI(
    azure_endpoint="https://my-openai-eastus.openai.azure.com",
    ...
)

✅ Correct:

# Multi-region with failover using LiteLLM or custom router
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-4o",
            "litellm_params": {
                "model": "azure/gpt-4o",
                "api_base": "https://my-openai-eastus.openai.azure.com",
                "api_version": "2024-10-21"
            }
        },
        {
            "model_name": "gpt-4o",
            "litellm_params": {
                "model": "azure/gpt-4o",
                "api_base": "https://my-openai-swedencentral.openai.azure.com",
                "api_version": "2024-10-21"
            }
        }
    ],
    routing_strategy="latency-based-routing"
)

response = await router.acompletion(
    model="gpt-4o",
    messages=messages,
    max_tokens=2000
)

Why: Azure OpenAI regions have different model availability, quotas, and latency. Multi-region deployment provides failover and access to the newest models. koreacentral often has limited GPU model availability — pair with eastus or swedencentral.

Rule 6: Pin Model Versions in Production

Impact: HIGH — Model version updates can change output behavior and break downstream logic.

❌ Wrong:

resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  name: 'gpt-4o'
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o' }  // No version — auto-updates
    versionUpgradeOption: 'OnceNewDefaultVersionAvailable'  // Auto-upgrade
  }
}

✅ Correct:

resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  name: 'gpt-4o-2024-08-06'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'gpt-4o'
      version: '2024-08-06'  // Pinned version
    }
    versionUpgradeOption: 'NoAutoUpgrade'  // Manual upgrade only
  }
}

Why: Model version changes can alter output format, reasoning quality, and cost. Pin versions in production and test upgrades in staging before promoting.

Category 3: Prompt Flow & Application Patterns (HIGH)

Rule 7: Use Prompt Flow for Complex AI Pipelines

Impact: HIGH — Ad-hoc chained LLM calls become unmaintainable and hard to debug.

❌ Wrong:

# Hard-coded chain of LLM calls
def process_document(doc):
    summary = call_llm("Summarize: " + doc)
    entities = call_llm("Extract entities from: " + summary)
    sentiment = call_llm("Analyze sentiment: " + summary)
    return combine(summary, entities, sentiment)

✅ Correct:

# flow.dag.yaml — Azure AI Foundry Prompt Flow
inputs:
  document:
    type: string

nodes:
  - name: summarize
    type: llm
    source: { type: code, path: summarize.jinja2 }
    inputs: { document: ${inputs.document} }

  - name: extract_entities
    type: llm
    source: { type: code, path: extract_entities.jinja2 }
    inputs: { summary: ${summarize.output} }

  - name: analyze_sentiment
    type: python
    source: { type: code, path: sentiment.py }
    inputs: { text: ${summarize.output} }

outputs:
  result:
    type: object
    value:
      summary: ${summarize.output}
      entities: ${extract_entities.output}
      sentiment: ${analyze_sentiment.output}

Why: Prompt Flow provides visual debugging, built-in evaluation, versioned prompt templates, and easy deployment to Container Apps or Azure ML managed endpoints. Use it for any pipeline with 2+ LLM calls.

Rule 8: Implement Retry with Exponential Backoff for Rate Limits

Impact: HIGH — Azure OpenAI enforces TPM (tokens per minute) and RPM (requests per minute) limits. Without retry logic, requests fail silently under load.

❌ Wrong:

# No retry — fails on 429
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

✅ Correct:

import time
from openai import RateLimitError

def call_with_retry(client, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                max_tokens=2000
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Use Retry-After header if available
            retry_after = int(e.response.headers.get("Retry-After", 2 ** attempt))
            log.warning(
                f"Rate limited (attempt {attempt + 1}/{max_retries}), "
                f"retrying in {retry_after}s"
            )
            time.sleep(retry_after)

Why: Azure OpenAI returns HTTP 429 with a Retry-After header. Respect it. The Azure OpenAI Python SDK has built-in retry, but configure max_retries explicitly.

Category 4: Cost Management (MEDIUM)

Rule 9: Monitor and Alert on Token Usage

Impact: MEDIUM — AI costs can spike unexpectedly without monitoring.

❌ Wrong:

# No cost tracking
response = client.chat.completions.create(model="gpt-4o", messages=messages)
return response.choices[0].message.content

✅ Correct:

import json

# Cost per 1K tokens (update as pricing changes)
COST_PER_1K = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
}

def call_with_cost_tracking(client, model, messages):
    response = client.chat.completions.create(
        model=model, messages=messages, max_tokens=2000
    )

    usage = response.usage
    costs = COST_PER_1K.get(model, {"input": 0, "output": 0})
    estimated_cost = (
        (usage.prompt_tokens / 1000) * costs["input"] +
        (usage.completion_tokens / 1000) * costs["output"]
    )

    # Emit metric for Azure Monitor / Application Insights
    log.info(json.dumps({
        "event": "ai_call",
        "model": model,
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "estimated_cost_usd": round(estimated_cost, 6)
    }))

    return response

Why: Set up Azure Monitor alerts on daily/weekly AI spend. Use Application Insights custom metrics to track cost per feature, per user, or per API endpoint.

Rule 10: Use the Right Model for the Task

Impact: MEDIUM — Using GPT-4o for tasks that GPT-4o-mini handles is 16x more expensive.

❌ Wrong:

# Using GPT-4o for simple classification
response = client.chat.completions.create(
    model="gpt-4o",  # $2.50/M input, $10/M output
    messages=[
        {"role": "system", "content": "Classify as positive/negative."},
        {"role": "user", "content": review_text}
    ]
)

✅ Correct:

# Model selection based on task complexity
MODEL_MAP = {
    "classification": "gpt-4o-mini",    # Simple tasks — 16x cheaper
    "extraction": "gpt-4o-mini",         # Structured extraction
    "summarization": "gpt-4o-mini",      # Short summaries
    "reasoning": "gpt-4o",               # Complex analysis
    "code_generation": "gpt-4o",          # Code requires stronger model
    "creative": "gpt-4o",                # Nuanced creative writing
}

def get_completion(task_type, messages):
    model = MODEL_MAP.get(task_type, "gpt-4o-mini")
    return client.chat.completions.create(
        model=model, messages=messages, max_tokens=2000
    )

Why: GPT-4o-mini handles classification, extraction, and simple summarization nearly as well as GPT-4o at 1/16th the cost. Reserve GPT-4o for reasoning, code generation, and creative tasks.

Category 5: Deployment to Container Apps (MEDIUM)

Rule 11: Deploy Prompt Flow to Container Apps with Managed Identity

Impact: MEDIUM — Prompt Flow needs proper identity and health probe configuration for production deployment.

❌ Wrong:

// Deploying without health probes or managed identity
resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'ai-flow'
  properties: {
    template: {
      containers: [{
        name: 'flow'
        image: 'myregistry.azurecr.io/prompt-flow:latest'  // No pinned tag
        env: [{ name: 'OPENAI_API_KEY', value: 'sk-...' }]  // Key in env
      }]
    }
  }
}

✅ Correct:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'ai-flow'
  identity: { type: 'SystemAssigned' }
  properties: {
    template: {
      containers: [{
        name: 'flow'
        image: 'myregistry.azurecr.io/prompt-flow:1.2.3'  // Pinned version
        env: [
          { name: 'AZURE_OPENAI_ENDPOINT', value: openAiEndpoint }
          // No API key — uses managed identity
        ]
        resources: { cpu: json('1.0'), memory: '2Gi' }
      }]
      scale: {
        minReplicas: 1  // Avoid cold starts for AI endpoints
        maxReplicas: 10
        rules: [{ name: 'http', http: { metadata: { concurrentRequests: '10' }}}]
      }
    }
    configuration: {
      ingress: {
        external: true
        targetPort: 8080
        transport: 'http'
      }
    }
  }
}

// Assign Cognitive Services role to Container App identity
resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(containerApp.id, cognitiveServicesAccount.id, cognitiveServicesUserRole)
  scope: cognitiveServicesAccount
  properties: {
    roleDefinitionId: cognitiveServicesUserRole  // Cognitive Services OpenAI User
    principalId: containerApp.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Why: Container Apps with managed identity eliminates API key management for AI services. Set minReplicas: 1 to avoid cold start latency on AI endpoints. Pin image tags to prevent unexpected model behavior changes.

Category 6: Open-Source and Fine-Tuned Models (MEDIUM)

Rule 12: Deploy Open-Source Models via AI Foundry Model Catalog

Impact: MEDIUM — Self-hosting open-source models (Phi, Mistral, Llama) without AI Foundry adds operational burden.

❌ Wrong:

# Self-hosting on a VM with manual GPU management
ssh gpu-vm
docker run -p 8080:8080 -v /models:/models \
  vllm/vllm-openai --model mistralai/Mistral-7B-v0.1

✅ Correct:

# Deploy via AI Foundry Model Catalog — Serverless API or Managed Compute
# 1. Deploy from Azure AI Foundry portal: Model Catalog → Deploy
# 2. Access via the same OpenAI-compatible API

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

# Serverless API deployment — pay-per-token, no GPU management
client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21",
    azure_endpoint="https://my-ai-foundry.services.ai.azure.com/models"
)

response = client.chat.completions.create(
    model="Phi-4",  # Deployed model name
    messages=messages,
    max_tokens=1000
)

Why: AI Foundry Model Catalog provides one-click deployment for 1,600+ open-source models with managed infrastructure, auto-scaling, and the same OpenAI-compatible API. Serverless API means no GPU management.

Pre-Deployment Checklist (Azure AI)

Identity & Security:

All AI service connections use Managed Identity (no API keys)
Cognitive Services OpenAI User role assigned to app identity
Content safety filtering configured on all model deployments
max_tokens set on every LLM call

Model Deployment:

Model versions pinned (no auto-upgrade in production)
Deployment type matches workload (Standard vs. Provisioned)
Multi-region failover configured for critical workloads

Cost Management:

Token usage monitoring and alerting configured
Right model for the task (GPT-4o-mini for simple tasks)
Budget alerts set in Azure Cost Management
Dev/test environments use lower-cost models

Production Readiness:

Retry logic with exponential backoff for rate limits
Content filter responses handled gracefully
AI usage disclosure for user-facing decisions (per NIA guidelines)
Container Apps minReplicas ≥ 1 for AI endpoints

azure-ai-foundry

Azure AI Foundry Best Practices

Learned Patterns (Auto-Updated)

Category 1: Identity & Security (CRITICAL)

Rule 1: Use Managed Identity for AI Services — Never Use API Keys

Rule 2: Configure Content Safety Filtering

Rule 3: Set max_tokens on Every AI Call

Category 2: Model Deployment (HIGH)

Rule 4: Use Standard Deployments for Production, Provisioned for Predictable Workloads

Rule 5: Multi-Region Deployment for Availability and Model Access

Rule 6: Pin Model Versions in Production

Category 3: Prompt Flow & Application Patterns (HIGH)

Rule 7: Use Prompt Flow for Complex AI Pipelines

Rule 8: Implement Retry with Exponential Backoff for Rate Limits

Category 4: Cost Management (MEDIUM)

Rule 9: Monitor and Alert on Token Usage

Rule 10: Use the Right Model for the Task

Category 5: Deployment to Container Apps (MEDIUM)

Rule 11: Deploy Prompt Flow to Container Apps with Managed Identity

Category 6: Open-Source and Fine-Tuned Models (MEDIUM)

Rule 12: Deploy Open-Source Models via AI Foundry Model Catalog

Pre-Deployment Checklist (Azure AI)

More from parandurume-labs/conductor

azure-security-audit