skills/josiahsiegel/claude-plugin-marketplace/azure-ml-foundry-workspace

azure-ml-foundry-workspace

Installation
SKILL.md

Azure Machine Learning Workspace / Azure AI Foundry - Complete Deep-Dive Reference

Authoritative reference for every aspect of Azure Machine Learning Workspace (Azure AI Foundry) including architecture, networking, private endpoints, compute clusters, endpoint deployment, managed identities, ACR integration, storage accounts, all CLI and PowerShell commands, log reading, debugging, and Terraform integration.


1. ARCHITECTURE AND CORE CONCEPTS

Workspace Resource Hierarchy

Azure Subscription
  └── Resource Group
        ├── Azure ML Workspace (Microsoft.MachineLearningServices/workspaces)
        │     ├── Dependent Resources (auto-created or BYO)
        │     │     ├── Azure Storage Account (default datastore)
        │     │     ├── Azure Key Vault (secrets, connection strings)
        │     │     ├── Azure Application Insights (telemetry)
        │     │     └── Azure Container Registry (Docker images for environments)
        │     ├── Compute Targets
        │     │     ├── Compute Instances (dev/test VMs)
        │     │     ├── Compute Clusters (AmlCompute - training)
        │     │     ├── Serverless Compute (on-demand)
        │     │     ├── Kubernetes Compute (AKS / Arc-enabled)
        │     │     └── Attached Compute (Databricks, HDInsight, VMs)
        │     ├── Data Assets (versioned references to data)
        │     ├── Datastores (connections to storage)
        │     ├── Environments (Docker + conda specs)
        │     ├── Models (registered trained models)
        │     ├── Endpoints
        │     │     ├── Managed Online Endpoints (real-time)
        │     │     ├── Kubernetes Online Endpoints (BYO infra)
        │     │     ├── Batch Endpoints (large-scale scoring)
        │     │     └── Serverless Endpoints (MaaS - pay-per-token)
        │     ├── Jobs (training runs, pipelines, sweeps)
        │     ├── Components (reusable pipeline steps)
        │     ├── Schedules (recurring job triggers)
        │     └── Registries (cross-workspace sharing)
        └── AI Foundry Hub (kind=hub) + Projects (kind=project)

AI Foundry Hub/Project vs Classic Workspace

Feature Classic Workspace (kind=Default) AI Foundry Hub + Project
Portal ml.azure.com ai.azure.com
Scope Single workspace Hub shares infra across projects
Networking Per-workspace Hub-level (shared across projects)
Identity Per-workspace Hub-level identity, project inherits
Model catalog Yes Yes, plus additional Foundry models
Prompt flow Yes Yes
AI agents Limited Full AI Agent Service
Use case Classical ML, custom training GenAI, LLM apps, AI agents

Workspace Creation - All Methods

CLI:

# Install/upgrade ML extension
az extension add --name ml --upgrade

# Create resource group
az group create --name ml-rg --location eastus

# Create workspace with all dependencies auto-created
az ml workspace create \
  --name my-ml-workspace \
  --resource-group ml-rg \
  --location eastus

# Create workspace with explicit dependencies
az ml workspace create \
  --name my-ml-workspace \
  --resource-group ml-rg \
  --location eastus \
  --storage-account /subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.Storage/storageAccounts/mlstorage \
  --key-vault /subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.KeyVault/vaults/mlkeyvault \
  --app-insights /subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.Insights/components/mlinsights \
  --container-registry /subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.ContainerRegistry/registries/mlacr \
  --public-network-access Disabled \
  --managed-network AllowInternetOutbound \
  --image-build-compute cpu-build-cluster \
  --enable-data-isolation true \
  --tags Environment=Production Team=DataScience

# Create AI Foundry Hub
az ml workspace create \
  --name my-ai-hub \
  --resource-group ml-rg \
  --location eastus \
  --kind hub \
  --storage-account aihubstorage \
  --key-vault aihubkeyvault

# Create AI Foundry Project within Hub
az ml workspace create \
  --name my-ai-project \
  --resource-group ml-rg \
  --location eastus \
  --kind project \
  --hub-id /subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.MachineLearningServices/workspaces/my-ai-hub

# Show workspace details
az ml workspace show \
  --name my-ml-workspace \
  --resource-group ml-rg

# List all workspaces
az ml workspace list \
  --resource-group ml-rg \
  --output table

# Update workspace
az ml workspace update \
  --name my-ml-workspace \
  --resource-group ml-rg \
  --description "Updated workspace" \
  --public-network-access Disabled

# Delete workspace
az ml workspace delete \
  --name my-ml-workspace \
  --resource-group ml-rg \
  --permanently-delete --all-resources

# Diagnose workspace configuration
az ml workspace diagnose \
  --name my-ml-workspace \
  --resource-group ml-rg

PowerShell (Az.MachineLearningServices):

# Install the module
Install-Module -Name Az.MachineLearningServices -Scope CurrentUser -Repository PSGallery -Force

# Create workspace
New-AzMLWorkspace `
  -Name "my-ml-workspace" `
  -ResourceGroupName "ml-rg" `
  -Location "eastus" `
  -StorageAccountId "/subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.Storage/storageAccounts/mlstorage" `
  -KeyVaultId "/subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.KeyVault/vaults/mlkeyvault" `
  -ApplicationInsightId "/subscriptions/<sub>/resourceGroups/ml-rg/providers/Microsoft.Insights/components/mlinsights" `
  -IdentityType "SystemAssigned" `
  -PublicNetworkAccess "Disabled"

# Get workspace
Get-AzMLWorkspace -Name "my-ml-workspace" -ResourceGroupName "ml-rg"

# List workspaces
Get-AzMLWorkspace -ResourceGroupName "ml-rg"

# Update workspace
Update-AzMLWorkspace `
  -Name "my-ml-workspace" `
  -ResourceGroupName "ml-rg" `
  -Description "Updated workspace" `
  -Tag @{Environment="Production"}

# Remove workspace
Remove-AzMLWorkspace -Name "my-ml-workspace" -ResourceGroupName "ml-rg"

# Diagnose workspace
Invoke-AzMLWorkspaceDiagnose -Name "my-ml-workspace" -ResourceGroupName "ml-rg"

2. NETWORKING

Azure ML supports three managed network isolation modes (Disabled, AllowInternetOutbound, AllowOnlyApprovedOutbound) with the managed VNet approach recommended for production. Private endpoints provide inbound connectivity, and outbound rules control egress from compute resources.

Key DNS Zones

Service Private DNS Zone
ML Workspace API privatelink.api.azureml.ms
ML Notebooks privatelink.notebooks.azure.net
Storage Blob privatelink.blob.core.windows.net
Storage File privatelink.file.core.windows.net
Key Vault privatelink.vaultcore.azure.net
Container Registry privatelink.azurecr.io
Application Insights privatelink.monitor.azure.com

Key Service Tags

Service Tag Purpose
AzureMachineLearning ML workspace management (inbound 44224, outbound 443)
BatchNodeManagement Compute cluster management (inbound 29876-29877)
Storage Access to Azure Storage (outbound 443)
AzureActiveDirectory Authentication (outbound 443)

For full VNet configuration, private endpoint setup, NSG rules, and outbound rule management, see references/networking.md.


3. COMPUTE

Azure ML offers multiple compute targets: Compute Instances for dev/test, AmlCompute Clusters for scalable training, Serverless Compute for on-demand jobs without cluster management, and Kubernetes Compute for BYO infrastructure scenarios.

GPU VM SKU Quick Reference

VM Series GPU GPU Memory Use Case
Standard_NC24ads_A100_v4 1x A100 80 GB Training, fine-tuning
Standard_ND96amsr_A100_v4 8x A100 80GB 640 GB Large model training
Standard_ND_H100_v5 8x H100 640 GB GenAI, LLM training
Standard_ND_H200_v5 8x H200 1120 GB Latest: 2x perf vs H100
Standard_NCads_H100_v5 1x H100 NVL 94 GB Inference, fine-tuning
Standard_NC4as_T4_v3 1x T4 16 GB Budget inference

For the complete GPU SKU table, compute instance/cluster CLI reference, serverless compute, Kubernetes attach, and debugging commands, see references/compute.md.


4. ENDPOINT DEPLOYMENT

Azure ML supports four endpoint types: Managed Online Endpoints (recommended for real-time inference with blue-green deployments), Batch Endpoints (large-scale scoring on compute clusters), Kubernetes Online Endpoints (BYO AKS/Arc infrastructure), and Serverless Endpoints (pay-per-token Model-as-a-Service).

Endpoint Types Quick Reference

Type Use Case Auth Modes Scaling
Managed Online Real-time inference key, aml_token Per-deployment instance count
Batch Large-scale scoring managed identity Compute cluster auto-scale
Kubernetes Online BYO infra real-time key, aml_token K8s pod scaling
Serverless (MaaS) Pay-per-token LLM key Automatic

For full endpoint creation, deployment, traffic splitting, log retrieval, and batch invocation commands, see references/endpoints.md.


5-7. IDENTITIES, ACR, AND STORAGE

Managed identities (system-assigned or user-assigned) control access between workspace, compute, endpoints, and dependent resources. ACR stores Docker images for environments and model serving, requiring Premium SKU for private endpoints and an image-build-compute cluster when behind a VNet. Storage accounts serve as the default datastore for blobs, file shares, job outputs, and MLflow artifacts.

Identity Types

Identity Type Use Case
System-Assigned (workspace) Default workspace operations, auto-lifecycle
User-Assigned (workspace) CMK encryption, cross-resource sharing
System-Assigned (compute) Per-cluster storage/ACR access
User-Assigned (compute) Fine-grained, reusable access control

Key RBAC Roles

Role Description
AzureML Data Scientist Run jobs, manage compute, deploy models
AzureML Compute Operator Create/manage compute resources
Azure AI Developer AI Foundry project development
Azure AI Inference Deployment Operator Deploy models to endpoints

For full identity configuration, role assignment commands, ACR integration, private ACR setup, datastore registration, and storage account details, see references/identities-acr-storage.md.


8-9. CLI AND POWERSHELL

The az ml CLI extension provides comprehensive workspace management through 20+ command groups covering workspaces, compute, jobs, models, endpoints, environments, data, datastores, components, schedules, registries, and connections. The Az.MachineLearningServices PowerShell module offers equivalent functionality for Windows-native automation.

Key az ml Command Groups

Command Group Purpose
az ml workspace Manage workspaces (create, diagnose, provision-network, outbound-rule)
az ml compute Manage compute (create, start, stop, connect-ssh, attach)
az ml job Manage jobs (create, stream, cancel, download)
az ml online-endpoint Manage online endpoints (create, invoke, get-credentials)
az ml online-deployment Manage deployments (create, get-logs, traffic)
az ml batch-endpoint Manage batch endpoints (create, invoke, list-jobs)
az ml serverless-endpoint Manage serverless endpoints (create, get-credentials)

For the complete command reference, job management deep-dive, schedule management, and full PowerShell cmdlet reference, see references/cli-powershell.md.


10. TERRAFORM INTEGRATION

Azure ML workspaces can be fully provisioned with Terraform using the azurerm provider. A production setup includes the workspace, VNet/subnets, NSG, storage account, key vault, ACR, Application Insights, private endpoints, DNS zones, compute clusters, and RBAC role assignments.

Key Terraform Resources

Resource Purpose
azurerm_machine_learning_workspace ML workspace (Default, Hub, Project)
azurerm_machine_learning_compute_cluster AmlCompute training clusters
azurerm_machine_learning_compute_instance Dev/test compute instances
azurerm_machine_learning_workspace_network_outbound_rule_* Managed network outbound rules

For the full production-ready Terraform configuration (providers, networking, storage, key vault, ACR, workspace, compute, role assignments, and outputs), see references/terraform.md.


11. TROUBLESHOOTING AND DEBUGGING

Azure ML provides multiple debugging surfaces: real-time job log streaming, deployment container logs (inference-server and storage-initializer), compute instance SSH access for system-level diagnostics, Log Analytics queries for historical analysis, and the az ml workspace diagnose command for configuration validation.

Common Error Categories

Category Common Errors
Compute QuotaExceeded, AllocationFailed, disk full, GPU not detected
Endpoints ScoringError, HealthCheckFailure, ImageBuildFailed, 429/503 errors
Networking DNS resolution failure, connection timeout, storage/ACR access denied
Jobs EnvironmentBuildError, OutOfMemoryError, NCCL timeout, blob not found

For full error reference tables, log locations, Log Analytics queries, endpoint metrics monitoring, workspace diagnostics, and the secure workspace setup checklist, see references/troubleshooting.md.


Additional Resources

Detailed reference files for each topic area:

External Documentation

Weekly Installs
13
GitHub Stars
23
First Seen
Mar 21, 2026
Installed on
claude-code13
opencode13
kimi-cli12
gemini-cli12
antigravity12
amp12