Merag Nokhiz

Systems Architect & Engineer

link terminal
June 2026

The AI Infrastructure Layer on Azure: Services, Patterns, and Trade-offs

A concrete breakdown of Azure's AI infrastructure stack — GPU compute tiers, storage architecture, model serving options, and how Azure AI Foundry ties it together.

Infrastructure azureaiinfrastructuremlopscloud-architecture

The AI Infrastructure Layer on Azure: Services, Patterns, and Trade-offs

June 2026 · nokhiz.github.io

TL;DR — 8 Central Insights ⚡

#Insight
1Azure’s AI layer is not a single service — it is a stack of compute, storage, networking, and platform services that must be wired together deliberately
2Training and inference have completely different infrastructure requirements — conflating them is the most common architectural mistake in early AI deployments
3Azure AI Foundry (formerly Azure ML Studio + Azure OpenAI) is the unified orchestration plane — it manages models, agents, evaluations, and deployments
4Storage is the hidden bottleneck — standard Blob throughput caps out fast on large training jobs; Azure Managed Lustre or premium ADLS Gen2 are the right choices
5Private endpoints and VNet injection are not optional — any production AI workload touching proprietary data must run inside a network perimeter
6GPU SKUs are expensive and quota-gated — but only relevant when you go beyond Serverless; dedicated compute and training workloads require quota planning before you architect
7Observability operates at two distinct levels — infrastructure metrics (GPU utilization, storage IOPS) and model-level metrics (token usage, latency, error rates) must both be instrumented; one without the other leaves blind spots
8Cost has multiple compounding drivers — GPU compute dominates, but token pricing, storage tier selection, and AI Search partition sizing all add up fast without active cost architecture

1. The Azure AI Stack 🏗️

💡 Key Message: Azure does not offer a single “AI service” — it offers a layered infrastructure stack where every layer has multiple options, and the wrong choice at one layer creates constraints at every layer above it.

Before picking a service, you need to understand how Azure structures its AI capabilities. There are three distinct planes:

  • Infrastructure plane — GPU virtual machines, compute clusters, high-throughput storage, and accelerated networking
  • Platform plane — Azure AI Foundry (model catalog, deployment, fine-tuning, agent orchestration), Azure Machine Learning workspaces
  • API plane — Azure OpenAI Service, Azure AI Services (Vision, Speech, Language) — fully managed endpoints where the infrastructure is abstracted away

Most teams start at the API plane, which is the right call for experimentation. Production workloads — especially those requiring fine-tuning, private data, custom models, or agent orchestration — require you to engage the infrastructure and platform planes directly.


Infrastructure Plane

GPU Compute

NC/ND-series · Compute Clusters

High-Throughput Storage

Managed Lustre · ADLS Gen2

Networking

VNet · Private Endpoints · AI Gateway

Platform Plane

Azure AI Foundry

Orchestration · Agents · Eval

Azure ML Workspace

Experiments · Pipelines · Registry

API Plane

Azure OpenAI Service

GPT-4o · o3 · Embeddings

Azure AI Services

Vision · Speech · Language


2. Compute — Training vs. Inference 🖥️

💡 Key Message: Training and inference are not the same workload. Sizing them identically is the infrastructure equivalent of running a database and a web server on the same box.

1. Training ⚙️

Training workloads are bursty, memory-intensive, and distributed. On Azure, the relevant SKUs are:

SKU FamilyGPUMemoryUse Case
NC-series (NCv3, NCasT4)V100 / T416 GBExperimentation, fine-tuning small models
ND-series (NDv4, NDv5)A100 / H10080 GBLarge-scale training, distributed workloads
NV-seriesOlder NVIDIAVisualization only — not for ML training

Azure ML Compute Clusters are the right abstraction here — they auto-scale to zero when idle, support Spot instances for non-critical runs, and handle multi-node distributed training natively via mpirun or PyTorch DistributedDataParallel.

📋 Note: ND H100 v5 SKUs are quota-gated and regionally scarce. File a quota increase request before you architect a training pipeline that depends on them — lead times can be days to weeks.


2. Inference 🎯

Inference on Azure splits into two fundamentally different billing models:

OptionGPU required?BillingRight for
Serverless API (Foundry)No — fully managedPer tokenCatalog models: GPT-4o, Llama, Phi, Mistral
Managed Online Endpoint (Azure ML)Yes — dedicated SKUPer hourCustom or fine-tuned model checkpoints
Dedicated Throughput (Foundry)Yes — reserved capacityPer hourGuaranteed TPM with no cold-start risk
AKS GPU node poolYes — self-managedPer nodeCustom CUDA kernels, multi-model colocation, sub-100ms SLA

For the vast majority of Foundry use cases — catalog models deployed via Serverless API — you never provision or see a GPU. Microsoft handles capacity. GPU infrastructure only enters the picture when you bring your own model or need a throughput guarantee that the shared Serverless tier cannot give you.

✏️ Key Rule: Start with Serverless. Add Dedicated Throughput when you need TPM guarantees. Move to Managed Online Endpoints only when your model is not in the catalog. AKS is the last resort, not the default.


3. Storage — The Bottleneck Nobody Plans For 🗄️

Standard Azure Blob Storage has a default ingress throughput limit of around 10 Gbps — raiseable to 60 Gbps via support request, but never automatic. For a training job pulling a 10 TB dataset across an ND H100 cluster, you will hit that ceiling. The right storage tier depends on the workload phase:

PhaseStorage OptionWhy
Raw data landingAzure Blob (LRS/ZRS)Cost-optimal, sufficient throughput for batch ingest
Training dataADLS Gen2 with hierarchical namespaceHigher parallelism, POSIX-compatible for ML frameworks
High-perf trainingAzure Managed LustreParallel filesystem, up to 1 TB/s aggregate throughput
Model artifacts / checkpointsAzure Blob (with ML Datastore)Native integration with Azure ML and AI Foundry

📌 Example: A fine-tuning job on a 70B parameter model generates checkpoints every N steps. At FP16, a single checkpoint is ~140 GB. With frequent checkpointing across a 10-node cluster, Blob throughput becomes the pacing constraint — Managed Lustre eliminates it.


4. Azure AI Foundry — The Platform Layer 🤖

Azure AI Foundry (released in late 2024 as the successor to Azure ML Studio + Azure OpenAI portal) is the unified control plane for everything above the infrastructure layer. Before going deeper, it helps to understand where it sits relative to the two other services teams frequently confuse it with:

Azure OpenAI ServiceAzure AI FoundryAzure ML Workspace
Core questionI need an API endpoint for GPT-4oI’m building an AI application with agents, RAG, or evaluationI’m training or fine-tuning my own model
ModelsOpenAI only (GPT-4o, o3…)OpenAI + Llama, Mistral, Phi, Cohere…Custom + catalog via Foundry
GPU required?NoNo — Serverless is pay-per-tokenYes — training always needs compute
ComplexityMinimalMediumHigh
RelationFoundry creates an OpenAI resource internallyWraps Azure OpenAI + extends itSeparate; integrates via Foundry

The key distinction: Foundry is for teams consuming models to build products. Azure ML is for teams building or customizing models. Azure OpenAI Service is the right choice when you need a clean, dependency-free API endpoint and nothing else.

1. Model Catalog and Deployment 🛠️

The Foundry model catalog includes Azure OpenAI models, Meta Llama, Mistral, Cohere, and others — deployed either as serverless APIs (billed per token) or as managed deployments to a dedicated compute tier.

Fine-tuning is available in Foundry for select models (GPT-4o mini, Llama 3, Phi-3) without leaving the platform. The fine-tuning job provisions compute, runs the job, and registers the resulting checkpoint directly into your Foundry project.


2. Agent Infrastructure 🤖

Foundry now includes a native Agent Service — a stateful orchestration layer for building agents that can use tools, maintain memory, and execute multi-step workflows.

Under the hood, this is Azure’s managed equivalent of the Agentic Layer described in Architecture, One Layer Further. Key components:

  • Tool calling — HTTP tools, Azure Functions, Logic Apps, and code interpreter
  • Thread and message management — persistent conversation state stored in Foundry-managed storage
  • Vector stores — integrated with Azure AI Search for retrieval-augmented generation

🎯 Core Function: Foundry Agent Service is the connective tissue between your models, your data, and your business logic — without you managing the stateful plumbing yourself.


👤 User / App

Azure AI Foundry

Agent Service

Model Deployment

GPT-4o · Llama · Custom

Tools

Functions · HTTP · Code

Azure AI Search

Vector Store / RAG

Thread Storage

Managed State

Compute

Serverless or Dedicated


5. Networking and Security 🔐

💡 Key Message: Default Azure AI service configurations are publicly accessible. Production workloads must explicitly configure network isolation — it does not happen automatically.

The network architecture for a production Azure AI workload follows a clear pattern:

  • Private endpoints — Azure OpenAI, AI Foundry hubs, Storage, AI Search should all be reached via private endpoints inside a VNet, not over the public internet
  • VNet-injected compute — Azure ML Compute Clusters and AKS node pools must be deployed into a subnet to eliminate data exfiltration paths
  • Azure AI Gateway (API Management) — sits in front of Azure OpenAI to handle rate limiting, token quotas per consumer, request logging, and failover between deployments
  • Managed Identity — eliminate API keys entirely; use system-assigned managed identities for service-to-service authentication

💡 Insight: Azure OpenAI’s content filters, audit logs, and network controls are all configurable — but none are on by default. The security posture of an Azure AI deployment reflects exactly what you configured, not what you assumed.


6. Observability 🔍

Azure AI workloads require monitoring at two levels:

  • Infrastructure metrics — GPU utilization, memory bandwidth, node health, storage IOPS. Azure Monitor + VM Insights covers this out of the box for compute clusters.
  • Model-level metrics — token usage, latency per request, error rates, prompt/completion lengths. Azure AI Foundry emits these to Azure Monitor if you configure a Log Analytics workspace on the hub.

For agents specifically, Foundry emits thread-level traces that can be forwarded to Application Insights. This gives you request-scoped visibility into which tool calls happened, how long they took, and where failures occurred.


7. Cost Architecture 💰

GPU compute is the dominant cost driver, but not the only one. A realistic cost breakdown for a production Azure AI workload:

ComponentCost DriverControl Lever
Training computeGPU hours × SKU rateSpot instances, right-sizing, auto-scale to zero
Inference computeTokens processed or endpoint hoursServerless for variable load, dedicated for steady high-throughput
StorageGB stored × tier + egressLifecycle policies, Managed Lustre only for training
Azure OpenAITokens (input + output) × model ratePrompt engineering, caching, model tiering (4o mini vs. 4o)
AI SearchIndex size + query unitsVector index pruning, partition right-sizing

📋 Note: Azure Spot VMs offer up to 90% discount on NC/ND-series but can be evicted with 30 seconds notice. Design training jobs to checkpoint frequently and resume from the last checkpoint — or use Azure ML’s built-in spot eviction handling.


8. Where to Start 🚀

Three entry points, ordered by infrastructure commitment:

  • Azure OpenAI Service — zero infrastructure, immediate API access, pay per token. Right for prototyping and production workloads where OpenAI models are sufficient and you want no platform overhead.
  • Azure AI Foundry + Serverless or Managed Compute — access the full model catalog, add agents, RAG, and evaluation. Serverless stays pay-per-token; switch to Managed Compute when you need a fine-tuned checkpoint or dedicated throughput.
  • Azure ML Workspace + AKS + Managed Lustre — full infrastructure control, custom serving, distributed training. Right for teams with bespoke model development at scale.

✏️ Key Rule: The API plane is where you validate the use case. The infrastructure plane is where you operationalize it. Don’t confuse experimenting with a GPT-4o API call with running a production AI system — the infrastructure gap between the two is substantial.


Tags: azure · ai · infrastructure · mlops · cloud-architecture