Azure AI Foundry

Microsoft's unified platform-as-a-service for enterprise AI operations -- 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring.

Posted Aug 22, 2025

6 min read

Azure AI Foundry

Microsoft’s unified platform-as-a-service for enterprise AI operations — 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring. The unified hub replacing Azure ML, Azure Cognitive Services, and Azure OpenAI Studio into one opinionated platform.

What Is Azure AI Foundry?

Azure AI Foundry is Microsoft’s bet on model-agnostic AI operations. Instead of being locked into one model (GPT-4o), you can:

Choose from 11,000+ models (OpenAI, Anthropic, Google, Meta, Mistral, Cohere, DeepSeek, Stability AI, etc.)
Build agents with the right model for the task
Evaluate quality systematically before production
Deploy with auto-scaling
Monitor performance and catch regressions
Swap models without code changes

The Consolidation: Azure ML + Cognitive Services + Azure OpenAI Studio –> Azure AI Foundry (launched 2024, becoming the canonical platform).

Core Capability Stack

1. Model Catalog (11,000+ Models)

Featured Model Families:

Vendor	Models	Notes
OpenAI	GPT-4o, o1, o3, GPT-4T, 3.5T, embeddings	Highest quality, most expensive
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku	Best reasoning, good safety
Google	Gemini 2.0, Gemini 1.5, embeddings	Multimodal strength, web search grounding
Meta	Llama 3.1, Llama 3	Open-source, cost-effective
Mistral	Mistral 7B, 8x22B, Large	Speed/quality trade-off
DeepSeek	DeepSeek Coder, DeepSeek R1	Code generation strength

2. Agent Service (Full Lifecycle)

Build Phase:

No-code: Portal UI with conversation designer
Low-code: Pre-built Agent Framework (Python)
Pro-code: Full control (LangGraph, Semantic Kernel, or custom)

Evaluate Phase:

Built-in Evaluators: ~15 quality metrics (intent resolution, task success, groundedness, etc.)
A/B Testing: Compare Model A vs Model B on same task
Regression Detection: Auto-catch quality drops when updating prompts/models

Deploy Phase:

Serverless hosting: No infrastructure to manage, auto-scale
Multiple deployment targets: Azure, on-prem, edge

3. Evaluation Framework (Deep Dive)

Built-in Evaluators:

Evaluator	What It Measures	Scale
Intent Resolution	Did the agent understand what the user wanted?	0-5
Task Completion	Did the agent accomplish the task?	0-5
Groundedness	Is the response grounded in source documents?	0-5
Relevance	Is the response relevant to the question?	0-5
Faithfulness	Does response accurately reflect source info?	0-5
Safety	Does response avoid harmful content?	0-5

Python Example: Programmatic Evaluation

        
      
from azure.ai.foundry.evaluation import evaluate_agent
from azure.ai.foundry.evaluation.evaluators import (
    IntentResolutionEvaluator,
    TaskCompletionEvaluator,
    GroundednessEvaluator
)

# Define test cases
test_cases = [
    {
        "query": "What's your pricing?",
        "expected_answer": "Enterprise plan is $29/user/month",
        "context": "pricing_doc.txt"
    },
    # ... more test cases
]

# Run evaluation
results = evaluate_agent(
    agent_id="sales-bot",
    test_set=test_cases,
    evaluators=[
        IntentResolutionEvaluator(),
        TaskCompletionEvaluator(),
        GroundednessEvaluator()
    ]
)

print(f"Intent Resolution: {results.intent_resolution.mean():.1f}/5")
print(f"Task Completion: {results.task_completion.mean():.1f}/5")

4. Multi-Model Strategy

The Innovation: Build once, swap models without code changes.

        
      
# Same agent code, different models
agent_a = Agent(model="gpt-4o", system_prompt="You are a helpful assistant")
agent_b = Agent(model="claude-3.5-sonnet", system_prompt="You are a helpful assistant")

# Same test set, two evaluations
eval_a = evaluate_agent(agent_a, test_cases)
eval_b = evaluate_agent(agent_b, test_cases)

# Pick winner and deploy -- no code changes needed
winner = "claude" if eval_b.intent_resolution > eval_a.intent_resolution else "gpt-4o"
agent.model = winner
agent.deploy()

Foundry Agent Service Architecture

Trace and Debug

Every LLM call is traced:

User Query: "What's my account balance?"

Trace:
+- Step 1: Model Call (gpt-4o)
|  +- Input: "What's my account balance?"
|  +- Tokens In: 42, Tokens Out: 156
|  +- Latency: 1.2 sec
|
+- Step 2: Tool Call (search_account)
|  +- Result: {"balance": "$5,432.10", "currency": "USD"}
|  +- Latency: 0.3 sec
|
+- Step 3: Model Call (gpt-4o, context-aware)
|  +- Output: "Your account balance is $5,432.10 USD."
|  +- Latency: 0.8 sec
|
+- Final Response: "Your account balance is $5,432.10 USD."
   +- Total Latency: 2.3 sec

Deploy Phase

        
      
# One command deploys
azure ai foundry agent deploy \
  --name customer-support \
  --environment production \
  --auto-scale true \
  --max-concurrent 100

Evaluation in Production: Regression Detection

        
      
from azure.ai.foundry.evaluation import RegressionDetector

detector = RegressionDetector(
    agent_id="customer-support",
    baseline_metrics={
        "intent_resolution": 4.2,
        "task_completion": 3.8,
        "groundedness": 4.5
    },
    alert_threshold={
        "intent_resolution": -0.3,
        "task_completion": -0.2,
        "groundedness": -0.3
    }
)

# Run daily evaluation on sample of recent conversations
daily_eval = detector.evaluate_daily_sample(sample_size=50)

if daily_eval.has_regression():
    send_alert(
        subject="Agent quality degraded",
        body=f"Intent resolution: {daily_eval.intent_resolution} (baseline: 4.2)"
    )
    if daily_eval.severity == "critical":
        agent.rollback_to_previous_version()

Real-World Deployment: Multi-Model Agent

Case Study: Insurance Claims

Solution: Conditional Multi-Model Dispatch

Before Multi-Model Routing:
  All claims: gpt-4o ($0.01 per claim)
  100K claims/month: $1,000/month

After Multi-Model Routing:
  40% simple claims: gpt-3.5-turbo ($0.001 per claim) = $40
  50% balanced claims: gpt-4o ($0.01 per claim) = $500
  10% complex claims: o1 ($0.05 per claim) = $500
  100K claims/month: $1,040/month

Quality Improvement:
  Before: 86% claims processed correctly
  After: 92% claims processed correctly

Latency Improvement:
  Before: Avg 4.2 sec per claim
  After: Avg 2.8 sec (33% faster)

Pricing and Cost Structure

Example: Customer Support Agent

Deployment: Foundry serverless
Model: gpt-4o
Monthly Volume: 100K conversations

Monthly Cost Breakdown:
  Model tokens: 100K conversations x 200 tokens avg x $0.005/1K = $100
  RAG queries: 100K queries x $0.0002 per query = $20
  Evaluation (weekly): 4 runs x 500 test cases x $0.005 = $10
  Storage (knowledge): $5
  Hosting (serverless): 100K requests x $0.0001 = $10
  Total: ~$145/month

Annual cost: ~$1,740 for agent that handles 1.2M customer interactions
ROI (assumes 1 FTE @ $70K/year): 40x

Key Properties

Property	Value	Notes
Model Catalog Size	11,000+ models	Largest in market
Build-to-Production Time	2-4 weeks	No-code: 1 week, Pro-code: 3-4 weeks
Evaluation Coverage	~15 built-in metrics	Can add custom evaluators
Multi-Model Support	Native	Swap models without code changes
Trace Depth	Every LLM call + tool invocation	Debug-friendly
Compliance Certifications	SOC 2, HIPAA, FedRAMP, GDPR	Enterprise-grade
Scalability	0 to 1000+ concurrent requests	Auto-scaling built-in

References

Author’s Take

Azure AI Foundry is the most technically sophisticated AI operations platform in market, but it’s not for everyone. Choose Foundry if you need to evaluate multiple models systematically, deploy agents at scale, and want full observability. Don’t choose Foundry if you just want to use ChatGPT or you’re a solo founder.

The real power: Foundry’s evaluation framework removes guesswork from “Is our agent good enough to ship?” You measure, compare, iterate, and only deploy when metrics meet thresholds. This is how you build AI systems people trust.

AI & Agents, AI Tools & Platforms

agent-frameworks

This post is licensed under CC BY 4.0 by the author.