Post

Azure AI Foundry

Microsoft's unified platform-as-a-service for enterprise AI operations -- 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring.

Azure AI Foundry

Microsoft’s unified platform-as-a-service for enterprise AI operations — 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring. The unified hub replacing Azure ML, Azure Cognitive Services, and Azure OpenAI Studio into one opinionated platform.


What Is Azure AI Foundry?

Azure AI Foundry is Microsoft’s bet on model-agnostic AI operations. Instead of being locked into one model (GPT-4o), you can:

  • Choose from 11,000+ models (OpenAI, Anthropic, Google, Meta, Mistral, Cohere, DeepSeek, Stability AI, etc.)
  • Build agents with the right model for the task
  • Evaluate quality systematically before production
  • Deploy with auto-scaling
  • Monitor performance and catch regressions
  • Swap models without code changes

The Consolidation: Azure ML + Cognitive Services + Azure OpenAI Studio –> Azure AI Foundry (launched 2024, becoming the canonical platform).


Core Capability Stack

1. Model Catalog (11,000+ Models)

Featured Model Families:

Vendor Models Notes
OpenAI GPT-4o, o1, o3, GPT-4T, 3.5T, embeddings Highest quality, most expensive
Anthropic Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku Best reasoning, good safety
Google Gemini 2.0, Gemini 1.5, embeddings Multimodal strength, web search grounding
Meta Llama 3.1, Llama 3 Open-source, cost-effective
Mistral Mistral 7B, 8x22B, Large Speed/quality trade-off
DeepSeek DeepSeek Coder, DeepSeek R1 Code generation strength

2. Agent Service (Full Lifecycle)

Build Phase:

  • No-code: Portal UI with conversation designer
  • Low-code: Pre-built Agent Framework (Python)
  • Pro-code: Full control (LangGraph, Semantic Kernel, or custom)

Evaluate Phase:

  • Built-in Evaluators: ~15 quality metrics (intent resolution, task success, groundedness, etc.)
  • A/B Testing: Compare Model A vs Model B on same task
  • Regression Detection: Auto-catch quality drops when updating prompts/models

Deploy Phase:

  • Serverless hosting: No infrastructure to manage, auto-scale
  • Multiple deployment targets: Azure, on-prem, edge

3. Evaluation Framework (Deep Dive)

Built-in Evaluators:

Evaluator What It Measures Scale
Intent Resolution Did the agent understand what the user wanted? 0-5
Task Completion Did the agent accomplish the task? 0-5
Groundedness Is the response grounded in source documents? 0-5
Relevance Is the response relevant to the question? 0-5
Faithfulness Does response accurately reflect source info? 0-5
Safety Does response avoid harmful content? 0-5

Python Example: Programmatic Evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from azure.ai.foundry.evaluation import evaluate_agent
from azure.ai.foundry.evaluation.evaluators import (
    IntentResolutionEvaluator,
    TaskCompletionEvaluator,
    GroundednessEvaluator
)

# Define test cases
test_cases = [
    {
        "query": "What's your pricing?",
        "expected_answer": "Enterprise plan is $29/user/month",
        "context": "pricing_doc.txt"
    },
    # ... more test cases
]

# Run evaluation
results = evaluate_agent(
    agent_id="sales-bot",
    test_set=test_cases,
    evaluators=[
        IntentResolutionEvaluator(),
        TaskCompletionEvaluator(),
        GroundednessEvaluator()
    ]
)

print(f"Intent Resolution: {results.intent_resolution.mean():.1f}/5")
print(f"Task Completion: {results.task_completion.mean():.1f}/5")

4. Multi-Model Strategy

The Innovation: Build once, swap models without code changes.

1
2
3
4
5
6
7
8
9
10
11
12
# Same agent code, different models
agent_a = Agent(model="gpt-4o", system_prompt="You are a helpful assistant")
agent_b = Agent(model="claude-3.5-sonnet", system_prompt="You are a helpful assistant")

# Same test set, two evaluations
eval_a = evaluate_agent(agent_a, test_cases)
eval_b = evaluate_agent(agent_b, test_cases)

# Pick winner and deploy -- no code changes needed
winner = "claude" if eval_b.intent_resolution > eval_a.intent_resolution else "gpt-4o"
agent.model = winner
agent.deploy()

Foundry Agent Service Architecture

Trace and Debug

Every LLM call is traced:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
User Query: "What's my account balance?"

Trace:
+- Step 1: Model Call (gpt-4o)
|  +- Input: "What's my account balance?"
|  +- Tokens In: 42, Tokens Out: 156
|  +- Latency: 1.2 sec
|
+- Step 2: Tool Call (search_account)
|  +- Result: {"balance": "$5,432.10", "currency": "USD"}
|  +- Latency: 0.3 sec
|
+- Step 3: Model Call (gpt-4o, context-aware)
|  +- Output: "Your account balance is $5,432.10 USD."
|  +- Latency: 0.8 sec
|
+- Final Response: "Your account balance is $5,432.10 USD."
   +- Total Latency: 2.3 sec

Deploy Phase

1
2
3
4
5
6
# One command deploys
azure ai foundry agent deploy \
  --name customer-support \
  --environment production \
  --auto-scale true \
  --max-concurrent 100

Evaluation in Production: Regression Detection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from azure.ai.foundry.evaluation import RegressionDetector

detector = RegressionDetector(
    agent_id="customer-support",
    baseline_metrics={
        "intent_resolution": 4.2,
        "task_completion": 3.8,
        "groundedness": 4.5
    },
    alert_threshold={
        "intent_resolution": -0.3,
        "task_completion": -0.2,
        "groundedness": -0.3
    }
)

# Run daily evaluation on sample of recent conversations
daily_eval = detector.evaluate_daily_sample(sample_size=50)

if daily_eval.has_regression():
    send_alert(
        subject="Agent quality degraded",
        body=f"Intent resolution: {daily_eval.intent_resolution} (baseline: 4.2)"
    )
    if daily_eval.severity == "critical":
        agent.rollback_to_previous_version()

Real-World Deployment: Multi-Model Agent

Case Study: Insurance Claims

Solution: Conditional Multi-Model Dispatch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Before Multi-Model Routing:
  All claims: gpt-4o ($0.01 per claim)
  100K claims/month: $1,000/month

After Multi-Model Routing:
  40% simple claims: gpt-3.5-turbo ($0.001 per claim) = $40
  50% balanced claims: gpt-4o ($0.01 per claim) = $500
  10% complex claims: o1 ($0.05 per claim) = $500
  100K claims/month: $1,040/month

Quality Improvement:
  Before: 86% claims processed correctly
  After: 92% claims processed correctly

Latency Improvement:
  Before: Avg 4.2 sec per claim
  After: Avg 2.8 sec (33% faster)

Pricing and Cost Structure

Example: Customer Support Agent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Deployment: Foundry serverless
Model: gpt-4o
Monthly Volume: 100K conversations

Monthly Cost Breakdown:
  Model tokens: 100K conversations x 200 tokens avg x $0.005/1K = $100
  RAG queries: 100K queries x $0.0002 per query = $20
  Evaluation (weekly): 4 runs x 500 test cases x $0.005 = $10
  Storage (knowledge): $5
  Hosting (serverless): 100K requests x $0.0001 = $10
  Total: ~$145/month

Annual cost: ~$1,740 for agent that handles 1.2M customer interactions
ROI (assumes 1 FTE @ $70K/year): 40x

Key Properties

Property Value Notes
Model Catalog Size 11,000+ models Largest in market
Build-to-Production Time 2-4 weeks No-code: 1 week, Pro-code: 3-4 weeks
Evaluation Coverage ~15 built-in metrics Can add custom evaluators
Multi-Model Support Native Swap models without code changes
Trace Depth Every LLM call + tool invocation Debug-friendly
Compliance Certifications SOC 2, HIPAA, FedRAMP, GDPR Enterprise-grade
Scalability 0 to 1000+ concurrent requests Auto-scaling built-in

References


Author’s Take

Azure AI Foundry is the most technically sophisticated AI operations platform in market, but it’s not for everyone. Choose Foundry if you need to evaluate multiple models systematically, deploy agents at scale, and want full observability. Don’t choose Foundry if you just want to use ChatGPT or you’re a solo founder.

The real power: Foundry’s evaluation framework removes guesswork from “Is our agent good enough to ship?” You measure, compare, iterate, and only deploy when metrics meet thresholds. This is how you build AI systems people trust.

This post is licensed under CC BY 4.0 by the author.