Azure AI Foundry
Microsoft's unified platform-as-a-service for enterprise AI operations -- 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring.
Microsoft’s unified platform-as-a-service for enterprise AI operations — 11,000+ models from all vendors, integrated agent building, production evaluation, and monitoring. The unified hub replacing Azure ML, Azure Cognitive Services, and Azure OpenAI Studio into one opinionated platform.
What Is Azure AI Foundry?
Azure AI Foundry is Microsoft’s bet on model-agnostic AI operations. Instead of being locked into one model (GPT-4o), you can:
- Choose from 11,000+ models (OpenAI, Anthropic, Google, Meta, Mistral, Cohere, DeepSeek, Stability AI, etc.)
- Build agents with the right model for the task
- Evaluate quality systematically before production
- Deploy with auto-scaling
- Monitor performance and catch regressions
- Swap models without code changes
The Consolidation: Azure ML + Cognitive Services + Azure OpenAI Studio –> Azure AI Foundry (launched 2024, becoming the canonical platform).
Core Capability Stack
1. Model Catalog (11,000+ Models)
Featured Model Families:
| Vendor | Models | Notes |
|---|---|---|
| OpenAI | GPT-4o, o1, o3, GPT-4T, 3.5T, embeddings | Highest quality, most expensive |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku | Best reasoning, good safety |
| Gemini 2.0, Gemini 1.5, embeddings | Multimodal strength, web search grounding | |
| Meta | Llama 3.1, Llama 3 | Open-source, cost-effective |
| Mistral | Mistral 7B, 8x22B, Large | Speed/quality trade-off |
| DeepSeek | DeepSeek Coder, DeepSeek R1 | Code generation strength |
2. Agent Service (Full Lifecycle)
Build Phase:
- No-code: Portal UI with conversation designer
- Low-code: Pre-built Agent Framework (Python)
- Pro-code: Full control (LangGraph, Semantic Kernel, or custom)
Evaluate Phase:
- Built-in Evaluators: ~15 quality metrics (intent resolution, task success, groundedness, etc.)
- A/B Testing: Compare Model A vs Model B on same task
- Regression Detection: Auto-catch quality drops when updating prompts/models
Deploy Phase:
- Serverless hosting: No infrastructure to manage, auto-scale
- Multiple deployment targets: Azure, on-prem, edge
3. Evaluation Framework (Deep Dive)
Built-in Evaluators:
| Evaluator | What It Measures | Scale |
|---|---|---|
| Intent Resolution | Did the agent understand what the user wanted? | 0-5 |
| Task Completion | Did the agent accomplish the task? | 0-5 |
| Groundedness | Is the response grounded in source documents? | 0-5 |
| Relevance | Is the response relevant to the question? | 0-5 |
| Faithfulness | Does response accurately reflect source info? | 0-5 |
| Safety | Does response avoid harmful content? | 0-5 |
Python Example: Programmatic Evaluation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from azure.ai.foundry.evaluation import evaluate_agent
from azure.ai.foundry.evaluation.evaluators import (
IntentResolutionEvaluator,
TaskCompletionEvaluator,
GroundednessEvaluator
)
# Define test cases
test_cases = [
{
"query": "What's your pricing?",
"expected_answer": "Enterprise plan is $29/user/month",
"context": "pricing_doc.txt"
},
# ... more test cases
]
# Run evaluation
results = evaluate_agent(
agent_id="sales-bot",
test_set=test_cases,
evaluators=[
IntentResolutionEvaluator(),
TaskCompletionEvaluator(),
GroundednessEvaluator()
]
)
print(f"Intent Resolution: {results.intent_resolution.mean():.1f}/5")
print(f"Task Completion: {results.task_completion.mean():.1f}/5")
4. Multi-Model Strategy
The Innovation: Build once, swap models without code changes.
1
2
3
4
5
6
7
8
9
10
11
12
# Same agent code, different models
agent_a = Agent(model="gpt-4o", system_prompt="You are a helpful assistant")
agent_b = Agent(model="claude-3.5-sonnet", system_prompt="You are a helpful assistant")
# Same test set, two evaluations
eval_a = evaluate_agent(agent_a, test_cases)
eval_b = evaluate_agent(agent_b, test_cases)
# Pick winner and deploy -- no code changes needed
winner = "claude" if eval_b.intent_resolution > eval_a.intent_resolution else "gpt-4o"
agent.model = winner
agent.deploy()
Foundry Agent Service Architecture
Trace and Debug
Every LLM call is traced:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
User Query: "What's my account balance?"
Trace:
+- Step 1: Model Call (gpt-4o)
| +- Input: "What's my account balance?"
| +- Tokens In: 42, Tokens Out: 156
| +- Latency: 1.2 sec
|
+- Step 2: Tool Call (search_account)
| +- Result: {"balance": "$5,432.10", "currency": "USD"}
| +- Latency: 0.3 sec
|
+- Step 3: Model Call (gpt-4o, context-aware)
| +- Output: "Your account balance is $5,432.10 USD."
| +- Latency: 0.8 sec
|
+- Final Response: "Your account balance is $5,432.10 USD."
+- Total Latency: 2.3 sec
Deploy Phase
1
2
3
4
5
6
# One command deploys
azure ai foundry agent deploy \
--name customer-support \
--environment production \
--auto-scale true \
--max-concurrent 100
Evaluation in Production: Regression Detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from azure.ai.foundry.evaluation import RegressionDetector
detector = RegressionDetector(
agent_id="customer-support",
baseline_metrics={
"intent_resolution": 4.2,
"task_completion": 3.8,
"groundedness": 4.5
},
alert_threshold={
"intent_resolution": -0.3,
"task_completion": -0.2,
"groundedness": -0.3
}
)
# Run daily evaluation on sample of recent conversations
daily_eval = detector.evaluate_daily_sample(sample_size=50)
if daily_eval.has_regression():
send_alert(
subject="Agent quality degraded",
body=f"Intent resolution: {daily_eval.intent_resolution} (baseline: 4.2)"
)
if daily_eval.severity == "critical":
agent.rollback_to_previous_version()
Real-World Deployment: Multi-Model Agent
Case Study: Insurance Claims
Solution: Conditional Multi-Model Dispatch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Before Multi-Model Routing:
All claims: gpt-4o ($0.01 per claim)
100K claims/month: $1,000/month
After Multi-Model Routing:
40% simple claims: gpt-3.5-turbo ($0.001 per claim) = $40
50% balanced claims: gpt-4o ($0.01 per claim) = $500
10% complex claims: o1 ($0.05 per claim) = $500
100K claims/month: $1,040/month
Quality Improvement:
Before: 86% claims processed correctly
After: 92% claims processed correctly
Latency Improvement:
Before: Avg 4.2 sec per claim
After: Avg 2.8 sec (33% faster)
Pricing and Cost Structure
Example: Customer Support Agent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Deployment: Foundry serverless
Model: gpt-4o
Monthly Volume: 100K conversations
Monthly Cost Breakdown:
Model tokens: 100K conversations x 200 tokens avg x $0.005/1K = $100
RAG queries: 100K queries x $0.0002 per query = $20
Evaluation (weekly): 4 runs x 500 test cases x $0.005 = $10
Storage (knowledge): $5
Hosting (serverless): 100K requests x $0.0001 = $10
Total: ~$145/month
Annual cost: ~$1,740 for agent that handles 1.2M customer interactions
ROI (assumes 1 FTE @ $70K/year): 40x
Key Properties
| Property | Value | Notes |
|---|---|---|
| Model Catalog Size | 11,000+ models | Largest in market |
| Build-to-Production Time | 2-4 weeks | No-code: 1 week, Pro-code: 3-4 weeks |
| Evaluation Coverage | ~15 built-in metrics | Can add custom evaluators |
| Multi-Model Support | Native | Swap models without code changes |
| Trace Depth | Every LLM call + tool invocation | Debug-friendly |
| Compliance Certifications | SOC 2, HIPAA, FedRAMP, GDPR | Enterprise-grade |
| Scalability | 0 to 1000+ concurrent requests | Auto-scaling built-in |
References
Author’s Take
Azure AI Foundry is the most technically sophisticated AI operations platform in market, but it’s not for everyone. Choose Foundry if you need to evaluate multiple models systematically, deploy agents at scale, and want full observability. Don’t choose Foundry if you just want to use ChatGPT or you’re a solo founder.
The real power: Foundry’s evaluation framework removes guesswork from “Is our agent good enough to ship?” You measure, compare, iterate, and only deploy when metrics meet thresholds. This is how you build AI systems people trust.