Post

Cloud Cost Optimization

Cloud spend is the fastest-growing and least-controlled budget line in most engineering organizations. Without FinOps discipline, cloud bills grow 30-40% year-over-year while utilization sits at 35-45%.

Cloud Cost Optimization

Key Dimensions

Dimension Definition Typical Values
Cloud Unit Cost Cost per unit of business output (e.g., cost per transaction) Varies by business
Utilization Rate % of provisioned resources actively used Industry average: 35-45%
Coverage Rate % of eligible spend covered by commitments (RI/SP) Target: 70-80%
Waste Spend on unused or idle resources Typically 25-35% of total cloud bill
Effective Savings Rate % savings achieved vs on-demand pricing Target: 30-50%
Cost per Engineer Cloud spend divided by engineering headcount €3,000-15,000/year

FinOps Framework

The FinOps Foundation (part of The Linux Foundation) defines a maturity model for cloud financial management. It’s the industry standard.

FinOps Lifecycle

1
2
3
4
5
6
7
8
9
10
11
12
13
        ┌─────────────┐
        │   INFORM    │ ← Visibility, allocation, benchmarking
        └──────┬──────┘
               │
        ┌──────▼──────┐
        │  OPTIMIZE   │ ← Right-sizing, commitments, waste elimination
        └──────┬──────┘
               │
        ┌──────▼──────┐
        │   OPERATE   │ ← Governance, automation, continuous improvement
        └──────┬──────┘
               │
               └──────→ (back to INFORM — continuous loop)

FinOps Maturity Levels

Level Characteristics What You Do
Crawl Basic cost visibility, manual reporting, reactive management Tag resources, set up cost dashboards, identify top spenders
Walk Team-level accountability, commitment coverage, some automation Implement showback, right-size top resources, buy RIs/SPs
Run Automated optimization, unit economics, engineering-driven Automated scaling, real-time alerts, cost in CI/CD, engineer self-serve

The Big Five Cost Optimization Levers

1. Right-Sizing (Savings: 20-40%)

Right-sizing means matching resource allocation to actual usage. Most engineers over-provision because “more is safer.”

Resource Common Over-Provisioning Right-Sizing Approach
Compute (EC2/GCE) m5.2xlarge running at 15% CPU Downsize to m5.large, monitor for 2 weeks
RDS/Cloud SQL db.r5.4xlarge at 20% CPU Downsize to r5.xlarge, enable auto-scaling
EBS/Persistent Disk 1TB provisioned, 200GB used Resize to 300GB (leave 50% headroom)
Kubernetes Resource requests 4x actual usage Set requests to P95 usage + 20% buffer

Right-sizing process:

  1. Pull 14-day CPU and memory utilization data
  2. Identify resources with <40% average utilization
  3. Recommend downsizing to next smaller size
  4. Implement in non-prod first, monitor for 1 week
  5. Apply to production with rollback plan

Kubernetes-specific:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Before: Over-provisioned
resources:
  requests:
    cpu: "2000m"      # Requesting 2 full CPUs
    memory: "4Gi"     # Requesting 4GB RAM
  limits:
    cpu: "4000m"
    memory: "8Gi"

# After: Right-sized based on P95 metrics
resources:
  requests:
    cpu: "500m"       # P95 usage was 400m
    memory: "1Gi"     # P95 usage was 800Mi
  limits:
    cpu: "1000m"      # 2x request for burst
    memory: "2Gi"     # 2x request for burst

2. Commitment Discounts – Reserved Instances & Savings Plans (Savings: 30-72%)

Commitment Type AWS GCP Azure Discount vs On-Demand
1-year, no upfront RI / Savings Plan CUD Reserved VM 30-40%
1-year, all upfront RI / Savings Plan CUD Reserved VM 35-45%
3-year, no upfront RI / Savings Plan CUD Reserved VM 50-60%
3-year, all upfront RI / Savings Plan CUD Reserved VM 60-72%

Decision framework:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Should you commit?

1. Is this workload stable for 12+ months?
   ├── Yes → Consider commitment
   └── No → Stay on-demand

2. How predictable is the usage?
   ├── Steady baseline → Commit to baseline, on-demand for peaks
   └── Highly variable → Use Savings Plans (flexible) over RIs (rigid)

3. What's your risk tolerance?
   ├── Conservative → 1-year, no upfront (can adjust annually)
   └── Aggressive → 3-year, all upfront (maximum savings, minimum flexibility)

4. Coverage target?
   └── Cover 70-80% of steady-state with commitments
       Leave 20-30% on-demand for flexibility

3. Spot/Preemptible Instances (Savings: 60-90%)

Spot instances are spare cloud capacity sold at deep discounts with the caveat that they can be reclaimed with short notice (2 minutes on AWS, 30 seconds on GCP).

Good for: Batch processing, CI/CD pipelines, dev/test environments, stateless workers, data processing Bad for: Databases, stateful services, user-facing production, anything that can’t handle interruption

Spot strategy for Kubernetes:

1
2
3
4
5
6
7
8
Cluster node pools:
  - On-demand pool:  30% of capacity (critical workloads)
  - Spot pool:       70% of capacity (stateless, fault-tolerant)

Use pod disruption budgets + node affinity to ensure:
  - Databases and stateful sets run on on-demand nodes
  - Stateless microservices prefer spot nodes
  - Batch jobs exclusively use spot nodes

4. Idle Resource Elimination (Savings: 10-25%)

The lowest-hanging fruit. Common idle resources:

Resource How It Gets Idle How to Find It Action
Unattached EBS volumes Instance deleted, volume remains Filter: status = available Delete (after backup)
Old snapshots Automated snapshots never cleaned up Age > 90 days, no AMI reference Delete
Idle load balancers Service decommissioned, LB remains 0 active connections for 7+ days Delete
Non-prod environments Dev/staging running 24/7 Running outside business hours Schedule on/off
Orphaned IPs Elastic IP allocated but not attached Billing without association Release
Oversized dev instances Developer provisioned large for testing m5.4xlarge in dev account Auto-resize to small

Non-prod scheduling savings:

1
2
3
4
5
6
7
8
9
Dev/Staging environments: Run 10 hours/day x 5 days/week
  = 50 hours / 168 hours per week
  = 30% of on-demand cost
  = 70% savings on non-prod compute

Example:
  10 dev instances x m5.xlarge x $0.192/hr x 24 x 365 = $16,819/year (always-on)
  10 dev instances x m5.xlarge x $0.192/hr x 10 x 260 = $4,992/year (scheduled)
  Savings: $11,827/year (70%)

5. Architecture Optimization (Savings: 20-60%, long-term)

Deeper changes that require engineering effort but deliver sustained savings:

Pattern Before After Savings
Serverless migration Always-on EC2 fleet for bursty workloads Lambda/Cloud Functions 40-70% for bursty workloads
Managed services Self-managed Kafka on EC2 Amazon MSK or Confluent Cloud 20-40% (less ops overhead)
Data tiering All data on SSD/gp3 Hot/warm/cold tiers with lifecycle policies 30-60% on storage
CDN / caching All traffic hits origin CloudFront/CDN for static + cacheable content 30-50% on bandwidth + compute
Database optimization Oversized RDS for read-heavy workload Read replicas + connection pooling 30-50%

Cost Allocation & Tagging Strategy

Why Tags Matter

Without proper tagging, your cloud bill is one big number. You can’t answer: “How much does Team X spend?” or “What does Product Y cost to run?”

Minimum Tag Set

Tag Key Example Values Purpose
team ai-platform, checkout, platform Cost allocation to teams
environment production, staging, development Separate prod from non-prod costs
service recommendation-api, search-service Per-service cost tracking
cost-center CC-4521, CC-3300 Map to finance cost centers
project ai-mvp, migration-v2 Track project-specific spend
owner imrul.sheikh, team-alpha Accountability
managed-by terraform, manual Identify IaC-managed resources

Tag Compliance

1
2
3
4
5
6
7
8
Tag compliance rate = Resources with required tags / Total resources x 100

Target: >95%

Enforcement approaches:
  1. Preventive: AWS SCPs / GCP Org Policies block untagged resource creation
  2. Detective: Weekly report of untagged resources sent to team leads
  3. Corrective: Automated tagging based on resource metadata (account, VPC)

Showback vs Chargeback

Model How It Works Pros Cons
Showback Show teams their costs, no financial consequence Low friction, raises awareness No accountability – teams can ignore it
Chargeback Charge costs to team budgets, affects their P&L Real accountability, drives optimization Complexity, disputes over shared costs, can discourage experimentation
Hybrid Chargeback for production, showback for dev/test Balances accountability with flexibility Moderate complexity

Recommendation: Start with showback. Most teams reduce spend 15-25% just from visibility. Move to chargeback only when showback stops driving behavior change (typically 12-18 months in).

Showback Report Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CLOUD COST REPORT — [Team Name] — [Month]

                          This Month   Last Month   MoM Change   Budget   vs Budget
────────────────────────────────────────────────────────────────────────────────────
Compute (EC2/GCE)         €12,500      €11,800      +5.9%        €12,000  +4.2%
Databases (RDS/SQL)       €4,200       €4,100       +2.4%        €4,500   -6.7%
Storage (S3/GCS)          €1,800       €1,750       +2.9%        €2,000   -10.0%
Networking (transfer)     €2,100       €1,900       +10.5%       €1,800   +16.7%
Containers (EKS/GKE)      €3,400       €3,200       +6.3%        €3,500   -2.9%
Other                     €900         €850         +5.9%        €1,000   -10.0%
────────────────────────────────────────────────────────────────────────────────────
TOTAL                     €24,900      €23,600      +5.5%        €24,800  +0.4%

TOP 3 COST DRIVERS THIS MONTH:
1. Networking up 10.5% — caused by cross-region data transfer for new analytics pipeline
2. Compute up 5.9% — added 2 instances for load testing, not yet decommissioned
3. Containers up 6.3% — new microservice deployed, baseline increase expected

OPTIMIZATION OPPORTUNITIES:
1. 3 idle dev instances identified — potential savings: €450/month
2. GP2 volumes eligible for GP3 migration — potential savings: €200/month
3. RI coverage dropped to 65% — renew expiring RIs for €800/month savings

Cloud Budget Governance

Budget Alerts

Set up multi-tier alerts:

Alert Level Threshold Action
Info 50% of monthly budget Automated email to team lead
Warning 80% of monthly budget Slack notification to team channel
Critical 100% of monthly budget Page engineering manager + finance
Emergency 120% of monthly budget Escalate to VP + implement spend freeze

Anomaly Detection

Cloud providers offer anomaly detection (AWS Cost Anomaly Detection, GCP Budget Alerts with forecasting). Configure these to catch unexpected spikes before they hit your budget:

  • Set anomaly threshold at 20% above expected daily spend
  • Route alerts to a dedicated Slack channel
  • Assign on-call rotation for cost anomaly investigation

Anti-Patterns and Common Mistakes

1. Treating Cloud Like On-Prem

The mistake: Buying reserved instances for everything because “that’s how we bought servers.” Why it’s wrong: Cloud’s value is elasticity. Over-committing eliminates the flexibility advantage. Instead: Commit to baseline (70-80%), keep headroom for elasticity. Use Savings Plans over RIs for flexibility.

2. Optimizing Too Early

The mistake: Spending weeks optimizing a $500/month dev environment. Why it’s wrong: The ROI of your engineering time is negative. Focus on the big items first. Instead: Sort by spend descending. Optimize the top 5 cost items first – they’re usually 80% of the bill (Pareto principle).

3. No Ownership

The mistake: “Cloud costs are the platform team’s problem.” Why it’s wrong: The team writing the code controls the architecture, instance sizes, and data transfer patterns. Platform can provide tools, but teams must own their costs. Instead: Every team sees their own costs weekly. Cost efficiency is part of code review (“Does this query need to scan the full table?”).

4. Ignoring Data Transfer Costs

The mistake: Focusing only on compute and storage while data transfer costs grow silently. Why it’s wrong: Cross-region and internet egress costs are $0.08-0.12/GB and add up fast with high-traffic services. Instead: Monitor data transfer as a separate line item. Use CDNs for static content. Keep services that communicate frequently in the same region/AZ.

5. Cost Optimization as One-Time Project

The mistake: Running a “cloud cost optimization initiative” once, then declaring victory. Why it’s wrong: Cloud spend regresses. New services launch, developers forget to clean up, traffic patterns change. Instead: FinOps is a practice, not a project. Build it into your operating rhythm: weekly cost reviews, monthly optimization sprints, quarterly commitment reviews.


FinOps Operating Rhythm

Cadence Activity Who
Daily Anomaly alerts reviewed On-call engineer
Weekly Cost dashboard review, top anomalies discussed Team leads
Monthly Full cost report, optimization opportunities, tag compliance Engineering manager + finance
Quarterly Commitment review (RI/SP renewals), budget vs actual EM + VP + finance
Annually Cloud strategy review, vendor negotiations, budget planning VP + CTO + finance

References

This post is licensed under CC BY 4.0 by the author.