Seven AWS levers that cut $1.5M from a production SaaS bill

07 May, 2026

A few years into my decade-long run as principal engineer at a private SaaS, our annual AWS bill had grown faster than revenue for three quarters running. The board had started asking questions that meant someone was going to have to answer for it. I was that someone.

I'm currently CTO at Allset, where I run production AI on Bedrock. The shape of that bill is different in some ways and identical in others, which is why I keep coming back to this story. The levers that worked then mostly still work, and the patterns that hide costs in a SaaS bill hide them in an AI bill the same way.

Over the next fiscal year we cut about $1.5M from the run rate without breaking anything in production and without slowing product velocity. The work wasn't elegant. It was a series of focused passes through the bill, in priority order, with each lever paying back something material before we moved to the next one.

If you're running a Bedrock or SageMaker workload that's started to spike, the same playbook applies. The line items have different names. The shape is the same. Read on with that translation in your head and the levers will map.

These are the seven levers, in the order they actually moved the bill.

1. Compute commitment optimization

EC2 compute was the biggest single line item by a wide margin. Around 40% of the total bill in any given month. Most of that workload was steady-state, predictable, and on-demand priced, which is the worst of all possible worlds.

The same shape applies on Bedrock. If you're running a steady-state inference workload at on-demand pricing and you've never looked at provisioned throughput, batch mode, or cross-region inference, you're paying retail on the line item that's about to be your top one.

The fastest lever was getting commitment-based pricing under it. Savings Plans for the steady-state baseline, Reserved Instances for the predictable always-on capacity, on-demand only for the bursty top of the load curve.

We deployed Pump for the daily auto-optimization layer. Pump watches your usage and buys, sells, and rebalances commitment buckets every day so you stay close to the optimal mix without anyone manually managing it. That alone took roughly $400K out of the EC2 compute line.

Put commitment pricing under the steady-state load before doing anything else. Right-sizing comes second, and only after you've stopped paying on-demand for capacity you're going to need anyway. This is true for EC2 and it's true for Bedrock provisioned throughput too.

2. EC2 right-sizing

Once commitments were under the workload, the next lever was looking at what we were actually paying for. Most of our instances were two to four times bigger than they needed to be. Some of that was history. Someone had picked an instance class three years ago for a workload that had since shrunk. Some of it was caution. The default in most engineering cultures is to over-provision because the cost of an instance that's too small (downtime, paged engineers, customer impact) feels worse than the cost of one that's too big (just money).

Compute Optimizer surfaces most of these automatically. We let it run for a week to gather enough data, then walked through every recommendation by hand. Some we accepted directly. Some we sized down further than it suggested. A few we left alone because we knew about workload patterns Compute Optimizer didn't.

This pass took roughly another quarter off the rightsized instances on top of the commitment savings.

3. NAT Gateway data processing fees

This is the quiet one. It's hidden inside the line that says "EC2 - Other." On our bill that line was $300K a year, and most of it was NAT Gateway data processing fees plus inter-AZ data transfer.

NAT Gateway charges $0.045 per GB processed. If your private subnets call AWS services like S3, DynamoDB, ECR, Secrets Manager, or any of dozens of others, that traffic was going out through NAT, hitting the public endpoint for the service, and coming back. You were paying NAT processing fees to talk to AWS itself.

The fix is VPC endpoints. Gateway Endpoints for S3 and DynamoDB are free. Interface Endpoints for everything else are about $0.01 per hour per endpoint per AZ, which sounds like nothing until you realize the alternative is paying NAT processing on every call. We added Interface Endpoints for the dozen services our private workloads called most, plus the two Gateway Endpoints, and watched the EC2-Other line drop substantially in the next bill.

The "EC2 - Other" line is almost always not what it looks like. Drill in before you assume it's load balancer traffic or storage.

4. S3 lifecycle and Intelligent Tiering

S3 was around $72K a year, and most of it was data sitting in Standard storage that hadn't been read in months. Logs we kept forever. Backups from systems that had since been decommissioned. Reports generated for one-off questions that nobody had looked at since.

We added lifecycle policies on the buckets with the most cold data. Standard for the first 30 days. Standard-IA for the next 60. Glacier Instant Retrieval after 90 days. Deep Archive after 180 days for anything older than that.

For the buckets where access patterns were genuinely unpredictable, we turned on Intelligent Tiering, which moves objects between tiers based on actual access patterns at no extra cost beyond a small monitoring fee per object.

Setting up lifecycle policies takes about an hour per bucket. The savings compound from there.

5. CloudWatch logs and metrics

CloudWatch was $35K a year, which seemed high for what we were actually using. Two things were going on.

First, log retention was set to "Never expire" on most log groups. That's the default if you don't set it explicitly, and most templates don't. We had years of debug-level logs from services that had been retired. We dropped retention to 30 days for most groups, one year for security-relevant groups, and deleted the log groups for retired services entirely.

If you're running voice or agent workloads, this gets worse. Every turn produces a span. Every retry produces another. We're seeing 30-day retention be the default working assumption now, with anything older going to S3 if you need it for replay or eval.

Second, custom metrics were quietly expensive. CloudWatch charges per metric per month, and high-cardinality custom metrics had blown up our metric count without anyone noticing. The usual culprits are dimensions like user ID, request ID, or anything else with thousands of unique values. On AI workloads add tenant ID and model name to that list.

Combined, this cut the CloudWatch line by more than half.

6. RDS and ElastiCache rightsizing

RDS was about $46K and ElastiCache about $34K. Both had the same problem as EC2. Instances picked years earlier for workloads that had since changed shape, plus a couple of clusters that had been provisioned for projects that never shipped and were sitting idle.

The first pass was just removing the idle clusters. The second was rightsizing the instances based on actual CPU, memory, and connection metrics. The third was evaluating Aurora I/O-Optimized for the write-heavy databases, which is a different storage tier that costs more per hour but eliminates I/O-per-request charges. For one of our larger databases the math worked out to about 30% savings on net.

This lever wasn't huge in absolute dollars compared to compute, but it was nearly free to do once we knew what to look at.

7. Negotiating the bill itself

This one isn't a technical lever. It's a negotiation lever, and it's worth knowing about because most engineering teams don't think to ask.

If you're spending more than roughly $1M a year on AWS, you can negotiate an Enterprise Discount Program. You commit to a multi-year spend at a defined growth rate. AWS gives you a percent discount on your entire bill in exchange. The discount tiers up with commitment size and term length.

We were on the Business support plan paying about $90K a year. As part of the EDP conversation, support went up to a higher tier with TAM access at no incremental cost, and the EDP discount took a percentage off the entire bill including the support line.

Negotiating an EDP takes a few weeks. The savings compound for the term of the agreement. If your bill is over a million dollars annually and you don't have one, you're paying retail by default.

What I'd do differently

I'd tag everything from day one. Cost-attribution tags on every resource, every project, every team. Not because the tags themselves save money, but because every single one of these levers was easier to identify and prioritize on the parts of the bill where tagging was thorough. Where tagging was missing, we spent days tracing line items back to owners before we could even decide what to cut.

The AI version of this is per-tenant attribution on every Bedrock invocation. If you can't answer "what does customer X cost us this month" you'll be in the same place we were on the SaaS bill, spending days tracing line items to owners before you can decide what to cut. Bedrock now supports cost allocation by IAM principal in CUR 2.0, which gets you part of the way. The other part is application-layer metering you build yourself.

I'd run Compute Optimizer earlier. We waited until we'd already done most of the manual rightsizing work before letting it loose, which was backwards. It would have given us a better starting point.

I'd be more skeptical of CloudFormation defaults. Most of the oversized RDS and ElastiCache instances came from templates that had been written years ago by people who picked safe-feeling instance classes. The defaults compounded across dozens of stacks before anyone noticed.

Generalizable patterns

The top three line items in any AWS bill are usually 70% of the spend. Start there.

On a Bedrock-heavy bill the top three are usually inference, the EC2 or ECS hosting your orchestration, and storage or networking around RAG. Same rule. Start there.

The "EC2 - Other" line is almost always NAT Gateway data processing, EBS, and inter-AZ data transfer in disguise. Drill into it before you assume it's something inevitable.

S3 lifecycle policies are free to set up and save 60-70% on cold data. There's no good reason not to have them on every bucket with non-trivial volume.

If your bill is over a million dollars annually and you don't have an Enterprise Discount Program, you're paying retail.

Most teams over-provision because the cost of being wrong on the slow side feels much worse than the cost of being wrong on the fast side. Inverting that culture is the real lever. The technical work just makes the cost of over-provisioning visible.

What's different on a Bedrock bill

The bill I just described is a SaaS bill from a few years ago. The bills I'm looking at now are mostly AI workloads. The lever order is roughly the same. The line items are different.

Compute commitment becomes provisioned throughput vs on-demand vs batch. The math is the same as Savings Plans vs Reserved vs On-Demand on EC2. If your inference traffic is steady, you're a candidate for provisioned throughput. If it's bursty and async-tolerant, you're a candidate for batch at 50% off. If it's bursty and synchronous, on-demand is what you've got.

Right-sizing becomes model selection. Claude Sonnet for the workload that needs it. Haiku or Nova for the workload that doesn't. Intelligent prompt routing if you can't decide on a per-request basis. We're seeing 30-90% cuts from model choice alone on workloads that were over-provisioned to Sonnet by default.

NAT Gateway processing has a Bedrock parallel. It's prompt caching. Static prefixes (system prompts, tool definitions, retrieved documents you're chatting against) get cached at 10% of the input token cost. Most teams haven't measured their cache potential. The savings are 60-90% on cacheable prefixes.

S3 lifecycle becomes vector storage tiering. S3 Vectors at sub-second latency for cold archives is up to 90% cheaper than OpenSearch Serverless for the same data. The decision tree is similar. Hot data in OpenSearch or pgvector. Cold data in S3 Vectors. Lifecycle moves it for you.

CloudWatch becomes turn-level observability. Voice and agent workloads produce ten to a hundred times more spans per session than a typical SaaS request. Same retention rules. 30 days hot, longer cold, aggregate before you publish.

EDP still applies. If your AWS bill has crossed seven figures and Bedrock is a meaningful slice of it, the EDP conversation should also include Bedrock-specific commitments and discount tiers. AWS will negotiate.

The lever I'd add that didn't exist on the SaaS bill is multi-tenant cost attribution. If you're running Bedrock for more than one customer and you can't answer "what does customer X cost us per month," you'll spend days reconstructing it the first time finance asks. Build the metering at the application layer from day one.

The order matters more than the specifics

Each lever paid back something material before we moved to the next. By the end of the year we'd cut about $1.5M off the run rate, the bill was no longer the topic of every board meeting, and product velocity hadn't slowed.

If you're staring at a bill that's growing faster than revenue, work the top three line items first, find the hidden costs inside "EC2 - Other," then move down the list. The shape of the bill tells you which levers will pay.

If your AWS bill is climbing faster than revenue and you're running Bedrock or SageMaker in production, this is the work I do for Series A-C voice and agentic AI startups. Engagements run 8 weeks, scope-locked, 100% upfront. The way in is an application at knbrlo.com/work. If you have a specific question and don't need a full engagement, a 1-hour consultation works.

Working on a Bedrock workload right now? I'm shipping a small CLI that runs the equivalent of these levers against a Bedrock invocation pattern and ranks the ones that'll pay back. It'll be at /oss/bedrock-cost-calculator when it ships. Reply to me on X if you want early access.