Rethinking IT Habits for the Public Cloud: Start Small

Optimising cloud costs starts with right-sizing your VMs. Traditional on-prem habits of over-provisioning can lead to inflated cloud bills. Reassess your needs and scale smartly to save money.

This is an op-ed from Stefan, our Cloud Economics Lead. If your view differs from his, contact Stefan using the form below the article or directly on LinkedIn and let him know!

Initial Sizing of Your Systems

In the on-prem data centre world you are familiar with, the sizing of a new system usually followed these steps:

Ask the Vendor: Based on your company size and a rough estimate of future demand growth, the vendor will offer several configurations tailored to future utilisation scenarios. In my experience, these often err on the conservative side.
Do Your Solution Architecture: Initial optimism about user adoption means the biggest-capacity vendor recommendation is often favoured.
Check with Your Capacity Team: Given that you typically buy new hardware every 4 or 5 years, depending on when the next refresh is due, an additional buffer is added to the chosen configuration.

So, a service that could run well enough on 3 CPUs and 32 GB RAM was probably scaled up to 4 CPUs and 48 GB RAM and ended up with 6 CPUs and 80 GB RAM. And since you've already paid for the hardware, that wasn't a big deal.

Fast forward to a cloud migration that happened instead of a hardware refresh. More often than not, systems are lift-and-shifted and have to fit into available VM 't-shirt sizes'.

Now, let's compare the Azure Pay-as-You-Go costs for two system configurations, "good enough" and final:

3 CPUs + 32 GB RAM: E4as_v5 (4 AMD vCPUs, 32 GB RAM) at $6.58/day in West Europe.
6 CPUs + 80 GB RAM: E16-8as_v5 (16 AMD vCPUs constrained to 8, 128 GB RAM) at $26.30/day in West Europe.

That's four times the cost. And remember, it's "per day".

The cloud provider doesn't care whether you have the same-sized server running at a healthy 60% capacity or a less-than-ideal 2% capacity. Your company gets charged the same amount for that VM. Every. Single. Day. It. Sits. Idle.

There are many days in a year.

Why Did That Happen?

In your on-prem world, you must ensure that, within reason, no matter how well your solution grows, it still fits into the existing server capacity you may be limited to for the next 3+ years. So, it made perfect sense to size systems for best-case adoption scenarios.

Add to that that capacity teams might not have had the luxury to revisit every application during hardware renewal. If there was even still a product team actively managing that application, the gap between needed and actual sizing could have been even larger than in my example above.

How Do I Know?

I've been part of numerous infrastructure assessments preceding cloud migrations. These assessments monitor actual system utilisation, typically over a 30-day period. The key reason these often show a positive ROI is that most on-prem landscapes can be downsized significantly.

Pro tip: You can apply these recommendations during your next maintenance window or to your next hardware refresh on-prem. There is no need to venture into the Public Cloud to rightsize a system back to optimal configuration.

I've also heard the rumour—and I want to explicitly say this is what someone else has claimed—that the average CPU utilisation across the biggest public clouds sits somewhere in the single digits.

I've seen multi-million-dollar-per-month cloud bills with an average CPU utilisation below 5%. That's why I tend to give credibility to that rumour.

What to Do?

The reason for starting big on-prem has been the upper limit of available compute power for your capacity team. In the public cloud, this limit mostly no longer exists: you can start systems with the smallest VM still capable of running your application in its first month and then scale up the VM as demand increases. All it takes is a reboot.

So the next time you design a system running in the public cloud, ask your vendor this question:

What is the smallest t-shirt size in my public cloud where this application still runs smoothly, given my initial user base?

Then, work from there. Have your ops team monitor utilisation, and when the VM starts to struggle under its load, increase its size during the next maintenance window.