Optimizing Compute Costs at Kajabi: From Cluster Autoscaler to Karpenter and Beyond

Table of Contents

—

When I started at Kajabi around 2 years ago, my cloud compute background was all autoscaling groups, EC2 instances and Heroku. It took some time to wrap my head around Kubernetes and how things worked under the hood. I slowly became familiar with the mechanisms at play in the cluster that allowed workloads to be scheduled and scaled. This allowed me to understand Cluster Autoscaler.

Under the hood of Cluster Autoscaler were managed node groups. We had a few different managed node groups for different workloads in each of our clusters. What was really interesting was that managed node groups were actually just autoscaling groups of EC2s that were managed by EKS. One of the limitations of autoscaling groups at the time was that the autoscaling group had to be comprised of instances of the same size (e.g., 4cpu 16gb RAM). This was limiting but created an interesting possibility that led to our first large compute optimization.

The Switch from Intel to AMD

With containerized applications that supported the amd64 architecture that both Intel and AMD use, this was low hanging fruit. We were able to switch to a newer generation instance at a lower price and with a performance increase.

Instance type comparison by cost, savings, and performance using m5.4xlarge as the baseline. Note that any of the newer generation types (m6 or m7) provide benefits.

Puma Conversion

The next major milestone was the conversion from Unicorn to Puma as our primary web server. This was a major effort spread across multiple engineering teams to test and implement thread safety inside the application, but after it was complete we were able to capture some amazing results.

CPU and memory usage both drop significantly at the moment we switched from Unicorn to Puma.

These lower consumption metrics allowed us to have greater pod density on existing Nodes but also unlocked more granular node selection detailed below.

Migration from Cluster Autoscaler to Karpenter

Karpenter offers more than a few advantages over the standard cluster autoscaler but notably for us is "Optimal Resource Utilization". Karpenter achieves this by automatically provisioning nodes based on application needs. This means that you can avoid overprovisioning, which can lead to wasted resources and increased costs. In our case, this means we can add for instance r6a and c6a nodes to the mix and Karpenter can pick between general purpose, compute, or memory optimized instances based on the demands of the cluster.

Because Karpenter optimizes resource utilization by way of consolidation efforts, it can lead to significant cost savings. By avoiding overprovisioning, you can reduce the number of nodes required to run your applications, which can result in lower cloud bills. Karpenter also can gather realtime price information about instances and intelligently select the optimal node to introduce to the cluster.

More Instance Types

With Karpenter now in place, we added the compute and memory optimized instances into the mix and let Karpenter decide which instance to select based on current cluster demands and cost.

Shows Karpenter making the switch from general purpose nodes (blue) to compute-optimized nodes. This makes for higher cluster optimization and fewer wasted resources.

Going Multiarch

The next big change for Kajabi was supporting multi-architecture builds for our applications so we could take advantage of our local development environment being ARM based silicon but also take advantage of AWS Graviton processors. AWS Graviton processors claims to deliver up to 40% better price performance over comparable current generation x86-based instances for a broad spectrum of workloads.

Through our own testing on Graviton instances we were able to achieve the results shown in this table. Graviton is 34% faster with 59% higher throughput and 25% higher bandwidth.

With those results being faster, and the compute costing less money we went for it, illustrated below is one of our clusters running primarily on arm based compute but also some AMD based compute where Karpenter made some intelligent decision either based on price or availability of instance types in a certain AZ to maintain our highly available architecture.

Terminal view of a flame graph showing utilization of different instance types as automatically determined by Karpenter. Different optimization choices (compute, memory, balanced) and architectures (AMD and Graviton) are of notable.

Bonus Savings

The migration to Karpenter also allowed a super easy change in the NodePool for us to move from gp2 storage to gp3 storage which gave us more performant storage volumes at 20% less cost.

Graph showing more expensive gp2 (orange) storage being replaced by gp3 (purple).

Convertible RIs, Spot Instances

Marching towards our annual RI/SP renewal with AWS we knew there was work to be done based on coverage. We selected and worked with a 3rd party vendor to orchestrate convertible RIs so that we can incrementally get better prices for compute over time, and also successfully introduced spot instances into our lower environments. Karpenter is able to introduce spot instances, and natively handle resiliency with spot instance interruption notices and consolidation configured.

On-demand (orange) has dropped considerably. Initially replaced only with savings plans (purple), in July and August we were able to start using reserved instances (blue) and dip our toes into spot instances (aqua).

A different view of similar information shifted one month later. Here we see the percentage of spend in each category. Spot instances (orange here) continue to see increased usage.

Spot instances are available at up to 90% off compared to on-demand rates but generally we observe savings in the 50-65% range vs on-demand prices. All of these changes combined have allowed us to spend the same amount in years past for up-front spend required on our savings plans but distributed to other services like RDS and reach a higher percentage of coverage from savings plans and RIs.

We have continued to increase our use of spot instances (aqua bars). We have seen a fairly steady savings over on-demand of about 60% (green line).

What's Next?

In summary we’ve made alot of progress in a year and have plans to continue this optimization journey by identifying spot stable workloads for our production environment and auditing our pod resource requests and limits to continue optimizing our platform. A huge thanks to the entire Production Engineering team and our Internal Platform and Automations team for supporting the projects and improvements that made this possible!