How Kajabi Saved Thousands on AWS Compute
Discover how Kajabi’s shift to Karpenter, use of ARM-based AWS Graviton instances, and transition to multi-architecture builds reduced compute costs and improved performance. Learn about our journey from EC2 instances to managed node groups, Puma, and spot instances for sustained savings and scalability.
When I started at Kajabi around 2 years ago, my cloud compute background was all autoscaling groups, EC2 instances and Heroku. It took some time to wrap my head around Kubernetes and how things worked under the hood. I slowly became familiar with the mechanisms at play in the cluster that allowed workloads to be scheduled and scaled. This allowed me to understand Cluster Autoscaler.
Under the hood of Cluster Autoscaler were managed node groups. We had a few different managed node groups for different workloads in each of our clusters. What was really interesting was that managed node groups were actually just autoscaling groups of EC2s that were managed by EKS. One of the limitations of autoscaling groups at the time was that the autoscaling group had to be comprised of instances of the same size (e.g., 4cpu 16gb RAM). This was limiting but created an interesting possibility that led to our first large compute optimization.
The Switch from Intel to AMD
With containerized applications that supported the amd64
architecture that both Intel and AMD use, this was low hanging fruit. We were able to switch to a newer generation instance at a lower price and with a performance increase.
Puma Conversion
The next major milestone was the conversion from Unicorn to Puma as our primary web server. This was a major effort spread across multiple engineering teams to test and implement thread safety inside the application, but after it was complete we were able to capture some amazing results.
These lower consumption metrics allowed us to have greater pod density on existing Nodes but also unlocked more granular node selection detailed below.
Migration from Cluster Autoscaler to Karpenter
Karpenter offers more than a few advantages over the standard cluster autoscaler but notably for us is "Optimal Resource Utilization". Karpenter achieves this by automatically provisioning nodes based on application needs. This means that you can avoid overprovisioning, which can lead to wasted resources and increased costs. In our case, this means we can add for instance r6a
and c6a
nodes to the mix and Karpenter can pick between general purpose, compute, or memory optimized instances based on the demands of the cluster.
Because Karpenter optimizes resource utilization by way of consolidation efforts, it can lead to significant cost savings. By avoiding overprovisioning, you can reduce the number of nodes required to run your applications, which can result in lower cloud bills. Karpenter also can gather realtime price information about instances and intelligently select the optimal node to introduce to the cluster.
More Instance Types
With Karpenter now in place, we added the compute and memory optimized instances into the mix and let Karpenter decide which instance to select based on current cluster demands and cost.
Going Multiarch
The next big change for Kajabi was supporting multi-architecture builds for our applications so we could take advantage of our local development environment being ARM based silicon but also take advantage of AWS Graviton processors. AWS Graviton processors claims to deliver up to 40% better price performance over comparable current generation x86-based instances for a broad spectrum of workloads.
With those results being faster, and the compute costing less money we went for it, illustrated below is one of our clusters running primarily on arm
based compute but also some AMD
based compute where Karpenter made some intelligent decision either based on price or availability of instance types in a certain AZ to maintain our highly available architecture.
Bonus Savings
The migration to Karpenter also allowed a super easy change in the NodePool for us to move from gp2
storage to gp3
storage which gave us more performant storage volumes at 20% less cost.
Convertible RIs, Spot Instances
Marching towards our annual RI/SP renewal with AWS we knew there was work to be done based on coverage. We selected and worked with a 3rd party vendor to orchestrate convertible RIs so that we can incrementally get better prices for compute over time, and also successfully introduced spot instances into our lower environments. Karpenter is able to introduce spot instances, and natively handle resiliency with spot instance interruption notices and consolidation configured.
Spot instances are available at up to 90% off compared to on-demand rates but generally we observe savings in the 50-65% range vs on-demand prices. All of these changes combined have allowed us to spend the same amount in years past for up-front spend required on our savings plans but distributed to other services like RDS and reach a higher percentage of coverage from savings plans and RIs.
What's Next?
In summary we’ve made alot of progress in a year and have plans to continue this optimization journey by identifying spot stable workloads for our production environment and auditing our pod resource requests and limits to continue optimizing our platform. A huge thanks to the entire Production Engineering team and our Internal Platform and Automations team for supporting the projects and improvements that made this possible!