Configurable Horizontal Pod Autoscaler

June 3, 2019

Share on FacebookShare on TwitterShare on LinkedIn

Configurable Horizontal Pod Autoscaler

Postmates migrated from running its infrastructure on vanilla instances with autoscaling groups to Kubernetes a few years ago. This change brought a significant increase in developer velocity and simplified some of the common questions that exist with service architectures such as service discovery, autoscaling, load balancing, rollouts, and rollbacks.

But, like any opensource offering, it didn’t meet all the requirements we had. In addition, Postmates hit an interesting roadblock while migrating workloads that needed to scale up very quickly, and the dials and knobs that exist to tune the same. Today, we’re proud to announce that we’re open sourcing our solution to make working with horizontal pod autoscaling more tunable. It is aptly named the Configurable Horizontal Pod Autoscaler (also known as HPA).
(https://github.com/postmates/configurable-hpa)

The Existing Solution (HPA)

The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics).

You can configure the following HPA parameters:

  • maxReplicas — the upper limit of replicas to which the autoscaler can scale up.
  • minReplicas — the lower limit of replicas.
  • metrics — the specification for which to use to calculate the desired replica count.

The off the shelf HPA only supports one cluster level configuration parameter that influences how fast the cluster is scaled down:

And a couple of hard-coded constants that specify how fast the cluster can scale up:

There’s no parameter that could be set per HPA to control the scale velocity of one particular cluster, which can be problematic for applications that need to scale faster or slower than others.

Different applications may have different business values, different logic, and may require different scaling behaviors. For example, at Postmates, we have at least three types of applications:

Applications that handle business-critical web traffic. These should scale up as fast as possible (false positive signals to scale up are ok), and scale down very slowly (waiting for another traffic spike). If an input request is not processed for 1 min, it should be timed out.

Let’s have a look at the graph that illustrates this use case.

The red line on the chart is an ideal number of pods to handle the traffic (depicted by the blue line). It lags behind the traffic curve slightly as the HPA needs some time to gather metrics, process them, and start new pods.

Because of limiting the maximum scale velocity, the current HPA algorithm (depicted by the green line) won’t be able to scale that fast. As a result, a large part of input requests during the first spike will be answered with errors (due to timeouts), because we didn’t have enough resources to handle them.

Note that HPA with 5 min Stabilization Window helped to handle the second spike of requests. While HPA with 0 min Stabilization Window (depicted by the yellow line) lead us again to the errors during the next traffic spike.

Applications that process very important data events. These should scale up as fast as possible (to reduce the data processing time), and scale down as soon as possible (to reduce cost). False positive signals to scale up/down are ok. Data events are kept in the queue until they are processed.

The red line is again the ideal number of pods to quickly handle the events spike.
 
Due to scale velocity limitations in the current version of HPA (the yellow and green lines), we won’t be able to scale up the cluster fast enough. As a result, we see a several minutes delay between ideal HPA and vanilla HPA.

Then after the Events flow slump, the cluster size should be decreased to reduce costs. Unlike the previous use case, the 5 minute Stabilization Window will not help us. Rather, it will prevent the cluster reduction, increasing the costs. We keep 79 pods for 5 minutes instead of reducing to 10 pods instantly. A 0 minute Stabilization Window fits this use case more appropriately.

Applications that process other data/web traffic. They are not as important and may scale up and down in a usual way to minimize jitter.

Fast Solution

A fast solution is to develop a CRD and a corresponding controller that will mimic the vanilla HPA and will be flexibly configurable.

The main advantage of this approach is that HPA configuration is done per each particular HPA object and not a cluster-wide.

The algorithm is quite simple and repeats the vanilla HPA: CHPA controller starts every 15 seconds; on every iteration, it follows the instruction:

Firstly, find all CHPA objects

Secondly, for every CHPA object:

  • Find the correspondent Deployment object.
  • Check metrics for all the Containers for all the Pods of the Deployment object.
  • Calculate the desired number of Pods.
  • Adjust the number of Pods.

The differences lie in how the desired number of Pods is calculated. And here we have several new configuration parameters:

  • UpscaleForbiddenWindowSeconds and DownscaleForbiddenWindowSeconds — This is the duration window from the previous ScaleUp (or ScaleDown) event for the particular CHPA object when we won’t try to ScaleUp (or ScaleDown) again.
  • ScaleUpLimitFactor and ScaleUpLimitMinimum limit the number of replicas for the next ScaleUp event.

If the Pod metrics show that we should increase the number of replicas, the algorithm will try to limit the increase by the ScaleUpLimit.
ScaleUpLimit is found as a maximum of an absolute number (ScaleUpLimitMinimum) and multiplication of current replicas by a coefficient (ScaleUpLimitFactor):

With these additional parameters, you can tune your autoscaler to fit your needs.

Let’s check how CHPA handles the same requests and event spikes as we had on the previous graphs.

Request spike with the new CHPA

The HPA line shows how the Ideal HPA algorithm will work. You can’t see it on the plot as the yellow line repeats it in every point.

Fast CHPA has the following configuration

ScaleUpLimitMinimum = 4
ScaleUpLimitFactor = 10
ScaleDownForbiddenWindowSeconds = 300

This lets us increase the cluster size by 10x on each CHPA controller cycle, letting us handle large spikes in traffic.

The green line has a slightly different configuration:

ScaleUpLimitMinimum = 4
ScaleUpLimitFactor = 10
ScaleDownForbiddenWindowSeconds = 300

The difference is in the size of the ScaleDownForbiddenWindowSeconds. It means that we won’t limit the scaling up. But after each scaling down we wait for another 5 minutes before allowing the next scale down.

This prevents us from scaling down too fast, as well as cluster size jitter.

As the HPA configuration could be defined separately for each deployment, you can configure critical services to scale down very slowly, while every other service could be configured to scale down instantly to reduce costs.

Data events spike with the new CHPA

The red line shows how the CHPA works with the following parameters:

ScaleUpLimitMinimum = 4
ScaleUpLimitFactor = 10

It scales the cluster up very fast to process all the data events without delay. And then it scales it down to reduce costs.

The yellow line demonstrates the same configuration as the vanilla HPA:

ScaleUpLimitMinimum = 4
ScaleUpLimitFactor = 2

You can see a several minute delay compared to the new CHPA until we could process all the events in the spike.

As the HPA configuration could be defined separately for each deployment, it is up to you how to handle that particular data events spike.

General Solution

We believe that this custom HPA solution fits our needs in the short term. In the long term, we would like to see configurable HPAs be a part of vanilla Kubernetes. So today, we’re also proud to announce our Kubernetes Enhancement Proposal for upstreaming these changes.

At this time, you can check the proposed changes, and we’ll have another blog post when it is ready!

Wrap Up

We have been using the CHPA in production for our critical workloads for several months now, and it has helped us withstand massive surges of traffic on business critical days when the whole country ordered pizzas.

Please feel free to give it a whirl and use the Github Issues section to ask questions/create issues, or follow the milestones to contribute!

If tackling engineering problems like this sounds interesting to you, check out our jobs page!

This post was written by Ivan Glushkov, staff engineer, Infrastructure. Ivan leads several of the infrastructure and platform initiatives at Postmates, and is responsible for building highly scalable and fault tolerant shared services in Python, Go, and Erlang.

Kubernetes

More from Engineering

View All

Machine Learning Based Personalization and Discovery at Postmates

Our passion is making it as easy as possible for our customers to find what they’re looking for.

November 24, 2020