Author: JD Technology, Xu Xianzhang

1 What is Overcapacity Expansion

The overcapacity expansion function refers to pre-scheduling a certain number of worker nodes. When the business peak period or the overall load of the cluster is high, it allows the application to complete horizontal scaling quickly without waiting for the expansion of cluster worker nodes.HPA, ClusterAutosacler, and Overcapacity ExpansionIt is used simultaneously to meet the needs of business scenarios with high load sensitivity.

The overcapacity expansion function is realized through the joint action of K8S application priority setting and ClusterAutosaler. By adjusting the number of idle applications with low priority, the cluster's scheduled resources are kept at a high level. When other high-priority applications are adjusted due to HPA or manual adjustment of application shards, scheduling resources can be freed up by evicting idle applications to ensure that high-priority applications can be scheduled and created as soon as possible. When idle applications change from eviction to waiting status, ClusterAutosaler expands the cluster instance at this time to ensure that there are enough idle applications that can be evicted for the next high-priority application scheduling.

The core of the overcapacity expansion function is OverprovisionAutoscaler (overcapacity expansion) and ClusterAutosaler (cluster automatic expansion), both of which need to adjust parameter configurations continuously to meet multiple business needs.

The overcapacity expansion function reduces the saturation degree of resource usage to a certain extent, improves the stability of the cluster and applications by increasing costs, and requires selection and reasonable configuration according to needs in actual business scenarios.

2 What situations require the use of Overprovision Scaling

When the cluster is enabled with Hpa and Autoscaler, the application scheduling time is usually4-12 minutes，mainly depends on the resource of creating work node and the total time from joining the cluster to Ready. The following is the analysis of the best and worst efficiency

The best case scenario - 4 minutes

•30 seconds - Target metric value update: 30-60 seconds

•30 seconds - HPA checks the metric value: 30 seconds -> 30 seconds - HPA checks the metric value: 30 seconds - >

•<2 seconds - Pods enter pending state after creation <2 seconds - Pods enter pending state after creation

•<2 seconds - CA sees pods in pending state, then called to create node 1 second <2 seconds - CA sees pods in pending state, then called to create node 1 second

•3 minutes - cloud provider creates work nodes, then joins k8s and waits for the node to become ready

The worst case - 12 minutes

•60 seconds — Target metric value update

•30 seconds — HPA checks the metric value

•< 2 seconds — Pods enter pending state after creation

•< 2 seconds — CA sees pods in pending state, then called to create node 1 second

•10 minutes — cloud provider creates work nodes, then joins k8s and waits for the node to become ready

In both scenarios, the time spent creating work nodes accounts for more than 75% of the total time. If this time can be reduced or not considered at all, it will greatly improve the application scaling speed. With the Overprovision feature, the stability of the cluster and business can be greatly enhanced. Overprovision scaling is mainly used for business scenarios with high sensitivity to application load

1 Preparing for Big Sales

2 Stream Computing/Real-time Computing

3 Devops System

4 Other Frequent Scheduling Business Scenarios

3 How to Enable Overprovision Scaling

The Overprovision feature is based on ClusterAutoscaler and realized with OverprovisionAutoscaler. Taking the JD Cloud public cloud Kubernetes container service as an example

3.1 Enable ClusterAutoscaler

https://cns-console.jdcloud.com/host/nodeGroups/list

•Go to “kubernetes container service”->“work node group”

•Select the corresponding node group needed, click Enable Auto Scaling

•Set the node quantity range and click Confirm

1681799292_643e387c7562af8b08086.png!small?1681799293069

3.2 Deploy OverprovisionAutoscaler

1 Deploy Controller and Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning-autoscaler
  namespace: default
  labels:
    app: overprovisioning-autoscaler
    owner: cluster-autoscaler-overprovisioning
spec:
  selector:
    matchLabels:
      app: overprovisioning-autoscaler
      owner: cluster-autoscaler-overprovisioning
  replicas: 1
  template:
    metadata:
      labels:
        app: overprovisioning-autoscaler
        owner: cluster-autoscaler-overprovisioning
    spec:
      serviceAccountName: cluster-proportional-autoscaler
      containers:
        - image: jdcloud-cn-north-1.jcr.service.jdcloud.com/k8s/cluster-proportional-autoscaler:v1.16.3
          name: proportional-autoscaler
          command:
            - /autoscaler
            - --namespace=default
            ## Note that you need to specify the name of the above configmap as needed 
            ## /overprovisioning-autoscaler-ladder/overprovisioning-autoscaler-linear
            - --configmap=overprovisioning-autoscaler-{provision-mode}
            ## Preheat cluster application(Type) / Name, baseline application and empty application need to be in the same namespace
            - --target=deployment/overprovisioning
            - --logtostderr=true
            - --v=2
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: host-time
              mountPath: /etc/localtime
      volumes:
        - name: host-time
          hostPath:
            path: /etc/localtime
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: cluster-proportional-autoscaler
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cluster-proportional-autoscaler
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["replicationcontrollers/scale"]
    verbs: ["get", "update"]
  - apiGroups: ["extensions","apps"]
    resources: ["deployments/scale", "replicasets/scale","deployments","replicasets"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "create"]}
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cluster-proportional-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-proportional-autoscaler
    namespace: default
roleRef:
  kind: ClusterRole
  name: cluster-proportional-autoscaler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1
globalDefault: false
description: "Priority class used by overprovisioning."

2 Deploy idle applications

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
  namespace: default
  labels:
    app: overprovisioning
    owner: cluster-autoscaler-overprovisioning
spec:
  replicas: 1
  selector:
    matchLabels:
      app: overprovisioning
      owner: cluster-autoscaler-overprovisioning
  template:
    metadata:
      annotations:
        autoscaler.jke.jdcloud.com/overprovisioning: "reserve-pod"
      labels:
        app: overprovisioning
        owner: cluster-autoscaler-overprovisioning
    spec:
      priorityClassName: overprovisioning
      containers:
        - name: reserve-resources
          image: jdcloud-cn-east-2.jcr.service.jdcloud.com/k8s/pause-amd64:3.1
          resources:
            requests:
              ## Set the number of shards and the required resources per shard based on the expected warm-up period
              cpu: 7
          imagePullPolicy: IfNotPresent

3.3 Verify that the overcapacity scaling feature is functioning normally

1 Verify Autoscaler

•Check if the autoscaler controller is Running

•Continuously create test applications, with the application resource requirements slightly less than the schedulable resources of a single node in the node group

•Observe the state of the cluster nodes, when insufficient resources cause the pod to be in a waiting state, whether the autocalser will expand according to the preset (expansion waiting, expansion cooling, maximum number of nodes, etc.)

•Enable cluster auto-scaling, delete test applications, and observe whether scaling down occurs after the cluster node resource Request reaches the threshold.

2 Verify OverprovisionAutoscaler

•Check if the OverprovisionAutoscaler controller is Running

•Continuously create test applications, and when autoscaler occurs, whether the number of idle applications will change according to the configuration

•When the business application is pendding, whether the idle application will be evicted and the business application will be scheduled

4 Set OverprovisionAutoscaler and ClusterAutoscaler parameters

4.1 Configuration of ClusterAutoscaler

1 Description of ca parameter

Parameter name	Default value	Parameter description
scan_interval	20s	How often cluster is reevaluated for scale up or down
max_nodes_total	0	Maximum number of nodes in all node groups
estimator	binpacking	Type of resource estimator to be used in scale up.
expander	least-waste	Type of node group expander to be used in scale up
max_empty_bulk_delete	15	Maximum number of empty nodes that can be deleted at the same time
max_graceful_termination_sec	600	Maximum number of seconds CA waits for pod termination when trying to scale down a node
max_total_unready_percentage	45	Maximum percentage of unready nodes in the cluster. After this is exceeded, CA halts operations
ok_total_unready_count	100	Number of allowed unready nodes, irrespective of max-total-unready-percentage
max_node_provision_time	900s	Maximum time CA waits for node to be provisioned
scale_down_enabled	true	Should CA scale down the cluster
scale_down_delay_after_add	600s	How long after scale up that scale down evaluation resumes
scale_down_delay_after_delete	10s	How long after node deletion that scale down evaluation resumes, defaults to scanInterval
scale_down_delay_after_failure	180s	How long after scale down failure that scale down evaluation resumes
scale_down_unneeded_time	600s	How long a node should be unneeded before it is eligible for scale down
scale_down_unready_time	1200s	How long an unready node should be unneeded before it is eligible for scale down
scale_down_utilization_threshold	0.5	Node utilization level, defined as the sum of requested resources divided by capacity, below which a node can be considered for scale down
balance_similar_node_groups	false	Detect similar node groups and balance the number of nodes between them
node_autoprovisioning_enabled	false	Should CA autoprovision node groups when needed
max_autoprovisioned_node_group_count	15	The maximum number of autoprovisioned groups in the cluster
skip_nodes_with_system_pods	true	If true, cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods)
skip_nodes_with_local_storage	true	If true, cluster autoscaler will never delete nodes with pods having local storage, e.g., EmptyDir or HostPath, 'NOW()', 'NOW()', 1);

2 Recommended Configuration

# Other settings remain default
scan_interval=10s
max_node_provision_time=180s
scale_down_delay_after_add=180s
scale_down_delay_after_delete=180s
scale_down_unneeded_time=300s
scale_down_utilization_threshold=0.4

4.2 Configure OverprovisionAutoscaler

The configuration of OverprovisionAutoscaler has two methods: linear configuration and step configuration, and only one configuration method can be selected.

1 Linear configuration (ladder)

Linear configuration, through the configuration of the total CPU core numberandThe ratio of the number of nodes and the number of idle applications to realize linear resource reservation, and the number of idle applications is always in line with the total CPU amountandThe number of nodes is proportional, and the accuracy will change according to the CPU resource request of the idle applications. The smaller the request value, the higher the accuracy. When a conflict occurs in the configuration, take the maximum value of the number of idle applications that conforms to the linear relationship.

The number of nodes meets the interval of min and max in the configuration

preventSinglePointFailure, when set to true, the number of idle application slices in the Running state meets the linear relationship; when set to false, the number of idle application slices in the Failer/Running state meets the linear relationship

includeUnschedulableNodes, whether to consider unschedulable nodes

kind: ConfigMap
apiVersion: v1
metadata:
  name: overprovisioning-autoscaler-linear
  namespace: default
data:
  linear: |-
    {
      "coresPerReplica": 2,
      "nodesPerReplica": 1,
      "min": 1,
      "max": 100,
      "includeUnschedulableNodes": false,
      "preventSinglePointFailure": true
    {}

2 Stepping configuration(linear)

Stepping configuration, realized by configuring the matrix of the total CPU core number or the node number and the number of idle applications, to reserve resources in a stepped manner. The number of idle applications conforms to the distribution status of the total CPU amount and the node number. When a conflict occurs in the configuration, take the maximum value of the number of idle applications that conforms to the distribution interval.

kind: ConfigMap
apiVersion: v1
metadata:
  name: overprovisioning-autoscaler-ladder
  namespace: default
data:
  ladder: |-
    {
      "coresToReplicas":
      [
        [ 1,1 ],
        [ 50,3 ],
        [ 200,5 ],
        [ 500,7 ]
      ],
      "nodesToReplicas":
      [
        [ 1,1 ],
        [ 3,4 ],
        [ 10,5 ],
        [ 50,20 ],
        [ 100,120 ],
        [ 150,120 ]
      }
    {}

你可能想看：

It is possible to perform credible verification on the system boot program, system program, important configuration parameters, and application programs of computing devices based on a credible root,

3.4 Multi-cluster Resource Management Solution - Cluster Federation (Federation)

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

Grade Protection Evaluation： Detailed Explanation of CentOS Login Failure Parameters and Two-Factor Authentication

5. Collect exercise results The main person in charge reviews the exercise results, sorts out the separated exercise issues, and allows the red and blue sides to improve as soon as possible. The main

Ensure that the ID can be accessed even if it is guessed or cannot be tampered with; the scenario is common in resource convenience and unauthorized vulnerability scenarios. I have found many vulnerab

Git leak && AWS AKSK && AWS Lambda cli && Function Information Leakage && JWT secret leak

Distributed Storage Technology (Part 2)： Analysis of the architecture, principles, characteristics, and advantages and disadvantages of wide-column storage and full-text search engines

As announced today, Glupteba is a multi-component botnet targeting Windows computers. Google has taken action to disrupt the operation of Glupteba, and we believe this action will have a significant i

Interpretation and Practice of the Requirements for the Registration and Declaration of Medical Device Network Security

最后修改时间：2025-03-26 00:52:23