Author: JD Technology, Xu Xianzhang
1 What is Overcapacity Expansion
The overcapacity expansion function refers to pre-scheduling a certain number of worker nodes. When the business peak period or the overall load of the cluster is high, it allows the application to complete horizontal scaling quickly without waiting for the expansion of cluster worker nodes.HPA, ClusterAutosacler, and Overcapacity ExpansionIt is used simultaneously to meet the needs of business scenarios with high load sensitivity.
The overcapacity expansion function is realized through the joint action of K8S application priority setting and ClusterAutosaler. By adjusting the number of idle applications with low priority, the cluster's scheduled resources are kept at a high level. When other high-priority applications are adjusted due to HPA or manual adjustment of application shards, scheduling resources can be freed up by evicting idle applications to ensure that high-priority applications can be scheduled and created as soon as possible. When idle applications change from eviction to waiting status, ClusterAutosaler expands the cluster instance at this time to ensure that there are enough idle applications that can be evicted for the next high-priority application scheduling.
The core of the overcapacity expansion function is OverprovisionAutoscaler (overcapacity expansion) and ClusterAutosaler (cluster automatic expansion), both of which need to adjust parameter configurations continuously to meet multiple business needs.
The overcapacity expansion function reduces the saturation degree of resource usage to a certain extent, improves the stability of the cluster and applications by increasing costs, and requires selection and reasonable configuration according to needs in actual business scenarios.
2 What situations require the use of Overprovision Scaling
When the cluster is enabled with Hpa and Autoscaler, the application scheduling time is usually4-12 minutes,mainly depends on the resource of creating work node and the total time from joining the cluster to Ready. The following is the analysis of the best and worst efficiency
The best case scenario - 4 minutes
•30 seconds - Target metric value update: 30-60 seconds
•30 seconds - HPA checks the metric value: 30 seconds -> 30 seconds - HPA checks the metric value: 30 seconds - >
•<2 seconds - Pods enter pending state after creation <2 seconds - Pods enter pending state after creation
•<2 seconds - CA sees pods in pending state, then called to create node 1 second <2 seconds - CA sees pods in pending state, then called to create node 1 second
•3 minutes - cloud provider creates work nodes, then joins k8s and waits for the node to become ready
The worst case - 12 minutes
•60 seconds — Target metric value update
•30 seconds — HPA checks the metric value
•< 2 seconds — Pods enter pending state after creation
•< 2 seconds — CA sees pods in pending state, then called to create node 1 second
•10 minutes — cloud provider creates work nodes, then joins k8s and waits for the node to become ready
In both scenarios, the time spent creating work nodes accounts for more than 75% of the total time. If this time can be reduced or not considered at all, it will greatly improve the application scaling speed. With the Overprovision feature, the stability of the cluster and business can be greatly enhanced. Overprovision scaling is mainly used for business scenarios with high sensitivity to application load
1 Preparing for Big Sales
2 Stream Computing/Real-time Computing
3 Devops System
4 Other Frequent Scheduling Business Scenarios
3 How to Enable Overprovision Scaling
The Overprovision feature is based on ClusterAutoscaler and realized with OverprovisionAutoscaler. Taking the JD Cloud public cloud Kubernetes container service as an example
3.1 Enable ClusterAutoscaler
https://cns-console.jdcloud.com/host/nodeGroups/list
•Go to “kubernetes container service”->“work node group”
•Select the corresponding node group needed, click Enable Auto Scaling
•Set the node quantity range and click Confirm
3.2 Deploy OverprovisionAutoscaler
1 Deploy Controller and Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning-autoscaler
namespace: default
labels:
app: overprovisioning-autoscaler
owner: cluster-autoscaler-overprovisioning
spec:
selector:
matchLabels:
app: overprovisioning-autoscaler
owner: cluster-autoscaler-overprovisioning
replicas: 1
template:
metadata:
labels:
app: overprovisioning-autoscaler
owner: cluster-autoscaler-overprovisioning
spec:
serviceAccountName: cluster-proportional-autoscaler
containers:
- image: jdcloud-cn-north-1.jcr.service.jdcloud.com/k8s/cluster-proportional-autoscaler:v1.16.3
name: proportional-autoscaler
command:
- /autoscaler
- --namespace=default
## Note that you need to specify the name of the above configmap as needed
## /overprovisioning-autoscaler-ladder/overprovisioning-autoscaler-linear
- --configmap=overprovisioning-autoscaler-{provision-mode}
## Preheat cluster application(Type) / Name, baseline application and empty application need to be in the same namespace
- --target=deployment/overprovisioning
- --logtostderr=true
- --v=2
imagePullPolicy: IfNotPresent
volumeMounts:
- name: host-time
mountPath: /etc/localtime
volumes:
- name: host-time
hostPath:
path: /etc/localtime
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: cluster-proportional-autoscaler
namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-proportional-autoscaler
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["replicationcontrollers/scale"]
verbs: ["get", "update"]
- apiGroups: ["extensions","apps"]
resources: ["deployments/scale", "replicasets/scale","deployments","replicasets"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create"]}
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-proportional-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-proportional-autoscaler
namespace: default
roleRef:
kind: ClusterRole
name: cluster-proportional-autoscaler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -1
globalDefault: false
description: "Priority class used by overprovisioning."
2 Deploy idle applications
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
namespace: default
labels:
app: overprovisioning
owner: cluster-autoscaler-overprovisioning
spec:
replicas: 1
selector:
matchLabels:
app: overprovisioning
owner: cluster-autoscaler-overprovisioning
template:
metadata:
annotations:
autoscaler.jke.jdcloud.com/overprovisioning: "reserve-pod"
labels:
app: overprovisioning
owner: cluster-autoscaler-overprovisioning
spec:
priorityClassName: overprovisioning
containers:
- name: reserve-resources
image: jdcloud-cn-east-2.jcr.service.jdcloud.com/k8s/pause-amd64:3.1
resources:
requests:
## Set the number of shards and the required resources per shard based on the expected warm-up period
cpu: 7
imagePullPolicy: IfNotPresent
3.3 Verify that the overcapacity scaling feature is functioning normally
1 Verify Autoscaler
•Check if the autoscaler controller is Running
•Continuously create test applications, with the application resource requirements slightly less than the schedulable resources of a single node in the node group
•Observe the state of the cluster nodes, when insufficient resources cause the pod to be in a waiting state, whether the autocalser will expand according to the preset (expansion waiting, expansion cooling, maximum number of nodes, etc.)
•Enable cluster auto-scaling, delete test applications, and observe whether scaling down occurs after the cluster node resource Request reaches the threshold.
2 Verify OverprovisionAutoscaler
•Check if the OverprovisionAutoscaler controller is Running
•Continuously create test applications, and when autoscaler occurs, whether the number of idle applications will change according to the configuration
•When the business application is pendding, whether the idle application will be evicted and the business application will be scheduled
4 Set OverprovisionAutoscaler and ClusterAutoscaler parameters
4.1 Configuration of ClusterAutoscaler
1 Description of ca parameter
Parameter name | Default value | Parameter description |
scan_interval | 20s | How often cluster is reevaluated for scale up or down |
max_nodes_total | 0 | Maximum number of nodes in all node groups |
estimator | binpacking | Type of resource estimator to be used in scale up. |
expander | least-waste | Type of node group expander to be used in scale up |
max_empty_bulk_delete | 15 | Maximum number of empty nodes that can be deleted at the same time |
max_graceful_termination_sec | 600 | Maximum number of seconds CA waits for pod termination when trying to scale down a node |
max_total_unready_percentage | 45 | Maximum percentage of unready nodes in the cluster. After this is exceeded, CA halts operations |
ok_total_unready_count | 100 | Number of allowed unready nodes, irrespective of max-total-unready-percentage |
max_node_provision_time | 900s | Maximum time CA waits for node to be provisioned |
scale_down_enabled | true | Should CA scale down the cluster |
scale_down_delay_after_add | 600s | How long after scale up that scale down evaluation resumes |
scale_down_delay_after_delete | 10s | How long after node deletion that scale down evaluation resumes, defaults to scanInterval |
scale_down_delay_after_failure | 180s | How long after scale down failure that scale down evaluation resumes |
scale_down_unneeded_time | 600s | How long a node should be unneeded before it is eligible for scale down |
scale_down_unready_time | 1200s | How long an unready node should be unneeded before it is eligible for scale down |
scale_down_utilization_threshold | 0.5 | Node utilization level, defined as the sum of requested resources divided by capacity, below which a node can be considered for scale down |
balance_similar_node_groups | false | Detect similar node groups and balance the number of nodes between them |
node_autoprovisioning_enabled | false | Should CA autoprovision node groups when needed |
max_autoprovisioned_node_group_count | 15 | The maximum number of autoprovisioned groups in the cluster |
skip_nodes_with_system_pods | true | If true, cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods) |
skip_nodes_with_local_storage | true | If true, cluster autoscaler will never delete nodes with pods having local storage, e.g., EmptyDir or HostPath, 'NOW()', 'NOW()', 1); |
2 Recommended Configuration
# Other settings remain default
scan_interval=10s
max_node_provision_time=180s
scale_down_delay_after_add=180s
scale_down_delay_after_delete=180s
scale_down_unneeded_time=300s
scale_down_utilization_threshold=0.4
4.2 Configure OverprovisionAutoscaler
The configuration of OverprovisionAutoscaler has two methods: linear configuration and step configuration, and only one configuration method can be selected.
1 Linear configuration (ladder)
Linear configuration, through the configuration of the total CPU core numberandThe ratio of the number of nodes and the number of idle applications to realize linear resource reservation, and the number of idle applications is always in line with the total CPU amountandThe number of nodes is proportional, and the accuracy will change according to the CPU resource request of the idle applications. The smaller the request value, the higher the accuracy. When a conflict occurs in the configuration, take the maximum value of the number of idle applications that conforms to the linear relationship.
The number of nodes meets the interval of min and max in the configuration
preventSinglePointFailure, when set to true, the number of idle application slices in the Running state meets the linear relationship; when set to false, the number of idle application slices in the Failer/Running state meets the linear relationship
includeUnschedulableNodes, whether to consider unschedulable nodes
kind: ConfigMap
apiVersion: v1
metadata:
name: overprovisioning-autoscaler-linear
namespace: default
data:
linear: |-
{
"coresPerReplica": 2,
"nodesPerReplica": 1,
"min": 1,
"max": 100,
"includeUnschedulableNodes": false,
"preventSinglePointFailure": true
{}
2 Stepping configuration(linear)
Stepping configuration, realized by configuring the matrix of the total CPU core number or the node number and the number of idle applications, to reserve resources in a stepped manner. The number of idle applications conforms to the distribution status of the total CPU amount and the node number. When a conflict occurs in the configuration, take the maximum value of the number of idle applications that conforms to the distribution interval.
kind: ConfigMap
apiVersion: v1
metadata:
name: overprovisioning-autoscaler-ladder
namespace: default
data:
ladder: |-
{
"coresToReplicas":
[
[ 1,1 ],
[ 50,3 ],
[ 200,5 ],
[ 500,7 ]
],
"nodesToReplicas":
[
[ 1,1 ],
[ 3,4 ],
[ 10,5 ],
[ 50,20 ],
[ 100,120 ],
[ 150,120 ]
}
{}

评论已关闭