Application of RALB load balancing algorithm

0 26
1. BackgroundThe search and recommendation algorithm architecture provides servi...

1. Background

The search and recommendation algorithm architecture provides services for all search and recommendation businesses of the JD Group and returns the processed results in real-time to the upstream. Various subsystems of the department have already realized adaptive limiting based on CPU, but the calls from the client side to the server side are still in the RR polling mode, without considering the performance differences of the downstream machines. This leads to the inability to maximize the overall CPU of the cluster, and there is an issue of unbalanced CPU on the server side.

The load balancing method developed by the JD advertising department for its business scenario is very instructive. They proposed the RALB (Remote Aware Load Balance) algorithm, which can improve the CPU resource efficiency of the downstream service cluster machines, avoid the bottleneck effect of CPU, and allow machines with good performance to handle more traffic. We have applied the core idea of this to our system and obtained good benefits.

Application of RALB load balancing algorithm

The structure of this article is as follows:

1. Introduction to RALB

◦A simple introduction to the principle of the algorithm is given.

2. Functional Verification

◦The RALB load balancing technology is applied to the search and recommendation architecture system for functional verification.

3. Throughput Test

◦This section mainly compares the RALB and RR load balancing techniques. It verifies that there is no significant difference in throughput between the two under the conditions of unlimited and fully limited traffic in the cluster. When it comes to partial limiting of RR, there is a difference in throughput between the two, and there is a point of maximum throughput difference. For RALB, the transition from unlimited traffic on the server side to full limiting is a turning point, with almost no partial limiting scenarios.

4. Boundary Testing

◦By simulating various boundary conditions, the system is tested to verify the stability and reliability of RALB.

5. Function Launch

◦Fully enable the RALB load balancing mode on all server-side clusters. It can be seen that before and after going online, the QPS of the server-side gradually appears stratified, and the CPU of the server-side gradually tends to be unified.

II. Introduction to RALB

RALB is a high-performance load balancing algorithm that aims to balance CPU usage

2.1 Algorithm Objective

1. Adjust the CPU usage rate of the server-side to make the CPU between nodes relatively balanced, avoiding CPU usage rate too high to trigger cluster flow control

2. QPS and CPU usage rate are linearly related; adjusting QPS can achieve the goal of CPU usage rate balance

2.2 Algorithm Principle

2.2.1 Algorithm Steps

1. When distributing traffic, distribute according to the weight (weighted random algorithm, wr)

2. Collect CPU usage rate: The server-side feedbacks the CPU usage rate (average 1s) to the client-side through RPC

3. Weight Adjustment: Adjust the weight (every 3s) according to the CPU usage rate (average within the window) of the cluster and each node to balance the CPU of each node

2.2.2 Indicator Dependency

NumberIndicatorFunctionSource
1IPAvailable IP listMaintenance of the service registration discovery and fault shielding module
2Real-time health indexReal-time change of IP availability status, providing boundary conditions for the algorithmMaintenance of RPC framework health check function
3Historical health indexHistorical health index, used to judge IP failure and recovery, and other boundary conditionsHistorical values of indicator 2
4Dynamic Target (CPU Usage Rate)Provide the most direct objective basis for the load balancing algorithmServer-side timing statistics, the RPC framework returns through RPC
5Weight weightReal-time Load Distribution BasisAlgorithm Update

2.2.3 Weight Adjustment Algorithm

1686293152_6482caa0699f82529c6af.png!small?1686293152984

2.2.4 Boundary Handling

Boundary 1: Within the feedback window (3s), if the downstream IP is not accessed, its CPU average is 0, and the weighting algorithm will consider the node's performance to be extremely good, thereby increasing the weight

1686293157_6482caa566029a8eb0543.png!small?1686293157818

Boundary 2: When a network failure occurs, the RPC framework sets the faulty node as unavailable with CPU and weight set to 0; after the network is restored, the RPC framework sets the IP as available, but nodes with a weight of 0 cannot receive traffic, resulting in the node remaining in an unavailable state

Processing: the update of weights is triggered by a timer, records the availability status of the node, and gives a low weight when the node recovers from an unavailable state to an available state, and gradually recovers the weight

1686293162_6482caaa6e3f3e4bb2b1b.png!small?1686293162911

2.3 Key to Implementation

Both fast and stable, avoid getting stuck or collapsing in any situation, especially handle boundary conditions well

Algorithm Key Points:

1. Keep the independent meaning and update mechanism of each dependent factor in the formula to maintain the reliability and simplicity of the algorithm

◦The update of the IP list is guaranteed by service registration discovery and RPC framework together

◦RPC updates CPU

2. Note the meaning of boundary values, and distinguish the meaning of continuous values

◦CPU = 0 indicates unknown, and does not mean that the CPU performance is good

◦w = 0 indicates that it will not be allocated traffic, and it will only be 0 when it is unavailable; in the available case, there should be at least a smaller value to ensure that RPC can still be triggered, thereby updating the weights

3. Update the algorithm weights without relying on RPC triggers, but should be updated regularly

Third, Function Verification

3.1 Load Testing Preparation

ModuleIPCPU
Client Side10.173.102.368
Server Side11.17.80.2388
11.18.159.1918
11.17.191.1378

3.2 Load Testing Data

1686293081_6482ca59d9ab5f2a12392.png!small?1686293082409

Since the performance gap between machines is not large, the CPU effect of the load testing is not significant. In order to make the CPU effect more obvious, apply an initial load to the node '11.17.80.238' (i.e., when there is no traffic, the CPU utilization rate is 12.5%).

1686293137_6482ca91a0c0d711f8887.png!small?1686293138345

3.3 Load Testing Conclusion

After load testing, both RR and LA have the problem of CPU imbalance, which will lead to the短板 effect due to the performance difference of machine resources, and cannot achieve the purpose of fully utilizing resources.

RALB takes CPU as the balancing target, so it will adjust the QPS that the node can handle in real-time according to the node's CPU, thereby achieving the target of CPU balance. Functionally, it has been verified to be available, and the CPU performance meets expectations.

Fourth, Throughput Testing

4.1 Load Testing Objective

RALB is a load balancing algorithm that uses CPU utilization as a dynamic indicator, which can effectively solve the problem of CPU imbalance, avoid the短板 effect of CPU, and allow machines with good performance to handle more traffic. Therefore, we expect that the RALB load balancing strategy can achieve a certain degree of throughput improvement compared to the RR round-robin strategy.

4.2 Load Testing Preparation

100 machines on the Server side are for testing, and the Server side is purely CPU adaptive throttling with a throttling threshold configured at 55%.

4.3 Pressure Test Data

Through pressure testing, the change trend of Server side throughput under the two load balancing modes of RALB and RR is compared, analyzing the impact of both load balancing strategies on cluster throughput.

4.3.1 RALB

4.3.1.1 Throughput Data

The table below is the throughput data of the Server side, obtained from the test pressure Client side, with the load balancing mode set to RALB. The situation of the Server side at 18:17 is close to the just throttled state. During the entire pressure test phase, three cases were tested: no traffic throttling, partial traffic throttling, and complete traffic throttling.

Time17:4017:4517:5218:1718:22
Total Traffic2270171511521096973
Processed Traffic982101010491061973
Throttled Traffic1288705103350
Throttling Ratio56.74%41%8.9%3.2%0%
Average CPU Usage55%55%54%54%49%

4.3.1.2 Indicator Monitoring

The traffic received by the Server side machines is distributed according to performance, and the CPU remains balanced.

1686293186_6482cac2afd2b67e53ee8.png!small?1686293187104

4.3.2 RR

4.3.2.1 Throughput Data

The table below is the throughput data of the Server side, obtained from the test pressure Client side, with the load balancing mode set to RR. The overall traffic of the Server side at 18:46 is close to the overall traffic of the Server side at 18:17. The focus will be on comparing the data of these two key moments later.

Time18:4018:4619:5720:0220:0420:09
Total Traffic96710821149117212631314
Processed Traffic9279911024103610481047
Throttled Traffic4091125136216267
Throttling Ratio4.18%8.4%10.92%11.6%17.1%20.32%
Average CPU Usage45%(Partial traffic throttled)51%(Partial traffic throttled)53%(Partial traffic throttled)54%(Close to all traffic throttled)55%(All traffic throttled)55%(All traffic throttled)

4.3.2.2 Indicator Monitoring

The traffic received by the Server side is balanced, but there is a difference in CPU.

1686293215_6482cadfed6a328cea4ea.png!small?1686293216365

4.4 Pressure Test Analysis

4.4.1 Throughput Curve

Based on the pressure test data in Section 4.3, the throughput curve of the Server side is plotted, and the change trend of throughput under the two load balancing modes of RALB and RR is compared.

import matplotlib.pyplot as plt
import numpy as np
       
x = [0,1,2,3,4,5,6,7,8,9,9.73,10.958,11.52,17.15,22.7]
y = [0,1,2,3,4,5,6,7,8,9,9.73,10.61,10.49,10.10,9.82]
  
w = [0,1,2,3,4,5,6,7,8,9.674,10.823,11.496,11.723,12.639,13.141,17.15,22.7]
z = [0,1,2,3,4,5,6,7,8,9.27,9.91,10.24,10.36,10.48,10.47,10.10,9.82]
  
plt.plot(x, y, 'r-o')
plt.plot(w, z, 'g-o')
plt.show()






1686293228_6482caec07f966820de24.png!small?1686293228487

4.4.2 Curve Analysis

Load Balancing StrategyRALBRR
Stage 1: All Machines Not ThrottledReceived QPS = Processed QPS, represented by the straight line y = xReceived QPS = Processed QPS, represented by the straight line y = x
Stage 2: Partial Machine ThrottlingThere is no RALB that allocates traffic based on the downstream CPU, and the downstream throttles according to CPU. Theoretically speaking, the CPU of the downstream always remains consistent. All machines reach throttling at the same time, and there is no situation where only some machines are throttled. Therefore, in the figure, non-throttling and full machine throttling is a turning point, without a smooth transition stage.RR strategy, the QPS allocated to the downstream machines is consistent, since the downstream throttles according to CPU, so the time of throttling for different machines is different. Compared to RALB, RR appears throttling earlier, and before reaching throttling, the throughput of RR is always less than that of RALB.
Stage 3: Full Machine ThrottlingAfter all machines reach the throttling threshold of 55%, theoretically, regardless of how the traffic increases thereafter, the QPS of the processing will remain unchanged. The figure shows that the QPS of the processing has decreased to some extent, because the processing of throttling also consumes part of the CPU.The time for RR to reach full throttling is later than that of RALB. After full throttling, the QPS of the two modes is consistent.

4.5 Pressure Test Conclusion

Critical point: the situation with the largest throughput difference, that is, the turning point from non-throttling to full throttling under the RALB mode.

Through the above analysis, it can be known that at the critical point where RALB is not throttled and fully throttled, the throughput difference between RR and RALB is the largest.

At this time, the calculation shows that the throughput of the Server cluster is increased by 7.06% under the RALB mode.

5. Boundary Testing

By simulating various boundary conditions, the stability of the system under boundary conditions can be judged.

Boundary ConditionsLoad Testing ScenarioLoad Testing Conclusion
Downstream Node ThrottlingCPU ThrottlingAdjustments of the penalty factor have an important impact on traffic distribution
QPS ThrottlingMeets expectations
Timeout of downstream nodesFixed sleep 1s for each request timeout on the Server EndThe traffic allocated during the period of request continuous timeout is basically 0
Abnormal exit of downstream nodesDirectly kill -9 pid if the Server End process is killedKilling the process and automatically restarting it, traffic distribution is quickly restored
Addition and subtraction of downstream nodesManual jsf going online and going offline on the Server EndNo traffic handling during the jsf offline period
Server End restart stop + startNormal unregistration and registration operations on the Server End process, traffic distribution meets expectations

6. Function Going Online

Configuration of the Client End at Suqian Data Center for going online, with RALB Load Balancing mode fully enabled on all Server End clusters. It can be seen that before and after going online, the QPS of the Server End gradually shows stratification, and the CPU of the Server End gradually tends to be unified.

1686293260_6482cb0c850fe06188e2a.png!small?1686293260943

References

1. Load Balancing Technology

2. A Deep Dive into Load Balancing

你可能想看:
最后修改时间:
admin
上一篇 2025年03月27日 06:24
下一篇 2025年03月27日 06:47

评论已关闭