1. Background

The search and recommendation algorithm architecture provides services for all search and recommendation businesses of the JD Group and returns the processed results in real-time to the upstream. Various subsystems of the department have already realized adaptive limiting based on CPU, but the calls from the client side to the server side are still in the RR polling mode, without considering the performance differences of the downstream machines. This leads to the inability to maximize the overall CPU of the cluster, and there is an issue of unbalanced CPU on the server side.

The load balancing method developed by the JD advertising department for its business scenario is very instructive. They proposed the RALB (Remote Aware Load Balance) algorithm, which can improve the CPU resource efficiency of the downstream service cluster machines, avoid the bottleneck effect of CPU, and allow machines with good performance to handle more traffic. We have applied the core idea of this to our system and obtained good benefits.

Application of RALB load balancing algorithm

The structure of this article is as follows:

1. Introduction to RALB

◦A simple introduction to the principle of the algorithm is given.

2. Functional Verification

◦The RALB load balancing technology is applied to the search and recommendation architecture system for functional verification.

3. Throughput Test

◦This section mainly compares the RALB and RR load balancing techniques. It verifies that there is no significant difference in throughput between the two under the conditions of unlimited and fully limited traffic in the cluster. When it comes to partial limiting of RR, there is a difference in throughput between the two, and there is a point of maximum throughput difference. For RALB, the transition from unlimited traffic on the server side to full limiting is a turning point, with almost no partial limiting scenarios.

4. Boundary Testing

◦By simulating various boundary conditions, the system is tested to verify the stability and reliability of RALB.

5. Function Launch

◦Fully enable the RALB load balancing mode on all server-side clusters. It can be seen that before and after going online, the QPS of the server-side gradually appears stratified, and the CPU of the server-side gradually tends to be unified.

II. Introduction to RALB

RALB is a high-performance load balancing algorithm that aims to balance CPU usage

2.1 Algorithm Objective

1. Adjust the CPU usage rate of the server-side to make the CPU between nodes relatively balanced, avoiding CPU usage rate too high to trigger cluster flow control

2. QPS and CPU usage rate are linearly related; adjusting QPS can achieve the goal of CPU usage rate balance

2.2 Algorithm Principle

2.2.1 Algorithm Steps

1. When distributing traffic, distribute according to the weight (weighted random algorithm, wr)

2. Collect CPU usage rate: The server-side feedbacks the CPU usage rate (average 1s) to the client-side through RPC

3. Weight Adjustment: Adjust the weight (every 3s) according to the CPU usage rate (average within the window) of the cluster and each node to balance the CPU of each node

2.2.2 Indicator Dependency

Number	Indicator	Function	Source
1	IP	Available IP list	Maintenance of the service registration discovery and fault shielding module
2	Real-time health index	Real-time change of IP availability status, providing boundary conditions for the algorithm	Maintenance of RPC framework health check function
3	Historical health index	Historical health index, used to judge IP failure and recovery, and other boundary conditions	Historical values of indicator 2
4	Dynamic Target (CPU Usage Rate)	Provide the most direct objective basis for the load balancing algorithm	Server-side timing statistics, the RPC framework returns through RPC
5	Weight weight	Real-time Load Distribution Basis	Algorithm Update

2.2.3 Weight Adjustment Algorithm

1686293152_6482caa0699f82529c6af.png!small?1686293152984

2.2.4 Boundary Handling

Boundary 1: Within the feedback window (3s), if the downstream IP is not accessed, its CPU average is 0, and the weighting algorithm will consider the node's performance to be extremely good, thereby increasing the weight

1686293157_6482caa566029a8eb0543.png!small?1686293157818

Boundary 2: When a network failure occurs, the RPC framework sets the faulty node as unavailable with CPU and weight set to 0; after the network is restored, the RPC framework sets the IP as available, but nodes with a weight of 0 cannot receive traffic, resulting in the node remaining in an unavailable state

Processing: the update of weights is triggered by a timer, records the availability status of the node, and gives a low weight when the node recovers from an unavailable state to an available state, and gradually recovers the weight

1686293162_6482caaa6e3f3e4bb2b1b.png!small?1686293162911

2.3 Key to Implementation

Both fast and stable, avoid getting stuck or collapsing in any situation, especially handle boundary conditions well

Algorithm Key Points:

1. Keep the independent meaning and update mechanism of each dependent factor in the formula to maintain the reliability and simplicity of the algorithm

◦The update of the IP list is guaranteed by service registration discovery and RPC framework together

◦RPC updates CPU

2. Note the meaning of boundary values, and distinguish the meaning of continuous values

◦CPU = 0 indicates unknown, and does not mean that the CPU performance is good

◦w = 0 indicates that it will not be allocated traffic, and it will only be 0 when it is unavailable; in the available case, there should be at least a smaller value to ensure that RPC can still be triggered, thereby updating the weights

3. Update the algorithm weights without relying on RPC triggers, but should be updated regularly

Third, Function Verification

3.1 Load Testing Preparation

Module	IP	CPU
Client Side	10.173.102.36	8
Server Side	11.17.80.238	8
11.18.159.191	8
11.17.191.137	8

3.2 Load Testing Data

1686293081_6482ca59d9ab5f2a12392.png!small?1686293082409

Since the performance gap between machines is not large, the CPU effect of the load testing is not significant. In order to make the CPU effect more obvious, apply an initial load to the node '11.17.80.238' (i.e., when there is no traffic, the CPU utilization rate is 12.5%).

1686293137_6482ca91a0c0d711f8887.png!small?1686293138345

3.3 Load Testing Conclusion

After load testing, both RR and LA have the problem of CPU imbalance, which will lead to the短板 effect due to the performance difference of machine resources, and cannot achieve the purpose of fully utilizing resources.

RALB takes CPU as the balancing target, so it will adjust the QPS that the node can handle in real-time according to the node's CPU, thereby achieving the target of CPU balance. Functionally, it has been verified to be available, and the CPU performance meets expectations.

Fourth, Throughput Testing

4.1 Load Testing Objective

RALB is a load balancing algorithm that uses CPU utilization as a dynamic indicator, which can effectively solve the problem of CPU imbalance, avoid the短板 effect of CPU, and allow machines with good performance to handle more traffic. Therefore, we expect that the RALB load balancing strategy can achieve a certain degree of throughput improvement compared to the RR round-robin strategy.

4.2 Load Testing Preparation

100 machines on the Server side are for testing, and the Server side is purely CPU adaptive throttling with a throttling threshold configured at 55%.

4.3 Pressure Test Data

Through pressure testing, the change trend of Server side throughput under the two load balancing modes of RALB and RR is compared, analyzing the impact of both load balancing strategies on cluster throughput.

4.3.1 RALB

4.3.1.1 Throughput Data

The table below is the throughput data of the Server side, obtained from the test pressure Client side, with the load balancing mode set to RALB. The situation of the Server side at 18:17 is close to the just throttled state. During the entire pressure test phase, three cases were tested: no traffic throttling, partial traffic throttling, and complete traffic throttling.

Time	17:40	17:45	17:52	18:17	18:22
Total Traffic	2270	1715	1152	1096	973
Processed Traffic	982	1010	1049	1061	973
Throttled Traffic	1288	705	103	35	0
Throttling Ratio	56.74%	41%	8.9%	3.2%	0%
Average CPU Usage	55%	55%	54%	54%	49%

4.3.1.2 Indicator Monitoring

The traffic received by the Server side machines is distributed according to performance, and the CPU remains balanced.

1686293186_6482cac2afd2b67e53ee8.png!small?1686293187104

4.3.2 RR

4.3.2.1 Throughput Data

The table below is the throughput data of the Server side, obtained from the test pressure Client side, with the load balancing mode set to RR. The overall traffic of the Server side at 18:46 is close to the overall traffic of the Server side at 18:17. The focus will be on comparing the data of these two key moments later.

Time	18:40	18:46	19:57	20:02	20:04	20:09
Total Traffic	967	1082	1149	1172	1263	1314
Processed Traffic	927	991	1024	1036	1048	1047
Throttled Traffic	40	91	125	136	216	267
Throttling Ratio	4.18%	8.4%	10.92%	11.6%	17.1%	20.32%
Average CPU Usage	45%（Partial traffic throttled）	51%（Partial traffic throttled）	53%（Partial traffic throttled）	54%（Close to all traffic throttled）	55%（All traffic throttled）	55%（All traffic throttled）

4.3.2.2 Indicator Monitoring

The traffic received by the Server side is balanced, but there is a difference in CPU.

1686293215_6482cadfed6a328cea4ea.png!small?1686293216365

4.4 Pressure Test Analysis

4.4.1 Throughput Curve

Based on the pressure test data in Section 4.3, the throughput curve of the Server side is plotted, and the change trend of throughput under the two load balancing modes of RALB and RR is compared.

import matplotlib.pyplot as plt
import numpy as np
       
x = [0,1,2,3,4,5,6,7,8,9,9.73,10.958,11.52,17.15,22.7]
y = [0,1,2,3,4,5,6,7,8,9,9.73,10.61,10.49,10.10,9.82]
  
w = [0,1,2,3,4,5,6,7,8,9.674,10.823,11.496,11.723,12.639,13.141,17.15,22.7]
z = [0,1,2,3,4,5,6,7,8,9.27,9.91,10.24,10.36,10.48,10.47,10.10,9.82]
  
plt.plot(x, y, 'r-o')
plt.plot(w, z, 'g-o')
plt.show()

1686293228_6482caec07f966820de24.png!small?1686293228487

4.4.2 Curve Analysis

Load Balancing Strategy	RALB	RR
Stage 1: All Machines Not Throttled	Received QPS = Processed QPS, represented by the straight line y = x	Received QPS = Processed QPS, represented by the straight line y = x
Stage 2: Partial Machine Throttling	There is no RALB that allocates traffic based on the downstream CPU, and the downstream throttles according to CPU. Theoretically speaking, the CPU of the downstream always remains consistent. All machines reach throttling at the same time, and there is no situation where only some machines are throttled. Therefore, in the figure, non-throttling and full machine throttling is a turning point, without a smooth transition stage.	RR strategy, the QPS allocated to the downstream machines is consistent, since the downstream throttles according to CPU, so the time of throttling for different machines is different. Compared to RALB, RR appears throttling earlier, and before reaching throttling, the throughput of RR is always less than that of RALB.
Stage 3: Full Machine Throttling	After all machines reach the throttling threshold of 55%, theoretically, regardless of how the traffic increases thereafter, the QPS of the processing will remain unchanged. The figure shows that the QPS of the processing has decreased to some extent, because the processing of throttling also consumes part of the CPU.	The time for RR to reach full throttling is later than that of RALB. After full throttling, the QPS of the two modes is consistent.

4.5 Pressure Test Conclusion

Critical point: the situation with the largest throughput difference, that is, the turning point from non-throttling to full throttling under the RALB mode.

Through the above analysis, it can be known that at the critical point where RALB is not throttled and fully throttled, the throughput difference between RR and RALB is the largest.

At this time, the calculation shows that the throughput of the Server cluster is increased by 7.06% under the RALB mode.

5. Boundary Testing

By simulating various boundary conditions, the stability of the system under boundary conditions can be judged.

Boundary Conditions	Load Testing Scenario	Load Testing Conclusion
Downstream Node Throttling	CPU Throttling	Adjustments of the penalty factor have an important impact on traffic distribution
QPS Throttling	Meets expectations
Timeout of downstream nodes	Fixed sleep 1s for each request timeout on the Server End	The traffic allocated during the period of request continuous timeout is basically 0
Abnormal exit of downstream nodes	Directly kill -9 pid if the Server End process is killed	Killing the process and automatically restarting it, traffic distribution is quickly restored
Addition and subtraction of downstream nodes	Manual jsf going online and going offline on the Server End	No traffic handling during the jsf offline period
Server End restart stop + start	Normal unregistration and registration operations on the Server End process, traffic distribution meets expectations

6. Function Going Online

Configuration of the Client End at Suqian Data Center for going online, with RALB Load Balancing mode fully enabled on all Server End clusters. It can be seen that before and after going online, the QPS of the Server End gradually shows stratification, and the CPU of the Server End gradually tends to be unified.

1686293260_6482cb0c850fe06188e2a.png!small?1686293260943