4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

I. Background

The common preparation work for major promotion preparation at present includes: special load testing (full-link load testing, internal load testing), disaster recovery exercises, degradation exercises, traffic control, inspection (monitoring, application health status), chaos testing (red and blue simulation), as shown in the figure below. With the increasing complexity of platform business, the role of red and blue simulation becomes more and more prominent. Below, we will introduce in detail how the big data platform carried out red and blue simulation in the preparation work for this Double Eleven major promotion.

1700531928_655c0ed8e3d54c6757e2f.png!small?1700531929504

Figure 1. Illustration of common work for major promotion preparation

Firstly, let's understand what red and blue simulation is and what benefits it has?

II. Introduction to red and blue simulation

The red and blue simulation is a common adversarial exercise method in the field of cybersecurity, referring to the process of discovering and rectifying deep-seated security vulnerabilities in both internal and external network assets and business data of enterprises, under the premise of ensuring the stable operation of business. It integrates the platform's security threat monitoring capabilities, emergency response capabilities, and protective capabilities to carry out real-force red and blue simulation exercises in a real network environment, improving and perfecting the security protection technology and management system.

The blue team represents the attacking side, and the red team represents the defending side. The red and blue simulation mimics the real process of network attacks and defenses, conducted in a controlled environment. The blue team attacks the red team by simulating various threats and attack methods, testing their defense capabilities and system high availability. The red team is responsible for defense and response, finding and fixing problems in the system, and collecting information about the attackers.

1700531934_655c0edebccf3cae4275d.png!small?1700531935613

Figure 2. Red and blue simulation

III. Benefits of red and blue simulation

1. Ensuring the effectiveness of monitoring and alerting

The red and blue simulation can help the production and research teams verify the effectiveness of monitoring and alert configurations, timeliness of notifications, and accuracy of information.

2. Enhancing system reliability

The red and blue simulation helps improve the reliability of the system by identifying potential problems that may cause system errors.

3. Risk reduction

The red and blue simulation helps reduce the relevant risks of online issues by identifying potential vulnerabilities that may be exploited by malicious attackers.

4. Economical and efficient testing

The red and blue simulation mimics the scenarios of the production environment, but it will not pose any risk to the production environment, ensuring the quality of the system from the perspective of testing.

1700531943_655c0ee733f0633d79988.png!small?1700531944033

Figure 3. The benefits of red and blue confrontation

4. Red and blue confrontation practice

The red and blue confrontation exercise practice mainly includes: exercise announcement, personnel specification and task allocation, exercise scene sorting before exercise, red and blue confrontation process, exercise result collection, and exercise debriefing, a total of 6 parts.

1700531952_655c0ef0ac7dbb48d67d0.png!small?1700531953477

Figure 4. Red and blue confrontation practice mainly includes six parts

4.1 Exercise announcement

It mainly includes two parts:

Firstly, the main person in charge of this red and blue confrontation organizes the start-up meeting of the confrontation exercise, determines the time range of the confrontation exercise, and specifies the real-time / offline exercise interface personnel.

Secondly, notify business users in advance by email / internal office app to inform them that a red and blue confrontation exercise will be conducted.

1700531958_655c0ef6b31e8944975f4.png!small?1700531959242

Figure 5. Red and blue confrontation exercise announcement

4.2 Personnel specification and task allocation

Firstly, specify the main person in charge of this red and blue confrontation. Responsible for the overall planning of the red and blue confrontation exercise, including scheme formulation, exercise confrontation document implementation, scene collection notification and review, organization of attack initiation and defense process, and exercise debriefing work.

Secondly, specify the real-time and offline ready interface personnel. They act as the blue team attackers, mainly responsible for specifying the exercise attack scenario and initiating system attacks.

Again, specify the real-time and offline backup support personnel separately. Generally, they are core R&D personnel. Since the specific time for initiating the attack is uncertain, to avoid the red team being unable to handle the fault in time due to various special reasons after the blue team initiates the attack, causing an impact on the normal online business, the backup support personnel can quickly restore the system.

Finally, specify the real-time and offline exercise monitors separately. Generally, they are test personnel, mainly responsible for recording the alarm information (mdc, ump) issued during the exercise process and reviewing the exercise record documents.

1700531968_655c0f00bd1f2986c2f43.png!small?1700531969247

Figure 6. Red and blue confrontation personnel specification and task allocation

4.3 Exercise scene collection before exercise

This part is the most important link before the exercise, mainly including determining the exercise application scope and determining the attack scenario of the attacker.

4.3.1 Determine the exercise application scope

It is recommended to prioritize applications with application levels L0 and L1 for exercise applications, and the specific selection can be based on business needs. In addition, at JD.com, you can quickly query the corresponding applications in the following two ways:

http://XXX.jd.com/dashboard/4/node/XXX

http://XXX.jd.com/health

The detailed exercise application list is provided by the real-time / offline interface person (after being reviewed and approved by the C3 leader), output: attack scenario collection

1700531974_655c0f0610319b32a35ef.png!small?1700531974635 Figure 7. Exercise application scope

4.3.2 Collect exercise fault scenarios

The jdos application mainly relies on the 【Chaos Engineering】platform for fault injection and adopts the following exercise scenarios:

High CPU usage, high memory usage, high disk usage, network latency, packet loss, process termination, abnormal MySQL request delay, abnormal JimDB request delay, and so on.

The underlying cluster mainly involves operations personnel injecting faults through scripts, commands, and other methods. It mainly includes the following drill scenarios:

Database instance CPU is high, hdfs queue is full, calculation tasks pending, RSS cluster busy, zk node abnormal shutdown, etc.

4.4 Red-Blue Confrontation Process

With the drill scenarios and after the product sends the drill notification email, the red-blue confrontation can be carried out. The following points need to be explained:

① Do not "reveal" the specific attack time to the Blue Team;

② It is recommended to choose a production environment application or cluster for attack to simulate online issues as realistically as possible.

4.4.1 Main Person in Charge Drill Notice Before

The main person in charge sends a message in the group in advance before the formal drill of the Blue Team attackers, template as follows:

@All Members 
Important Notice
Today from 17:30 to 21:30, the big data platform (real-time + offline) will carry out red-blue confrontation drills, and unexpected faults will be carried out at irregular intervals. Please everyone synchronize the follow-up processing process in this group. Divided into three stages: problem discovery, cause analysis and diagnosis, and fault handling.
After confirming each link (problem discovery, fault diagnosis, fault handling), send a message immediately, do not send a summary at the end!
After confirming each link (problem discovery, fault diagnosis, fault handling), send a message immediately, do not send a summary at the end!

1. Problem Discovery
Problem Discovery
Product-Service Name:
(1) Received phone/dongdong alarm, alarm content xxx  
Or
(2) Radar large screen turns red, screenshot xx begins to investigate and handle

2. Cause Analysis
Fault Diagnosis
Product-Service Name: xx issue cause has been found, and a brief description of the cause.

3. Fault Handling
Fault Handling
Product-Service Name: xx issue has been handled, restored, and an alarm recovery/monitoring screenshot has been provided.

4.4.2 Blue Team Create & Execute Drill Tasks

The Blue Team creates or bulk creates drill tasks on the chaos engineering platform according to the drill scenarios collected previously. As shown in the following figure:

1700531987_655c0f13112f9bd719d7f.png!small?1700531987740 Figure 8. Blue Team Creates Task

The following points are explained:

① The attack on the underlying cluster is mainly realized through commands and scripts, and it will not be described in detail here.

② Network delay and packet loss faults may cause the drill to fail, reason: limited network fault drills (this host kernel version has known bugs and cannot be drilled) "4.18.0-80.11.2.el8_0.x86_64".

③ In the scenario of 100% memory utilization, because when the Linux memory is full, it will trigger oom kill, so it is recommended to set it to 90%.

④ The recommended duration of the drill should be greater than 5 minutes, reason: some application configurations have an mdc alarm cycle range within 5 minutes, and it may not receive an alarm if the drill duration is less than 5 minutes.

4.4.3 Red Team Fault Defense and Repair

After the Blue Team initiates an attack, the Red Team will receive an alarm from the internal office app and proceed with fault repair according to the predefined plan. Some screenshots are as follows:

1700531997_655c0f1d6d51cf60b9a5c.png!small?1700531998022

1700532006_655c0f26f0e89459b0425.png!small?1700532007699 Figure 9, 10. Internal Office App Alarm Notice

4.4.4 【Red】System recovery

Some simulation scenarios (process termination) will not recover automatically and require the red team to manually restart system application services to ensure that all production application services are normal.

4.4.5 【Red + Blue】Simulation ends

After the red-blue confrontation simulation ends, both the red and blue teams fill out the “Red-Blue Confrontation Simulation Scenario” document. The blue team fills in: chaos task link, chaos simulation scenario, simulation status, chaos simulation start time, and chaos simulation end time. The red team fills in: troubleshooter, alarm information, root cause, time to find the cause, troubleshooting process description (including troubleshooting process, tools used,辅助 decision-making judgment), planned solution & emergency plan, estimated time to handle the impact.

1700532012_655c0f2cc0da6263f34e2.png!small?1700532013285 Figure 11. Illustration of document filling after the simulation ends

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

1) Not handled in a timely manner: After the red team receives the alarm, due to various reasons (meetings, not at the desk, etc.), the fault is not handled in a timely manner.

2) Incomplete handling: After the red team handles the ns failure issue, it does not notify users to handle the failed tasks.

3) No alarm received:

① Alarm rules are not configured. For example, the mdc or ump platform has not configured alarms.

② The alarm threshold has not been triggered. For example, when the blue team attacks, the CPU utilization rate is 90%, but the mdc alarm rule configuration is 95%.

③ Disable alarms on the mdc platform. For example, the mdc temporarily disabled the MDC monitoring and alarm of the template center.

1700532018_655c0f323abfa06596759.png!small?1700532018782 Figure 12. Issues existing in the simulation

4.6 Simulation review

The main person in charge organizes the red-blue confrontation review meeting, provides simulation results and problem lists, and both on-site and off-site architects participate in real-time, evaluating or making suggestions on this simulation from the perspectives of the simulation process and simulation results.

① Alarm level needs self-inspection and correction. Currently, some alarm level configurations are too low. When the CPU utilization rate is greater than 90%, it reports 【Warning】, and it is recommended to change it to 【Emergency】.

② Extend the attack time. Find a few applications, with an attack time of 30 + minutes, to verify whether the defense personnel truly cut off the traffic.

③ Chaos simulation normalization. It can be carried out through the chaos engineering platform - normal simulation, and combined with the duty roster to increase the frequency of simulations, to train soldiers in battle.

④ Step-by-step simulation of 【Warning】and 【Emergency】scenarios. The first step is to attack for 10 minutes to trigger the 【Warning】scenario, and the second step is to attack for another 10 minutes to trigger the 【Emergency】scenario.

⑤ Java method exceptions and delay scenarios have not been simulated. Subsequently, it is expected that testers will support traffic inflow through forcebot load testing.

Expected support from the chaos platform:

① The chaos engineering platform supports the batch selection of multiple applications to create and start/stop chaos simulation tasks. This can improve the efficiency of task creation. Currently, the batch creation of simulation tasks can only add and create applications one by one.

② The Chaos Engineering Platform provides a常态化混沌演练api. This facilitates users in customizing and creating regular chaos exercise tasks.

③ The Chaos Engineering Platform supports checking mdc, ump alarms within the platform. This reduces the need for users to switch between multiple platform systems.

V. Summary

Through this red-blue confrontation exercise, we have not only effectively enhanced the risk resistance of the big data platform system applications, reduced the probability of system failures in the production environment, but also greatly improved the ability of R&D personnel to solve problems and faults, and沉淀了一套快速高效的演练方案.

Last but not least, thanks for the strong support from the Chaos Engineering Platform!