Background

Currently, common preparation work for major promotions: special load testing (full-link load testing, internal load testing), disaster recovery drills, degradation drills, traffic control, inspection (monitoring, application health status), chaos testing (Red-Blue confrontation), as shown in the figure below. With the increasing complexity of platform business,Red-Blue confrontationThe role of big data platforms in preparing for major promotions is becoming increasingly evident. Below, we will introduce how big data platforms carry out Red-Blue confrontation in the preparation for major promotions.

First, let's understand what the Red-Blue confrontation is and what benefits it has?

1. Introduction to Red-Blue confrontation

The Red-Blue confrontation is a common practice in the field of cybersecurityConfrontational drillsMethods refer to the integration of platform security threat monitoring capabilities, emergency response capabilities, and protection capabilities to discover and rectify deep-seated security risks in both internal and external network assets and business data, ensuring business smooth operation. Real network environment is used to carry out live Red-Blue confrontation drills to improve and perfect security protection technology and management system.

The Blue team represents the attacking side, and the Red team represents the defending sideThe Red-Blue confrontation simulates the real network attack and defense process, conducted in a controlled environment. The Blue team attacks the Red team by simulating various threats and attack methods, testing their defense capabilities and system high availability. The Red team is responsible for defense and response, finding and fixing problems in the system, and collecting information about the attackers.

2. Benefits of Red-Blue confrontation

1. Ensure the effectiveness of monitoring alarms

The Red-Blue confrontation can help the production and research verify the effectiveness of the configuration of monitoring alarms, timeliness of notifications, and accuracy of information.

2. Enhance system reliability

The Red-Blue confrontation helps improve the reliability of the system by identifying potential issues that may cause system errors.

3. Risk reduction

The Red-Blue confrontation helps reduce the relevant risks of online issues by identifying potential vulnerabilities that may be exploited by malicious attackers.

4. Economical and efficient testing

The Red-Blue confrontation simulates the production environment scenario but will not pose any risk to the production environment, ensuring the quality of the system from the perspective of testing.

3. Red-Blue confrontation practice

The Red-Blue drill practice mainly includes: drill announcement, personnel specification and task allocation, scene sorting before the drill, Red-Blue confrontation process, collection of drill results, and review of the drill, totaling 6 parts.

1. Exercise announcement

This mainly includes two parts:

Firstly, the main person in charge of this red-blue confrontation organizes the start-up meeting of the confrontation exercise, determines the exercise time range, and specifies the real-time/offline exercise interface personnel.

Secondly, notify business users by email or Doundou in advance about the upcoming red-blue confrontation exercise.

2. Personnel assignment and task distribution

Firstly, specify the main person in charge of this red-blue confrontation. Responsible for the overall planning of the red-blue confrontation exercise, including plan formulation, exercise confrontation document implementation, scenario collection notification and review, organizing the attack side to initiate and the defense side to defend, and exercise debriefing work.

Secondly, specify the real-time and offline preparation interface personnel. They act as the blue team's attacking side, mainly responsible for specifying the exercise attack scenario and initiating system attacks.

Again, specify the real-time and offline backup support personnel separately. Generally, these are core R&D personnel. Since the specific time of initiating the attack is uncertain, to avoid the red team being unable to timely handle faults due to various special reasons after the blue team initiates the attack, leading to impact on online normal business, the backup support personnel can quickly restore the system.

Finally, specify the real-time and offline exercise monitors separately. Generally, these are test personnel, mainly responsible for recording alarm information (mdc, ump) during the exercise process and reviewing the exercise record documents.

3. Pre-exercise scenario collection

This part is the most important step before the exercise, mainly including determining the exercise application scope and determining the attack scenario of the attacking side.

3.1. Determine the exercise application scope

It is recommended to prioritize the application level for exercise applicationsL0 and L1Applications, which can be selected according to business needs. In addition, the following two methods can be used to quickly query the corresponding applications:

http://XXX.jd.com/dashboard/4/node/XXX

http://XXX.jd.com/health

The detailed exercise application list is provided by the real-time/offline interface person (after review and approval by the C3 leader), output:Collect attack scenarios in batches,

3.2. Collect exercise fault scenarios

jdos application Mainly rely on the Chaos Engineering platform for fault injection, using the following exercise scenarios:

High CPU usage, high memory usage, high disk usage, network latency, packet loss, process termination, MySQL request delay exceptions, jimdb request delay exceptions, etc.

Underlying cluster Mainly through operation and maintenance personnel using scripts, commands, and other methods to inject faults. The main exercise scenarios include the following:

Database instance CPU usage high, hdfs queue full, pending computing tasks, busy RSS cluster, zk node failure exceptions, etc.

4. Red-blue confrontation process

With the exercise scenario in place and the product having sent the exercise notification email, the red-blue confrontation can proceed. Here are some points to clarify:

① It is not possible to specify the specific attack time “RevealGive to the blue team.

② It is recommended to chooseProduction EnvironmentAttack the application or cluster to simulate online problems as realistically as possible.

4.1, [Main Person in Charge] Pre-exercise notification

The main person in charge sends a message in advance in the group before the formal exercise of the blue team attack, the template is as follows:

@All Members  
[Important Notice]
Today 17:30～21:30, the big data platform (real-time + offline) will carry out red-blue against exercises and unexpected fault attacks at irregular intervals. Please everyone synchronize the follow-up processing process in this group. Divided into three stages: problem discovery, cause analysis and diagnosis, and fault handling.
After determining each link (problem discovery, fault diagnosis, fault handling), send a message immediately, do not send a summary at the end!
After determining each link (problem discovery, fault diagnosis, fault handling), send a message immediately, do not send a summary at the end!


1. Problem Discovery
[Problem Discovery]
Product-Service Name:
(1) Received phone/Dongdong alarm, alarm content xxx  
or
(2) The radar large screen is red, screenshot xx begins troubleshooting and processing

2. Cause Analysis
[Fault Diagnosis]
Product-Service Name: xx issue cause has been found, and a brief description of the cause is provided.

3. Fault Handling
[Fault Handling]
Product-Service Name: xx issue has been resolved, has been restored, and an alarm recovery/monitoring screenshot is provided.

4.2, [Blue] Create & execute exercise tasks

The blue team creates exercise tasks or executes them according to the exercise scenarios collected previously in the chaos engineering platform.Batch creation of exercise tasks. As shown in the figure

The following points are explained:

① The attack on the underlying cluster is mainly realized through commands and scripts, and it will not be described in detail here.

② Network delay and packet loss faults may cause exercise failure, the reason is: restricted network fault exercise (the kernel version of this host has a known BUG and cannot be exercised) "4.18.0-80.11.2.el8_0.x86_64".

③ The memory utilization rate of 100%, because when the Linux memory is full, it will trigger oom kill, so it is recommended to set it to 90%.

④ The recommended exercise duration is greater than 5 minutes, the reason is that some application configurations have a mdc alarm cycle range within 5 minutes, and if the exercise duration is less than 5 minutes, alarms may not be received.

4.3, [Red] Defense and fault repair

After the blue team launches an attack, the red team will receive a Dongdong alarm and proceed with fault repair according to the established emergency plan. Some screenshots are as follows:

4.4, [Red] System recovery

Some exercise scenarios (process termination) will not recover automatically and require the red team to manually restart the system application services to ensure that all production application services are normal.

4.5, [Red + Blue] Exercise completed

After the red-blue against exercise is over, both the red and blue teams fill out "Red-blue对抗演练场景Document, filled by the blue team: chaos task link, chaos exercise scenario, exercise status, start time of chaos exercise execution, end time of chaos exercise execution. Filled by the red team: person in charge, alarm information, root cause, time when the root cause was found, description of troubleshooting process (including troubleshooting process, tools used, auxiliary decision-making judgment), planned solution & emergency plan, estimated impact and processing time. As shown in the figure below:

5. Collect exercise results The main person in charge reviews the exercise results, sorts out the separated exercise issues, and allows the red and blue sides to improve as soon as possible. The main problems include:

1) Not handled in a timely manner:After the red side receives the alarm, due to various reasons (such as meetings, not at the work station, etc.), the fault was not handled in a timely manner.

2) Incomplete handling:After the red side handles the ns failure problem, it did not notify the user to handle the failed task.

3) No alarm received:

① No alarm rules were configured. For example, the MDC or ump platform has not configured alarms.

② The alarm threshold was not triggered. For example, when the blue side attacks, the CPU utilization rate is 90%, but the mdc alarm rule configuration is 95%.

③ Disable alarms on the mdc platform. For example, the MDC temporarily disabled the MDC monitoring and alarm of the template center.

6. Exercise review

The main person in charge organizes the red-blue confrontation review meeting, provides exercise results and problem lists, and both on-site and offline architects participate in real-time, from the perspectives of exercise process and exercise effect to evaluate or make suggestions on this exercise.

① Alarm level needs self-check and correction. Currently, some alarm level configurations are too low. When the CPU utilization rate is greater than 90%, it reports [Warning], and it is recommended to change it to [Emergency].

② Extend the attack time. Find some applications, with an attack time of 30+ minutes, to verify whether the defenders truly cut off the traffic.

③ Chaos exercise常态化. It can be achieved throughChaos Engineering Platform - Regular ExerciseCarry out, and increase the frequency of exercises by combining the duty schedule, training soldiers in battle.

④ Step-by-step exercise of [Warning] and [Emergency] scenarios. The first step is to attack for 10 minutes to trigger the [Warning] scenario, and the second step is to attack for another 10 minutes to trigger the [Emergency] scenario.

⑤ Java method exceptions and delay scenarios have not been exercised. Subsequent expectations are that test personnel will support traffic inflow through forcebot pressure testing.

Expected support for chaos platform:

① The chaos engineering platform supports selecting multiple applications in one batch to create and start chaos exercise tasks. This can improve the efficiency of task creation. Currently, the batch creation of exercise tasks can only add applications one by one.

② The chaos engineering platform provides a常态化 chaos exercise api. It facilitates users to customize and create regular chaos exercise tasks.

③ The chaos engineering platform supports viewing mdc, ump alarms within the platform. It reduces the need for users to switch between multiple platform systems.

4. Summary

Through this red-blue confrontation exercise, it has effectively enhanced the risk resistance of the big data platform system application, reduced the probability of system failure in the production environment, greatly improved the ability of R&D personnel to solve problems and faults, and沉淀了一套快速高效的演练方案。

Author: Jingdong Retail, Yin Wei

Source: JD Cloud Developer Community. Please indicate the source of reproduction.

你可能想看：

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows