Recently, at the CIS2022 conference hosted by Freebuf, Lu Enzhe, the responsible person for server security at ByteDance, shared the theme of 'Best Practices of ByteDance Cloud Workload Protection'. In the sharing, Lu Enzhe started from the internal practice of ByteDance and shared how ByteDance faces complex business and deployment environments for full workload security protection, and how to solve the security operation pressure under massive loads through ways such as data optimization, alarm consolidation, and correlation tracing.

Threats Facing Cloud Workload Protection

At ByteDance, a large number of business deployments with different demands are deployed on various load platforms such as physical machines, virtual machines, containers, and Serverless. Research has found that different loads have specialized security threats. For physical machines and virtual machines, traditional intrusion prevention, compliance, and risk awareness are mainly used. In the container workload part, the introduced image security and intrusion prevention at the cluster level are constantly emerging. At the Serverless level, it is difficult to distinguish high-permission API calls within the application from the host/container layer, and the shorter lifecycle also brings difficulties in post-event traceability and troubleshooting.

How to implement cloud workload protection in the production network？ A practice sharing from ByteDance

Enzhe mentioned that the resources that security can dominate are often limited. It is impossible for enterprises to use different teams for task differentiation for different loads, and intrusions often have relevance, requiring the combination of multiple different workloads for processing and analysis. Therefore, in the face of the above-mentioned threats, the Volcano Engine, combining the internal practical experience of ByteDance, has developed an integrated cloud workload protection platform CWPP.

According to the introduction, the Volcano Engine CWPP originates from the internal host security solution Elkeid of ByteDance. Elkeid has solved the anti-intrusion and traceability needs of tens of millions of containers within ByteDance, and its data collection module has been open sourced externally (https://github.com/bytedance/Elkeid).

From the very beginning of its design, CWPP adheres to the principle of providing consistent protection and visibility for physical machines, virtual machines, containers, and serverless workloads. It integrates host security, container security, RASP, blocking and response capabilities, and traceability through the way of plugins into a single agent. Meanwhile, the cross-regional and cross-cloud deployment needs within ByteDance have also led to CWPP's support for multi-cloud and hybrid cloud environments.

Design philosophy of Volcano Engine CWPP

As shown in the figure, the Volcano Engine CWPP integrates multiple load protection capabilities into a native architecture and has developed a dedicated high-speed policy engine and service discovery capabilities under massive machines. This integration reduces the overall operational pressure.

CWPP Data Collection

In terms of collecting data from different aspects and sources according to different loads and requirements, the current CWPP collects data from the kernel layer for hosts/virtual machines/containers. Compared to user-space solutions, the benefits of kernel layer collection are not only richer data collection capabilities and lower performance overhead, but also natural support for containers.

At the same time, similar attacks such as SSRF/RCE/SQL injection that target the application layer may still occur. Therefore, we have developed the Volcano Engine Elkeid RASP, which is an application security defense technology independently researched by the Volcano Engine Cloud Security Team. It collects critical runtime information by implanting probes into the application runtime and analyzes runtime behavior to produce timely alerts.

The diagram below shows the niche of Elkeid RASP during on-premises data collection, which provides protection for the application layer itself. From the right side, it can be seen that Elkeid RASP is a plugin of the on-premises agent, responsible for dynamically injecting RASP probes into selected services, supporting cross-container service monitoring and management. The RASP probe also provides runtime operations such as Hook information reporting, hot patching, and blocking access. Moreover, the Volcano Engine Elkeid RASP supports usage in various language environments, which can basically meet the needs of most backend services within most companies.

In the cloud-native era, the exploitation of containers themselves is one aspect, and the exploitation of the Kubenetes cluster itself is another new exposure surface. Elkeid RASP therefore utilizes the native audit log data of Kubenetes, accesses the strategy engine through the log collection layer, to write and analyze security events or threat risks, and then provide them to security engineers for judgment and analysis.

CWPP Alerts and Strategies

After collecting data, writing corresponding strategies and operating the identified alert information in a timely manner is the next focus:

The first is the addition of the context of the alert itself, that is, behavior sequence detection. As shown in the figure below, the three pieces of information themselves are not significant risks, but when combined, they form a behavior sequence based on the background, which itself has certain risks, and an alert is needed.
For long-term resident binaries, CWPP has built cloud disinfection capabilities. By throwing suspicious binaries to the cloud, using multiple engines for static detection and executing samples in a real dynamic sandbox to observe and analyze behavior, and using richer rules and data to judge whether the sample is a malicious sample. Currently, cloud disinfection has captured many malicious binaries from hidden entry points inside and outside ByteDance.
Huoshan Engine CWPP achieves low-cost storage and efficient query of original data storage through its self-developed storage layer. This solution can support second-level queries for PB-level original data. When a security alert occurs, the CWPP traceability engine will try to associate the alert with original data queries to achieve the ability to reconstruct the scene.

CWPP Alert Cases

In an emergency response to a Java RCE Alert alert this year, the attacker entered from the edge machine, discovered that arbitrary file reading and SSRF vulnerabilities could be exploited, and then conducted blind attacks, finding a Jenkins cluster that was taken down. Even after the rapid CWPP alert and blocking, the attacker continued to update methods and targets, conducting multiple counterattacks and attack attempts. After CWPP provided accurate alert information, operation personnel quickly identified other alerts based on automatically associated events, quickly determined the damage and loss, and quickly issued RASP protection through the CWPP platform, reported captured runtime alerts in real-time, and effectively resolved the above issues through CWPP's powerful alert traceability and handling capabilities.

As a security product that provides consistent protection and visibility for physical machines, virtual machines, containers, and serverless workloads, Huoshan Engine CWPP will continue to integrate with the overall network environment, as well as ByteDance's internal practices and technological innovation, to provide users with a more comprehensive host security solution.

你可能想看：

(3) Is the national secret OTP simply replacing the SHA series hash algorithms with the SM3 algorithm, and becoming the national secret version of HOTP and TOTP according to the adopted dynamic factor