Firstly, with too many rules, how to alleviate the maintenance burden？

With the escalation of the confrontation with black industries, mining own data has become an important weapon for the defense side, and machine learning has also become the magic weapon of the defense side.

The application of machine learning in risk control is different from general application scenarios, mainly reflected in the following aspects:

The cost of labeling tags is high, leading to the fact that tags are sometimes even unlabelled.
Bad guys will actively improve cheating strategies, leading to the failure of the inherent model.
It is necessary to return results in real-time or nearly real-time.
Scenarios are diverse: account hacking, scalpers claiming coupons, points walls, anti-crawling, etc.

There are basically three specific examples to illustrate how we use machine learning to solve specific problems faced by account risk control.

Firstly, with too many rules, how to alleviate the maintenance burden?

For risk control systems, rules often have an immediate effect, supplemented by expert experience, which can play an important role for a long time. However, as time goes by, more and more rules are added to the system, and more and more marginal conditions need to be weighed, even causing a situation where a slight change can affect the whole. How to better maintain these historical rules?

By analyzing these historical rules, we find that the rules are actually a kind of manual induction and quantification of anomalies, for example: if someone's IP address visits more than 50 times per minute and the login failure rate is greater than 90%, disable the IP address. Then, the better maintenance of the rule problem becomes: how to quantify anomalies given a sample and features? The preliminary problem we face is as follows:

Considering the 'initiative' of unlabelled data and bad guys, we can choose unsupervised anomaly detection models to distinguish anomalies. Among them, the isolation forest model is the most effective. It can achieve the following effects:

However, considering the complexity of online situations, we hope to score the degree of abnormality of the samples, combine historical data for strategy backtracking, flexibly determine thresholds, and thus achieve the best summoning, resulting in abnormal phenomena, thus achieving the effect of quantifying abnormalities:

Finally, we have achieved the quantification of anomalies, thus improving the online rules. In this way, the maintenance of the rules has become the maintenance of features. As long as features and their combinations related to the model can be generated, the corresponding simplified anomaly model can be obtained, thus realizing flexible online rule management. Compared with the original rule interception partition, the coverage rate of the model has reached over 80% of rule interception, and the recall rate has increased by over 70%.

Second, unconscious captcha

With the escalation of confrontation, the captcha is becoming increasingly ineffective in distinguishing between humans and machines. The cracking methods used by the attackers have also long been upgraded from manual captcha to deep learning captcha. Even ordinary people have to sigh in the face of the captcha: Can such a captcha really be answered by humans? Maybe only machines can answer it.

Since the captcha cannot prevent machines and also hinders normal users, are there more user-friendly methods that can replace the captcha? Here we introduce human-computer recognition, which uses phone sensors to complete human-computer distinction without the user's awareness.

Smartphones contain various sensors, such as acceleration sensors, angular velocity sensors (gyroscope), gravity sensors, and magnetic field sensors. At every moment when the phone is running, these sensors collect data at frequencies of hundreds or even thousands of hertz. So, can these sensor data be separated into human/machine? Gradually, the bad guys can manipulate the phone API or external force to make the phone produce physical behavior, thus generating sensor data representing motion states. For the bad guys, the operations they can choose include: when the phone is stationary, when the phone is vibrating, and when the phone is placed in a swing device. We have collected the sensor data generated in these scenarios and can make a visual comparison of the following 4 pictures:

The operations corresponding to the figure from top to bottom are: phone stationary, phone vibration, phone swing, and normal click operation. It can be seen from the figure that the sensor still has slight distortion when stationary, the sensor fluctuation has regularity when vibrating, the sensor data in the swing is very similar to a sine wave, and the normal click operation brings a sharp rise or fall in acceleration through the finger's touch operation. From this, it can be inferred that distinguishing whether the sensor data is generated by humans is more accurately said: whether a certain sliding or clicking interaction operation is generated by the current sensor, which is exactly suitable for the binary classification problem that machine learning can judge.

Ultimately, it is the human-computer recognition model that can determine whether the sensor data collected by the SDK is generated by humans. From the online interception effect, the overall accuracy rate is 97.3%, with 1.8% of good people mistakenly intercepted.

From the perspective of confrontation, the main purpose of human-computer recognition is to reduce the entry threshold for good people, while improving the biological entry threshold for evil people is only a supplementary and secondary goal, because the security of human-computer recognition strongly depends on the security of the SDK. Once the SDK is involved in practice, we will add multiple features for collection in the SDK, thus transforming the role of the probe and bringing the main battlefield back to the body.

Three, post-event early warning mechanism

Risk control is usually divided into three parts: pre-event, in-event, and post-event. Considering real-time performance, it is difficult to achieve high recall for pre-event and in-event interception. However, even in the post-event stage, if we can discover anomalies early and form an interception strategy with high accuracy, it is still cumulative.

We know that the behavior of the bad guys often shows clustering, which may be the aggregation of devices, IPs, etc., or the synchronicity of behavior: a group of UID doing the same thing at the same time. This clustering and synchronicity are very similar to the clustering of infected attributes and the synchronicity of behavior during an outbreak of an infectious disease: all live in a specific area, all have been to a public place. And the timely post-event early warning in the risk control scenario is very similar to the problem faced by public health institutions when dealing with infectious diseases such as COVID-19: how to detect the outbreak of an epidemic in time and lock in a large number of newly infected people?

We assume that health institutions have collected the following patient consultation information (part):

Name	Gender	Symptoms	Residential community	Workplace	Time of consultation
Zhang San	Male	Cold	1st Street	Vegetable market	2020/1/4
Li Si	Female	Pneumonia	22 Road	Seafood market	2020/1/5
Wang Wu	Male	Pneumonia	Changjiang Family Dormitory	Fruit and vegetable market	2020/1/5
Zhao Liu	Female	Pneumonia	Huanghe Village	Seafood market	2020/1/6
Qian Qi	Male	Pneumonia	Jianghan Road	Seafood market	2020/1/6

For example, when the number of pneumonia patients increased significantly in 2020W1, to locate this abnormal item, it is necessary to count the same indicators in the same period of history (last week, last month, last year), and use statistical tests (chi-square test/Fisher's exact test, etc.) to determine the difference between the current indicator and the historical indicator (p-value). It can be seen that in our previous data, the number of pneumonia patients in 2020W1 was significantly higher than that in the same period in history, and the first step is completed.

Time	Total number of respiratory patients	Pneumonia patients	p-value
2019W1	100	8	0.0193
2019W48	120	8	0.0052
2019W52	160	15	0.0187
2020W1	200	40	1

After discovering abnormal indicators, we also hope to locate the source of the infectious disease and the close contacts. In epidemiology, of course, what is needed is an epidemiological investigation, the key of which is to find the source of infection and the mode of transmission. For risk control, it is necessary to find the intricate connections between the bad guys. We usually do this step by step: the closer the contact with the bad guys, the more likely they are to be bad.

For such problems, it is natural to think of graph-related technologies. In the context of infectious diseases, graph technology can identify close contacts centered around seafood markets. In the risk control scenario, we can also use graph association analysis to expose the bad guys in practice. We found that the Louvain community detection algorithm does not require data annotation, runs quickly, is accurate, and has better performance.

On the whole, it is helpful for machine learning algorithms. The defense side resets the massive data it possesses and mines more insights and cognition, maintaining a more proactive position in defense.

你可能想看：

Data security can be said to be a hot topic in recent years, especially with the rapid development of information security technologies such as big data and artificial intelligence, the situation of d

Ensure that the ID can be accessed even if it is guessed or cannot be tampered with; the scenario is common in resource convenience and unauthorized vulnerability scenarios. I have found many vulnerab

In today's rapidly developing digital economy, data has become an important engine driving social progress and enterprise development. From being initially regarded as part of intangible assets to now

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

b) It should have the login failure handling function, and should configure and enable measures such as ending the session, limiting the number of illegal logins, and automatically exiting when the lo

Article 2 of the Cryptography Law clearly defines the term 'cryptography', which does not include commonly known terms such as 'bank card password', 'login password', as well as facial recognition, fi

5. Collect exercise results The main person in charge reviews the exercise results, sorts out the separated exercise issues, and allows the red and blue sides to improve as soon as possible. The main

b) It should have a login failure handling function, and should configure and enable measures such as ending the session, limiting the number of illegal login attempts, and automatically logging out w

Bubba AI launches open-source compliance platform Comp AI, helping 100,000 startups achieve security compliance

Finally, choose the target source, establish a talent skill tree, which is convenient for accurately and quickly finding resumes

最后修改时间：2025-03-29 01:24:13