A brief discussion on the high availability design of service interfaces

Author: Wang Lei, JD Retail

Preface

As a backend developer, developing service interfaces is my normal work. Whether these interfaces are for front-end HTTP or for RPC remote calls from other services, a common topic that cannot be avoided is 'high availability'. Interface development often seems simple, but ensuring high availability is not as easy as imagined. Next, let's take a look at what should be considered for a high availability interface, and welcome criticism and correction if there are shortcomings in the text.

What is high availability

In a simple word, it is whether our system has the ability to deal with and avoid risks.

Why do we need high availability

1. Programs are developed by people, and mistakes can be made during the development process, leading to online incidents
2. The system operation depends on various operating environments: CPU, memory, hard disk, network, etc., which are all possible to fail
3. New users are registering accounts for business, but the registration interface failed, affecting user experience
4. During large promotions such as Double 11 and 618, a large number of users place orders, resulting in the failure of the order service interface and the impact on GMV, etc.
5. Other unknown factors, etc.
In short, in order to deal with the occurrence of these uncontrollable factors, we must ensure high availability

Key points of high availability

We have said that the essence of high availability is whether the system has the ability to deal with and avoid risks. From this perspective, the following key factors should be considered when designing high availability interfaces: Dependence (dependency), Probability (probability), Time (duration), Scope (scope)

A brief discussion on the high availability design of service interfaces

1. The resources relied on are relatively few
2. The probability of risk is sufficiently low
3. The scope of impact is sufficiently small
4. The duration of the impact is short enough

Several Principles of Interface High Availability Design

Combining these key points, let's take a look at the specific precautions

1. Control Dependency

The less dependency, the better, and the less strong dependency, the better

Less Dependency
For example: daily 10 requests per minute, querying Mysql data can meet the needs, at this time, blindly introducing Redis middleware not only wastes resources but also increases the complexity of the system

Weak Dependency
For example: the user registration service strongly depends on the new user coupon distribution service, when the coupon distribution service fails, the entire registration is unavailable. A good way is to use asynchronous and weak dependency
method, so that when the coupon distribution service is unavailable, it will not affect the registration link.

2. Avoid Single Point

The core of avoiding a single point of failure is to quickly perform fault tolerance through backup or redundancy

1. We adopt multi-data center and multi-region deployment of our application to ensure the distribution of fault risk, so that once a server fails, other services can still continue to support our services
2. Each time we go online, we will keep the release version of the last online version, so that if the program goes online has a problem, we can quickly roll back to the previous version
3. Each interface should ensure that at least 2 people are familiar with the relevant business, so that in case of online service problems, any one of them can quickly handle the relevant online problems
4. Whether it is Mysql or Redis and other middleware, they all support the master-slave cluster deployment of data

There are many similar examples, so they will not be listed one by one here

3. Load Balancing

to distribute the risk and prevent the risk from spreading

For example: whether it is Nginx or JSF, the purpose of load balancing is to distribute the traffic as evenly as possible to different server nodes, so as to effectively ensure that a single node does not fail due to system bottlenecks
problems can trigger a series of risks. 

Like the above example, I think every R&D person knows and will do this, but whether we have considered the problem of balance in all scenarios?

For example: usually to improve the read concurrency capability, we will cache the data in JIMDB, but because the cached key has appeared hot data, the single shard of JIMDB has a high load, just
Well, there is also other data cached on this shard, but due to the high CPU load, the query performance has become poor, with a large number of timeouts, affecting the business. Therefore, in the interface design
When encountering a similar scenario, it is also necessary to fully consider the balance of data storage, and at the same time, monitor the hot data to support dynamic balance at any time.

4. Resource Isolation

The purpose of isolation is to control the risk within a manageable range and prevent the risk from spreading

For example: the physical deployment of interface deployment and service deployment is isolated from each other, avoiding a failure in a single data center or a single server from affecting the entire service

For example: when we store business data, we will divide the data into databases and tables, and store the data through different shards, so that a failure of a single server will not affect the entire service

5. Interface Flow Control

Flow control is a protective measure, the purpose of which is to control the risk within a manageable range

When we develop interfaces, we must combine the limit flow measures with the business traffic situation. Throttling is not only for the protection of our own service resources but also for the protection of dependent resources
Protection measures.

At present, the JSF of the group has corresponding throttling processing capabilities for traffic control, and we can also combine actual business to develop throttling modules

6. Service Circuit Breaking

Circuit breaking is also a kind of protection measure. The purpose is to control the risk within the controllable range and avoid the spread of risk

For example: Often, our service A calls multiple services such as B, C, and D at the same time. When one of the services we depend on fails or the performance drops, it will lead to the overall service
The availability rate decreases, so when we develop such services, we must pay attention to the isolation between interfaces. We can use components like Hystrix, or rely on DUCC
For manual isolation.

In fact, circuit breaking is also a kind of control resource dependency. It degrades the strong dependency to the weak dependency

7. Asynchronous Processing

Convert synchronous operations to asynchronous operations

For example: When users receive some rights and interests on the user page, due to the large number of users during the major promotion period, in order to avoid system load, the asynchronous reception of user collection is adopted at this time through MQ
After the request, the coupon distribution is carried out, which not only greatly reduces the impact range of the accident but also reduces the probability of problem occurrence

8. Degradation Plan

Service degradation is a kind of remedial measure after the occurrence of problems. Through service degradation, the impact range of risk can be reduced

For important service interfaces, we must have a complete degradation plan. It should be noted here that degradation is detrimental, and we must consider various issues before the system development
The possibility of occurrence, the premise of degradation is to ensure the operation of the core business through the degradation of non-core business

For example: During the peak period of major promotions, many functions are generally degraded in advance, and at the same time, the flow is limited, mainly to protect the transaction and payment experience of the vast majority of people during the peak

9. Gray Release

To reduce the impact range of risk through gray release

For example: When we launch a new service, we allow users to experience the application of the new version first through a certain gray release strategy, and collect the feedback of these users on the new version application as well as
Comment on the indicators such as function, performance, and stability of the new version, and then decide whether to continue to expand the new version release range to full-scale upgrade or roll back to the old version. According to the online feedback results,
To find out the leaks and make up for them, discover major problems, and can roll back the 'old version'

10. Chaos Engineering

By using some destructive means in advance, potential problems can be discovered in advance

For example: A complex interface system depends on too many services and components. These components may fail at any time and place. And once they fail, will they have a butterfly effect
Generally, it causes the overall service to be unavailable, and we are not aware of it. Therefore, we can use the Taihang platform chaos engineering for drills, formulate various plans for the scenarios that occur, and control the risks
Control within the controllable range.

你可能想看：

A brief discussion on the methods of discovering vulnerabilities in business systems from the perspective of management