1. Business background

System business functions：The system internally handles data processing and integration, and provides initialization (write) of result data and query data result services to external systems.

System network architecture:

2. Overall implementation plan for this upgrade：

•

The impact of deployment architecture on the cut volume launch - the launch of the internal management system has no impact on the read business of other systems

• Distributed cache can be expanded separately, which is unrelated to the upgrade of storage and query functions

• Through the isolation of the cache layer, external systems can remain unchanged during the system expansion period, and only the internal management system upgrade is affected

• During the internal system launch/verification, in addition to the initialization operations related to business scenario 1, read services can still be provided to reduce the impact of launch

Secondly, the overall implementation plan for this upgrade:

Overall implementation plan diagram:

(1) Establish goals

Commodity full volume channel plan - cut volume plan: (Total is 10 times the current):

Currently:

The commonly used tables in the current database have all exceeded5000WAmong the results, some are expressed as 6000W, which has reached the peak capacity of the MYSQL database table, and cannot support the full cut volume;

Goal:

Highest support900 million: According to the slicing plan, after full slicing, the system is about 6.7 billion, retaining 1/4 redundancy, taking8.375Billion; Round up to 900 million, this value has a large redundancy, can meet the data support for the next 5 years

Time goal: scheme setting in early August, online and verification from August 17 to August 22, August 24 slicing plan starts

(Second), current system status

1. Resource usage

• Current deployment structure

——Data center distribution, Mysql: 1 master 4 slaves (Data center A 1 master, 3 slaves; Data center B only read slave)

——Data center distribution, Doris: 32C, 63 nodes, 3 replicas

• Current application container (docker) count, db maximum connection count

——Application container count: 62 (Web group: 25, Worker group: 31, MQ group: 6)

——db maximum connection number 100 (each container configuration)

• Whether the current business is read-write separated, and what is the read-write ratio

——No read-write separation

• Under various business scenarios, whether the master-slave delay can be tolerated? How long is the tolerable delay time

——Most of the modification operations of the business personnel are synchronous operations, and the operation result is returned to the front end after completion. From the perspective of business operations + query results, it is impossible to tolerate delay

——Background task scenario, for intermediate data processing, can tolerate master-slave delay

• At the product level, when the system encounters bottleneck pressure, is it acceptable to limit the flow? Is it acceptable to delay data display?

——This time's external service interface does not involve development, service interface is not affected; business page visit volume is low, can accept short-term delay

• Does the team have ES usage experience?

——Partly understood, not used in the project

2. Internal usage of the database

Use a general inventory framework to comprehensively sort out the current situation of the system

Table space, business scenarios and other information (part)

Table name

Current number of records in the table (Unit: ten thousand)

Maximum supported number of records (Unit: ten thousand)

Number of table fields

Whether it can be split out into a sharding key

Sharding key field

Whether there is an SQL without sharding field

Whether there is a cross-table query scenario

Data record read-write ratio

Whether there is an immediate query after writing

Use case

Whether the data can be truncated

Acceptable truncation duration

Estimated amount after slicing

Distributed DB

ES judgment condition: whether there is a complex query

ES direct double write, judgment condition: query immediately after writing

Approval flow table

3.5KW

4 kW

Yes

sku

There is

1000/1

There is a manual query operation after writing

1. Page creation of approval flow 2. Page query of approval flow 3. Page data set to invalid 4. Callback modification from approval platform

+300 million

✅

After UI modification, you need to click the "Query" button again;

Approval flow details table (historical data has been cleaned)

800W

4 kW

Yes

Add SKU

None

There is

1000/1

There is a manual query operation after writing

1. Refresh approval flow (delete + add) 2. Query approval-in-process processes (tasks)

Convert to cold backup after approval

Convert to cold backup

✅

Business data table 1

3.3 kW

4 kW

Yes

sku

None

100/1

1. Create after approval 2. Data invalidation, deletion operation 3. Background tools: synchronize cache (complex and pagination query)

+300 million

✅

❌

Business data table 2

5.9 kW

4 kW

Yes

sku

Exists (abnormal deletion after addition according to id)

None

1000/1

1. Business query/export dimension 1 data 2. Business query dimension 2 data 2 3. Background tools: synchronize cache

+500 million

✅

Synchronize big data push data to cache, use creator field to query; multiple SKU pagination query

❌

Support data table (big data platform calculation and push)

1.2 kW

4 kW

Yes

item_sku_id

None

5/1

1. Operation and maintenance tools: add/delete records 2. Clean up historical data (task) 3. Data query (display usage) 4. Calculation 5. Big data push data

Pushed by date, currently retained for 3 days

Historical data is useless for doris?

3 to 4 kW per day

✅

Delete data dt

❌

.....

(Three) technical solution selection

System characteristics: high concurrency write of single table, complex read

1. Storage selection:

Conclusion:

Internal distributed database: expanded from single shard to multi-shard, solving massive data storage and simple query

ES: newly introduced, to realize complex query (token query) and global sorting

redis: retained, needs expansion

Doris: retained, capacity increased

Complex query (reason: there are multi-table association scenarios in the front-end business access (2 tables with tens of millions of level association queries), with the increase of table capacity, the performance of association query decreases, and it can no longer meet the business efficient requirements)

Complex query decision factors:

		Distributed DB (mysql)	es	doris	TiDB
Decision indicators	Product positioning	Database (OLTP)	Search engine	Database (OLAP)	Database (OLTP+80%OLAP)
	Advantages:	1. High concurrency and high throughput transaction processing 2. Stability 3. Data real-time (write and read immediately)	1. Full-text index 2. Complex structured query	High concurrency query analysis	1. Both transaction processing and data analysis 2. Automatic expansion 3. Data real-time (write and read immediately)
	Disadvantages:	1. Large-scale data analysis 2. Manual expansion	1. Transaction processing 2. Real-time (write and read immediately)	1. Transaction processing 2. Real-time (write and read immediately)	High concurrency and high throughput transaction processing
	Reliability	High (multi-machine room)	High (multi-machine room)	Low (shared cluster)	Low (single machine room)
	Scalability	Database dimension: platform management, table dimension: application control	Platform management		Database dimension: platform management, table dimension: application control
	Data consistency	Final consistency	Final consistency		Strong consistency
	Operation and maintenance support	DBA	Branch operation and maintenance	No professional operation and maintenance team	Branch DBA
	Summary	Slow complex business queries, unable to support cross-table queries, multi-dimensional complex queries, and global sorting. Single-table query performance is fast using sharding fields	High performance in complex business queries	The deployment structure is a shared cluster, (especially) the write performance is greatly affected by external factors	The deployment architecture is a single room, which cannot meet the reliability requirements of the 0-level system

	Architecture goal
Conclusion	Massive storage and high concurrency write	✅ Large data volume storage, high performance of single-table query based on sharding field, single database transaction processing	✖️ Transaction processing under high concurrency	✖️ Query is greatly affected by write/update operations	✖️ Transaction processing reliability under high concurrency
Conclusion	Complexity of query	✖️ Poor performance, may exist cross-database query	✅ High reliability under large data volume complex business queries	✖️ Query is greatly affected by write/update operations	✖️ Reliability

2. Data synchronization plan

A-Real-time Synchronization Plan:

Solution description: Use the DRC platform to configure and complete the near-real-time data synchronization from distributed DB to ES (Note: DRC is a company's internal general data synchronization platform that can be used for data synchronization between multiple data sources)

Advantages:: Simple unordered code development Disadvantages:: There may be business write-read scenarios after writing, with the risk of data inconsistency

B-Dual Write Strong Consistency Plan:

Solution description: Dual write distributed DB and ES, ensure data consistency

Advantages:: Ensures the consistency of the data write-read scenario Disadvantages:: High code development cost

Data synchronization plan selection suggestion:

First choose A-Real-time Synchronization Plan -> Online verification to see if it meets the business operation experience -> Then decide whether to implement B-Dual Write Strong Consistency Plan

Data synchronization difficulties and solutions:

Problem:

• The double-table joint query scenario cannot be directly synchronized using the DRC platform, and it is necessary to develop a corresponding synchronization module jar package, embed it in the DRC task, or give up using DRC, directly use code synchronization, all of which have the problem of long development time

• The ES index space occupancy is large, the number of redundant records is large, and the query results need to be deduplicated, which is complex to query

Difficulties: The process table and the process detail node table involve joint queries, both tables have single-table add/delete/modify operations; leading to a complex data model and difficult synchronization of the data synchronized to ES

Solution: (Add redundant fields to the database table, and redundant fields are specifically used for ES queries)

Add the fields for pending review personnel and reviewed personnel to the process table in DB, with the values of the fields separated by spaces. Use the tokenization function of ES, and at the same time, ES can directly use the DRC tool to synchronize this table data, reducing the development time for synchronization

Solution cost: Synchronize the new fields in the process table when adding/ modifying detailed procedures; Develop a tool for refreshing historical data

(4) Phased development and implementation steps for going online

1. System business transformation - table field addition (August 10th)

1) Add sharding field to business table

Part of the business tables lack sharding fields, so they cannot be directly sharded. For the business tables, a new SKU sharding field is added, and the existing logic is modified to add SKU conditions to improve query efficiency;

2) Increase redundant fields for ES-related queries (data refresh)

2, Synchronization and verification of distributed database sharding data (August 11)

1) Complete the initialization of the distributed database sharding library + ES;

2) Configure DRC to complete the full and incremental data synchronization from the original single database to the distributed database sharding library;

3) Configure DRC to complete the full and incremental data synchronization from the distributed database sharding library to the ES;

4) Regularly compare the data consistency between the distributed database, distributed database sharding, and ES through inspection tools.

3, Read traffic switching + verification (August 17)

1) Add a new AOP aspect, and gradually shift the read requests to the new application cluster through DUCC configuration (erp whitelist, full read, result comparison, etc.)

2) After the product and business sides complete the verification, switch all read traffic to the new application cluster (note: the new application cluster uses a read-only account for the database)

4, Write traffic switching (8.21)

1) Inform the business party and upstream and downstream systems before the launch, inform the launch time and estimated duration, and reduce the impact on business

2) Add a static page to prompt users that the system upgrade is not available, switch the front-end domain to the static page, and avoid user operations

3) Stop the original system grouping, ensure that there is no write traffic in the original single database, and at the same time coordinate with the DBA to execute the write prohibition on the original database (close the worker, pause the MQ consumption)

4) Wait and ensure that the original database data is fully synchronized to the target database, and then manually and automatically verify the consistency of the data in the new and old two databases

5) Switch the new system grouping to read and write accounts and deploy

7) R&D and testing personnel use test products to test the new system's grouping function, and after there are no problems, they are handed over to business personnel for verification (switch to static operation and maintenance page)

8) Start the worker and connect to the MQ

5, Effect after launch

The system runs normally after the launch, and 2.6 billion products have been settled from 8.23 to the present; The current system supports product category data of 316 million; The maximum DB table data has reached 284 million; ES data is 43,560,000;

Comparison before and after: erp: xxx; This erp account has 29w data. The original query takes 9s, and the new query takes 1s;

Four, Summary

Good suggestions:

• Comprehensive and clear inventory of the current system status: can reduce complexity and improve quality

• Clear launch plan: guide personnel to reasonably divide labor, shorten the launch time, and reduce the difficulty of launch

Unsolved problems:

Currently, the distributed database distributed transaction support is relatively weak, and it cannot guarantee the correctness of multiple records being modified in a single transaction across different databases, and it is necessary to submit, read after submission, and verify to ensure that the data is correctly saved

When the product data under the business personnel reaches one million, the query time is still long, and the query performance will continue to be optimized

Author: JD Retail, Wang Kai

Source: JD Cloud Developer Community. Please indicate the source of reproduction

你可能想看：

It is possible to perform credible verification on the system boot program, system program, important configuration parameters, and application programs of computing devices based on a credible root,

As announced today, Glupteba is a multi-component botnet targeting Windows computers. Google has taken action to disrupt the operation of Glupteba, and we believe this action will have a significant i

Full-patch domain forest collapsed in 5 seconds？ Trust avalanche after encryption upgrade.

Guokr Cloud： Understand the security advantages of IPv6 upgrade and transformation in ten minutes

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

Fake Google ads target Microsoft ad accounts, and malware attacks are upgraded again

In today's rapidly developing digital economy, data has become an important engine driving social progress and enterprise development. From being initially regarded as part of intangible assets to now

A Brief Discussion on Data Security Governance and Hierarchical Classification Implementation

APP Illegal Trend： Interpreting the 'Identification Method for Illegal and Unauthorized Collection and Use of Personal Information by APPs'