2. Overall implementation plan for this upgrade:

0 18
1. Business backgroundSystem business functions:The system internally handles da...

1. Business background

System business functions:The system internally handles data processing and integration, and provides initialization (write) of result data and query data result services to external systems.

System network architecture:

2. Overall implementation plan for this upgrade:

•

The impact of deployment architecture on the cut volume launch - the launch of the internal management system has no impact on the read business of other systems

• Distributed cache can be expanded separately, which is unrelated to the upgrade of storage and query functions

• Through the isolation of the cache layer, external systems can remain unchanged during the system expansion period, and only the internal management system upgrade is affected

• During the internal system launch/verification, in addition to the initialization operations related to business scenario 1, read services can still be provided to reduce the impact of launch

Secondly, the overall implementation plan for this upgrade:

Overall implementation plan diagram:

(1) Establish goals

Commodity full volume channel plan - cut volume plan: (Total is 10 times the current):

Currently:

The commonly used tables in the current database have all exceeded5000WAmong the results, some are expressed as 6000W, which has reached the peak capacity of the MYSQL database table, and cannot support the full cut volume;

Goal:

Highest support900 million: According to the slicing plan, after full slicing, the system is about 6.7 billion, retaining 1/4 redundancy, taking8.375Billion; Round up to 900 million, this value has a large redundancy, can meet the data support for the next 5 years

Time goal: scheme setting in early August, online and verification from August 17 to August 22, August 24 slicing plan starts

(Second), current system status

1. Resource usage

• Current deployment structure

——Data center distribution, Mysql: 1 master 4 slaves (Data center A 1 master, 3 slaves; Data center B only read slave)

——Data center distribution, Doris: 32C, 63 nodes, 3 replicas

• Current application container (docker) count, db maximum connection count

——Application container count: 62 (Web group: 25, Worker group: 31, MQ group: 6)

——db maximum connection number 100 (each container configuration)

• Whether the current business is read-write separated, and what is the read-write ratio

——No read-write separation

• Under various business scenarios, whether the master-slave delay can be tolerated? How long is the tolerable delay time

——Most of the modification operations of the business personnel are synchronous operations, and the operation result is returned to the front end after completion. From the perspective of business operations + query results, it is impossible to tolerate delay

——Background task scenario, for intermediate data processing, can tolerate master-slave delay

• At the product level, when the system encounters bottleneck pressure, is it acceptable to limit the flow? Is it acceptable to delay data display?

——This time's external service interface does not involve development, service interface is not affected; business page visit volume is low, can accept short-term delay

• Does the team have ES usage experience?

——Partly understood, not used in the project

2. Internal usage of the database

Use a general inventory framework to comprehensively sort out the current situation of the system

Table space, business scenarios and other information (part)

Table nameCurrent number of records in the table (Unit: ten thousand)Maximum supported number of records (Unit: ten thousand)Number of table fieldsWhether it can be split out into a sharding keySharding key fieldWhether there is an SQL without sharding fieldWhether there is a cross-table query scenarioData record read-write ratioWhether there is an immediate query after writingUse caseWhether the data can be truncatedAcceptable truncation durationEstimated amount after slicingDistributed DBES judgment condition: whether there is a complex queryES direct double write, judgment condition: query immediately after writing
Approval flow table3.5KW4 kW43YesskuThere isThere is1000/1There is a manual query operation after writing1. Page creation of approval flow 2. Page query of approval flow 3. Page data set to invalid 4. Callback modification from approval platformNo+300 millionAfter UI modification, you need to click the "Query" button again;
Approval flow details table (historical data has been cleaned)800W4 kW20YesAdd SKUNoneThere is1000/1There is a manual query operation after writing1. Refresh approval flow (delete + add) 2. Query approval-in-process processes (tasks)Convert to cold backup after approvalConvert to cold backup
Business data table 13.3 kW4 kW15YesskuNoneNone100/11. Create after approval 2. Data invalidation, deletion operation 3. Background tools: synchronize cache (complex and pagination query)No+300 million
Business data table 25.9 kW4 kW16YesskuExists (abnormal deletion after addition according to id)None1000/11. Business query/export dimension 1 data 2. Business query dimension 2 data 2 3. Background tools: synchronize cacheNo+500 millionSynchronize big data push data to cache, use creator field to query; multiple SKU pagination query
Support data table (big data platform calculation and push)1.2 kW4 kW12Yesitem_sku_idNoneNone5/11. Operation and maintenance tools: add/delete records 2. Clean up historical data (task) 3. Data query (display usage) 4. Calculation 5. Big data push dataPushed by date, currently retained for 3 daysHistorical data is useless for doris?3 to 4 kW per dayDelete data dt
.....

(Three) technical solution selection

System characteristics: high concurrency write of single table, complex read

1. Storage selection:

Conclusion:

Internal distributed database: expanded from single shard to multi-shard, solving massive data storage and simple query

ES: newly introduced, to realize complex query (token query) and global sorting

redis: retained, needs expansion

Doris: retained, capacity increased

Complex query (reason: there are multi-table association scenarios in the front-end business access (2 tables with tens of millions of level association queries), with the increase of table capacity, the performance of association query decreases, and it can no longer meet the business efficient requirements)

Complex query decision factors:

Distributed DB (mysql)esdorisTiDB
Decision indicatorsProduct positioningDatabase (OLTP)Search engineDatabase (OLAP)Database (OLTP+80%OLAP)
Advantages:1. High concurrency and high throughput transaction processing 2. Stability 3. Data real-time (write and read immediately)1. Full-text index 2. Complex structured queryHigh concurrency query analysis1. Both transaction processing and data analysis 2. Automatic expansion 3. Data real-time (write and read immediately)
Disadvantages:1. Large-scale data analysis 2. Manual expansion1. Transaction processing 2. Real-time (write and read immediately)1. Transaction processing 2. Real-time (write and read immediately)High concurrency and high throughput transaction processing
ReliabilityHigh (multi-machine room)High (multi-machine room)Low (shared cluster)Low (single machine room)
ScalabilityDatabase dimension: platform management, table dimension: application controlPlatform managementDatabase dimension: platform management, table dimension: application control
Data consistencyFinal consistencyFinal consistencyStrong consistency
Operation and maintenance supportDBABranch operation and maintenanceNo professional operation and maintenance teamBranch DBA
SummarySlow complex business queries, unable to support cross-table queries, multi-dimensional complex queries, and global sorting. Single-table query performance is fast using sharding fieldsHigh performance in complex business queriesThe deployment structure is a shared cluster, (especially) the write performance is greatly affected by external factorsThe deployment architecture is a single room, which cannot meet the reliability requirements of the 0-level system
Architecture goal
ConclusionMassive storage and high concurrency write✅ Large data volume storage, high performance of single-table query based on sharding field, single database transaction processing✖️ Transaction processing under high concurrency✖️ Query is greatly affected by write/update operations✖️ Transaction processing reliability under high concurrency
Complexity of query✖️ Poor performance, may exist cross-database query✅ High reliability under large data volume complex business queries✖️ Query is greatly affected by write/update operations✖️ Reliability

2. Data synchronization plan

A-Real-time Synchronization Plan:

Solution description: Use the DRC platform to configure and complete the near-real-time data synchronization from distributed DB to ES (Note: DRC is a company's internal general data synchronization platform that can be used for data synchronization between multiple data sources)

Advantages:: Simple unordered code development Disadvantages:: There may be business write-read scenarios after writing, with the risk of data inconsistency

B-Dual Write Strong Consistency Plan:

Solution description: Dual write distributed DB and ES, ensure data consistency

Advantages:: Ensures the consistency of the data write-read scenario Disadvantages:: High code development cost

Data synchronization plan selection suggestion:

First choose A-Real-time Synchronization Plan -> Online verification to see if it meets the business operation experience -> Then decide whether to implement B-Dual Write Strong Consistency Plan

Data synchronization difficulties and solutions:

Problem:

• The double-table joint query scenario cannot be directly synchronized using the DRC platform, and it is necessary to develop a corresponding synchronization module jar package, embed it in the DRC task, or give up using DRC, directly use code synchronization, all of which have the problem of long development time

• The ES index space occupancy is large, the number of redundant records is large, and the query results need to be deduplicated, which is complex to query

Difficulties: The process table and the process detail node table involve joint queries, both tables have single-table add/delete/modify operations; leading to a complex data model and difficult synchronization of the data synchronized to ES

Solution: (Add redundant fields to the database table, and redundant fields are specifically used for ES queries)

Add the fields for pending review personnel and reviewed personnel to the process table in DB, with the values of the fields separated by spaces. Use the tokenization function of ES, and at the same time, ES can directly use the DRC tool to synchronize this table data, reducing the development time for synchronization

Solution cost: Synchronize the new fields in the process table when adding/ modifying detailed procedures; Develop a tool for refreshing historical data

(4) Phased development and implementation steps for going online

1. System business transformation - table field addition (August 10th)

1) Add sharding field to business table

Part of the business tables lack sharding fields, so they cannot be directly sharded. For the business tables, a new SKU sharding field is added, and the existing logic is modified to add SKU conditions to improve query efficiency;

2) Increase redundant fields for ES-related queries (data refresh)

2, Synchronization and verification of distributed database sharding data (August 11)

1) Complete the initialization of the distributed database sharding library + ES;

2) Configure DRC to complete the full and incremental data synchronization from the original single database to the distributed database sharding library;

3) Configure DRC to complete the full and incremental data synchronization from the distributed database sharding library to the ES;

4) Regularly compare the data consistency between the distributed database, distributed database sharding, and ES through inspection tools.

3, Read traffic switching + verification (August 17)

1) Add a new AOP aspect, and gradually shift the read requests to the new application cluster through DUCC configuration (erp whitelist, full read, result comparison, etc.)

2) After the product and business sides complete the verification, switch all read traffic to the new application cluster (note: the new application cluster uses a read-only account for the database)

4, Write traffic switching (8.21)

1) Inform the business party and upstream and downstream systems before the launch, inform the launch time and estimated duration, and reduce the impact on business

2) Add a static page to prompt users that the system upgrade is not available, switch the front-end domain to the static page, and avoid user operations

3) Stop the original system grouping, ensure that there is no write traffic in the original single database, and at the same time coordinate with the DBA to execute the write prohibition on the original database (close the worker, pause the MQ consumption)

4) Wait and ensure that the original database data is fully synchronized to the target database, and then manually and automatically verify the consistency of the data in the new and old two databases

5) Switch the new system grouping to read and write accounts and deploy

7) R&D and testing personnel use test products to test the new system's grouping function, and after there are no problems, they are handed over to business personnel for verification (switch to static operation and maintenance page)

8) Start the worker and connect to the MQ

5, Effect after launch

The system runs normally after the launch, and 2.6 billion products have been settled from 8.23 to the present; The current system supports product category data of 316 million; The maximum DB table data has reached 284 million; ES data is 43,560,000;

Comparison before and after: erp: xxx; This erp account has 29w data. The original query takes 9s, and the new query takes 1s;

Four, Summary

Good suggestions:

• Comprehensive and clear inventory of the current system status: can reduce complexity and improve quality

• Clear launch plan: guide personnel to reasonably divide labor, shorten the launch time, and reduce the difficulty of launch

Unsolved problems:

Currently, the distributed database distributed transaction support is relatively weak, and it cannot guarantee the correctness of multiple records being modified in a single transaction across different databases, and it is necessary to submit, read after submission, and verify to ensure that the data is correctly saved

When the product data under the business personnel reaches one million, the query time is still long, and the query performance will continue to be optimized

Author: JD Retail, Wang Kai

Source: JD Cloud Developer Community. Please indicate the source of reproduction

你可能想看:
最后修改时间:
admin
上一篇 2025年03月25日 15:01
下一篇 2025年03月25日 15:24

评论已关闭