Distributed Storage Technology (Part 2)： Analysis of the architecture, principles, characteristics, and advantages and disadvantages of wide-column storage and full-text search engines

For write-intensive applications with massive daily write volumes, unpredictable data growth, and high requirements for performance and reliability, conventional relational databases cannot meet their needs. This is also true for scenarios with extremely high query performance requirements, such as full-text search and data analysis. To further meet the needs of these two types of scenarios, there are wide-column storage and search engine technologies. This article will introduce their architecture, principles, and advantages and disadvantages.

Wide-column storage —

The concept of wide-column storage originated from Google's Bigtable paper, with the initial definition being:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. 'Bigtable: A distributed storage system for structured data'

Bigtable stores data in several Tables, and the form of each Cell (data unit) in the Table is as follows: The data inside the Cell is composed of a string, and is located by three dimensions: row, column, and timestamp.

1681089518_643363eed7b9322b0b4ad.png!small?1681089518321

The images are from the book 'Bigtable: A distributed storage system for structured data'. Bigtable sorts the Table by dictionary order according to the Cell's Row Key when storing data, and splits a Table into several adjacent Tablets by Row, and then distributes the Tablets to different Tablet Servers for storage. In this way, when the client queries rows that are close to the Row Key, the probability that the Cells fall on the same Tablet is also higher, which improves the query efficiency. Different Cells in the Table can store multiple versions of the same data, distinguished by timestamps. The timestamp is essentially a 64-bit integer, which can be automatically set by Bigtable as the current time of data writing (in microseconds), or can be set by the application itself. However, the application needs to ensure that there is no conflict between Cells. For Cells with the same Row Key and Column Key, Bigtable sorts them in descending order of timestamps, so that the latest data is read first. Since Google Bigtable solves the core business needs of concurrent retrieval and query, high-speed log writing, and massive data scenarios, the industry has inspired some projects, such as HBase, Cassandra, etc., and they are collectively referred to as Wide Column Store (also known as table storage). Wide Column Store can be schemaless, and the table fields can be freely expanded. A data table can have an infinite number of columns, and each row can have different columns, allowing many null values per row, similar to a sparse matrix. A column family stores data that is frequently queried together. The ranking of wide-column NoSQL databases in the current DB-Engine is as follows, showing that the most popular ones are Cassandra, HBase, and Cosmos DB on Azure. Next, we will introduce the situation of HBase. 1681089528_643363f8001b85bd773d4.png!small?1681089527427

1681089528_643363f8001b85bd773d4.png!small?1681089527427

HBase is a column-oriented distributed NoSQL database, an open-source implementation of the Google Bigtable framework, capable of meeting the needs of random, real-time data retrieval. The main storage and processing objects of HBase are wide tables, with storage modes that can be compatible with local storage, HDFS, Amazon S3, and other file systems supported by Hadoop, and it has strong linear scalability compared to RDBMS. HBase uses a storage system based on the LSM tree to ensure stable data write rates, while utilizing its log management mechanism and the multi-replica mechanism of HDFS to ensure database fault tolerance. The typical application scenarios are: OLTP business with high concurrency write/query for multi-version, sparse, semi-structured, and structured data. The data model of HBase is composed of different logical concepts, including: tables, rows, row keys, columns, column families, cells, and timestamps.

Table (Table): Table is the organizational form of data in HBase, which is a collection of columns, similar in meaning to tables in traditional databases. It can also cover the update records of column data at different timestamps.
Column (Column): It is a separate data item in a database, with each Column containing a type of data.
ColumnFamily (ColumnFamily): Data in HBase tables are grouped by ColumnFamily, where a ColumnFamily is an aggregation of similar types of columns in one or more HBase tables. HBase stores data of the same ColumnFamily in a single file, which can serve a similar function to vertical partitioning. It can reduce unnecessary scans during queries and increase query speed.
Row (Row): A Row is a collection of RowKey and ColumnFamily, and can include one or more ColumnFamilies.
RowKey (RowKey): Row data in HBase is sorted by RowKey, which serves as the primary key. When querying, HBase can locate data through RowKey, and Regions are also divided by RowKey.
Timestamp (Timestamp): Timestamp is the version identifier for a given value, and is written into the HBase database simultaneously with the value. Timestamp can be any type of time format, and for each RowKey, there can be multiple record updates with different time versions.

1681089540_643364041744c5e8a2bc6.png!small?1681089539516

The excerpt is from 'HBase: The Definitive Guide'. In HBase, tables are divided into multiple Regions for storage based on RowKey. Each Region is the basic unit of HBase data management, and Regions are split by RowKey, serving a similar function to horizontal partitioning. Data is distributed across various nodes in the cluster, and Regions on different nodes collectively form the overall logical view of the table. Expanding Regions can enhance capacity. 1681089559_643364174f6005798abbb.png!small?1681089559279

1681089559_643364174f6005798abbb.png!small?1681089559279

The excerpt is from 'HBase: The Definitive Guide'. Regions are maintained by HRegionServer, which is centrally managed by HMaster. HMaster can automatically adjust the number of Regions in HRegionServer, thereby achieving infinite expansion of data storage. 1681089567_6433641f56461c20d4776.png!small?1681089567165

1681089567_6433641f56461c20d4776.png!small?1681089567165

The image is from 'HBase: The Definitive Guide'. The technical architecture of HBase is shown in the figure above, including the main components or services:

Client: The Client is the access entry for the entire HBase cluster, responsible for communicating with HMaster to perform cluster management operations, or communicating with HRegionServer for data read/write operations.
ZooKeeper: The status information of each node in the cluster is registered in ZooKeeper, and HMaster perceives the health status of each HRegionServer through ZooKeeper. Additionally, HBase allows multiple HMasters to be started, and ZooKeeper ensures that only one HMaster is running in the cluster.
HMaster: It is responsible for managing CRUD operations on data tables, managing the load balancing of HRegionServers, and responsible for the allocation of new Regions; it is also responsible for migrating Regions from the failed HRegionServer when an HRegionServer fails and shuts down.
HRegionServer: A node corresponds to one HRegionServer. It is responsible for receiving Regions allocated by HMaster, communicating with Clients, and handling all read/write requests related to the Regions it manages.
HStore: HStore is responsible for the data storage function in HBase, with 1 MemStore and 0 or more StoreFiles. Data is first written to the memory MemStore, flushed to StoreFile (a wrapper for File), and finally persisted to HDFS. When querying a specific column, it is only necessary to retrieve the corresponding block from HDFS.
HLog: Log management and replay, each operation that enters MemStore is recorded in HLog.

HBase uses LSM (Log-Structured Merge-tree) as its underlying storage structure. LSM has better write performance compared to the B+ tree commonly used in RDBMS. Since the rate of disk random IO is slower than that of sequential IO by several orders of magnitude, the storage design of databases generally avoids disk random IO. Although the B+ tree tries to store the same node on the same page, this is only effective for relatively small amounts of data; when there is a large amount of random writing, nodes tend to split, and the probability of disk random read/write increases. To ensure write performance, LSM first writes to memory and then sequentially and batch-into the disk. The sequential write design makes LSM have better massive data write performance compared to B+ trees, but when reading, it needs to merge memory data and historical disk data, resulting in some sacrifice in read performance. However, LSM can also improve read speed by merging small sets into large sets (merging) and using Bloom Filter and other methods. 1681089578_6433642a1836f5b9f825f.png!small?1681089577610

1681089578_6433642a1836f5b9f825f.png!small?1681089577610

Image source: 'HBase: The Definitive Guide'. The structures related to storage in HBase include MemStore, HFile, and WAL. MemStore is a memory storage structure that can convert random writes of data into sequential writes, and data is first written to MemStore when it is written, until the memory storage capacity cannot continue to store data, at which point the data storage is transferred to disk. HFile is the file data structure where the final data of HBase is written to disk, which is the underlying storage format of StoreFile. In HBase, each StoreFile corresponds to an HFile, and under normal circumstances, HFile is stored on HDFS, thus ensuring data integrity and providing distributed storage. WAL (Write-Ahead Log) is responsible for providing high-concurrency, persistent log storage and replay services. All business operations of HBase are stored in WAL to provide disaster recovery. For example, if the machine is powered off during the persistence of MemStore memory data to HFile, the WAL can be replayed, and the data will not be lost.

Full-text Search Engine—

Relational databases store data with a fixed format in the form of rows, unlike search engines, which store data in the form of documents, which are physically hierarchical or tree-like structures. The advantage of this approach is that it is very easy to increase the processing capability of semi-structured or structured data. Currently, some representative databases of search engines include the open-source Elasticsearch, and the domestic firm StarRing Technology has its independently developed search product Scope. In the DB Engine rankings, Elasticsearch has consistently ranked within the top ten. Compared to SQL databases, ES provides scalable, near-real-time distributed search capabilities, supporting the partitioning of text data into multiple parts for storage and backup by cluster nodes to improve search speed and ensure data integrity. It also supports automated load balancing and emergency handling to ensure that the entire cluster is highly available. ES is suitable for various businesses that have processing needs for unstructured data such as documents, such as intelligent word segmentation, full-text search, relevance ranking, and so on. 1681089597_6433643d0f914eaeae2f0.png!small?1681089596555

1681089597_6433643d0f914eaeae2f0.png!small?1681089596555

ES defines a set of proprietary elements and concepts for storing and managing data, the most important of which are Field, Document, Type, and Index. Field is the smallest data unit in Elasticsearch, similar to a column in a relational database, which is a collection of data values with the same data type. Document is similar to the concept of a row in relational data, and a document contains the corresponding data values for each field. Type is similar to the table-level concept in a database, while Index is the largest data unit in Elasticsearch. Unlike SQL indexes, an index in ES corresponds to the schema concept in SQL.

Elasticsearch	SQL Database
Index	Database
Type	Table
Document	Row
Field	Column

On the physical model, the main concepts of Elasticsearch include Cluster (cluster), Node (node), and Shard (shard). Among them, a node refers to an instance of Elasticsearch or a Java process, and multiple instances can be run on a single node if the hardware resources are sufficient. A shard is the smallest unit of data processing in the Elasticsearch cluster, which is an instance of a Lucene index. Each index is composed of one or more shards on one or more nodes. A shard is divided into Primary Shard and Replica Shard. Each document is stored in a Primary Shard, and if the Primary Shard or the node it is on fails, the Replica Shard can become the Primary Shard to ensure high availability of data; in the process of retrieving data, the query command can be executed on the backup shard to alleviate the pressure on the Primary Shard, thereby improving the query performance.

1681089614_6433644e5220f2d4024cd.png!small?1681089613859

When writing data to Elasticsearch, Elasticsearch distributes the document to multiple shards based on the document identifier ID. When querying data, Elasticsearch queries all shards and summarizes the results. To avoid the failure of some shard queries affecting the accuracy of the results during the query, Elasticsearch introduces routing functions. When data is written, data is written to the specified shard through routing. When querying data, the same routing is used to specify which shard the data should be queried from. Relying on the underlying Lucene inverted index technology, Elasticsearch performs very well in querying and searching text, log data, and is almost superior to all relational databases and other products based on Lucene modifications (such as Solr), therefore, it is widely used in fields such as log analysis, intelligence analysis, and knowledge retrieval, especially in multiple industry solutions built based on the Elastic Stack. However, in 2020, Elastic changed its license model, restricting cloud vendors from directly hosting and selling Elasticsearch. The industry has tried to bypass the relevant license restrictions through some other methods (such as OpenSearch developed based on the Elasticsearch 5.7 version). However, ES also has several very obvious architectural deficiencies that limit its further expansion of application scenarios, including:

It does not support transaction functions and can only support the final consistency of data, therefore, it cannot be used for the storage and use of critical data.
The analytical capabilities are weak, especially the complex aggregation capabilities are weak.
High availability between various shards is achieved through master-slave replication, which may exist data brain split issues.
The data handling capacity of a single Node still needs to be improved.
The security module of ES is a commercial plugin, and a large number of ES clusters lack appropriate security protection.

It is due to the above architectural deficiencies that other search engines in the industry have some directions for improvement, especially the security issues of Elasticsearch (ES), which have caused several major data security leaks in China. StarRing Technology began to develop a domestic substitute for ES in 2017 and released the distributed search engine Scope based on Lucene in 2019. It adopts a brand new high-availability architecture based on the Paxos protocol, supports cross-data center deployment modes, built-in native security features, and supports multiple Node instances on a single server, which greatly enhances the stability, reliability, and performance of the search engine. It has landed multiple large-scale production clusters in China, with a single cluster server exceeding 500 nodes.

— Summary—

This article introduces the architecture, principles, advantages and disadvantages of wide-column storage and search engine technology (note that the development of various technologies is relatively fast, and there may be discrepancies between the technical description and the latest technological development). With the basic data storage management in place, the next step is to obtain the required results through integrated computing. In the face of large data volumes, how to achieve high throughput, low latency, high scalability, and fault tolerance is the key to distributed computing technology. From the next article onwards, we will introduce distributed computing technologies represented by MapReduce and Spark.【References】【1】Chang F, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data[J]. ACM Transactions on Computer Systems (TOCS), 2008, 26(2): 1-26.

【2】Lars George, HBase: The Definitive Guide, 2011.

你可能想看：

d) Adopt identification technologies such as passwords, password technologies, biometric technologies, and combinations of two or more to identify users, and at least one identification technology sho

5. Collect exercise results The main person in charge reviews the exercise results, sorts out the separated exercise issues, and allows the red and blue sides to improve as soon as possible. The main

b) It should have the login failure handling function, and should configure and enable measures such as ending the session, limiting the number of illegal logins, and automatically exiting when the lo

b) It should have a login failure handling function, and should configure and enable measures such as ending the session, limiting the number of illegal login attempts, and automatically logging out w

Data security can be said to be a hot topic in recent years, especially with the rapid development of information security technologies such as big data and artificial intelligence, the situation of d

Google Android 11 Beta version officially released, Baidu Security fortification technology first fully compatible

In today's rapidly developing digital economy, data has become an important engine driving social progress and enterprise development. From being initially regarded as part of intangible assets to now

Detailed Explanation of VM Virtual Machine Protection Technology & Analysis of Two CTFvm Reverse Engineering Practical Exercises

It is possible to perform credible verification on the system boot program, system program, important configuration parameters, and application programs of computing devices based on a credible root,

Announcement regarding the addition of 7 units as technical support units for the Ministry of Industry and Information Technology's mobile Internet APP product security vulnerability database