In the traditional application of enterprise data, enterprises use various systems, with data scattered across different storage devices. Data analysis requirements are often cross-database. There may be security issues or impact on the performance of business systems when data is stored in a lake or warehouse for analysis. Enterprises need a flexible, quick, and low-maintenance cost data integration method, which is the data federation architecture. This article introduces the data federation architecture.

— Overview of Data Federation —
In the traditional application of enterprise data, a common dilemma often arises: enterprises use various systems, with data scattered across different data storage, and data analysis requirements are often cross-database. There are two ways to address this type of demand. One is to first perform data integration (ETL) and integrate the data into a data lake or data warehouse before analysis, which is generally suitable for business scenarios with relatively stable demands and where existing systems can be migrated to new data platforms. For some other enterprises, due to the阶段性 characteristics of their platform construction, there may be a rapid change in demand or existing systems cannot be directly migrated due to various reasons. The solution based on ETL to lake and then analysis may have the disadvantages of slow business response speed, low flexibility, and complex processes. Enterprises need a method that can integrate data more flexibly and quickly, which is the data federation architecture.
Data federation solves the 'data silo' problem and avoids the long duration and high development and operation costs of traditional ETL processes. It can be applied in scenarios where there are flexible and real-time requirements for data collection, or where there is processing of heterogeneous data sources.
- Virtual Operational Database (ODS)
Through virtual operational data storage (ODS), a manageable data integration view is built, where data changes are quickly reflected in the ODS, and the data sources of the federated data can be flexibly increased or decreased according to specific analysis needs, thus meeting the requirements of some lightweight, short-term data analysis, or real-time flexible dashboard applications.
- Building a Data Transfer Area
By using data federation to build a data transfer area, snapshots of a large amount of data entering the data warehouse from the production system can be merged quickly, greatly reducing the interference of data replication on the production system. The real-time storage of data changes in the data transfer area can record complete data change information.
- Expansion of Data Warehouse
There are problems after enterprises deploy data warehouses: on the one hand, it is unlikely that the entire enterprise will only use a single data warehouse; on the other hand, there are still a large number of data that have not been stored in any data warehouse, and a unified perspective needs to be built. Data federation and federated computing can provide a unified perspective of all enterprise data warehouses and scattered data without converting formats or moving data, thus reducing the cost of data movement and conversion.
- Heterogeneous Platform Migration
Using federated computing in the process of heterogeneous platform migration can make the migration process smoother, without considering issues such as data migration and incompatibility between the syntax of the heterogeneous platform, ensuring that the application's use of data is not affected, and that the data source configuration can be changed after the migration is completed without affecting the new application.
- Heterogeneous Data Analysis
Enterprises can utilize the capabilities of data federation to achieve analysis across structured, unstructured, or semi-structured data. Although data federation and federated computing have great value in optimizing data collection and platform migration, they are not万能的: On the one hand, because the integrated view of data can be implemented quickly, many enterprises ignore data governance, leading to unnecessary redundancy in the federation process; on the other hand, since data federation does not have a pre-designed data architecture for business, the best performance is still achieved by providing fixed-form data requirements through data lakes, data warehouses, and other means. Therefore, it is still necessary to combine and apply data governance, ETL, and other methods to solve more extensive data management problems within enterprises. Data federation mainly solves the frequent evolving data requirements or Adhoc development scenarios and evolving enterprise data development needs.
An Analysis of Data Federation Architecture
From the perspective of technical architecture, data federation technology provides a unified data view by adding a federation computing engine to the existing various data sources, and supports developers to query and analyze data from heterogeneous data sources through the federation computing engine. Developers do not need to consider issues such as the physical location of data, data structure, operation interface, and storage capacity, and can access and analyze homogeneous or heterogeneous data on a single system. The data federation architecture can bring several key architectural advantages to the data management of enterprises, mainly including:
- Virtualized Data Integration: Compared with traditional ETL, data federation only performs virtual integration, canFaster, Lower CostIntegrate a large amount of data, improve the speed of data integration, and can quickly explore some innovative data business;
- For some complex existing systems, it can provideCross-database Data AnalysisCapacity, thus protecting the existing investment of the enterprise;
- Convenient for DevelopersFlexible Discovery and Use of Data: Users do not need to perceive the location and structure of the data source, and the data source system does not need to be modified. The flexibility of data processing is improved;
- Can be realizedUnified Data Security Control: Because integrated through virtual views instead of copying, unified security control is achieved on the data federation platform, which can ensure the security control of unified data export, reduce the risk of data leakage caused by copying data, especially suitable for business systems that are not suitable to import data into the data lake, thereby ensuring the security of static and dynamic data;
- Eliminate Unnecessary Data Movement: For some less frequently used data, after using the data federation architecture, this part of data does not need to be integrated into the data lake every day, but only to do data analysis when there is actual analysis demand, which reduces unnecessary data movement;
- Agile Data Service Provisioning and Portal Function: With the function of data federation, enterprises can further develop a data service portal based on it, providing development tools and service tools to allow users to use data more flexibly.
The data federation technical architecture diagram is shown above, the key point of which is to implement a unified SQL query engine that can federate multiple homogeneous or heterogeneous autonomous data sources. Users can arbitrarily query data at any position in the federation system without concerning about the storage location of the data, the type of SQL language used by the actual data source system, or its storage capacity. As shown in the diagram above, this architecture needs to achieve four aspects of unification to realize the above core capabilities, including unified metadata management, unified query processing interfaces, unified development and operation and maintenance tools, and unified security management.
- Unified Metadata Management
The metadata management module needs to build an abstract overall view of various homogeneous and heterogeneous data sources, providing unified data source connection management and metadata management. Subsequent business development layers only need to access metadata information under different databases through this centralized metadata management module, without the need to manage the different connection details of various databases.
- Data Source Connection Module
Through the data federation platform, developers can build virtual connections across database instances, thus realizing cross-database access in the current database, generally using SQL extensions similar to DBLink. This layer is responsible for managing data sources, supporting both traditional data source connections and connections to big data platforms. In design, it is best to support both structured data and unstructured data access. The syntax example of DBLink is as follows, and subsequent developers can directly use the table name on dblink to access data tables on different data sources, and even access multiple tables in a single SQL.
CREATE DATABASE LINK <link_name> CONNECT TO <jdbc_username> IDENTIFIED BY '<jdbc_password>' USING '<jdbc_URL>' with '<jdbc_driver>';CREATE EXTERNAL TABLE <table_name> (col_dummy string) STORED AS DBLINK WITH DBLINK <link_name> TBLPROPERTIES('dblink.table.name'=<ref_table_name>); |
- Metadata Management Module
Since the metadata of various databases changes continuously with the operation of the business, the data federation module also needs to obtain data changes in a timely manner, such as the addition or deletion of table fields or type changes. Therefore, the metadata management module is responsible for providing this function. It needs to obtain metadata from various data sources and manage them centrally, obtain and maintain the latest metadata through queries of data sources, and thus ensure the consistency of metadata between platforms. It plays a key supporting role throughout the lifecycle of constructing, running, and maintaining the entire data federation computing.
- Unified Query Processing Interface
This module provides an entry point for data analysts, mainly including cross-platform data processing implemented by unified standard SQL statements, as well as some data portal management functions provided by the enterprise, such as SQL review and data application. Due to the unified processing of multiple databases, this module has higher technical requirements. In addition to compatibility with the data types and SQL dialect differences of multiple databases, SQL performance is another very critical technical element.
- Federated Query SQL Engine Layer
As a unified grammar parsing layer, it parses SQL instructions. Its core is the SQL compiler, optimizer, and transaction management unit, which ensures that developers can provide a good database experience without needing to develop business based on different underlying platforms and differentiated APIs. At the same time, it will generate the best execution plan through the optimizer and finally push the execution plan to the computing engine layer. For developers, the SQL engine layer needs to support standard JDBC/ODBC/REST interfaces.
- Federated query computing engine layer
Some SQL operations may require operating on data in multiple databases and performing association analysis. Therefore, the computing engine layer needs to have the ability to load data from different databases into the engine and perform various data operations. This requires the computing engine to be based on a distributed computing architecture and effectively solve the differences in data types among various databases. For example, the behavior of the trim function may vary when the string appears with spaces on both sides in different databases, and the definition of Decimal/Numeric types may also differ. These differences in type systems will bring a very high degree of processing complexity and are also a core capability of the computing engine. Another important feature related to performance optimization is that due to the involvement of multiple remote databases, the federated computing engine needs to support two typical SQL optimizations: data result caching and pushing the calculation down to the remote database.
- Data result Cache layer
The federated computing engine automatically counts the frequency of result set queries, selects to cache or synchronize the result sets of the underlying database to the federated computing platform for hot result sets and intermediate cache data sets, and selects the best query path during the query process to optimize the query performance.
We use the above SQL query as an example. We all know that MySQL database is not good at data statistical analysis. The aggregation operation of the lineitem table with 10 million records takes about 200 seconds. However, if a federated computing engine is used, the computing engine, based on the historical records of multiple queries, finds that this table is analyzed statistically multiple times, and it persists the data from MySQL to st_lineitem. Subsequent queries to MySQL's tpch1.lineitem use st_lineitem directly. Since distributed computing engines are good at statistical analysis, subsequent queries all go through the federated computing engine, and the performance is also improved by several orders of magnitude, and the resource consumption of the original MySQL database is also greatly reduced.
- Data is pushed down to the remote database
In some Ad-hoc statistical analysis scenarios, if all the data in the source database are loaded into the federated computing engine, there may be a large amount of data IO and network IO, which may lead to slow performance. If part of the execution plan of this SQL query can be directly executed in the remote database, the computing engine can choose to push down operators to the underlying database when creating the execution plan, allowing each remote database to complete the basic calculations, and then summarize the execution results to the federated computing layer for unified calculation, thus optimizing the query performance.
Let's take a simple example. If you need to find the number of employees with employee numbers less than a given value from MySQL and Oracle databases, without the optimization of aggregation push-down, after the developer issues the statistical SQL, the federal calculation engine needs to load the full amount of tens of millions of data from MySQL and Oracle into the calculation engine for calculation. This loading process may take several hundred seconds (mainly limited by the number of concurrent threads opened by the database to the federal calculation engine). If the calculation engine supports pushing the calculation down to the remote database, then the data migration involved will be very little, and the performance will also be improved from several hundred seconds to tens of seconds.
- Unified security management
Since data federation will open various databases to business users, the opening of data also exposes more data risks. Therefore, we need more granular data security protection functions to protect data authentication, audit, authorization, and provide data encryption, desensitization, and confidentiality classification functions to ensure the security of data during storage, transmission, and processing.
- Unified permission management
The data federation security layer needs to provide a unified user module for related permission management and authorization use for multiple databases. We provide a federal data source permission control with the same experience as the database table-level permission control through metadata mapping (external table creation), and provide fine-grained user-level row and column permission control functions. In this way, it can be approximately considered that query restrictions for original table -> view are provided during the compilation period, ensuring that sensitive data is not leaked during the calculation process.
- SQL dynamic audit
SQL audit refers to the blocking or optimization processing suggestions for sensitive queries (such as equal value queries of sensitive fields) during the submission of business SQL to the platform. For convenience of understanding, we list some typical DML audit rules and corresponding rule descriptions as follows table:
Description of audit rules | Logical expression of audit rules |
DML statements containing 'DELETE' and 'UPDATE' must have a 'WHERE' condition. | (sql.type = 'DELETE' or sql.type = 'UPDATE') and !sql.hasWhere |
Single-table queries for ES tables and Hyperbase tables require a 'limit' statement to be added to the SQL if there are no condition statements and no aggregation. | sql.type = 'select' and sql.joinNum = 0 and (sql.from.tableType = 'es' or sql.from.tableType = 'hyperbase') and sql.where = null and !sql.hasUdaf |
There are too many UNION ALL operations in the SQL statement | (sql.type = 'select' and sql.unionallNum > 100) or (sql.type = 'insert' and sql.select.unionallNum > 100) |
There are too many JOINs in the SQL statement | (sql.type = 'select' and sql.joinNum > 100) or (sql.type = 'insert' and sql.select.joinNum > 100) |
Prohibit deletion of data objects | sql.type = 'dropdatabase' or sql.type = 'droptable' |
Prohibit insert overwrite | sql.type = 'insert' and sql.isOverwrite |
Check the table creation statement, the product of the number of partitions and the number of buckets cannot exceed the specified number; if it exceeds the specified number, prompt to adjust and optimize | sql.type = 'createtable' and sql.partNum * sql.bucketNum > 3 |
When creating a table, the table name is prohibited from using other data formats except letters, numbers, and underscores | sql.type = 'createtable' and !(sql.tableName ~ '^[a-zA-Z0-9_]+$') |
Large enterprises, with the opening up of their business, have a high demand for SQL auditing. Many companies have their own SQL open and audit platforms, and there are also some targeted solutions in the open-source community. The overall architecture can be referred to in the following diagram, where the rules are defined in the DSL language, and the corresponding parser completes the SQL audit operations (such as blocking or optimization suggestions, etc.).
- Sensitive Data Dynamic Desensitization
Sensitive data dynamic desensitization differs from the row and column level permissions of database tables. It needs to ensure the content security of the data, which is a more business-level data security protection that complements access control. One point to note is that to ensure the accuracy of the results, real data is used for calculation during the query and computation phase, and dynamic desensitization is only performed at the output stage to ensure that sensitive data is not leaked. To ensure the effectiveness of data desensitization, the data federation engine needs to maintain a global table and field lineage diagram, so that it can improve the global sensitive rule propagation function based on data lineage, enabling the identification of more sensitive data tables and providing configurable automatic sensitive data identification and desensitization for these tables.
- Unified operations and maintenance management
In addition to having infrastructure support, the implementation of federated computing also requires support from upper-level data development tools to achieve the complete process from data acquisition, processing, to value realization, while ensuring data security across data sources. Development, management, and operations tools generally include data development, management, and operations tool platforms, enabling enterprises to more efficiently utilize federated computing to build their internal data service layers and data business value layers.
Presto (or PrestoDB) is an open-source distributed SQL query engine, designed from the ground up for fast analytical query processing on any scale of data. It supports both non-relational data sources, such as Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, as well as relational data sources, such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata. Presto can query data at the location where the data is stored, without moving the data to an independent analytical system. Query execution can run in parallel on a purely in-memory architecture, with most results returned in seconds.
Presto was initially a project for running interactive analytical queries on Facebook's data warehouse, with the underlying data storage based on a cluster using HDFS. Before Presto was built, Facebook used Apache Hive, which was created and released in 2008, but as the business load increased and the demand for interactive analysis grew, Hive could not meet the high performance required for interactive queries. In 2012, Facebook's data infrastructure team built Presto, an interactive query system that could run quickly at a scale of PB. Today, Presto has become a popular choice for running interactive queries on Hadoop, receiving a large number of contributions from Facebook and other organizations. The Facebook version of Presto is more focused on solving enterprise internal demand functions and is also known as Presto DB. Later, some members of Presto created a more general Presto branch, named Presto SQL, with version numbers divided by xxx, such as version 345. This open-source version is also the more universally used version. For a while, to better distinguish from Facebook's Presto, Presto SQL changed its name to Trino, which is the historical origin of the Trino project.
Presto and Trino are both distributed computing engines and do not have the ability to store data themselves. In addition, a large number of database connectors have been adapted in the community, making them more suitable for data federation queries in open-source solutions. In Trino, the core of the separation of storage and computation is based on the connector architecture. Connectors provide Trino with interfaces to access any data source. As shown in Figure 2, each connector provides an abstract of the underlying data source based on tables. As long as data can be represented by tables, columns, and rows using Trino's available data types, a connector can be created, and the query engine can use the data for query processing. Currently supported connectors include: Hive, Iceberg, MySQL, PostgreSQL, Oracle, SQL Server, ClickHouse, MongoDB, and more.
As introduced in the previous chapter, a unified computing engine is the key core of the data federation system. How to provide flexible Adhoc queries and good performance is also an important design goal of Presto. In terms of physical architecture, Presto is mainly divided into Coordinator and Worker. The Coordinator is mainly responsible for compiling SQL tasks and distributing computation tasks to Workers, while the Worker is mainly responsible for pulling data from remote data storage and performing corresponding data calculations. Since Presto is developed based on the Java language, the execution efficiency of JVM is not as good as that of underlying languages. In order to solve the performance problem caused by this, Presto also uses code generation technology (Code generation) to generate bytecode that can be directly executed by the underlying engine to improve analysis performance.
Presto adopts memory-based computation rather than disk-based computation for data computation, i.e., the data in the data computation process is mainly cached in memory. Due to the inevitable memory expansion problem during the computation process, it may lead to the crash of the Worker process, therefore, memory management is a very important technology in Presto. Presto divides the entire memory into three memory pools, namely, System Pool, Reserved Pool, and General Pool. The System Pool is reserved for system use, such as data transmission between machines, where buffers are maintained in memory, and this part of memory is mounted under the system name, with the default 40% of memory space allocated for system use. The Reserved Pool and General Pool are used to allocate query runtime memory. Most of the queries use the General Pool. The space of the Reserved Pool is equivalent to the maximum space size used by a query on a single machine, with the default being 10% of the space.
Another core supporting capability of distributed execution plan federation analysis is that Presto adopts a pipeline computation mode in the computation model. An execution plan is compiled into multiple stages, where each stage is a collection of computation tasks that do not require data shuffle. After being decomposed into multiple tasks in parallel, the tasks in each stage are executed in parallel. Stages without mutual dependencies can also be executed in parallel, thus fully utilizing the concurrent execution capability of the distributed computation engine to improve performance.
Generally speaking, at the level of unified metadata management and query interface, Presto and Trino both have very good architectures and clear code structures, making them suitable for secondary development projects based on them. However, due to the insufficient attention of the open-source community to enterprise-level management needs, both communities lack good functional modules to support unified data security control and operation and maintenance tools. Therefore, to implement a complete data federation analysis system, secondary development or the introduction of commercial software is needed to supplement and improve the necessary functions.
— Summary —
This article introduces the architecture and cases of data federation, which solves the problem of 'data silos' and avoids the high development and operation and maintenance costs caused by the long traditional ETL process. It can be applied in scenarios with flexible and real-time data collection requirements, or where there is a need to process heterogeneous data sources.

评论已关闭