Hudi connector#
The Hudi connector enables querying Hudi tables.
Note
The Hudi connector is a limited preview in Starburst Enterprise. Contact Starburst Support with questions or feedback.
Requirements#
To use the Hudi connector, you need:
Hudi version 0.12.2 or higher.
Network access from the SEP coordinator and workers to the Hudi storage.
Access to the Hive metastore service (HMS).
Network access from the SEP coordinator to the HMS.
General configuration#
The connector requires a Hive metastore for table metadata and supports the same
metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri
must be configured.
The connector recognizes Hudi tables synced to the metastore by the
Hudi sync tool.
To create a catalog that uses the Hudi connector, create a catalog properties
file etc/catalog/example.properties
that references the hudi
connector.
Update the hive.metastore.uri
with the URI of your Hive metastore Thrift
service:
connector.name=hudi
hive.metastore.uri=thrift://example.net:9083
Additionally, following configuration properties can be set depending on the use-case:
Property name |
Description |
Default |
---|---|---|
|
Fetch the list of file names and sizes from metadata rather than storage. |
|
|
List of column names that are hidden from the query output. It can be used to hide Hudi meta fields. By default, no fields are hidden. |
|
|
Access Parquet columns using names from the file. If disabled, then columns are accessed using the index. Only applicable to Parquet file format. |
|
|
Whether batched column readers must be used when reading Parquet files
for improved performance. Set this property to |
|
|
Whether batched column readers must be used when reading ARRAY, MAP
and ROW types from Parquet files for improved performance. Set this
property to |
|
|
Minimum number of partitions returned in a single batch. |
|
|
Maximum number of partitions returned in a single batch. |
|
|
Unlike uniform splitting, size-based splitting ensures that each batch of splits has enough data to process. By default, it is enabled to improve performance. |
|
|
The split size corresponding to the standard weight (1.0) when size-based split weights are enabled. |
|
|
Minimum weight that a split can be assigned when size-based split weights are enabled. |
|
|
Rate at which splits are queued for processing. The queue is throttled if this rate limit is breached. |
|
|
Maximum outstanding splits in a batch enqueued for processing. |
|
SQL support#
The connector provides read access to data in the Hudi table that has been synced to Hive metastore. The globally available and read operation statements are supported.
Basic usage examples#
In the following example queries, stock_ticks_cow
is the Hudi copy-on-write
table referred to in the Hudi quickstart guide.
USE example.example_schema;
SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';
symbol | _col1 |
-----------+----------------------+
GOOG | 2018-08-31 10:59:00 |
(1 rows)
SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';
dt | symbol |
------------+--------+
2018-08-31 | GOOG |
(1 rows)
SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
dt | _col1 |
------------+--------+
2018-08-31 | 99 |
(1 rows)
Schema and table management#
Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector:
Table type |
Supported query type |
---|---|
Copy on write |
Snapshot queries |
Merge on read |
Read-optimized queries |
Metadata tables#
The connector exposes a metadata table for each Hudi table. The metadata table contains information about the internal structure of the Hudi table. You can query each metadata table by appending the metadata table name to the table name:
SELECT * FROM "test_table$timeline"
$timeline
table#
The $timeline
table provides a detailed view of meta-data instants
in the Hudi table. Instants are specific points in time.
You can retrieve the information about the timeline of the Hudi table
test_table
by using the following query:
SELECT * FROM "test_table$timeline"
timestamp | action | state
--------------------+---------+-----------
8667764846443717831 | commit | COMPLETED
7860805980949777961 | commit | COMPLETED
The output of the query has the following columns:
Name |
Type |
Description |
---|---|---|
|
|
Instant time is typically a timestamp when the actions performed |
|
|
Type of action performed on the table |
|
|
Current state of the instant |
File formats#
The connector supports Parquet file format.