Hudi connector#

The Hudi connector enables querying Hudi tables.

Note

The Hudi connector is a limited preview in Starburst Enterprise. Contact Starburst Support with questions or feedback.

Requirements#

To use the Hudi connector, you need:

  • Hudi version 0.12.3 or higher.

  • Network access from the SEP coordinator and workers to the Hudi storage.

  • Access to a Hive metastore service (HMS).

  • Network access from the SEP coordinator to the HMS.

  • Data files stored in the Parquet file format. These can be configured using file format configuration properties per catalog.

General configuration#

To configure the Hive connector, create a catalog properties file etc/catalog/example.properties that references the hudi connector and defines the HMS to use with the hive.metastore.uri configuration property:

connector.name=hudi
hive.metastore.uri=thrift://example.net:9083

There are HMS configuration properties available for use with the Hudi connector. The connector recognizes Hudi tables synced to the metastore by the Hudi sync tool.

Additionally, following configuration properties can be set depending on the use-case:

Hudi configuration properties#

Property name

Description

Default

hudi.columns-to-hide

List of column names that are hidden from the query output. It can be used to hide Hudi meta fields. By default, no fields are hidden.

hudi.parquet.use-column-names

Access Parquet columns using names from the file. If disabled, then columns are accessed using the index. Only applicable to Parquet file format.

true

hudi.split-generator-parallelism

Number of threads to generate splits from partitions.

4

hudi.split-loader-parallelism

Number of threads to run background split loader. A single background split loader is needed per query.

4

hudi.size-based-split-weights-enabled

Unlike uniform splitting, size-based splitting ensures that each batch of splits has enough data to process. By default, it is enabled to improve performance.

true

hudi.standard-split-weight-size

The split size corresponding to the standard weight (1.0) when size-based split weights are enabled.

128MB

hudi.minimum-assigned-split-weight

Minimum weight that a split can be assigned when size-based split weights are enabled.

0.05

hudi.max-splits-per-second

Rate at which splits are queued for processing. The queue is throttled if this rate limit is breached.

Integer.MAX_VALUE

hudi.max-outstanding-splits

Maximum outstanding splits in a batch enqueued for processing.

1000

hudi.per-transaction-metastore-cache-maximum-size

Maximum number of metastore data objects per transaction in the Hive metastore cache.

2000

SQL support#

The connector provides read access to data in the Hudi table that has been synced to Hive metastore. The globally available and read operation statements are supported.

Basic usage examples#

In the following example queries, stock_ticks_cow is the Hudi copy-on-write table referred to in the Hudi quickstart guide.

USE example.example_schema;

SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';
  symbol   |        _col1         |
-----------+----------------------+
 GOOG      | 2018-08-31 10:59:00  |
(1 rows)
SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';
    dt      | symbol |
------------+--------+
 2018-08-31 |  GOOG  |
(1 rows)
SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
    dt      | _col1 |
------------+--------+
 2018-08-31 |  99  |
(1 rows)

Schema and table management#

Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector:

Hudi configuration properties#

Table type

Supported query type

Copy on write

Snapshot queries

Merge on read

Read-optimized queries

Metadata tables#

The connector exposes a metadata table for each Hudi table. The metadata table contains information about the internal structure of the Hudi table. You can query each metadata table by appending the metadata table name to the table name:

SELECT * FROM "test_table$timeline"
$timeline table#

The $timeline table provides a detailed view of meta-data instants in the Hudi table. Instants are specific points in time.

You can retrieve the information about the timeline of the Hudi table test_table by using the following query:

SELECT * FROM "test_table$timeline"
 timestamp          | action  | state
--------------------+---------+-----------
8667764846443717831 | commit  | COMPLETED
7860805980949777961 | commit  | COMPLETED

The output of the query has the following columns:

Timeline columns#

Name

Type

Description

timestamp

VARCHAR

Instant time is typically a timestamp when the actions performed.

action

VARCHAR

Type of action performed on the table.

state

VARCHAR

Current state of the instant.