Materialized views#

Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) supports materialized views with the Hive and Iceberg connectors.

Materialized views increase query performance by providing pre-computed results from complex, analytical queries. With SEP, you can run federated queries, create materialized views, and access catalogs of results through the Hive and Iceberg connectors.

Materialized views created in a Hive catalog may be automatically refreshed, as discussed in this document.

Note

There is no support at this time for automatic refresh of materialized views in Iceberg.

Prerequisites#

Your SEP platform administrator or a data engineer with administrative access to SEP has to first enable materialized views in one or more Hive or Iceberg catalogs. Check with them to learn which catalogs are enabled, and which schemas may be used.

You also must have the necessary access privileges to create data in the schema designated in each enabled catalog.

Create and use a materialized view#

Any federated query that runs successfully in SEP can be used to create a materialized view. Materialized views in SEP are created in the same way as in other data platforms, with a CREATE MATERIALIZED VIEW statement. In the following example, the mysalescatalog has been configured to allow materialized views:

CREATE MATERIALIZED VIEW mysalescatalog.mysalesschema.mv_cust_tot_return AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

Once a materialized view exists, you can query it like any regular table:

SELECT * FROM mycatalog.mysalesschema.mv_cust_tot_return;

Automatically refresh materialized views in Hive#

Hive catalogs in SEP configured to allow materialized views provide several WITH clause properties to configure refresh schedules and how new data is imported:

  • refresh_interval and cron: Choose one method to specify a refresh frequency. The defined interval must be greater than or equal to five minutes.

  • max_import_duration: Specifies how long to allow a refresh to complete before failing.

  • grace_period: Specifies the amount of time in-flight queries can run against an expiring snapshot.

  • incremental_column: Specifies the column to be used to identify new data since the last refresh. If you do not use this field, SEP performs a full refresh.

We suggest that you review the Hive connector documentation , which has more information about these properties.

In the following example, the refresh_interval property is used to automatically refresh the data every 24 hours from the time the CREATE statement initially runs:

CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
  refresh_interval = '24h',
  grace_period = '5m',
  max_import_duration = '30m'
) AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

In this example, the refresh runs for a maximum of 30 minutes. Unless the cron property is specified, the time at which data in a materialized view is refreshed is based on the moment the CREATE MATERIALIZED VIEW statement first runs, plus the refresh_interval.

You can run refreshes on a set schedule by using the cron property instead. The cron property uses normal cron expressions. Here is the same materialized view, created with a cron schedule and an incremental column:

CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
  cron = '30 2 * * *'
  grace_period = '5m',
  max_import_duration = '30m',
  incremental_column = 'sr_returned_date_sk'
) AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

This causes the refresh to execute at 2:30 AM daily, and loads only new data as determined by the sr_returned_date_sk date column.

Enable Iceberg materialized views when using Glue#

Built-in access control (BIAC) must be configured to own table entities in order to enable materialized views in an Iceberg catalog when using Glue.

See Using BIAC with Iceberg materialized views in Glue for more details.