Materialized views#
Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) supports materialized views with the Hive and Iceberg connectors.
Materialized views increase query performance by providing pre-computed results from complex, analytical queries. With SEP, you can run federated queries, create materialized views, and access catalogs of results through the Hive and Iceberg connectors.
Materialized views created in a Hive catalog may be automatically refreshed, as discussed in this document.
Note
There is no support at this time for automatic refresh of materialized views in Iceberg.
Prerequisites#
Your SEP platform administrator or a data engineer with administrative access to SEP has to first enable materialized views in one or more Hive or Iceberg catalogs. Check with them to learn which catalogs are enabled, and which schemas may be used.
You also must have the necessary access privileges to create data in the schema designated in each enabled catalog.
Create and use a materialized view#
Any federated query that runs successfully in SEP can be used to
create a materialized view. Materialized views in SEP are
created in the same way as in other data platforms, with a CREATE MATERIALIZED VIEW statement. In the following
example, the mysalescatalog
has been configured to allow materialized views:
CREATE MATERIALIZED VIEW mysalescatalog.mysalesschema.mv_cust_tot_return AS
SELECT
sr_customer_sk ctr_customer_sk,
sr_store_sk ctr_store_sk,
sum(sr_return_amt) ctr_total_return
FROM
tpcds.sf1.store_returns,
tpcds.sf1.date_dim
WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
GROUP BY sr_customer_sk, sr_store_sk
;
Once a materialized view exists, you can query it like any regular table:
SELECT * FROM mycatalog.mysalesschema.mv_cust_tot_return;
Automatically refresh materialized views in Hive#
Hive catalogs in SEP configured to allow materialized views
provide several WITH
clause properties to configure refresh schedules and how
new data is imported:
refresh_interval
andcron
: Choose one method to specify a refresh frequency. The defined interval must be greater than or equal to five minutes.max_import_duration
: Specifies how long to allow a refresh to complete before failing.grace_period
: Specifies the amount of time in-flight queries can run against an expiring snapshot.incremental_column
: Specifies the column to be used to identify new data since the last refresh. If you do not use this field, SEP performs a full refresh.
We suggest that you review the Hive connector documentation , which has more information about these properties.
In the following example, the refresh_interval
property is used to
automatically refresh the data every 24 hours from the time the CREATE
statement initially runs:
CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
refresh_interval = '24h',
grace_period = '5m',
max_import_duration = '30m'
) AS
SELECT
sr_customer_sk ctr_customer_sk,
sr_store_sk ctr_store_sk,
sum(sr_return_amt) ctr_total_return
FROM
tpcds.sf1.store_returns,
tpcds.sf1.date_dim
WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
GROUP BY sr_customer_sk, sr_store_sk
;
In this example, the refresh runs for a maximum of 30 minutes. Unless the
cron
property is specified, the time at which data in a materialized view is
refreshed is based on the moment the CREATE MATERIALIZED VIEW
statement first
runs, plus the refresh_interval
.
You can run refreshes on a set schedule by using the cron
property instead.
The cron
property uses normal cron expressions. Here is the same materialized
view, created with a cron
schedule and an incremental column:
CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
cron = '30 2 * * *'
grace_period = '5m',
max_import_duration = '30m',
incremental_column = 'sr_returned_date_sk'
) AS
SELECT
sr_customer_sk ctr_customer_sk,
sr_store_sk ctr_store_sk,
sum(sr_return_amt) ctr_total_return
FROM
tpcds.sf1.store_returns,
tpcds.sf1.date_dim
WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
GROUP BY sr_customer_sk, sr_store_sk
;
This causes the refresh to execute at 2:30 AM daily, and loads only new data as
determined by the sr_returned_date_sk
date column.
Enable Iceberg materialized views when using Glue#
Built-in access control (BIAC) must be configured to own table entities in order to enable materialized views in an Iceberg catalog when using Glue.
See Using BIAC with Iceberg materialized views in Glue for more details.