Cache service#

The Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) cache service provides the ability to configure and automate the management of table scan redirections and materialized views in supported connectors.

The service connects to an existing SEP installation to run queries for copying data from the source catalog to the target catalog. The target catalog is regularly synchronized with the source and used as a cache.

With table scan redirections, you can redirect data requests for a table in one catalog to a cached version of it in another catalog on a more performant system, reducing the query load on the original data source. This redirection is transparent to the user, and therefore provides performance improvements without the need to modify user queries.

The cache service runs by default within the coordinator process. For evaluation purposes, you can also run the cache service as a standalone service. You can interact with it using its REST API, or the cache service CLI.

Note

The cache service requires a valid Starburst Enterprise license.

Requirements#

The cache service has similar requirements to SEP, which are described on the Deploying page. In particular, you must enable and configure the backend service before proceeding.

Linux Operating System#

  • 64-bit required

  • Newer release preferred, especially when running on containers

Java Runtime Environment#

The cache service requires a 64-bit version of Java 17.

Relational database#

The cache service requires an externally managed database for storing table scan redirections data and materialized view definitions. The following RDBMS are supported:

  • MySQL 8.0.12 or higher

  • PostgreSQL 9.6 or higher

  • OracleDB 12.2.0.1 or higher

Note

You can use the same externally managed database as for the required backend service. However, it must be located in a different schema in PostgreSQL or Oracle, or in a different database in a MySQL instance, due to naming conflicts. Optionally, you can use a separate externally managed database instead.

Select the right cache strategy#

Both materialized views and table scan redirections improve query performance. Find out which one is right for your organization.

Materialized views#

The cache service handles materialized view refreshes. To enable this, each supported catalog where storage tables reside must be configured to allow the creation of these tables.

Materialized views created in Hive catalogs are backed by the cache service and the Hive Metastore and can be automatically refreshed. If you use SEP with Hive and the cache service, you already have the necessary prerequisites to use automatically refreshed materialized views.

Any query that runs successfully in SEP can be used to create a materialized view. Users access the materialized view through the catalog it is made available from, just as they would any other data source.

Materialized views in Hive allow you to access the results of a query from any data catalog you have defined in SEP in real time without re-processing the query, and with a refresh scheme that works best for your organization.

Table scan redirections#

Table scan redirections allow you to transparently redirect queries to a cached version in a location that incurs lower egress costs, is more performant, or both. Your users get even faster performance without changing the vetted queries that they depend upon.

Embedded deployment#

The cache service can be setup to run embedded within the coordinator process by configuring the cache service on the coordinator in etc/cache.properties. This mode of deployment does not require installation of any additional packages or running a separate service.

Note

When running in embedded mode, the cache service uses the coordinator’s resources such as the JVM and logging. This additional resource usage must be planned for when sizing your cluster.

Standalone deployment#

A standalone deployment ensures that the service is not affected by coordinator performance, or the deployment of a new release on the SEP cluster.

Deployment of the cache service in Kubernetes can be managed with the available Helm chart.

Manual deployment relies on using a tarball:

  • To download the cache service binary file, contact Starburst Support

  • Starburst Support provides access to a file named much like starburst-cache-service-*.tar.gz

  • Extract it, for example with tar xfvz starburst-cache-service-*.tar.gz

The resulting directory starburst-cache-service-nnn, with nnn replaced by the release number, is called the installation directory. It contains all necessary resources.

Move the extracted directory into the desired location, such as /opt/, and you are ready to proceed with configuring the service.

Deploy the cache service#

The cache service is available at the configured cache-service.uri for both standalone and embedded deployments. The cache service can be deployed either within the existing coordinator process, or as a standalone application separate from your SEP cluster.

Note

We strongly suggest that you use the cache service embedded in the the SEP Helm chart. Use the standalone deployment only if you encounter an issue, such as performance degradation, that can be definitively attributed solely to the cache service running in embedded mode.

Enable the cache service#

To use table scan redirections or Hive materialized views, the cache service must first be installed and enabled. Our reference documentation has Requirements and instructions for doing so. Take the time to read about the two types of deployments, standalone and embedded, to see which is better suited to your cluster. If your organization uses our Helm charts to deploy SEP, you can use the Helm chart for the cache service to deploy it as standalone service.

Before you begin, ensure that you have an external database instance available to store information about materialized views and table scan redirections.

When you are ready, there are a number of configuration properties that must be set, shown in the following example:

service-database.user=alice
service-database.password=test123
service-database.jdbc-url=jdbc:mysql://mysql-server:3306/<service-database-name>
starburst.user=bob
starburst.jdbc-url=jdbc:trino://coordinator:8080
rules.file=etc/rules.json

Where these get set depends on whether or not you are using a standalone or embedded cache service, and whether you are using Helm charts or not:

  • Embedded - * via the SEP Helm-chart in the cache.properties node nested under the top level coordinator node in the config section node. * Other deployments in the etc/cache.properties file on the coordinator server.

  • Standalone - * via the cache service Helm chart in the config.properties node nested under the top level config node. * Other deployments in the etc/config.properties file on the cache service server.

Enable materialized views#

Materialized views must be enabled in the Hive catalog that they are accessed from. In addition, you must specify a schema to contain materialized view storage. We strongly recommend that you create a schema dedicated to materialized views for a given catalog. A namespace must also be specified; it is used internally by the cache service to avoid inadvertent name collisions. To create a new schema with a specified location, use a command similar to the following, which creates the views_cache_storage schema in the myhive catalog:

CREATE SCHEMA myhive.views_cache_storage WITH (location = 's3a://<s3-bucket-name>/hivepostgres_views/views_cache_storage/');

In addition, the schema for the materialized view itself must exist. If it does not exist, you must create it:

CREATE SCHEMA myhive.views_schema WITH (location = 's3a://<s3-bucket-name>/hivepostgres_views/views_schemas');

Each Hive catalog must be configured to allow materialized views and use the schema you created for materialized views:

materialized-views.enabled=true
materialized-views.namespace=mymvnamespace
materialized-views.storage-schema=mymvstorage
cache-service.uri=http://<my-cache-service-hostname>:8180

Enable table scan redirections#

Like materialized views, table scan redirections are also managed by the cache service. The redirections are transparent to users, so no additional training, modifications of queries or new commands are needed to use them.

Table scan redirection refreshes are configured in a JSON-formatted rules file. There are numerous properties to govern both global defaults for all rules and rule- specific behaviors.

Table scan redirections offer even finer-grained control for refreshes than do materialized views, including cleanup options.

Run the cache service#

Once deployed and configured, the cache service is run using the launcher script in bin/launcher as follows:

bin/launcher start

Alternatively, it can be run in the foreground, with the logs and other output written to stdout/stderr. Both streams should be captured if using a supervision system like daemontools:

bin/launcher run

Run the launcher with --help to see the supported commands and command line options. In particular, the --verbose option is very useful for debugging the installation.

The launcher configures default values for the configuration directory etc, configuration files, the data directory var, and log files in the data directory. You can change these values to adjust your usage to any requirements, such as using a directory outside the installation directory, specific mount points or locations, and even using other file names.

After starting the cache service, you can find log files in the log directory inside the data directory var:

  • launcher.log: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It contains a few log messages that occur while the server logging is being initialized, and any errors or diagnostics produced by the JVM.

  • server.log: This is the main log file used by the service. It typically contains the relevant information if the server fails during initialization. It is automatically rotated and compressed.

  • http-request.log: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.

Configure the cache service#

The following files must exist inside the installation directory to hold the following configuration files. These files are used by the cache service alone in a standalone deployment, or by both the cache service and the coordinator in an embedded deployment:

The following file is specific to the cache service, and must also exist inside the installation directory to support type mapping:

  • etc/type-mapping.json - optional JSON file to specify type mapping between source and target catalogs.

The following file is specific the table scan redirection feature of the cache service, and must also exist inside the installation directory to support define table scan redirection rules:

  • etc/rules.json - JSON file specifying the source tables and target connector for the cache along with the schedule for refreshing them

General configuration properties#

The configuration properties file, etc/config.properties, contains the configuration for the cache service, when deployed as a standalone system. Users of the embedded mode use a etc/cache.properties file on the coordinator with the same properties.

The following is a minimal configuration for the service:

service-database.user=alice
service-database.password=test123
service-database.jdbc-url=jdbc:mysql://mysql-server:3306/<service-database-name>
starburst.user=bob
starburst.jdbc-url=jdbc:trino://coordinator:8080
rules.file=etc/rules.json

The properties to configure the cache service are explained in detail in the following sections.

General configuration properties#

Property name

Description

starburst.user

Username to connect to the SEP cluster for executing queries to refresh the cached tables. This user must also be able to read the source tables for any materialized views it refreshes.

starburst.password

Password to connect to the SEP cluster when password based authentication is enabled on the SEP cluster.

starburst.jdbc-url

JDBC URL of the SEP cluster used for executing queries to refresh the cached tables. You can use JDBC driver parameters in the connection string to configure details of the connection. For example, use jdbc:trino://coordinator:8080?SSL=true for a cluster secured with TLS.

cache-service.uri

URI of the cache service, including the hostname and port number.

rules.file

Path to the JSON file containing rules for identifying source tables and target connector for caching. It also specifies a schedule for refreshing cached tables.

rules.refresh-period

Frequency at which cache rules are refreshed from the rules.file. Defaults to 1m.

max-table-import-threads

Maximum number of table import jobs that can be run in parallel. Defaults to 20.

refresh-interval

Frequency at which the cache service triggers refresh of cached tables and checks for a need to refresh materialized views based on their refresh interval and cron expression. Do not set this cache service interval to large values that can cause defined materialized view refreshes to be skipped. Defaults to 2m.

refresh-initial-delay

Initial delay for startup of the refresh. Defaults to 0s.

cleanup-interval

Frequency at which cache service triggers cleanup of expired tables in the cache. Defaults to 5m.

cleanup-initial-delay

Initial delay for startup of the cleanup. Defaults to 0s.

HTTP/TLS and authentication configuration properties#

The following properties allow you to configure the cache service to use either HTTP or HTTPS, and to set up file-based authentication if appropriate for your environment.

HTTP/TLS and authentication properties#

Property name

Description

Default

http-server.http.port

HTTP port for the cache service

8180

http-server.https.port

HTTPS port of the cache service

8543

http-server.https.enabled

Flag to activate HTTPS/TLS

false

http-server.authentication.type

Authentication type used for the cache service, use password for password file based authentication

none

http-server.https.keystore.path

Path to the JKS keystore file used for TLS

http-server.https.keystore.key

Name of the key in the JKS keystore used for TLS

file.password-file

Path to the password file used with the file authentication type

Configure the cache service for HTTPS/TLS connections#

To configure the cache service to use HTTPS connections, add the following configuration properties to the cache.properties file on the coordinator:

starburst.user=<starburst-user>
starburst.password=<starburst-password>
starburst.jdbc-url=jdbc:trino://<coordinator-hostname>:8443?SSL=true

This example assumes that your cluster has the default 8443 HTTPS port already enabled.

You can configure the cache service to accept incoming HTTPS connections on a custom port by using the http-server.https.port configuration property in the cache.properties file:

http-server.https.port=<custom-port>

Configure the cache service for insecure HTTP connections#

To configure the cache service to use HTTP connections, add the following configuration properties to your cache.properties file:

starburst.user=cache-service
starburst.jdbc-url=jdbc:trino://<coordinator-hostname>:8080

The following configuration property must also be set in the config.properties file:

server.authentication.allow-insecure-over-http=true

This example assumes that your cluster has the default 8080 HTTP port still enabled.

You can configure the cache service to accept incoming HTTP connections on a custom port by using the http-server.http.port configuration property in the cache.properties file:

http-server.http.port=<custom-port>

File-based authentication#

File-based password authentication can be configured for the cache service by adding the following properties in the config.properties file:

http-server.https.enabled=true
http-server.authentication.type=password
http-server.https.keystore.path=etc/auth/localhost.keystore
http-server.https.keystore.key=changeit
file.password-file=etc/auth/password.db

Log levels configuration property#

The optional log levels file, etc/log.properties, allows setting the minimum log level for named logger hierarchies. There are four decreasingly verbose levels: DEBUG, INFO, WARN and ERROR. The default logging level is INFO.

Every logger has a name, which is typically the fully qualified name of the class that uses the logger. Loggers have a hierarchy based on the dots in the name, like Java packages. For example, consider the following log levels file:

com.starburstdata.cache=WARN

This sets the minimum level to WARN for both com.starburstdata.cache.db and com.starburstdata.cache.rules.

Type mapping#

Type mapping overcomes missing type support in target storage catalogs for cache service-manged table scan redirections and materialized views with Hive as the target catalog. It allows your users to create materialized views without the need to perform data type casting, and allows data engineers to create cached table projections in target catalogs where there is not a one-to-one type mapping. Type mapping uses definitions in a JSON file, type-mapping.json, to define type casting rules used when the target catalog is created or updated.

Mappings are key-value pairs as source: target. The following example shows how to construct type mapping rules for the target catalog, myhivesalesdata:

{
  "rules": {
    "myhivesalesdata": {
      "timestamp(0)": "timestamp(3)",
      "timestamp(1)": "timestamp(3)",
      "timestamp(2)": "timestamp(3)"
    }
  }
}

Each target catalog has a separate entry in the JSON file. The Trino name of any type that is supported by the source catalog can be used as a key in the type map. Values must be Trino types supported by the target catalog.

Type casting is applied to regular columns, partition columns, and the column used for incremental update.

Type mapping behavior#

Columns can only be cast based on their type. It is impossible to cast one column of a given type without casting all columns of that same type.

There are pre-configured type mappings CAST_TIMESTAMPS_TO_MILLISECONDS, CAST_TIMESTAMPS_TO_MICROSECONDS and CAST_TIMESTAMPS_TO_NANOSECONDS that can only be set using cache properties, not via the JSON type mapping. These mappings extend the precision of all timestamp-related types, they never truncate data, and ignore the target catalog name.

Some mappings, like INTEGER -> VARCHAR, can change the semantics of the max() function used for calculating the contents of an incremental update. For example "-1" > "7" because string lengths are compared first, so in incremental update the row with "-1" is added even if the table already contains data up to "7".

Using timestamp columns for partitioning is strongly discouraged. Timestamp partitions are silently changed to TIMESTAMP(3) while being written. This behavior can result in loss of precision for some columns. Additionally, it can lead to a huge amount of partitions, which negatively impacts performance.

Type mapping configuration#

The following optional properties allow configuring type mapping rules:

Cache service type mapping configuration#

Property name

Description

type-mapping

Set the kind of type mapping to apply:

  • NONE

  • CAST_TIMESTAMPS_TO_MILLISECONDS

  • CAST_TIMESTAMPS_TO_MICROSECONDS

  • CAST_TIMESTAMPS_TO_NANOSECONDS

  • FILE

Defaults to NONE.

type-mapping.file

Path to the JSON file

type-mapping.file.refresh-period

Frequency to use to refresh the type mapping rules from the type-mapping.file file. Defaults to 1m.

JVM configuration#

The Java Virtual Machine (JVM) config file, etc/jvm.config, contains a list of command line options used for launching the JVM running the cache service for the cache service, when deployed as a standalone system. When deployed embedded within the coordinator, the contents of the etc/jvm.config apply to both the cache-service and to all other services and processes running on the coordinator.

The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces, or other special characters, should not be quoted.

The following is a basic etc/jvm.config file:

-server
-Xmx512M
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError

An OutOfMemoryError typically leaves the JVM in an inconsistent state. The above configuration causes the JVM to write a heap dump file for debugging, and forcibly terminate the process when this occurs.

JMX metrics#

Metrics about table import are reported in the JMX table jmx.current."com.starburstdata.cache:name=TableImportService".

Metrics about cached table cleanup are reported in the JMX table jmx.current."com.starburstdata.cache:name=CleanupService".

Metrics about redirections requests on the web service resources are reported in the JMX table jmx.current."com.starburstdata.cache.resource:name=RedirectionsResource".

Metrics about table import and expiration requests on the web service resource are reported in the JMX table jmx.current."com.starburstdata.cache.resource:name=CacheResource".