Query result caching#

Query result caching is an industry standard mechanism used to accelerate query execution. When you execute a query, the result is cached for a specified period of time. By caching the results of queries, Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) relies on pre-computed results and avoids unnecessarily processing queries if the results can be read from the cache. Retrieving a result directly from the cache substantially reduces query time.

Note

Query result caching is a public preview in Starburst Enterprise. Contact Starburst Support with questions or feedback.

Configuration#

Query result caching is not enabled by default and must be explicitly enabled in the coordinator configuration. For Kubernetes deployments, enable the following property in the additionalProperties section of the top-level coordinator node in the values.yaml file. For Starburst Admin deployments, configure the property in the files/coordinator/config.properties.j2 file:

results-cache.enabled=true

The following configuration properties are available:

Query result caching configuration properties#

Property

Description

results-cache.s3.bucket

The name of the S3 bucket where results are stored. Required.

results-cache.s3.prefix

The prefix or directory path in the S3 bucket where cached results are stored.

results-cache.s3.aws-access-key

The access key of the principal to authenticate with for S3 service. If this property is set, automatic authentication is disabled.

results-cache.s3.aws-secret-key

The secret key of the principal to authenticate with for S3 service. If this property is set, automatic authentication is disabled.

results-cache.s3.endpoint

The S3 storage endpoint server, which can be used to connect to an S3-compatible storage system instead of AWS.

results-cache.s3.region

The AWS region where the S3 bucket is located.

results-cache.s3.sts.endpoint

The optional endpoint URL for AWS Security Token Service (STS).

results-cache.s3.sts.region

The region where the STS is located. This property is required if results-cache.s3.sts.endpoint is set.

results-cache.s3.aws-iam-role

The Amazon Resource Name (ARN) of the IAM role to assume when connecting to S3. If this configuration property is set, AWS STS authorization uses results-cache.s3.aws-access-key and results-cache.s3.aws-secret-key.

results-cache.s3.aws-external-id

The external ID that matches the IAM role’s trust policy.

results-cache.s3.path-style-access

Used to specify path-style access for all requests to the S3-compatible storage. Defaults to false.

results-cache.max-age

The maximum amount of time that query results are stored in the cache before they are considered stale. The upper limit is 7 days.

results-cache.max-result-size

The maximum size of the query results that are stored in the cache, accepts values between 1kB and 5MB, inclusive. Defaults to 1MB.

results-cache.upload.threads

The size of the thread pool allocated for uploading results to the results cache. Defaults to 5.

results-cache.telemetry-enabled

Used to enable telemetry data collection for the results cache.

The following is an example configuration for query result caching:

results-cache.enabled=true
results-cache.s3.endpoint=s3.us-east-2.amazonaws.com
results-cache.s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
results-cache.s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
results-cache.s3.bucket=sep-result-cache
results-cache.s3.region=us-east-2
results-cache.s3.path-style-access=true
results-cache.max-age=10m

Usage#

When you submit a query for the first time, it goes through a normal execution process. Query result caching stores the result of the query on S3-compatible storage. For subsequent runs of the same query, SEP fetches the cached result of the original query. The query results you serve from the cache are preserved in query history.

Cache entries are not shared between users. Any changes to the parameters of the query which might influence the results such as the query string, session or catalog session properties, or additional credentials change the cache key and are treated as a separate cache entry.

A user with appropriate permissions can use the following session property to bypass cached results:

SET SESSION
skip_results_cache=true

The flush_results_cache() procedure is used to clear the query results cache.

Expiration#

SEP does not delete cache files it creates in S3.Starburst recommends setting a lifecycle policy for the bucket used to store cached results in order to delete expired elements.

For example, the following lifecycle policy would expire cache entries after one day:

<LifecycleConfiguration>
  <Rule>
    <ID>Cleanup results cache entries</ID>
    <Status>Enabled</Status>
    <Expiration>
      <Days>1</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Monitoring#

You can use the JMX connector and the "com.starburstdata.cache.resultscache:name=dispatcher" MBean to see statistics related to the results cache. For example, the following query shows the total count of cache hits:

SELECT "stats.cachehits.totalcount"
FROM jmx.current."com.starburstdata.cache.resultscache:name=dispatcher"