Query result caching#
Query result caching is an industry standard mechanism used to accelerate query execution. When you execute a query, the result is cached for a specified period of time. By caching the results of queries, Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) relies on pre-computed results and avoids unnecessarily processing queries if the results can be read from the cache. Retrieving a result directly from the cache substantially reduces query time.
Note
Query result caching is a public preview in Starburst Enterprise. Contact Starburst Support with questions or feedback.
Configuration#
Query result caching is not enabled by default and must be explicitly enabled in
the coordinator configuration. For Kubernetes deployments, enable the following
property in the additionalProperties
section of the top-level coordinator
node in the values.yaml
file. For Starburst Admin deployments, configure the
property in the files/coordinator/config.properties.j2
file:
results-cache.enabled=true
The following configuration properties are available:
Property |
Description |
---|---|
|
The name of the S3 bucket where results are stored. Required. |
|
The prefix or directory path in the S3 bucket where cached results are stored. |
|
The access key of the principal to authenticate with for S3 service. If this property is set, automatic authentication is disabled. |
|
The secret key of the principal to authenticate with for S3 service. If this property is set, automatic authentication is disabled. |
|
The S3 storage endpoint server, which can be used to connect to an S3-compatible storage system instead of AWS. |
|
The AWS region where the S3 bucket is located. |
|
The optional endpoint URL for AWS Security Token Service (STS). |
|
The region where the STS is located. This property is required if
|
|
The Amazon Resource Name (ARN) of the IAM role to assume when connecting
to S3. If this configuration property is set, AWS STS authorization uses
|
|
The external ID that matches the IAM role’s trust policy. |
|
Used to specify path-style access for all requests to the S3-compatible
storage. Defaults to |
|
The maximum amount of time that query results are stored in the cache before they are considered stale. The upper limit is 7 days. |
|
The maximum size of the query results that are stored in the cache,
accepts values between |
|
The size of the thread pool allocated for uploading results to the
results cache. Defaults to |
|
Used to enable telemetry data collection for the results cache. |
The following is an example configuration for query result caching:
results-cache.enabled=true
results-cache.s3.endpoint=s3.us-east-2.amazonaws.com
results-cache.s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
results-cache.s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
results-cache.s3.bucket=sep-result-cache
results-cache.s3.region=us-east-2
results-cache.s3.path-style-access=true
results-cache.max-age=10m
Usage#
When you submit a query for the first time, it goes through a normal execution process. Query result caching stores the result of the query on S3-compatible storage. For subsequent runs of the same query, SEP fetches the cached result of the original query. The query results you serve from the cache are preserved in query history.
Cache entries are not shared between users. Any changes to the parameters of the query which might influence the results such as the query string, session or catalog session properties, or additional credentials change the cache key and are treated as a separate cache entry.
A user with appropriate permissions can use the following session property to bypass cached results:
SET SESSION
skip_results_cache=true
The flush_results_cache()
procedure is used to clear the query results cache.
Expiration#
SEP does not delete cache files it creates in S3.Starburst recommends setting a lifecycle policy for the bucket used to store cached results in order to delete expired elements.
For example, the following lifecycle policy would expire cache entries after one day:
<LifecycleConfiguration>
<Rule>
<ID>Cleanup results cache entries</ID>
<Status>Enabled</Status>
<Expiration>
<Days>1</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
Monitoring#
You can use the JMX connector and the
"com.starburstdata.cache.resultscache:name=dispatcher"
MBean to see statistics
related to the results cache. For example, the following query shows the total
count of cache hits:
SELECT "stats.cachehits.totalcount"
FROM jmx.current."com.starburstdata.cache.resultscache:name=dispatcher"