Managed statistics#
You can configure Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) to collect and store its own managed Table statistics for some data sources that only collect or expose a limited number of statistics. These additional statistics can then enable the query planner to make better-informed, cost-based optimizations.
Note
Managed statistics is a public preview feature. Contact Starburst Support with questions or feedback.
Configuration#
Managed statistics requires the following:
A valid Query logger configuration and underlying database.
An internal shared secret to secure transmission of statistics.
Note
The managed statistics feature is only supported in select connectors. Reference the connectors feature matrix for more information about which connectors support this feature.
Coordinator node configuration#
In order to enable managed statistics for a cluster, add the following
configuration properties to config.properties
on the coordinator node only:
starburst.managed-statistics.enabled=true
Catalog configuration#
In addition to enabling managed statistics for the cluster, you must also enable managed statistics for each catalog.
To enable managed statistics for a catalog, add the following catalog
configuration property to the catalog configuration file under etc/catalog
:
managed-statistics.enabled=true
The catalog must also be configured to support secured internal communication. Add the following catalog configuration properties depending on which type of internal security your cluster is using.
For a shared secret, add the following property to each catalog configuration file:
internal-communication.shared-secret=<secret>
For internal TLS with a Java keystore, add the following properties to each catalog configuration file:
internal-communication.shared-secret=exampleSecret
managed-statistics.http-client.key-store-path=/path/to/keystore.jks
managed-statistics.http-client.key-store-password=examplePassword
For internal TLS with a certificate truststore, add the following properties to each catalog configuration file:
internal-communication.shared-secret=exampleSecret
managed-statistics.http-client.trust-store-path=/path/to/keystore.jks
managed-statistics.http-client.trust-store-password=examplePassword
Managed statistics configuration properties#
The following configuration properties are used to control managed statistics on the coordinator:
Property name |
Description |
---|---|
|
Enables managed statistics on a cluster. Defaults to |
|
Maximum amount of time for managed statistics to be stored on the
in-memory cache on the coordinator before they are retrieved from the
database. For more information see statistics storage and management. Default value is |
|
Maximum number of objects stored in the in-memory statistics cache. If
this value is exceeded, the least-used statistics are discarded and only
retrieved again from the database when the query planner needs to
reference them. Defaults to |
Collecting statistics#
Managed statistics must be collected manually with the collect_statistics
command in an ALTER TABLE EXECUTE statement, with
the optional column
parameter to only collect statistics for a specified
array of column names. See the connector documentation for more information
on ALTER TABLE EXECUTE
statement support and syntax for that connector.
The following statistics are collected:
For a table:
Row count: The total number of rows in the table.
For each column in a table:
Nulls fraction: The fraction of null values.
Distinct value count: The number of distinct values.
Range: For data types that have a range, the range of values.
Data size: For textual data types, the total data size of all values.
Storage and management#
When statistics are collected, they are stored in the query logger database
indefinitely until collect_statistics
is run again and overwrites any stale
statistics. These statistics are pulled from the database into an in-memory
cache on the coordinator at a frequency defined by the
managed-statistics.cache-ttl
configuration property which defaults to 10
minutes. The query planner accesses this in-memory cache for statistics instead
of the query logger database, reducing network I/O on the database at the cost
of additional memory usage on the coordinator node. To disable the in-memory
cache and have the query planner directly access the query logger database for
statistics, set the managed-statistics.cache-ttl
property to 0
.
The in-memory cache retrieves its statistics from the SEP database, not from
catalogs directly, so you must still routinely run the collect_statistics
command to ensure that the cache does not store stale statistics.