Backend service#
The Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) backend service manages and stores information for:
Query completion details and events
Cluster metrics
Data products
Built-in role-based access control
Managed statistics
Cache service - Optional. Separate schema or database required, depending on the RDBMS.
Warning
This service is required; however by default, it is disabled. You must provide and configure a suitable database, and enable the service as described in the requirements and installation sections that follow.
Requirements#
The backend service requires the following:
An externally-managed database. The following RDBMS are supported:
MySQL 8.0.12+
PostgreSQL 9.6+
OracleDB 12.2.0.1+
Network access from the coordinator to the external database.
A valid Starburst Enterprise license.
The requirements for the external database vary based on a number of factors, including the following:
Specific RDBMS in use
Size of the cluster
Number of users
Number of executed queries
Number of users accessing metrics
Reports and other queries accessing the service database
The SEP backend service supports different authentication methods for each external database type:
RDBMS |
Authentication method |
---|---|
MySQL |
Basic, AWS IAM |
OracleDB |
Basic, Kerberos |
PostgreSQL |
Basic |
The RDBMS server is recommended to have the following hardware allocation, with adjustments based on actual server load:
64GB of RAM
8 CPU cores
At least 5GB of storage
Storage requirements vary widely based on usage patterns, and grow over time specifically due to the query history storage. To keep the cluster and database operational over time, you must monitor storage usage and adjust when necessary.
Due to the importance of the service for the proper functioning of SEP features and operations, we do not recommend allocating fewer resources for production infrastructure without testing. The following hardware allocation is the minimum required for the external service database, recommended only for use in non-production deployments:
8GB of RAM
4 CPU cores
5 GB of storage
In addition, the connection to the database needs to be highly performant in terms latency and throughput to support your specific usage load.
Installation and configuration#
To enable the backend service, configure the SEP coordinator to use the required external database with the following configuration properties:
Property name |
Description |
---|---|
|
JDBC connection string for the required external database. |
|
Username of a user with full read and write access to the required external database. |
|
Password for the database user. |
|
Enables the query history functionality. You must explicitly enable
the persistence for Insights to use the
persisted data in the backend service database. Defaults to |
Instructions for configuring the service in Kubernetes deployments are available in the Kubernetes documentation. For Starburst Admin deployments, read about the config.properties file.
The following examples show the service configuration in the coordinator’s
config.properties
for the starburstenterprise
database in PostgreSQL,
Oracle, and MySQL:
Example configuration with a PostgreSQL database
insights.jdbc.url=jdbc:postgresql://postgresql.example.com:5432/starburstenterprise
insights.jdbc.user=sepuser
insights.jdbc.password=test12
insights.persistence-enabled=true
It is not good practice to use the PostgreSQL public
schema. Configure the
user to use a different schema by default. See the PostgreSQL documentation
for details.
Example configuration with an Oracle database
In the following example, username and password authentication is used to connect the backend service to an Oracle database:
insights.jdbc.url=jdbc:oracle:thin:@oracle.example.com:1521/starburstenterprise
insights.jdbc.user=sepuser
insights.jdbc.password=test12
insights.persistence-enabled=true
As an alternative, the service can use Kerberos authentication to connect to Oracle with the following configuration:
insights.jdbc.url=jdbc:oracle:thin:@oracle.example.com:1521/starburstenterprise
insights.jdbc.authentication-type=KERBEROS
insights.jdbc.kerberos.client.principal=example@STARBURSTDATA.COM
insights.jdbc.kerberos.client.keytab=/etc/kerberos/example.keytab
insights.jdbc.kerberos.config=/etc/kerberos/krb5.conf
insights.persistence-enabled=true
Example query configuration with a MySQL database
In the following example, username and password authentication is used to connect the service to a MySQL database:
insights.jdbc.url=jdbc:mysql://mysql.example.com:3306/starburstenterprise?sessionVariables=sql_mode=ANSI
insights.jdbc.user=sepuser
insights.jdbc.password=test12
insights.persistence-enabled=true
As an alternative, the service can use IAM token-based authentication to connect to a MySQL database with the following configuration:
insights.jdbc.url=jdbc:mysql://mysql.example.com:3306/starburstenterprise?sessionVariables=sql_mode=ANSI
insights.jdbc.authentication-type=AWS_IAM
insights.jdbc.connection-user=db_user
insights.jdbc.aws.region-name=us-east-2
insights.jdbc.aws.iam-role=${ENV:RDS_ROLE_ARN}
insights.jdbc.aws.external-id=for_product_test
insights.jdbc.aws.token-expiration-timeout=1s
insights.persistence-enabled=true
You must specify sql_mode=ANSI
in the insights.jdbc.url
configuration
for a MySQL database. The configured user must have sufficient rights to create
tables and insert data in the configured schema.
Backup before upgrading#
We recommend creating a backup of your service database before upgrading to a new version of SEP. Refer to your database documentation for platform- and version-specific instructions.
Logged information#
Information from processing queries is logged in multiple tables and the contained fields. Details are documented in the following sections.
Completed queries#
A row is created in the completed_queries
table for each submitted query. It
captures everything that SEP emits – query, user, metadata, stats, performance
related attributes, resource consumption, start time, end time, and much more:
Column |
Description |
---|---|
|
Unique identifier of the query |
|
Session catalog |
|
Session schema |
|
Identity used to authenticate to SEP |
|
User agent that submitted the query |
|
Information about the client submitting the query |
|
Client tool or driver used to execute query |
|
Name of the SEP environment name |
|
Address of the client that submitted the query |
|
Version of SEP that executed the query |
|
Identity used for authorization and access control |
|
The groups the user executing the query belongs to, as an array of strings |
|
State of the query when logged |
|
Full SQL statement of the submitted query |
|
Full explain plan of the submitted query, including costs for each stage.
Identical to the output from |
|
Sum of all rows used in the query input |
|
Sum of all bytes used in the query input |
|
Number of filtered rows on query output |
|
Number of filtered bytes on query output |
|
Number of inserted rows |
|
Number of bytes of inserted rows |
|
Total accumulated CPU time across threads and workers |
|
Elapsed time for query pressing as measured by a wall clock in ms |
|
Time spent between query submission and the beginning of query planning in ms |
|
Maximum amount of memory directly tied to query resources used by a query at one time |
|
Maximum amount of memory used at one time which is not eligible to be spilled to disk |
|
Maximum amount of user memory reserved at one time by one task |
|
Maximum amount of user and system memory reserved at one time by one task, values are impacted by the memory management configurations |
|
Number of uncompressed bytes read from source |
|
Number of rows read from source |
|
Number of bytes of data exchanged between nodes during query execution |
|
Number of rows of data exchanged between nodes during query execution |
|
Integral of memory reservation with respect to time, giving units of
|
|
Total number of splits completed by the query |
|
Table statistics for each source table, as well as estimates for the CPU,
memory, and network I/O for each stage of the query plan and each operator
in each plan node. Operator costs are denoted by their plan node ID, as
indicated from the |
|
Information about JVM garbage collection tasks and time spent in GC for each query stage |
|
Min, max, average, and percentiles for reserved CPU time per task for each query stage. Can be useful for identifying skew in a query. |
|
Information about every operator, including operator type (TableScan, Aggregation, Join, etc.) and resources used (CPU time, network I/O, memory reserved, etc.) invoked at each stage of the query. |
|
Time spent waiting for sufficient resources before query execution in ms |
|
Time spent reading the metadata and checking the query for semantic errors in ms |
|
Total time a query spent in the execution phase in ms |
|
Timestamp from when query was received |
|
Timestamp when results were finished being consumed by the client |
|
List of catalogs, schemas, tables, and columns read. Query must be a
|
|
Time spent creating and optimizing the query plan in ms |
|
Total time a query’s tasks were scheduled on a thread to run in ms |
|
Information about a query’s failure reason (if applicable), as a JSON formatted string. |
|
The type of query. Possible values are the ones enumerated for the
|
|
The type of update the query performed (e.g. |
Query tables#
For every table referenced by a query, zero, one or more rows with information
about the accessed table is created in the query_tables
table. The
query_id
defines the relationship to the query in the completed_queries
table.
Column |
Description |
---|---|
|
Unique identifier of the query |
|
Name of the catalog containing the queried table |
|
Name of the schema containing the queried table |
|
Name of the queried table |
|
Number of bytes read or written |
|
Number of rows read or written |
|
Boolean flag indicating whether the table was written to ( |
Analyzing the query log#
Query log data is available for analysis and inspection in Insights query overview.
To create custom analysis and visualizations of the query log data, you can create a catalog with the PostgreSQL connector, for instance, if you are using PostgreSQL as your external backend service database, using the same JDBC connection parameters. This allows you to use the Trino CLI, or any other application connected to the catalog to create queries, dashboards, perform ad hoc analysis and more.
Example use cases:
measure query performance numbers and trends
understand impact of different cluster configurations
enable cluster workload management and resource consumption
The following section describes an example analysis.
Generate data by running a query with SEP.
trino> CREATE TABLE memory.default.country AS
SELECT n.name nation_name, r.name region_name
FROM delta.sf1.nation n JOIN delta.sf1.region r ON n.regionkey = r.regionkey;
CREATE TABLE: 25 rows
Query 20201209_021306_00003_qsyab, FINISHED, 1 node
Splits: 68 total, 68 done (100.00%)
3.34 [35 rows, 5.35KB] [10 rows/s, 1.6KB/s]
Data is written to the completed_queries
table in the PostgreSQL database.
In the following example, the catalog file is named
sepbackendservice.properties
and the public
schema contains the completed_queries
table. The
query_id
from the execution is used to locate the correct record. The output
in the following example has been trimmed.
trino> SELECT *
FROM sepbackendservice.public.completed_queries
WHERE query_id = '20201209_021306_00003_qsyab';
-[ RECORD 1 ]-------------------------+---------------------------------------
query_id | 20201209_021306_00003_qsyab
catalog | NULL
schema | NULL
principal | demo
user_agent | StatementClientV1/345
client_info | NULL
source | trino-cli
environment | demo
remote_client_address | 127.0.0.1
server_version | 350-e
usr | demo
query_state | FINISHED
query | CREATE TABLE memory.default.country AS SELECT n.name nation_name, r.name region_name FROM delta.sf1.nation n JOIN delta.sf1.region r ON n.regionkey = r.regionkey
query_plan | Fragment 0 [COORDINATOR_ONLY]
| CPU: 43.24ms, Scheduled: 60.81ms, Input: 2 rows (81B); per task: avg.: 2.00 std.dev.: 0.00, Output: 1 row (9B)
| Output layout: [rows]
| ...
total_rows | 35
total_bytes | 5477
output_rows | 1
output_bytes | 9
written_rows | 25
written_bytes | 597
cpu_time_ms | 392
wall_time_ms | 3342
queued_time_ms | 10
peak_user_memory_bytes | 387318
peak_total_non_revocable_memory_bytes | 387318
peak_task_user_memory | 387318
peak_task_total_memory | 387318
physical_input_bytes | 5274
physical_input_rows | 30
internal_network_bytes | 933
internal_network_rows | 32
cumulative_memory | 7.7086698E7
completed_splits | 68
plan_node_stats_and_costs | {"stats":{},"costs":{}}
stage_gc_statistics | [{"stageId":0,"tasks":1,"fullGcTasks":0,...
cpu_time_distribution | [{"stageId":0,"tasks":1,"p25":25,"p50":25,...
operator_summaries | [{"stageId":0,"pipelineId":0,"operatorId":0,..
resource_waiting_time | 783
analysis_time | 783
execution_time | 2559
create_time | 2020-12-09 02:13:06.140000 UTC
end_time | 2020-12-09 02:13:09.482000 UTC
accessed_metadata | [{"catalogName":"delta","schema":"sf1","table":"nation","columns":["name","regionkey"],...
failure_info | NULL
update_type | CREATE TABLE
planning_time | 1711
scheduled_time | 1263
The tables accessed by the query are recorded in the query_tables
table and
can be identified using the same query_id
.
trino> SELECT *
FROM sepbackendservice.public.query_tables
WHERE query_id = '20201209_021306_00003_qsyab';
query_id | catalog_name | schema_name | table_name | physical_bytes | physical_rows | is_output
-----------------------------+--------------+-------------+------------+----------------+---------------+-----------
20201209_021306_00003_qsyab | memory | default | country | 597 | 25 | true
20201209_021306_00003_qsyab | delta | sf1 | region | 1691 | 5 | false
20201209_021306_00003_qsyab | delta | sf1 | nation | 3583 | 25 | false
(3 rows)