Starburst Splunk connector#
The Starburst Splunk connector allows querying data stored in Splunk.
Requirements#
To connect to Splunk, you need:
Splunk username and password credentials.
Network access from the coordinator and workers to Splunk, the default port is 8089.
A valid Starburst Enterprise license.
Configuration#
Create a catalog properties file in etc/catalog
named example.properties
to access Splunk as configured in the example
catalog (replace example
with your database name or some other descriptive name of the catalog) with the
following contents:
connector.name=splunk
splunk.url=https://example.splunk.com:8089
splunk.user=admin
splunk.password=password
splunk.schema-directory=/path/to/schema/directory
The connector can only access Splunk with the access credentials specified in the catalog configuration file:
Specify the
connector.name
property assplunk
.Configure the catalog to the URL for your Splunk server hostname and management port, default 8089.
Add your Splunk user and password for your account.
If you need to access Splunk with different credentials, configure a separate catalog.
If the Splunk server is secured with TLS, you must export the server certificate
and store it on the SEP coordinator and all workers in the same path on each,
relative to the SEP etc
directory. Then, configure the
splunk.ssl-server-cert
property to specify this path to the Splunk server
certificate file.
General configuration properties#
The following table describes general catalog configuration properties for the connector:
Property name |
Description |
Default value |
---|---|---|
|
Support case insensitive schema and table names. |
|
|
This value should be a duration. |
|
|
Path to a name mapping configuration file in JSON format that allows Trino to disambiguate between schemas and tables with similar names in different cases. |
|
|
Frequency with which Trino checks the name matching configuration file for changes. This value should be a duration. |
(refresh disabled) |
|
The duration for which metadata, including table and column statistics, is cached. |
|
|
Cache the fact that metadata, including table and column statistics, is not available |
|
|
Maximum number of objects stored in the metadata cache |
|
|
Maximum number of statements in a batched execution. Do not change this setting from the default. Non-default values may negatively impact performance. |
|
|
Push down dynamic filters into JDBC queries |
|
|
Maximum duration for which Trino will wait for dynamic filters to be collected from the build side of joins before starting a JDBC query. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries. |
|
Access Splunk data#
In addition to Splunk system tables, the connector can use previously saved searches to generate reports for configured users. The name of the table in SEP is the name of the report in Splunk. To create a new table, run any search in Splunk, select Save As > Report, and specify a name.
The connector only supports accessing Splunk data via saved reports.
Metadata cache#
The connector caches report names that it reads from Splunk. If you create a new
report in Splunk, it may not show up right away. You can manually reset the
metadata cache by restarting SEP or running the following CALL
statement:
CALL example.system.reset_metadata_cache();
Generate a schema file#
A schema file for a Splunk report details the SEP table name, columns, and data types. If a table does not have a schema file, the connector scans up to 50 rows from Splunk to dynamically detect the types and columns to generate an internal schema. Alternatively, you can manually create a schema file or follow the example procedure to create a schema file.
When you manually create a schema file you must add it to the schema directory
on all nodes to specify the correct data types. We recommend that you use the
create_schema
procedure to create the initial schema file, provide the table
name, then edit the file as necessary. The coordinator generates the file in the
configured schema directory with the name of the table and an rsd
extension.
If you edit the file, you must copy the new version to each host in the cluster.
CALL example.system.create_schema('orders');
The following example schema file for the TPC-H orders
table was generated
by the connector. The name of the file is orders.rsd
. It must be placed in
the directory specified by the splunk.schema-directory
configuration
property on every SEP node.
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">
<api:info title="orders" other:catalog="CData" other:schema="Splunk" description="null" other:earliest_time="0" other:savedsearch="true" other:tablename="orders" other:search=" from inputlookup:"orders.csv"" other:tablepath="servicesNS/admin/search/search/jobs" other:version="20">
<attr name="clerk" xs:type="string" isrequired="false" columnsize="2000" description="" other:internalname="clerk" other:filterable="false" />
<attr name="comment" xs:type="string" isrequired="false" columnsize="2000" description="" other:internalname="comment" other:filterable="false" />
<attr name="custkey" xs:type="int" isrequired="false" columnsize="4" precision="10" description="" other:internalname="custkey" other:filterable="false" />
<attr name="orderdate" xs:type="date" isrequired="false" columnsize="3" description="" other:internalname="orderdate" other:filterable="false" />
<attr name="orderkey" xs:type="int" isrequired="false" columnsize="4" precision="10" description="" other:internalname="orderkey" other:filterable="false" />
<attr name="orderpriority" xs:type="string" isrequired="false" columnsize="2000" description="" other:internalname="orderpriority" other:filterable="false" />
<attr name="orderstatus" xs:type="string" isrequired="false" columnsize="2000" description="" other:internalname="orderstatus" other:filterable="false" />
<attr name="shippriority" xs:type="int" isrequired="false" columnsize="4" precision="10" description="" other:internalname="shippriority" other:filterable="false" />
<attr name="totalprice" xs:type="double" isrequired="false" columnsize="8" precision="15" description="" other:internalname="totalprice" other:filterable="false" />
</api:info>
<api:script method="GET">
<api:call op="splunkadoSelect">
<api:push/>
</api:call>
</api:script>
</api:script>
Type mapping#
The connector maps the following SQL types in Splunk to SEP types:
BOOLEAN
INTEGER
BIGINT
DOUBLE
VARCHAR
DATE
TIME(3)
TIMESTAMP(3)
All other SQL types are not supported.
SQL support#
The connector supports globally available and read operation statements to access data and metadata in Splunk.
Performance#
The connector includes a number of performance improvements, detailed in the following sections.
Dynamic filtering#
Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a JDBC query.
You can disable dynamic filtering by setting the dynamic-filtering.enabled
property in your catalog configuration file to false
.
Wait timeout#
By default, table scans on the connector are delayed up to 20 seconds until dynamic filters are collected from the build side of joins. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.
You can configure the dynamic-filtering.wait-timeout
property in your
catalog properties file:
dynamic-filtering.wait-timeout=1m
You can use the dynamic_filtering_wait_timeout
catalog session
property in a specific session:
SET SESSION example.dynamic_filtering_wait_timeout = 1s;
Compaction#
The maximum size of dynamic filter predicate, that is pushed down to the
connector during table scan for a column, is configured using the
domain-compaction-threshold
property in the catalog
properties file:
domain-compaction-threshold=100
You can use the domain_compaction_threshold
catalog
session property:
SET SESSION domain_compaction_threshold = 10;
By default, domain-compaction-threshold
is set to 32
.
When the dynamic predicate for a column exceeds this threshold, it is compacted
into a single range predicate.
For example, if the dynamic filter collected for a date column dt
on the
fact table selects more than 32 days, the filtering condition is simplified from
dt IN ('2020-01-10', '2020-01-12',..., '2020-05-30')
to dt BETWEEN
'2020-01-10' AND '2020-05-30'
. Using a large threshold can result in increased
table scan overhead due to a large IN
list getting pushed down to the data
source.
Metrics#
Metrics about dynamic filtering are reported in a JMX table for each catalog:
jmx.current."io.trino.plugin.jdbc:name=example,type=dynamicfilteringstats"
Metrics include information about the total number of dynamic filters, the number of completed dynamic filters, the number of available dynamic filters and the time spent waiting for dynamic filters.
Starburst Cached Views#
The connector supports table scan redirection to improve performance and reduce load on the data source.