Starburst Splunk connector#

The Starburst Splunk connector allows querying data stored in Splunk.

Requirements#

To connect to Splunk, you need:

Splunk username and password credentials.
Network access from the coordinator and workers to Splunk, the default port is 8089.
A valid Starburst Enterprise license.

Configuration#

Create a catalog properties file in etc/catalog named example.properties to access Splunk as configured in the example catalog (replace example with your database name or some other descriptive name of the catalog) with the following contents:

connector.name=splunk
splunk.url=https://example.splunk.com:8089
splunk.user=admin
splunk.password=password
splunk.schema-directory=/path/to/schema/directory

The connector can only access Splunk with the access credentials specified in the catalog configuration file:

Specify the connector.name property as splunk.
Configure the catalog to the URL for your Splunk server hostname and management port, default 8089.
Add your Splunk user and password for your account.

If you need to access Splunk with different credentials, configure a separate catalog.

If the Splunk server is secured with TLS, you must export the server certificate and store it on the SEP coordinator and all workers in the same path on each, relative to the SEP etc directory. Then, configure the splunk.ssl-server-cert property to specify this path to the Splunk server certificate file.

General configuration properties#

The following table describes general catalog configuration properties for the connector:

Property name	Description	Default value
`case-insensitive-name-matching`	Support case insensitive schema and table names.	`false`
`case-insensitive-name-matching.cache-ttl`	This value should be a duration.	`1m`
`case-insensitive-name-matching.config-file`	Path to a name mapping configuration file in JSON format that allows Trino to disambiguate between schemas and tables with similar names in different cases.	`null`
`case-insensitive-name-matching.config-file.refresh-period`	Frequency with which Trino checks the name matching configuration file for changes. This value should be a duration.	(refresh disabled)
`metadata.cache-ttl`	The duration for which metadata, including table and column statistics, is cached.	`0s` (caching disabled)
`metadata.cache-missing`	Cache the fact that metadata, including table and column statistics, is not available	`false`
`metadata.cache-maximum-size`	Maximum number of objects stored in the metadata cache	`10000`
`write.batch-size`	Maximum number of statements in a batched execution. Do not change this setting from the default. Non-default values may negatively impact performance.	`1000`
`dynamic-filtering.enabled`	Push down dynamic filters into JDBC queries	`true`
`dynamic-filtering.wait-timeout`	Maximum duration for which Trino will wait for dynamic filters to be collected from the build side of joins before starting a JDBC query. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.	`20s`

Access Splunk data#

In addition to Splunk system tables, the connector can use previously saved searches to generate reports for configured users. The name of the table in SEP is the name of the report in Splunk. To create a new table, run any search in Splunk, select Save As > Report, and specify a name.

The connector only supports accessing Splunk data via saved reports.

Metadata cache#

The connector caches report names that it reads from Splunk. If you create a new report in Splunk, it may not show up right away. You can manually reset the metadata cache by restarting SEP or running the following CALL statement:

CALL example.system.reset_metadata_cache();

Generate a schema file#

A schema file for a Splunk report details the SEP table name, columns, and data types. If a table does not have a schema file, the connector scans up to 50 rows from Splunk to dynamically detect the types and columns to generate an internal schema. Alternatively, you can manually create a schema file or follow the example procedure to create a schema file.

When you manually create a schema file you must add it to the schema directory on all nodes to specify the correct data types. We recommend that you use the create_schema procedure to create the initial schema file, provide the table name, then edit the file as necessary. The coordinator generates the file in the configured schema directory with the name of the table and an rsd extension. If you edit the file, you must copy the new version to each host in the cluster.

CALL example.system.create_schema('orders');

The following example schema file for the TPC-H orders table was generated by the connector. The name of the file is orders.rsd. It must be placed in the directory specified by the splunk.schema-directory configuration property on every SEP node.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">
  <api:info title="orders" other:catalog="CData" other:schema="Splunk" description="null" other:earliest_time="0" other:savedsearch="true" other:tablename="orders" other:search=" from inputlookup:&quot;orders.csv&quot;" other:tablepath="servicesNS/admin/search/search/jobs"  other:version="20">
    <attr   name="clerk"            xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="clerk"            other:filterable="false"   />
    <attr   name="comment"          xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="comment"          other:filterable="false"   />
    <attr   name="custkey"          xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="custkey"          other:filterable="false"   />
    <attr   name="orderdate"        xs:type="date"     isrequired="false"   columnsize="3"                       description=""                                                      other:internalname="orderdate"        other:filterable="false"   />
    <attr   name="orderkey"         xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="orderkey"         other:filterable="false"   />
    <attr   name="orderpriority"    xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="orderpriority"    other:filterable="false"   />
    <attr   name="orderstatus"      xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="orderstatus"      other:filterable="false"   />
    <attr   name="shippriority"     xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="shippriority"     other:filterable="false"   />
    <attr   name="totalprice"       xs:type="double"   isrequired="false"   columnsize="8"      precision="15"   description=""                                                      other:internalname="totalprice"       other:filterable="false"   />
  </api:info>

  <api:script method="GET">
    <api:call op="splunkadoSelect">
      <api:push/>
    </api:call>
  </api:script>
</api:script>

Type mapping#

The connector maps the following SQL types in Splunk to SEP types:

BOOLEAN
INTEGER
BIGINT
DOUBLE
VARCHAR
DATE
TIME(3)
TIMESTAMP(3)

All other SQL types are not supported.

SQL support#

The connector supports globally available and read operation statements to access data and metadata in Splunk.

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Predicate pushdown support#

The connector supports predicate pushdown on columns of the following types:

BOOLEAN
INTEGER
BIGINT
DOUBLE
VARCHAR
TIMESTAMP(3)

You can set predicate-pushdown.enabled=true in the catalog configuration or use the predicate_pushdown_enabled session property to enable predicate pushdown. By default, this configuration property is set to false.

Note

Enabling predicate pushdown may cause correctness issues when pushing predicates down where the value of the column is NULL.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a JDBC query.

You can disable dynamic filtering by setting the dynamic-filtering.enabled property in your catalog configuration file to false.

Wait timeout#

By default, table scans on the connector are delayed up to 20 seconds until dynamic filters are collected from the build side of joins. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.

You can configure the dynamic-filtering.wait-timeout property in your catalog properties file:

dynamic-filtering.wait-timeout=1m

You can use the dynamic_filtering_wait_timeout catalog session property in a specific session:

SET SESSION example.dynamic_filtering_wait_timeout = 1s;

Compaction#

The maximum size of dynamic filter predicate, that is pushed down to the connector during table scan for a column, is configured using the domain-compaction-threshold property in the catalog properties file:

domain-compaction-threshold=100

You can use the domain_compaction_threshold catalog session property:

SET SESSION domain_compaction_threshold = 10;

By default, domain-compaction-threshold is set to 32. When the dynamic predicate for a column exceeds this threshold, it is compacted into a single range predicate.

For example, if the dynamic filter collected for a date column dt on the fact table selects more than 32 days, the filtering condition is simplified from dt IN ('2020-01-10', '2020-01-12',..., '2020-05-30') to dt BETWEEN '2020-01-10' AND '2020-05-30'. Using a large threshold can result in increased table scan overhead due to a large IN list getting pushed down to the data source.

Metrics#

Metrics about dynamic filtering are reported in a JMX table for each catalog:

jmx.current."io.trino.plugin.jdbc:name=example,type=dynamicfilteringstats"

Metrics include information about the total number of dynamic filters, the number of completed dynamic filters, the number of available dynamic filters and the time spent waiting for dynamic filters.

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.