Starburst Generic JDBC connector#

The Generic JDBC connector allows querying and creating tables in an external database that SEP does not have a dedicated connector for. This can be used for proof-of-concept deployments of federation use-cases, enabling joining multiple data sources. For production deployments a dedicated supported connector should be used.

Warning

Due to its generic nature, this connector may or may not work with a particular database or deployment scenario. In particular, you should expect the connector to handle at most only a subset of data types available in the external database.

Requirements#

To connect to a data source with the Generic JDBC connector, you need:

  • A JDBC driver for the database or system you want to connect.

  • Network access from the coordinator and workers to the data source. The used port varies for different databases and configurations.

  • A valid Starburst Enterprise license.

Configuration#

The Generic JDBC connector requires a JDBC 4.2 driver dedicated for the external database. No JDBC drivers are shipped with the connector. The installation steps assume you have the external database already running and that you have the appropriate JDBC driver jar.

  1. Locate generic-jdbc directory under SEP’s plugin directory. When SEP is installed from RPM, this is /usr/lib/starburst/lib/plugin/generic-jdbc.

  2. Add the appropriate JDBC driver jar file to the generic-jdbc directory.

  3. Locate etc/catalog directory containing configuration for SEP catalogs. When SEP is installed from RPM, this is /etc/starburst/catalog.

  4. Create example.properties file in the catalogs configuration directory (replace example with your database name or some other descriptive name of the catalog) with the following contents:

connector.name=generic_jdbc

# the name of the Driver class. E.g. for Postgres this would be org.postgresql.Driver
generic-jdbc.driver-class=FULLY_QUALIFIED_NAME_OF_THE_DRIVER_CLASS

# JDBC connection URL. E.g. for Postgres this could be jdbc:postgresql://example.net:5432/database?defaultRowFetchSize=10000
connection-url=JDBC_CONNECTION_URL_AS_APPROPRIATE_FOR_THE_EXTERNAL_DATABASE_DRIVER

# connection credentials
connection-user=USERNAME
connection-password=PASSWORD

# additional configuration
case-insensitive-name-matching=true
  1. Perform the above steps on every SEP cluster node.

  2. Restart SEP on every node.

General configuration properties#

The following table describes general catalog configuration properties for the connector:

Property name

Description

Default value

case-insensitive-name-matching

Support case insensitive schema and table names.

false

case-insensitive-name-matching.cache-ttl

This value should be a duration.

1m

case-insensitive-name-matching.config-file

Path to a name mapping configuration file in JSON format that allows Trino to disambiguate between schemas and tables with similar names in different cases.

null

case-insensitive-name-matching.config-file.refresh-period

Frequency with which Trino checks the name matching configuration file for changes. This value should be a duration.

(refresh disabled)

metadata.cache-ttl

The duration for which metadata, including table and column statistics, is cached.

0s (caching disabled)

metadata.cache-missing

Cache the fact that metadata, including table and column statistics, is not available

false

metadata.cache-maximum-size

Maximum number of objects stored in the metadata cache

10000

write.batch-size

Maximum number of statements in a batched execution. Do not change this setting from the default. Non-default values may negatively impact performance.

1000

dynamic-filtering.enabled

Push down dynamic filters into JDBC queries

true

dynamic-filtering.wait-timeout

Maximum duration for which Trino will wait for dynamic filters to be collected from the build side of joins before starting a JDBC query. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.

20s

Caution

Because the connector is generic, while you can configure these properties there is no guarantee that the configured feature works as expected.

Multiple external databases#

If you want to connect to multiple external databases using the Generic JDBC connector, follow these steps:

  1. Add appropriate JDBC driver jars to the generic-jdbc directory under SEP’s plugin directory for all database servers you are going to connect to. If two external databases use the same driver jar, you do not need to add the driver jar twice.

  2. Define additional catalogs in SEP catalogs configuration directory, by creating additional_example.properties file (replacing additional_example with some descriptive name of the catalog).

JDBC compatibility#

Generic JDBC connector provides best-effort compatibility with older JDBC drivers, including drivers implementing version 3 of the JDBC specification (JDBC 3).

The best-effort compatibility can be turned on with:

generic-jdbc.experimental.jdbc-version-compatibility.enabled=true

In rare cases, JDBC drivers fail to implement search string escaping as documented in the JDBC specification. In those cases, you can force the driver to ignore escaping by setting the following property:

generic-jdbc.ignore-search-string-escape=true

Table functions#

The connector provides specific table functions to access a data source with the generic JDBC connector.

query(varchar) -> table#

The query function allows you to query the underlying database directly. It requires syntax native to the data source, because the full query is pushed down and processed in the data source. This can be useful for accessing native features or for improving query performance in situations where running a query natively may be faster.

The query table function is available in the system schema of any catalog that uses the generic JDBC connector, such as example. The following example passes myQuery to the data source. myQuery has to be a valid query for the data source, and is required to return a table as a result:

SELECT
  *
FROM
  TABLE(
    example.system.query(
      query => 'myQuery'
    )
  );

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a JDBC query.

You can disable dynamic filtering by setting the dynamic-filtering.enabled property in your catalog configuration file to false.

Wait timeout#

By default, table scans on the connector are delayed up to 20 seconds until dynamic filters are collected from the build side of joins. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.

You can configure the dynamic-filtering.wait-timeout property in your catalog properties file:

dynamic-filtering.wait-timeout=1m

You can use the dynamic_filtering_wait_timeout catalog session property in a specific session:

SET SESSION example.dynamic_filtering_wait_timeout = 1s;

Compaction#

The maximum size of dynamic filter predicate, that is pushed down to the connector during table scan for a column, is configured using the domain-compaction-threshold property in the catalog properties file:

domain-compaction-threshold=100

You can use the domain_compaction_threshold catalog session property:

SET SESSION domain_compaction_threshold = 10;

By default, domain-compaction-threshold is set to 32. When the dynamic predicate for a column exceeds this threshold, it is compacted into a single range predicate.

For example, if the dynamic filter collected for a date column dt on the fact table selects more than 32 days, the filtering condition is simplified from dt IN ('2020-01-10', '2020-01-12',..., '2020-05-30') to dt BETWEEN '2020-01-10' AND '2020-05-30'. Using a large threshold can result in increased table scan overhead due to a large IN list getting pushed down to the data source.

Metrics#

Metrics about dynamic filtering are reported in a JMX table for each catalog:

jmx.current."io.trino.plugin.jdbc:name=example,type=dynamicfilteringstats"

Metrics include information about the total number of dynamic filters, the number of completed dynamic filters, the number of available dynamic filters and the time spent waiting for dynamic filters.

Table scan redirections#

The connector supports table scan redirection to improve performance and reduce load on the data source.

SQL support#

The connector provides read and write access to data and metadata in a database using a JDBC connection. In addition to the globally available and read operation statements, the connector supports the following features:

Some SQL statements may not work depending on the database and JDBC driver.

Type mapping#

Type mapping configuration properties#

The following properties can be used to configure how data types from the connected data source are mapped to Trino data types and how the metadata is cached in Trino.

Property name

Description

Default value

unsupported-type-handling

Configure how unsupported column data types are handled:

  • IGNORE, column is not accessible.

  • CONVERT_TO_VARCHAR, column is converted to unbounded VARCHAR.

The respective catalog session property is unsupported_type_handling.

IGNORE

jdbc-types-mapped-to-varchar

Allow forced mapping of comma separated lists of data types to convert to unbounded VARCHAR

Security#

The connector includes a number of security-related features, detailed in the following sections.

User impersonation#

Generic JDBC connector supports user impersonation.

User impersonation can be enabled in the catalog file:

generic-jdbc.impersonation.enabled=true

User impersonation in Generic JDBC connector is based on SET ROLE.

Limitations#

Data types#

Support for data types depends on the data types in the external database and how the data types are reported by the JDBC driver of the external database. Actual support may vary from database to database. The Generic JDBC connector will hide columns which data type it considered unsupported. In the extreme case, this may mean that no columns are visible, if all data types being used are unsupported.

If you want to query data with unsupported types, you can define a view in your external database, converting unsupported data types to supported ones (converting to varchar is usually a good choice). Then you should query this view from SEP, instead of the underlying table.