Configure the cache service in Kubernetes#

The starburst-cache-service Helm chart configures a standalone Cache service to use with Starburst Cached Views.

We strongly suggest that you follow best practices to customize your cluster. Create small, targeted files to override any defaults and add any configuration properties. There is an example file set in our deployment guide that describes the recommended way to manage your customizations.

You must configure the following to use the Starburst cache service:

  • The cache service

  • Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP), to use the cache service

  • The SEP query logger

Make sure that the cache service is available before configuring and restarting SEP to use it.

Requirements#

  • Externally managed, compatible relational database.

  • Full access to a dedicated schema on the database with username and password credentials.

  • Network access on the configured port between the database and the cache service container in the Kubernetes cluster.

Configure and start the cache service#

To update the cache service with any configuration changes, run the helm update command with the updated YAML files and the --install switch. For example:

$ helm upgrade my-caching-service starburstdata/starburst-cache-service \
    --install \
    --version 413.18.0 \
    --values ./registry-access.yaml
    --values ./cache-service-prod.yaml

When you update the cache service, you can use the same command that you use to upgrade to a new release. Helm compares all --values files and the version, and safely ignores any that are unchanged.

Top level nodes#

The top level nodes included in the values.yaml file are described in the following table. Default values and examples follow this table:

Top level values.yaml nodes#

Node name

Description

image:

Contains the details for the cache service Docker image. Review our best practices for managing registry access across all Starburst products. NOTE: The image: node is handled the same way as for the Docker image and registry section for the Helm chart for SEP.

keystore:

Contains the secret to be mounted under the specified podFileLocation within the container.

config:

Specifies the configuration properties for the cache service under config.properties, its JVM config under jvm.config, and its logging configuration under log.properties. It also contains the rules.json and type-mapping.json nodes.

registryCredentials:

Defines authentication details for Docker registry access. Typically, you need to use your username and password for the Starburst Harbor instance. Cannot be used if using imagePullSecrets:. NOTE: The registryCredentials: node is handled the same way as for the Docker image and registry section for the Helm chart for SEP.

imagePullSecrets:

Alternative authentication for selected Docker registry using secrets. Cannot be used if using registryCredentials:.

resources:

The CPU and memory resources to use for the cache service. Request and limit values should be identical. These settings can be adjusted to match your workload and available node sizes in the cluster.

expose:

Defines the mechanisms and their options that expose the cache service to an outside network. type: "clusterIp" is configured by default.

database:

Defines the database backend for the cache service. Defaults to type: "internal".

envFrom: []

Allows for the propagation of environment variables from different sources complying with the K8S schema specification. This can be used to deliver values to the cache service configuration properties files by creating a Kubernetes secret holding variable values.

env: []

Allows to define additional environment variables for the cache service.

nodeSelector: {}, affinity: {} and tolerations: []

Configuration to allow Kubernetes to determine the node and pod to use. These nodes are left empty by default.

commonLabels: {}

Defines common labels to identify all cache service objects in a KRM to use with the kustomize utility and other tools.

image:#

The following are the default values for the cache service image: top level node in the values.yaml file:

image:
  repository: "starburstdata/starburst-cache-service"
  tag: "413-e.18"
  pullPolicy: "IfNotPresent"

config:#

The defaults for nodes nested under config: are described below.

config.properties:#

This node specifies configuration properties and their values for the cache service. The following are the default values for the cache service config.properties: node in the values.yaml file:

config:
  config.properties: |
    service-database.user=alice
    service-database.password=test123
    service-database.jdbc-url=jdbc:mysql://mysql-server:13306/redirections
    starburst.user=bob
    starburst.jdbc-url=jdbc:trino://coordinator:8080
    rules.file=etc/rules.json

config:jvm.config:#

This node specifies the command line configuration options for starting the Java Virtual Machine (JVM) used by the cache service. The following are the default values for the cache service config:jvm.config: node in the values.yaml file:

config:
  jvm.config: |
    -server
    --add-opens=java-base/sun.nio.ch=ALL-UNNAMED
    --add-opens=java-base/java.nio=ALL-UNNAMED
    --add-opens=java.base/java.lang=ALL-UNNAMED
    --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
    -XX:G1HeapRegionSize=32M
    -XX:+ExplicitGCInvokesConcurrent
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+ExitOnOutOfMemoryError
    -XX:ReservedCodeCacheSize=512M
    -XX:PerMethodRecompilationCutoff=10000
    -XX:PerBytecodeRecompilationCutoff=10000
    -XX:+UnlockDiagnosticVMOptions
    -XX:+UseAESCTRIntrinsics
    -XX:InitialRAMPercentage=80
    -XX:MaxRAMPercentage=80
    -Djdk.nio.maxCachedBufferSize=2000000
    -Djdk.attach.allowAttachSelf=true

log.properties:#

This optional node specifies the logging configuration for the cache service. The following are the default values for the cache service log.properties: node in the values.yaml file:

config:
  jvm.config: |
    io.airlift=INFO

rules.json:#

The rules.json node is specific to table scan redirections. It specifies the source tables and target connector for the cache, along with the schedule for refreshing them. The following are the default values for the cache service rules.json: node in the values.yaml file:

config:
  rules.json: |
    {
      "rules": []
    }

The following example demonstrates how to implement cache service refresh rules in this node by adding the JSON file content in a multi-line segment:

config:
  rules.json: |
    {
      "defaultGracePeriod": "42m",
      "defaultMaxImportDuration": "1m",
      "defaultCacheCatalog": "default_cache_catalog",
      "defaultCacheSchema": "default_cache_schema",
      "defaultUnpartitionedImportConfig": {
        "usePreferredWritePartitioning": false,
        "preferredWritePartitioningMinNumberOfPartitions": 1,
        "writerCount": 128,
        "scaleWriters": false,
        "writerMinSize": "110MB"
      },
      "defaultPartitionedImportConfig": {
        "usePreferredWritePartitioning": true,
        "preferredWritePartitioningMinNumberOfPartitions": 40,
        "writerCount": 256,
        "scaleWriters": false,
        "writerMinSize": "52MB"
      },
      "rules": [
        {
          "catalogName": "mysql",
          "schemaName": "marketing",
          "tableName": "events",
          "refreshInterval": "2m",
          "gracePeriod": "15m",
          "incrementalColumn": "event_id",
          "deletePredicate": "event_date < date_add('day', -31, CURRENT_DATE)"
        }
      ]
    }

type-mapping.json:#

The type-mapping.json node specifies type mapping rules between source and target catalogs. This node is not included in the values.yaml file by default. The following example maps three different timestamp types to timestamp(3) in the target:

config:
  type-mapping.json: |
    {
      "rules": {
        "tpch": {
          "integer": "long"
        },
        "hive": {
          "timestamp(0)": "timestamp(3)",
          "timestamp(1)": "timestamp(3)",
          "timestamp(2)": "timestamp(3)"
        }
      }
    }

registryCredentials:#

The following are the default values for the cache service registryCredentials: top level node in the values.yaml file:

registryCredentials:
  enabled: false
  registry:
  username:
  password:

imagePullSecrets:#

Instead of setting registryCredentials you can pass a list of secrets in the following format. This feature is disabled by default:

# imagePullSecrets:
#  - name: secret1
#  - name: secret2
imagePullSecrets:

resources:#

The following are the default values for the cache service resource: top level node in the values.yaml file:

resources:
  requests:
    memory: 2Gi
    cpu: 0.5
  limits:
    memory: 2Gi
    cpu: 4

expose:#

You must expose the service to allow it to connect to the SEP coordinator, and to reach it with tools such as the cache service CLI. This service-type configuration is defined by the expose: top level node. You can choose from four different mechanisms by setting the type: value to the common configurations in k8s.

Depending on your choice, you only have to configure the identically-named sections.

Types for expose: top level node#

Type

Description

clusterIp

Default value. Only exposes the service internally within the k8s cluster using an IP address internal to the cluster. Use this in the early stages of configuration.

nodePort

Configures the internal port number of the server for requests from outside the cluster on the nodePort port number. External service requests are made to <nodeIP>:<nodePort>. The clusterIP service is automatically created to supply all internal IP addresses.

loadBalancer

Used with platforms that provide a load balancer. This option automatically creates nodePort and clusterIP services , to which the external load balancer routes requests.

ingress

This option provides a production-level, securable configuration. It allows a load balancer to route to multiple apps in the cluster, and may provide load balancing, SSL termination, and name-based virtual hosting. For example, the SEP coordinator and Ranger server can be in the same cluster, and can be exposed via ingress: configuration.

The following are the default values for the cache service expose: top level node in the values.yaml file:

expose:
  port: 8180
  # one of: nodePort, clusterIp, loadBalancer, ingress
  type: "clusterIp"
  clusterIp:
    name: "cache-service"
    ports:
      http:
        port: 8180
  nodePort:
    name: "cache-service"
    ports:
      http:
        port: 8180
        nodePort: 30180
  loadBalancer:
    name: "cache-service"
    IP: ""
    ports:
      http:
        port: 8180
    annotations: {}
    sourceRanges: []
  ingress:
    ingressName: "cache-service-ingress"
    serviceName: "cache-service"
    servicePort: 8180
    ingressClassName:
    tls:
      enabled: true
      secretName:
    host:
    path: "/"
    annotations: {}

database.internal:#

The database backend for the cache service is a PostgreSQL database internal to the cluster by default.

Alternatively, you can use an external backend database for production usage that you must manage yourself.

In the .Values.config.config.properties configuration it is required to refer to appropriate environment variables to enable integration with the database backend:

config.properties: |
  service-database.user=${ENV:SERVICE_DATABASE_USER}
  service-database.password=${ENV:SERVICE_DATABASE_PASSWORD}
  service-database.jdbc-url=${ENV:SERVICE_DATABASE_JDBC_URL}

The following snippet shows the default configuration for the internal cache service backend database:

database:
  type: internal
  internal:
    image:
      repository: "library/postgres"
      tag: "10.6"
      pullPolicy: "IfNotPresent"
    volume:
      # use one of:
      # - existingVolumeClaim to specify existing PVC
      # - persistentVolumeClaim to specify spec for new PVC
      # - other volume type inline configuration, e.g. emptyDir
      # Examples:
      # existingVolumeClaim: "my_claim"
      # persistentVolumeClaim:
      #  storageClassName:
      #  accessModes:
      #    - ReadWriteOnce
      #  resources:
      #    requests:
      #      storage: "2Gi"
      emptyDir: {}
    resources:
      requests:
        memory: "1Gi"
        cpu: 2
      limits:
        memory: "1Gi"
        cpu: 2
    driver: "org.postgresql.Driver"
    port: 5432
    databaseName: "cacheservice"
    databaseUser: "cacheservice"
    databasePassword: "CacheServicePass1234"
    envFrom: []
    env: []
Internal cache service backend database configuration#

Node name

Description

database.type

Set to internal to use a database in the k8s cluster, managed by the chart

database.internal.image

Docker container images used for the PostgreSQL server

database.internal.volume

Storage volume to persist the database. The default configuration requests a new persistent volume (PV).

database.internal.volume.persistentVolumeClaim

The default configuration, which requests a new persistent volume (PV).

database.internal.volume.existingVolumeClaim

Alternative volume configuration, which uses an existing volume claim by referencing the name as the value in quotes, e.g., "my_claim".

database.internal.volume.emptyDir

Alternative volume configuration, which configures an empty directory on the pod. Keep in mind that a pod replacement loses the database content

database.internal.resources

database.internal.databaseName

Name of the internal database

database.internal.databaseUser

User to connect to the internal database

database.internal.databasePassword

Password to connect to internal database

database.internal.envFrom

YAML sequence of mappings to define Secret or Configmap as a source of environment variables for the PostgreSQL container.

database.internal.env

YAML sequence of mappings to define two keys environment variables for the PostgreSQL container.

database.external:#

This section shows the setup for using of an external PostgreSQL or MySQL database. You must provide the necessary details for the external server, and ensure that it can be reached from the k8s cluster pod. Set the database.type to external and configure the connection properties:

In the .Values.config.config.properties configuration it is required to refer to the appropriate environment variables in order to enable integration with the database backend:

config.properties: |
  service-database.user=${ENV:SERVICE_DATABASE_USER}
  service-database.password=${ENV:SERVICE_DATABASE_PASSWORD}
  service-database.jdbc-url=${ENV:SERVICE_DATABASE_JDBC_URL}
database:
  type: external
  external:
    jdbcUrl:
    user:
    password:
External cache service backend database configuration#

Node name

Description

database.type

Set to external to use an existing PostgreSQL or MySQL database outside the cluster.

database.external.jdbcUrl

JDBC URL to connect to the external database as required by the database and used driver, including hostname and port. Ensure you use a valid JDBC URL as required by the PostgreSQL or MySQL driver. Typically the syntax requires the host, port and database name jdbc:postgresql://host:port/database or jdbc:mysql://host:port/database.

database.external.user

Database user name to access the external database using JDBC.

database.external.password

Password for the user configured to access the external database using JDBC.

commonLabels:#

The following are the default values for the commonLabels: top level node in the cache service values.yaml file:

commonLabels: {}
#  environment: dev
#  myLabel: labelValue

Configure SEP to use the cache service#

The cache service requires a database schema to store configuration data. Ensure that you have created the schema, and note the connection information. Once the cache service and the external RDBMS is configured and running, SEP must be configured as in the following example, which shows a PostgreSQL database providing the backing schema:

coordinator:
  etcFiles:
    properties:
      cache.properties: |
        service-database.user=postgres
        service-database.password=S3cr3t1v3
        service-database.jdbc-url=jdbc:postgresql://<your_rds_endpoint>:5432/redirections
        starburst.user=starburst_service
        starburst.password=
        starburst.jdbc-url=jdbc:trino://coordinator:8080
        rules.file=secretRef:cache-rules:cache-rules.json
        rules.refresh-period=1m
        refresh-initial-delay=1m
        refresh-interval=24h

Many connectors support the use of the cache service. For each supported catalog that you wish to use with the cache service, two lines must be added to the catalog properties configuration:

redirection.config-source=SERVICE
cache-service.uri=http://cache-service:8180

In the following example, the mysalesdata catalog is configured to use the cache service:

catalogs:
  mysalesdata: |
    connector.name=postgresql
    connection-url=jdbc:postgresql://<mydbhost>:5432/bootcamp
    connection-user=postgres
    connection-password=S3cr3t1v3
    statistics.enabled=true
    redirection.config-source=SERVICE
    cache-service.uri=http://cache-service:8180

Configuration examples#

External secret reference#

To configure the cache service to work with the cache rules as an external secret reference, first create a k8s secret holding the file:

kubectl create secret generic cache-rules --from-file=cache-rules.json

When the file is created, you can configure the secret reference usage for the above configuration:

config:
  config.properties: |
    service-database.user=${ENV:SERVICE_DATABASE_USER}
    service-database.password=${ENV:SERVICE_DATABASE_PASSWORD}
    service-database.jdbc-url=${ENV:SERVICE_DATABASE_JDBC_URL}
    starburst.user=bob
    starburst.jdbc-url=jdbc:trino://coordinator:8080
    rules.file=secretRef:cache-rules:cache-rules.json

This mounts the secret named cache-rules in the path /mnt/secretsRef/cache-rules and replaces the secretRef:cache-rules occurrences with the absolute path, resulting in the following configuration property setting:

rules.file=/mnt/secretRef/cache-rules/cache-rules.json

This mechanism can only be applied for properties files defined under .Values.config node. Specific secret values, such as passwords, can be passed into properties files using the .Values.envFrom.