Configuring the Hive Metastore Service in Kubernetes#

Looking for the installation guide? This topic covers configuring the Hive Metastore Service (HMS) after you completed a basic install of Starburst Enterprise. If you have not yet done that, go to our installation guide.

View the installation guide

The starburst-hive Helm chart configures an HMS, the Hive metastore, and optionally the backing database in the cluster detailed in the following sections. The following Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) connectors and features require a HMS:

Use your registry credentials, and follow best practices by creating an override file for changes to default values as desired.

In addition to the configuration properties described in this document, you can also use the base Hive connector’s metastore configuration properties, Thrift metastore configuration properties and AWS Glue metastore configuration properties as needed, depending on your environment.

Using the HMS#

The expose section configures the DNS availability of the HMS in the cluster. By default the HMS is available at the hostname hive and port 9083. As a result the Thrift URL within the cluster is thrift://hive:9083.

You can use the URL for any catalog:

catalog:
  datalake: |
    connector.name=hive
    hive.metastore.uri=thrift://hive:9083

Docker image and registry#

Same as Docker image and registry section for the Helm chart for SEP.

image:
  repository: "harbor.starburstdata.net/starburstdata/hive"
  tag: "3.1.3-e.3"
  pullPolicy: "IfNotPresent"

registryCredentials:
  enabled: false
  registry:
  username:
  password:

imagePullSecrets:
 - name:

Exposing the pod to outside network#

The expose section for the HMS works identical to the SEP server expose section. Differences are isolated to the configured default values. Additionally, ingress is not supported since the HMS service uses the TCP protocol and not HTTP. The default type is clusterIp. Ensure to adapt your configured catalogs to use the correct Thrift URL S, when changing this configuration. The defaults are hostname hive and port 9083.

Default clusterIp:

expose:
  type: "clusterIp"
  clusterIp:
    name: "hive"
    ports:
      http:
        port: 9083

Alternative with nodePort:

expose:
  type: "nodePort"
  nodePort:
    name: "hive"
    ports:
      http:
        port: 9083
        nodePort: 30083

Alternative with loadBalancer:

expose:
  type: "loadBalancer"
  loadBalancer:
  name: "hive"
  IP: ""
  ports:
    http:
      port: 9083
  annotations: {}
  sourceRanges: []

Internal database backend for HMS#

The database backend for HMS is a PostgreSQL database internal to the cluster by default.

Note

Alternatively, you can use an external backend database for production usage that you must manage yourself.

The following snippet shows the default configuration for the internal HMS backend database:

database:
  type: internal
  internal:
    image:
      repository: "library/postgres"
      tag: "10.6"
      pullPolicy: "IfNotPresent"
    volume:
      # use one of:
      # - existingVolumeClaim to specify existing PVC
      # - persistentVolumeClaim to specify spec for new PVC
      # - other volume type inline configuration, e.g. emptyDir
      # Examples:
      # existingVolumeClaim: "my_claim"
      # persistentVolumeClaim:
      #  storageClassName:
      #  accessModes:
      #    - ReadWriteOnce
      #  resources:
      #    requests:
      #      storage: "2Gi"
      emptyDir: {}
    resources:
      requests:
        memory: "1Gi"
        cpu: 2
      limits:
        memory: "1Gi"
        cpu: 2
    driver: "org.postgresql.Driver"
    port: 5432
    databaseName: "hive"
    databaseUser: "hive"
    databasePassword: "HivePass1234"
    envFrom: []
    env: []
Internal HMS backend database configuration#

Node name

Description

database.type

Set to internal to use a database in the k8s cluster, managed by the chart

database.internal.image

Docker container images used for the PostgreSQL server

database.internal.volume

Storage volume to persist the database. The default configuration requests a new persistent volume (PV).

database.internal.volume.persistentVolumeClaim

The default configuration, which requests a new persistent volume (PV).

database.internal.volume.existingVolumeClaim

Alternative volume configuration, which use existing volume claim by referencing the name as the value in quotes, e.g., "my_claim".

database.internal.volume.emptyDir

Alternative volume configuration, which configures an empty directory on the pod, keeping in mind that a pod replacement loses the database content.

database.internal.resources

database.internal.databaseName

Name of the internal database

database.internal.databaseUser

User to connect to the internal database

database.internal.databasePassword

Password to connect to internal database

database.internal.envFrom

YAML sequence of mappings to define Secret or Configmap as a source of environment variables for the PostgreSQL container.

database.internal.env

YAML sequence of mappings to define two keys environment variables for the PostgreSQL container.

Examples#

OpenShift deployments often do not have access to pull from the default Docker registry library/postgres. You can replace it with an image from the Red Hat registry, which requires additional environment variables set with the parameter database.internal.env:

database:
  type: internal
  internal:
    image:
       repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
       tag: "latest"
    env:
      - name: POSTGRESQL_DATABASE
        value: "hive"
      - name: POSTGRESQL_USER
        value: "hive"
      - name: POSTGRESQL_PASSWORD
        value: "HivePass1234"

Another option is to create a Secret (ex. postgresql-secret) containing variables needed by postgresql which are mentioned in previous code block, and pass it to the container with envFrom parameter:

database:
  type: internal
  internal:
    image:
       repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
       tag: "latest"
    envFrom:
      - secretRef:
          name: postgresql-secret

External backend database for HMS#

This section shows the setup for using of an external PostgreSQL, MySQL or Microsoft SQL Server database. You must provide the necessary details for the external server, and ensure that it can be reached from the k8s cluster pod. Set the database.type to external and configure the connection properties:

database:
  type: external
  external:
    jdbcUrl:
    driver:
    user:
    password:
External HMS backend database configuration#

Node name

Description

database.type

Set to external to use an existing PostgreSQL, MySQL, Oracle, or SQL Server database outside the cluster.

database.external.jdbcUrl

JDBC URL to connect to the external database as required by the database and used driver, including hostname and port. Ensure you use a valid JDBC URL as required by the PostgreSQL, MySQL, Oracle, or SQL Server driver. Typically, the syntax requires the host, port, and database name as follows:

  • For PostgreSQL: jdbc:postgresql://host:port/database.

  • For MySQL: jdbc:mysql://host:port/database.

  • For Oracle, using the thin or OCI driver: jdbc:oracle:<driver>:@//host:port:orcl", "username", "password".

  • For SQL Server: The database name must be passed as an environment variable HIVE_METASTORE_DB_NAME and the JDBC connection string syntax jdbc:sqlserver://host:port.

database.external.driver

Valid values are as follows:

  • For PostgreSQL: org.postgresql.Driver.

  • For an external MySQL or a compatible database: com.mysql.jdbc.Driver.

  • For Oracle: oracle.jdbc.OracleDriver.

  • For SQL Server: com.microsoft.sqlserver.jdbc.SQLServerDriver.

database.external.user

Database user name to access the external database using JDBC.

database.external.password

Password for the user configured to access the external database using JDBC.

Server start up configuration#

You can create a startup shell script to customize how HMS is started, and pass additional arguments to it.

initFile:
extraArguments:

initFile

A shell script to run before HMS is launched. The content of the file has to be an inline string in the YAML file. The original startup command is passed as the first argument. The script needs to invoke it at the end as exec "$@". Use exec "$1" if passing any extra arguments.

extraArguments

List of extra arguments to pass to the initFile script.

The following example shows how you can use initFile to run a custom start up script. The init script must end with exec "$@":

initFile: |
  #!/bin/bash
  echo "Custom init for $2"
  exec "$@"
extraArguments:
  - TEST_ARG

Additional volumes#

Additional volumes can be necessary for persisting files. These can be defined in the additionalVolumes section. None are defined by default:

additionalVolumes: []

You can add one or more volumes supported by k8s, to all nodes in the cluster.

If you specify path only, a directory named in path is created. When mounting ConfigMap or Secret, files are created in this directory for each key.

This also supports an optional subPath parameter which takes in an optional key in the ConfigMap or Secret volume you create. If you specify subPath, a specific key named subPath from ConfigMap or Secret is mounted as a file with the name provided by path.

additionalVolumes:
  - path: /mnt/InContainer
    volume:
      emptyDir: {}
  - path: /etc/hive/conf/test_config.txt
    subPath: test_config.txt
    volume:
      configMap:
        name: "configmap-in-volume"

Storage#

You can configure the credentials and other details for access to the Hive metastore, HDFS storage, and other supported object storage. The credentials enable the HMS to access storage for metadata information, including statistics gathering.

In addition, you have to configure the catalog with sufficient, corresponding credentials.

The default configuration is empty:

hiveMetastoreWarehouseDir:
hdfs:
  hadoopUserName:
objectStorage:
  awsS3:
    region:
    endpoint:
    accessKey:
    secretKey:
    pathStyleAccess: false
  gs:
    cloudKeyFileSecret:
  azure:
    abfs:
      authType: "accessKey"
      accessKey:
        storageAccount:
        accessKey:
      oauth:
        clientId:
        secret:
        endpoint:
    wasb:
      storageAccount:
      accessKey:
  adl:
    oauth2:
      clientId:
      credential:
      refreshUrl:

The following table describes the properties for storage access configuration:

HMS storage-related configuration#

Node name

Description

hiveMetastoreWarehouseDir

The location of your Hive metastore’s warehouse directory. For example, s3://example/hive-default-warehouse/.

hdfs.hadoopUserName

User name for Hadoop HDFS access

objectStorage.awsS3.*

Configuration for AWS S3 access

objectStorage.awsS3.region

AWS region name

objectStorage.awsS3.endpoint

AWS S3 endpoint, for example http[s]://<bucket>.s3-<AWS-region>.amazonaws.com.

objectStorage.awsS3.accessKey

Name of the access key for AWS S3

objectStorage.awsS3.secretKey

Name of the secret key for AWS S3

objectStorage.awsS3.pathStyleAccess

objectStorage.gs.*

Configuration for Google Storage access

objectStorage.gs.cloudKeyFileSecret

Name of the secret with the file containing the access key to the cloud storage. The key of the secret must be named key.json.

objectStorage.azure.*:

Configuration for Microsoft Azure storage systems

objectStorage.azure.abfs.*

Configuration for Azure Blob Filesystem (ABFS)

objectStorage.azure.abfs.authType

Authentication to access ABFS, Valid values are``accessKey`` or oauth, configuration in the following properties

objectStorage.azure.abfs.accessKey.*

Configuration for access key authentication to ABFS

objectStorage.azure.abfs.accessKey.storageAccount

Name of the ABFS account to access

objectStorage.azure.abfs.accessKey.accessKey

Actual access key to use for ABFS access

objectStorage.azure.abfs.oauth.*

Configuration for OAuth authentication to ABFS

objectStorage.azure.abfs.oauth.clientId

Client identifier for OAuth authentication

objectStorage.azure.abfs.oauth.secret

Secret for OAuth

objectStorage.azure.abfs.oauth.endpoint

Endpoint URL for OAuth

objectStorage.azure.wasb.*

Configuration for Windows Azure Storage Blob (WASB)

objectStorage.azure.wasb.storageAccount

Name of the storage account to use for WASB

objectStorage.azure.wasb.accessKey

Key to access WASB

objectStorage.azure.adl

Configuration for Azure Data Lake (ADL)

objectStorage.azure.adl.oauth2.*

Configuration for OAuth authentication to ADL

objectStorage.azure.adl.oauth2.clientId

Client identifier for OAuth access to ADL

objectStorage.azure.adl.oauth2.credential

Credential for OAuth access to ADL

objectStorage.azure.adl.oauth2.refreshUrl:

Refresh URL for the OAuth access to ADL

More information about the configuration options is available in the following resources:

Metastore configuration for Avro#

In order to enable Avro tables when using Hive 3.x, you need to add the following property definition to the Hive metastore configuration file hive-site.xml:

<property>
     <name>metastore.storage.schema.reader.impl</name>
     <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
 </property>

For more information about additional files, see Adding files.

Server configuration#

heapSizePercentage: 85

resources:
  requests:
    memory: "1Gi"
    cpu: 1
  limits:
    memory: "1Gi"
    cpu: 1

Node assignment#

You can configure your cluster to determine the node and pod to use for the HMS:

nodeSelector: {}
tolerations: []
affinity: {}

Our SEP configuration documentation contains examples and resources to help you configure these YAML nodes.

Annotations#

You can add configuration to annotate the deployment and pod:

deploymentAnnotations: {}
podAnnotations: {}

Security context#

You can optionally configure security contexts to define privilege and access control settings for the HMS pods.

securityContext:

If you do not want to set the serviceContext for the default service account, you can restrict it by configuring the service account for the HMS pod.

Service account#

You can configure a service account for the HMS pod using:

serviceAccountName:

Environment variables#

You can pass environment variables to the HMS container using the same mechanism used for the internal database:

envFrom: []
env: []

Both are specified as a mapping sequences for example:

envFrom:
  - secretRef:
      name: my-secret-with-vars
env:
  - name: MY_VARIABLE
    value: some-value