Configuring the Hive Metastore Service in Kubernetes#

Looking for the installation guide? This topic covers configuring the Hive Metastore Service (HMS) after you completed a basic install of Starburst Enterprise. If you have not yet done that, go to our installation guide.

View the installation guide

The starburst-hive Helm chart configures an HMS, the Hive metastore, and optionally the backing database in the cluster detailed in the following sections. The following Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) connectors and features require a HMS:

Use your registry credentials, and follow best practices by creating an override file for changes to default values as desired.

In addition to the configuration properties described in this document, you can also use the base Hive connector’s metastore configuration properties, Thrift metastore configuration properties and AWS Glue metastore configuration properties as needed, depending on your environment.

Using the HMS#

The expose section configures the DNS availability of the HMS in the cluster. By default the HMS is available at the hostname hive and port 9083. As a result the Thrift URL within the cluster is thrift://hive:9083.

You can use the URL for any catalog:

catalog:
  datalake: |
    connector.name=hive
    hive.metastore.uri=thrift://hive:9083

Docker image and registry#

Same as Docker image and registry section for the Helm chart for SEP.

image:
  repository: "harbor.starburstdata.net/starburstdata/hive"
  tag: "3.1.3-e.3"
  pullPolicy: "IfNotPresent"

registryCredentials:
  enabled: false
  registry:
  username:
  password:

imagePullSecrets:
 - name:

Exposing the pod to outside network#

The expose section for the HMS works identical to the SEP server expose section. Differences are isolated to the configured default values. Additionally, ingress is not supported since the HMS service uses the TCP protocol and not HTTP. The default type is clusterIp. Ensure to adapt your configured catalogs to use the correct Thrift URL S, when changing this configuration. The defaults are hostname hive and port 9083.

Default clusterIp:

expose:
  type: "clusterIp"
  clusterIp:
    name: "hive"
    ports:
      http:
        port: 9083

Alternative with nodePort:

expose:
  type: "nodePort"
  nodePort:
    name: "hive"
    ports:
      http:
        port: 9083
        nodePort: 30083

Alternative with loadBalancer:

expose:
  type: "loadBalancer"
  loadBalancer:
  name: "hive"
  IP: ""
  ports:
    http:
      port: 9083
  annotations: {}
  sourceRanges: []

Internal database backend for HMS#

The database backend for HMS is a PostgreSQL database internal to the cluster by default.

Note

Alternatively, you can use an external backend database for production usage that you must manage yourself.

The following snippet shows the default configuration for the internal HMS backend database:

database:
  type: internal
  internal:
    image:
      repository: "library/postgres"
      tag: "10.6"
      pullPolicy: "IfNotPresent"
    volume:
      # use one of:
      # - existingVolumeClaim to specify existing PVC
      # - persistentVolumeClaim to specify spec for new PVC
      # - other volume type inline configuration, e.g. emptyDir
      # Examples:
      # existingVolumeClaim: "my_claim"
      # persistentVolumeClaim:
      #  storageClassName:
      #  accessModes:
      #    - ReadWriteOnce
      #  resources:
      #    requests:
      #      storage: "2Gi"
      emptyDir: {}
    resources:
      requests:
        memory: "1Gi"
        cpu: 2
      limits:
        memory: "1Gi"
        cpu: 2
    driver: "org.postgresql.Driver"
    port: 5432
    databaseName: "hive"
    databaseUser: "hive"
    databasePassword: "HivePass1234"
    envFrom: []
    env: []

Internal HMS backend database configuration#
Node name	Description
`database.type`	Set to `internal` to use a database in the k8s cluster, managed by the chart
`database.internal.image`	Docker container images used for the PostgreSQL server
`database.internal.volume`	Storage volume to persist the database. The default configuration requests a new persistent volume (PV).
`database.internal.volume.persistentVolumeClaim`	The default configuration, which requests a new persistent volume (PV).
`database.internal.volume.existingVolumeClaim`	Alternative volume configuration, which use existing volume claim by referencing the name as the value in quotes, e.g., `"my_claim"`.
`database.internal.volume.emptyDir`	Alternative volume configuration, which configures an empty directory on the pod, keeping in mind that a pod replacement loses the database content.
`database.internal.resources`
`database.internal.databaseName`	Name of the internal database
`database.internal.databaseUser`	User to connect to the internal database
`database.internal.databasePassword`	Password to connect to internal database
`database.internal.envFrom`	YAML sequence of mappings to define Secret or Configmap as a source of environment variables for the PostgreSQL container.
`database.internal.env`	YAML sequence of mappings to define two keys environment variables for the PostgreSQL container.

Examples#

OpenShift deployments often do not have access to pull from the default Docker registry library/postgres. You can replace it with an image from the Red Hat registry, which requires additional environment variables set with the parameter database.internal.env:

database:
  type: internal
  internal:
    image:
       repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
       tag: "latest"
    env:
      - name: POSTGRESQL_DATABASE
        value: "hive"
      - name: POSTGRESQL_USER
        value: "hive"
      - name: POSTGRESQL_PASSWORD
        value: "HivePass1234"

Another option is to create a Secret (ex. postgresql-secret) containing variables needed by postgresql which are mentioned in previous code block, and pass it to the container with envFrom parameter:

database:
  type: internal
  internal:
    image:
       repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
       tag: "latest"
    envFrom:
      - secretRef:
          name: postgresql-secret

External backend database for HMS#

This section shows the setup for using of an external PostgreSQL, MySQL or Microsoft SQL Server database. You must provide the necessary details for the external server, and ensure that it can be reached from the k8s cluster pod. Set the database.type to external and configure the connection properties:

database:
  type: external
  external:
    jdbcUrl:
    driver:
    user:
    password:

External HMS backend database configuration#
Node name	Description
`database.type`	Set to `external` to use an existing PostgreSQL, MySQL, Oracle, or SQL Server database outside the cluster.
`database.external.jdbcUrl`	JDBC URL to connect to the external database as required by the database and used driver, including hostname and port. Ensure you use a valid JDBC URL as required by the PostgreSQL, MySQL, Oracle, or SQL Server driver. Typically, the syntax requires the host, port, and database name as follows: For PostgreSQL: `jdbc:postgresql://host:port/database`. For MySQL: `jdbc:mysql://host:port/database`. For Oracle, using the thin or OCI driver: `jdbc:oracle:<driver>:@//host:port:orcl", "username", "password"`. For SQL Server: The database name must be passed as an environment variable `HIVE_METASTORE_DB_NAME` and the JDBC connection string syntax `jdbc:sqlserver://host:port`.
`database.external.driver`	Valid values are as follows: For PostgreSQL: `org.postgresql.Driver`. For an external MySQL or a compatible database: `com.mysql.jdbc.Driver`. For Oracle: `oracle.jdbc.OracleDriver`. For SQL Server: `com.microsoft.sqlserver.jdbc.SQLServerDriver`.
`database.external.user`	Database user name to access the external database using JDBC.
`database.external.password`	Password for the user configured to access the external database using JDBC.

Server start up configuration#

You can create a startup shell script to customize how HMS is started, and pass additional arguments to it.

initFile:
extraArguments:

initFile

A shell script to run before HMS is launched. The content of the file has to be an inline string in the YAML file. The original startup command is passed as the first argument. The script needs to invoke it at the end as exec "$@". Use exec "$1" if passing any extra arguments.

extraArguments

List of extra arguments to pass to the initFile script.

The following example shows how you can use initFile to run a custom start up script. The init script must end with exec "$@":

initFile: |
  #!/bin/bash
  echo "Custom init for $2"
  exec "$@"
extraArguments:
  - TEST_ARG

Additional volumes#

Additional volumes can be necessary for persisting files. These can be defined in the additionalVolumes section. None are defined by default:

additionalVolumes: []

You can add one or more volumes supported by k8s, to all nodes in the cluster.

If you specify path only, a directory named in path is created. When mounting ConfigMap or Secret, files are created in this directory for each key.

This also supports an optional subPath parameter which takes in an optional key in the ConfigMap or Secret volume you create. If you specify subPath, a specific key named subPath from ConfigMap or Secret is mounted as a file with the name provided by path.

additionalVolumes:
  - path: /mnt/InContainer
    volume:
      emptyDir: {}
  - path: /etc/hive/conf/test_config.txt
    subPath: test_config.txt
    volume:
      configMap:
        name: "configmap-in-volume"

Storage#

You can configure the credentials and other details for access to the Hive metastore, HDFS storage, and other supported object storage. The credentials enable the HMS to access storage for metadata information, including statistics gathering.

In addition, you have to configure the catalog with sufficient, corresponding credentials.

The default configuration is empty:

hiveMetastoreWarehouseDir:
hdfs:
  hadoopUserName:
objectStorage:
  awsS3:
    region:
    endpoint:
    accessKey:
    secretKey:
    pathStyleAccess: false
  gs:
    cloudKeyFileSecret:
  azure:
    abfs:
      authType: "accessKey"
      accessKey:
        storageAccount:
        accessKey:
      oauth:
        clientId:
        secret:
        endpoint:
    wasb:
      storageAccount:
      accessKey:
  adl:
    oauth2:
      clientId:
      credential:
      refreshUrl:

The following table describes the properties for storage access configuration:

HMS storage-related configuration#
Node name	Description
`hiveMetastoreWarehouseDir`	The location of your Hive metastore’s warehouse directory. For example, `s3://example/hive-default-warehouse/`.
`hdfs.hadoopUserName`	User name for Hadoop HDFS access
`objectStorage.awsS3.*`	Configuration for AWS S3 access
`objectStorage.awsS3.region`	AWS region name
`objectStorage.awsS3.endpoint`	AWS S3 endpoint, for example `http[s]://<bucket>.s3-<AWS-region>.amazonaws.com`.
`objectStorage.awsS3.accessKey`	Name of the access key for AWS S3
`objectStorage.awsS3.secretKey`	Name of the secret key for AWS S3
`objectStorage.awsS3.pathStyleAccess`
`objectStorage.gs.*`	Configuration for Google Storage access
`objectStorage.gs.cloudKeyFileSecret`	Name of the secret with the file containing the access key to the cloud storage. The key of the secret must be named `key.json`.
`objectStorage.azure.*`:	Configuration for Microsoft Azure storage systems
`objectStorage.azure.abfs.*`	Configuration for Azure Blob Filesystem (ABFS)
`objectStorage.azure.abfs.authType`	Authentication to access ABFS, Valid values are``accessKey`` or `oauth`, configuration in the following properties
`objectStorage.azure.abfs.accessKey.*`	Configuration for access key authentication to ABFS
`objectStorage.azure.abfs.accessKey.storageAccount`	Name of the ABFS account to access
`objectStorage.azure.abfs.accessKey.accessKey`	Actual access key to use for ABFS access
`objectStorage.azure.abfs.oauth.*`	Configuration for OAuth authentication to ABFS
`objectStorage.azure.abfs.oauth.clientId`	Client identifier for OAuth authentication
`objectStorage.azure.abfs.oauth.secret`	Secret for OAuth
`objectStorage.azure.abfs.oauth.endpoint`	Endpoint URL for OAuth
`objectStorage.azure.wasb.*`	Configuration for Windows Azure Storage Blob (WASB)
`objectStorage.azure.wasb.storageAccount`	Name of the storage account to use for WASB
`objectStorage.azure.wasb.accessKey`	Key to access WASB
`objectStorage.azure.adl`	Configuration for Azure Data Lake (ADL)
`objectStorage.azure.adl.oauth2.*`	Configuration for OAuth authentication to ADL
`objectStorage.azure.adl.oauth2.clientId`	Client identifier for OAuth access to ADL
`objectStorage.azure.adl.oauth2.credential`	Credential for OAuth access to ADL
`objectStorage.azure.adl.oauth2.refreshUrl`:	Refresh URL for the OAuth access to ADL

More information about the configuration options is available in the following resources:

Metastore configuration for Avro#

In order to enable Avro tables when using Hive 3.x, you need to add the following property definition to the Hive metastore configuration file hive-site.xml:

<property>
     <name>metastore.storage.schema.reader.impl</name>
     <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
 </property>

For more information about additional files, see Adding files.

Server configuration#

heapSizePercentage: 85

resources:
  requests:
    memory: "1Gi"
    cpu: 1
  limits:
    memory: "1Gi"
    cpu: 1

Node assignment#

You can configure your cluster to determine the node and pod to use for the HMS:

nodeSelector: {}
tolerations: []
affinity: {}

Our SEP configuration documentation contains examples and resources to help you configure these YAML nodes.

Annotations#

You can add configuration to annotate the deployment and pod:

deploymentAnnotations: {}
podAnnotations: {}

Security context#

You can optionally configure security contexts to define privilege and access control settings for the HMS pods.

securityContext:

If you do not want to set the serviceContext for the default service account, you can restrict it by configuring the service account for the HMS pod.

Service account#

You can configure a service account for the HMS pod using:

serviceAccountName:

Environment variables#

You can pass environment variables to the HMS container using the same mechanism used for the internal database:

envFrom: []
env: []

Both are specified as a mapping sequences for example:

envFrom:
  - secretRef:
      name: my-secret-with-vars
env:
  - name: MY_VARIABLE
    value: some-value