Configuring the Hive Metastore Service in Kubernetes#
Looking for the installation guide? This topic covers configuring the Hive Metastore Service (HMS) after you completed a basic install of Starburst Enterprise. If you have not yet done that, go to our installation guide. |
The starburst-hive
Helm chart configures an HMS, the
Hive metastore, and optionally the backing database in the cluster detailed in
the following sections. The following Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) connectors and features require a HMS:
Use your registry credentials, and follow best practices by creating an override file for changes to default values as desired.
In addition to the configuration properties described in this document, you can also use the base Hive connector’s metastore configuration properties, Thrift metastore configuration properties and AWS Glue metastore configuration properties as needed, depending on your environment.
Using the HMS#
The expose section configures the DNS availability of the HMS in the cluster. By
default the HMS is available at the hostname hive
and port 9083
. As a
result the Thrift URL within the cluster is thrift://hive:9083
.
You can use the URL for any catalog:
catalog:
datalake: |
connector.name=hive
hive.metastore.uri=thrift://hive:9083
Docker image and registry#
Same as Docker image and registry section for the Helm chart for SEP.
image:
repository: "harbor.starburstdata.net/starburstdata/hive"
tag: "3.1.3-e.3"
pullPolicy: "IfNotPresent"
registryCredentials:
enabled: false
registry:
username:
password:
imagePullSecrets:
- name:
Exposing the pod to outside network#
The expose
section for the HMS works identical to the SEP server
expose section. Differences are isolated to the
configured default values. Additionally, ingress
is not supported since the
HMS service uses the TCP protocol and not HTTP. The default type is
clusterIp
. Ensure to adapt your configured catalogs to use the correct
Thrift URL S, when changing this configuration. The defaults are hostname
hive
and port 9083.
Default clusterIp
:
expose:
type: "clusterIp"
clusterIp:
name: "hive"
ports:
http:
port: 9083
Alternative with nodePort
:
expose:
type: "nodePort"
nodePort:
name: "hive"
ports:
http:
port: 9083
nodePort: 30083
Alternative with loadBalancer
:
expose:
type: "loadBalancer"
loadBalancer:
name: "hive"
IP: ""
ports:
http:
port: 9083
annotations: {}
sourceRanges: []
Internal database backend for HMS#
The database backend for HMS is a PostgreSQL database internal to the cluster by default.
Note
Alternatively, you can use an external backend database for production usage that you must manage yourself.
The following snippet shows the default configuration for the internal HMS backend database:
database:
type: internal
internal:
image:
repository: "library/postgres"
tag: "10.6"
pullPolicy: "IfNotPresent"
volume:
# use one of:
# - existingVolumeClaim to specify existing PVC
# - persistentVolumeClaim to specify spec for new PVC
# - other volume type inline configuration, e.g. emptyDir
# Examples:
# existingVolumeClaim: "my_claim"
# persistentVolumeClaim:
# storageClassName:
# accessModes:
# - ReadWriteOnce
# resources:
# requests:
# storage: "2Gi"
emptyDir: {}
resources:
requests:
memory: "1Gi"
cpu: 2
limits:
memory: "1Gi"
cpu: 2
driver: "org.postgresql.Driver"
port: 5432
databaseName: "hive"
databaseUser: "hive"
databasePassword: "HivePass1234"
envFrom: []
env: []
Node name |
Description |
---|---|
|
Set to |
|
Docker container images used for the PostgreSQL server |
|
Storage volume to persist the database. The default configuration requests a new persistent volume (PV). |
|
The default configuration, which requests a new persistent volume (PV). |
|
Alternative volume configuration, which use existing volume claim by
referencing the name as the value in quotes, e.g., |
|
Alternative volume configuration, which configures an empty directory on the pod, keeping in mind that a pod replacement loses the database content. |
|
|
|
Name of the internal database |
|
User to connect to the internal database |
|
Password to connect to internal database |
|
YAML sequence of mappings to define Secret or Configmap as a source of environment variables for the PostgreSQL container. |
|
YAML sequence of mappings to define two keys environment variables for the PostgreSQL container. |
Examples#
OpenShift deployments often do not have access to pull from the default Docker
registry library/postgres
. You can replace it with an image from the Red Hat
registry, which requires additional environment variables set with the parameter
database.internal.env
:
database:
type: internal
internal:
image:
repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
tag: "latest"
env:
- name: POSTGRESQL_DATABASE
value: "hive"
- name: POSTGRESQL_USER
value: "hive"
- name: POSTGRESQL_PASSWORD
value: "HivePass1234"
Another option is to create a Secret (ex. postgresql-secret
) containing
variables needed by postgresql
which are mentioned in previous code block,
and pass it to the container with envFrom
parameter:
database:
type: internal
internal:
image:
repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
tag: "latest"
envFrom:
- secretRef:
name: postgresql-secret
External backend database for HMS#
This section shows the setup for using of an external PostgreSQL, MySQL or
Microsoft SQL Server database. You must provide the necessary details for the
external server, and ensure that it can be reached from the k8s cluster pod. Set
the database.type
to external
and configure the connection properties:
database:
type: external
external:
jdbcUrl:
driver:
user:
password:
Node name |
Description |
---|---|
|
Set to |
|
JDBC URL to connect to the external database as required by the database
and used driver, including hostname and port. Ensure you use a valid JDBC
URL as required by the PostgreSQL, MySQL, or SQL Server driver. Typically
the syntax requires the host, port and database name
|
|
Valid values are |
|
Database user name to access the external database using JDBC. |
|
Password for the user configured to access the external database using JDBC. |
Server start up configuration#
You can create a startup shell script to customize how HMS is started, and pass additional arguments to it.
initFile:
extraArguments:
initFile
A shell script to run before HMS is launched. The content of the file has to be
an inline string in the YAML file. The original startup command is passed as the
first argument. The script needs to invoke it at the end as exec "$@"
. Use
exec "$1"
if passing any extra arguments.
extraArguments
List of extra arguments to pass to the initFile
script.
The following example shows how you can use initFile
to run a custom start
up script. The init script must end with exec "$@"
:
initFile: |
#!/bin/bash
echo "Custom init for $2"
exec "$@"
extraArguments:
- TEST_ARG
Additional volumes#
Additional volumes can be necessary for persisting files. These can be defined
in the additionalVolumes
section. None are defined by default:
additionalVolumes: []
You can add one or more volumes supported by k8s, to all nodes in the cluster.
If you specify path
only, a directory named in path
is created. When
mounting ConfigMap or Secret, files are created in this directory for each key.
This also supports an optional subPath
parameter which takes in an optional
key in the ConfigMap or Secret volume you create. If you specify subPath
, a
specific key named subPath
from ConfigMap or Secret is mounted as a file
with the name provided by path
.
additionalVolumes:
- path: /mnt/InContainer
volume:
emptyDir: {}
- path: /etc/hive/conf/test_config.txt
subPath: test_config.txt
volume:
configMap:
name: "configmap-in-volume"
Storage#
You can configure the credentials and other details for access to the Hive metastore, the HDFS storage and other supported object storage. The credentials enable the HMS to access to storage for metadata information including statistics gathering.
In addition you have to configure the catalog with sufficient, corresponding credentials.
The default configuration is empty:
hiveMetastoreWarehouseDir:
hdfs:
hadoopUserName:
objectStorage:
awsS3:
region:
endpoint:
accessKey:
secretKey:
pathStyleAccess: false
gs:
cloudKeyFileSecret:
azure:
abfs:
authType: "accessKey"
accessKey:
storageAccount:
accessKey:
oauth:
clientId:
secret:
endpoint:
wasb:
storageAccount:
accessKey:
adl:
oauth2:
clientId:
credential:
refreshUrl:
The following table describes the properties for storage access configuration:
Node name |
Description |
---|---|
|
The location of your Hive metastore’s warehouse directory. For example,
|
|
User name for Hadoop HDFS access |
|
Configuration for AWS S3 access |
|
AWS region name |
|
AWS S3 endpoint, for example
|
|
Name of the access key for AWS S3 |
|
Name of the secret key for AWS S3 |
|
|
|
Configuration for Google Storage access |
|
Name of the secret with the file containing the access key to the cloud
storage. The key of the secret must be named |
|
Configuration for Microsoft Azure storage systems |
|
Configuration for Azure Blob Filesystem (ABFS) |
|
Authentication to access ABFS, Valid values are``accessKey`` or |
|
Configuration for access key authentication to ABFS |
|
Name of the ABFS account to access |
|
Actual access key to use for ABFS access |
|
Configuration for OAuth authentication to ABFS |
|
Client identifier for OAuth authentication |
|
Secret for OAuth |
|
Endpoint URL for OAuth |
|
Configuration for Windows Azure Storage Blob (WASB) |
|
Name of the storage account to use for WASB |
|
Key to access WASB |
|
Configuration for Azure Data Lake (ADL) |
|
Configuration for OAuth authentication to ADL |
|
Client identifier for OAuth access to ADL |
|
Credential for OAuth access to ADL |
|
Refresh URL for the OAuth access to ADL |
More information about the configuration options is available in the following resources:
Metastore configuration for Avro#
In order to enable Avro tables
when using Hive 3.x, you need to add the following property definition to the
Hive metastore configuration file hive-site.xml
:
<property>
<name>metastore.storage.schema.reader.impl</name>
<value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
</property>
For more information about additional files, see Adding files.
Server configuration#
heapSizePercentage: 85
resources:
requests:
memory: "1Gi"
cpu: 1
limits:
memory: "1Gi"
cpu: 1
Node assignment#
You can configure your cluster to determine the node and pod to use for the HMS:
nodeSelector: {}
tolerations: []
affinity: {}
Our SEP configuration documentation contains examples and resources to help you configure these YAML nodes.
Annotations#
You can add configuration to annotate the deployment and pod:
deploymentAnnotations: {}
podAnnotations: {}
Security context#
You can optionally configure security contexts to define privilege and access control settings for the HMS pods.
securityContext:
If you do not want to set the serviceContext for the default
service account, you can restrict it by configuring the service account for the HMS pod.
Service account#
You can configure a service account for the HMS pod using:
serviceAccountName:
Environment variables#
You can pass environment variables to the HMS container using the same mechanism used for the internal database:
envFrom: []
env: []
Both are specified as a mapping sequences for example:
envFrom:
- secretRef:
name: my-secret-with-vars
env:
- name: MY_VARIABLE
value: some-value