Command line interface#

Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) provides a terminal-based interactive shell for running Dell Data Processing Engine queries.

Requirements#

The Dell Data Processing Engine CLI has the following requirements:

  • Java 22

  • Must be able to open a web browser for logging in to your IdP

Installation#

Download and extract the .tar.gz file. In the terminal, navigate to the bin directory and execute the dell-data-processing-engine.sh script with the version command to see the version of the CLI:

./dell-data-processing-engine version

Windows users must run the .bat script located in the same directory.

Running the CLI#

You can accomplish many tasks with the CLI using the Dell Data Processing Engine API endpoints, such as managing your profiles, submitting batch jobs, and configuring resource pools. You can view supported commands, or run the CLI with --help or -h to see all available options:

❯ ./dell-data-processing-engine help
Usage: dell-data-processing-engine [-hV] [-p=<profile>] [COMMAND]
Use this application to manage Dell Data Processing Engine
  -h, --help                Show this help message and exit.
  -p, --profile=<profile>   Configuration profile to use (default profile is used if not specified)
  -V, --version             Print version information and exit.
Commands:
  help     Display help information about the specified command.
  config   Commands to configure this application
  login    Login, start a new session, and set/change the default profile
  submit   Emulation of the spark-submit script
  uploads  Commands to manage file uploads and secrets
  status   List status of a batch or connect job
  logs     Get logs of a batch or connect job
  delete   Delete a batch job or Spark Connect instance
  profile  Commands to get or set the default profile

Commands#

Global arguments#

The following CLI arguments are global arguments:

Dell Data Processing Engine global CLI command arguments#

Argument

Description

-h, --help

Show usage help for the help command and exit.

--format=<displayMode>

For commands that produce a result, changes how the result is formatted: table, simple, json, prettyJson, or raw.

-p, --profile

Show usage help for the help command and exit. Configuration profile to use. If no profile is specified, then the default profile is used.

--insecure

Skip SSL certificate validation.

Warning

This option should only be used for testing.

--logLevel=<messagingMode>

Changes the output/logging level: standard or debug. Defaults to standard.

--keystore

Path to a KeyStore containing client certificates and the KeyStore password.

--cert

Path to an X509 client certificate in PEM format or a directory of X509 client certificates.

General#

The following commands are general commands:

Dell Data Processing Engine general CLI commands#

Command

Description

Options

login

Login, start a new session, and set or change the default profile.

--force. Request that your IdP always presents a login screen even if you are already logged in.

submit

Emulation of Spark’s spark-submit script. Not all options and combinations are supported.

  • --name. A name of your application. If a name is not provided, one is generated.

  • --jars. Comma-separated list of .jar files to include on the driver and executor classpaths.

  • --py-files. Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

  • --files. Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed with SparkFiles.get(fileName).

  • --archives. Comma-separated list of archives to be extracted into the working directory of each executor.

  • -c, --conf. Arbitrary Spark configuration property.

  • --executer-memory. The total memory per executor. Defaults to 1G.

  • --num-executors. The number of executors to launch. Defaults to 2.

  • --executor-cores. The number of cores used by each executor. Defaults to 1.

  • --pool. Resource pool to use.

  • --encrypt. Enable encryption for communication between the driver and executors. Defaults to true.

  • --file-upload. Upload files with this submission. Each file must be 1MB or less and the total size of all uploads cannot be more than 10MB. The form for each file is LOCAL_PATH:DESTINATION_DIRECTORY.   DESTINATION_DIRECTORY. The path is relative to the uploads directory which is always prefixed onto the DESTINATION_DIRECTORY you provide.

    For example, --file-upload /usr/local/myfile.txt:/my/mount would upload a local file from /usr/local/myfile.txt and mount it in Spark executors at \""" + UPLOAD_DIRECTORY + """ my/mount/myfile.txt.

    The uploaded files can be referenced using the standard Spark prefix:

    local://\""" + UPLOAD_DIRECTORY + """...

  • --uploaded-secrets. Attach uploaded secrets to Spark executors as environment variables.

  • --uploaded-files. Attach uploaded files to Spark executors. The form for each file is UPLOAD_ID:DESTINATION_DIRECTORY. DESTINATION_DIRECTORY. The path is relative to the uploads directory which is always prefixed onto the DESTINATION_DIRECTORY you provide. All files in the upload are mounted into the destination directory. For example, if upload m-12345678 had 2 text files, one.txt and two.txt, and 1 binary file, three.dat,--uploaded-files m-12345678:/my/mount would mount in Spark executors at \""" + UPLOAD_DIRECTORY + """my/mount/one.txt", "\""" + UPLOAD_DIRECTORY + """   my/mount/two.txt", and "\""" + UPLOAD_DIRECTORY + """my/mount/three.dat

    The uploaded files can be referenced using the standard Spark prefix:

    local://\""" + UPLOAD_DIRECTORY + """...

  • --spark-connect. Start a Spark Connect instance instead of a batch job.

  • --cron-timezone. CRON timezone. Defaults to UTC.

  • --cron. Set a CRON schedule.

  • --ttl-seconds-after-finished. Set TTL cleanup in seconds. Defaults to 0.

  • --class. Your Java or Scala application’s main class.

uris

List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens.

Note

It can take a few seconds for URIs to become available.

version

Show the current version.

Admin#

Note

Admin commands are available for admin users only.

The following CLI commands can be used for admin tasks:

Dell Data Processing Engine general admin CLI commands#

Command

Description

api-logs

List API server logs.

list-all-instances

List all instances.

list-file-upload

Get details of all uploaded files.

list-secret-uploads

Get details of all uploaded secrets.

scheduler-logs

List scheduler logs.

system-events

List system events.

Dell Data Processing Engine resource pool CLI commands#

Command

Description

Options

get-assignments

Get the current resource pool user assignments.

get

Get the current state of the resource pools.

resource-pools

Resource pool management.

set

Set the resource pools. Specify a set of arguments for each resource pool.

  • -n, --name. The name for this resource pool.

  • --priority. Resource pool priority. Defaults to 0.

  • --max-applications. Max resource pool applications. 0 is unlimited. Defaults to 0.

  • --min-memory. The minimum amount of memory in gigabytes. Defaults to 1g.

  • --min-cores. The minimum number of cores. Defaults to 1.

  • --max-memory. The maximum amount of memory in gigabytes. Defaults to 1g.

  • --max-cores. The maximum number of cores. Defaults to 1.

  • --default-job-memory. The default amount of memory in gigabytes per job. Defaults to 1g.

  • --default-job-cores. The default number of cores per job. Defaults to 1.

  • --default-job-executors. The default number of executors per job. Defaults to 1.

  • --max-job-memory. The maximum amount of memory in gigabytes per job. Defaults to 1g.

  • --max-job-cores. The maximum amount of cores per job. Defaults to 1.

  • --max-job-executors. The maximum amount of executors per job. Defaults to 1.

update-assignments

Replace the current resource pool user assignments.

Dell Data Processing Engine user and role management CLI commands#

Command

Description

Options

assign-roles

Assign a user to a role.

  • -u, --user. User name that is being assigned the role. Required.

  • -r, --role. The role that the user is assigned to. Required.

get-roles

Show the current user’s role assignments.

unassign-roles

Unassign a user from a role.

  • -u, --user. User name that is being unassigned from the role. Required.

  • -r, --role. The role that the user is unassigned from. Required.

update-role-assignments

Replace all user role assignments.

Configuration#

The following CLI commands can be used for configuration tasks:

Dell Data Processing Engine configuration CLI commands#

Command

Description

Options

config

Commands to configure this application.

get

Displays the current configuration.

set

Set configuration.

-a, --api-endpoint. The URI of your API server. Required.

Instance#

The following CLI commands can be used for instance-related tasks:

Dell Data Processing Engine instance CLI commands#

Command

Description

Options

instance

Commands to manage instances.

uris

List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens.

delete

Delete a batch job or Spark Connect instance.

  • --no-confirm. Disable prompt for deletion confirmation.

logs

Get logs of a batch or Spark Connect job.

  • -i, --index. For logs that use 0 as the index.

  • --zipfile. Download logs for the driver and executors as a .zip file. Optional.

status

List the status of a batch or Spark Connect job.

Profile#

The following CLI commands can be used for profile-related tasks:

Dell Data Processing Engine profile CLI commands#

Command

Description

Options

profile

Commands to get or set the default profile.

delete

Delete the specified profile.

get

Show the current default profile.

  • --no-confirm. Disable prompt for deletion confirmation.

list

Show all configured profile names.

set

Set the default profile.

Uploads#

The following CLI commands can be used for tasks related to file uploads and secrets:

Dell Data Processing Engine uploads CLI commands#

Command

Description

Options

uploads

Commands to manage file uploads and secrets.

create-file

Create a file upload set.

  • -c, --comment. Comment or description. Used only for your own reference purposes. Required.

create-secret

Create a new secret set.

  • -c, --comment. Comment or description. Used only for your own reference purposes. Required.

delete-file

Delete a file set.

  • -i, --id. The uploadID of the file. Required.

delete-secret

Delete a secret set.

  • -i, --id. The uploadID of the secret. Required.

get-file

Get details of an uploaded file.

  • -i, --id. The uploadID of the file. Required.

get-secret

Get details of an uploaded secret.

  • -i, --id. The uploadID of the secret. Required.

update-file

Update a file upload set.

  • -i, --id. The uploadID of the file. Required.

  • -c, --comment. Comment or description. Used only for your own reference purposes. Required.

update-secret

Update a secret set.

  • -i, --id. The uploadID of the secret. Required.

  • -c, --comment. Comment or description. Used only for your own reference purposes. Required.

Examples#

The following sections show you how to use various Dell Data Processing Engine CLI commands.

Login to CLI#

Logging into the CLI requires authorizing with your IdP. Enter the login command to login:

./dell-data-processing-engine login

The login command opens a new tab in your browser requiring you to enter your credentials. Upon logging in, an access token and encrypted value is generated and written to your local configuration file for later use.

Submit a batch job#

An example Dell Data Processing Engine batch job submission that uses the org.apache.spark.examples.SparkPi class to calculate pi:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar"
-------------------------------------------
sparkId | b-7537272864c477686cd6bfc11f5ca66
-------------------------------------------

Note

Save the sparkId for subsequent use with the API.

View the logs for the job using the sparkId generated with the previous command:

./dell-data-processing-engine instance logs b-7537272864c477686cd6bfc11f5ca66
Job = finished: reduce at SparkPi.scala:38, took 7.584185 s
Pi is roughly 3.1372556862784315

To delete an instance:

./dell-data-processing-engine instance delete b-7537272864c477686cd6bfc11f5ca66
Delete b-7537272864c477686cd6bfc11f5ca66? [y/N]> y
Delete request sent

Submit a batch job on a schedule#

Using the same class and .jar file as the previous example, you can run the job on a CRON schedule by using the --cron option. The following example runs the job every hour at minute 0:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron "0 * * * *""

To run the job every day at 12:00 PM:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron 0 "12 * * *""

The timezone is UTC by default. You can change the timezone with the --cron-timezone option. The following example runs the batch job at 12:00 PM EST:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar  --cron-timezone="EST" --cron "0 12 * * *""

Submit a Spark Connect job#

To submit a Spark Connect job:

./dell-data-processing-engine submit --spark-connect
Spark connect job started:
---------------------------------------------
sparkId | c-eed1bf18f8526486c9ecddf9819a273db
---------------------------------------------

You can then use the provided URI to start the instance:

./dell-data-processing-engine instance uris c-eed1bf18f8526486c9ecddf9819a273db
Spark Web UI: https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
Spark Connect: sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ...

Copy the token and use the --remote option to run on PySpark:

./bin/pyspark --remote "sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ..."
Welcome to
      ____             __
    / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/

Using Python version 3.10.14 (main. Mar 19 2024 21:46:16)
Client connected to the Spark Connect server at https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
SparkSession available as `spark`.

You can now execute Python code with Spark Connect:

>>> from datetime import datetime, date
>>> from pyspark.sql import Row
>>>
>>> df = spark.createDataFrame([
...    Row(a=1, b=2, c=`string1`, d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
...    Row(a=2, b=3, c=`string2`, d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
...    Row(a=4, b=5, c=`string3`, d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)),
... ])
>>>> df.show()
+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
|  4|5.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+

Use the https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787 URL to view the Spark Web UI. Dell Data Processing Engine requires authentication through your IdP before accessing the Spark Web UI. If you have not already logged in, you are taken to your IdP’s login screen to enter your credentials.

Configure resource pools#

The following example shows two resource pools being set, one named default with a minimum number of cores set to 1, and another named large with a minimum number of cores set to 3 and a maximum number of cores set to 8:

./dell-data-processing-engine admin resource-pools set --name default --min-cores 1 --name large --min-cores 3 --max-cores 8

Note that in this example, the default resource pool does not explicitly set a maximum number of cores with the --max-cores option. The default resource pool instead uses the default value for --max-cores which is 1.

You can use the get command to return a list of your resource pools and their configuration:

./dell-data-processing-engine admin resource-pools get

Assign roles to users#

The assign-roles command can be used to assign the role role to the user user:

./dell-data-processing-engine admin users assign-roles -r role -u user

To assign the admin role to user:

./dell-data-processing-engine admin users assign-roles -r admin -u user

Similarly, you can unassign roles from users with the unassign-roles command:

./dell-data-processing-engine admin users unassign-roles -r role -u user
./dell-data-processing-engine admin users unassign-roles -r admin -u user