Command line interface#

Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) provides a terminal-based interactive shell for running Dell Data Processing Engine queries.

Requirements#

The Dell Data Processing Engine CLI has the following requirements:

Java 22
Must be able to open a web browser for logging in to your IdP

Installation#

Download and extract the .tar.gz file. In the terminal, navigate to the bin directory and execute the dell-data-processing-engine.sh script with the version command to see the version of the CLI:

./dell-data-processing-engine version

Windows users must run the .bat script located in the same directory.

Running the CLI#

You can accomplish many tasks with the CLI using the Dell Data Processing Engine API endpoints, such as managing your profiles, submitting batch jobs, and configuring resource pools. You can view supported commands, or run the CLI with --help or -h to see all available options:

❯ ./dell-data-processing-engine help
Usage: dell-data-processing-engine [-hV] [-p=<profile>] [COMMAND]
Use this application to manage Dell Data Processing Engine
  -h, --help                Show this help message and exit.
  -p, --profile=<profile>   Configuration profile to use (default profile is used if not specified)
  -V, --version             Print version information and exit.
Commands:
  help     Display help information about the specified command.
  config   Commands to configure this application
  login    Login, start a new session, and set/change the default profile
  submit   Emulation of the spark-submit script
  uploads  Commands to manage file uploads and secrets
  status   List status of a batch or connect job
  logs     Get logs of a batch or connect job
  delete   Delete a batch job or Spark Connect instance
  profile  Commands to get or set the default profile

Commands#

Global arguments#

The following CLI arguments are global arguments:

Dell Data Processing Engine global CLI command arguments#
Argument	Description
`-h`, `--help`	Show usage help for the help command and exit.
`--format=<displayMode>`	For commands that produce a result, changes how the result is formatted: `table`, `simple`, `json`, `prettyJson`, or `raw`.
`-p`, `--profile`	Show usage help for the help command and exit. Configuration profile to use. If no profile is specified, then the default profile is used.
`--insecure`	Skip SSL certificate validation. Warning This option should only be used for testing.
`--logLevel=<messagingMode>`	Changes the output/logging level: `standard` or `debug`. Defaults to `standard`.
`--keystore`	Path to a KeyStore containing client certificates and the KeyStore password.
`--cert`	Path to an X509 client certificate in PEM format or a directory of X509 client certificates.

General#

The following commands are general commands:

Dell Data Processing Engine general CLI commands#
Command	Description	Options
`login`	Login, start a new session, and set or change the default profile.	`--force`. Request that your IdP always presents a login screen even if you are already logged in.
`submit`	Emulation of Spark’s spark-submit script. Not all options and combinations are supported.	`--name`. A name of your application. If a name is not provided, one is generated. `--jars`. Comma-separated list of `.jar` files to include on the driver and executor classpaths. `--py-files`. Comma-separated list of `.zip`, `.egg`, or `.py` files to place on the `PYTHONPATH` for Python apps. `--files`. Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed with `SparkFiles.get(fileName)`. `--archives`. Comma-separated list of archives to be extracted into the working directory of each executor. `-c`, `--conf`. Arbitrary Spark configuration property. `--executer-memory`. The total memory per executor. Defaults to `1G`. `--num-executors`. The number of executors to launch. Defaults to `2`. `--executor-cores`. The number of cores used by each executor. Defaults to `1`. `--pool`. Resource pool to use. `--encrypt`. Enable encryption for communication between the driver and executors. Defaults to `true`. `--file-upload`. Upload files with this submission. Each file must be `1MB` or less and the total size of all uploads cannot be more than `10MB`. The form for each file is `LOCAL_PATH:DESTINATION_DIRECTORY. DESTINATION_DIRECTORY`. The path is relative to the uploads directory which is always prefixed onto the `DESTINATION_DIRECTORY` you provide. For example, `--file-upload /usr/local/myfile.txt:/my/mount` would upload a local file from `/usr/local/myfile.txt` and mount it in Spark executors at `\""" + UPLOAD_DIRECTORY + """ my/mount/myfile.txt`. The uploaded files can be referenced using the standard Spark prefix: `local://\""" + UPLOAD_DIRECTORY + """...` `--uploaded-secrets`. Attach uploaded secrets to Spark executors as environment variables. `--uploaded-files`. Attach uploaded files to Spark executors. The form for each file is `UPLOAD_ID:DESTINATION_DIRECTORY`. `DESTINATION_DIRECTORY`. The path is relative to the uploads directory which is always prefixed onto the `DESTINATION_DIRECTORY` you provide. All files in the upload are mounted into the destination directory. For example, if upload `m-12345678` had 2 text files, `one.txt` and `two.txt`, and 1 binary file, `three.dat`,`--uploaded-files m-12345678:/my/mount` would mount in Spark executors at `\""" + UPLOAD_DIRECTORY + """my/mount/one.txt", "\""" + UPLOAD_DIRECTORY + """ my/mount/two.txt", and "\""" + UPLOAD_DIRECTORY + """my/mount/three.dat` The uploaded files can be referenced using the standard Spark prefix: `local://\""" + UPLOAD_DIRECTORY + """...` `--spark-connect`. Start a Spark Connect instance instead of a batch job. `--cron-timezone`. CRON timezone. Defaults to `UTC`. `--cron`. Set a CRON schedule. `--ttl-seconds-after-finished`. Set TTL cleanup in seconds. Defaults to `0`. `--class`. Your Java or Scala application’s main class.
`uris`	List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens. Note It can take a few seconds for URIs to become available.
`version`	Show the current version.

Admin#

Note

Admin commands are available for admin users only.

The following CLI commands can be used for admin tasks:

Dell Data Processing Engine general admin CLI commands#
Command	Description
`api-logs`	List API server logs.
`list-all-instances`	List all instances.
`list-file-upload`	Get details of all uploaded files.
`list-secret-uploads`	Get details of all uploaded secrets.
`scheduler-logs`	List scheduler logs.
`system-events`	List system events.

Dell Data Processing Engine resource pool CLI commands#
Command	Description	Options
`get-assignments`	Get the current resource pool user assignments.
`get`	Get the current state of the resource pools.
`resource-pools`	Resource pool management.
`set`	Set the resource pools. Specify a set of arguments for each resource pool.	`-n`, `--name`. The name for this resource pool. `--priority`. Resource pool priority. Defaults to `0`. `--max-applications`. Max resource pool applications. `0` is unlimited. Defaults to `0`. `--min-memory`. The minimum amount of memory in gigabytes. Defaults to `1g`. `--min-cores`. The minimum number of cores. Defaults to `1`. `--max-memory`. The maximum amount of memory in gigabytes. Defaults to `1g`. `--max-cores`. The maximum number of cores. Defaults to `1`. `--default-job-memory`. The default amount of memory in gigabytes per job. Defaults to `1g`. `--default-job-cores`. The default number of cores per job. Defaults to `1`. `--default-job-executors`. The default number of executors per job. Defaults to `1`. `--max-job-memory`. The maximum amount of memory in gigabytes per job. Defaults to `1g`. `--max-job-cores`. The maximum amount of cores per job. Defaults to `1`. `--max-job-executors`. The maximum amount of executors per job. Defaults to `1`.
`update-assignments`	Replace the current resource pool user assignments.

Dell Data Processing Engine user and role management CLI commands#
Command	Description	Options
`assign-roles`	Assign a user to a role.	`-u`, `--user`. User name that is being assigned the role. Required. `-r`, `--role`. The role that the user is assigned to. Required.
`get-roles`	Show the current user’s role assignments.
`unassign-roles`	Unassign a user from a role.	`-u`, `--user`. User name that is being unassigned from the role. Required. `-r`, `--role`. The role that the user is unassigned from. Required.
`update-role-assignments`	Replace all user role assignments.

Configuration#

The following CLI commands can be used for configuration tasks:

Dell Data Processing Engine configuration CLI commands#
Command	Description	Options
`config`	Commands to configure this application.
`get`	Displays the current configuration.
`set`	Set configuration.	`-a`, `--api-endpoint`. The URI of your API server. Required.

Instance#

The following CLI commands can be used for instance-related tasks:

Dell Data Processing Engine instance CLI commands#
Command	Description	Options
`instance`	Commands to manage instances.
`list`	List instances available to you.
`uris`	List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens.
`delete`	Delete a batch job or Spark Connect instance.	`--no-confirm`. Disable prompt for deletion confirmation.
`logs`	Get logs of a batch or Spark Connect job.	`-i`, `--index`. For logs that use 0 as the index. `--zipfile`. Download logs for the driver and executors as a `.zip` file. Optional.
`status`	List the status of a batch or Spark Connect job.

Profile#

The following CLI commands can be used for profile-related tasks:

Dell Data Processing Engine profile CLI commands#
Command	Description	Options
`profile`	Commands to get or set the default profile.
`delete`	Delete the specified profile.
`get`	Show the current default profile.	`--no-confirm`. Disable prompt for deletion confirmation.
`list`	Show all configured profile names.
`set`	Set the default profile.

Uploads#

The following CLI commands can be used for tasks related to file uploads and secrets:

Dell Data Processing Engine uploads CLI commands#
Command	Description	Options
`uploads`	Commands to manage file uploads and secrets.
`create-file`	Create a file upload set.	`-c`, `--comment`. Comment or description. Used only for your own reference purposes. Required.
`create-secret`	Create a new secret set.	`-c`, `--comment`. Comment or description. Used only for your own reference purposes. Required.
`delete-file`	Delete a file set.	`-i`, `--id`. The uploadID of the file. Required.
`delete-secret`	Delete a secret set.	`-i`, `--id`. The uploadID of the secret. Required.
`get-file`	Get details of an uploaded file.	`-i`, `--id`. The uploadID of the file. Required.
`get-secret`	Get details of an uploaded secret.	`-i`, `--id`. The uploadID of the secret. Required.
`update-file`	Update a file upload set.	`-i`, `--id`. The uploadID of the file. Required. `-c`, `--comment`. Comment or description. Used only for your own reference purposes. Required.
`update-secret`	Update a secret set.	`-i`, `--id`. The uploadID of the secret. Required. `-c`, `--comment`. Comment or description. Used only for your own reference purposes. Required.

Examples#

The following sections show you how to use various Dell Data Processing Engine CLI commands.

Logging into the CLI requires authorizing with your IdP. Enter the login command to login:

./dell-data-processing-engine login

The login command opens a new tab in your browser requiring you to enter your credentials. Upon logging in, an access token and encrypted value is generated and written to your local configuration file for later use.

Submit a batch job#

An example Dell Data Processing Engine batch job submission that uses the org.apache.spark.examples.SparkPi class to calculate pi:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar"

-------------------------------------------
sparkId | b-7537272864c477686cd6bfc11f5ca66
-------------------------------------------

Note

Save the sparkId for subsequent use with the API.

View the logs for the job using the sparkId generated with the previous command:

./dell-data-processing-engine instance logs b-7537272864c477686cd6bfc11f5ca66

Job = finished: reduce at SparkPi.scala:38, took 7.584185 s
Pi is roughly 3.1372556862784315

To delete an instance:

./dell-data-processing-engine instance delete b-7537272864c477686cd6bfc11f5ca66

Delete b-7537272864c477686cd6bfc11f5ca66? [y/N]> y
Delete request sent

Submit a batch job on a schedule#

Using the same class and .jar file as the previous example, you can run the job on a CRON schedule by using the --cron option. The following example runs the job every hour at minute 0:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron "0 * * * *""

To run the job every day at 12:00 PM:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron 0 "12 * * *""

The timezone is UTC by default. You can change the timezone with the --cron-timezone option. The following example runs the batch job at 12:00 PM EST:

./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar  --cron-timezone="EST" --cron "0 12 * * *""

Submit a Spark Connect job#

To submit a Spark Connect job:

./dell-data-processing-engine submit --spark-connect

Spark connect job started:
---------------------------------------------
sparkId | c-eed1bf18f8526486c9ecddf9819a273db
---------------------------------------------

You can then use the provided URI to start the instance:

./dell-data-processing-engine instance uris c-eed1bf18f8526486c9ecddf9819a273db

Spark Web UI: https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
Spark Connect: sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ...

Copy the token and use the --remote option to run on PySpark:

./bin/pyspark --remote "sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ..."

Welcome to
      ____             __
    / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/

Using Python version 3.10.14 (main. Mar 19 2024 21:46:16)
Client connected to the Spark Connect server at https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
SparkSession available as `spark`.

You can now execute Python code with Spark Connect:

>>> from datetime import datetime, date
>>> from pyspark.sql import Row
>>>
>>> df = spark.createDataFrame([
...    Row(a=1, b=2, c=`string1`, d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
...    Row(a=2, b=3, c=`string2`, d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
...    Row(a=4, b=5, c=`string3`, d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)),
... ])
>>>> df.show()

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
|  4|5.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+

Use the https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787 URL to view the Spark Web UI. Dell Data Processing Engine requires authentication through your IdP before accessing the Spark Web UI. If you have not already logged in, you are taken to your IdP’s login screen to enter your credentials.

Configure resource pools#

The following example shows two resource pools being set, one named default with a minimum number of cores set to 1, and another named large with a minimum number of cores set to 3 and a maximum number of cores set to 8:

./dell-data-processing-engine admin resource-pools set --name default --min-cores 1 --name large --min-cores 3 --max-cores 8

Note that in this example, the default resource pool does not explicitly set a maximum number of cores with the --max-cores option. The default resource pool instead uses the default value for --max-cores which is 1.

You can use the get command to return a list of your resource pools and their configuration:

./dell-data-processing-engine admin resource-pools get

Assign roles to users#

The assign-roles command can be used to assign the role role to the user user:

./dell-data-processing-engine admin users assign-roles -r role -u user

To assign the admin role to user:

./dell-data-processing-engine admin users assign-roles -r admin -u user

Similarly, you can unassign roles from users with the unassign-roles command:

./dell-data-processing-engine admin users unassign-roles -r role -u user

./dell-data-processing-engine admin users unassign-roles -r admin -u user