Command line interface#
Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) provides a terminal-based interactive shell for running Dell Data Processing Engine queries.
Requirements#
The Dell Data Processing Engine CLI has the following requirements:
Java 22
Must be able to open a web browser for logging in to your IdP
Installation#
Download and extract the .tar.gz
file.
In the terminal, navigate to the bin
directory and execute the
dell-data-processing-engine.sh
script with the version
command to see the
version of the CLI:
./dell-data-processing-engine version
Windows users must run the .bat
script located in the same directory.
Running the CLI#
You can accomplish many tasks with the CLI using the Dell Data Processing
Engine API endpoints, such as managing your profiles, submitting batch
jobs, and configuring resource pools. You can view supported
commands, or run the CLI with --help
or -h
to see all
available options:
❯ ./dell-data-processing-engine help
Usage: dell-data-processing-engine [-hV] [-p=<profile>] [COMMAND]
Use this application to manage Dell Data Processing Engine
-h, --help Show this help message and exit.
-p, --profile=<profile> Configuration profile to use (default profile is used if not specified)
-V, --version Print version information and exit.
Commands:
help Display help information about the specified command.
config Commands to configure this application
login Login, start a new session, and set/change the default profile
submit Emulation of the spark-submit script
uploads Commands to manage file uploads and secrets
status List status of a batch or connect job
logs Get logs of a batch or connect job
delete Delete a batch job or Spark Connect instance
profile Commands to get or set the default profile
Commands#
Global arguments#
The following CLI arguments are global arguments:
Argument |
Description |
---|---|
|
Show usage help for the help command and exit. |
|
For commands that produce a result, changes how the result is formatted:
|
|
Show usage help for the help command and exit. Configuration profile to use. If no profile is specified, then the default profile is used. |
|
Skip SSL certificate validation. Warning This option should only be used for testing. |
|
Changes the output/logging level: |
|
Path to a KeyStore containing client certificates and the KeyStore password. |
|
Path to an X509 client certificate in PEM format or a directory of X509 client certificates. |
General#
The following commands are general commands:
Command |
Description |
Options |
---|---|---|
|
Login, start a new session, and set or change the default profile. |
|
|
Emulation of Spark’s spark-submit script. Not all options and combinations are supported. |
|
|
List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens. Note It can take a few seconds for URIs to become available. |
|
|
Show the current version. |
Admin#
Note
Admin commands are available for admin users only.
The following CLI commands can be used for admin tasks:
Command |
Description |
---|---|
|
List API server logs. |
|
List all instances. |
|
Get details of all uploaded files. |
|
Get details of all uploaded secrets. |
|
List scheduler logs. |
|
List system events. |
Command |
Description |
Options |
---|---|---|
|
Get the current resource pool user assignments. |
|
|
Get the current state of the resource pools. |
|
|
Resource pool management. |
|
|
Set the resource pools. Specify a set of arguments for each resource pool. |
|
|
Replace the current resource pool user assignments. |
Command |
Description |
Options |
---|---|---|
|
Assign a user to a role. |
|
|
Show the current user’s role assignments. |
|
|
Unassign a user from a role. |
|
|
Replace all user role assignments. |
Configuration#
The following CLI commands can be used for configuration tasks:
Command |
Description |
Options |
---|---|---|
|
Commands to configure this application. |
|
|
Displays the current configuration. |
|
|
Set configuration. |
|
Instance#
The following CLI commands can be used for instance-related tasks:
Command |
Description |
Options |
---|---|---|
|
Commands to manage instances. |
|
|
List URIs for a batch or connect job. Spark Connect URIs contain your current access and refresh tokens. |
|
|
Delete a batch job or Spark Connect instance. |
|
|
Get logs of a batch or Spark Connect job. |
|
|
List the status of a batch or Spark Connect job. |
Profile#
The following CLI commands can be used for profile-related tasks:
Command |
Description |
Options |
---|---|---|
|
Commands to get or set the default profile. |
|
|
Delete the specified profile. |
|
|
Show the current default profile. |
|
|
Show all configured profile names. |
|
|
Set the default profile. |
Uploads#
The following CLI commands can be used for tasks related to file uploads and secrets:
Command |
Description |
Options |
---|---|---|
|
Commands to manage file uploads and secrets. |
|
|
Create a file upload set. |
|
|
Create a new secret set. |
|
|
Delete a file set. |
|
|
Delete a secret set. |
|
|
Get details of an uploaded file. |
|
|
Get details of an uploaded secret. |
|
|
Update a file upload set. |
|
|
Update a secret set. |
|
Examples#
The following sections show you how to use various Dell Data Processing Engine CLI commands.
Login to CLI#
Logging into the CLI requires authorizing with your IdP. Enter the login
command to login:
./dell-data-processing-engine login
The login
command opens a new tab in your browser requiring you to enter your
credentials. Upon logging in, an access token and encrypted value is generated
and written to your local configuration file for later use.
Submit a batch job#
An example Dell Data Processing Engine batch job submission that uses the
org.apache.spark.examples.SparkPi
class to calculate pi:
./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar"
-------------------------------------------
sparkId | b-7537272864c477686cd6bfc11f5ca66
-------------------------------------------
Note
Save the sparkId
for subsequent use with the API.
View the logs for the job using the sparkId
generated with the previous
command:
./dell-data-processing-engine instance logs b-7537272864c477686cd6bfc11f5ca66
Job = finished: reduce at SparkPi.scala:38, took 7.584185 s
Pi is roughly 3.1372556862784315
To delete an instance:
./dell-data-processing-engine instance delete b-7537272864c477686cd6bfc11f5ca66
Delete b-7537272864c477686cd6bfc11f5ca66? [y/N]> y
Delete request sent
Submit a batch job on a schedule#
Using the same class and .jar
file as the previous example, you can run the
job on a CRON schedule by using the --cron
option. The following example runs
the job every hour at minute 0:
./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron "0 * * * *""
To run the job every day at 12:00 PM:
./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron 0 "12 * * *""
The timezone is UTC
by default. You can change the
timezone with the
--cron-timezone
option. The following example runs the batch job at 12:00 PM
EST:
./dell-data-processing-engine submit --class org.apache.spark.examples.SparkPi "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar --cron-timezone="EST" --cron "0 12 * * *""
Submit a Spark Connect job#
To submit a Spark Connect job:
./dell-data-processing-engine submit --spark-connect
Spark connect job started:
---------------------------------------------
sparkId | c-eed1bf18f8526486c9ecddf9819a273db
---------------------------------------------
You can then use the provided URI to start the instance:
./dell-data-processing-engine instance uris c-eed1bf18f8526486c9ecddf9819a273db
Spark Web UI: https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
Spark Connect: sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ...
Copy the token and use the --remote
option to run on PySpark:
./bin/pyspark --remote "sc://c-eed1bf1f8526486c9ecddf9819a273db-grpc.local.gate.net:8787/token=eyJraWQ101J0ZXN0aW5nIiwidHlwIjoiSldUIiwiYWxnIjoiUlMyN ..."
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.3
/_/
Using Python version 3.10.14 (main. Mar 19 2024 21:46:16)
Client connected to the Spark Connect server at https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
SparkSession available as `spark`.
You can now execute Python code with Spark Connect:
>>> from datetime import datetime, date
>>> from pyspark.sql import Row
>>>
>>> df = spark.createDataFrame([
... Row(a=1, b=2, c=`string1`, d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
... Row(a=2, b=3, c=`string2`, d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
... Row(a=4, b=5, c=`string3`, d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)),
... ])
>>>> df.show()
+---+---+-------+----------+-------------------+
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 4|5.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+
Use the https://c-eed1bf18f8526486c9ecddf9819a273db-ui.local.net:8787
URL to
view the Spark Web UI. Dell
Data Processing Engine requires authentication through your IdP before accessing
the Spark Web UI. If you have not already logged in, you are taken to your IdP’s
login screen to enter your credentials.
Configure resource pools#
The following example shows two resource pools being set, one named default
with a minimum number of cores set to 1
, and another named large
with a
minimum number of cores set to 3
and a maximum number of cores set to 8
:
./dell-data-processing-engine admin resource-pools set --name default --min-cores 1 --name large --min-cores 3 --max-cores 8
Note that in this example, the default
resource pool does not explicitly set a
maximum number of cores with the --max-cores
option. The default
resource
pool instead uses the default value for --max-cores
which is 1
.
You can use the get
command to return a list of your resource pools and their
configuration:
./dell-data-processing-engine admin resource-pools get
Assign roles to users#
The assign-roles
command can be used to assign the role role
to the user
user
:
./dell-data-processing-engine admin users assign-roles -r role -u user
To assign the admin
role to user
:
./dell-data-processing-engine admin users assign-roles -r admin -u user
Similarly, you can unassign roles from users with the unassign-roles
command:
./dell-data-processing-engine admin users unassign-roles -r role -u user
./dell-data-processing-engine admin users unassign-roles -r admin -u user