Overview#

Dell Data Analytics Engine, powered by Starburst Enterprise platform (SEP) includes support for Dell Data Processing Engine, built for Apache Spark, the open-source distributed processing engine designed for large-scale data analytics.

About Dell Data Processing Engine#

User personas#

Dell Data Processing Engine is used primarily by platform administrators who are responsible for configuring and managing the Dell Data Processing Engine environment, and data engineers who are responsible for developing and deploying data pipelines using Spark clusters. View the following documentation for each persona to learn more:

For platform administrators:

For data engineers:

Command line interface#

Dell Data Processing Engine includes a command line interface that can be used to interface with the Dell Data Processing Engine API. View the command line interface documentation.

User interface#

A user interface is available for many components of the Dell Data Processing Engine:

Release notes#

The release notes page lists new features, upgraded functionality, bug fixes, and breaking changes.

Software version information#

The following software versions are included:

Component Version
Spark 3.5.3
Hive Metastore 3.1.3
Hudi 1.0.1
Iceberg 1.8.0
Delta Lake 3.3.0
Parquet 1.13.1

Expand the following lists to view included Python and Java libraries:

Python libraries
Library Version
absl-py2.1.0
aiobotocore2.17.0
aiohappyeyeballs2.4.4
aiohttp3.11.11
aioitertools0.12.0
aiosignal1.3.2
astunparse1.6.3
async-timeout5.0.1
attrs24.3.0
blis1.2.0
boto31.35.93
botocore1.35.93
certifi2024.12.14
charset-normalizer3.4.1
cramjam2.9.1
et_xmlfile2.0.0
fastparquet2024.11.0
flatbuffers24.12.23
frozenlist1.5.0
fsspec2024.12.0
gast0.6.0
google-pasta0.2.0
greenlet3.1.1
grpcio1.69.0
h5py3.12.1
idna3.10
jmespath1.0.1
joblib1.4.2
keras3.8.0
libclang18.1.1
Markdown3.7
markdown-it-py3.0.0
MarkupSafe3.0.2
mdurl0.1.2
ml-dtypes0.4.1
multidict6.1.0
namex0.0.8
numpy2.0.2
openpyxl3.1.5
opt_einsum3.4.0
optree0.13.1
packaging24.2
pandas2.2.3
patsy1.0.1
pip24.3.1
propcache0.2.1
protobuf5.29.3
pyarrow18.1.0
Pygments2.19.1
python-dateutil2.9.0.post0
pytz2024.2
PyYAML6.0.2
requests2.32.3
rich13.9.4
s3fs2024.12.0
s3transfer0.10.4
scipy1.15.1
setuptools59.6.0
six1.17.0
SQLAlchemy2.0.37
statsmodels0.14.4
spacyv3.8.3
tensorboard2.18.0
tensorboard-data-server0.7.2
tensorflow2.18.0
tensorflow-io-gcs-filesystem0.37.1
termcolor2.5.0
typing_extensions4.12.2
tzdata2024.2
urllib32.3.0
Werkzeug3.1.3
wheel0.37.1
wrapt1.17.2
xlrd2.0.1
yarl1.18.3
Java libraries
Library Artifact
Hudi org.apache.hudi:hudi-spark3.3.x_2_12:1.0.0
Iceberg org.apache.iceberg:iceberg-spark-runtime:0.13.2
Nessie org.projectnessie:nessie-spark-extensions:0.45.0
Delta io.delta:delta-spark_2_12:3.3.0

The Java libraries are stored in the opt/spark/lib directory and must be added to your jobs as needed.