Apache Spark is incredibly powerful, but anyone who has worked with it long enough knows the feeling:
Why is this job suddenly slower today?
Why are executors running out of memory?
Why is one stage taking 90% of the runtime?
What exactly is Spark doing behind the scenes?
Spark comes with instrumentation, notably the Spark UI
provides much information on the jobs execution details. However, there are
additional instrumentation, mased on metrics, that require additional effort to
collect, display and analyze.
Over the years, we developed and open-sourced a set of tools
to help Spark users better understand, monitor, and troubleshoot their
applications, from interactive notebooks to large production clusters.
Today we would like to present the three complementary Spark
monitoring tools we publish and actively use internally at CERN.
---
Three Different Tools, Three Different Perspectives on Spark
Each tool addresses a different layer of Spark
observability:
|
Tool |
Main Goal |
|
Capture and analyze Spark performance metrics
programmatically |
|
|
Improve visibility and usability inside notebooks |
|
|
Build real-time monitoring dashboards and historical
observability |
Together, they form a lightweight but powerful ecosystem for
Spark monitoring and troubleshooting.
1. sparkMeasure — Understand What Your Spark Job Is Really Doing
Project
sparkMeasure GitHub Repository
If you use Spark interactively, tune jobs, or benchmark
workloads, chances are you have wished for an easier way to collect performance
metrics directly from your code.
That is exactly what sparkMeasure was built for.
sparkMeasure provides lightweight instrumentation for Apache
Spark applications and exposes metrics directly inside Scala and PySpark
workflows.
Why Users Like It
One major advantage of sparkMeasure is simplicity.
A few lines of code are enough to start collecting
meaningful metrics.
Example Use Cases
- Compare
two DataFrame/SQL optimization strategies
- Investigate
shuffle-heavy workloads
- Measure
the impact of partitioning changes
- Capture
metrics during automated tests
- Export
performance summaries to DataFrames
Adoption
sparkMeasure has seen strong adoption in the Spark community. The Python wrapper alone currently exceeds 1 million downloads per month. This level of usage demonstrates how broadly useful lightweight Spark instrumentation has become.
This Figure shows the architecture of sparkMeasure and how it integrates with Spark Metrics to collect and store Spark execution metrics.
2. sparkmonitor — Making Spark Friendlier in Notebooks
Project
sparkmonitor GitHub Repository
Interactive notebooks are one of the most popular ways to
use Spark today.
But notebook users often struggle with:
- limited
visibility into job progress,
- difficulty
understanding Spark execution,
- and
poor feedback during long-running computations.
This is where sparkmonitor helps.
sparkmonitor was originally developed for CERN’s hosted
notebook environments to improve the Spark experience for users running Jupyter
notebooks interactively.
It provides notebook-friendly monitoring and visualization
capabilities that make Spark behavior easier to understand while jobs are
running.
The goal is simple:
Make Spark execution more transparent and easier to follow
directly from the notebook environment.
3. Spark Dashboard — Real-Time Observability for Spark Clusters
Project
Spark Dashboard GitHub Repository
Sometimes notebook-level visibility is not enough.
Operations teams and platform administrators need real-time
monitoring and historical metrics,
This is the goal of the Spark Dashboard project.
Why This Matters
Real-time dashboards help answer operational questions such
as:
- Are
executors under memory pressure?
- Is the
cluster saturated?
- Did a
deployment introduce regressions?
- Are
workloads becoming slower over time?
- Which
applications consume the most resources?
At CERN, this dashboard is also used internally as an optional enhanced monitoring solution for Spark services.
The dashboard stack combines Spark metrics with other powerful open source metrics storage and observability technologies, as Telegraf, Victoria Metrics and Grafana Dashboards:
Test-Driving the Monitoring Tools with TPC-DS Benchmarks
A practical way to explore and validate these monitoring
tools is to run them against reproducible Spark benchmark workloads such as
TPC-DS.
We use an open-source PySpark wrapper for TPC-DS benchmark execution: TPCDS PySpark Wrapper
Complementary Rather Than Competing
These tools are designed to work at different levels. Think of them as complementary building blocks.
Open Source and Community
All three projects are open source and publicly available on
GitHub.
Repositories
Final Thoughts
Spark applications can sometimes feel like black boxes.
The more visibility users have into execution behavior,
metrics, and resource consumption, the easier it becomes to:
- optimize
workloads,
- troubleshoot
issues,
- improve
cluster efficiency,
- and
teach others how Spark really works.
We hope these tools help make Spark more observable,
understandable, and easier to operate, whether you are running a single
notebook or a large production platform.
If you are already using them, we would love to hear about
your use cases and feedback from the community.



