Wednesday, September 27, 2023

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.


The Puzzle of the Slow Query

Set within the framework of data analysis for the ATLAS experiment's Data Control System, our exploration uses data stored in the Parquet format and deploys Apache Spark for queries. The setup: Jupyter notebooks operating on the SWAN service at CERN interfacing with the Hadoop and Spark service.

The Hiccup: A notably slow query during data analysis where two tables are joined. Running on 32 cores, this query takes 27 minutes—surprisingly long given the amount of data in play.

The tables involved:

  • EVENTHISTORY: A log of events for specific sub-detectors, each row contains a timestamp, the subsystem id and a value
  • LUMINOSITY, a table containing the details of time intervals called "luminosity blocks", see Luminosity block - Particle Wiki

Data size:
EVENTHISTORY is a large table, it can collect millions of data points per day, while LUMINOSITY is a much smaller table (only thousands of points per day). In the test case reported here we used data collected over 1 day, with EVENTHISTORY -> 75M records, and LUMINOSITY -> 2K records.


The join condition between EVENTHISTORY and LUMINOSITY is an expression used to match for events in EVENTHISORY and intervals in LUMINOSITY (note this is not a join based on an equality predicate). This is what the query looks like in SQL:


spark.sql("""
select l.LUMI_NUMBER, e.ELEMENT_ID, e.VALUE_NUMBER
from eventhistory e, luminosity l
where e.ts between l.starttime and l.endtime
""")


An alternative version of the same query written using the DataFrame API:

eventhistory_df.join(
    luminosity_df, 
    (eventhistory_df.ts >= luminosity_df.starttime) & 
    (eventhistory_df.ts <= luminosity_df.endtime)
    ).select(luminosity_df.LUMI_NUMBER,
             eventhistory_df.ELEMENT_ID,
             eventhistory_df.VALUE_NUMBER)


Cracking the Performance Case

WebUI: The first point of entry for troubleshooting this was the Spark WebUI. We could find there the execution time of the query (27 minutes) and details on the execution plan and SQL metrics under the "SQL/ DataFrame" tab. Figure 1 shows a relevant snippet where we could clearly see that Broadcast nested loop join was used for this.


Execution Plan: The execution plan is the one we wanted for this query, that is the small LUMINOSITY table is broadcasted to all the executors and then joined with each partition of the larger EVENTHISTORY table.


Figure 1: This shows a relevant snippet of the execution graph from the Spark WebUI. The slow query discussed in this post runs using broadcast nested loops join. This means that the small table is broadcasted to all the nodes and then joined to each partition of the larger table.


CPU utilization measured with Spark Dashboard

Spark Dashboard instrumentation provides a way to collect and visualize Spark execution metrics. This makes it easy to plot the CPU used during the SQL execution. From there we could see that  the workload was CPU-bound


The Clue: Profiling with Flame Graphs and Pyroscope

Stack profiling and Flame Graphs visualization are powerful techniques to investigate CPU-bound workloads. We use it here to find where the CPU cycles are consumed and thus make the query slow.

First a little recap of what is stack profiling with flame graph visualization, and what tools we can use to apply it to Apache Spark workloads:

Stack profiling and Flame Graphs visualization provide a powerful technique for troubleshooting CPU-bound workloads. 
  • Flame Graphs provide information on the "hot methods" consuming CPU
  • Flame Graphs and profiling can also be used to profile time spent waiting (off-cpu) and memory allocation

Grafana Pyroscope simplifies data collections and visualization, using agents and a custom WebUI. Key motivations for using it with Spark are:
  • Streamlined Data Collection & Visualization: The Pyroscope project page offers a simplified approach to data gathering and visualization with its custom WebUI and agent integration.
  • Java Integration: The Pyroscope java agent is tailored to work seamlessly with Spark. This integration shines especially when Spark is running on various clusters such as YARN, K8S, or standalone Spark clusters.
  • Correlation with Grafana: Grafana’s integration with Pyroscope lets you juxtapose metrics with other instruments, including the Spark metrics dashboard.
  • Proven Underlying Technology: For Java and Python, the tech essentials for collecting stack profiling data, async-profiler and py-spy, are time-tested and reliable.
  • Functional & Detailed WebUI: Pyroscope’s WebUI stands out with features that allow users to:
    • Select specific data periods
    • Store and display data across various measurements
    • Offer functionalities to contrast and differentiate measurements
    • Showcase collected data for all Spark executors, with an option to focus on individual executors or machines
  • Lightweight Data Acquisition: The Pyroscope java agent is efficient in data gathering. By default, stacks are sampled every 10 milliseconds and uploaded every 10 seconds. We did not observe any measurable  performance or stability impact of the instrumentation.

Spark Configuration


To use Pyroscope with Spark we used some additional configurations. Note this uses a specialized Spark Plugin from this repo. It is also possible to use java agents. The details are at:  

This is how we profiled and visualized the Flame Graph of the query execution:

1. Start Pyroscope
  • Download from https://github.com/grafana/pyroscope/releases
  • CLI start: ./pyroscope -server.http-listen-port 5040
  • Or use docker: docker run -it -p 5040:4040 grafana/pyroscope
  • Note: customize the port number, I used port 5040 to avoid confusion with the Spark WebUI which defaults to port 4040 too
2. Start Spark with custom configuration, as in this example with PySpark:

# Get the Spark session
from pyspark.sql import SparkSession
spark = (SparkSession.builder.
      appName("DCS analysis").master("yarn")
      .config("spark.jars.packages",
      "ch.cern.sparkmeasure:sparkplugins_2.12:0.3, io.pyroscope:agent:0.12.0")
      .config("spark.plugins", "ch.cern.PyroscopePlugin")
      .config("spark.pyroscope.server", "http://pcitdbgpu1:5040")
      .getOrCreate()
    )



Figure 2: This is a snapshot from the Grafana Pyroscope dashboard with data collected during the execution of the slow query (join between EVENTHISTORY and LUMINOSITY). The query runs in 27 minutes, using 32 cores. The Flame Graph shows the top executed methods and the Flame Graph. Notably, a large fraction of the execution time appears to be spent into SparkDateTimeUtils performing date-datatype conversion operations. This is a crucial finding for the rest of the troubleshooting and proposed fix.


The Insight  


Using profiling data from Pyroscope, we pinpointed the root cause of the query's sluggishness. Spark was expending excessive CPU cycles on data type conversion operations during the evaluation of the join predicate. Upon revisiting the WebUI and delving deeper into the execution plan under the SQL/DataFrame tab, we discovered, almost concealed in plain view, the specific step responsible for the heightened CPU consumption:

(9) BroadcastNestedLoopJoin [codegen id : 2]
Join condition: ((ts#1 >= cast(starttime_dec#57 as timestamp)) AND (ts#1 <= cast(endtime_dec#58 as timestamp)))

The extra operations of "cast to timestamp" appear to be key in explaining the issue.
Why do we have date format conversions? 
By inspecting the schema of the involved tables, it turns out that in the LUMINOSITY table the fields used for joining with the timestamp are of type Decimal.

To recap, profiling data, together with the execution plan, showed that the query was slow because it forced data type conversion over and over for each row where the join condition was evaluated.

The fix:  
The solution we applied for this was simple: we converted to use the same data type for all the columns involved in the join, in particular converting to timestamp the columns starttime and endtime of the LUMINOSITY table. 

Results: improved performance 70x:  
The results are that the query after the change runs in 23 sec, compared to the previous runtime of 27 minutes. Figure 3 shows the Flame graph after the fix was applied.



Figure 3: This is a snapshot of the Grafana Pyroscope dashboard with data collected during the execution of the query after tuning. The query takes only 23 seconds compared to 27 minutes before tuning (see Figure 2)

Related work and links

Details of how to use Pyroscope with Spark can be found in the note:  
Related work of interest for Apache Spark performance troubleshooting:
  • Spark Dashboard - tooling and configuration for deploying an Apache Spark Performance Dashboard using containers technology.
  • Spark Measure - a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
  • Spark Plugins - Code and examples of how to write and deploy Apache Spark Plugins.
  • Spark Notes and Performance Testing notes

Wrapping up

Wrapping Up: Stack profiling and Flame Graph visualization aren’t just jargon—they’re game-changers. Our deep dive illuminated how they transformed an Apache Spark query performance by 70x. Using Grafana Pyroscope with Spark, we demonstrated a holistic approach to gather, analyze, and leverage stack profile data.

A hearty thank you to my colleagues at CERN for their guidance. A special nod to the CERN data analytics, monitoring, and web notebook services, and to the ATLAS database team.