Thursday, March 26, 2020

Distributed Deep Learning for Physics with TensorFlow and Kubernetes

Summary: This post details a solution for distributed deep learning training for a High Energy Physics use case, deployed using cloud resources and Kubernetes. You will find the results for training using CPU and GPU nodes. This post also describes an experimental tool that we developed, TF-Spawner, and how we used it to run distributed TensorFlow on a Kubernetes cluster.

Authors: and

A Particle Classifier

This work was developed as part of the pipeline described in Machine Learning Pipelines with Modern Big DataTools for High Energy Physics. The main goal is to build a particle classifier to improve the quality of data filtering for online systems at the LHC experiments. The classifier is implemented using a neural network model described in this research article.
The datasets used for test and training are stored in TFRecord format, with a cumulative size of about 250 GB, with 34 million events in the training dataset. A key part of the neural network (see Figure 1) is a GRU layer that is trained using lists of 801 particles with 19 low-level features each, which account for most of the training dataset. The datasets used for this work have been produced using Apache Spark, see details and code. The original pipeline produces files in Apache Parquet format; we have used Spark and the spark-tensorflow-connector to convert the datasets into TFRecord format, see also the code.

Data: download the datasets used for this work at this link
Code: see the code used for the tests reported in this post at this link

Figure 1: (left) Diagram of the neural network for the Inclusive Classifier model, from T. Nguyen et. al. (right) TF.Keras implementation used in this work.

Distributed Training on Cloud Resources

Cloud resources provide a suitable environment for scaling distributed training of neural networks. One of the key advantages of using cloud resources is the elasticity of the platform that allows allocating resources when needed. Moreover, container orchestration systems, in particular Kubernetes, provide a powerful and flexible API for deploying many types of workloads on cloud resources, including machine learning and data processing pipelines. CERN physicists, and data scientists in general, can access cloud resources and Kubernetes clusters via the CERN OpenStack private cloud. The use of public clouds is also being actively tested for High Energy Physics (HEP) workloads. The tests reported here have been run using resources from Oracle's OCI.
For this work, we have developed a custom launcher script, TF-Spawner (see also the paragraph on TF-Spawner for more details) for running distributed TensorFlow training code on Kubernetes clusters.
Training and test datasets have been copied to the cloud object storage prior to running the tests, OCI object storage in this case, while for tests run at CERN we used an S3 installation based on Ceph. Our model training job with TensorFlow used training and test data in TFRecord format, produced at the end of the data preparation part of the pipeline, as discussed in the previous paragraph. TensorFlow reads natively TFRecord format and has tunable parameters and optimizations when ingesting this type of data using the modules and We found that reading from OCI object storage can become a bottleneck for distributed training, as it requires reading data over the network which can suffer from bandwidth saturation, latency spikes and/or multi-tenancy noise. We followed TensorFlow's documentation recommendations for improving the data pipeline performance, by using prefetching, parallel data extraction, sequential interleaving, caching, and by using a large read buffer. Notably, caching has proven to be very useful for distributed training with GPUs and for some of the largest tests on CPU, where we observed that the first training epoch, which has to read the data into the cache, was much slower than subsequent epoch which would find data already cached.
Tests were run using TensorFlow version 2.0.1, using tf.distribute strategy "multi worker mirror strategy''. Additional care was taken to make sure that the different tests would also yield the same good results in terms of accuracy on the test dataset as what was found with training methods tested in previous work. To achieve this we have found that additional tuning was needed on the settings of the learning rate for the optimizer (we use the Adam optimizer for all the tests discussed in this article). We scaled the learning rate with the number of workers, to match the increase in effective batch size (we used 128 for each worker). In addition, we found that slowly reducing the learning rate as the number of epochs progressed, was beneficial to the convergence of the network. This additional step is an ad hoc tuning that we developed by trial and error and that we validated by monitoring the accuracy and loss on the test dataset at the end of each training.
To gather performance data, we ran the training for 6 epochs, which provided accuracy and loss very close to the best results that we would obtain by training the network up to 12 epochs. We have also tested adding shuffling between each epoch, using the shuffle method of the API, however it has not shown measurable improvements so this technique has not been further used in the tests reported here.

Figure 2: Measured speedup for the distributed training of the Inclusive Classifier model using TensorFlow and tf.distribute with “multi  worker  mirror  strategy”, running on cloud resources with CPU and GPU nodes (Nvidia P100), training for 6 epochs. The speedup values indicate how well the distributed training scales as the number of worker nodes, with CPU and GPU resources, increases.

Results and Performance Measurements, CPU and GPU Tests

We deployed our tests using Oracle's OCI. Cloud resources were used to build Kubernetes clusters using virtual machines (VMs). We used a set of Terraform script to automate the configuration process. The cluster for CPU tests used VMs of the flavor "VM.Standard2.16'', based on 2.0 GHz Intel Xeon Platinum 8167M, each providing 16 physical cores (Oracle cloud refers to this as OCPUs) and 240 GB of RAM. Tests in our configuration deployed 3 pods for each VM (Kubernetes node), each pod running one TensorFlow worker. Additional OS-based measurements on the VMs confirmed that this was a suitable configuration, as we could measure that the CPU utilization on each VM matched the number of available physical cores (OCPUs), therefore providing good utilization without saturation. The available RAM in the worker nodes was used to cache the training dataset using the API (data populates the cache during the first epoch).
Figure 2 shows the results of the Inclusive Classifier model training speedup for a variable number of nodes and CPU cores. Tests have been run using TF-Spawner. Measurements show that the training time decreases as the number of allocated cores increases. The speedup grows close to linearly in the range tested: from 32 cores to 480 cores. The largest distributed training test that we ran using CPU, used 480 physical cores (OCPU), distributed over 30 VM, each running 3 workers each (each worker running in a separate container in a pod), for a total of 90 workers.

Similarly, we have performed tests using GPU resources on OCI and running the workload with TF-Spawner. For the GPU tests we have used the VM flavor "GPU 2.1'' which comes equipped with one Nvidia P100 GPU, 12 physical cores (OCPU) and 72 GB of RAM. We have tested with distributed training up to 10 GPUs, and found that scalability was close to linear in the tested range. One important lesson learned when using GPUs is, that the slow performance of reading data from OCI storage makes the first training epoch much slower than the rest of the epochs (up to 3-4 times slower). It was therefore very important to use TensorFlow's caching for the training dataset for our tests with GPUs. However, we could only cache the training dataset for tests using 4 nodes or more, given the limited amount of memory in the VM flavor used (72 GB of RAM per node) compared to the size of the training set (200 GB).
Distributed training tests with CPUs and GPUs were performed using the same infrastructure, namely a Kubernetes cluster built on cloud resources and cloud storage allocated on OCI. Moreover, we used the same script for CPU and GPU training and used the same APIs, tf.distribute and tf.keras, and the same TensorFlow version. The TensorFlow runtime used was different for the two cases, as training on GPU resources took advantage of TensorFlow's optimizations for CUDA and Nvidia GPUs. Figure 3 shows the distributed training time measured for some selected cluster configurations. We can use these results to compare the performance we found when training on GPU and on CPU. For example, we find in Figure 3 that the training time of the Inclusive Classifier for 6 epochs using 400 CPU cores (distributed over 25 VMs equipped with 16 physical cores each) is about 2000 seconds, which is similar to the training time we measured when distributing the training over 6 nodes equipped with GPUs.
When training using GPU resources (Nvidia P100), we measured that each batch is processed in about 59 ms (except for epoch 1 which is I/O bound and is about 3x slower). Each batch contains 128 records, and has a size of about 7.4 MB. This corresponds to a measured throughput of training data flowing through the GPU of about 125 MB/sec per node (i.e. 1.2 GB/sec when training using 10 GPUs). When training on CPU, the measured processing time per batch is about 930 ms, which corresponds to 8 MB/sec per node, and amounts to 716 MB/sec for the training test with 90 workers and 480 CPU cores.
We do not believe these results can be easily generalized to other environments and models, however, they are reported here as they can be useful as an example and for future reference.

Figure 3: Selected measurements of the distributed training time for the Inclusive Classifier model using TensorFlow and tf.distribute with “multi worker mirror strategy”, training for 6 epochs, running on cloud resources, using CPU (2.0 GHz Intel Xeon Platinum 8167M) and GPU (Nvidia P100) nodes, on Oracle's OCI.


TF-Spawner is an experimental tool for running TensorFlow distributed training on Kubernetes clusters.
TF-Spawner takes as input the user's Python code for TensorFlow training, which is expected to use tf.distribute strategy for multi worker training, and runs it on a Kubernetes cluster. TF-Spawner takes care of requesting the desired number of workers, each running in a container image inside a dedicated pod (unit of execution) on a Kubernetes cluster. We used the official TensorFlow images from Docker Hub for this work. Moreover, TF-Spawner handles the distribution of the necessary credentials for authenticating with cloud storage and manages the TF_CONFIG environment variable needed by tf.distribute.


TensorBoard  metrics visualization:

TensorBoard provides monitoring and instrumentation for TensorFlow operations. To use TensorBoard with TF-Spawner you can follow a few additional steps detailed in the documentation.

Figure 4: TensorBoard visualization of the distributed training metrics for the Inclusive Classifier, trained on 10 GPUs nodes on a Kubernetes cluster using TF-Spawner. Measurements show that training convergences smoothly. Note: the reason why we see lower accuracy and greater loss for the training dataset compared to the validation dataset is due to the use of dropout in the model.

Limitations: We found TF-Spawner powerful and easy to use for the scope of this work. However, it is an experimental tool. Notably, there is no validation of the user-provided training script, it is simply passed to Python for execution. Users need to make sure that all the requested pods are effectively running, and have to manually take care of possible failures. At the end of the training, the pods will be found in "Completed" state, users can then manually get the information they need, such as the training time from the pods' log files. Similarly, other common operations, such as fetching the saved trained model, or monitoring training with TensorBoard, will need to be performed manually. These are all relatively easy tasks, but require additional effort and some familiarity with the Kubernetes environment.
Another limitation to the proposed approach is that the use of TF-Spawner does not naturally fit with the use of Jupyter Notebooks, which are often the preferred environment for ML development. Ideas for future work in this direction and other tools that can be helpful in this area are listed in the conclusions.
If you try and find TF-Spawner useful for your work, we welcome feedback.

Conclusions and Acknowledgements

This work shows an example of how we implemented distributed deep learning for a High Energy Physics use case, using commonly used tools and platforms from industry and open source, namely TensorFlow and Kubernetes. A key point of this work is demonstrating the use of cloud resources to scale out distributed training.
Machine learning and deep learning on large amounts of data are standard tools for particle physics, and their use is expected to increase in the HEP community in the coming year, both for data acquisition and for data analysis workflows, notably in the context of the challenges of the High Luminosity LHC project. Improvements in productivity and cost reduction for development, deployment, and maintenance of machine learning pipelines on HEP data are of high interest.
We have developed and used a simple tool for running TensorFlow distributed training on Kubernetes clusters, TF-Spawner. Previously reported work has addressed the implementation of the pipeline and distributed training using Apache Spark. Future work may address the use of other solutions for distributed training, using cloud resources and open source tools, such as Horovod on Spark and KubeFlow. In particular, we are interested in further exploring the integration of distributed training with the analytics platforms based on Jupyter Notebooks.

This work has been developed in the context of the Data Analytics services at CERN and of the CERN openlab project on machine learning in the cloud in collaboration with Oracle. Additional information on the work described here can be found in the article Machine Learning Pipelines with Modern Big DataTools for High Energy Physics. The authors would like to thank Matteo Migliorini and Marco Zanetti of the University of Padova for their collaboration and joint work, Thong Nguyen and Maurizio Pierini for their  help, suggestions, and for providing the dataset and models for this work. Many thanks also to CERN openlab, to our Oracle contacts for this project, and to our colleagues at the Spark and Hadoop Service at CERN.

Thursday, April 25, 2019

Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo

Topic: This post describes a data pipeline for a machine learning task of interest in high energy physics: building a particle classifier to improve event selection at the particle detectors. The pipeline is built using tools from the "Big Data ecosystem", notably Apache Spark, BigDL and Analytics Zoo. The work is accompanied by a set of notebooks with code and sample data.

The Physics use case

High Energy Physics (HEP) is a data-intensive domain: particle detectors and their sensors collect vast amounts of data. Data pipelines that move data from detectors to analysis engines are key to the experiments' success and operate at high throughput. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Most of the data thus collected has to be thrown away as it would be too costly to store the raw data. This is performed by fast data processing by a trigger system organized in multiple levels. The rate of data stored for later analysis is currently around 1-10 GB/sec.
Every improvement in the accuracy of the event filtering system is welcome as it can provide large savings in terms of compute and storage resources needed for data analysis. The work "Topology classification  with  deep  learning  to  improve  real-time event  selection  at  the  LHC" by Nguyen et al. provides a promising solution to this problem by using neural networks. The paper details how to build a classifier to used to improve the purity of data samples selected at trigger level, by distinguishing three different event topologies of interest ("W + jet", "QCD", "t tbar"). The classifier is intended to be used in the online filtering system to improve accuracy and in particular reduce the number of false positives compared to current systems, often implemented as complex rule-based systems.
The primary goal of this work is to reproduce the classification performance results of Nguyen et al., showing that the proposed pipeline makes a more efficient usage of computing resources and/or provides a more productive interface for the data scientist/physicists, along all the steps of the processing pipeline.

Figure 1: Event data collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as an event filter, improving state of the art in accuracy.

Data pipelines

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used across the entire data processing chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. In particular, data pipelines are expected to provide solid tools for data processing, a task that ends up being one of the most time-consuming for data scientists/physicists approaching data analysis problems.
Traditionally, HEP has developed custom tools for data processing, which have been successfully used for decades. Recently, a large range of solutions for data processing and machine learning have become available from open source communities. The maturity and adoption of such solutions continue to grow both in industry and academia. Using software from open source communities comes with several advantages, including the reduced cost of development and maintenance and the possibility of sharing solutions and expertise with a large user and expert base.

In this work we build the machine learning pipeline detailed in the paper on topology classification with DL mentioned above, using tools from the "Big Data" ecosystem.
One of the key objectives for the machine learning data pipeline is to transform raw data into more valuable information used to train the ML/DL models. Apache Spark provides the backbone of the pipeline, from the task of fetching data from the storage system to feature processing and feeding training data into a DL engine (BigDL and Analytics Zoo are used in this work). The four steps of the pipeline we built are:

  • Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
  • Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered, and datasets with new  features are produced
  • Parameter Tuning: where the best set of hyperparameters for each model architecture are found by performing grid search
  • Training: where the best models found in the previous step are trained on the entire dataset

Figure 2: This illustrates the data processing pipeline used for training the event topology classifier.

Data source 

Data used for this work have been generated using software simulators to generate events and to calculate the detector response. The dataset is the result of a Monte Carlo event generation, where three different processes (categories) have been simulated: the inclusive production of a leptonically decaying W+/- boson, the pair production of a top-antitop pair (t, t_bar) and hadronic production of multijet events. Variables of low and high level are included in the dataset. See the paper on topology classification with DL for details.
For this exercise, the generated training data amounts to 4.5 TB, for a total of 54 million events, divided in 3 classes: "W + jet", "QCD", "t tbar" events. The generated training data is stored using the ROOT format, as it is a very common format for HEP. Data are originally stored in the CERN EOS storage system as it is the case for the majority of HEP data at CERN at present. The authors of the paper have kindly shared the training data for the purpose of this work. Each event of the dataset consists of a list of reconstructed particles. Each particle is associated with features providing information on the particle cinematic (position and momentum) and on the type of particle. Follow this link for additional details on the datasets and links to download data samples made available to reproduce this work.

Data ingestion

Data ingestion is the first step of the pipeline, where we read ROOT files from the CERN EOS storage system into a Spark DataFrame. For this, we use a dedicated library able to ingest ROOT data into Spark DataFrames: spark-root, an Apache Spark data source for the ROOT file format. It is based on a Java implementation of the ROOT I/O libraries, which offers the ability to translate ROOT files into Spark DataFrames and RDDs. In order to access the files stored in the CERN EOS storage system from Spark applications, another library has been developed: the Hadoop-XRootD connector. The Hadoop-XRootD connector is a Java library, it extends the HADOOP Filesystem and makes its capable of accessing files in EOS via the XRootD protocol. This allows Spark to read directly from EOS. which is convenient for our use case as it avoids the need for copying data in HDFS or other storage compatible with Spark/Hadoop libraries.
At the end of the data ingestion step the result is that data, with the same structure as the original ROOT files, are made available as a Spark DataFrame on which we can perform event selection and feature engineering.

Event selection and feature engineering

In this step of the pipeline, we process the dataset by applying relevant filters, by computing derived features and by applying data normalization techniques.
The first part of the processing requires domain-specific knowledge in HEP to simulate trigger selection: this is emulated by requiring all the events to include one isolated electron or muon with transverse momentum (p_T) above a given threshold, p_T >= 23 GeV. All particle are then ranked in decreasing order of p_T. For each event, the isolated lepton is the first entry of the list of particles. Together with the isolated lepton, the first 450 charged particles, the first 150 photons, and the first 200 neutral hadrons have been considered, for a total of 801 particles with 19 features each. The result is, that each event passing the filter is associated to a matrix with shape 801 x 19. This defines the Low Level Features (LLF) dataset.
Starting from the LLF, an additional set of 14 High Level Features (HLF) is computed. These additional features are motivated by known physics and data processing steps and will be of great help for improving the neural network model in later steps.
LLF and HLF datasets, computed as described above, are saved in Apache Parquet format: the amount of training data is reduced at this point from the original 4.4 TB of ROOT files to 950 GB of snappy-compressed Parquet files.

Additional processing steps are performed and include operations frequently found when preparing training data for classifiers, notably:
  • Data undersampling. This is done to work around the class imbalance in the original training data. After this step, we have the same number of events per each of the three classes, that is 1.4 million events for each of the 3 classes.
  • Data shuffling
  • All the features present in the datasets are scaled to take values between 0 and 1 or normalized, as needed for the different classifiers.
  • The datasets (for HLF and LLF features) are split in training and test datasets (80% and 20% respectively) and saved in two sets of Parquet files. Smaller datasets are also created as samples of the full train and test datasets, for development purposes,
Data are saved in snappy-compressed Parquet format at the end of this stage and amount to about 310 GB. The decrease in size from the previous step is mostly due to the undersampling step and the fact that the population on the 3 topology classes in the original training data used for this work are not balanced.

Code for this step: notebook for data ingestion and for feature preparation

Neural network models

We have tested three different neural network models, following the guiding research paper:

1. The first and simplest model is the "HLF Classifier". It is a fully connected feed-forward deep neural network taking as input the 14 high level features. The chosen architecture consists of three hidden layers with 50, 20, 10 nodes activated by Rectified Linear Units (ReLU). The output layer consists of 3 nodes, activated by the Softmax activation function.

2. The "Particle Sequence Classifier" is trained using recursive layers, taking as input the 801 particles in the Low Level Features dataset. Prior to "feeding the particles" into a recurrent neural network, particles are ordered by decreasing distance from the isolated lepton (see the original article for details).
Gated Recurrent Units (GRU) have been used to aggregate the input sequence of particles with a recurrent layer width of 50. The output of the GRU layer is fed into a fully connected layer with 3 Softmax-activated nodes.

3. The "Inclusive Classifier" is the most complex of the 3 models tested. In this classifier, some domain knowledge is injected into the particle-sequence classifier. In order to do this, the architecture of the particle sequence classifier has been modified by concatenating the 14 High Level Features to the output of the GRU layer after a dropout layer. An additional dense layer of 25 nodes is introduced before the final output layer consisting of 3 nodes, activated by the Softmax activation function.

Figure 3: (left) Code snippet of the inclusive classifier model with Analytics zoo, using Keras-API. (right) Neural network model diagram from Nguyen et al. 

Parameter tuning

Hyperparameter tuning is a common step for improving machine learning pipelines. In this work, we have used Spark to speed up this step. Spark allows training multiple models with different parameters concurrently on a cluster, with the result of speeding up the hyperparameter tuning step. We used the Area Under the ROC curve as the performance metric to compare different classifiers. When performing a grid search each run is independent of the others, hence this process can be easily parallelized and scales very well.
For example, we tested the High Level Features classifier, a feed forward neural network, taking as input the 14 High Level Features. For this model, we tested changing the number of layers and units per layer, the activation function, the optimizer, etc. As an experiment, we ran grid search on the small dataset containing 100K events using a grid of ~200 hyper-parameters sets (models).
Hyperparameter tuning can similarly be repeated for all three classifiers described above. For the following work, we have decided to use the same models as the ones presented in the paper, as they were offering the best trade-off between performances and complexity.
To run grid search in parallel, we used Spark with spark-sklearn and Tensorflow/Keras wrapper for scikit_learn.

Code: notebook with hyperparameter tuning

Figure 4: The hyperparameter tuning step consists of running multiple training jobs with the goal of finding an optimal set of parameters. Spark has been used to speed-up this step by running multiple Keras training jobs in parallel on a cluster.

Distributed training with Spark, BigDL and Analytics Zoo

There are many suitable software and platform solutions for deep learning training at present, and it is an exciting time in this respect with many paths to explore. However, choosing among them is not straightforward, as many products are available with different characteristics and optimizations for different areas of application.
For this work, we wanted to use a solution that easily integrates with the Spark service at CERN, running on YARN/Hadoop cluster (more recently also running Spark on Kubernetes using cloud resources). GPUs and other HW accelerators are not yet available for Spark deployments at CERN, so we also wanted a solution that could scale on CPUs. Moreover, we wanted to use Python/PySpark and standard APIs for processing the neural network: notably Keras API in this case.
Those reasons combined with our ongoing collaboration with Intel in the context of CERN openlab has led us to successfully test and use BigDL and Analytics Zoo for distributed model training in this work. BigDL and Analytics Zoo are open source projects distributed under the Apache 2.0 license. They provide a distributed deep learning framework for Big Data platforms and are implemented as libraries on top of Apache Spark. They allow users to write large-scale deep learning applications as standard Spark programs, running on existing Spark clusters. Notably, they allow users to work with models defined with Keras and Tensorflow APIs.
BigDL provides data-parallelism to train models in a distributed fashion across a cluster using synchronous mini-batch Stochastic Gradient Descent. Data are automatically partitioned across Spark executors as an RDD of Sample: an RDD of N-Dimensional array containing the input features and label. Distributed training is implemented as an iterative process, thus there will be multiple iterations over the same data. Reading data from disk multiple times is slow, for this reason, BigDL exploits the in-memory capabilities of Spark to cache the train RDD in the memory of each worker allowing faster access during the iterations (this also means that sufficient memory needs to be allocated for training large datasets). More information on BigDL architecture can be found in the BigDL white paper.
Notebooks with the code used for training the DL models are:

Figure 5: Snippets of code illustrating key operations for deploying distributed model training in Spark using BigDL and Analytics Zoo.

The resulting classifiers show very good performance figures for the three models tested, and reproduce the findings of the original research paper. The fact that the HLF classifier performs well, despite its simplicity, can be justified by the fact that we are putting a considerable physics knowledge into the the definition of the 14 high level features. This conclusion is further reinforced by additional tests, where we have built a topology classifier using a "random decision forest" and a "gradient boosting" (XGBoost) model, trained using the high level features dataset. In both cases we have obtained very good performance of the classifiers, just close to what has been achieved with the HLF classifier model.
In contrast, the fact that the particle sequence classifier performs better than the HLF classifier is remarkable, because we are not putting any a priori knowledge into the particle sequence classifier: the model is taking just a list of particles as input. Further improvements are seen by combining the HLF with the particle sequence classifier. In some sense, the GRU layer is identifying important features and physics quantities out of the training data, rather than using knowledge injected via feature engineering.

Figure 6: (Right) ROC curve and AUC for the three models tested. Good results are obtained that are consistent with the findings in the original research paper. (Left) Training and validation loss (categorical cross entropy) plotted as a function of iteration number for HLF classifier. Training convergence is obtained at around 50 epochs, depending on the model.

Figure 7: The confusion matrix for the "Inclusive classifier" model after training. Note that the off-diagonal elements are below 5% and the diagonal elements are close to 95%.

Workload and performance

Two Spark clusters have been used to develop and run the workload, the first is a development/test cluster, the second is a production consisting of 52 nodes, with the following resources:  1800 vcores, 14 TB of RAM, 9 PB of storage. It is a shared general-purpose YARN/Hadoop cluster build on commodity hardware, installed with Linux (CentOS 7). Only a fraction of the resources of the cluster was used when executing the pipeline, moreover jobs of other nature were running on the system currently.

Data ingestion and event filtering is a very resource-demanding step of the pipeline. Processing the original data set of 4.4 TB took ~3 hours, using 50 executors with 8 cores each, for a total of 400 cores allocated to the Spark job.
Monitoring of the workload showed that the job was almost completely CPU-bound. The large majority of the CPU cycles are spent executing Python code, that is outside of the Spark JVM. The fact that we chose to process the bulk of the training data using Python UDF functions mapped on RDDs is the main cause of the use of such large amount of CPU resources, as this a well-known "performance gotcha" in current versions of Spark. Notably, many CPU cycles are "wasted" in data serialization and deserialization operations, going back and forth from JVM and Python, in the current implementation.
Additional work on improving the performance of the event filtering has introduced the use of Spark SQL to implement some of the filters. Notably, we have made use of Spark SQL Higher Order Functions, a specialized category of SQL functions, introduced in Spark from version 2.4.0, that allow to improve processing for nested data (arrays). For our workload this has introduced the benefit of running a significant fraction of the filtering operations fully inside the JVM, optimized by the Spark code for DataFrame operations. The result is that the data ingestion step, optimized with Spark SQL and higher order functions, runs in ~2 hours (was 3 hours in the implementation that uses only Python UDF).
Future work on the pipeline may further address the issue of reducing the time spent in serialization and deserialization when using Python UDF, for example, we might decide to re-write the critical parts of the ingestion code in Scala. However, with the current training data size (4.4 TB), the performance of the data ingestion step, of just a couple of hours, is acceptable.

Performance of hyperparameter tuning with grid search: grid search runs multiple training jobs in parallel, therefore it is expected to scale well when run in parallel, as it was the case in our work (see also the paragraph on hyperparameter tuning). This is confirmed by measurements, as shown in Figure 8, by adding executors, i.e. the number of models trained in parallel, the time required to scan all the parameters decreases, as expected.

Figure 8: Performance of the hyperparameter tuning with grid search shows near-linear scalability as expected by this embarrassingly parallel process.

Performance of distributed training with BigDL and Analytics Zoo is also important for this exercise, as faster training time means improved productivity of the data scientist/physicists who will typically perform many experiments on the models and data. Figure 9 shows some preliminary results of the training speed for the HLF classifier, tested on the development cluster, using batch size of 32 per worker. You can notice there very good scalability behavior at the scale tested. Additional measurements, using 20 executors, with 6 cores each (for a total of 120 allocated cores), using batch size of 128 per worker, have shown training speed for the HLF classifier of the order of 100K rows/sec, sustained for the training of the full dataset. We have been able to train the model in less than 5 minutes, running for 12 epochs (each epoch processing 3.4 million training events). The batch size has an impact on the execution time and on the accuracy of the training. We found that a batch size of 128 for the HLF classifier is a good compromise for speed, which a batch size of 32 is slower but gives slightly better results, for example for producing publication plots. Future work will address more thorough investigation of the performance and scalability of the training step.

Figure 9: Performance of the training step for the HLF classifier on the development cluster. Very good scalability has been found at this scale.

An important point to keep in mind when training models with BigDL, is that the RDDs containing the features and labels datasets need to be cached in the executors' JVM memory. This can be a significant amount of memory. Figure 10 shows the RDDs cached by Spark block manager when training the model "Inclusive classifier" using BigDL.

Figure 10: Datasets cached using Spark block manager when training the "Inclusive Classifier" model using BigDL. More than 200 GB of RAM are used across the Spark cluster to cache the training and test data.

The training dataset size and the model complexity of the HLF classifier are of relatively small scale, making the HLF classifier suitable for training also on desktop computers. However, the Particle sequence classifier and the Inclusive classifier have a much higher complexity and require processing hundreds of GB of training data. Figure 11 shows the amount of CPU consumed during the training of the Particle Sequence classifier using BigDL/Analytics-zoo on a Spark cluster, using 70 executor and training with a batch size of 32. Notably, Figure 11 illustrates the fact that the training has lasted for 9 hours and has utilized the equivalent of 200 CPUs for the whole duration of the training. The executor CPU measurement has been performed using Spark monitoring instrumentation feeding into an InfluxDB backed and visualized with Grafana, as described in this post.

Figure 11: Measured aggregated CPU utilization of Spark executors during the training of the Particle Sequence classifier.

Conclusions, lessons learned and plans

We have successfully implemented the full data pipeline for training a "topology classifier" for event filtering at LHC experiments, using deep neural networks as detailed in an original research paper using tools and platforms from the Big Data and machine learning ecosystem in the High Energy Physics domain.
Apache Spark has been used as the main engine for the pipeline, we have appreciated the flexibility and the ease of use of Spark's API in the Python environment.
The integration of PySpark with Jupyter notebooks provides a user-friendly environment for data processing and exploration.
Additional key integrations with Spark have made this work possible, in particular, the spark-root library allows Spark to read data in ROOT format and the Hadoop-XRootD connector, which integrates Spark with CERN EOS storage system.
BigDL and Analytics Zoo have allowed us to easily scale out training using familiar Keras APIs and using the Spark clusters currently available at CERN.
We have identified areas for future performance improvements, notably in the data ingestion and filtering step where we chose to use Python functions on Spark RDDs for the current implementation.
Future work and plans include further exploration of the workload at scale, including running on cloud resources and investigating the use of HW accelerators for DL training. Another area we plan to explore in future work is to implement a streaming data pipeline with model inference, as the natural next step that follows this work.

Code for this work:
Presentation at Spark Summit 2019: Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies


This work has been performed in the context of the Spark, Hadoop and Streaming services at CERN IT and has profited of the support and consultancy of the engineers in the service. The primary author of this work is Matteo Migliorini, who worked on it during his stay at CERN in 2018 under the supervision of Luca Canali. Additional credits go to Riccardo Castellotti, Viktor Khristenko and Maria Girone of CERN openlab, to Marco Zanetti of Padua University and to the authors of the research paper Topology classification  with  deep  learning  to  improve  real-time event  selection  at  the  LHC", in particular to Maurizio Pierini and Thong Nguyen. Credits to the BigDL and Analytics Zoo team at Intel, in particular, many thanks to Jiao (Jennie) Wang and Sajan Govindan.

Tuesday, February 19, 2019

A Performance Dashboard for Apache Spark

Topic: This post dives into the steps for deploying and using a performance dashboard for Apache Spark, using Spark metrics system instrumentation, InfluxDB and Grafana.

What problem does it solve: The dashboard can provide important insights for performance troubleshooting and online monitoring of Apache Spark workloads. In particular when running Spark applications in a distributed environment (Kubernetes, YARN, etc.).

Context: Familiarity with Spark architecture built-in instrumentation, notably the Web UI. This discussion also overlaps with other monitoring tooling discussed in this blog, notably sparkMeasure.

What's new: This work profits of recent updates to Spark metrics instrumentation, notably SPARK-22190,  SPARK-25228. It also builds on top of the ideas of previous work by Hammer Lab.

Deploy using containers (updates May 2020): The repository spark-dashboard supports the installation of the Spark Performance Dashboard using containers technology. Two different installation options are packaged in the repository, use the one that suits your environment best:
  • dockerfiles -> Docker build files for a Docker container image, use this to deploy the Spark Dashboard using Docker.
  • charts -> a Helm chart for deploying the Spark Dashboard on Kubernetes.

Step 1: Understanding the architecture

Spark metrics dashboard architecture

Spark is instrumented with the Dropwizard/Codahale metrics library. Several components of Spark are instrumented with metrics, see also the Spark monitoring guide, notably the driver and executors components are instrumented with multiple metrics each. In addition, Spark provides various sink solutions for the metrics. This work makes use of the Spark graphite sink and utilizes InfluxDB with a Graphite endpoint to collect the metrics. Finally Grafana is used for querying InfluxDB and plotting the graphs (see architectural diagram below).
An important architectural detail of the metrics system is that the metrics are sent directly from the sources to the sink. This is important when running in distributed mode. Each Spark executor, for example, will sink directly the metrics to InfluxDB. By contrast, WebUI/Eventlog instrumentation uses the Spark Listener Bus and metrics flow through the driver in that case (see this link for further details). The number of metrics instrumenting Spark components is quite large. You can find a list of Spark metrics at this link

Step 2: Configuring InfluxDB

  • Download and install InfluxDB 
    • Note: tested in Feb 2019 with InfluxDB version 1.7.3
  • Edit config file /etc/influxdb/influxdb.conf
    • Enable graphite endpoint
  • Key step: Configure the templates, this instructs InfluxDB on how to map data received on the graphite endpoint to the measurements and tags in the DB
  • Optionally configure other InfluxDB parameters of interest as data location and retention
  • Start/restart InfluxDB service: systemctl restart influxdb.service

Step 3: Configuring Grafana and prepare/import the dashboard

  • Download and install Grafana 
    • Download the rpm from and start Grafana service: `systemctl start grafana-server.service`
    • Note: tested in Feb 2019 using Grafana 6.0.0 beta.
  • Alternative: run Grafana on a Docker container
  • Connect to the Grafana web interface as admin and configure
    • By default: http://yourGrafanaMachine:3000
    • Create a data source to connect to InfluxDB. 
      • Set the http URL with the correct port number, default: http://yourInfluxdbMachine:8086
      • Set the InfluxDB database name: default is graphite (no password)
    • Key step: Prepare the dashboard. 

Step 4: Preparing Spark configuration to sink metrics to graphite endpoint in InfluxDB

There are a few alternative ways on how to do this, depending on your environment and preferences. One way is to set a list of configuration parameters of the type "spark.metrics.conf.xx" another is editing the file $SPARK_CONF_DIR/
Configuration for the metrics sink need to be available to all the components being traced, as each component will connect directly to the sink. See details at Spark_metrics_config_options. Example:

$SPARK_HOME/bin/spark-shell \
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*"="graphiteEndPoint_influxDB_hostName>" \
--conf "spark.metrics.conf.*.sink.graphite.port"=<graphite_listening_port> \
--conf "spark.metrics.conf.*.sink.graphite.period"=10 \
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds \
--conf "spark.metrics.conf.*.sink.graphite.prefix"="lucatest" \
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"

Step 5: Test/understand the metrics for performance troubleshooting

The configuration is finished, now you are ready to test the dashboard. Run Spark using the configuration as in Step 4 and start a test workload. Open the web page of the Grafana dashboard:
  • A drop-down list should appear on top left of the dashboard page, select the application you want to monitor. Metrics related to the selected application should start being populated as time and workload progresses.
  • If you use the dashboard to measure a workload that has already been running for some time, note to set the Grafana time range selector (top right of the dashboard) to a suitable time window
  • For best results test this using Spark 2.4.0 or higher (note Spark 2.3.x will also work, but it will not populate executor JVM CPU)
  • Avoid local mode and use Spark with a cluster manager (for example YARN or Kubernetes) when testing this. Most of the interesting metrics are in the executor source, which is not populated in local mode (up to Spark 2.4.0 included). 
Dashboard view:
The following links show an example and general overview of the example dashboard, measuring a test workload. You can find there a large number of graphs and gauges, however, this is still a selection of the many available metrics in Spark instrumentation. The test workload is Spark TPCDS benchmark at scale 100 GB, running on a test YARN cluster, using 24 executors, with 5 cores and 12 GB of RAM each.

Example Graphs

The next step is to further drill down in understanding Spark metrics, the dashboard graphs and in general investigate how the dashboard can help you troubleshoot your application performance. One way to start is to run a simple workload that you can understand and reproduce. In the following you will find example graphs from a simple Spark SQL query reading a Parquet table from HDFS.
  • The query used is spark.sql("select * from web_sales where ws_list_price=10.123456").show
  • web_sales is a 1.3 TB table from the Spark TPCDS benchmark]( generated at scale 10000 GB.
  • What the query does is reading the entire table, applying the given filter and finally returning an empty result set. This query is used as a "trick to the Spark engine" to force a full read of the table and intentionally avoiding optimization, like Parquet filter pushdown.
  • This follows the discussion of Diving into Spark and Parquet Workloads, by Example
  • Infrastructure and resources allocated: the Spark Session ran on a test YARN cluster, using 24 executors, with 5 cores and 12 GB of RAM each.
Graph: Number of Active Tasks

Graph: Number of Active Tasks
One key metric when troubleshooting distributed workloads is the graph of the number of active sessions as a function of time. This shows how Spark is able to make use of the available cores allocated by the executors.

Graph: JVM CPU Usage

Graph: JVM CPU Usage*
CPU used by the executors is another key metric to understand the workload. The dashboard also reports the CPU consumed by tasks, the difference is that the CPU consumed by the JVM includes for example of the CPU used by Garbage collection and more.
Garbage collection can take an important amount of time, in particular when processing large amounts of data as in this case, see Graph: JVM Garbage Collection Time

Graph: Time components

Graph: Time components
Decomposing the run time in component run time and/or wait time can be of help to pinpoint the bottlenecks. In this graph you can see that CPU time and Garbage collection are important components of the workload. A large component of time, between the "executor run time" and the sum of "cpu time and GC time" is not instrumented. From previous studies and by knowing the workload, we can take the educated guess that this is the read time. See also the discussion at Diving into Spark and Parquet Workloads, by Example

Graph: HDFS read throughput

Graph: HDFS read throughput
Reading from HDFS is an important part of this workload. In this graph you can see the measured throughput using HDFS instrumentation exported via the Spark metrics system into the dashboard.

Lessons learned

- The dashboard provides many useful insights on Spark workload:

  • It allows to measure and visualize the time profile of resource utilization, starting from the number of concurrent tasks, to CPU, I/O and shuffle metrics.
  • Additional info is provided on time spent (or time waited) by Spark workload per component.
  • The most useful metrics for the cases I used Spark performance dashboard appear to come from the executor source.

- Understanding the dashboard graphs is still the domain of the dashboard user/performance analysis.

  • The dashboard does not directly provide root cause analysis, nor finds bottlenecks. However it helps the user to do that.
  • The dashboard is only as good as the metrics that feed it. Interference can have many forms, for example when running on highly-loaded systems.
  • This work is still experimental. For example we noticed that in certain cases metric values seem wrong and/or hard to understand (see also issues with task executor metrics described below)

- Spark task executor metrics contain very important information, but come with additional complexity

  • Task metrics are cumulative, which means that for many graphs you want to compute deltas, for example with InfluxQL this can be done with the function non_negative_derivative.
  • Task metrics values are incremented only at the end of each task. This may create an issue that affects workload graphs of jobs where task duration is greater than metric data input rate. As a workaround, tune spark.metrics.conf.*.sink.graphite.period (the suggested value in Step 2 is to set this to 10s).
  • Task metrics are updated only for successful tasks. Workload metrics of failed tasks are not visible in the task metrics graphs.

Templates for the graphite endpoint of InfluxDB are very important

  • Templates allow to structure incoming metrics data into InfluxDB tables. 
  • The proposed template values in the example have been set up by experimentation, they are still being developed: see details in Step 2 and in particular InfluxDB graphite endpoint configuration snippet.

- There are many areas in which this work can be improved both on the Apache Spark side and the metrics collection/visualization side. Some random notes on this:

  • Annotations linking Spark job information with metrics with job information would be useful to help drilling down on the workload. This for example on how to answer questions like: "I see an increase in CPU usage at time XX, which query/job/stage caused that?"  (note added, this has been since addressed, see this link).
  • Many components of the time spent/time waited are not directly instrumented in Apache Spark yet: think I/O read time. Also time and resources consumed outside the JVM (for example by Python UDF) are not instrumented
  • The Graphite endpoint does not use a password. It would be advisable to have a full InfluxDB sink in Apache Spark, including full credential connectivity support, rather than using the graphite endpoint.

CERN Spark service experience

  • We have installed and provide a shared dashboard as an experimental offering to the users of CERN Spark service (both for Spark on YARN/Hadoop and Spark on Kubernetes).
  • We have worked with Spark community to push a few improvements to the instrumentation, as discussed here, in particular exposing Spark task metrics with the Dropwizard metrics system: SPARK-22190,  SPARK-25228SPARK-25277SPARK-26890
  • We have also used the dashboard in the context of a project for data analysis at scale (up to 800 concurrent cores). In that context we have also introduced custom instrumentation/metrics to Spark and the monitoring dashboard, which has proved to be easily extensible and customizable.


You find in this post the details of how to build a performance dashboard for Apache Spark workloads. The dashboard can provide important insights for performance troubleshooting and real-time monitoring of Apache Spark workloads. The implementation on Spark metrics system integrated with InfluxDB and Grafana. In addition you can find links to details of the available metrics instrumentation, configuration details for InfluxDB and examples of Grafana dashboard and graphs that you can use a starting point for  further exploration.
We wish you a fruitful time monitoring and troubleshooting Spark, if you find some mistakes of good additions to the dashboard that you want to share, we will be happy to hear from you.

Acknowledgments: this work has been developed in the context of the data analytics services with Apache Spark, Hadoop and streaming at CERN IT, thanks to the collaboration of several colleagues in the teams. Contacts for questions about this work: and  

Friday, August 24, 2018

SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads


  SparkMeasure simplifies the collection and analysis of Spark task metrics data. It is also intended as a working example of how to use Spark listeners for collecting and processing Spark performance metrics.
The work on sparkMeasure has been previously presented in this blog with examples. Recently, an updated version of sparkMeasure (version 0.13) introduces additional integration for the PySpark and Jupyter environments, improved documentation and additional features provided by the community via PRs (many thanks to the contributors).
At CERN Spark and Hadoop service we have been using sparkMeasure in a few occasions and found it useful for understanding the performance characteristics of Spark workloads and for performance troubleshooting.

Download and deploy with examples

You can find sparkMeasure, its documentation with examples in the sparkMeasure development repository on GitHub and/or in its mirror cerndb repo.
You can deploy sparkMeasure from Maven Central or build with "sbt package". PySpark users can find the Python wrapper API on PyPI: "pip install sparkmeasure".

Use sparkMeasure for measuring interactive and batch workloads

  • Interactive: measure and analyze performance from shell or notebooks: using spark-shell (Scala), PySpark (Python) or Jupyter notebooks.
  • Code instrumentation: add calls in your code to deploy sparkMeasure custom Spark listeners and/or use the classes StageMetrics/TaskMetrics and related APIs for collecting, analyzing and saving metrics data.
  • "Flight Recorder" mode: this records all performance metrics automatically and saves data for later processing.

Documentation and examples


Architecture diagram

sparkMeasure architecture diagram

 Main concepts underlying sparkMeasure

  • The tool is based on the Spark Listener interface. Listeners transport Spark executor Task Metrics data from the executor to the driver. They are a standard part of Spark instrumentation, used by the Spark Web UI and History Server for example.
  • Metrics can be collected using sparkMeasure at the granularity of stage completion and/or task completion (configurable)
  • Metrics are flattened and collected into local memory structures in the driver (ListBuffer of a custom case class).
  • Spark DataFrame and SQL are used to further process metrics data for example to generate reports.
  • Metrics data and reports can be saved for offline analysis.

Getting started examples of sparkMeasure usage

  1. Link to an example Python_Jupyter Notebook
  2. Example notebooks on the Databricks platform (community edition): example Scala notebook on Databricks, example Python notebook on Databricks
  3. An example using Scala REPL/spark-shell:
bin/spark-shell --packages

val stageMetrics = 
stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show())
The output should look like this:
Scheduling mode = FIFO
Spark Context default degree of parallelism = 8
Aggregated Spark stage metrics:
numStages => 3
sum(numTasks) => 17
elapsedTime => 9103 (9 s)
sum(stageDuration) => 9027 (9 s)
sum(executorRunTime) => 69238 (1.2 min)
sum(executorCpuTime) => 68004 (1.1 min)
sum(executorDeserializeTime) => 1031 (1 s)
sum(executorDeserializeCpuTime) => 151 (0.2 s)
sum(resultSerializationTime) => 5 (5 ms)
sum(jvmGCTime) => 64 (64 ms)
sum(shuffleFetchWaitTime) => 0 (0 ms)
sum(shuffleWriteTime) => 26 (26 ms)
max(resultSize) => 17934 (17.0 KB)
sum(numUpdatedBlockStatuses) => 0
sum(diskBytesSpilled) => 0 (0 Bytes)
sum(memoryBytesSpilled) => 0 (0 Bytes)
max(peakExecutionMemory) => 0
sum(recordsRead) => 2000
sum(bytesRead) => 0 (0 Bytes)
sum(recordsWritten) => 0
sum(bytesWritten) => 0 (0 Bytes)
sum(shuffleTotalBytesRead) => 472 (472 Bytes)
sum(shuffleTotalBlocksFetched) => 8
sum(shuffleLocalBlocksFetched) => 8
sum(shuffleRemoteBlocksFetched) => 0
sum(shuffleBytesWritten) => 472 (472 Bytes)
sum(shuffleRecordsWritten) => 8


  • Why measuring performance with workload metrics instrumentation rather than just using time?
    • Measuring elapsed time, treats your workload as "a black box" and most often does not allow you to understand the root cause of the performance. With workload metrics you can (attempt to) go further in understanding and root cause analysis, bottleneck identification, resource usage measurement.
  • What are Apache Spark tasks metrics and what can I use them for?
    • Apache Spark measures several details of each task execution, including run time, CPU time, information on garbage collection time, shuffle metrics and on task I/O. See also this short description of the Spark Task Metrics
  • How is sparkMeasure different from Web UI/Spark History Server and EventLog?
    • sparkMeasure uses the same ListenerBus infrastructure used to collect data for the Web UI and Spark EventLog.
      • Spark collects metrics and other execution details and exposes them via the Web UI.
      • Notably Task execution metrics are also available through the REST API
      • In addition Spark writes all details of the task execution in the EventLog file (see config of spark.eventlog.enabled and spark.eventLog.dir)
      • The EventLog is used by the Spark History server + other tools and programs can read and parse the EventLog file(s) for workload analysis and performance troubleshooting, see a proof-of-concept example of reading the EventLog with Spark SQL
    • There are key differences that motivate this development:
      • sparkmeasure can collect data at the stage completion-level, which is more lightweight than measuring all the tasks, in case you only need to compute aggregated performance metrics. When needed, sparkMeasure can also collect data at the task granularity level.
      • sparkmeasure has an API that makes it simple to add instrumention/performance measurements in notebooks and application code.
      • sparkmeasure collects data in a flat structure, which makes it natural to use Spark SQL for workload data processing, which provides a simple and powerful interface
      • limitations: sparkMeasure does not collect all the data available in the EventLog, sparkMeasure buffers data in the driver memory, see also the TODO and issues doc
  • What are known limitations and gotchas?
    • The currently available Spark task metrics can give you precious quantitative information on resources used by the executors, however there do not allow to fully perform time-based analysis of the workload performance, notably they do not expose the time spent doing I/O or network traffic.
    • Metrics are collected on the driver, which can be quickly become a bottleneck. This is true in general for ListenerBus instrumentation, in addition sparkMeasure in the current version buffers all data in the driver memory.
    • Task metrics values collected by sparkMeasure are only for successfully executed tasks. Note that resources used by failed tasks are not collected in the current version.
    • Task metrics are collected by Spark executors running on the JVM, resources utilized outside the JVM are currently not directly accounted for (notably the resources used when running Python code inside the python.daemon in the case of PySpark).
  • When should I use stage metrics and when should I use task metrics?
    • Use stage metrics whenever possible as they are much more lightweight. Collect metrics at the task granularity if you need the extra information, for example if you want to study effects of skew, long tails and task stragglers.
  • What are accumulables?
    • Metrics are first collected into accumulators that are sent from the executors to the driver. Many metrics of interest are exposed via [[TaskMetrics]] others are only available in StageInfo/TaskInfo accumulables (notably SQL Metrics, such as "scan time")
  • How can I save/sink the collected metrics?
    • You can print metrics data and reports to standard output or save them to files (local or on HDFS). Additionally you can sink metrics to external systems (such as Prometheus, other sinks like InfluxDB or Kafka may be implemented in future versions).
  • How can I process metrics data?
    • You can use Spark to read the saved metrics data and perform further post-processing and analysis. See the also Notes on metrics analysis.
  • How can I contribute to sparkMeasure?
    • SparkMeasure has already profited from PR contributions. Additional contributions are welcome. See the TODO_and_issues list for a list of known issues and ideas on what you can contribute.

Acknowledgements and additional links

This work has been developed in the context of the CERN Hadoop, Spark and Streaming services and of the CERN openlab project on data analytics. Credits go to my colleagues for collaboration. Many thanks also to the Apache Spark community, in particular to the contributors of PRs and reporters of issues. Additional links on this topic are: 

Friday, September 29, 2017

Performance Analysis of a CPU-Intensive Workload in Apache Spark

Topic: This post is about techniques and tools for measuring and understanding CPU-bound and memory-bound workloads in Apache Spark. You will find examples applied to studying a simple workload consisting of reading Apache Parquet files into a Spark DataFrame.

Why are the topics discussed here relevant

Many workloads for data processing on modern distributed computing architectures are often CPU-bound. Typical servers underlying current data platforms have a large and increasing amount of RAM and CPU cores compared to what was typically available just a few years ago. Moreover the widespread adoption of compressed columnar and/or in-memory data formats for data analytics further drives data platform to higher utilization of CPU cycles and of memory bandwidth.  This is mostly good news as CPU-bound workloads in many cases are easy to scale up (in particular for read-mostly workloads that don't require expensive serialization at the application/database engine level, like locks and latches). Memory throughput is a scarse resource in modern data processing platforms (people often refer to this as a "hitting a memory wall" or "RAM is the new disk") and one of the key challenges that modern software for data processing has to handle is how to best utilize CPU cores, this often means implementing algorithms to minimize stalls due to cache misses and utilizing memory bandwidth with vector processing (e.g. using SIMD instructions).
Technologies like Spark, Parquet, Hadoop, make data processing at scale easier and cheaper. Techniques and tools for measuring and understanding data processing workloads, including their CPU and memory footprint, are important to get the best value out of your software and hardware investment. If you are a software developer you want to be able to measure and optimize your workloads. If you are working with operations you want to be able to measure what your systems can deliver to you and know what to watch out for in case you have noisy neighbors in your shared/cloud environment. Either way you can find in this post a few examples and techniques that may help you in your job.


The work described here originates from a larger set of activities related to characterizing workloads of interest for sizing Spark clusters that the CERN Hadoop and Spark service team did in Q1 2017. In that context working with Spark and data files in columnar format, such as Parquet and also with Spark and ROOT (ROOT is a key data format for high energy physics) is very important for many use cases from CERN users. Some of the first qualitative observations we noticed when running test workloads/benchmarks on Parquet tables were:
  • Workloads reading from Parquet are often CPU-intensive
  • It is important to allocate enough memory to the Spark executors, often more than we thought at the start of the tests
  • Our typical test machines (dual socket servers) would process roughly 1 to 2 GB/sec of data in Parquet format. 
  • The throughput of simple workloads would scale well horizontally on the cluster nodes.
In this post I report on follow-up tests to drill down with measurements on some of the interesting details. The goal of the post is also to showcase methods of investigation and tools of interest for troubleshooting CPU-bound workloads that I believe are of general interest for data-intensive platforms.

This post

In fist part of this post you will find an example and a few drill-down lab tests about profiling and understanding CPU-intensive Spark workloads. You will see examples of profiling the workload of a query reading from Parquet by measuring systems resources utilization using Spark metrics and OS tools, including tools for stack profiling and flame graph visualization and tools for dynamic tracing.
In the second part, you will see how the test workload scales up when it is executed in parallel by multiple executor threads. You will see how to measure and model the workload scalability profile. You will see how to measure latency, CPU instruction and cycles (IPC), memory bandwidth utilization and how to interpret the results.

Lab setup

The workload for this test is a simple query reading from a partitioned table in Apache Parquet format. The test table used in this post is WEB_SALES, a partitioned table of ~68 GB, part of the TPCDS benchmark (scale 1500GB). The setup instructions for the TPCDS database on Spark can be found at: "Spark SQL Performance Tests".
The tests have been performed using Spark 2.2.0, running in local mode. The test machine is a dual socket server with 2 Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (10 cores per CPU, 20 cores in total). The test server has 512 GB of RAM (16 DIMMs DDR4 of 32GB each). It is important for the tests/labs described here to have enough memory to cache the test table and to provide enough RAM for the executors (128 GB of RAM in the test machine is probably the minimum if you want to reproduce the results at the scale described in this post). The storage of the test machine is provided by just two local disks, this is undersized compared to but it is not matter much for the results of the tests, as the test workload is CPU-bound. The server used for the tests described here was installed with Linux, RHEL 6.8.
Some of the labs described in post make reference to Spark task metrics, in particular collected using a tool called sparkMeasure, described in this post. Also relevant are additional examples and lab setup instructions as discussed in "Diving into Spark and Parquet Workloads, by Example".

Part 1 - Defining and understanding the test workload

  • Reading from Parquet is CPU-intensive
  • You need to allocate enough heap memory for your executors or else the JVM Garbage Collector can become resource-hungry
  • Measure key metrics and profile Spark jobs using Spark instrumentation as task metrics and OS tools to measure CPU, I/O, network.

The first observation that got me started with the investigations in this post is that reading Parquet files with Spark is often a CPU-intensive operation. A simple test to realize this is by reading a test table using a Spark job running with just one task/core and measure the workload using Spark. This is the subject of the first experiment:

Lab 1: Experiment with reading a table from Parquet and measure the workload

The idea for this experiment is to run a Spark job to read a test table stored in Parquet format (see lab setup details above). The job just scans through the table partitions with a filter condition that is written in such a way that it is never true for the given test data, this has the effect of generating I/O to read the entire table, producing a minimal amount of overhead for processing and returning the result set (which will be empty). From more details of how this "trick" works, see the post "Diving into Spark and Parquet Workloads, by Example".
One more prerequisite, before starting the test, is to clean the file system cache, this ensure reproducibility of the results if you want to run it the test multiple times:

# drop the filesystem cache before testing
# echo 3 > /proc/sys/vm/drop_caches

# startup Spark shell with additional instrumentation using spark-measure
$ bin/spark-shell --master local[1] --packages --driver-memory 16g

// Test workload
// Run a SQL statement that forces reading the entire table"/data/tpcds_1500/web_sales").createOrReplaceTempView("web_sales")
val stageMetrics =
stageMetrics.runAndMeasure(spark.sql("select * from web_sales where ws_sales_price=-1").collect())

The output of the Spark instrumentation is a series of metrics for the job execution. The metrics are collected by Spark executors and handled via Spark listeners (see additional details on sparkMeasure, at this link). The measured metrics for the test workload on Lab 1 are:

Spark Context default degree of parallelism = 1
Aggregated Spark stage metrics:
numStages => 1
sum(numTasks) => 787
elapsedTime => 465430 (7.8 min)
sum(stageDuration) => 465430 (7.8 min)
sum(executorRunTime) => 463966 (7.7 min)
sum(executorCpuTime) => 325077 (5.4 min)
sum(executorDeserializeTime) => 905 (0.9 s)
sum(executorDeserializeCpuTime) => 917 (0.9 s)
sum(resultSerializationTime) => 14 (14 ms)
sum(jvmGCTime) => 5793 (6 s)
sum(shuffleFetchWaitTime) => 0 (0 ms)
sum(shuffleWriteTime) => 0 (0 ms)
max(resultSize) => 1067344 (1042.0 KB)
sum(numUpdatedBlockStatuses) => 0
sum(diskBytesSpilled) => 0 (0 Bytes)
sum(memoryBytesSpilled) => 0 (0 Bytes)
max(peakExecutionMemory) => 0
sum(recordsRead) => 1079971165
sum(bytesRead) => 72942144585 (67.9 GB)
sum(recordsWritten) => 0
sum(bytesWritten) => 0 (0 Bytes)
sum(shuffleTotalBytesRead) => 0 (0 Bytes)
sum(shuffleTotalBlocksFetched) => 0
sum(shuffleLocalBlocksFetched) => 0
sum(shuffleRemoteBlocksFetched) => 0
sum(shuffleBytesWritten) => 0 (0 Bytes)
sum(shuffleRecordsWritten) => 0

What you can learn from this test and the collected metrics values:
  • The workload consists in reading one partitioned table, the web_sales table part of the TPCDS benchmark. The table size is ~68 GB (it has ~1 billion rows), generated as described above in "Lab setup".
  • From the metrics you can see that the query ran in about 8 minutes, with only one active task at a time. Note also that "executor time" = "elapsed time" in this example.
  • Most of the execution time was spent on CPU. Other metrics about time measures (serialization/deserialization, garbage collection, shuffle), although of importance in general for Spark workloads, can be ignored in this example as their values are very close to zero.
  • The difference between between the elapsed time and the CPU time can be interpreted as the I/O read time (more on this later in this post). The estimated value of the time spent on I/O amounts to ~ 30% of the total execution time. Further in this post you can see how to directly measure I/O time and confirm this finding.
  • We conclude that the workload is mostly CPU-bound 

Lab 2: focus on the CPU time of the workload

This experiment runs the same query as experiment 1, with one important change: data is read entirely from file system cache in this second experiment. This happens naturally in a test system with enough memory (assuming that no other workload of importance runs there). Reading data from filesystem cache brings to zero the time spend waiting for the slow response time of the HDD and provides an important boost to I/O performance: this is as if the system was attached to fast storage. The test machine used did not have fast I/O, however in a real-world data system we may expect to have a fast storage subsystem, as a SSD or NVMe or a network-based filesystem over 10 GigE. This is probably a more realistic scenarios than reading from a local disk and further motivates the tests proposed here, where I/O is served from filesystem cache to decouple the measurements from HDD response time. The results for the key metrics in the case of reading the test Parquet table from the filesystem cache (using the same SQL as in Lab 1) are:

elapsedTime => 316575 (5.3 min)
sum(stageDuration) => 316575 (5.3 min)
sum(executorRunTime) => 315270 (5.3 min)
sum(executorCpuTime) => 310535 (5.2 min)
sum(executorDeserializeTime) => 867 (0.9 s)
sum(executorDeserializeCpuTime) => 874 (0.9 s)
sum(resultSerializationTime) => 13 (13 ms)
sum(jvmGCTime) => 4961 (5 s)
sum(recordsRead) => 1079971165
sum(bytesRead) => 72942144585 (67.9 GB)

What you can learn from this test:
  • The workload response time i Lab 2 isalmost entirely due to time spent on CPU by the executor.
  • With some simple calculation you can find that one single Spark task/core on the test machine can process ~220 MB/s of data in Parquet format (this is computed diving the "bytes Read" / "executor Run Time").
Note: Spark task workload metrics still report reading ~68 GB of data in this experiment, which is way correct, as data comes in via the OS read call. Data is actually being served by OS cache, so no physical I/O is happening to serve the read, a fact unknown to Spark executors. You need OS tools to find this, which is the subject on the next lab.

Lab 3: Measure Spark workload with OS tools

This is about using OS tools to measure systems resource utilization for the test workload. By contrast, the Spark workload metrics you have seen so far in the results of experiments 1 and 2 are measured by the Spark executor itself. If you are interested in the details, you can see the code of how this is done in Executor.scala, for example, the CPU time metric is collected using ThreadMXBean and the method getCurrentThreadCpuTime(), Garbage collection time comes from the GarbageCollectorMXBean and the method getCollectionTime().
This experiment is about measuring CPU usage with OS tools and comparing the results with the metrics reported by Spark Measure (as in the tests of Lab 1 and 2). There are multiple tools you can use to measure process CPU usage in Linux, ultimately with instrumentation coming from /proc filesystem. This is an example of using the Linux comman ps:

# use this to find pid_Spark_JVM
$ jps | grep SparkSubmit 

# measure cpu time 
$ ps -efo cputime -p <pid_Spark_JVM>

... run test query in the spark-shell as in Lab 2 and run again:

$ ps -efo cputime -p <pid_Spark_JVM>

Just take the difference of the 2 measurements of CPU time. The results for the test system and workload (Spark SQL query reading from Parquet) is:

CPU measured at OS level: 446 seconds

What you can learn from this test:

  • By comparing the CPU measurement with OS tools (this Lab)  with the result of  Lab 2: sum(stageDuration) => 316575 and sum(executorCpuTime) => 310535 you can conclude that the executor consumes in reality more CPU than reported by Spark metrics.
  • This is because additional threads are run in the JVM besides the task executor. Note: in these tests only 1 task is executed at a time, this is done on purpose for ease of troubleshooting. 

What about I/O activity ? There are multiple tools for measuring I/O, the following example uses iotop to measure I/O detailed by OS thread

Results: By running "iotop -a" on the test machine (the option -a is to collect cumulative metrics) while the workload of Lab 1 runs (after dropping  the filesysytem cache) I have confirmed that indeed 68 GB of data are read from the disk. In the case of Lab 2, reading a second time the same table, therefore using the filesystem cache, "iotop -a" confirms that the amount of data physically read from the disk device is 0 in this case. Additional measurements to confirm that the I/O goes through the filesystem cache can be done using tracing tools. As an example I have used cachestat from perf-tools. Note that RATIO = 100% in the reported measurements means that the tool has measured that I/O was being served by filesystem cache:

# bin/cachestat 5
Counting cache functions... Output every 5 seconds.
  281001        0      205   100.0%          215      70648
  280151        2        8   100.0%          215      70648

Dynamic tracing: Measuring the latency of I/O calls is a good task for dynamic tracing tools, like SystemTap, bcc/BPF, DTrace. Earlier in this post, I commented that in Lab 1 the query execution spent 70% of its time on CPU and 30% of the time was spent on I/O (reading from Parquet files): the CPU time is a direct measurement reported via Spark metrics, the time spent on I/O is inferred from subtracting the total executor time from CPU time.
It is possible to directly measure the time spent on I/O using a SystemTap script  to measure I/O latency of the "read" calls. An example of the script used for this test is:

# stap read_latencyhistogram_filterPID.stp -x <pid_JVM> 60

Using this I have confirmed with a direct measurement that I/O time (reading from Parquet files) is indeed responsible for 30% of the workload time as described in Lab 1. In general dynamic tracing are powerful tools for investigating beyond standard OS tools. See also this link for a list of useful OS tools.

Lab 4: Drilling down with stack profiling and flame graphs

Stack profiling and on-CPU Flame Graph visualization are useful techniques and tools for investigating CPU workloads, see also Brendan Gregg's page on Flame Graphs.
Stack profiling helps in understanding and drilling-down on "hot code":
you can find which parts of the code spend considerable amount of time on CPU and this provides insights for troubleshooting.
Flame graph visualization of the stack profiles brings additional value, besides being an appealing interface, it provides context about the running code. In particular flame graphs provide information on the parent functions, this can be used to match the names of the functions consuming CPU resources and their parents with the application architecture/code. There are other methods to represent stack profiles, including reporting a summary or representing it in a tree view. Nitsan Wakart in "Exploring Java Perf Flamegraphs" explains the advantages of using flame graphs and compares a few of the available tools for collecting stack profiles for the JVM. See also notes and examples of how to use tools for stack sampling and flame graph visualization for Spark at this link.
The figure here below is a flame graph of the test workload of Lab 2, stack frame captured using async-profiler, in addition the SVG version of the flamegraph is available at this link 

What you can learn from the flame graph:

  • The flame graph shows the stack traces of the parts of the code that take most of the time. Code paths for execution of Spark data sources and "Vectorized Parquet Reader" are clearly occupying most of the trace samples (i.e. are taking most of the CPU time). 
  • You can see also that the execution is under "whole stage code generation", another important feature of Spark 2.x (see also the discussion in the blog entry "Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs".

There are many interesting details to explore in the flame graph, for this it is best to use SVG version with the zoom feature (click on the graph to zoom) and the search feature (use Control-F).
See also the exercise at the end of Lab 5 and the memory heap profile in Lab 6.

    Lab 5:  Measure the impact using of compression

    The Parquet files used in the previous tests are compressed using snappy compression. This is the default compression algorithm used when writing Parquet files in Spark: you can check this by running spark.conf.get("spark.sql.parquet.compression.codec"). Compression/decompression takes CPU cycles, this exercise is about measuring how much of the workload time is due to decompression of the Parquet files. A simple test is to create a copy of the table without compression and then run the test SQL against it. The table copy can be created with the following code:

    val df = spark.table("web_Sales")

    The next step is running a SQL statement to read the entire table (similarly to previous Labs, this time, however, using the non-compressed table data). As in the case of Lab 2 above the test is done after caching the table in the file system cache, which happens naturally on a system with enough memory after running the query a couple of times. The metrics collected with sparkMeasure when running the test are:

    Scheduling mode = FIFO
    Spark Context default degree of parallelism = 1
    Aggregated Spark stage metrics:
    numStages => 1
    sum(numTasks) => 796
    elapsedTime => 259021 (4.3 min)
    sum(stageDuration) => 259021 (4.3 min)
    sum(executorRunTime) => 257667 (4.3 min)
    sum(executorCpuTime) => 254627 (4.2 min)
    sum(executorDeserializeTime) => 858 (0.9 s)
    sum(executorDeserializeCpuTime) => 896 (0.9 s)
    sum(resultSerializationTime) => 9 (9 ms)
    sum(jvmGCTime) => 3220 (3 s)
    sum(shuffleFetchWaitTime) => 0 (0 ms)
    sum(shuffleWriteTime) => 0 (0 ms)
    max(resultSize) => 1076538 (1051.0 KB)
    sum(numUpdatedBlockStatuses) => 0
    sum(diskBytesSpilled) => 0 (0 Bytes)
    sum(memoryBytesSpilled) => 0 (0 Bytes)
    max(peakExecutionMemory) => 0
    sum(recordsRead) => 1079971165
    sum(bytesRead) => 81634460367 (76.0 GB)
    sum(recordsWritten) => 0
    sum(bytesWritten) => 0 (0 Bytes)
    sum(shuffleTotalBytesRead) => 0 (0 Bytes)
    sum(shuffleTotalBlocksFetched) => 0
    sum(shuffleLocalBlocksFetched) => 0
    sum(shuffleRemoteBlocksFetched) => 0
    sum(shuffleBytesWritten) => 0 (0 Bytes)
    sum(shuffleRecordsWritten) => 0

    Learning points
    • The query execution time is ~ 4.3 minutes and the non-compressed test table size is 76.0 GB (~ 12% larger than the snappy-compressed table).
    • The workload is CPU-bound and the execution time reduction compared to Lab 2 is about 20%.
    • The query execution time is faster on the non-compressed table than in the case of Lab 2, using the snappy-compressed table. This can be explained as fewer CPU cycles are spent processing the table in this case, as decompression is not needed. The fact that the non-compressed table is larger (of about 12%) does not appear to offset the increase in speed in this case.
    • Decompression is an important part of the SQL workload used in this post, but only a fraction of the computing time is spend on decompression, this part of the workload appears not to be responsible for the bulk of the CPU time used.
    Comment: this test is consistent with the general finding that snappy is a lightweight algorithm for data compression and suitable for working with Parquet files for data analytics in many cases.

    Exercise: the fact that the use of snappy-compresses files is responsible only for a small percentage of the test workload can also be predicted by analyzing the flame graph shown in Lab 4. To do this search (using CONTROL-F) for "decompress" on the SVG version of the flame graph. You will find that a few methods of the flame graph containing the keyword "decompress" are highlighted. You should find that only a fraction of the time is spent in decompression activities, in agreement with the conclusions of this Lab.

    Lab 6: Measuring the effect of garbage collection

    Heap memory management is an important part of the activity of the JVM. Systems resource utilization for Garbage Collection workload can also be non-negligible and overall show up as considerable overhead to the processing time. The experiments proposed so far have been run with a "generously sized" heap  (i.e. --driver-memory 16g) to steer away from such issue. In the next experiment you will see what happens when reading the test table using the default values for  spark-shell/spark-submit, that is --driver-memory 1g.
    Results of selected metrics when reading the test Parquet table with one executor and --driver-memory 1g:

    Spark Context default degree of parallelism = 1
    Aggregated Spark stage metrics:
    numStages => 1
    sum(numTasks) => 787
    elapsedTime => 469846 (7.8 min)
    sum(stageDuration) => 469846 (7.8 min)
    sum(executorRunTime) => 468480 (7.8 min)
    sum(executorCpuTime) => 304396 (5.1 min)
    sum(executorDeserializeTime) => 833 (0.8 s)
    sum(executorDeserializeCpuTime) => 859 (0.9 s)
    sum(resultSerializationTime) => 9 (9 ms)
    sum(jvmGCTime) => 163641 (2.7 min)

    Additional measurement of the CPU time used using OS tools (ps -efo cputime -p <pid_of_SparkSubmit>) report CPU time = 2306 seconds.

    Lessons learned:
    • The execution time of the test SQL statement (reading a Parquet table) increases significantly when the heap memory is decreased to the default value of 1g. Spark metrics report that additional 2.7 minutes are spent on garbage collection time in this case. The garbage collection time is added to the CPU time used for table data processing, with the net effect of increasing the query execution time of ~50%.
    • The additional processing for garbage collection also consumes extra CPU cycles in the system. Spark metrics report that the CPU time used by the executor does not increase (it stays at about 300 seconds, as in the case of previous tests), however OS tools show that the JVM process overall takes ~2300 seconds of CPU time, which represents a staggering  7-fold overhead on the amount of CPU used by the executor (300 seconds). Additional investigations show that the extra CPU time is used by multiple threads in the JVM related to the extra work needed by garbage collection.

    Heap memory usage can be traced using async-profiler and displayed as a heap memory flame graph (see also the discussion of on-CPU flame graphs in Lab 4). The resulting heap memory flame graph for the test workload is reported in the figure here below. You can see there that Parquet reader code allocates large amounts of heap memory. Data have been collected using "./ -d 315 -m heap -o collapsed -f myheapprofile.txt <pid_Spark_JVM>", see also this link.
    Follow this link for the SVG version of the heap memory flame graph.

    Lab 7: Measuring IPC and memory throughput

    The experiments discussed so far show that the test workload (reading a Parquet table) is CPU-intensive. In this paragraph you will see how to further drill down on what this means.
    Measuring CPU utilization at the OS-level is easy, for example you have seen earlier in this post that Spark metrics report executor CPU time, also you can use OS tools measure CPU utilization (see Lab 3 and also this link). What is more complicated is to drill down in the CPU workload and understand what the CPU cores do when they are busy.
    Brendan Gregg has covered the topic of measuring and understanding CPU workloads in a popular blog post called "CPU Utilization is Wrong". Other resources with many details are "A Crash Course in Modern Hardware" by Cliff Click and Tanel Poders's blog posts on "RAM is the new disk". Overall the subject appears quite complex, with many details that are CPU-model-dependent. I just want to stress a few key points relevant to the scope of this post.
    One of the key points is that it is important to measure and understand the efficiency of the CPU utilization and being able to find answers to question that drill down in CPU workload like: are CPU cores executing  instructions as fast as the CPU cores allow, if not why? How many cycles are being "wasted" stalling and in particular waiting for cache misses on memory  I/O? Are branch misses important for my workload? etc.
    Ideally, we would like to be able to measure the time the CPU spends in different areas of its activity (instruction execution, stalling etc), such a direct measure is currently not available to my knowledge, however modern CPUs have instrumentation in the form of hardware counters ("Performance Monitoring Counters", PMC, that allow to further drill down beyond the CPU utilization numbers you can get from the OS.
    Instructions and cycles are two key metrics, often reported as a ratio instructions over cycles, IPC or the reverse, cycles over instructions CPI, to understand if CPU cores are mostly executing instructions or stalling (for example for cache misses). On Linux, "perf" is a reference tool to measure PMCs. Perf can also measure many more metrics, including cache misses, which are an important cause of stalled CPU time. Metrics measured with perf for the test workload of reading a Parquet table (as in Lab 2 above) are reported here below (measured using "perf stat"):

    # perf stat -a -e task-clock,cycles,instructions,branches,branch-misses -e stalled-cycles-frontend,stalled-cycles-backend -e cache-references,cache-misses -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses -p <pid_Spark_JVM>

     Performance counter stats for process id '12345':

         345151.182659      task-clock (msec)         #    1.100 CPUs utilized
       748,844,357,817      cycles                    #    2.170 GHz                      (38.60%)
     1,259,538,420,042      instructions              #    1.68  insns per cycle          (46.28%)
       158,154,239,473      branches                  #  458.217 M/sec                    (46.31%)
         1,628,794,907      branch-misses             #    1.03% of all branches          (46.33%)
       <not supported>      stalled-cycles-frontend
       <not supported>      stalled-cycles-backend
        28,214,731,947      cache-references          #   81.746 M/sec                    (46.34%)
         6,756,246,371      cache-misses              #   23.946 % of all cache refs      (46.35%)
         4,967,291,822      LLC-loads                 #   14.392 M/sec                    (30.69%)
         1,346,955,145      LLC-load-misses           #   27.12% of all LL-cache hits     (30.80%)
         7,974,695,379      LLC-stores                #   23.105 M/sec                    (15.41%)
         1,627,016,422      LLC-store-misses          #    4.714 M/sec                    (15.39%)
       360,746,862,614      L1-dcache-loads           # 1045.185 M/sec                    (23.09%)
        31,109,549,485      L1-dcache-load-misses     #    8.62% of all L1-dcache hits    (30.77%)
       170,609,291,683      L1-dcache-stores          #  494.303 M/sec                    (30.74%)
       <not supported>      L1-dcache-store-misses

         313.666825089 seconds time elapsed

    Comments on IPC measurements with perf stat: 

    • The metrics collected by perf show that the workload is CPU-bound: there is only one task executing in the system and the equivalent of one CPU core is reported as utilized on average during the job execution. To be more precise, the measurements show that the utilization is about 10% higher than 1 CPU-core equivalent, an overhead that is consistent with the finding discussed earlier on measuring CPU utilization with OS tools. 
    • Relevant to this discussion are the measurement of cycles and instructions: number of instructions ~ 1.3E12 and number of cycles ~7.5E11. The ratio of the two, instructions per cycle: IPC ratio ~1.7
    • Using Brendan Gregg's heuristics, the measured IPC of 1.7 can be interpreted as being generated by an instruction-bound workload. Later in this post you will find additional comments on this in the context of concurrent execution.
    • Note: each core of the CPU model used in this test (Intel Xeon CPU E5-2630 v4) can retire up to 4 instructions per cycle. You can expect to see IPC values greater than 1 (up to 4) for  compute-intensive workloads, while IPC values are expected to drop to values lower than 1 for memory-bound processes that stall frequently trying to read from memory. 

    Comments on cache utilization and cache misses:

    • In the metrics reported for the perf stat output you can see a large number of load and store operations on CPU caches. The test workload is about data processing (reading and Parquet files and applying a filter condition), so it makes sense that data access is an important part of what the CPUs need to process.
    • L1 caches utilization in particular, appears to be quite active in serving a large number of requests: the sum of L1-dcache-stores and L1-dcache-loads mounts to almost 50% of the value of "instructions". Operations that access L1 cache are relatively fast, so you can expect them to cause stalls of short duration (the actual details vary as different CPUs models may have various optimizations in place to reduce the impact of stalls). Note also the value of L1-dcache-load-misses, these are reported to be below 10% of the total loads, however their impact can still be significant.
    • In the reported metrics you can also find details on the number of LLC (last level cache) loads, stores and most importantly the number of LLC misses. LLC misses are expensive operations as they require memory access. 

    Further drill down on CPU-to-DRAM access

    The measurements with perf stat show that the test workload is instruction-intensive, but has also an important component of CPU-to-memory throughput. You can use additional information coming from hardware counters to further drill down on memory access. A few handy tools are available for this: see this link for additional information on tools to measure memory throughput in Linux.
    In the following I have used Intel's performance counter monitor (pcm) tool to measure the test workload. This is a snippet of the output of pcm when measuring the test workload (note the full output is quite verbose and contains more information, see this link for an example):

    # ./pcm.x <measurement_duration_in_seconds>

    Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

                   QPI0     QPI1    |  QPI0   QPI1
     SKT    0      191 G    191 G   |    3%     3%
     SKT    1      169 G    169 G   |    3%     3%

    Total QPI outgoing data and non-data traffic:  721 G
    MEM (GB)->|  READ |  WRITE | CPU energy | DIMM energy
     SKT   0    333.67    469.09     6516.63     2697.37
     SKT   1    279.68    269.33     10062.19     4048.73
    -------------------------------------------------------------------         *    613.36    738.42     16578.81     6746.10

    Key points and comments on CPU-to-memory throughput measurements:
    • The measured data transfer to and from main memory is 613.36 GB (read) + 738.42 GB (write) =  ~1350 GB of total "CPUs-memory traffic". 
    • The execution time of the tests is ~310 seconds (not reported in the data of this Lab, see Lab 2 metrics). 
    • The measured average throughput between CPU and memory for the job is ~ 4.3 GB/s
    • For comparison the CPU specs at list 68.3 GB/s as the maximum memory bandwidth for the CPU model used in this test.
    • The CPUs used for these tests use the Intel Quick Path Interconnect QPI to access memory. There are 4 links in total in the system (2 links per socket). The measured QPI utilization is ~3% in this test. 
    • "Low memory throughput": For the current CPU model it appears that the measured throughput of 4 GB/s and QPI utilization of 3% can be considered as low utilization of the CPU-to-memory channel. I have reached this tentative conclusion by comparing with a memory-intensive workload on the same system generated with the Stream memory test by John D. McCalpin (see also more info at this link). In the case of "Stream memory test" run using only one thread, I could measure QPI utilization of ~12% and and measure CPU-to-memory throughput of about 14 GB/s. This is about 3-4 time more than with the test workload of the Labs in this post and therefore appears to further justify the statement that memory throughput is low for the test workload (reading a Parquet table).
    • "Front-end and back-end workload": It is interesting to compare the figures of DRAM throughput with the measured data processing throughput as seen from the users and similarly, how much of the original table data in Parquet is processed vs the amount of memory moved between DRAM and CPU. There is a factor 20 in volume between the original table size and the amount of memory processed: The measured throughput between CPU and memory is ~4.3 GB/s, the measured throughput of Parquet data processing is 220 MB/s. The size of the test table in Parquet is 68 GB, while roughly 1.3 TB of data have been transferred between memory and CPU processed during the workload (read + write activity). 

    • From measurements of the CPU hardware counters it appears that the test workload is CPU-bound and instruction-intensive, however it also has an important component of data transfer between CPU and main memory. More on this topic will be covered later in this post in the analysis of the workload at scale.
    • The efficiency of the code used to process Parquet data appears to have room for optimizations, in particular the amount of data processed in memory is measured to be more than 1 order of magnitude larger than the size of the Parquet table being processed.
    Recap of lessons learned in Part 1 of this post
    • Reading from Parquet is a CPU-intensive activity. 
    • You need to allocate enough heap memory for your executors to avoid large overheads with the JVM garbage collector.
    • The test system can process ~ 200 MB per second of Parquet data on a single core for the given test workload. The CPU workload is mostly instruction-bound, the utilization of the channel CPU-memory is low. 
    • There are many tools you can use to measure Spark workloads: Spark task metrics and OS tools to measure CPU, I/O, network, memory, stack profiling with flame graph visualization and dynamic tracing tools.

    Part 2: Scaling up the workload

    • The test workload, reading from a Parquet partitioned table, scales up on a 20-core server. 
    • The workload is instruction-bound up to 10 concurrent tasks, at higher load it is limited by CPU-to-memory bandwidth.
    • CPU utilization metrics can be misleading, drill down by measuring instruction per cycle (IPC) and memory throughput. Beware of hyper-threading.

    Lab 8: Measuring elapsed time at scale

    In the experiments described so far you have seen how the test workload (reading from a Parquet partitioned table) runs on Spark using a single task and how to measure it in some simple and controlled cases. Distributed data processing and Spark are all about running tasks concurrently, this is the subject of the the rest of this post.
    The test machine I used has 20 cores (2 sockets, with 10 cores each, see also Lab setup earlier in this post). The results of running the workload with increasing number of concurrent tasks is summarized in the following table:

    Concurrent Tasks (N#) Elapsed time (millisec)
    1 315816
    2 151787
    3 102726
    4 76959
    19 20638
    20 19917

    Key takeaway point from the table:

    • As the number of concurrent tasks deployed increases, the job duration decreases, as expected. 
    • The measurements show that elapsed time drops from about 5 minutes, in the case of no parallelism, down to 20 seconds, when 20 CPU cores are fully utilized. 
    • The next question is to quantify how well the workload scales on multiple CPU cores, which is the subject of the next Lab.

    Note on the measurement method: measurements in the top scale of the number of concurrent tasks shows an important amount of time spent on garbage collection (jvmGCTime) in their first execution. In subsequent executions the garbage collection time is gradually reduced to just a few seconds. The measurement reported here are collected after a few "warm up runs" of the query (typically around 3 "warm up" runs before measuring). For similar reason the parameter --driver-memory is "oversize" to a high value (typically 32g * num_executors).

    Lab 9: How well does the workload scale?

    A useful parameter to study scalability is the "Speedup": if you execute your job with "p" (as in parallel) concurrent processes, Speedup(p) is defined as the response time of the job at concurrency p divided the response time of the job run with p=1. If you prefer a formula: Speedup(p) = R(1)/R(p).
    The expected behavior for the speedup of a perfectly scalable job will show as a straight line in a graph of speedup vs. number of concurrent tasks: the amount of work is directly proportional to the number of workers in this case. Any bottleneck and/or serialization effect will cause the graph of the speedup to eventually "bend down" and lay below the "ideal case" of linear scalability. See also how the speedup can be modeled using Amdahl's law.
    The following table and graph show the measurements of Lab 8 (query execution time for reading the test table of 68 GB) transformed into speedup values vs. N# of executors:

    p=N# executors
    Elapsed time (s)
    1 311302 1.00
    2 149673 2.08
    3 100194 3.11
    4 75547 4.12
    5 60931 5.11
    6 51342 6.06
    7 44165 7.05
    8 39613 7.86
    9 35450 8.78
    10 32061 9.71
    11 29436 10.58
    12 27735 11.22
    13 25764 12.08
    14 24021 12.96
    15 23217 13.41
    16 22092 14.09
    17 21016 14.81
    18 20647 15.08
    19 19999 15.57
    20 19989 15.57

    Lessons learned:
    • The speedup grows almost linearly, at least up to about 10 concurrent tasks
    • This can be interpreted as a sign that the workload scales and does not encounter bottlenecks at least up to the measured scale. 
    • At higher load you can see that the speedup reaches a saturation zone. Drilling down on this is the subject of the next Lab.

    Lab 10: CPU utilization, instructions, cycles and memory at scale

    CPU utilization is a metric that does not reveal many insights about the nature of the workload. Measuring CPU instructions and cycles helps in understanding if the workload is instruction-bound or memory-bound. Also measuring the throughput between CPU and memory is important for data-intensive workloads as the one testes here.
    The following graph reports the measurement of the CPU-to-memory throughput for the test workloads measured using Intel's pcm (see also this link for more info).

    CPU-to-memory traffic measurement at high load - Here below some key metrics measured in the test with 20 concurrent tasks executing the test workload:
    • The measured throughput of data transfer between CPU and memory is about 84 GB/s (55% of which is writes and 45% reads). 
    • Another important metric is the utilization of the QPI links, there are 4 links in total in the system (2 links per socket). The measured QPI utilization is ~65% for the test with 20 concurrent processes.

    Comparing with benchmarks - The question I would like to answer is: I have measured 84 GB/s of CPU-to-memory throughput, is that a high value for my test system? To approach the problem I have used a couple of benchmark tools to generate high load on the system: John D. McCalpin's stream memory test and Intel memory latency latency checker, see further details on those tools at this link. The test system during the stress tests has reached up to 100 GB/s of CPU-to-memory throughput with the benchmark tools. By comparing the benchmark values with the measured throughput of 84 GB/s I am tempted to conclude that the test workload (reading a Parquet table) stresses the system towards saturation levels of the CPU-to-memory channel at high load (with 20 cores busy executing tasks).

    IPC measurement under load - Another measure hinting that the CPUs are spending an increasing amount of time in stalled cycles at high load, comes from the measurement of instruction per cycle (IPC). The number of instructions for the test workload (reading 68 GB of a test Parquet table) stays constant at about 1.3E12 instructions at various load levels (this number can be measured with perf for example). However, the number of cycles needed to process the workload increases with load. Measurements on a the test system are reported in the following figure:

    Comments on IPC measurements - By comparing the graph of job speedup and the graph of IPC, you can see that the drop in IPC corresponds to the inflection point in the scalability graph, around a number of concurrent tasks ~10. This is a good hint that serialization mechanism are in place to limit the scalability when the load is higher than 10 concurrent tasks (for the test system used).
    Additional notes on the measurements of IPC at scale:
    - The measurements reported are an average of the IPC values measured over the duration of the job, there is more structure to it when looking at a finer time scale, which is not reported here.
    - The measured amount of data transferred between CPU and memory shows a slow linear increase with number of concurrent tasks (from ~1300 GB at N# tasks=1 to 1.6 TB at N# tasks=20).
    Lessons learned: 
    • The test workload in this post (reading a partitioned Parquet table) when run with concurrently executing tasks, scales up linearly to about 10 concurrent tasks, then the scalability appears affected by the higher utilization of the CPU-memory channels. 
    • The highest throughput of the test with Spark SQL measured with 20 concurrent tasks is  ~3.4 GB/s of Parquet data processed and this correspond to ~84 GB/s throughput between CPUs and memory.
    • As noted in Lab 7, to process 68 GB of parquet data in the test workload used in this post, the system has accessed more than 1 TB of memory. Similarly, at high load  the throughput observed at the user-end is of about 3.4 GB/s RAM while the system throughput  at systems level at 84 GB/s. The high ratio of useful work done compared to the system load hints to possible optimizations in the Parquet reader.

    Lab 11: Understanding CPU metrics with hyper-threading and pitfalls when measuring CPU workloads on multi-tenant systems

    In this section I want to drill down on a few pitfalls when measuring CPU utilization. The first one is about CPU Hyper-Threading.
    The server used for the tests reported here has hyper-threading activated. In practice this means that the OS sees "40 CPU" (see for example /proc/cpuinfo), while the actual number of cores in the system is 20. In the previous Labs you have seen how the test workload scales up to 20 concurrent tasks, the number 20 being the number of cores in the system. In that "load regime" the OS appeared to be able to schedule the tasks and utilize the available CPU cores to scale up the workload (the test system was dedicated to the tests and did not have any other relevant load). The following test is about scaling up to 40 concurrent tasks and comparing with the case of 20 concurrent tasks.

    Metric 20 concurrent tasks 40 concurrent tasks
    Elapsed time 20 s 23 s
    Executor run time 392 s 892 s
    Executor CPU Time 376 s 849 s
    CPU-memory data volume 1.6 TB 2.2 TB
    CPU-memory throughput 85 GB/s 90 GB/s
    IPC 1.42 0.66
    QPI utilization 65% 80%

    Lessons learned:

    • The response time of the test workload does not improve when going from 20 concurrent tasks up to 40 concurrent tasks. 
    • The measured SQL elapsed time stays basically constant, actually it increases slightly for the chosen workload. The increase in elapsed time (i.e. a reduction of throughput) can be correlated with the extra work in CPU-to-memory data transfer, that grows with the number of threads utilized by the load (this is also reported in the measurements table).
    • The OS scheduler runs 40 concurrent processes/tasks, corresponding to the number of threads in the CPUs, however this does not improve the performance compared to the case of 20 concurrent processes/tasks.
    • When CPU utilization goes above 20 concurrent processes, some of the processes are scheduled to run on threads that have to share the same physical cores (up to 2 threads per core in the CPU used in the test system). The end result is that the system behaves as if it was running on slower CPUs (roughly twice slower moving from load= 20 concurrent tasks to load=40 concurrent tasks).
    • Gotcha: process utilization and measured CPU time are difficult metrics to interpret and even more so when using hyper-threading, this can be an important point to keep in mind when troubleshooting workloads at high load and/or on multi-tenant systems utilizing shared resources where the additional load can come from other "noisy neighbors".

    In the previous paragraphs you have seen that when the number of concurrent tasks goes above the number of available CPU cores in the system, the throughput of the test workload does not increase, even though the OS reports an increasing utilization of the CPU threads.
    What happens when the number of concurrent tasks keeps increasing and goes above the number of available CPU threads in the system?
    This is another case of running on a system with high load and it is a valid case, possibly because you are running on a system shared with other users creating additional load. Here is an example, when running the workload with 60 concurrent tasks (the test system has 40 CPU threads):

    Metric 20 concurrent tasks 40 concurrent tasks 60 concurrent tasks
    Elapsed time 20 s 23 s 23 s
    Executor run time 392 s 892 s 1354 s
    Executor CPU Time 376 s 849 s 872 s
    CPU-memory data volume 1.6 TB 2.2 TB 2.2 TB
    CPU-memory throughput 85 GB/s 90 GB/s 90 GB/s
    IPC 1.42 0.66 0.63
    QPI utilization 65% 80% 80%

    Key points: 
    • At high load the job latency stays constant with increasing load, roughly to the value reached when running with 20 concurrent tasks (that is about 20 seconds, 20 concurrent tasks corresponds to one task per core in the test system).
    • The amount of utilized CPU seconds does not increase above the value measured at 40 concurrent tasks. This is explained as the OS can schedule work on 40 "logical CPUs" corresponding to 40 possible concurrent execution threads in the CPUs using hyper-threading.
    • When the number of concurrent tasks is higher than the number of available CPU threads (40) the measured CPU utilization time does not increase, while the sum of the elapsed time on the executor processes keeps increasing. The additional executor time measured accounts for time spent waiting in OS scheduler queues.
    • Gotcha: The amount of useful work that the system does when executing the test workload at 20, 40 or 60 concurrent tasks is very similar. In all cases the workload is CPU-bound, with high throughput to memory. The measured metrics values CPU time and executor run time can be misleading at high load for the reasons just explained. It is useful to bear in mind this type of examples when troubleshooting workloads on shared/multi-tenant systems, where additional load on the system can come from sources external to your jobs.


    CPU-bound workloads  are very important in data processing. In particular modern hardware has typically abundance of cores and memory. Technologies like Spark, Parquet, Hadoop, make data processing at scale easier and cheaper. Techniques and tools for measuring and understanding data processing workloads, including their CPU and memory footprint, are important to get the best value out of your software and hardware investment.
    Reading from Parquet is a CPU-bound workload examined in this post. Some of the lessons learned investigating the test workloads are:

    • Spark metrics can be used to understand resource usage and workload behavior, with some limitations.
    • Use OS-tools to measure CPU, network, disk. Dynamic tracing can drill down on  latency. Stack profiling and flame graphs are useful to drill down on hot parts of the code
    • CPU utilization does not always give the full picture: use hardware counters to drill down, including instructions and cycles (IPC) and CPU-to-memory throughput. 
    • Measure your workload, speedup, IPC, CPU utilization, as function of increasing number of concurrent tasks to understand its scalability profile and bottlenecks.
    • Watch out for common pitfalls when measuring CPU metrics on multi-tenant systems, in particular be aware if Hyper-threading is used.
    Having good tools is key for troubleshooting, here some links relevant to this post:

    - sparkMeasure - a tool to work with Spark Executor task metrics
    - OS tools - complement Spark metrics with measurements with OS tools
    - Stack profiling and flame graph visualization to drill down on hot pieces of code

    Relevant articles and presentations:

    - Brendan Gregg's blog post CPU Utilization is Wrong
    - The presentation: A Crash Course in Modern Hardware by Cliff Click
    Brendan Gregg's page on Flame Graphs
    Nitsan Wakart: Using FlameGraphs To Illuminate The JVM by , Exploring Java Perf Flamegraphs
    - For related topics in the Oracle context, see Tanel Poders's blog post "RAM is the new disk"
    Intel 64 and IA-32 Architectures, Optimization Reference Manual
    Intel Tuning Guides and Performance Analysis Papers


    This  work has been done in the context of CERN Hadoop and Spark service and has profited from the collaboration of several of my colleagues at CERN.