Friday, December 12, 2025

CPU Performance Testing Made Simple with test_cpu_parallel

🚀 Quickly measure and load test your CPUs

How fast is your CPU really, right now?
In modern environments, that simple question is surprisingly hard to answer, and it matters for performance, cost, and scalability.

Between virtual machines, containers, cloud instances, power-management policies, and opaque hardware specifications, CPU performance often feels like a black box. Traditional benchmarking suites can help, but they are usually heavy, slow, and overkill when all you need is a quick, practical measurement.

This post shows a fast and lightweight way to measure real CPU (and memory) performance using a simple tool:

test_cpu_parallel


🛠 A simple way to measure CPU speed

test_cpu_parallel is a lightweight Rust tool that runs configurable CPU and memory workloads and gives you comparable performance numbers quickly. 

It creates multiple worker threads and times how long they take to complete a fixed amount of CPU (or memory) work. Try it:

# Run with docker or podman:
docker run lucacanali/test_cpu_parallel /opt/test_cpu_parallel -w 2 


This runs a synthetic CPU workload on 2 threads and returns the job completion time. You can directly compare these numbers across machines, architectures, or configurations.

Disclaimer: this post is not about about precise measurements and CPU benchmarks, as this would be very hard to get. For example, CPU workloads are not uniform, in some cases we have memory-intensive workloads, in other cases compute-intensive workloads, or workloads with high I/O rate to/from fast disks or to/from network. In general, benchmarks trying to get one "CPU speed number" can be misleading and this work is not an exception.


📈 Compare performance across systems

(Old laptop vs new laptop, on-prem vs cloud, etc.)

How does the performance of your new laptop compare to the old one or to a cloud VM you are using? To test scalability or compare architectures:

./test_cpu_parallel --num_workers 8 --full -o results.csv

This collects performance test runs for 1 through 8 threads and saves the results to a CSV file.

The result can be used to analyze the performance and how it scales as you increase the number of parallel threads on the CPU. In particular the speedup curve (see Figure 1) is particularly useful to find the saturation point of your system.

  • Does performance scale linearly at low thread counts?
  • Where does it flatten?
  • Does hyperthreading help or hurt?
  • Where does CPU saturation occur?

I’ve used this to compare everything from small laptops to 128-core servers — and the curves quickly reveal when performance is real vs. just advertised.

Figure 1: The speedup curve is a way to plot CPU performance data that highlights where the CPU system saturates as load increases. The higher the curve goes, the more CPU cycles you can get from the machine. More details on how to interpret his graph was made in this notebook.


☁️ Benchmarking cloud CPUs you don’t control

Cloud environments hide a lot from you:
Which CPU model? How many real cores? Are you being throttled?

test_cpu_parallel gives you a black-box diagnostic:

  • Measure how performance scales with threads
  • Check if doubling threads halves execution time
  • Detect oversubscription or noisy neighbors
  • Catch “burstable CPU” limits or throttling

If the speedup curve flattens early, something’s not right, and now you know.

 

🔄 How many CPU cores can you really use?

Say the LInux command lscpu reports:

  • 16 CPUs
  • 8 physical + 8 hyperthreads

Does it mean you can scale up your CPU-intensive workload to 16x or 8x, or else? What does this actually mean for performance?

Measure and find out how much does your system scale by running:

./test_cpu_parallel --num_workers 16 --full

This is quite useful for:

  • CI/CD runners whose specs change unexpectedly
  • Cloud VMs with unclear provisioning
  • Detecting misconfigured containers or Kubernetes limits
  • Creating reliable performance baselines

📊 Advanced: Plotting Speedup Graphs

From the CSV output, it’s easy to create speedup plots or efficiency curves. I’ve used this to:
Compare different generations of CPU architectures.
Profile performance saturation points.
Provide performance regression baselines over time.
See the example Jupyter notebooks with measurements and plots to get started.


Examples:

Q1: How fast are your CPUs, today?

Run this from CLI, it will run a test and  measure the CPU speed:

# Run with docker or podman:
docker run lucacanali/test_cpu_parallel /opt/test_cpu_parallel -w 1 


Alternative, download the binary and run locally (for linux x64):

wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/test_cpu_parallel/test_cpu_parallel
chmod +x test_cpu_parallel
./test_cpu_parallel -w 1 


The tool will run for about one minute, depending on your CPU speed, and run a compute-intensive loop using (only) one CPU thread (configured with the option -w 1) and report the execution time.

You can then compare the run time across multiple platforms and/or across different time of the day to see if there are variations.

Example of output value:


$ docker run lucacanali/test_cpu_parallel /opt/test_cpu_parallel -w 1


test_cpu_parallel - A basic CPU workload generator written in Rust

Use for testing and comparing CPU performance [-h, --help] for help


Starting a test with num_workers = 1, num_job_execution_loops = 3, 

worker_inner_loop_size = 30000, full = false, output_file = ""


Scheduling job batch number 1

Scheduled running of 1 concurrent worker threads

Job 0 finished. Result, delta_time = 44.49 sec

Scheduling job batch number 2

Scheduled running of 1 concurrent worker threads

Job 0 finished. Result, delta_time = 44.47 sec

Scheduling job batch number 3

Scheduled running of 1 concurrent worker threads

Job 0 finished. Result, delta_time = 44.49 sec


CPU-intensive jobs using num_workers=1 finished.

Job runtime statistics:

Mean job runtime = 44.49 sec

Median job runtime = 44.49 sec

Standard deviation = 0.01 sec


More info with the options, features and limitations of the testing tool at: Test_CPU_parallel_Rust  on Github  

---


Q2: How many CPU cores can I use?

 A key advantage of using  Test_CPU_parallel_Rust is that it provides an easy way to test multi-core and multi-CPU systems.  

For example if you have two or more cores available on your system you can run the tool with option -w 2 and the load will run on two concurrent threads. You expect to run in about the same time as with the load 1 example, if that's not the case, you probably don't have 2 cores available


# Run with docker or podman, configure the number of concurrently running threads with -w option
docker run lucacanali/test_cpu_parallel /opt/test_cpu_parallel -w 2 


Bonus points: Keep increasing the load with the option -w to see when you reach saturation.  

What can you discover? That the system reports to have N cores available but it is not the case, as possibly you have been given a mixture of "logical cores" (cores that come from CPU hyperthreading)


Run on Kubernetes, see also: Doc on how to run on K8S

# Run on a Kubernetes cluster
kubectl run test-cpu-pod --image=lucacanali/test_cpu_parallel --restart=Never -- /opt/test_cpu_parallel -w 2


Q3: How scale up the CPU load to saturation 

When testing with increasing CPU load we expect to reach saturation of the CPU utilization at one point. Test_cpu_parallel provides an easy way to run scale up CPU load, by running multiple CPU load tests with an increasing number of concurrent threads and measuring the time for each run. This can be used to measure when the CPUs reach saturation. Saturation is expected when the number of concurrent running threads gets close to the number of available CPU (physical) cores in the system.


# Full mode will test all the values of num_workers from 1 to the value set with --num_workers
test_cpu_parallel --full --num_workers 8


This exercise described in the blog post CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers shows how to use test_cpu_parallel to compare the relative performance of different CPU types

Two key metrics to analyze the performance are:

  • throughput analysis: how many jobs per minute can the server run as the load increases? and at saturation?
  • speedup: how well does the system scale? when running 2 concurrent threads do I get twice the workload or is there some overhead? Scaling up when do I reach saturation?

Another interesting metric is the single-threaded speed, this is the CPU speed we get when the system is lightly loaded, which could be useful on a laptop or desktop for example, which on servers we would typically see systems highly loaded most of the time.


Lessons learned: The case of the large box 

The case of testing a large server with 128 cores using test_cpu_parallel

The result show that the box throughput increases as the number of concurrent threads increases but reaches saturation much earlier than the expected value of 128. There can be various reasons for that, including the thermal dissipation causing CPU cores to throttle. The lesson is: don't make hypotheses on how your CPU can scale, just measure your CPU! 

Figure 2: The speedup curve was obtained from test data taken with test_cpu_parallel and shows how the CPU performance scales as the number of concurrent CPU-bound executions increases. For a 32-core machine we see saturation as expected around 32 concurrent jobs. The surprise was with 128-core server, which did not perform as expected, showing saturation already at 80 concurrent tasks.


Conclusions

Modern CPUs, especially in cloud and virtualized environments, rarely behave the way specs suggest. Core counts, clock speeds, and instance types don’t tell you how much performance you actually get under load.

test_cpu_parallel focuses on one simple idea: measure, don’t guess.

It’s not a full benchmarking suite, but a fast, practical tool to:

  • sanity-check CPU and memory performance,

  • see how well a system really scales with parallel load,

  • detect early saturation, throttling, or misconfigured resources,

  • compare machines, VMs, containers, or environments with minimal effort.

The key takeaway is straightforward: assumptions about CPU scaling are often wrong. A quick measurement and a speedup curve will usually tell you more than any spec sheet.

Link to the project Github page