When More Cores Means Less Speed: Debugging PyTorch with Valgrind on ARM
If you’ve ever tried to debug a PyTorch program on an ARM64 system using
Valgrind, you might have stumbled on something really
odd: “Why does it take so long?”. And if you’re like us, you would probably try
to run it locally, on a Raspberry pi, to see what’s going on… And the madness
begins!
TL;DR, as you probably figured out from the title of this post, it’s a
counter-intuitive experience: the more cores your machine has, the slower your
(torch) code seems to run under Valgrind. Shouldn’t more cores mean more speed?
Let’s dive into why that’s not always the case ;)
The background
In an effort to improve our testing infrastructure for
vAccel and make it more robust, we
started cleaning up our examples, unifying the build & test scripts and started
adding more elaborate test cases for both the library and the plugins.
Valgrind provides a quite decent experience for this, especially to catch
multi-arch errors, memory leaks and dangling pointers (something quite common
when writing in C :D).
The issue
While adding the Valgrind mode of execution in our tests for the vAccel
plugins, we noticed something really weird in the
Torch
case. The test was taking forever!

Specifically, while the equivalent amd64 was taking roughly 4 and a half
minutes (Figure 1), the arm64 run was taking nearly an hour (53 minutes) –
see Figure 2.

Debugging
The first thing that came to mind was that there’s something wrong with our
infrastructure. We run self-hosted Github runners, with custom container images
that support the relevant software components we need for each plugin/case. We
run those on our infra, a set of VMs running on top of diverse low-end
bare-metal machines, both amd64 and arm64. The arm64 runners run on a
couple of Jetson AGX Orins, with 8 cores and 32GB of RAM.
And what’s the first thing to try (especially when debugging on arm64? A
Raspberry Pi of course!
So getting the runner container image on a Raspberry Pi 5, with 8GB of RAM, spinning up the container, building the library and the plugin, all took roughly 10 minutes. And we’re ready for the test:
# ninja run-examples-valgrind -C build-container
ninja: Entering directory `build-container'
[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
Arch is 64bit : true
[snipped]
Running examples with plugin 'libvaccel-torch.so'
+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
==371== Memcheck, a memory error detector
==371== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==371== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
==371== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
==371==
2025.07.10-20:48:01.91 - <debug> Initializing vAccel
2025.07.10-20:48:01.93 - <info> vAccel 0.7.1-9-b175578f
2025.07.10-20:48:01.93 - <debug> Config:
2025.07.10-20:48:01.93 - <debug> plugins = libvaccel-torch.so
2025.07.10-20:48:01.93 - <debug> log_level = debug
2025.07.10-20:48:01.93 - <debug> log_file = (null)
2025.07.10-20:48:01.93 - <debug> profiling_enabled = false
2025.07.10-20:48:01.93 - <debug> version_ignore = false
2025.07.10-20:48:01.94 - <debug> Created top-level rundir: /run/user/0/vaccel/ZpNkGT
2025.07.10-20:48:47.87 - <info> Registered plugin torch 0.2.1-3-0b1978fb
[snipped]
2025.07.10-20:48:48.07 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
2025.07.10-20:48:53.18 - <debug> Downloaded: 2.4 KB of 13.7 MB (17.2%) | Speed: 474.96 KB/sec
2025.07.10-20:48:54.93 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 2.01 MB/sec
2025.07.10-20:48:54.95 - <debug> Download completed successfully
2025.07.10-20:48:55.04 - <debug> session:1 Registered resource 1
2025.07.10-20:48:56.37 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
2025.07.10-20:48:56.37 - <debug> Returning func from hint plugin torch
[snipped]
CUDA not available, running in CPU mode
Success!
Result Tensor :
Output tensor => type:7 nr_dims:2
size: 4000 B
Prediction: banana
[snipped]
==371== HEAP SUMMARY:
==371== in use at exit: 339,636 bytes in 3,300 blocks
==371== total heap usage: 1,779,929 allocs, 1,776,629 frees, 405,074,676 bytes allocated
==371==
==371== LEAK SUMMARY:
==371== definitely lost: 0 bytes in 0 blocks
==371== indirectly lost: 0 bytes in 0 blocks
==371== possibly lost: 0 bytes in 0 blocks
==371== still reachable: 0 bytes in 0 blocks
==371== suppressed: 339,636 bytes in 3,300 blocks
==371==
==371== For lists of detected and suppressed errors, rerun with: -s
==371== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3160 from 3160)
+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
==376== Memcheck, a memory error detector
==376== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==376== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
==376== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
==376==
2025.07.10-20:54:37.78 - <debug> Initializing vAccel
2025.07.10-20:54:37.80 - <info> vAccel 0.7.1-9-b175578f
2025.07.10-20:54:37.80 - <debug> Config:
2025.07.10-20:54:37.80 - <debug> plugins = libvaccel-torch.so
2025.07.10-20:54:37.80 - <debug> log_level = debug
2025.07.10-20:54:37.80 - <debug> log_file = (null)
[snipped]
2025.07.10-20:55:30.78 - <debug> Found implementation in torch plugin
2025.07.10-20:55:30.78 - <debug> [torch] Loading model from /run/user/0/vaccel/zazTtc/resource.1/mobilenet.pt
CUDA not available, running in CPU mode
2025.07.10-21:01:14.77 - <debug> [torch] Prediction: banana
classification tags: banana
[snipped]
2025.07.10-21:01:23.92 - <debug> Unregistered plugin torch
==376==
==376== HEAP SUMMARY:
==376== in use at exit: 341,280 bytes in 3,304 blocks
==376== total heap usage: 3,167,523 allocs, 3,164,219 frees, 534,094,402 bytes allocated
==376==
==376== LEAK SUMMARY:
==376== definitely lost: 0 bytes in 0 blocks
==376== indirectly lost: 0 bytes in 0 blocks
==376== possibly lost: 0 bytes in 0 blocks
==376== still reachable: 0 bytes in 0 blocks
==376== suppressed: 341,280 bytes in 3,304 blocks
==376==
==376== For lists of detected and suppressed errors, rerun with: -s
==376== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3161 from 3161)
+ set +x
Note: We’ll talk about the suppressions a bit later
The test took roughly 13 minutes. At this point, we were scratching our heads.
Why would a high-end Jetson Orin, with way more cores and RAM, perform so much
worse under Valgrind than a humble Raspberry Pi? Time to dig deeper into what’s
really going on under the hood…
The Surprise
When the results came in, the numbers were still striking: the same
Valgrind-wrapped Torch test that took almost an hour on our Jetson Orin
finished in just 13 minutes on the Raspberry Pi. The Pi, with far less RAM and
CPU muscle, still managed to outperform the Orin by a wide margin under these
specific conditions.
This result was the definition of counter-intuitive. Everything we know about hardware says the Orin should wipe the floor with the Pi. Yet, here we were, staring at the Pi’s prompt, wondering if we’d missed something obvious.
Digging Deeper: What’s Really Happening?
So, what’s going on? Why does a high-end, multi-core ARM system get crushed by
a humble Pi in this scenario? The answer lies at the intersection of Valgrind,
multi-threaded workloads, and the quirks of the ARM64 ecosystem.
Thread Count: The Double-Edged Sword
Modern CPUs, especially high-end ARM chips like the Orin, have lots of cores, and frameworks like PyTorch are eager to use them all. By default, PyTorch will spawn as many threads as it thinks your system can handle, aiming for maximum parallelism.
But Valgrind, which works by instrumenting every memory access and
synchronizing thread activity to catch bugs, doesn’t scale gracefully with
thread count. In fact:
- Each additional thread multiplies
Valgrind’s overhead. More threads mean more context switches, more synchronization, and more internal bookkeeping. - On platforms where
Valgrind’s threading support is less mature (like aarch64), this overhead can balloon out of control. - On the Raspberry Pi, with its modest core count, PyTorch only spawns a
handful of threads. But on the Orin, with many more cores, PyTorch ramps up
the thread count—and
Valgrind’s overhead explodes.
The ARM64 Valgrind Quirk
The arm64 port of Valgrind is still catching up to its amd64 sibling in
terms of optimizations and robustness. Some operations, especially those
involving threads and memory, are simply slower to emulate and track on arm64.
This compounds the thread explosion problem, making high-core-count systems
paradoxically slower under Valgrind.
Dealing with library suppressions on arm64 with Valgrind
For instance, when running applications that rely on specific libraries under
Valgrind on arm64 systems, developers frequently encounter a barrage of
memory-related warnings and errors. Many of these issues are not actual bugs in
your code, but rather artifacts of how these libraries manage memory
internally, or limitations in Valgrind’s emulation on such architectures.
For instance, OpenSSL is known for its custom memory management strategies. It
often allocates memory statically or uses platform-specific tricks, which can
confuse Valgrind’s memory checker. For example, you might see reports of
“still reachable” memory or even “definitely lost” memory at program exit.
In reality, much of this memory is intentionally held for the lifetime of the
process—such as global tables or the state for the random number generator.
These are not leaks in the conventional sense, but Valgrind will still flag
them, especially if you run with strict leak checking enabled.
On arm64 platforms, the situation can be further complicated. Valgrind may
not fully emulate every instruction used by the specific library. This can lead
to false positives, such as uninitialized value warnings, or even more dramatic
errors like SIGILL (illegal instruction) if Valgrind encounters an
unsupported operation.
It’s not uncommon to see a flood of warnings that are, in practice, harmless or simply not actionable unless you’re developing for that specific library itself.
To manage this noise and focus on real issues in our application, we use
Valgrind’s suppression mechanism. Suppression files allow us
to tell Valgrind to ignore specific known issues, so we can zero in on genuine
bugs in our own code.
Suppression entries are typically matched by library object names, so on
arm64 we use patterns like /usr/lib/aarch64-linux-gnu/libssh.so* or
obj:*libc10*.so*, obj:*libtorch*.so*.
An example suppression snippet (valgrind.supp) looks like the following:
[...]
{
suppress_libtorch_leaks
Memcheck:Leak
match-leak-kinds: reachable,possible
...
obj:*libtorch*.so*
}
{
suppress_libtorch_ovelaps
Memcheck:Overlap
...
obj:*libtorch*.so*
}
[...]
It’s important to note that not all problems can be suppressed away. For
example, if Valgrind encounters a truly unsupported instruction and throws a
SIGILL, a suppression file won’t help; you may need to update Valgrind or avoid
that code path. Still, for the majority of benign memory warnings from OpenSSL
or Torch, well-crafted suppressions keeps our Valgrind output manageable
and meaningful.
Debug Symbol Overhead
Another factor: large binaries with lots of debug symbols (common in deep
learning stacks) can cause Valgrind to spend an inordinate amount of time just
parsing and managing symbol information. The more complex the binary and its
dependencies, the longer the startup and runtime overhead. Again, amplified on
arm64.
Lessons Learned (and What You Can Do)
Limit Thread Count: When running under Valgrind, explicitly set PyTorch to
use a single thread OMP_NUM_THREADS=1. This alone can make a world of
difference.
Test Small: Use the smallest possible model and dataset for Valgrind runs.
Save the big workloads for native or lighter-weight profiling tools.
Expect the Unexpected: Don’t assume that “bigger is better” when debugging
with Valgrind – sometimes, less really is more!
Profile Performance Separately: Use Valgrind for correctness and bug-hunting,
not for benchmarking or performance profiling.
And here’s the full snippet of the test, on a runner VM on the Jetson Orin, taking less than 6 minutes:
$ ninja run-examples-valgrind -C build
ninja: Entering directory `build'
[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
Arch is 64bit : true
Default config dir : /home/ananos/vaccel-plugin-torch/scripts/common/config
Package : vaccel-torch
Package config dir : /home/ananos/vaccel-plugin-torch/scripts/config
Package lib dir : /home/ananos/vaccel-plugin-torch/build/src
vAccel prefix : /home/runner/artifacts
vAccel lib dir : /home/runner/artifacts/lib/aarch64-linux-gnu
vAccel bin dir : /home/runner/artifacts/bin
vAccel share dir : /home/runner/artifacts/share/vaccel
Running examples with plugin 'libvaccel-torch.so'
+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
==1655== Memcheck, a memory error detector
==1655== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==1655== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
==1655== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
==1655==
2025.07.10-20:06:28.83 - <debug> Initializing vAccel
2025.07.10-20:06:28.85 - <info> vAccel 0.7.1-9-b175578f
2025.07.10-20:06:28.86 - <debug> Config:
2025.07.10-20:06:28.86 - <debug> plugins = libvaccel-torch.so
2025.07.10-20:06:28.86 - <debug> log_level = debug
2025.07.10-20:06:28.86 - <debug> log_file = (null)
2025.07.10-20:06:28.86 - <debug> profiling_enabled = false
2025.07.10-20:06:28.86 - <debug> version_ignore = false
2025.07.10-20:06:28.87 - <debug> Created top-level rundir: /run/user/1000/vaccel/P01ae4
2025.07.10-20:07:27.35 - <info> Registered plugin torch 0.2.1-3-0b1978fb
2025.07.10-20:07:27.35 - <debug> Registered op torch_jitload_forward from plugin torch
2025.07.10-20:07:27.35 - <debug> Registered op torch_sgemm from plugin torch
2025.07.10-20:07:27.35 - <debug> Registered op image_classify from plugin torch
2025.07.10-20:07:27.35 - <debug> Loaded plugin torch from libvaccel-torch.so
2025.07.10-20:07:27.39 - <debug> Initialized resource 1
Initialized model resource 1
2025.07.10-20:07:27.39 - <debug> New rundir for session 1: /run/user/1000/vaccel/P01ae4/session.1
2025.07.10-20:07:27.39 - <debug> Initialized session 1
Initialized vAccel session 1
2025.07.10-20:07:27.40 - <debug> New rundir for resource 1: /run/user/1000/vaccel/P01ae4/resource.1
2025.07.10-20:07:27.62 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
2025.07.10-20:07:33.90 - <debug> Downloaded: 555.7 KB of 13.7 MB (4.0%) | Speed: 88.84 KB/sec
2025.07.10-20:07:36.78 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.50 MB/sec
2025.07.10-20:07:36.80 - <debug> Download completed successfully
2025.07.10-20:07:36.94 - <debug> session:1 Registered resource 1
2025.07.10-20:07:38.16 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
2025.07.10-20:07:38.16 - <debug> Returning func from hint plugin torch
2025.07.10-20:07:38.16 - <debug> Found implementation in torch plugin
2025.07.10-20:07:38.16 - <debug> [torch] session:1 Jitload & Forward Process
2025.07.10-20:07:38.16 - <debug> [torch] Model: /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
2025.07.10-20:07:38.17 - <debug> [torch] Loading model from /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
CUDA not available, running in CPU mode
Success!
Result Tensor :
Output tensor => type:7 nr_dims:2
size: 4000 B
Prediction: banana
2025.07.10-20:08:39.93 - <debug> session:1 Unregistered resource 1
2025.07.10-20:08:39.94 - <debug> Released session 1
2025.07.10-20:08:39.94 - <debug> Removing file /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
2025.07.10-20:08:39.95 - <debug> Released resource 1
2025.07.10-20:08:48.91 - <debug> Cleaning up vAccel
2025.07.10-20:08:48.91 - <debug> Cleaning up sessions
2025.07.10-20:08:48.91 - <debug> Cleaning up resources
2025.07.10-20:08:48.91 - <debug> Cleaning up plugins
2025.07.10-20:08:48.92 - <debug> Unregistered plugin torch
==1655==
==1655== HEAP SUMMARY:
==1655== in use at exit: 304,924 bytes in 3,290 blocks
==1655== total heap usage: 1,780,098 allocs, 1,776,808 frees, 406,800,553 bytes allocated
==1655==
==1655== LEAK SUMMARY:
==1655== definitely lost: 0 bytes in 0 blocks
==1655== indirectly lost: 0 bytes in 0 blocks
==1655== possibly lost: 0 bytes in 0 blocks
==1655== still reachable: 0 bytes in 0 blocks
==1655== suppressed: 304,924 bytes in 3,290 blocks
==1655==
==1655== For lists of detected and suppressed errors, rerun with: -s
==1655== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
+ [ 1 = 1 ]
+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
==1657== Memcheck, a memory error detector
==1657== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==1657== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
==1657== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
==1657==
2025.07.10-20:08:50.40 - <debug> Initializing vAccel
2025.07.10-20:08:50.42 - <info> vAccel 0.7.1-9-b175578f
2025.07.10-20:08:50.42 - <debug> Config:
2025.07.10-20:08:50.42 - <debug> plugins = libvaccel-torch.so
2025.07.10-20:08:50.42 - <debug> log_level = debug
2025.07.10-20:08:50.42 - <debug> log_file = (null)
2025.07.10-20:08:50.42 - <debug> profiling_enabled = false
2025.07.10-20:08:50.42 - <debug> version_ignore = false
2025.07.10-20:08:50.43 - <debug> Created top-level rundir: /run/user/1000/vaccel/73XJNT
2025.07.10-20:09:48.93 - <info> Registered plugin torch 0.2.1-3-0b1978fb
2025.07.10-20:09:48.93 - <debug> Registered op torch_jitload_forward from plugin torch
2025.07.10-20:09:48.93 - <debug> Registered op torch_sgemm from plugin torch
2025.07.10-20:09:48.93 - <debug> Registered op image_classify from plugin torch
2025.07.10-20:09:48.93 - <debug> Loaded plugin torch from libvaccel-torch.so
2025.07.10-20:09:48.94 - <debug> New rundir for session 1: /run/user/1000/vaccel/73XJNT/session.1
2025.07.10-20:09:48.95 - <debug> Initialized session 1
Initialized session with id: 1
2025.07.10-20:09:48.97 - <debug> Initialized resource 1
2025.07.10-20:09:48.98 - <debug> New rundir for resource 1: /run/user/1000/vaccel/73XJNT/resource.1
2025.07.10-20:09:49.19 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
2025.07.10-20:09:55.17 - <debug> Downloaded: 816.6 KB of 13.7 MB (5.8%) | Speed: 137.30 KB/sec
2025.07.10-20:09:57.71 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.62 MB/sec
2025.07.10-20:09:57.73 - <debug> Download completed successfully
2025.07.10-20:09:57.87 - <debug> session:1 Registered resource 1
2025.07.10-20:09:57.88 - <debug> session:1 Looking for plugin implementing VACCEL_OP_IMAGE_CLASSIFY
2025.07.10-20:09:57.88 - <debug> Returning func from hint plugin torch
2025.07.10-20:09:57.88 - <debug> Found implementation in torch plugin
2025.07.10-20:09:57.88 - <debug> [torch] Loading model from /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
CUDA not available, running in CPU mode
2025.07.10-20:11:31.42 - <debug> [torch] Prediction: banana
classification tags: banana
classification imagename: PLACEHOLDER
2025.07.10-20:11:31.93 - <debug> session:1 Unregistered resource 1
2025.07.10-20:11:31.93 - <debug> Removing file /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
2025.07.10-20:11:31.94 - <debug> Released resource 1
2025.07.10-20:11:31.95 - <debug> Released session 1
2025.07.10-20:11:44.12 - <debug> Cleaning up vAccel
2025.07.10-20:11:44.12 - <debug> Cleaning up sessions
2025.07.10-20:11:44.12 - <debug> Cleaning up resources
2025.07.10-20:11:44.12 - <debug> Cleaning up plugins
2025.07.10-20:11:44.12 - <debug> Unregistered plugin torch
==1657==
==1657== HEAP SUMMARY:
==1657== in use at exit: 306,616 bytes in 3,294 blocks
==1657== total heap usage: 3,167,511 allocs, 3,164,217 frees, 533,893,229 bytes allocated
==1657==
==1657== LEAK SUMMARY:
==1657== definitely lost: 0 bytes in 0 blocks
==1657== indirectly lost: 0 bytes in 0 blocks
==1657== possibly lost: 0 bytes in 0 blocks
==1657== still reachable: 0 bytes in 0 blocks
==1657== suppressed: 306,616 bytes in 3,294 blocks
==1657==
==1657== For lists of detected and suppressed errors, rerun with: -s
==1657== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
+ set +x
and the actual test in Figure 3, taking 8 minutes, almost 7 times faster than the original execution:

Wrapping Up
This experience was a great reminder that debugging tools and parallel
workloads don’t always play nicely, especially on less mature platforms.
Sometimes, the humble Raspberry Pi will leave a high-end chip in the dust,
at least when Valgrind is in the mix.
So next time you’re staring at a progress bar that refuses to budge, remember: more cores might just mean more waiting. And don’t be afraid to try your tests on the “little guy” – you might be surprised by what you find.