Over the past few months, I have been deep in the trenches of cross-compiling.
This story takes place within Project Digidow, a facial recognition sensor written in Rust, built with Nix, and deployed on a Raspberry Pi.
Everything started when the version of TensorFlow Lite in
nixpkgs was deprecated and stopped building.
The ideal solution seemed simple: just update TensorFlow
Lite to a newer version. Unfortunately, that turned out to
be far from easy, since TensorFlow Lite can be built with
either Bazel or CMake. The existing Nix build used Bazel,
which did not want to update gracefully.
I decided to switch to building TensorFlow Lite from source
using CMake. This worked significantly better, but still
required some effort. I replaced all fetchers in the CMake
scripts with file:// URLs and used pre-built
binaries from nixpkgs whenever possible.
However, since the CMake build system insists on compiling
nearly everything from source, I had to manually provide
most dependencies.
At this point, I ran into linker issues with
tflite-rs. The Bazel build produces a “fat”
.so file that includes all transitive
dependencies, while the CMake build does not. This meant
that I had to manually link all dependencies.
My eventual solution was to provide all transitive
dependencies in the Nix build output directory as well. As a
result, tflite-rs now uses Bazel for local
building, but it also supports a provided binary, which I
now build with CMake.
Unfortunately, the sensor was far too large to compile
directly on the Raspberry Pi. The obvious next step was
cross-compiling. Rust supports cross-compilation out of the
box, and Nix has great support for it via the
cross toolchain. In theory, this should have
been simple. In practice, nothing worked. Or rather, some
parts worked, others did not, and I had no idea why.
The biggest issue was the classic “tools for host vs. tools for target” problem.
The solution was to use the appropriate tools for the host
and target platforms via Nix (thanks,
pkgsCross). During build time, I injected and
replaced tool paths in the CMake scripts. The worst
offenders were Protobuf and FlatBuffers, since both rely on
compiler executables that need to run on the host machine,
yet require version-specific shared libraries that belong to
the target platform.
After a lot of manual labor, everything finally compiled successfully.
To run the binary on the Raspberry Pi, I had to transfer it there along with all its transitive dependencies. Of course, those dependencies also had their own dependencies. I copied them all over, only to find that the binary still refused to run.
Nix uses absolute library paths starting with
/nix/store, which normally ensures version
consistency and isolation. However, that system falls apart
when you move binaries to another machine.
I tried statically linking everything into one binary to solve this problem, but that did not work either. The static linking process simply is not supported by all dependencies yet.
So, I went back to manually editing the library paths in the
binary. At this point, patchelf became my new
best friend:
patchelf --remove-needed libfoo.so --remove-needed libbar.so
patchelf --add-needed libfoo.so --add-needed libbar.so
Later, I realized that I could have simply used
patchelf --set-rpath $ORIGIN/../lib
instead of spending hours on manual path edits.
After fixing the paths, I could finally execute the binary using:
LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary
The binary started, printed some initial logs, and then failed when attempting to import an OpenCV model. The error message indicated that the model file could not be parsed. This was confusing because the exact same model worked before, and it still worked perfectly on x86_64 — just not on aarch64.
Running the sensor under GDB confirmed that all the correct libraries were being called. The relevant call chain looked like this:
Rust code → rust-opencv → OpenCV → Protobuf.
My first theory was that the protoc compiler
had generated code for the wrong architecture or endianess.
The conclusion was clear: the import code itself worked. It just didn’t work inside my main binary.
I switched from a bottom-up approach to a top-down one and
started reducing the binary until I found the smallest
version that still failed. The culprit quickly emerged:
simply importing tflite with
use tflite::Tflite was enough to break OpenCV.
Remember how Nix uses absolute library paths? And how I mentioned that CMake wanted to build everything from source? To get TensorFlow Lite building, I had given it the version of Protobuf it wanted, and I also exported all its transitive dependencies — including that version of Protobuf.
This led to a situation where OpenCV loaded the wrong Protobuf library. It wasn’t different enough to cause a linker error, but it was just different enough to break model parsing in subtle ways.
The fix turned out to be surprisingly simple. I just needed
to make TensorFlow Lite use the same version of Protobuf
that was packaged with nixpkgs.
After months of debugging, rebuilding, and manually patching
binaries, everything finally works again. Until the next
nixpkgs update, of course.