the perils of cross-compilation

Over the past few months, I have been deep in the trenches of cross-compiling.

context

This story takes place within Project Digidow, a facial recognition sensor written in Rust, built with Nix, and deployed on a Raspberry Pi.

the trigger

Everything started when the version of TensorFlow Lite in nixpkgs was deprecated and stopped building. The ideal solution seemed simple: just update TensorFlow Lite to a newer version. Unfortunately, that turned out to be far from easy, since TensorFlow Lite can be built with either Bazel or CMake. The existing Nix build used Bazel, which did not want to update gracefully.

first attempt: ditching bazel

I decided to switch to building TensorFlow Lite from source using CMake. This worked significantly better, but still required some effort. I replaced all fetchers in the CMake scripts with file:// URLs and used pre-built binaries from nixpkgs whenever possible. However, since the CMake build system insists on compiling nearly everything from source, I had to manually provide most dependencies.

new problem: tflite-rs

At this point, I ran into linker issues with tflite-rs. The Bazel build produces a “fat” .so file that includes all transitive dependencies, while the CMake build does not. This meant that I had to manually link all dependencies.

My eventual solution was to provide all transitive dependencies in the Nix build output directory as well. As a result, tflite-rs now uses Bazel for local building, but it also supports a provided binary, which I now build with CMake.

compiling on the raspberry pi

Unfortunately, the sensor was far too large to compile directly on the Raspberry Pi. The obvious next step was cross-compiling. Rust supports cross-compilation out of the box, and Nix has great support for it via the cross toolchain. In theory, this should have been simple. In practice, nothing worked. Or rather, some parts worked, others did not, and I had no idea why.

the usual suspects

OpenCV refused to build.
TensorFlow Lite refused to build.

The biggest issue was the classic “tools for host vs. tools for target” problem.

solution: using pkgsCross

The solution was to use the appropriate tools for the host and target platforms via Nix (thanks, pkgsCross). During build time, I injected and replaced tool paths in the CMake scripts. The worst offenders were Protobuf and FlatBuffers, since both rely on compiler executables that need to run on the host machine, yet require version-specific shared libraries that belong to the target platform.

After a lot of manual labor, everything finally compiled successfully.

deployment: the next obstacle

To run the binary on the Raspberry Pi, I had to transfer it there along with all its transitive dependencies. Of course, those dependencies also had their own dependencies. I copied them all over, only to find that the binary still refused to run.

Nix uses absolute library paths starting with /nix/store, which normally ensures version consistency and isolation. However, that system falls apart when you move binaries to another machine.

I tried statically linking everything into one binary to solve this problem, but that did not work either. The static linking process simply is not supported by all dependencies yet.

back to patching

So, I went back to manually editing the library paths in the binary. At this point, patchelf became my new best friend:

patchelf --remove-needed libfoo.so --remove-needed libbar.so
            patchelf --add-needed libfoo.so --add-needed libbar.so

Later, I realized that I could have simply used patchelf --set-rpath $ORIGIN/../lib instead of spending hours on manual path edits.

making it run (almost)

After fixing the paths, I could finally execute the binary using:

LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary

The binary started, printed some initial logs, and then failed when attempting to import an OpenCV model. The error message indicated that the model file could not be parsed. This was confusing because the exact same model worked before, and it still worked perfectly on x86_64 — just not on aarch64.

debugging phase one

Running the sensor under GDB confirmed that all the correct libraries were being called. The relevant call chain looked like this:

Rust code → rust-opencv → OpenCV → Protobuf.

My first theory was that the protoc compiler had generated code for the wrong architecture or endianess.

investigations

I developed a small program to parse the model file directly on the Raspberry Pi. It worked, confirming the model file was not corrupted.
I tried several Protobuf versions. None of them changed the outcome.
I built a minimal OpenCV program to parse the same model. It also worked fine.

The conclusion was clear: the import code itself worked. It just didn’t work inside my main binary.

debugging phase two

I switched from a bottom-up approach to a top-down one and started reducing the binary until I found the smallest version that still failed. The culprit quickly emerged: simply importing tflite with use tflite::Tflite was enough to break OpenCV.

the real root cause

Remember how Nix uses absolute library paths? And how I mentioned that CMake wanted to build everything from source? To get TensorFlow Lite building, I had given it the version of Protobuf it wanted, and I also exported all its transitive dependencies — including that version of Protobuf.

This led to a situation where OpenCV loaded the wrong Protobuf library. It wasn’t different enough to cause a linker error, but it was just different enough to break model parsing in subtle ways.

the fix

The fix turned out to be surprisingly simple. I just needed to make TensorFlow Lite use the same version of Protobuf that was packaged with nixpkgs.

the end (for now)

After months of debugging, rebuilding, and manually patching binaries, everything finally works again. Until the next nixpkgs update, of course.