diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..4d80992 --- /dev/null +++ b/LICENSE @@ -0,0 +1 @@ +ALL RIGHTS RESERVED. diff --git a/README.md b/README.md new file mode 100644 index 0000000..f90cb17 --- /dev/null +++ b/README.md @@ -0,0 +1,3 @@ +# kempinger.at blog + +All rights reserved. \ No newline at end of file diff --git a/public_html/index.html b/public_html/index.html index 9348054..898f916 100644 --- a/public_html/index.html +++ b/public_html/index.html @@ -637,15 +637,15 @@ laboratory work.

- +

The Perils of Cross-Compilation

+

The story of how I cross compiled a giant Rust + C++ facial recognition project using nix

+ Read full observation → +
diff --git a/public_html/page/cross-compile.html b/public_html/page/cross-compile.html new file mode 100644 index 0000000..bc8f56f --- /dev/null +++ b/public_html/page/cross-compile.html @@ -0,0 +1,424 @@ + + + + + + cross-compiling - kempinger.at + + + + + +
+ ← Back to Laboratory +
+

the perils of cross-compilation

+ +

+ Over the past few months, I have been deep in the trenches + of cross-compiling. +

+ +

context

+

+ This story takes place within + Project Digidow, a facial + recognition sensor written in Rust, built with Nix, and + deployed on a Raspberry Pi. +

+ +

the trigger

+

+ Everything started when the version of TensorFlow Lite in + nixpkgs was deprecated and stopped building. + The ideal solution seemed simple: just update TensorFlow + Lite to a newer version. Unfortunately, that turned out to + be far from easy, since TensorFlow Lite can be built with + either Bazel or CMake. The existing Nix build used Bazel, + which did not want to update gracefully. +

+ +

first attempt: ditching bazel

+

+ I decided to switch to building TensorFlow Lite from source + using CMake. This worked significantly better, but still + required some effort. I replaced all fetchers in the CMake + scripts with file:// URLs and used pre-built + binaries from nixpkgs whenever possible. + However, since the CMake build system insists on compiling + nearly everything from source, I had to manually provide + most dependencies. +

+ +

new problem: tflite-rs

+

+ At this point, I ran into linker issues with + tflite-rs. The Bazel build produces a “fat” + .so file that includes all transitive + dependencies, while the CMake build does not. This meant + that I had to manually link all dependencies. +

+

+ My eventual solution was to provide all transitive + dependencies in the Nix build output directory as well. As a + result, tflite-rs now uses Bazel for local + building, but it also supports a provided binary, which I + now build with CMake. +

+ +

compiling on the raspberry pi

+

+ Unfortunately, the sensor was far too large to compile + directly on the Raspberry Pi. The obvious next step was + cross-compiling. Rust supports cross-compilation out of the + box, and Nix has great support for it via the + cross toolchain. In theory, this should have + been simple. In practice, nothing worked. Or rather, some + parts worked, others did not, and I had no idea why. +

+ +

the usual suspects

+ +

+ The biggest issue was the classic “tools for host vs. tools + for target” problem. +

+ +

solution: using pkgsCross

+

+ The solution was to use the appropriate tools for the host + and target platforms via Nix (thanks, + pkgsCross). During build time, I injected and + replaced tool paths in the CMake scripts. The worst + offenders were Protobuf and FlatBuffers, since both rely on + compiler executables that need to run on the host machine, + yet require version-specific shared libraries that belong to + the target platform. +

+

+ After a lot of manual labor, everything finally compiled + successfully. +

+ +

deployment: the next obstacle

+

+ To run the binary on the Raspberry Pi, I had to transfer it + there along with all its transitive dependencies. Of course, + those dependencies also had their own dependencies. I copied + them all over, only to find that the binary still refused to + run. +

+

+ Nix uses absolute library paths starting with + /nix/store, which normally ensures version + consistency and isolation. However, that system falls apart + when you move binaries to another machine. +

+

+ I tried statically linking everything into one binary to + solve this problem, but that did not work either. The static + linking process simply is not supported by all dependencies + yet. +

+ +

back to patching

+

+ So, I went back to manually editing the library paths in the + binary. At this point, patchelf became my new + best friend: +

+
+patchelf --remove-needed libfoo.so --remove-needed libbar.so
+            patchelf --add-needed libfoo.so --add-needed libbar.so
+

+ Later, I realized that I could have simply used + patchelf --set-rpath $ORIGIN/../lib + instead of spending hours on manual path edits. +

+ +

making it run (almost)

+

+ After fixing the paths, I could finally execute the binary + using: +

+
+LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary
+

+ The binary started, printed some initial logs, and then + failed when attempting to import an OpenCV model. The error + message indicated that the model file could not be parsed. + This was confusing because the exact same model worked + before, and it still worked perfectly on x86_64 — just not + on aarch64. +

+ +

debugging phase one

+

+ Running the sensor under GDB confirmed that all the correct + libraries were being called. The relevant call chain looked + like this: +

+

Rust code → rust-opencv → OpenCV → Protobuf.

+

+ My first theory was that the protoc compiler + had generated code for the wrong architecture or endianess. +

+ +

investigations

+ +

+ The conclusion was clear: the import code itself worked. It + just didn’t work inside my main binary. +

+ +

debugging phase two

+

+ I switched from a bottom-up approach to a top-down one and + started reducing the binary until I found the smallest + version that still failed. The culprit quickly emerged: + simply importing tflite with + use tflite::Tflite was enough to break OpenCV. +

+ +

the real root cause

+

+ Remember how Nix uses absolute library paths? And how I + mentioned that CMake wanted to build everything from source? + To get TensorFlow Lite building, I had given it the version + of Protobuf it wanted, and I also exported all its + transitive dependencies — including that version of + Protobuf. +

+

+ This led to a situation where OpenCV loaded the wrong + Protobuf library. It wasn’t different enough to cause a + linker error, but it was just different enough to break + model parsing in subtle ways. +

+ +

the fix

+

+ The fix turned out to be surprisingly simple. I just needed + to make TensorFlow Lite use the same version of Protobuf + that was packaged with nixpkgs. +

+ +

the end (for now)

+

+ After months of debugging, rebuilding, and manually patching + binaries, everything finally works again. Until the next + nixpkgs update, of course. +

+ +
+
+ +