Add cross-compilation post and site updates

Add full cross-compilation article, update index laboratory note, and
add LICENSE and README files
This commit is contained in:
Stefan Kempinger 2025-11-10 12:09:52 +01:00
parent d7ff8b0d09
commit aaabfcf030
4 changed files with 435 additions and 7 deletions

View file

@ -637,15 +637,15 @@
laboratory work.</p>
</div>
<div class="notes-grid">
<!-- <article class="note-card">
<article class="note-card">
<div class="note-header">
<span class="note-date">2024-07-08</span>
<span class="note-tag">Systems</span>
<span class="note-date">2025-11-10</span>
<span class="note-tag">Nix + Rust</span>
</div>
<h3>The Myth of Perfect Code Architecture</h3>
<p>After years of chasing the "perfect" system design, I've concluded that adaptability trumps perfection. The best architectures are those that can evolve...</p>
<a href="#" class="note-link">Read full observation →</a>
</article>-->
<h3>The Perils of Cross-Compilation</h3>
<p>The story of how I cross compiled a giant Rust + C++ facial recognition project using nix</p>
<a href="page/cross-compile.html" class="note-link">Read full observation →</a>
</article>
</div>
</div>
</section>

View file

@ -0,0 +1,424 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>cross-compiling - kempinger.at</title>
<link rel="stylesheet" href="../style.css" />
<style>
body {
font-family:
"Inter",
-apple-system,
BlinkMacSystemFont,
"Segoe UI",
Roboto,
sans-serif;
background: linear-gradient(135deg, #0a0a0a 0%, #1a1a2e 100%);
color: #ffffff;
line-height: 1.6;
margin: 0;
min-height: 100vh;
}
.container {
max-width: 800px;
margin: 0 auto;
padding: 4rem 20px 2rem 20px;
}
.boxed-content {
background: rgba(255, 255, 255, 0.02);
border-radius: 20px;
padding: 3rem;
backdrop-filter: blur(10px);
border: 1px solid rgba(255, 255, 255, 0.1);
margin: 0 auto;
max-width: 700px;
}
h1 {
font-size: 2.5rem;
font-weight: 700;
color: #00d4ff;
text-align: center;
margin-bottom: 2rem;
}
.back-link {
display: inline-block;
margin-bottom: 2rem;
color: #00d4ff;
text-decoration: none;
font-weight: 500;
}
.back-link:hover {
text-decoration: underline;
}
.construction-message {
text-align: center;
margin-top: 2rem;
}
.construction-message img {
width: 80px;
opacity: 0.7;
}
</style>
</head>
<body>
<div class="container">
<a href="../index.html#lab" class="back-link">← Back to Laboratory</a>
<div class="boxed-content">
<h1>the perils of cross-compilation</h1>
<p>
Over the past few months, I have been deep in the trenches
of cross-compiling.
</p>
<h2>context</h2>
<p>
This story takes place within
<a href="https://digidow.eu">Project Digidow</a>, a facial
recognition sensor written in Rust, built with Nix, and
deployed on a Raspberry Pi.
</p>
<h2>the trigger</h2>
<p>
Everything started when the version of TensorFlow Lite in
<code>nixpkgs</code> was deprecated and stopped building.
The ideal solution seemed simple: just update TensorFlow
Lite to a newer version. Unfortunately, that turned out to
be far from easy, since TensorFlow Lite can be built with
either Bazel or CMake. The existing Nix build used Bazel,
which did not want to update gracefully.
</p>
<h2>first attempt: ditching bazel</h2>
<p>
I decided to switch to building TensorFlow Lite from source
using CMake. This worked significantly better, but still
required some effort. I replaced all fetchers in the CMake
scripts with <code>file://</code> URLs and used pre-built
binaries from <code>nixpkgs</code> whenever possible.
However, since the CMake build system insists on compiling
nearly everything from source, I had to manually provide
most dependencies.
</p>
<h2>new problem: tflite-rs</h2>
<p>
At this point, I ran into linker issues with
<code>tflite-rs</code>. The Bazel build produces a “fat”
<code>.so</code> file that includes all transitive
dependencies, while the CMake build does not. This meant
that I had to manually link all dependencies.
</p>
<p>
My eventual solution was to provide all transitive
dependencies in the Nix build output directory as well. As a
result, <code>tflite-rs</code> now uses Bazel for local
building, but it also supports a provided binary, which I
now build with CMake.
</p>
<h2>compiling on the raspberry pi</h2>
<p>
Unfortunately, the sensor was far too large to compile
directly on the Raspberry Pi. The obvious next step was
cross-compiling. Rust supports cross-compilation out of the
box, and Nix has great support for it via the
<code>cross</code> toolchain. In theory, this should have
been simple. In practice, nothing worked. Or rather, some
parts worked, others did not, and I had no idea why.
</p>
<h3>the usual suspects</h3>
<ul>
<li>OpenCV refused to build.</li>
<li>TensorFlow Lite refused to build.</li>
</ul>
<p>
The biggest issue was the classic “tools for host vs. tools
for target” problem.
</p>
<h2>solution: using pkgsCross</h2>
<p>
The solution was to use the appropriate tools for the host
and target platforms via Nix (thanks,
<code>pkgsCross</code>). During build time, I injected and
replaced tool paths in the CMake scripts. The worst
offenders were Protobuf and FlatBuffers, since both rely on
compiler executables that need to run on the host machine,
yet require version-specific shared libraries that belong to
the target platform.
</p>
<p>
After a lot of manual labor, everything finally compiled
successfully.
</p>
<h2>deployment: the next obstacle</h2>
<p>
To run the binary on the Raspberry Pi, I had to transfer it
there along with all its transitive dependencies. Of course,
those dependencies also had their own dependencies. I copied
them all over, only to find that the binary still refused to
run.
</p>
<p>
Nix uses absolute library paths starting with
<code>/nix/store</code>, which normally ensures version
consistency and isolation. However, that system falls apart
when you move binaries to another machine.
</p>
<p>
I tried statically linking everything into one binary to
solve this problem, but that did not work either. The static
linking process simply is not supported by all dependencies
yet.
</p>
<h3>back to patching</h3>
<p>
So, I went back to manually editing the library paths in the
binary. At this point, <code>patchelf</code> became my new
best friend:
</p>
<pre>
patchelf --remove-needed libfoo.so --remove-needed libbar.so
patchelf --add-needed libfoo.so --add-needed libbar.so</pre
>
<p>
Later, I realized that I could have simply used
<code>patchelf --set-rpath $ORIGIN/../lib</code>
instead of spending hours on manual path edits.
</p>
<h2>making it run (almost)</h2>
<p>
After fixing the paths, I could finally execute the binary
using:
</p>
<pre>
LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary</pre
>
<p>
The binary started, printed some initial logs, and then
failed when attempting to import an OpenCV model. The error
message indicated that the model file could not be parsed.
This was confusing because the exact same model worked
before, and it still worked perfectly on x86_64 — just not
on aarch64.
</p>
<h2>debugging phase one</h2>
<p>
Running the sensor under GDB confirmed that all the correct
libraries were being called. The relevant call chain looked
like this:
</p>
<p>Rust code → rust-opencv → OpenCV → Protobuf.</p>
<p>
My first theory was that the <code>protoc</code> compiler
had generated code for the wrong architecture or endianess.
</p>
<h3>investigations</h3>
<ul>
<li>
I developed a small program to parse the model file
directly on the Raspberry Pi. It worked, confirming the
model file was not corrupted.
</li>
<li>
I tried several Protobuf versions. None of them changed
the outcome.
</li>
<li>
I built a minimal OpenCV program to parse the same
model. It also worked fine.
</li>
</ul>
<p>
The conclusion was clear: the import code itself worked. It
just didnt work inside my main binary.
</p>
<h2>debugging phase two</h2>
<p>
I switched from a bottom-up approach to a top-down one and
started reducing the binary until I found the smallest
version that still failed. The culprit quickly emerged:
simply importing <code>tflite</code> with
<code>use tflite::Tflite</code> was enough to break OpenCV.
</p>
<h3>the real root cause</h3>
<p>
Remember how Nix uses absolute library paths? And how I
mentioned that CMake wanted to build everything from source?
To get TensorFlow Lite building, I had given it the version
of Protobuf it wanted, and I also exported all its
transitive dependencies — including that version of
Protobuf.
</p>
<p>
This led to a situation where OpenCV loaded the wrong
Protobuf library. It wasnt different enough to cause a
linker error, but it was just different enough to break
model parsing in subtle ways.
</p>
<h2>the fix</h2>
<p>
The fix turned out to be surprisingly simple. I just needed
to make TensorFlow Lite use the same version of Protobuf
that was packaged with <code>nixpkgs</code>.
</p>
<h2>the end (for now)</h2>
<p>
After months of debugging, rebuilding, and manually patching
binaries, everything finally works again. Until the next
<code>nixpkgs</code> update, of course.
</p>
<!-- --
i did some cross-compiling
context: project digidow (digidow.eu), facial recognition sensor, written in rust, built with nix, deployed on a raspberry pi
initial trigger: version of tensorflow-lite in nixpkgs deprecated, not building anymore
ideal solution: update tensorflow-lite to a newer version
problem: that is quite hard, since tflite can be built with bazel or cmake, and the nix build uses bazel, which did not want to get updated easily
solution: i decided to build tensorflow-lite from source using cmake, which worked much better
- replace all fetchers in cmake script with custom urls (file://)
- obviously, where possible, use the pre-built binaries from nixpkgs, but the cmake script really wants to build everything from source, so we need to provide most dependencies manually
issue: cannot use tflite-rs right now, since there are some linker issues
- bazel build outputs a fat .so that has all transitive dependencies included
- cmake build does not include transitive dependencies, so we need to manually link them
- how do i link the dependencies in a build.rs script ?!?
solution: provide all transitive dependencies in an nix build output directory aswell
full setup: tflite-rs now uses bazel for local building, and still supports a provided binary, which i now build with cmake
issue: the sensor is too big for compiling on the raspberry pi
solution: cross compiling
- easy: rust supports cross-compilation out of the box
- nix also has great support for cross-compilation, using the `cross` toolchain
problem: nothing works
- or rather some stuff doesn't work, and some stuff i don't understand
issues:
- opencv doesn't build
- tflite doesn't build
biggest issue: tools for host vs tools for target
solution: use the appropriate tools for the host and target platform via nix (thx pkgsCross)
how: inject and replace the toolpaths in the cmake scripts during build time
beggest offenders: protobuf & flatbuffers, since both use executable compilers that need to be run on my host machine, but need version specific shared libraries that are run on the target platform
alright; after some manual labor, i managed to get everything compiling
now, to run the binary on the raspberry pi, i need to transfer the binary to the pi and execute it there
except for the fact i alse need the transitive dependencies aswell as the transitive dependencies of the transitive dependencies
alright, copy them over aswell
still, binary does not work
nix has a great feature that sets library paths as absolute paths (starting with `/nix/store`), so that you never have issues with using 2 libraries with different versions
but that falls apart when copying the binary to the target platform
so, i need to manually edit the library paths in the binary to point to the correct paths on the target platform
oor i could statically link everything into a single binary, which would solve the issue of transitive dependencies
let's try that
aaaand it doesn't work
why: because the static linking process is not yet fully supported by all of the dependencies
so, i need to manually edit the library paths in the binary to point to the correct paths on the target platform
what is the correct path?
since some libs are very sensitive to the version of the other libs they depend on, i need to use my own version of the libraries
our lord and savior appears: `patchelf --remove-needed libfoo.so --remove-needed libbar.so` and `patchelf --add-needed libfoo.so --add-needed libbar.so`
this removes the absolute paths and replaces them with just the library names
(turns out i only needed patchelf --set-rpath $ORIGIN/../lib instead of 100 lines of remove and add)
great, now the binary should work on the target platform
except for the fact i also need the transitive dependencies aswell as the transitive dependencies of the transitive dependencies
still, that is done quite quickly; and now all paths are okish
now paths are ok, and dependencies can get resolved
but how do i resolve them?
`LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH`
this allows the binary to find the libraries it needs, which are now correctly linked to the target platform's libraries.
great, now the binary should work on the target platform
but it still doesn't work
i can start the binary with `LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary` ,and i get some initial logs, but it fails when it tries to import an opencv model
i get a high level rust error message that points to an error message in the opencv code, which is a genuine error message.
this message just says that the model file cannot be parsed, which means that the model file is not in the correct format or is corrupted.
but there is a catch: the model file was used the same way before, but now it is not working.
and it also works on my machine, but not on the target platform.
HOW ?!?
how does an import statement for an .onnx model file
1. work previously
2. work on x86_64
3. not work on aarch64, the very same machine it worked on before ?!?
alright, time to debug again
running the sensor with gdb and the proper breakpoints tells us the parsing process calls the correct libraries
setup: (our rust code (using the model) -> rust-opencv (rust bindings for opencv) -> opencv (has declared a onnx.proto file, which is used to parse the model and hasn't been changed in a while) -> protobuf (protobuf latest version from nixpkgs))
maybe the protoc compiler generates the wrong code (code for the host platform, not the target platform (could be an issue related to endianess))
debugging:
- develop a small program that can parse the model file
- knowledge gained: a small program that can parse the model file was developed and tested on the target platform, which confirmed that the model file was not corrupted and was in the correct format.
- knowledge gained2: i can easily emulate the binary on my machine using the target platform's libraries via nix
step 2: check the protobuf version; try different versions of protobuf
- knowledge gained: the protobuf version does not really matter
step 3
- develop a small opencv program that can parse the model file
- knowledge gained: a small opencv program that can parse the model file was developed and tested on the target platform, which confirmed that the model file was not corrupted and was in the correct format.
- knowledge gained2: the import code works, just not in our binary ?!
step 4
- alright; bottom up approach did not produce any results, let's go top down
- use our sensor, and produce the smallest failing binary
- this is it; i identified the issue with the binary
having an import statement for tflite (`use tflite::Tflite`) breaks opencv
WHY ?!?
well, remember when i said "nix has a great feature that sets library paths as absolute paths (starting with `/nix/store`), so that you never have issues with using 2 libraries with different versions" and "but the cmake script really wants to build everything from source, so we need to provide most dependencies manually"
at some point, to get the tensorflow lite build to work, i gave it the version of protobuf it wanted.
i also export all of the built transitive dependencies of tensorflow lite, which includes protobuf
and so when preparing the build output, i correctly output the protobuf library that was packaged with nixpkgs, but since tflite also uses protobuf, it replaced the correct protobuf library.
this created an error case where all components were able to be loaded, but the opencv library loaded the wrong protobuf library.
the library just was not different enough to produce a symbol difference and thus a linker error - the method to reading the model was just slightly different
the fix was easy enough - in tensorflow-lite i just had to change the version of protobuf it wanted to the version that was packaged with nixpkgs.
and now everything works!
--></div>
</div>
</body>
</html>