Add links to opencv-rust and the OpenCV ONNX importer, clarify the debugging narrative (note that two unrelated libraries conflicted), and mention cross-compiling as the final outcome.
423 lines
21 KiB
HTML
423 lines
21 KiB
HTML
<!doctype html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="UTF-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
<title>cross-compiling - kempinger.at</title>
|
||
<link rel="stylesheet" href="../style.css" />
|
||
<style>
|
||
body {
|
||
font-family:
|
||
"Inter",
|
||
-apple-system,
|
||
BlinkMacSystemFont,
|
||
"Segoe UI",
|
||
Roboto,
|
||
sans-serif;
|
||
background: linear-gradient(135deg, #0a0a0a 0%, #1a1a2e 100%);
|
||
color: #ffffff;
|
||
line-height: 1.6;
|
||
margin: 0;
|
||
min-height: 100vh;
|
||
}
|
||
|
||
.container {
|
||
max-width: 800px;
|
||
margin: 0 auto;
|
||
padding: 4rem 20px 2rem 20px;
|
||
}
|
||
|
||
.boxed-content {
|
||
background: rgba(255, 255, 255, 0.02);
|
||
border-radius: 20px;
|
||
padding: 3rem;
|
||
backdrop-filter: blur(10px);
|
||
border: 1px solid rgba(255, 255, 255, 0.1);
|
||
margin: 0 auto;
|
||
max-width: 700px;
|
||
}
|
||
|
||
h1 {
|
||
font-size: 2.5rem;
|
||
font-weight: 700;
|
||
color: #00d4ff;
|
||
text-align: center;
|
||
margin-bottom: 2rem;
|
||
}
|
||
|
||
.back-link {
|
||
display: inline-block;
|
||
margin-bottom: 2rem;
|
||
color: #00d4ff;
|
||
text-decoration: none;
|
||
font-weight: 500;
|
||
}
|
||
|
||
.back-link:hover {
|
||
text-decoration: underline;
|
||
}
|
||
|
||
.construction-message {
|
||
text-align: center;
|
||
margin-top: 2rem;
|
||
}
|
||
|
||
.construction-message img {
|
||
width: 80px;
|
||
opacity: 0.7;
|
||
}
|
||
</style>
|
||
</head>
|
||
|
||
<body>
|
||
<div class="container">
|
||
<a href="../index.html#lab" class="back-link">← Back to Laboratory</a>
|
||
<div class="boxed-content">
|
||
<h1>the perils of cross-compilation</h1>
|
||
|
||
<p>
|
||
Over the past few months, I have been deep in the trenches
|
||
of cross-compiling.
|
||
</p>
|
||
|
||
<h2>context</h2>
|
||
<p>
|
||
This story takes place within
|
||
<a href="https://digidow.eu">Project Digidow</a>, a facial
|
||
recognition sensor written in Rust, built with Nix, and
|
||
deployed on a Raspberry Pi.
|
||
</p>
|
||
|
||
<h2>the trigger</h2>
|
||
<p>
|
||
Everything started when the version of TensorFlow Lite in
|
||
<code>nixpkgs</code> was deprecated and stopped building.
|
||
The ideal solution seemed simple: just update TensorFlow
|
||
Lite to a newer version. Unfortunately, that turned out to
|
||
be far from easy, since TensorFlow Lite can be built with
|
||
either Bazel or CMake. The existing Nix build used Bazel,
|
||
which did not want to update gracefully.
|
||
</p>
|
||
|
||
<h2>first attempt: ditching bazel</h2>
|
||
<p>
|
||
I decided to switch to building TensorFlow Lite from source
|
||
using CMake. This worked significantly better, but still
|
||
required some effort. I replaced all fetchers in the CMake
|
||
scripts with <code>file://</code> URLs and used pre-built
|
||
binaries from <code>nixpkgs</code> whenever possible.
|
||
However, since the CMake build system insists on compiling
|
||
nearly everything from source, I had to manually provide
|
||
most dependencies.
|
||
</p>
|
||
|
||
<h2>new problem: tflite-rs</h2>
|
||
<p>
|
||
At this point, I ran into linker issues with
|
||
<code>tflite-rs</code>. The Bazel build produces a “fat”
|
||
<code>.so</code> file that includes all transitive
|
||
dependencies, while the CMake build does not. This meant
|
||
that I had to manually link all dependencies.
|
||
</p>
|
||
<p>
|
||
My eventual solution was to provide all transitive
|
||
dependencies in the Nix build output directory as well. As a
|
||
result, <code>tflite-rs</code> now uses Bazel for local
|
||
building, but it also supports a provided binary, which I
|
||
now build with CMake.
|
||
</p>
|
||
|
||
<h2>compiling on the raspberry pi</h2>
|
||
<p>
|
||
Unfortunately, the sensor was far too large to compile
|
||
directly on the Raspberry Pi. The obvious next step was
|
||
cross-compiling. Rust supports cross-compilation out of the
|
||
box, and Nix has great support for it via the
|
||
<code>cross</code> toolchain. In theory, this should have
|
||
been simple. In practice, nothing worked. Or rather, some
|
||
parts worked, others did not, and I had no idea why.
|
||
</p>
|
||
|
||
<h3>the usual suspects</h3>
|
||
<ul>
|
||
<li>OpenCV refused to build.</li>
|
||
<li>TensorFlow Lite refused to build.</li>
|
||
</ul>
|
||
<p>
|
||
The biggest issue was the classic “tools for host vs. tools
|
||
for target” problem.
|
||
</p>
|
||
|
||
<h2>solution: using pkgsCross</h2>
|
||
<p>
|
||
The solution was to use the appropriate tools for the host
|
||
and target platforms via Nix (thanks,
|
||
<code>pkgsCross</code>). During build time, I injected and
|
||
replaced tool paths in the CMake scripts. The worst
|
||
offenders were Protobuf and FlatBuffers, since both rely on
|
||
compiler executables that need to run on the host machine,
|
||
yet require version-specific shared libraries that belong to
|
||
the target platform.
|
||
</p>
|
||
<p>
|
||
After a lot of manual labor, everything finally compiled
|
||
successfully.
|
||
</p>
|
||
|
||
<h2>deployment: the next obstacle</h2>
|
||
<p>
|
||
To run the binary on the Raspberry Pi, I had to transfer it
|
||
there along with all its transitive dependencies. Of course,
|
||
those dependencies also had their own dependencies. I copied
|
||
them all over, only to find that the binary still refused to
|
||
run.
|
||
</p>
|
||
<p>
|
||
Nix uses absolute library paths starting with
|
||
<code>/nix/store</code>, which normally ensures version
|
||
consistency and isolation. However, that system falls apart
|
||
when you move binaries to another machine.
|
||
</p>
|
||
<p>
|
||
I tried statically linking everything into one binary to
|
||
solve this problem, but that did not work either. The static
|
||
linking process simply is not supported by all dependencies
|
||
yet.
|
||
</p>
|
||
|
||
<h3>back to patching</h3>
|
||
<p>
|
||
So, I went back to manually editing the library paths in the
|
||
binary. At this point, <code>patchelf</code> became my new
|
||
best friend:
|
||
</p>
|
||
<pre>
|
||
patchelf --remove-needed libfoo.so --remove-needed libbar.so
|
||
patchelf --add-needed libfoo.so --add-needed libbar.so</pre
|
||
>
|
||
<p>
|
||
Later, I realized that I could have simply used
|
||
<code>patchelf --set-rpath $ORIGIN/../lib</code>
|
||
instead of spending hours on manual path edits.
|
||
</p>
|
||
|
||
<h2>making it run (almost)</h2>
|
||
<p>
|
||
After fixing the paths, I could finally execute the binary
|
||
using:
|
||
</p>
|
||
<pre>
|
||
LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary</pre
|
||
>
|
||
<p>
|
||
The binary started, printed some initial logs, and then
|
||
failed when attempting to import an OpenCV model. The error
|
||
message indicated that the model file could not be parsed.
|
||
This was confusing because the exact same model worked
|
||
before, and it still worked perfectly on x86_64 — just not
|
||
on aarch64.
|
||
</p>
|
||
|
||
<h2>debugging phase one</h2>
|
||
<p>
|
||
Running the sensor under GDB confirmed that all the correct
|
||
libraries were being called. The relevant call chain looked
|
||
like this:
|
||
</p>
|
||
<p>Rust code → <a href="https://github.com/twistedfall/opencv-rust">opencv-rust</a> → <a href="https://github.com/opencv/opencv/blob/4.x/modules/dnn/src/onnx/onnx_importer.cpp#L283">OpenCV</a> → Protobuf</p>
|
||
<p>
|
||
My first theory was that the <code>protoc</code> compiler
|
||
had generated code for the wrong architecture or endianess.
|
||
</p>
|
||
|
||
<h3>investigations</h3>
|
||
<ul>
|
||
<li>
|
||
I developed a small program to parse the model file
|
||
directly on the Raspberry Pi. It worked, confirming the
|
||
model file was not corrupted.
|
||
</li>
|
||
<li>
|
||
I tried several Protobuf versions. None of them changed
|
||
the outcome.
|
||
</li>
|
||
<li>
|
||
I built a minimal OpenCV program to parse the same
|
||
model. It also worked fine.
|
||
</li>
|
||
</ul>
|
||
<p>
|
||
The conclusion was clear: the import code itself worked. Just not for me, for some reason.
|
||
</p>
|
||
|
||
<h2>debugging phase two</h2>
|
||
<p>
|
||
I switched from a bottom-up approach to a top-down one and
|
||
started reducing the binary until I found the smallest
|
||
version that still failed. The culprit quickly emerged:
|
||
simply importing <code>tflite</code> with
|
||
<code>use tflite::Tflite</code> was enough to break OpenCV.
|
||
Two unrelated libraries were somehow breaking each other.
|
||
</p>
|
||
|
||
<h3>the real root cause</h3>
|
||
<p>
|
||
Remember how Nix uses absolute library paths? And how I
|
||
mentioned that CMake wanted to build everything from source?
|
||
To get TensorFlow Lite building, I had given it the version
|
||
of Protobuf it wanted, and I also exported all its
|
||
transitive dependencies — including that version of
|
||
Protobuf.
|
||
</p>
|
||
<p>
|
||
This led to a situation where OpenCV loaded the wrong
|
||
Protobuf library. It wasn’t different enough to cause a
|
||
linker error, but it was just different enough to break
|
||
model parsing in subtle ways.
|
||
</p>
|
||
|
||
<h2>the fix</h2>
|
||
<p>
|
||
The fix turned out to be surprisingly simple. I just needed
|
||
to make TensorFlow Lite use the same version of Protobuf
|
||
that was packaged with <code>nixpkgs</code>.
|
||
</p>
|
||
|
||
<h2>the end (for now)</h2>
|
||
<p>
|
||
After months of debugging, rebuilding, and manually patching
|
||
binaries, everything finally works again, with the added benefit of cross-compiling, instead of either building on the Raspberry Pi or under emulation.
|
||
</p>
|
||
|
||
<!-- --
|
||
i did some cross-compiling
|
||
|
||
context: project digidow (digidow.eu), facial recognition sensor, written in rust, built with nix, deployed on a raspberry pi
|
||
|
||
initial trigger: version of tensorflow-lite in nixpkgs deprecated, not building anymore
|
||
|
||
ideal solution: update tensorflow-lite to a newer version
|
||
|
||
problem: that is quite hard, since tflite can be built with bazel or cmake, and the nix build uses bazel, which did not want to get updated easily
|
||
|
||
solution: i decided to build tensorflow-lite from source using cmake, which worked much better
|
||
- replace all fetchers in cmake script with custom urls (file://)
|
||
- obviously, where possible, use the pre-built binaries from nixpkgs, but the cmake script really wants to build everything from source, so we need to provide most dependencies manually
|
||
|
||
issue: cannot use tflite-rs right now, since there are some linker issues
|
||
- bazel build outputs a fat .so that has all transitive dependencies included
|
||
- cmake build does not include transitive dependencies, so we need to manually link them
|
||
- how do i link the dependencies in a build.rs script ?!?
|
||
solution: provide all transitive dependencies in an nix build output directory aswell
|
||
|
||
full setup: tflite-rs now uses bazel for local building, and still supports a provided binary, which i now build with cmake
|
||
|
||
issue: the sensor is too big for compiling on the raspberry pi
|
||
|
||
solution: cross compiling
|
||
- easy: rust supports cross-compilation out of the box
|
||
- nix also has great support for cross-compilation, using the `cross` toolchain
|
||
|
||
problem: nothing works
|
||
- or rather some stuff doesn't work, and some stuff i don't understand
|
||
|
||
issues:
|
||
- opencv doesn't build
|
||
- tflite doesn't build
|
||
|
||
biggest issue: tools for host vs tools for target
|
||
|
||
solution: use the appropriate tools for the host and target platform via nix (thx pkgsCross)
|
||
how: inject and replace the toolpaths in the cmake scripts during build time
|
||
beggest offenders: protobuf & flatbuffers, since both use executable compilers that need to be run on my host machine, but need version specific shared libraries that are run on the target platform
|
||
|
||
alright; after some manual labor, i managed to get everything compiling
|
||
|
||
now, to run the binary on the raspberry pi, i need to transfer the binary to the pi and execute it there
|
||
except for the fact i alse need the transitive dependencies aswell as the transitive dependencies of the transitive dependencies
|
||
alright, copy them over aswell
|
||
still, binary does not work
|
||
nix has a great feature that sets library paths as absolute paths (starting with `/nix/store`), so that you never have issues with using 2 libraries with different versions
|
||
but that falls apart when copying the binary to the target platform
|
||
so, i need to manually edit the library paths in the binary to point to the correct paths on the target platform
|
||
|
||
oor i could statically link everything into a single binary, which would solve the issue of transitive dependencies
|
||
let's try that
|
||
aaaand it doesn't work
|
||
why: because the static linking process is not yet fully supported by all of the dependencies
|
||
|
||
so, i need to manually edit the library paths in the binary to point to the correct paths on the target platform
|
||
what is the correct path?
|
||
since some libs are very sensitive to the version of the other libs they depend on, i need to use my own version of the libraries
|
||
|
||
our lord and savior appears: `patchelf --remove-needed libfoo.so --remove-needed libbar.so` and `patchelf --add-needed libfoo.so --add-needed libbar.so`
|
||
this removes the absolute paths and replaces them with just the library names
|
||
(turns out i only needed patchelf --set-rpath $ORIGIN/../lib instead of 100 lines of remove and add)
|
||
|
||
great, now the binary should work on the target platform
|
||
except for the fact i also need the transitive dependencies aswell as the transitive dependencies of the transitive dependencies
|
||
still, that is done quite quickly; and now all paths are okish
|
||
|
||
now paths are ok, and dependencies can get resolved
|
||
but how do i resolve them?
|
||
`LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH`
|
||
this allows the binary to find the libraries it needs, which are now correctly linked to the target platform's libraries.
|
||
|
||
great, now the binary should work on the target platform
|
||
but it still doesn't work
|
||
i can start the binary with `LD_LIBRARY_PATH=/path/to/libs:$LD_LIBRARY_PATH ./binary` ,and i get some initial logs, but it fails when it tries to import an opencv model
|
||
i get a high level rust error message that points to an error message in the opencv code, which is a genuine error message.
|
||
this message just says that the model file cannot be parsed, which means that the model file is not in the correct format or is corrupted.
|
||
|
||
but there is a catch: the model file was used the same way before, but now it is not working.
|
||
and it also works on my machine, but not on the target platform.
|
||
|
||
HOW ?!?
|
||
|
||
how does an import statement for an .onnx model file
|
||
1. work previously
|
||
2. work on x86_64
|
||
3. not work on aarch64, the very same machine it worked on before ?!?
|
||
|
||
alright, time to debug again
|
||
|
||
running the sensor with gdb and the proper breakpoints tells us the parsing process calls the correct libraries
|
||
setup: (our rust code (using the model) -> rust-opencv (rust bindings for opencv) -> opencv (has declared a onnx.proto file, which is used to parse the model and hasn't been changed in a while) -> protobuf (protobuf latest version from nixpkgs))
|
||
|
||
maybe the protoc compiler generates the wrong code (code for the host platform, not the target platform (could be an issue related to endianess))
|
||
|
||
debugging:
|
||
- develop a small program that can parse the model file
|
||
- knowledge gained: a small program that can parse the model file was developed and tested on the target platform, which confirmed that the model file was not corrupted and was in the correct format.
|
||
- knowledge gained2: i can easily emulate the binary on my machine using the target platform's libraries via nix
|
||
|
||
step 2: check the protobuf version; try different versions of protobuf
|
||
- knowledge gained: the protobuf version does not really matter
|
||
|
||
step 3
|
||
- develop a small opencv program that can parse the model file
|
||
- knowledge gained: a small opencv program that can parse the model file was developed and tested on the target platform, which confirmed that the model file was not corrupted and was in the correct format.
|
||
- knowledge gained2: the import code works, just not in our binary ?!
|
||
|
||
step 4
|
||
- alright; bottom up approach did not produce any results, let's go top down
|
||
- use our sensor, and produce the smallest failing binary
|
||
- this is it; i identified the issue with the binary
|
||
|
||
having an import statement for tflite (`use tflite::Tflite`) breaks opencv
|
||
WHY ?!?
|
||
well, remember when i said "nix has a great feature that sets library paths as absolute paths (starting with `/nix/store`), so that you never have issues with using 2 libraries with different versions" and "but the cmake script really wants to build everything from source, so we need to provide most dependencies manually"
|
||
at some point, to get the tensorflow lite build to work, i gave it the version of protobuf it wanted.
|
||
i also export all of the built transitive dependencies of tensorflow lite, which includes protobuf
|
||
and so when preparing the build output, i correctly output the protobuf library that was packaged with nixpkgs, but since tflite also uses protobuf, it replaced the correct protobuf library.
|
||
this created an error case where all components were able to be loaded, but the opencv library loaded the wrong protobuf library.
|
||
the library just was not different enough to produce a symbol difference and thus a linker error - the method to reading the model was just slightly different
|
||
|
||
the fix was easy enough - in tensorflow-lite i just had to change the version of protobuf it wanted to the version that was packaged with nixpkgs.
|
||
|
||
and now everything works!
|
||
|
||
--></div>
|
||
</div>
|
||
</body>
|
||
</html>
|