Change-Id: I424b79460e903392f73270ee68fe88fc0ba49811
Reviewed-on: https://go-review.googlesource.com/c/website/+/523635
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
This commit is contained in:
Russ Cox 2023-08-28 13:29:22 -04:00
Родитель 6fcd42d2ce
Коммит 4f91f9e3eb
13 изменённых файлов: 692 добавлений и 0 удалений

692
_content/blog/rebuild.md Normal file
Просмотреть файл

@ -0,0 +1,692 @@
---
title: Perfectly Reproducible, Verified Go Toolchains
date: 2023-08-28
by:
- Russ Cox
summary: Go 1.21 is the first perfectly reproducible Go toolchain.
---
One of the key benefits of open-source software is that anyone can read
the source code and inspect what it does.
And yet most software, even open-source software,
is downloaded in the form of compiled binaries,
which are much more difficult to inspect.
If an attacker wanted to run a [supply chain attack](https://cloud.google.com/software-supply-chain-security/docs/attack-vectors)
on an open-source project,
the least visible way would be to replace the binaries being served while
leaving the source code unmodified.
The best way to address this kind of attack is to make open-source software
builds _reproducible_,
meaning that a build that starts with the same sources produces the same
outputs every time it runs.
That way, anyone can verify that posted binaries are free of hidden changes
by building from authentic sources and checking that the rebuilt binaries
are bit-for-bit identical to the posted binaries.
That approach proves the binaries have no backdoors or other changes not
present in the source code,
without having to disassemble or look inside them at all.
Since anyone can verify the binaries, independent groups can easily detect
and report supply chain attacks.
As supply chain security becomes more important,
so do reproducible builds, because they provide a simple way to verify the
posted binaries for open-source projects.
Go 1.21.0 is the first Go toolchain with perfectly reproducible builds.
Earlier toolchains were possible to reproduce,
but only with significant effort, and probably no one did:
they just trusted that the binaries posted on [go.dev/dl](/dl/) were the correct ones.
Now its easy to “trust but verify.”
This post explains what goes into making builds reproducible,
examines the many changes we had to make to Go to make Go toolchains reproducible,
and then demonstrates one of the benefits of reproducibility by verifying
the Ubuntu package for Go 1.21.0.
## Making a Build Reproducible {#how}
Computers are generally deterministic, so you might think all builds would
be equally reproducible.
Thats only true from a certain point of view.
Lets call a piece of information a _relevant input_ when the output of
a build can change depending on that input.
A build is reproducible if it can be repeated with all the same relevant inputs.
Unfortunately, lots of build tools turn out to incorporate inputs that we
would usually not realize are relevant and that might be difficult to recreate
or provide as input.
Lets call an input an _unintentional input_ when it turns out to be relevant
but we didnt mean it to be.
The most common unintentional input in build systems is the current time.
If a build writes an executable to disk, the file system records the current
time as the executables modification time.
If the build then packages that file using a tool like “tar” or “zip”,
the modification time is written into the archive.
We certainly didnt want our build to change based on the current time, but it does.
So the current time turns out to be an unintentional input to the build.
Worse, most programs dont let you provide the current time as an input,
so there is no way to repeat this build.
To fix this, we might set the time stamps on created files to Unix time
0 or to a specific time read from one of the builds source files.
That way, the current time is no longer a relevant input to the build.
Common relevant inputs to a build include:
- the specific version of the source code to build;
- the specific versions of dependencies that will be included in the build;
- the operating system running the build, which may affect path names in the resulting binaries;
- the architecture of the CPU on the build system,
- which may affect which optimizations the compiler uses or the layout of certain data structures;
- the compiler version being used, as well as compiler options passed to it, which affect how the code is compiled;
- the name of the directory containing the source code, which may appear in debug information;
- the user name, group name, uid, and gid of the account running the build, which may appear in file metadata in an archive;
- and many more.
To have a reproducible build, every relevant input must be configurable in the build,
and then the binaries must be posted alongside an explicit configuration
listing every relevant input.
If youve done that, you have a reproducible build. Congratulations!
Were not done, though. If the binaries can only be reproduced if you
first find a computer with the right architecture,
install a specific operating system version,
compiler version, put the source code in the right directory,
set your user identity correctly, and so on,
that may be too much work in practice for anyone to bother.
We want builds to be not just reproducible but _easy to reproduce_.
To do that, we need to identify relevant inputs and then,
instead of documenting them, eliminate them.
The build obviously has to depend on the source code being built,
but everything else can be eliminated.
When a builds only relevant input is its source code,
lets call that _perfectly reproducible_.
## Perfectly Reproducible Builds for Go {#go}
As of Go 1.21, the Go toolchain is perfectly reproducible:
its only relevant input is the source code for that build.
We can build a specific toolchain (say, Go for Linux/x86-64) on a Linux/x86-64 host,
or a Windows/ARM64 host, or a FreeBSD/386 host,
or any other host that supports Go, and we can use any Go bootstrap compiler,
including bootstrapping all the way back to Go 1.4s C implementation,
and we can vary any other details.
None of that changes the toolchains that are built.
If we start with the same toolchain source code,
we will get the exact same toolchain binaries out.
This perfect reproducibility is the culmination of efforts dating back originally to Go 1.10,
although most of the effort was concentrated in Go 1.20 and Go 1.21.
This section highlights some of the most interesting relevant inputs that we eliminated.
### Reproducibility in Go 1.10 {#go110}
Go 1.10 introduced a content-aware build cache that decides whether targets
are up-to-date based on a fingerprint of the build inputs instead of file modification times.
Because the toolchain itself is one of those build inputs,
and because Go is written in Go, the [bootstrap process](/s/go15bootstrap)
would only converge if the toolchain build on a single machine was reproducible.
The overall toolchain build looks like this:
<div class="image">
<img src="rebuild/bootstrap.png" srcset="rebuild/bootstrap.png 1x, rebuild/bootstrap@2x.png 2x" width="515" height="177">
</div>
We start by building the sources for the current Go toolchain using an earlier Go version,
the bootstrap toolchain (Go 1.10 used Go 1.4, written in C;
Go 1.21 uses Go 1.17).
That produces “toolchain1”, which we use to build everything again,
producing “toolchain2”, which we use to build everything again,
producing “toolchain3”.
Toolchain1 and toolchain2 have been built from the same sources but with
different Go implementations (compilers and libraries),
so their binaries are certain to be different.
However, if both Go implementations are non-buggy,
correct implementations, toolchain1 and toolchain2 should behave exactly the same.
In particular, when presented with the Go 1.X sources,
toolchain1s output (toolchain2) and toolchain2s output (toolchain3)
should be identical,
meaning toolchain2 and toolchain3 should be identical.
At least, thats the idea. Making that true in practice required removing a couple unintentional inputs:
**Randomness.** Map iteration and running work in multiple goroutines serialized
with locks both introduce randomness in the order that results may be generated.
This randomness can make the toolchain produce one of several different
possible outputs each time it runs.
To make the build reproducible, we had to find each of these and sort the
relevant list of items before using it to generate output.
**Bootstrap Libraries.** Any library used by the compiler that can choose
from multiple different correct outputs might change its output from one
Go version to the next.
If that library output change causes a compiler output change,
then toolchain1 and toolchain2 will not be semantically identical,
and toolchain2 and toolchain3 will not be bit-for-bit identical.
The canonical example is the [`sort`](/pkg/sort/) package,
which can place elements that compare equal in [any order it likes](/blog/compat#output).
A register allocator might sort to prioritize commonly used variables,
and the linker sorts symbols in the data section by size.
To completely eliminate any effect from the sorting algorithm,
the comparison function used must never report two distinct elements as equal.
In practice, this invariant turned out to be too onerous to impose on every
use of sort in the toolchain,
so instead we arranged to copy the Go 1.X `sort` package into the source
tree that is presented to the bootstrap compiler.
That way, the compiler uses the same sort algorithm when using the bootstrap
toolchain as it does when built with itself.
Another package we had to copy was [`compress/zlib`](/pkg/compress/zlib/),
because the linker writes compressed debug information,
and optimizations to compression libraries can change the exact output.
Over time, weve [added other packages to that list too](https://go.googlesource.com/go/+/go1.21.0/src/cmd/dist/buildtool.go#55).
This approach has the added benefit of allowing the Go 1.X compiler to use
new APIs added to those packages immediately,
at the cost that those packages must be written to compile with older versions of Go.
### Reproducibility in Go 1.20 {#go120}
Work on Go 1.20 prepared for both easy reproducible builds and [toolchain management](toolchain)
by removing two more relevant inputs from the toolchain build.
**Host C toolchain.** Some Go packages, most notably `net`,
default to [using `cgo`](cgo) on most operating systems.
In some cases, such as macOS and Windows,
invoking system DLLs using `cgo` is the only reliable way to resolve host names.
When we use `cgo`, though, we invoke the host C toolchain (meaning a specific
C compiler and C library),
and different toolchains have different compilation algorithms and library code,
producing different outputs.
The build graph for a `cgo` package looks like:
<div class="image">
<img src="rebuild/cgo.png" srcset="rebuild/cgo.png 1x, rebuild/cgo@2x.png 2x" width="441" height="344">
</div>
The host C toolchain is therefore a relevant input to the pre-compiled `net.a`
that ships with the toolchain.
For Go 1.20, we decided to fix this by removing `net.a` from the toolchain.
That is, Go 1.20 stopped shipping pre-compiled packages to seed the build cache with.
Now, the first time a program uses package `net`,
the Go toolchain compiles it using the local systems C toolchain and caches that result.
In addition to removing a relevant input from toolchain builds and making
toolchain downloads smaller,
not shipping pre-compiled packages also makes toolchain downloads more portable.
If we build package `net` on one system with one C toolchain and then compile
other parts of the program on a different system with a different C toolchain,
in general there is no guarantee that the two parts can be linked together.
One reason we shipped the pre-compiled `net` package in the first place
was to allow building programs that used package net even on systems without
a C toolchain installed.
If theres no pre-compiled package, what happens on those systems? The
answer varies by operating system,
but in all cases we arranged for the Go toolchain to continue to work well
for building pure Go programs without a host C toolchain.
- On macOS, we rewrote package net using the underlying mechanisms that cgo would use,
without any actual C code.
This avoids invoking the host C toolchain but still emits a binary that
refers to the required system DLLs.
This approach is only possible because every Mac has the same dynamic libraries installed.
Making the non-cgo macOS package net use the system DLLs also meant that
cross-compiled macOS executables now use the system DLLs for network access,
resolving a long-standing feature request.
- On Windows, package net already made direct use of DLLs without C code, so nothing needed to be changed.
- On Unix systems, we cannot assume a specific DLL interface to network code,
but the pure Go version works fine for systems that use typical IP and DNS setups.
Also, it is much easier to install a C toolchain on Unix systems than it
is on macOS and especially Windows.
We changed the `go` command to enable or disable `cgo` automatically based
on whether the system has a C toolchain installed.
Unix systems without a C toolchain fall back to the pure Go version of package net,
and in the rare cases where thats not good enough,
they can install a C toolchain.
Having dropped the pre-compiled packages,
the only part of the Go toolchain that still depended on the host C toolchain
was binaries built using package net,
specifically the `go` command.
With the macOS improvements, it was now viable to build those commands with `cgo` disabled,
completely removing the host C toolchain as an input,
but we left that final step for Go 1.21.
**Host dynamic linker.** When programs use `cgo` on a system using dynamically linked C libraries,
the resulting binaries contain the path to the systems dynamic linker,
something like `/lib64/ld-linux-x86-64.so.2`.
If the path is wrong, the binaries dont run.
Typically each operating system/architecture combination has a single correct
answer for this path.
Unfortunately, musl-based Linuxes like Alpine Linux use a different dynamic
linker than glibc-based Linuxes like Ubuntu.
To make Go run at all on Alpine Linux, in Go bootstrap process looked like this:
<div class="image">
<img src="rebuild/linker1.png" srcset="rebuild/linker1.png 1x, rebuild/linker1@2x.png 2x" width="480" height="209">
</div>
The bootstrap program cmd/dist inspected the local systems dynamic linker
and wrote that value into a new source file compiled along with the rest
of the linker sources,
effectively hard-coding that default into the linker itself.
Then when the linker built a program from a set of compiled packages,
it used that default.
The result is that a Go toolchain built on Alpine is different from a toolchain built on Ubuntu:
the host configuration is a relevant input to the toolchain build.
This is a reproducibility problem but also a portability problem:
a Go toolchain built on Alpine doesnt build working binaries or even
run on Ubuntu, and vice versa.
For Go 1.20, we took a step toward fixing the reproducibility problem by
changing the linker to consult the host configuration when it is running,
instead of having a default hard-coded at toolchain build time:
<div class="image">
<img src="rebuild/linker2.png" srcset="rebuild/linker2.png 1x, rebuild/linker2@2x.png 2x" width="450" height="175">
</div>
This fixed the portability of the linker binary on Alpine Linux,
although not the overall toolchain, since the `go` command still used package
`net` and therefore `cgo` and therefore had a dynamic linker reference in its own binary.
Just as in the previous section, compiling the `go` command without `cgo`
enabled would fix this,
but we left that change for Go 1.21.
(We didnt feel there was enough time left in the Go 1.20 cycle to test
such that change properly.)
### Reproducibility in Go 1.21 {#go121}
For Go 1.21, the goal of perfect reproducibility was in sight,
and we took care of the remaining, mostly small,
relevant inputs that remained.
**Host C toolchain and dynamic linker.** As discussed above,
Go 1.20 took important steps toward removing the host C toolchain and dynamic
linker as relevant inputs.
Go 1.21 completed the removal of these relevant inputs by building the toolchain
with `cgo` disabled.
This improved portability of the toolchain too:
Go 1.21 is the first Go release where the standard Go toolchain runs unmodified
on Alpine Linux systems.
Removing these relevant inputs made it possible to cross-compile a Go toolchain
from a different system without any loss in functionality.
That in turn improved the supply chain security of the Go toolchain:
we can now build Go toolchains for all target systems using a trusted Linux/x86-64 system,
instead of needing to arrange a separate trusted system for each target.
As a result, Go 1.21 is the first release to include posted binaries for
all systems at [go.dev/dl/](/dl/).
**Source directory.** Go programs include full paths in the runtime and debugging metadata,
so that when a program crashes or is run in a debugger,
stack traces include the full path to the source file,
not just the name of the file in an unspecified directory.
Unfortunately, including the full path makes the directory where the source
code is stored a relevant input to the build.
To fix this, Go 1.21 changed the release toolchain builds to install commands
like the compiler using `go install -trimpath`,
which replaces the source directory with the module path of the code.
If a released compiler crashes, the stack trace will print paths like `cmd/compile/main.go`
instead of `/home/user/go/src/cmd/compile/main.go`.
Since the full paths would refer to a directory on a different machine anyway,
this rewrite is no loss.
On the other hand, for non-release builds,
we keep the full path, so that when developers working on the compiler itself cause it to crash,
IDEs and other tools reading those crashes can easily find the correct source file.
**Host operating system.** Paths on Windows systems are backslash-separated,
like `cmd\compile\main.go`.
Other systems use forward slashes, like `cmd/compile/main.go`.
Although earlier versions of Go had normalized most of these paths to use forward slashes,
one inconsistency had crept back in, causing slightly different toolchain builds on Windows.
We found and fixed the bug.
**Host architecture.** Go runs on a variety of ARM systems and can emit
code using a software library for floating-point math (SWFP) or using hardware
floating-point instructions (HWFP).
Toolchains defaulting to one mode or the other will necessarily differ.
Like we saw with the dynamic linker earlier,
the Go bootstrap process inspected the build system to make sure that the
resulting toolchain worked on that system.
For historical reasons, the rule was “assume SWFP unless the build is
running on an ARM system with floating-point hardware”,
with cross-compiled toolchains assuming SWFP.
The vast majority of ARM systems today do have floating-point hardware,
so this introduced an unnecessary difference between natively compiled and
cross-compiled toolchains,
and as a further wrinkle, Windows ARM builds always assumed HWFP,
making the decision operating system-dependent.
We changed the rule to be “assume HWFP unless the build is running on
an ARM system without floating-point hardware”.
This way, cross-compilation and builds on modern ARM systems produce identical toolchains.
**Packaging logic.** All the code to create the actual toolchain archives
we post for download lived in a separate Git repository,
golang.org/x/build, and the exact details of how archives get packaged does change over time.
If you wanted to reproduce those archives,
you needed to have the right version of that repository.
We removed this relevant input by moving the code to package the archives
into the main Go source tree, as `cmd/distpack`.
As of Go 1.21, if you have the sources for a given version of Go,
you also have the sources for packaging the archives.
The golang.org/x/build repository is no longer a relevant input.
**User IDs.** The tar archives we posted for download were built from a
distribution written to the file system,
and using [`tar.FileInfoHeader`](/pkg/archive/tar/#FileInfoHeader) copies
the user and group IDs from the file system into the tar file,
making the user running the build a relevant input.
We changed the archiving code to clear these.
**Current time.** Like with user IDs, the tar and zip archives we posted
for download had been built by copying the file system modification times into the archives,
making the current time a relevant input.
We could have cleared the time, but we thought it would look surprising
and possibly even break some tools to use the Unix or MS-DOS zero time.
Instead, we changed the go/VERSION file stored in the repository to add
the time associated with that version:
$ cat go1.21.0/VERSION
go1.21.0
time 2023-08-04T20:14:06Z
$
The packagers now copy the time from the VERSION file when writing files to archives,
instead of copying the local files modification times.
**Cryptographic signing keys.** The Go toolchain for macOS wont run on
end-user systems unless we sign the binaries with an Apple-approved signing key.
We use an internal system to get them signed with Googles signing key,
and obviously we cannot share that secret key in order to allow others to
reproduce the signed binaries.
Instead, we wrote a verifier that can check whether two binaries are identical
except for their signatures.
**OS-specific packagers.** We use the Xcode tools `pkgbuild` and `productbuild`
to create the downloadable macOS PKG installer,
and we use WiX to create the downloadable Windows MSI installer.
We dont want verifiers to need the same exact versions of those tools,
so we took the same approach as for the cryptographic signing keys,
writing a verifier that can look inside the packages and check that the
toolchain files are exactly as expected.
## Verifying the Go Toolchains {#verify}
Its not enough to make Go toolchains reproducible once.
We want to make sure they stay reproducible,
and we want to make sure others can reproduce them easily.
To keep ourselves honest, we now build all Go distributions on both a trusted
Linux/x86-64 system and a Windows/x86-64 system.
Except for the architecture, the two systems have almost nothing in common.
The two systems must produce bit-for-bit identical archives or else we do
not proceed with the release.
To allow others to verify that were honest,
weve written and published a verifier,
[`golang.org/x/build/cmd/gorebuild`](https://pkg.go.dev/golang.org/x/build/cmd/gorebuild).
That program will start with the source code in our Git repository and rebuild the
current Go versions, checking that they match the archives posted on [go.dev/dl](/dl/).
Most archives are required to match bit-for-bit.
As mentioned above, there are three exceptions where a more relaxed check is used:
- The macOS tar.gz file is expected to differ,
but then the verifier compares the contents inside.
The rebuilt and posted copies must contain the same files,
and all the files must match exactly, except for executable binaries.
Executable binaries must match exactly after stripping code signatures.
- The macOS PKG installer is not rebuilt. Instead,
the verifier reads the files inside the PKG installer and checks that they
match the macOS tar.gz exactly,
again after code signature stripping.
In the long term, the PKG creation is trivial enough that it could potentially
be added to cmd/distpack,
but the verifier would still have to parse the PKG file to run the signature-ignoring
code executable comparison.
- The Windows MSI installer is not rebuilt.
Instead, the verifier invokes the Linux program `msiextract` to extract
the files inside and check that they match the rebuilt Windows zip file exactly.
In the long term, perhaps the MSI creation could be added to cmd/distpack,
and then the verifier could use a bit-for-bit MSI comparison.
We run `gorebuild` nightly, posting the results at [go.dev/rebuild](/rebuild),
and of course anyone else can run it too.
## Verifying Ubuntus Go Toolchain {#ubuntu}
The Go toolchains easily reproducible builds should mean that the binaries
in the toolchains posted on go.dev match the binaries included in other packaging systems,
even when those packagers build from source.
Even if the packagers have compiled with different configurations or other changes,
the easily reproducible builds should still make it easy to reproduce their binaries.
To demonstrate this, lets reproduce the Ubuntu `golang-1.21` package
version `1.21.0-1` for Linux/x86-64.
To start, we need to download and extract the Ubuntu packages,
which are [ar(1) archives](https://linux.die.net/man/1/ar) containing zstd-compressed tar archives:
{{raw `
$ mkdir deb
$ cd deb
$ curl -LO http://mirrors.kernel.org/ubuntu/pool/main/g/golang-1.21/golang-1.21-src_1.21.0-1_all.deb
$ ar xv golang-1.21-src_1.21.0-1_all.deb
x - debian-binary
x - control.tar.zst
x - data.tar.zst
$ unzstd < data.tar.zst | tar xv
...
x ./usr/share/go-1.21/src/archive/tar/common.go
x ./usr/share/go-1.21/src/archive/tar/example_test.go
x ./usr/share/go-1.21/src/archive/tar/format.go
x ./usr/share/go-1.21/src/archive/tar/fuzz_test.go
...
$
`}}
That was the source archive. Now the amd64 binary archive:
{{raw `
$ rm -f debian-binary *.zst
$ curl -LO http://mirrors.kernel.org/ubuntu/pool/main/g/golang-1.21/golang-1.21-go_1.21.0-1_amd64.deb
$ ar xv golang-1.21-src_1.21.0-1_all.deb
x - debian-binary
x - control.tar.zst
x - data.tar.zst
$ unzstd < data.tar.zst | tar xv | grep -v '/$'
...
x ./usr/lib/go-1.21/bin/go
x ./usr/lib/go-1.21/bin/gofmt
x ./usr/lib/go-1.21/go.env
x ./usr/lib/go-1.21/pkg/tool/linux_amd64/addr2line
x ./usr/lib/go-1.21/pkg/tool/linux_amd64/asm
x ./usr/lib/go-1.21/pkg/tool/linux_amd64/buildid
...
$
`}}
Ubuntu splits the normal Go tree into two halves,
in /usr/share/go-1.21 and /usr/lib/go-1.21.
Lets put them back together:
$ mkdir go-ubuntu
$ cp -R usr/share/go-1.21/* usr/lib/go-1.21/* go-ubuntu
cp: cannot overwrite directory go-ubuntu/api with non-directory usr/lib/go-1.21/api
cp: cannot overwrite directory go-ubuntu/misc with non-directory usr/lib/go-1.21/misc
cp: cannot overwrite directory go-ubuntu/pkg/include with non-directory usr/lib/go-1.21/pkg/include
cp: cannot overwrite directory go-ubuntu/src with non-directory usr/lib/go-1.21/src
cp: cannot overwrite directory go-ubuntu/test with non-directory usr/lib/go-1.21/test
$
The errors are complaining about copying symlinks, which we can ignore.
Now we need to download and extract the upstream Go sources:
$ curl -LO https://go.googlesource.com/go/+archive/refs/tags/go1.21.0.tar.gz
$ mkdir go-clean
$ cd go-clean
$ curl -L https://go.googlesource.com/go/+archive/refs/tags/go1.21.0.tar.gz | tar xzv
...
x src/archive/tar/common.go
x src/archive/tar/example_test.go
x src/archive/tar/format.go
x src/archive/tar/fuzz_test.go
...
$
To skip some trial and error, it turns out that Ubuntu builds Go with `GO386=softfloat`,
which forces the use of software floating point when compiling for 32-bit x86,
and strips (removes symbol tables from) the resulting ELF binaries.
Lets start with a `GO386=softfloat` build:
$ cd src
$ GOOS=linux GO386=softfloat ./make.bash -distpack
Building Go cmd/dist using /Users/rsc/sdk/go1.17.13. (go1.17.13 darwin/amd64)
Building Go toolchain1 using /Users/rsc/sdk/go1.17.13.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building commands for host, darwin/amd64.
Building packages and commands for target, linux/amd64.
Packaging archives for linux/amd64.
distpack: 818d46ede85682dd go1.21.0.src.tar.gz
distpack: 4fcd8651d084a03d go1.21.0.linux-amd64.tar.gz
distpack: eab8ed80024f444f v0.0.1-go1.21.0.linux-amd64.zip
distpack: 58528cce1848ddf4 v0.0.1-go1.21.0.linux-amd64.mod
distpack: d8da1f27296edea4 v0.0.1-go1.21.0.linux-amd64.info
---
Installed Go for linux/amd64 in /Users/rsc/deb/go-clean
Installed commands in /Users/rsc/deb/go-clean/bin
*** You need to add /Users/rsc/deb/go-clean/bin to your PATH.
$
That left the standard package in `pkg/distpack/go1.21.0.linux-amd64.tar.gz`.
Lets unpack it and strip the binaries to match Ubuntu:
$ cd ../..
$ tar xzvf go-clean/pkg/distpack/go1.21.0.linux-amd64.tar.gz
x go/CONTRIBUTING.md
x go/LICENSE
x go/PATENTS
x go/README.md
x go/SECURITY.md
x go/VERSION
...
$ elfstrip go/bin/* go/pkg/tool/linux_amd64/*
$
Now we can diff the Go toolchain weve created on our Mac with the Go toolchain that Ubuntu ships:
{{raw `
$ diff -r go go-ubuntu
Only in go: CONTRIBUTING.md
Only in go: LICENSE
Only in go: PATENTS
Only in go: README.md
Only in go: SECURITY.md
Only in go: codereview.cfg
Only in go: doc
Only in go: lib
Binary files go/misc/chrome/gophertool/gopher.png and go-ubuntu/misc/chrome/gophertool/gopher.png differ
Only in go-ubuntu/pkg/tool/linux_amd64: dist
Only in go-ubuntu/pkg/tool/linux_amd64: distpack
Only in go/src: all.rc
Only in go/src: clean.rc
Only in go/src: make.rc
Only in go/src: run.rc
diff -r go/src/syscall/mksyscall.pl go-ubuntu/src/syscall/mksyscall.pl
1c1
< #!/usr/bin/env perl
---
> #! /usr/bin/perl
...
$
`}}
Weve successfully reproduced the Ubuntu packages executables and identified
the complete set of changes that remain:
- Various metadata and supporting files have been deleted.
- The `gopher.png` file has been modified. On closer inspection the two are
identical except for an embedded timestamp that Ubuntu has updated.
Perhaps Ubuntus packaging scripts recompressed the png with a tool that
rewrites the timestamp even when it cannot improve on the existing compression.
- The binaries `dist` and `distpack`, which are built during bootstrap but
not included in standard archives,
have been included in the Ubuntu package.
- The Plan 9 build scripts (`*.rc`) have been deleted, although the Windows build scripts (`*.bat`) remain.
- `mksyscall.pl` and seven other Perl scripts not shown have had their headers changed.
Note in particular that weve reconstructed the toolchain binaries bit-for-bit:
they do not show up in the diff at all.
That is, we proved that the Ubuntu Go binaries correspond exactly to the
upstream Go sources.
Even better, we proved this without using any Ubuntu software at all:
these commands were run on a Mac, and [`unzstd`](https://github.com/rsc/tmp/blob/master/unzstd/)
and [`elfstrip`](https://github.com/rsc/tmp/blob/master/elfstrip/) are short Go programs.
A sophisticated attacker might insert malicious code into an Ubuntu package
by changing the package-creation tools.
If they did, reproducing the Go Ubuntu package from clean sources using
those malicious tools would still produce bit-for-bit identical copies of
the malicious packages.
This attack would be invisible to that kind of rebuild,
much like [Ken Thompsons compiler attack](https://dl.acm.org/doi/10.1145/358198.358210).
Verifying the Ubuntu packages using no Ubuntu software at all is a much
stronger check.
Gos perfectly reproducible builds, which dont depend on unindented
details like the host operating system,
host architecture, and host C toolchain, are what make this stronger check possible.
(As an aside for the historical record, Ken Thompson told me once that his
attack was in fact detected,
because the compiler build stopped being reproducible.
It had a bug: a string constant in the backdoor added to the compiler was
imperfectly handled and grew by a single NUL byte each time the compiler compiled itself.
Eventually someone noticed the non-reproducible build and tried to find the cause by compiling to assembly.
The compilers backdoor did not reproduce itself into assembly output at all,
so assembling that output removed the backdoor.)
## Conclusion
Reproducible builds are an important tool for strengthening the open-source supply chain.
Frameworks like [SLSA](https://slsa.dev/) focus on provenance and a software
chain of custody that can be used to inform decisions about trust.
Reproducible builds complement that approach by providing a way to verify
that the trust is well-placed.
Perfect reproducibility (when the source files are the builds only relevant
input) is only possible for programs that build themselves,
like compiler toolchains.
It is a lofty but worthwhile goal precisely because self-hosting compiler
toolchains are otherwise quite difficult to verify.
Gos perfect reproducibility means that,
assuming packagers dont modify the source code,
every repackaging of Go 1.21.0 for Linux/x86-64 (substitute your favorite
system) in any form should be distributing exactly the same binaries,
even when they all build from source.
Weve seen that this is not quite true for Ubuntu Linux,
but perfect reproducibility still lets us reproduce the Ubuntu packaging
using a very different, non-Ubuntu system.
Ideally all open source software distributed in binary form would have easy-to-reproduce builds.
In practice, as weve seen in this post,
it is very easy for unintended inputs to leak into builds.
For Go programs that dont need `cgo`, a reproducible build is as simple
as compiling with `CGO_ENABLED=0 go build -trimpath`.
Disabling `cgo` removes the host C toolchain as a relevant input,
and `-trimpath` removes the current directory.
If your program does need `cgo`, you need to arrange for a specific host
C toolchain version before running `go build`,
such as by running the build in a specific virtual machine or container image.
Moving beyond Go, the [Reproducible Builds](https://reproducible-builds.org/)
project aims to improve reproducibility of all open source and is a good
starting point for more information about making your own software builds reproducible.

Двоичные данные
_content/blog/rebuild/bootstrap.graffle Normal file

Двоичный файл не отображается.

Двоичные данные
_content/blog/rebuild/bootstrap.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 17 KiB

Двоичные данные
_content/blog/rebuild/bootstrap@2x.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 44 KiB

Двоичные данные
_content/blog/rebuild/cgo.graffle Normal file

Двоичный файл не отображается.

Двоичные данные
_content/blog/rebuild/cgo.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
_content/blog/rebuild/cgo@2x.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 45 KiB

Двоичные данные
_content/blog/rebuild/linker1.graffle Normal file

Двоичный файл не отображается.

Двоичные данные
_content/blog/rebuild/linker1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичные данные
_content/blog/rebuild/linker1@2x.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 35 KiB

Двоичные данные
_content/blog/rebuild/linker2.graffle Normal file

Двоичный файл не отображается.

Двоичные данные
_content/blog/rebuild/linker2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичные данные
_content/blog/rebuild/linker2@2x.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 28 KiB