We hit GKE bugs and changes when upgrading from GKE 1.2 to 1.4.
The main issue is that Kubernetes does't reserve CPU or memory for
itself on nodes, so things were OOMing and getting killed. And when
Docker or Kubernetes got killed themselves, they were wedging and not
recovering.
So we're going to run a daemonset (POD on all nodes) to reserve space
for Kubernetes for it. That's not in this CL.
But this CL got us limping along and was already in production. It
doubles resource RAM usage for jobs, so fewer things schedule per node.
While we're at it, let jobs use more CPU if it's available.
Also, disable auto-scaling. It was off before by hand. Force it off
programatically too. And make the node count 5, like it was by hand.
Also, force un-graceful pod deletes, since GKE 1.3 or something
introduced a graceful-vs-ungraceful distinction, which we weren't
handling previously and therefore pods never were being deleted.
Change-Id: I3606e4e2e92c496d8194503d510921bd1614d34e
Reviewed-on: https://go-review.googlesource.com/33490
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Our builders are named of the form "GOOS-GOARCH" or
"GOOS-GOARCH-suffix".
Over time we've grown many builders. This CL doesn't change
that. Builders continue to be named and operate as before.
Previously the build configuration file (dashboard/builders.go) made
each builder type ("linux-amd64-race", etc) define how to create a
host running a buildlet of that type, even though many builders had
identical host configs. For example, these builders all share the same
host type (a Kubernetes container):
linux-amd64
linux-amd64-race
linux-386
linux-386-387
And these are the same host type (a GCE VM):
windows-amd64-gce
windows-amd64-race
windows-386-gce
This CL creates a new concept of a "hostType" which defines how
the buildlet is created (Kube, GCE, Reverse, and how), and then each
builder itself references a host type.
Users never see the hostType. (except perhaps in gomote list output)
But they at least never need to care about them.
Reverse buildlets now can only be one hostType at a time, which
simplifies things. We were no longer using multiple roles per machine
once moving to VMs for OS X.
gomote continues to operate as it did previously but its underlying
protocol changed and clients will need to be updated. As a new
feature, gomote now has a new flag to let you reuse a buildlet host
connection for different builder rules if they share the same
underlying host type. But users can ignore that.
This CL is a long-standing TODO (previously attempted and aborted) and
will make many things easier and faster, including the linux-arm
cross-compilation effort, and keeping pre-warmed buildlets of VM types
ready to go.
Updates golang/go#17104
Change-Id: Iad8387f48680424a8441e878a2f4762bf79ea4d2
Reviewed-on: https://go-review.googlesource.com/29551
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Also, add count of fds and goroutines to the coordinator's status
page.
Change-Id: I857e609623cfa280716d5d079180d0e4021d0bac
Reviewed-on: https://go-review.googlesource.com/27550
Reviewed-by: Quentin Smith <quentin@golang.org>
ctx was initialized in the wrong place.
Change-Id: I7fb3c56071a3e4d5cc2199e07fcc4b8ecbc7a674
Reviewed-on: https://go-review.googlesource.com/22967
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Improvements to support rapid scheduling of many build jobs:
- Retry logic in Kubernetes client to handle sporadic connection
closes from their API server under heavy load
- Cluster autoscaler scales on default CPU utilization metric
- Debug mode allows scheduling multiple builds to test scaling
- Account for scheduled vs. provisioned resources in a cluster
and use that information to estimate when a build's pod
will be scheduled and in running state
- Use estimated scheduled time to set context timeout
- Track pod lifecycle (requested time, estimated available time,
actual available time, terminate time, etc)
Change-Id: I14d6c5e01af0970dbb3390a29d1ee5c43049fff8
Reviewed-on: https://go-review.googlesource.com/19524
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
buildenv.Environment type defines configuration options:
- Coordinator uses the GCE project name to lookup config. A custom
config name can be provided at runtime to override.
- The conventional prod and stage project names ('symbolic-datum-552'
and 'go-dashboard-dev') map to prod and staging configuration structs.
- Production and staging status is explicitly defined in configuration.
- GCS bucket names for buildlet, logs, and snapshots are
configurable.
Change-Id: I7e6d7874eb0bdfe35dbdd5fcf6212ab50d576b88
Reviewed-on: https://go-review.googlesource.com/19502
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
* status page shows kube pool details
* pods created by the coorindator are tracked
* pods that fail to create are deleted
* pods older than delete-at are deleted
* pods created by a different coordinator are deleted
Updates golang/go#12546
Change-Id: I4c4f8ff906962b4a014a66d0a9d490ff17710d62
Reviewed-on: https://go-review.googlesource.com/16101
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
* Replaced cancel with context.Context
* StartPod can be canceled
* Wait for buildlet to come online, but fail fast if pod fails first
* Support timeout waiting for pod to leave pending phase
* Use Kubernetes watch API (long poll)
Updates golang/go#12546
Change-Id: I792a3b8fed615362a0290feee7de0c2cefe43c0e
Reviewed-on: https://go-review.googlesource.com/15285
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
* set correct metadata as env vars on each container in pod
* stage0 detects when running in a pod
* add scope to support farmer on GCE communicating with Google
Container Engine API
* dev mode supports GCE and Kubernetes even when not running
on GCE
* pod buildlet returns a configured http.Client
* pod buildlets work, but pods are not removed after completion
(or on creation failure)
Updates golang/go#12546
Change-Id: If91673b49223130c1e7077c130f1abe1e7966d02
Reviewed-on: https://go-review.googlesource.com/15041
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Find the default Kubernetes cluster and configure a client to talk to it.
Use application default credentials.
Updates golang/go#12546
Change-Id: Ifb1ce57f52f4fbbee3267f8cc3cf02a78146bd5b
Reviewed-on: https://go-review.googlesource.com/14532
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Previously it wasn't noticing their death until the next health check.
Take advantage of that the revdial is always blocked in a Read, so it
will see a TCP shutdown in the case of normal shutdowns. (health checks
will still catch disappearing machines)
Change-Id: I9a7f60a38b3acaf02057b2da9e0cbc91d328f651
Reviewed-on: https://go-review.googlesource.com/14736
Reviewed-by: Andrew Gerrand <adg@golang.org>
should be rare now.
Change-Id: Icc4bfd13c8dfe8f2e189db819bc0d552f35fb3c9
Reviewed-on: https://go-review.googlesource.com/14731
Reviewed-by: Andrew Gerrand <adg@golang.org>
Use the new Go 1.5 mechanism to abort HTTP requests with a channel.
This is respected by the Go 1.5 http.Transport, which we always use.
This CL shouldn't be necessary (CL 14700 was the real bug), but
doesn't hurt. This still would've probably prevented most of
golang/go#12666
Change-Id: I6890e016ee04183fc0d600baed8046c2f79113d8
Reviewed-on: https://go-review.googlesource.com/14701
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Now Close always means destroy.
The old code & API was a half-baked implementation of a (lack of)
design where there was a difference between being done with a buildlet
(with it possibly being reused by somebody else?) and you wanting to
nuke it completely. Unfortunately this just grew messier and more
broken over time.
This attempts to clean it all up.
We can add the sharing ideas back later when there's actually a design
and implementation. (We're not losing anything with this CL, because
nothing ever shared buildlets)
In fact, this CL fixes a problem where reverse buildlets weren't
getting their underlying net.Conns closed when their healthchecks
failed, leading to 4+ hour (and counting) build hangs due to buildlets
getting killed at inopportune moments (e.g. me testing running a
reverse buildlet on my home mac to help out with the dashboard
backlog)
Change-Id: I07be09f4d5f0f09d35e51e41c48b1296b71bb9b5
Reviewed-on: https://go-review.googlesource.com/14585
Reviewed-by: David Crawshaw <crawshaw@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
* reverse buildlet rework (multiplexed TCP connections, instead
of a hacky reverse roundtripper)
* scaleway ARM image improvements
* parallel gzip implementation, which makes things ~8x faster on
Scaleway.
* merge watcher into the coordinator, for easier deployments
Change-Id: I55d769f982e6583b261435309faa1f718a15fde1
Reviewed-on: https://go-review.googlesource.com/12665
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This creates a mechanism for clients (such as cmd/release and
cmd/gomote) to obtain buildlets via the coordinator. Previously
cmd/release and cmd/gomote could only create GCE VMs themselves, and
required the GCE project's credentials. In addition to the awkwardness
of needing to hand out the GCE credentials, it also meant ARM and
Darwin buildlets (which use the reverse buildlet pool) weren't usable.
Instead, this creates a new auth mechanism where the coordinator is
contacted over TLS with key pinning (the CA system isn't used) in the
same way that the reverse builders already dialed into the
coordinator, and then a "user build type" and hash are sent as the
username and password. The same master key is used to sign user
builder keys, and they always start with "user-". (which isn't a GOOS).
Then the coordinator provides an API to create and list buildlets.
They auto-expire after a duration and are auto-renewed upon use.
The buildlet library (as used by cmd/release etc) then proxies HTTP
requests via the coordinator to the backend buildlet.
See doc/remote-buildlet.txt for protocol details.
Change-Id: I12e27eae788fdd91927cb182b950893dc759f8e9
Reviewed-on: https://go-review.googlesource.com/11901
Reviewed-by: Andrew Gerrand <adg@golang.org>
Now you have to fail 3 in a row before you're killed.
This helped Plan 9 a bunch for a few days, before the change was
reverted (and then it started failing consistently again).
We still don't know why Plan 9 sucks at replying to heartbeats.
Change-Id: Ic64d2e8fb75f544c7c3e9a62c28ab9e20ff8392d
Reviewed-on: https://go-review.googlesource.com/11132
Reviewed-by: Andrew Gerrand <adg@golang.org>
Reviewed-by: David du Colombier <0intro@gmail.com>
Add background heartbeats to detect dead GCE VMs (the OpenBSD
buildlets seem to hang a lot),
Add timeouts to test executions.
Take helper buildlets out of service once they're marked bad.
Keep the in-order buildlet running forever when sharding tests, in
case all the helpers die. (observed once)
Keep a cache of recently deleted VMs and don't try to delete VMs again
if we've recently deleted them. (they're slow to delete)
More reverse buildlets more paranoid in their health checking and closing
of the connection.
Make status page link to /try set URLs.
Also, better logging (more sometimes, less others0, and misc bug fixes.
Change-Id: I57a5e8e39381234006cac4dd799b655d64be71bb
Reviewed-on: https://go-review.googlesource.com/10981
Reviewed-by: Andrew Gerrand <adg@golang.org>
Also: heartbeat reverse buildlets more often
Also: adjust cmd/go test expected duration.
Change-Id: I75fbb66a0b20ac7357ad7cf78fc545101ac9aa33
Reviewed-on: https://go-review.googlesource.com/10884
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
The reverse buildlets' RoundTrip are hanging, which is its own problem,
but this calling code should be robust and time out anyway.
Change-Id: Id9e3e1d9feb6ffa58cc0995d0623bd90845bb9d6
Reviewed-on: https://go-review.googlesource.com/10847
Reviewed-by: Andrew Gerrand <adg@golang.org>
And deal with Preemptible resource exhaustion errors.
And change all-compile to misc-compile and only do the builders
not covered otherwise (Fixes#11073)
And make the watcher serve git source.
And cache and singleflight fetching of git source.
And a million other things.
Fixesgolang/go#11073
Change-Id: I0f45610f0c6a06bd0c8ba9632b8624e00aeb52fc
Reviewed-on: https://go-review.googlesource.com/10750
Reviewed-by: Andrew Gerrand <adg@golang.org>
Also gomote updates which came about during the process of developing
and debugging this.
Change-Id: Ia53d674118a6b99bcdda7062d3b7161279b6ad52
Reviewed-on: https://go-review.googlesource.com/10463
Reviewed-by: Andrew Gerrand <adg@golang.org>
Have the buildlet send build keys to the buildlet in reverse mode.
Rename Info to Status, as it will be used periodically by the
coordinator to check that the buildlet is OK.
And add a simple connection test.
Change-Id: Ic7285636c7818e2584e11555a8f5b5b66be30638
Reviewed-on: https://go-review.googlesource.com/8593
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
In -reverse mode, the buildlet dials the coordinator. The connection
is then turned around so the coordinator can control the buildlet.
This lets us start buildlets on machines without APIs behind
firewalls.
Also add the -mode flag to the coordinator, which defaults to dev
mode when running off GCE. In prod mode the coordinator attempts to
become farmer.golang.org and pick up its real certificates. In the
new mode, builtin certificates are used allowing you to start a
local coordinator and buildlet for testing.
A simple connection test will be in a followup CL, as soon as key
checking is implemented.
Change-Id: I2a7dcdfbb4efda71df31b571788945e9ce1f3365
Reviewed-on: https://go-review.googlesource.com/8490
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This starts the process of making the coordinator about to use the
buildlet on things other than GCE.
Update golang/go#8647 (ARM, reverse proxy or online.net)
Update golang/go#10267 (windows 2003 on AWS)
Update golang/go#9495 (OS X VMWare VMs on racked Mac Minis)
Change-Id: I5df79ea67e0ececba8b880e81bd93d4c80897455
Reviewed-on: https://go-review.googlesource.com/8198
Reviewed-by: David Crawshaw <crawshaw@golang.org>
Also, I rebuilt the Windows image with the 30GB base image. I named it
-v2 during manual testing with gomote, and I'm keeping it like that
now, so update dashboard/builders.go too.
Fixesgolang/go#10071
Change-Id: I30029310cbf61fb21ef80063f9822cb90ce843c0
Reviewed-on: https://go-review.googlesource.com/7914
Reviewed-by: David Crawshaw <crawshaw@golang.org>
Also, make Windows use regular disks for now, since its image is so
large (100 GB) and we only have 2TB of SSD quota.
This is all very conservative and paranoid for now until I figure out
what part of the coordinator was misbehaving.
Change-Id: Icead5c07cf706c2cfc4d1dd66a108649429018ac
Reviewed-on: https://go-review.googlesource.com/7910
Reviewed-by: David Crawshaw <crawshaw@golang.org>
Add gomote and buildlet support for listing remote files, and
efficiently syncing from the local workstation to the remote buildlet.
Change-Id: Ifab1fb1c208ca4bc66f8d6916c38e1914001a3a5
Reviewed-on: https://go-review.googlesource.com/4270
Reviewed-by: Andrew Gerrand <adg@golang.org>
Also fix the RemoveAll method by setting the request Content-Type.
Change-Id: I87ec29c5c0da06eba5eaebcd00bdbef18e6ae8ad
Reviewed-on: https://go-review.googlesource.com/3880
Reviewed-by: Andrew Gerrand <adg@golang.org>
And use it in gomote and release tools.
Change-Id: I87fa013d6d6729e7305dacd137be1b3d3b02f5f4
Reviewed-on: https://go-review.googlesource.com/3771
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
We still use Dockerfiles to describe them, but then a new tool
(docker2boot) converts them into VM images suitable for booting on
GCE, running the buildlet like all the other operating systems. Things
are easier if everything acts the same way.
Note that since we're no longer moving around Docker images, the image
size and the layer accumulation cruft no longer matters. We can now
have Dockerfile lines like "RUN rm -rf /usr/share/doc" and it actually
results in a smaller VM image, since we just "docker export" the files
out of it to create the VM image.
This doesn't yet convert the clang, sid, or nacl builders. The
coordinator still runs those under Docker directly. A future change
will convert those to VM images as well.
Change-Id: Iedb136ae3daf888c955eb843bdcc9a638d08f5e9
Reviewed-on: https://go-review.googlesource.com/3341
Reviewed-by: Andrew Gerrand <adg@golang.org>
The 'build' repo (unlike the 'tools' repo) is allowed to depend on
anything.
Change-Id: I4caa9fe61bccf05f488152eac53ed5769a848d4d
Reviewed-on: https://go-review.googlesource.com/3113
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Also:
- Move the watcher to cmd/watcher (somehow this was missed earlier).
- Move dashboard package from the repo root to its own directory.
- Update docker build scripts. (Although not yet the version hashes in
the Dockerfiles; this leaves the docker builds broken, but they were
already broken after moving the builder to cmd/builder. They'll be
fixed in a followup CL after this one is submitted.)
Change-Id: I29a9758da1f3c60446e3ce18174c0df26e4d8325
Reviewed-on: https://go-review.googlesource.com/3077
Reviewed-by: Andrew Gerrand <adg@golang.org>