Mostly cache specs, but also touching some shared stuff:
* Unified some Nuget packages (System.Collections.Immutable, System.Memory, etc), and bumped versions to the latest. This triggered a bunch of other updates...
* Removed a lot of unnecessary 'withQualifier'
* Simplified some references by fixing the 'runtimes' folder behavior on the Nuget resolver (which forced a bunch of 'withWinRuntime' calls
More cleanup passes are possible, but those to come later
We need to always hash files when we're uploading to ensure the `ContentHash` matches that we're supposed to be uploading. However, in most code paths we're already hashing the files when inserting into the local, so we've already done a check to ensure the file hash matches. In such cases, we can avoid hashing the file twice.
QB inserts the same file hash from 3 different paths at the exact same time. Because we don't limit concurrency per hash at the ephemeral layer, the 3 requests start at the same time.
The 1st request starts up and inserts into the local FSCS, then spends 2s uploading to Azure Storage.
The other 2 requests start up, they both fall into the optimization this PR deletes. At this point, we have: 1 request uploading, 2 other requests completed with Success to insert due to the heuristic.
Because QB deduplicates the file uploads based on hash at some layer, it decides that the target has been uploaded, and proceeds to schedule a target in a different worker.
That different worker doesn't need to check for cache hit because the targets it depends on were built on the same build, so it just proceeds to Place the content. We try to copy the file from the other workers in the build, but they all timeout (we've got extremely low timeouts here on purpose), so we fall onto storage. Because this races with the file upload, and the file upload isn't done, storage says "this doesn't exist".
We fail the build!
This failed 4 prod builds today. Please note, removing the optimization actually improves our situation:
1. The case this optimization is meant to optimize will be caught with the call to AllowPersistentPutElisionAsync, which is lock-free lookup in memory and so a _priori_ should be quite fast.
2. Calling AllowPersistentPutElision is actually what we truly want to achieve (avoiding the upload if we know it's already in storage).
3. The removal of this heuristic means we no longer need to ensure there's a unique cache location for every build. We can always re-use the same cache and the ephemeral layer won't give incorrect results.
In the most recent scenarios, we are giving the engines a list of URIs which include the SAS token. We should support this scenario without requiring them to parse anything.
Make sure blob cache creation blows up if a container failed to be created. We need to allow for the `CreateIfnotExists` call to fail in case we only have read-only permissions. However, if it failed and the container in fact does not exist, the cache should blow up. Hopes are that the engines will be able to handle this error and proceed without the remote cache in dev builds.
This PR:
- Makes gRPC clients use a single connection pool
- Eliminates a lot of gRPC options being passed around that aren't actually used or plumbed
- Simplifies the gRPC configuration to follow similar settings as AnyBuild does
- Gets us one step closer to enabling gRPC encryption by always enabling server-side encryption on a separate port, and allowing usage of grpcs://, https:// etc in MachineLocation to denote HTTPS connections.
- Makes all instances actually try to enable gRPC encryption on the server side. If it fails it fails, but we try
Validation: https://cloudbuild.microsoft.com/user/jubayard_20240126152912
- Making it a bit harder to build test packages locally since you need a mac to build the runtime package.
- Will revert this once I have a better fix.
Reuses the NLog component already available for CASaaS. The schema of the kusto table is preserved to make existing kusto queries compatible, even though many of the columns are not meaningful outside of CloudBuild.
For the time being, adding an explicit flag to turn it on for deployment purposes, but eventually the existing /logsToKusto flag will also turn on cache logging.
Instructions/scripts for setting up the ingestion/table are coming afterwards.
The ephemeral cache currently assumes that something exists in Azure Storage if it exists in the content tracker (distributed, accessible to all machines) OR if it's in the put elision cache (local to the machine). This causes a series of things to happen:
1. The number of storage accesses is higher than it should be in cases where the content is added in one machine, evicted, and then accessed from another. This scenario is relatively common because the local cache size we use for ephemeral builds is small as we have small drives.
2. A race condition is possible (see deleted comments); it's prevented by waiting, but the cost of that is more storage accesses.
3. In the datacenter wide case, it also means that the accessing of Azure Storage is less efficient than it could be, because the knowledge about storage accesses isn't shared across builds.
This PR adds Azure Storage as a location and modifies things so they take that extra information into consideration.
Example builds: https://cloudbuild.microsoft.com/user/jubayard_20231215164957
Further example builds: https://cloudbuild.microsoft.com/user/jubayard_20240125180851
Related work items: #2122872
For file access reports, Detours in general normalize/canonicalize the paths by calling `GetFullPath`.
However, there is one case where Detours does not do that, i.e., the case where it removes path last segment.
Consider the following API call, `FindFirstFile("B:\\*.cpp")`. This is a call to start enumerating `B:\` with `*.cpp` pattern. Detours needs to report enumerate access on `B:\`. To do so, it has to remove `*.cpp` from the given input. The last segment removal was so naive that it simply removes everything after, and including, the last `\`. This results in Detours reporting `B:`.
The fix is to compare the path with the path root. If the position of last `\` is within the root's length, then we should take the whole root instead.
Related work items: #2126559
The bug is caused by the following scenario:
Consider a scenario where a build session consists of 2 builds, one consisting pips for constructing pip graph fragments, and the other stitches the fragments together and executes the pips in the fragments. In the first build session, the 1st build has a pip that constructs a pip P in a graph fragment G with a fingerprint salt A. The fragment is then included in the 2nd build that executes P.
In the next build session, the first build keeps using the fingerprint salt A for constructing the graph fragment G. Thus, in that build, the construction of G has a cache hit. In the second build, one changes the fingerprint salt to B, and expects P to have a cache miss. In this build we will have a graph cache miss because we have a different fingerprint salt B. However, during the graph construction, we still serialize the graph fragments generated in the first build. If we do not include the fingerprint salt B when calculating the pip's static weak fingerprint (when stitching the fragments together), then P will have the same static weak fingerprint as the one in the first build session. This can result in a hit in the incremental scheduling state and make P get skipped during the execution phase, and thus underbuild.
Related work items: #2137467
This PR protects against the following bug:
- QuickBuild tries to Put a file, but hands us a file that doesn't match the hash quickbuild expects
- We fail to insert it into the local cache, but because we had a previous target that had the scenario above and the target contained the hash that we expected to put, the file is considered to be alive in storage and therefore gets a put elision (i.e., we don't even try calling storage), which means we return that the put was successful!
- QuickBuild then assumes the file is successfully in the cache, and proceeds with life happily.
- Eventually, it adds a fingerprint which references the file that doesn't actually exist!
Make environment variables value case sensitive when compute cache fingerprint.
Currently we ignore the case when comparing environment variables values in InputTracker and CacheGraphProvider. Office team has some builds failed because of this.
Related work items: #2130263
- This was obsoleted in glibc 2.33, but we still need to support it for ubuntu 20.04 which uses glibc 2.31 by default.
- The regular mknod/mknodat calls don't get interposed for whatever reason, even though the symbols exist in glibc < 2.33
Related work items: #2132116
GetCacheInitializationTask method returns a task, result(failure if present) for a distributed build are handled instantly. But in case of dev builds we handle them during GraphReuse and ConstructingSchedule.
Depending on the stack trace, the issue occurs when the initialization of cache has failed, but the ConstructSchedule method has decided to proceed with new graph and the method returns successfully.
But an error is logged if there is any failure in the initialization of the cache. This causes the exception we see as it fails the assertions in ValidateSuccessMatches method.
Related work items: #2138736
Retrieve contextual information about the branch the build is running from the Azure DevOps environment variables, and use those values as keys. We consider two scenarios: builds running from a PR trigger and builds that don't.
For builds running on a PR, these are the candidate keys we use:
1. The PR merge-branch name (e.g., /refs/pull/1234/merge)
2. The source branch for the PR (e.g., /refs/dev/chelo/myFeatureBranch)
3. The target branch for the PR (e.g., /refs/heads/main)
The rationale is that we want to get the latest fingerprint store pushed by other iterations from 'this' PR, which will share the PR branch name, we fall back to the closest branches that might have been built before in a non-PR build (and thus pushed a fingerprint store using their branch names).
For builds triggered outside or PRs, we just use the branch name, which is BUILD_SOURCEBRANCH. The other variables related to PRs are undefined in this context.
This strategy is effective assuming the target branches are built regularly and that FP stores from those branches will become accessible from the PR builds (i.e., with 'baseline builds' running in branches that are targeted from PRs, and in the same cache namespace that the PRs).