Use a new common pool BuildXL-DevOpsAgents-Selfhost for both Windows and Linux runs. This is a pre-requisite step to be able to enable network isolation and use a common firewall for all runs.
This PR makes us enable compliance build only in Linux PR validation.
Ideally, we should only enable source analyses in the rolling pipeline. However, because our rolling pipeline contains CB builds, SDL source analyses are all disabled, in favor of the CB compliance build. Unfortunately, some analyses, like policheck, do not work in network isolation when building in CB.
SDL source analyses typically take 2.5-4 minutes, and should not be in the critical path of this Linux pipeline run.
Add a `bxl_ado.sh` script that can interpret the command line given by 1ESPT (notably, AdoBuildRunner arguments are separated by `--`, and we use `--runner-arg` in our bash scripts to process these), and migrate all our selfhost builds to use 1ESPT with automatic cache config generation.
- Make the public validation public end-to-end, by building a public engine and then using that to run tests (also without passing `--internal`)
- Use the 1ESPT build job for the distributed build test. We're still keeping the old YML for backup, but it becomes unused.
- Note that using 1ESPT for our own selfhost is not possible due to how we invoke `bxl.sh`, that's why that one still adds the worker stage manually.
Use the ado runner cache config generation for our selfhost Linux validations:
* Remove the explicit generation of a cache config for the ado build runner (distributed) case
* Use the new ado build runner functionality and let it generate one (ephemeral, datacenter wide, as we are currently dogfooding)
* Make internal build distributed (main reason is to keep dogfooding the ephemeral cache with a more real-life build). The explicit distributed clean build with a 2-pip build remains as is.
- Only build the engine that will be under test once, and then consume from the different validations via pipeline artifacts, instead of doing it once per validation
- Modularize the YAMLs by moving shared tasks to a common file
- Add some dependencies between the stages to short-circuit failures: namely, run the PTrace validation only after the public and
- The internal validation is now single machine, in favor of validating distribution with a small two-pip build at the end
- Add a validation that runs a small distributed build consisting of two pips, one of which is forced to run on a remote worker. This runs at the end of the validation
Let's dogfood the data center wide flavor of the ephemeral cache. Use the ephemeral cache as well for the ptrace validation (with a different universe, just to keep those two validations from interacting with each other)
The purposes of upgrading XUnit are twofolds:
1. To use the latest greatest XUnit packages.
2. As an attempt to relieve hanging XUnit pips.
For (2), we also do other changes:
- Split Ninja UTs to multiple "smaller" pips, where each pip should run faster than the big one.
- Limit the number of xunit pips that can run in parallel by using semaphore. We suspect that the hanging XUnit pips are caused by too many XUnit pips running in parallel.
Example of successful validation: https://dev.azure.com/mseng/Domino/_build/results?buildId=23035493&view=results
Related work items: #2111327
Finally abandon the parallel workers approach for the Linux PR validation, instead, use a stage for the workers. This is more representative of both "worker pipelines" used in production and the future of 1ESPT distributed builds.
Hostname resolution stopped working when removing the private subnet we were using to run distributed builds. It turns out that DNS resolution in the "default network" needs the hostnames to be qualified with a domain (https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and-role-instances).
This PR adds an option (to be set by the AdoBuildRunner) so we can inject the correct hostnames in the build, without changing how this works for CloudBuild (where Dns.GetHostName works)
Related work items: #2116072
Wrap the bxl call for the Linux public & internal selfhost PR validation with a 60m timeout. This should allow us to get logs uploaded if BuildXL hangs for any reason.
We needed this because we were using a stateful pool, but we have moved to a stateless one now. The "justification" doesn't actually describe anything related to what this breakglass was accomplishing.
As of today we have two distinct ways of running distributed builds on ADO: the original model, where all build agents run the same job and are multiplied with the parallel strategy, and the worker pipeline model, where a second pipeline is triggered where all agents run as workers, and a single agent runs a build in the original pipeline (as an orchestrator).
The two approaches were using different ways to coordinate the distributed build (namely, communicating the relevant build information such as build id and orchestrator location), so we had two "BuildManager" classes which were instantiated for the different scenarios. But the fact is that we can use the same pre-build coordination in both scenarios: this PR consolidates this into a single `BuildManager` class which corresponds to the up-until-now called `WorkerPipelineBuildManager`. This communicates the information via the build properties of the orchestrator build (which for the "parallel agents" scenario is the exact same build as the workers, but for the "worker pipeline" scenario).
Note that the "parallel agents" scenario is only used internally for our selfhost validations. The PR includes the only change needed in the Linux validation YAML to accommodate this consolidation (passing an invocation key is now required).
Use linux stateless pool to mitigate DFA issue (theory: cancellations leave some shared opaque outputs unmarked, then next build finds those, and rsync has an incremental behavior where copies are avoided)
Introduced a flag to enable/disable explicit setting of the execute permissions bit for the root process in linux builds.
Added a condition to explicitly set this bit for node and dotnet in the extractor.
This is done to obtain more information about the linux permissions bug.
Related work items: #2104538
Add salt to both Windows and Linux PR to address a cache poisoning issue
----
## AI-Generated Description
This change adds a new condition to the **bxl.ps1** and **bxl.sh** scripts that checks if the user-provided arguments contain the `/p:BUILDXL_FINGERPRINT_SALT` parameter. If not, it adds this parameter with the value `casingPR` to the arguments passed to the BuildXL executable. This is done to force a salt for the cache fingerprinting, because a previous PR introduced some casing issues that polluted the cache. This change can be removed once the poisoned content is evicted from the cache.
Rev fingerprint salt of Linux PR pipeline to deal with bad rsync entry
----
## AI-Generated Description
This change modifies the **job-selfhost.yml** file by updating the fingerprint salt for the distributed build. The fingerprint salt is a value that affects the content hash of the build inputs and outputs, and changing it can help invalidate the cache and force a rebuild. The change also removes the `/historicMetadataCache-` flag from the bxl command line, which disables the use of the historic metadata cache that can speed up the build by reusing previous results.
Use a blob backed L3 cache for linux selfhost builds. The retention period is now configured to be 1 day as a way to stress test the eviction mechanism. Something like 3 or 4 days will probably be more reasonable after we are done testing this feature.
Related work items: #2074180
This change adds a /logToKusto option to send all log lines to a kusto database (in the same way today this happens for DominoMessage on CB). The primary scenario for enabling this option is ADO builds.
In order to authenticate, the pipeline running the build needs to provide a managed identity that we need to authorize. 1eshp allows for associating a managed identitie to a pipeline, so that's pretty straightforward.
Docs (on EngHub, since the flow today is mostly tied to 1eshp) to follow.
*Note*: this PR needs a fix in the nuget generation logic that is already in, but it needs to become part of the LKG, so we'll get failures till that happens
Related work items: #2047667