Add an 'orchestrator' and 'worker' mode to the BuildXLNugetInstaller so that the version is decided only on a single agent in a distributed build: this avoids race conditions between the agents when resolving a version.
In the future this strategy might be adopted to the configuration download itself, but for that a prerequisite is that the configuration is distributed in a versioned manner (like Nuget), so for now it is good enough to keep this logic as specific to the BuildXL installer.
This PR also adds an integration pipeline to test the installer before releasing. This pipeline exercises a straightforward flow and also the worker-orchestrator consistency when using the distributed mode.
Changes to APH version, passed all validations.
Also do not see any issues with the package in Ubuntu 24.04.
The globalization issue seen earlier with the older version of APH has also been resolved.
Related work items: #2227478
Produce an error file for Yarn and Lage resolvers (so they are at par with the Rush resolver).
Make this change in behavior under an optional argument, so we don't break Office consumption
When bxl is running under Linux without an X server available, the regular interactive browser credential is not an option for doing cache auth. Use device code authentication on that case.
Align the worker id with what is shown in the ADO pipelines UI (worker #1 at the top, log names, etc.). Do this by using the predefined System.JobPositionInPhase variable to select a particular worker id
Related work items: #2226709
BuildXL is not able to dump a hung process due to "Process with an Id of N is not running" but the `GetExitCodeProcess` call just before didn't return success and it just created a handle so I'm pretty sure this case wasn't a race condition. Using this PR's changes I confirmed that the process was in fact running and that `Process.GetProcessById` was too earnest in its diagnosis that the process was complete. I was able to dump it without this method and so avoiding this call may be the best way to capture these kinds of process dumps.
Related work items: #2221778
This pull request addresses two bugs encountered when the Content Addressable Storage (CAS) is mounted separately from the outputs on Linux:
1. Temporary File Deletion: If TryInKernelFileCopyAsync fails, the temporary file is now deleted. Previously, this temp file would prevent CopyWithStreamAsync from copying the file.
2. Hardlink Creation: When creating a hardlink for content already in the cache, the process attempts to delete the original file and replace it with a new one that includes the hardlink. If this operation fails, the original file was being deleted. And the fallback to copying would return true because the content is already in the store, but the pip would fail in future operations due to the missing file. Now, if the file is deleted, the original file is re-created using the existing content from the cache.
Related work items: #2216529, #2219745
Added uncacheableExitCodes property to Process.cs
This prevents caching of a successful pip.
Added a corresponding integration unit test.
Related work items: #2210032
In various environments we hit some amount of pip cancellation and retrying due to memory throttling. This happening at a low level is not necessarily problematic and shouldn't be in the end user's face. Demote those messages to verbose and log if the total count passes a threshold.
We suppress end-user-facing logs in the workers even if they are selected for console redirection. Note that these events are always forwarded to the orchestrator so they *will* end up in the orchestrator console anyway
Currently, content hash lists are always replaced in Blob L3 implementation because AutomaticallyOverwriteContentHashLists is set to false. AutomaticallyOverwriteContentHashLists is poorly named but setting it false indicates that the content session should not be used to pin content to check availability to prevent replacing content hash lists with available content. The end result is that content hash lists are always replaced. The negatively impacts graph caching and pip caching, because concurrent runs generating same fingerprint will overwrite each other rather than converging. Graph caching is even more impacted because it relies entries not being replaced for its fingerprint chaining. This means that similar builds which generate the same initial graph fingerprint but ultimately have different graph fingerprints continuously stomp over each other.
Making the default on ADO 3 makes /pipDefaultTimeout to be silently multiplied by 3, which is pretty confusing. In addition to that, we have no reasons to think that builds on ADO take longer than in non-ADO environments.
Last 30 days of builds on ADO not passing an explicit multiplier:
Execute: [Web] [Desktop] [Web (Lens)] [Desktop (SAW)] https://cbuild.kusto.windows.net/b73924d6ec544e76a97c39510c054b21
**Customer**
domoreexp-teams-modular-packages
mseng-BuildXL.Internal
microsoft-OSGTools
* Teams (domoreexp) is now passing an explicit timeout (at some point they were not, and that's why the query picks it up).
* Our own bxl internal builds should be good with the change
* We ran a forced clean build under OSGTools and validated that with /pipTimeoutMultiplier:1 they are still green.
Initial implementation of a build tools installer to be run in pipeline templates to acquire BuildXL and provide versioning through some global configuration.
This is meant to be distributed as a standalone package. This PR builds an MVP that we can roll out to some onboarding customers to try out (in experimental scenarios so far).
Engine output directories are retained in some builds which is causing BXL to crash.
Ex: BuildXLCurrentLog directory which is to be deleted towards the end of the build is retained for the next build causing an issue.
This is handled by ensuring that we rename this retained folder and create a new directory.
Added a unit test to check the scenario.
Attaching the previous PR.
https://dev.azure.com/mseng/Domino/_git/BuildXL.Internal/pullrequest/805664
Related work items: #2188585
`Context.isWindowsOS` is accepted by DScript as a legit value, while obviously being a typo. Due to the typescript nature of DScript, when `Context.isWindowsOS` is evaluated, it's evaluated to `true`. This PR fixes a few ternary conditionals where we were always getting the first value.
Change public external feed to @ dev.azure.com/PipelineTools
Manually change the feed in BuildXLLkgVersionPublic.cmd since release pipeline only change the version
Related work items: #2217690