Noticed that a few more files are used only by ProcessPipExecutor so moving those there instead. Additionally, pulling out all the VM and ExternalSandbox code into a separate dll as it's not used by Processes and doesn't fit in ProcessPipExecutor
Processes depends on a very small portion of BXL.Utilities.Configurations. And most of those dependencies are BXL specific as well so it makes sense to me to just break this dependency.
Office would like to experiment with a light version of BuildXL in their PR builds. First step is to add an option to disable XLG events from workers to the orchestrator. Manifest events will be still sent though.
Refactoring Instrumentation.Common to only contain Aria specific files, and moving all the non-Aria code into Utilities.Core instead. The motivation behind this is so that BXL released packages won't contain Aria packages which is an external pacakge. We want to have as little external packages as possible because it might cause diamond dependencies. (eg. BXL.Utilities depends on Aria1.0 and ExPkg depends on Aria1.1 and the customer pulls both in, there will be a version conflict)
There are two scenarios when a build will have work waiting on resources:
1. When the scheduler limits have been throttled due to available resources
2. When the scheduler must actively cancel some executing processes due to resource availability.
We see negative numbers in the pipsWaitingOnResources value in scenario 2. Here, the scheduler temporarily decreases TotalProcessSlots as a signal to cancel some of the actively running processes. In this case AquiredProcessSlots will actually be larger than the Total and the prior calculation was yielding a negative number. This negative number is only temporary while the cancellation is happening. Once cancelled, the AquiredProcessSlots will no longer exceed TotalProcessSlots. The pips that got cancelled will once again be back in the ChooseWorkerCpu PipExecutionStep. That's why the negative number is only ever temporary.
Calculating based on LocalWorker.TotalProcessSlots - LocalWorker.AquiredProcessSlots is clearly wrong while the cancellation is happening. It is also not correct during the steady state while a build is running, either throttled or unthrottled. When a machine is maxed out, the two will have the same value and the difference will be zero. Math.Min() will pick that 0 value even though there may still be many pips in the ChooseWorkerCPU step.
It seems directly using the count of pips in the ChooseWorkerCPU step is the appropriate thing to do.
Related work items: #2041760
Currently to check if statx is supported, we check for the major/minor version of the OS. However, it is not enough because we need to check whether the installed glibc supports statx as well.
We can call `ldd --version` and parse its output to know the glibc version. We can call `gnu_get_libc_version()` to get the glibc version. In this PR we simply invoke statx on the executing assembly and check for the `EntryPointNotFoundException`.
Related work items: #2043763
Updated Tool.QTestRunner.dsc to use qTestMsTestPlatformRootPathValue,l a path value, for --msBuildToolsRoot so that the input to the qtest, qTestMsTestPlatformRootPath, can be different than the qTestMsTestPlatformRootPathValue path. This happens when qTestMsTestPlatformRootPath is an opaque from a nuget pckage and qTestMsTestPlatformRootPathValue is a subdir of that.
This PR:
- Adds the concept of a `Topology`. The `Topology` is responsible for handing out a storage client for any given object in the system. Currently, we have only one `ShardedTopology` (a sharding scheme points to a specific storage account).
- Adds the concept of an `IShardingScheme` over a bunch of nodes (i.e., "how do I distribute objects uniformly across shards"). The `ShardedTopology` uses this interface to point to a specific storage account.
- Adds the concept of an `IStorageCacheSecretsProvider`. This interface is used by the `Topology` to obtain storage secrets for all the involved storage accounts / containers. The intention is that we'd be able to obtain container-specific SAS tokens to scope security.
- Rewrites `BlobFolderStorage` into `BlobStorageClientAdapter`. Instead of using the concept of paths and so on that we had, this class now takes an actual client and it basically just adds a bunch of methods that you can do against storage. It's not an extension class because there's several shared fields.
- The concept of using a single folder inside of Azure Storage has been split off into `AzureBlobStorageFolder` (which now takes care of basically handing out storage clients).
- Renames `AzureBlobStorageCredentials` into `AzureStorageCredentials` (because this will be used for queue in the future).
- Removes bulk pin from Blob L3, it doesn't work.
- Adds test coverage to memoization operations
- Adds bicep scripts for provisioning L3 instances
- Flattens the serialization format for content and metadata, they are now at the root of the container instead of inside folders. This is on purpose to allow for longer names, which is particularly important for Selectors inside StrongFingerprint
[WIP]
- Deploy to CBTest to ensure nothing here breaks our current way of working. This is only because of the BlobFolderSTorage change
Add retries when opening a file fails with UnauthorizedAccessException.
The PR also changes the exception type thrown due to sharing violation on linux. Before the code was failing with `IOException` on Linux and `UnauthoziedAccessException` on Windows. Now the behavior is consistent.
Related work items: #2039887
Don't publish recompute information globally. Also removes the concept of a LocationAddRecentInactiveEager. That feature has been useless for months because machines are filtered out at both local and global, so doing a proactive global registration doesn't help.
We realized that `UseInRingMachinesForCopies` feature is not helpful because it generates a lot of extra netrowking traffic and doesn't improve build reliability.
This PR removes the code that we should not use. If/when we'll think that we need something like that we'll implement such feature from scratch.
This removes the additional line on the console below the main status line that gets displayed with pips are resourced paused. ex:
`3 running pips waiting on resources`
The data is already shown in the primary status and we want to decrease the amount of scrolling back people have to do in the ADO console view when looking at builds that have already completed.
This PR moves majority of Pips dependencies from Processes. All of the Pips usages are BXL specific. Additionally, moved a bunch of files that are only used by SandboxedProcessPipExecutor or above into ProcessPipExecutor
After some reflection, we should be more aggressive about limiting the frequency of status updates that go to the ADO console. We are getting customer feedback that it is annoying to scroll through the many lines of status updates for longer builds. This change aims to strike a balance of providing some feedback in case folks want to see status while the build is running, but erroring on the side of having fewer status lines which aren't relevant once the build is over.
Usage of /RedirectedUserProfileJunctionRoot was thrown off because /enableProfileRedirect must also be set for it to take effect. This change simplifies config by just removing /enableProfileRedirect. It was not being set anyway and now that CloudBuild is externally setting the path, there is no longer need for a CB specific default path in the BuildXL codebase.
Only ensure PTrace files exist and executable when PTrace is enabled:
- AnyBuild doesn't use PTrace. Missing access reports are compensated by Fuse.
- In previous PR I suggested Pasindu to use BuildXL itself as the daemon. In that case we don't have to ensure the existence of `ptracedaemon`.
- The existence and executability of `ptracerunner` still needs to be guaranteed, and should only be done once.
This PR should allow AnyBuild to ingest new BuildXL without further modification in AnyBuild file distrib.
Move Counter From BXL.Utilities to Utilities.Core. CounterAttributes are needed by Processes but it made sense to just move all of Counter to Utilities.Core.
Processes.Remoting is currently only used by ProcessPipExecutor. It must be moved out of Processes due to its dependency on Pips. So moving it to its own dll as it's separate from both Processes and ProcessesPipExecutor
The previously developed status update throttling feature only applied to the "##vso[task.setprogress" which are interpreted by ADO. This change refactors that throttling to extend it to the user visible status lines on the console. The goal is to decrease the status updates that people scroll through on ADO.
The process might clear the environment so we need to propagate this on exec. Also, let's check this variable every time, even if the FAM flag is not set, to fail fast if this is unexpectedly absent: we set the variable even when the PTrace mode is not enabled, so this should always be there
Related work items: #2041253
Measure the time it takes to push outputs to the cache for each pip and record it in the perf info object used by the scheduler. Add this counter as a sub-step of execute process in the main log perf summary
The FIFO might still hold some unread reports when the process tree has finished, so it's incorrect to stop processing the reports as soon as we find out that all the processes for the pip have exited. Doing this causes reports to be dropped, messing with the observations for the pip.
What we do know at the point where we call `RequestStop` is that there will be no more reports pushed to the FIFO (because all the processes have exited, there are no more possible writes), so we can send a special value through it to signal the FIFO-consuming thread that it can finish. This is preferred against just disposing our keep-alive write handle to produce an EOF in the consumer (which was our original strategy), because the `Dispose` on that handle produces the EOF with a considerable delay.
Related work items: #2021040
Capture observed inputs for failed & retried pips in the XLG.
Lert ObservedInputProcessor return observed inputs for failed pips and dfa pips
Add two new kinds of FingerprintComputation to make runtime analyzer ignore failed pips' fingerprint information but do computation for dfa pips
When we are approaching a CB timeout, we trigger a cancellation token. As a result, we cancel all running pips as well as cancel the post processing of all finished pips (even if they were successful). This pr fixes the assert, so we don't crash during build termination.
Related work items: #2028490
The implementaion of rename (and renameat) in the Linux sandbox was generating accesses for the destination files without the write flag, and therefore they were interpreted as reads
- This change allows ptrace to run outside of the process tree of the pip, reducing any interference we have in the process execution. The previous iteration of the sandbox had issues with inheriting handles from the parent process that it was forked from causing pips to hang.
- Adds the ptracedaemon service that runs in the background to listen to ptrace requests from processes running under the interpose sandbox.
- When a request is received, the ptracerunner is executed in a separate process to attach to the tracee using PTRACE_ATTACH.
Related work items: #2037038
The breakdown of process overhead is just displaying the PipStep. But the ExecuteProcess PipStep can have quite a bit more work than just having the child process active. Grab a couple counters from constituent parts of this overhead and include them in the appropriate part of the breakdown instead of under "other"
AnyBuild started failing with the following error after updating to the latest bxl bits:
```
2023-03-11 05:48:57,219 [18] ERROR root: An exception is unhandled. Process will exit.
BuildXL.Utilities.Serialization.InsufficientLengthException: The buffer should have at least 2 length but has 0.
at BuildXL.Utilities.Serialization.InsufficientLengthException.Throw(Int32 minLength, Int32 remainingLength) in \.\Public\Src\Utilities\Utilities\Serialization\Insuffic> at BuildXL.Utilities.Serialization.SpanWriter.DoEnsureLength(Int32 minLength) in \.\Public\Src\Utilities\Utilities\Serialization\SpanWriter.cs:line 187
at BuildXL.Utilities.Serialization.SpanWriter.EnsureLength(Int32 minLength) in \.\Public\Src\Utilities\Utilities\Serialization\SpanWriter.cs:line 159
at BuildXL.Utilities.Serialization.SpanSerializationExtensions.Write[T](SpanWriter& writer, T value) in \.\Public\Src\Utilities\Utilities\Serialization\SpanSerializatio> at BuildXL.Cache.ContentStore.Distributed.MetadataService.RocksDbOperations.WriteMergeLocations(SpanWriter& mergeWriter, SpanReader& reader1, SpanReader& reader2, Boole> at BuildXL.Cache.ContentStore.Distributed.MetadataService.RocksDbOperations.WriteMergeLocations(SpanWriter& mergeWriter, ReadOnlySpan`1 value1, ReadOnlySpan`1 value2, B> at BuildXL.Cache.ContentStore.Distributed.MetadataService.RocksDbOperations.MergeLocations(ReadOnlySpan`1 key, ReadOnlySpan`1 value1, ReadOnlySpan`1 value2, MergeResult> at BuildXL.Cache.ContentStore.Distributed.MetadataService.RocksDbOperations.ProcessSingleLocationEntry(ReadOnlySpan`1 key, ReadOnlySpan`1 value, MergeResult result) in > at RocksDbSharp.MergeOperators.<>c__DisplayClass4_0.<CreateAssociative>g__mergeCore|1(ReadOnlySpan`1 key, MergeOperandsEnumerator operands, MergeValueBuffers buffers, R> at RocksDbSharp.MergeOperators.<>c__DisplayClass4_0.<CreateAssociative>b__3(ReadOnlySpan`1 key, NullableReadOnlySpan`1 value, MergeOperandsEnumerator operands, MergeVal> at RocksDbSharp.MergeOperators.MergeOperatorImpl.RocksDbSharp.MergeOperator.FullMerge(IntPtr key, UIntPtr keyLength, IntPtr existingValue, UIntPtr existingValueLength, > at RocksDbSharp.ColumnFamilyOptions.MergeOperator_FullMerge(IntPtr state, IntPtr key, UIntPtr keyLength, IntPtr existingValue, UIntPtr existingValueLength, IntPtr opera>
```
To mitigate this issue this PR makes the following changes:
* Increases the target buffer for merged location entry
* Adds an error handling to trace the error and not crash the app.