- Adds a workaround for the following issue seen on ooui:
- `error DX10105: [Pip62D946C910711FCE, !! ooui - @1js/excel-online [bundle]] PTraceRunner logged the following error: (relative_to_absolute) ['openat'] Could not get path for fd -1 with path ''; errno: 3`
- Both path and dirfd are not set to proper values causing the crash.
- Adds a retry to ReadArgumentLong.
Related work items: #2171844
Run CodeQL only in CodeQL pipeline and enable bug filing. In addition to this PR, a change was made to disable CodeQL in `RunCheckInTests BuildXL PR Validation` pipeline (it's not yaml-based)
We shouldn't process writes in ObservedInputProcessor (OIP shouldn't process writes, only "inputs"), so we should exclude these accesses from the list that we then process. Before this change, observations such as these were subsequently treated as probes by the OIP, and considering such probes in the fingerprint computation could cause spurious cache misses (in particular this happened while working on an orthogonal feature to consider directory probes a bit more carefully).
This change also moves the logic adding the created directories to the `createdDirectoriesMutable` collection so the paths are added before being filtered out (excluded) from the collection of observations.
The check for unique specs when adding JS pips had a potential race (even though a concurrent collection is used, the actual spec pip addition was happening outside a lock)
Related work items: #2189126
Make the interpose sandbox more robust on missing process start events. This applies in particular to clone3, which cannot be interposed by design.
* Interpose sandbox starts populating the parent id for every access (before this change it was being sent already but it was always 0)
* When an access for which we haven't seen a process start events arrives, we use the parent process to retrieve the path and arguments. On Linux, clone/fork doesn't alter that, so this assumption is safe.
* If we haven't heard from the parent, we keep track of this process until we heard from it as the callstack unfolds.
A couple other details:
* FailFast is now enabled on the action blocks. We always want unhandled exceptions to propagate and crash bxl
* We now make sure to always have one instance of each active process id around. This avoids duplicate reports (we report process start on the parent and the child). Some test counters had to be updated because of this. The main advantage is that now whenever we have to update the path & args of a process because we see an exec() coming, we only have to update one instance. Before we were only updating the 'first one', leaving subsequent process starts with the same pid with the wrong path
Related work items: #2172444
The worker crashes due to an error event and a success result (ValidateSuccess assertion failure). When there is a connection issue with the orchestrator, the OnConnectionFailure event handler is triggered on the worker. However, it simultaneously receives an Exit message from the orchestrator. Therefore, we now only log an error if the exit was not previously triggered.
An example build (https://cloudbuild.microsoft.com/build/18e62434-6d8f-cedf-37dd-68d9e6af1c74)
[12:05.527] [08:05:28.55 UTC] verbose DX7047: Connection with orchestrator timed out. Details: Timed out on a call to the worker. Assuming the worker is dead. Call timeout: 5 min. Retries: 2.
[18:42.912] [08:12:05.93 UTC] error DX7004: There were no calls from orchestrator in the last 10 minutes. Assuming it is dead and exiting.
[22:20.792] critical DX0059: Catastrophic BuildXL Failure.
Build:[0.1.0-20240531.3.1][refs/heads/releases/0.1.0-20240531.3.1:6227578178e15e04f42f6ebf33dbef0ab1a5da4e].
Exception:BuildXL.Utilities.Core.BuildXLException: No error should be logged if status is success. Errors logged: 7004
Related work items: #2172783
Refactor and fix capture build properties for org and codebase for Reliability dashboard.
Most of the URL's in Reliability dashboard are of the format - dev.azure.com and visualstudio.com
So I reused the existing code to normalize gitRemoteRepoUrl's/
Added few test cases to test these scenarios.
Related work items: #2188136
When the ptrace sandbox is active, we have two concurrent processing blocks sending messages to the pending report queue. That means the queue cannot guarantee single writer.
- Reintroduces !785928
- Removes dependency on IOHandler/AccessHandler from macOS code.
- Clean up file operations from native sandbox to remove macos specific ones.
- Remove duplicate operations in ReportedFileOperation.
- Report exec calls separately from forks (using the ProcessExec report type), and update exe and command line on tracked process objects on the managed code.
- Update report format, and allow for different report types on the Linux sandbox.
- Report parent process pid on sandbox reports.
- Rename MacLookup to AbsentProbe
- Updates SandboxEvent type to contain everything required for an access report including the access check result.
- Adds a report builder class that is responsible for building the report string to be returned to the managed side.
- Removes old ES_EVENT_* constants and replaces them with buildxl::linux::EventType
Related work items: #2111125
Refactor HistoricMetadata cache - Compensate for BlobL3 topology not having HistoricMetadataCache
Separated the functionality currently present in HistoricMetadataCache class into two classes - **HistoricMetadataHashLookupManager.cs**
- This class extends the functionality of PipTwoPhaseCache by incorporating hash-to-hash lookup capabilities.
**HistoricMetadataCache.cs**
- This class extends the functionality of HistoricMetadataHashLookupManager class by incorporating the metadata querying functionality.
Added this Counter to ensure that we use HMD for H2H lookup
**InternalHashToHashHistoricMetadataCacheReadCount**
Attached are test results
[blobll3 (2).xlsx](https://dev.azure.com/mseng/9ed2c125-1cd5-4a17-886b-9d267f3a5fab/_apis/git/repositories/50d331c7-ea65-45eb-833f-0303c6c2387e/pullRequests/788089/attachments/blobll3%20%282%29.xlsx)
Related work items: #2175115
Adding back this PR with a small fix.
https://dev.azure.com/mseng/Domino/_git/BuildXL.Internal/pullrequest/784626
As per the current PR. We first check if there is a match with the Windows Prefix if not we remove the prefix and then match it again to see if there is a match.
The first test case in the newly added unit test validates the change
[InlineData(@"^\\\\\?\\.*", @"\\?\c:\foo\bar\file.txt", true)]
Related work items: #2165593
We are seeing some 1JS builds that spend a quite amount of time in ReplayWarningsFromCacheAsync (some up to 20 minutes) due to waiting for stderr/out to materialization semaphore. Make those materialization requests so they don't need to acquire the materialization semaphore. These are always 2 materialization requests per pip (so in terms of throttling they should be negligible), but at the same time this step is blocking the scheduler.
A more comprehensive change should include regular input/output materialization, such that we have some sort of distinct queues/priority queues for those. But that's a bigger change to make.
- Removes dependency on IOHandler/AccessHandler from macOS code.
- Clean up file operations from native sandbox to remove macos specific ones.
- Remove duplicate operations in ReportedFileOperation.
- Report exec calls separately from forks (using the ProcessExec report type), and update exe and command line on tracked process objects on the managed code.
- Update report format, and allow for different report types on the Linux sandbox.
- Report parent process pid on sandbox reports.
- Rename MacLookup to AbsentProbe
- Updates SandboxEvent type to contain everything required for an access report including the access check result.
- Adds a report builder class that is responsible for building the report string to be returned to the managed side.
- Removes old ES_EVENT_* constants and replaces them with buildxl::linux::EventType
Reverts !785928
- Removes dependency on IOHandler/AccessHandler from macOS code.
- Clean up file operations from native sandbox to remove macos specific ones.
- Remove duplicate operations in ReportedFileOperation.
- Report exec calls separately from forks (using the ProcessExec report type), and update exe and command line on tracked process objects on the managed code.
- Update report format, and allow for different report types on the Linux sandbox.
- Report parent process pid on sandbox reports.
- Rename MacLookup to AbsentProbe
- Updates SandboxEvent type to contain everything required for an access report including the access check result.
- Adds a report builder class that is responsible for building the report string to be returned to the managed side.
- Removes old ES_EVENT_* constants and replaces them with buildxl::linux::EventType
The point of this abstraction is to to provide a way to add debugging capabilities throughout the code that will have minimal overhead when in a 'disabled' state. This way, consumers only need to worry about this "construction" time (typically, using an optional setting to enable the debugging in that particular section of the code), and can add arbitrary debugging lines in the rest of the logic without worrying about impacting builds that are not meant to be observed with the extra logging. This is achieved by using a custom InterpolatedStringHandler, which will only construct the interpolated strings when the debugging is enabled.
I ran a Benchmark.NET that does some random processing of a bunch of strings:
```cs
int containsA = 0, randomCondition = 0;
// myGuidStrings contains 5000 guids as strings
for (int i = 0; i < myGuidStrings.Count; i++)
{
using var trace = new DebugTrace(enabled);
trace.AppendLine("Starting processing");
var s = myGuidStrings[i];
var someRandomCondition = s.Contains("a477");
trace.AppendLine($"Condition: {someRandomCondition}");
if (someRandomCondition){
randomCondition++;
}
var splits = s.Split("8");
foreach (var ss in splits)
{
trace.AppendLine($"split {ss}");
if (ss.Contains('a'))
{
trace.AppendLine("split contains a");
containsA++;
}
}
diag = trace.ToString();
}
```
The benchmark runs with
```
(1) Enabled: Constructing DebugTrace with enabled = true
(2) Disabled: Constructing DebugTrace with enabled = false
(3) NoTrace: DebugTrace stripped from the code (not using it at all)
```
With 5000 strings:
```
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|--------- |-----------:|---------:|--------:|---------:|-----------:|
| Enabled | 2,976.2 us | 11.27 us | 9.99 us | 203.1250 | 3344.74 KB |
| Disabled | 888.7 us | 3.68 us | 3.26 us | 52.7344 | 868.15 KB |
| NoTrace | 872.0 us | 3.05 us | 2.85 us | 52.7344 | 868.98 KB |
```
With 100,000 strings:
```
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|--------- |---------:|---------:|---------:|----------:|----------:|
| Enabled | 63.15 ms | 0.323 ms | 0.302 ms | 4000.0000 | 65.44 MB |
| Disabled | 19.30 ms | 0.224 ms | 0.209 ms | 1062.5000 | 17 MB |
| NoTrace | 17.70 ms | 0.133 ms | 0.124 ms | 1062.5000 | 17.03 MB |
```
Comparing `Disabled` and `NoTrace` shows that having this code in the "disabled" state has practically no overhead (consider that for the benchmark, like ~half the operations are on the DebugTrace object), and no extra allocations.
Use grpcs instead of grpc scheme upon location creation, as this is resulting in the encrypted gRPC port to be identified as -1 instead of 7092:
```
2024-05-21 14:23:36,202 [14] DEBUG Creating Ephemeral cache. Type=[DatacenterWideEphemeral] RootPath=[F:\dbs\sh\cb_m\0521_141826\TempFileStore\097a2a56-be5a-40b9-82b2-8eea9ae54557] MaxCacheSizeMb=[52428] Location=[grpc://61fb6450c000004:7092/] Leader=[61fb6450c000004 -> grpc://61fb6450c000004:7092/] Id=[CacheId { Universe = default, Namespace = default }]
...
2024-05-21 14:23:36,640 [17] DEBUG 0d000adf-5f53-4c98-bbc7-42842594cb8e GrpcCoreServerHost.TryGetEncryptedCredentials: Found gRPC Encryption Certificate.
2024-05-21 14:23:36,642 [23] DEBUG 0d000adf-5f53-4c98-bbc7-42842594cb8e GrpcCoreServerHost.StartAsync: Server creating Encrypted Grpc channel on port -1
```
Related work items: #2181670