## Description
Currently, there are some filesystem operations in Gitrest that result
in a generic 400 HTTP error code, rather than a helpful HTTP status and
message based on the error that occurred.
This PR adds some wrapper functions that help determine if an error is a
FileSystemError (or RedisFSError, which is similar) and bubble that up
as a NetworkError that can be parsed for the HTTP response.
## Description
Updates our transitive dependencies on `path-to-regexp` to versions that
fixed https://nvd.nist.gov/vuln/detail/CVE-2024-45296 . Accomplished by
updating our direct dependencies on `sinon` to a mix of version 18 and
19, since that's the main way in which we get transitive dependencies on
`path-to-regexp`.
`@types/sinon` was also opportunistically updated to the latest version
where it wasn't already up to date.
## Description
Updates `tar` to version `6.2.1` to address
https://nvd.nist.gov/vuln/detail/CVE-2024-28863 . Done by adding a
`pnpm.overrides` entry `"tar": "^6.2.1"` to the package.json of each of
the affected packages, running `pnpm i --no-frozen-lockfile`, then
removing the override from package.json and running the same command
again.
## Description
This PR adds support for a fluid token issuance endpoint. This endpoint
can be used to issue fluid access tokens based on a custom
implementation of the `IFluidAccessTokenGenerator` interface. One use of
this endpoint is to enable access using cloud identity providers such as
Entra-ID. Alfred has a new endpoint -
`api/v1/tenants/:tenantid/accesstoken`. This endpoint expects a `Bearer`
token and creates an access token for a given fluid tenant.
This PR also adds support to inject a custom implementation of the
`IFluidAccessTokenGenerator`. This can be used to implement any custom
logic that is needed for a business use case.
It also has unit tests for the following:
1) Throttling of the endpoint
2) Validation of cases where a `Bearer` token is not provided or an
invalid authorization method is used
3) Validation of a valid token creation path
4) Validation of token creation failure due to invalid token signature
and due to unauthorized access
## Breaking Changes
This PR adds a new resourceFactory arg of type
`IFluidAccessTokenGenerator`. This arg is customizable as well.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
Updates the build-tools and build-cli dev dependencies in
protocol-definitions, common-utils, server/historian, and
server/gitrest. This gets webpack updated to the latest 5.x version,
which addresses https://nvd.nist.gov/vuln/detail/CVE-2024-43788 .
Also updates eslint-config-fluid in protocol-definitions,
server/historian, and server/gitrest, just to keep with the latest
version.
## Description
Updates dependencies to get to ws@8.17.1 (or ws@7.5.10) to address
https://nvd.nist.gov/vuln/detail/CVE-2024-37890. Updating socket.io to
4.8.0 was necessary in some cases get the necessary dependency ranges.
socket.io 4.7.5-4.8.0 is a minor semver update but contains [a breaking
change in the type of the `close()`
function](https://github.com/socketio/socket.io/pull/4971/files), so two
places had to be updated to account for that.
## Description
The `StartupCheck` implementation was consumed by Gitrest and Historian.
However, since the r11 packages consumed by Gitrest and Historian are
not updated as frequently as r11s consumed by repos such as FRS, this
dependency caused a breakage in Gitrest in FRS.
As a result, the solution is to make the startupCheck parameter in
resourceFactories to be more generic - `IReadinessCheck`. This will
prevent any future class dependencies from breaking.
## Breaking Changes
There should be none as this change makes the parameter more 'generic'.
That is, old implementations should still work fine.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
The `StartupCheck` implementation was consumed by Gitrest and Historian.
However, since the r11 packages consumed by Gitrest and Historian are
not updated as frequently as r11s consumed by repos such as FRS, this
dependency caused a breakage in Gitrest in FRS.
As a result, the solution is to make the startupCheck parameter in
resourceFactories to be more generic - `IReadinessCheck`. This will
prevent any future class dependencies from breaking.
## Breaking Changes
There should be none as this change makes the parameter more 'generic'.
That is, old implementations should still work fine.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
Skip token cache if the token is about to expire in 5 minutes.
---------
Co-authored-by: Yunho <yunho-macbookpro2024@DESKTOP-M86HBMH.redmond.corp.microsoft.com>
Co-authored-by: Yunho <yunho-macbookpro2024@Yunhos-MacBook-Pro.local>
Co-authored-by: Yunho <yunho-macbookpro2024@Yunhos-MBP.guest.corp.microsoft.com>
## Description
This PR follows the r11s PR -
https://github.com/microsoft/FluidFramework/pull/22819 - to remove the
usage of the `StartupCheck` singleton by Historian and Gitrest. This
singleton along with r11 package mismatch caused bugs. Hence, now I pass
the implementation of the startup probe as a resource to the server.
## Breaking Changes
Changes the resourceFactory args of both Gitrest and Historian.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
The singleton implementation of `StartupCheck` causes bugs when the r11
packages do not match the historian packages consumed. Hence, I decided
to switch to a non-singleton implementation. This introduces the
`StartupCheck` as an implementation of `IReadinessCheck`. This probe is
a resource provided to all HTTP services in r11.
## Breaking Changes
Changes Resource and Runner objects for Alfred, Riddler and Nexus to
include the `StartUpCheck` object.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
This PR is to add circuit breaker functionality for scriptorium lambda.
It is to handle the exceptions where service restart is not helpful and
instead, we want to wait and retry again. For example, when mongo db is
unavailable/down, and scriptorium is not able to write ops to the db,
restarting the service doesnt help, instead we would wait and retry
after some time. Circuit Breaker pattern helps in such cases by
maintaining open/closed/halfOpen state.
So in scriptorium, all the calls to db are wrapped by the circuit
breaker, and in case of such errors, the circuit will open and pause the
lambda (i.e. pause the incoming messages). After some time, the circuit
will go to halfOpen state and call a healthCheck function - if it
succeeds, the circuit will close and resume the incoming messages, else
it will stay open and paused.
We can configure various options, like error threshold, reset timeout,
the errors for which we want to engage the circuit breaker, etc. Also if
the circuit is not able to close or resume for some time (configurable),
we will fallback to restarting the service to avoid being in an endless
state of waiting.
This PR is for scriptorium, and once we validate and roll this out in
production, we will add the same pattern for document lambdas too.
Summary of changes made in this PR:
- Circuit Breaker Implementation: Adds a circuit breaker pattern to
scriptorium->db calls, with various configuration options for error
thresholds, reset timeouts, and error filters.
- Pause and Resume Methods: Adds pause and resume methods for lambdas,
context, documentContext, partition, partitionManager, kafkaRunner,
rdKafkaConsumer, and lambda to manage message flow during circuit
breaker states.
- Health Check for MongoDB: Adds a health check method to the MongoDB
class and exposes a healthCheck property from the MongoManager class.
## Testing
- [X] Added unit tests for circuit breaker.
- [X] Tested the scriptorium end to end functionality locally by forcing
the db to be unavailable in the local setup.
- [x] Tested in dev cluster by changing mongo db settings to replicate a
networking error.
We will roll this out slowly by testing in each ring.
---------
Co-authored-by: Shubhangi Agarwal <shuagarwal@microsoft.com>
A long time ago (5acfef448f) we added
support in ContaineRuntime to parse op contents if it's a string. The
intention was to stop parsing in DeltaManager once that saturated. This
is that long overdue follow-up.
Taking this opportunity to make a few things hopefully clearer in
ContainerRuntime too:
* Highlighting where/how the serialization/deserialization of `contents`
happens
* Highlighting the different treatment/expectations for runtime v.
non-runtime messages during `process` flow
## Deprecations:
Deprecating use of `contents` on the event arg `op` for
`batchBegin`/`batchEnd` events, they're in for a surprise. I added a
changeset for this case.
## Description
- Add metrics to know where time is being spent during session discovery
- Broken down into two primary pieces: verifyStorageToken and getSession
- GetSession is further broken down into three parts:
checkDocumentExistence, updateExistingSession, and createNewSession
- checkDocumentExistence is the DB call that is made to retrieve the doc
and see if it exists
- updateExistingSession will only happen if the session is not yet
alive/discovered
- createNewSession will only happen if the session is undefined (docs
created before the concept of service sessions)
---------
Co-authored-by: Brandon Diaz <“BrandonLouisDiaz@gmail.com”>
## Description
Updates transitive dependencies on `braces` from 3.0.2 to 3.0.3 to
address [CVE-2024-4068](https://nvd.nist.gov/vuln/detail/CVE-2024-4068).
A couple of applications of `flub modify lockfile --dependency braces
--version 3.0.3 --releaseGroup <release group>`, and some manual updates
in packages/release groups that we can't target with `flub`, basically
doing the same thing but manually (add an override in package.json,
install dependencies, remove override, install dependencies again to
clean up override from the lockfile).
In a few cases I got unrelated updates, mostly about node types, which I
reverted manually.
Server packages also got semver update from 7.6.0 to 7.6.3 which seems
fine.
Add a new error code: TokenRevoked to InternalErrorCode enum for driver
to handle token revocation scenario: should refresh token and reconnect.
Co-authored-by: Yunho <yunho-macbookpro2024@Yunhos-MBP.guest.corp.microsoft.com>
## Description
This PR takes in the r11 changes -
https://github.com/microsoft/FluidFramework/pull/22635 - and adds
support for the `/healthz` endpoints for `Historian` and `Gitrest`.
1. `/healthz/startup`: Startup readiness check endpoint
2. `/healthz/ready`: Service lifecycle readiness check endpoint
3. `/healthz/ping`: Liveness endpoint. This endpoint was not added for
`Historian` as it already has an existing ping endpoint `/repos/ping`
4.
These are needed to support Kubernetes Health Checks.
The readiness endpoint would need a custom implementation of
IReadinessCheck. If this is not provided, the endpoint will not be
created.
## Breaking Changes
Adds customizations to the ResourceFactory and Runners each of the
service mentioned above. These are used to inject an implementation of
IReadinessCheck.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
This PR adds support for the following endpoints for `Riddler, Nexus,
and Alfred`:
1) `/healthz/startup`: Startup readiness check endpoint
2) `/healthz/ready`: Service lifecycle readiness check endpoint
3) `/healthz/ping`: Liveness endpoint. This endpoint was not added for
`Alfred` as it already has an existing ping endpoint `/api/v1/ping`
These are needed to support Kubernetes Health Checks.
The startup endpoint relies on a new singleton class introduced in this
PR - `StartupChecker`. This class returns the `startup` status as
`isReady: true` after the service runner is created.
The readiness endpoint would need a custom implementation of
`IReadinessCheck`. If this is not provided, the endpoint will not be
created.
To support HTTP endpoints in Nexus, it also adds a request listener to
the HTTP server setup in Nexus.
## Breaking Changes
Adds customizations to the ResourceFactory and Runners each of the
service mentioned above. These are used to inject an implementation of
`IReadinessCheck`.
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
This PR makes it so alfred redirects requests whose path starts with
`/socket.io` to nexus for handling instead of trying to handle them
itself, specifically in the case of a local routerlicious environment
running in docker.
### Context
While trying to run our e2e tests against a local routerlicious
environment running in docker I noticed that some compat tests with
older versions (1.x) were failing consistently, and looking at the
server logs I realized that requests for the delta stream were being
received by alfred, who doesn't handle them anymore since
https://github.com/microsoft/FluidFramework/pull/19227. That PR updated
the kubernetes manifests so requests to alfred's URL where the path
starts with `/socket.io` are actually routed to nexus now. I believe
that was necessary because older versions of the driver would not
understand new settings for the deltaStreamUrl. That makes things work
for an AKS deployment, but we missed doing the same thing for the local
docker environment, which this PR fixes.
## Description
Client side changes needed to support targeting signals to a specific
client id.
Signals are now sent with v2 signals protocol (`ISentSignalMessage`)
Unnecessary override of `submitSignal` function is removed from
localDocumentDeltaConnection. This is handled in documentDeltaConnection
of base driver
These changes follow the server changes to support targeted signals
#19519
[ADO Task
7026](https://dev.azure.com/fluidframework/internal/_workitems/edit/7026)
## Description
During peak traffic hours, the RedisCollaborationSessionManager
introduced in #22381 could potentially return thousands of sessions.
After 1,600 sessions, this exceeds the recommended maximum Redis
response size of 200kb (each session+key is about 172 bytes) for optimal
efficiency.
To improve efficiency, we can use [Redis
HSCAN](https://redis.io/docs/latest/commands/hscan/) to fetch sessions
from Redis in batches. Here, the default number of sessions per batch is
800 (half the maximum) to allow wiggle room for future session
information.
### Tests
Added some unit tests for the RedisCollaborationSessionManager, and
bumped the `ioredis-mock` version to include stipsan/ioredis-mock#1300.
## Description
It is redundant and a waste of space to store the documentId and
tenantId in redis fields when they are already present in the key.
Improves #22381
Use the net library for IP type detection instead of your custom method.
Some IP addresses may not be recognized or printed correctly if you use
your own regular expression method.
## Description
Currently, the only reliable way to track a session in R11s is via
Deli's `SessionResult` metric, which depends on Join/Leave Ops and
Deli's "close" handler. This session tracking does not account for
sessions that only have Reader clients with no Ops.
This PR introduces an optional, alternative method for tracking
collaboration sessions within the Nexus lambda itself, which is able to
account for Read-only sessions.
> **Note:** This is an alternative to #9191 which requires creating
Orderer connections to manage read clients using Deli, as well as
keep-alive pings from the frontend (Nexus in our case). We do not want
to spin up Deli and create Orderer connections for read sessions.
### Solution Design Details
> **Context** The original design attempted to only use information
already available from `IClientManager` to understand active session
information and act accordingly. However, the "currently connected
client list" available via `IClientManager` was insufficient for
handling various multi-instance scenarios such as clients leaving from
separate Nexus instances causing the session to "terminate" too
quickly/twice or a Nexus instance shutting down causing a session end
timer to be lost.
1. **"Session Creation (First Client Join)"**: When a client for a given
document connects to the socket server while no other clients are
connected/active for that document, and the previous session either
never existed or was inactive for more than 10 minutes, the session is
"created/started."
2. **"Session Expansion/Continuation (Client Join):** When a client for
a given document connects to the socket server while other clients are
connected/active for that document, or the previous session has been
inactive for less than 10 minutes, the session is updated with
information about that new client, and any existing timers are reset.
3. **Session End (Last Client Leave):** When the only remaining
connected client for a given document disconnects from the socket
server, the session is updated with "last client leave time" and a 10
minute timeout is started.
4. **Session Timeout (Inactive for 10 minutes):** When a session's
inactivity timer expires and there are still no clients in the session
according to the ClientManager, the session is logged as "ended" and
cleaned up.
All of the above "session" information is stored within a Redis HashMap
that allows the list of current sessions to be retrieved and iterated
over, or a single session to be retrieved and updated.
## Breaking Changes
### Firm Input Validation
When the client sends a malformed connect message (i.e. the message does
not contain all expected properties with expected types), Nexus will
emit a `connect_document_error` message with a 400 error code,
indicating malformed user input to the client.
#### Context
Nexus currently makes a lot of type assumptions about the client's
`IConnect` message in the `connect_document` event handler. This can
cause the service to crash due to unhandled TypeErrors at runtime. This
PR introduces strong type checks for the incoming `IConnect` message and
its internal `IClient` details so that Nexus can safely access the
expected properties in that message.
## Reviewer Guidance
- **Main Session Tracking Logic**: server/r11s/packages/services/src
`redisSessionManager.ts` and `sessionTracker.ts`
- **Main Nexus Session Tracking**:
server/r11s/packages/lambdas/src/nexus `connect.ts` and `disconnect.ts`
- There is also a small refactor in `disconnect.ts` to make the
Disconnect handler structure more similar to the Connect handler by
moving the internal loops into their own named functions.
- **Type Validation**: server/r11s/packages/lambdas/src/nexus `index.ts`
and `protocol.ts`
---------
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
## Description
Updates axios dependencies to the latest version (in package.json direct
dependencies and in transitive dependencies in lockfiles) throughout the
repo to address a few CVEs.
## Description
We don't have good way of hooking up connect document metrics with
isEphemeralContainer flags. Get session would be the entry point of
connect a document so this will provide us more accurate information.
## Breaking Changes
N/A
---------
Co-authored-by: Xin Zhang <zhangxin@microsoft.com>
## Description
Refactors and changes the prop `correlationIdSource` to `requestSource`
to avoid ambiguity in understanding whether we are tracking request
origin or correlationId origin.
## Description
This PR adds telemetry to track the origin of the correlation associated
with an API call by adding a new telemetry prop - `correlationIdSource`.
If the client sends a correlationId in the `x-correlation-id` header or
in the `x-telemetry-header`, then the source is set as
`"correlationIdSource": "client"`. Else the correlationId is generated
by the server and the prop is set as `"correlationIdSource": "server"`.
## Breaking Changes
Updates `ITelemetryContextProperties` to include the
`correlationIdSource` property.
## Description
Upgrading Routerlicious server packages in Gitrest and Historian to pull
in changes from #22109.
Adds `getTelemetryContextProeprties` param to each BasicRestWrapper
instantiation
## Description
Customers depend on the "Document is deleted..." message, not the error
code. Some of our E2E tests do to. When an EC is considered expired,
just say it's "deleted" to match existing client logic.
Follow-up to move to a better message: [ADO
#12867](https://dev.azure.com/fluidframework/internal/_workitems/edit/12867)
## Description
Global TelemetryContext was implemented several major server versions
ago. At the same time, the old `getCorrelationId` and
`bindCorrelationId` method of tracking correlationId was deprecated.
This PR removes usage of those methods, and also adds a new Telemetry
Context header that can be extended to track other information for the
lifetime of an API request.
For the new `x-telemetry-context` header, the old `x-correlation-id`
header will still be respected (for now) if `x-telemetry-context` header
does not container `correlationId` property. BasicRestWrapper now takes
in an optional `getTelemetryContextProperties` method, similar to how it
takes a `getCorrelationId` method. This is used to generate
telemetryContext header on outgoing requests from within R11s.
`x-correlation-id` is still generated.
## Breaking Changes
- `enableGlobalTelemetryContext` config switched to `true` in code. Was
already true in configs.
- `bindCorrelationId` usage was removed from Gitrest, Historian, and
Routerlicious Rest APIs, meaning `getCorrelationId` without
`enableGlobalTelemetryContext: true` will not work anymore.
I'm leaving the old `getCorrelationId` and `bindCorrelationId` methods
in for 1 more release cycle out of abundance of caution, even though it
has been deprecated for almost a year.
## Description
This PR fixes the CredScan warnings we were getting in the server
pipelines, before they become a blocker that makes the pipeline runs
fail.
The auto-injected CredScan task in server pipelines was complaining
about things that we had already indicated should be skipped (through
the CredScanSuppressions.json file). Turns out that for docker builds,
the file is expected in the "root context" for the docker build, not at
the root of the repo like it is for some other auto-injected tasks. This
PR makes it so we copy the file to the necessary new location in the
server pipelines.
It also replaces a bunch of fake usernames/passwords in a file's
comments with "PLACEHOLDER" which the CredScan task automatically skips
(pro-tip: don't use "PLACEHOLDER" as your actual password 😄).
Finally, it adds more suppressions for files that are part of test code
in some server dependencies.
## Description
Currently, we rely on an Ephemeral Container to either 1) be cleaned up
by the Deli lambda on session end, or 2) expire due to DB and Redis TTL
values.
There are inconsistencies in configurations and TTL behaviors regardless
of configs, so we want to explicitly reject access to Ephemeral
containers that are older than a certain time.
This PR causes all Historian requests and Alfred getSession requests to
fail with an explicit `404 - Ephemeral Container Expired: ...` error
when the container was created longer ago than the EphemeralDocumentTTL
config value. It also changes Gitrest's Ephemeral TTL configuration to
use an explicit EphemeralDocumentTTL value for consistency, rather than
an implicit general Redis TTL value.
The defaults for these values are remaining as 24 hours.
This change reverts part of the changes made in #21018. The past changes
inadvertently caused the packages to be published without any built
content. I have verified from test builds that the published packages do
have built content with this change.
In this change, the pack process for docker pipelines is once again run
with a unique shell command that is run in the docker container, and the
package lists are created directly in the pipeline instead of by a
script.
This is unfortunate from a maintenance perspective because it means
there are two slightly different pack paths depending on the pipeline.
That said, this is by far the most straightforward fix.