application-services/docs/design/sync-manager.md

22 KiB

Sync manager

We've identified the need for a "sync manager" (although are yet to identify a good name for it!) This manager will be responsible for managing "global sync state" and coordinating each engine.

At a very high level, the sync manager is responsible for all syncing. So far, so obvious. However, given our architecture, it's possible to identify a key architectural split.

  • The embedding application will be responsible for most high-level operations. For example, the app itself will choose how often regular syncs should happen, what environmental concerns apply (eg, should I only sync on WiFi?), letting the user choose exactly what to sync, and so forth.

  • A lower-level component will be responsible for the direct interaction with the engines and with the various servers needed to perform a sync. It will also have the ultimate responsibility to not cause harm to the service (for example, it will be likely to enforce some kind of rate limiting or ensuring that service requests for backoff are enforced)

Because all application-services engines are written in Rust, it's tempting to suggest that this lower-level component also be written in Rust and everything "just works", but there are a couple of complications here:

  • For iOS, we hope to integrate with older engines which aren't written in Rust, even if iOS does move to the new Sync Manager.

  • For Desktop, we hope to start by reusing the existing "sync manager" implemented by Desktop, and start moving individual engines across.

  • There may be some cross-crate issues even for the Rust implemented engines. Or more specifically, we'd like to avoid assuming any particular linkage or packaging of Rust implemented engines.

Even with these complications, we expect there to be a number of high-level components, each written in a platform specific language (eg, Kotlin or Swift) and a single lower-level component to be implemented in Rust and delivered as part of the application-services library - but that's not a free-pass.

Why "a number of high-level components"? Because that is the thing which understands the requirements of the embedding application. For example, Android may end up with a single high-level component in the android-components repo and shared between all Android components. Alternatively, the Android teams may decide the sync manager isn't generic enough to share, so each app will have their own. iOS will probably end up with its own and you could imagine a future where Desktop does too - but they should all be able to share the low level component.

The responsibilities of the Sync Manager.

The primary responsibilities of the "high level" portion of the sync manager are:

  • Manage all FxA interaction. The low-level component will have a way to communicate auth related problems, but it is the high-level component which takes concrete FxA action.

  • Expose all UI for the user to choose what to sync and coordinate this with the low-level component. Note that because these choices can be made on any connected device, these choices must be communicated in both directions.

  • Implement timers or any other mechanism to fully implement the "sync scheduler", including any policy decisions such as only syncing on WiFi, etc.

  • Implement a UI so the user can "sync now".

  • Collect telemetry from the low-level component, probably augment it, then submit it to the telemetry pipeline.

The primary responsibilities of the "low level" portion of the sync manager are:

  • Manage the meta/global, crypto/keys and info/collections resources, and interact with each engine as necessary based on the content of these resources.

  • Manage interaction with the token server.

  • Enforce constraints necessary to ensure the entire ecosystem is not subject to undue load. For example, this component should ignore attempts to sync continuously, or to sync when the services have requested backoff.

  • Manage the "clients" collection - we probably can't ignore this any longer, especially for bookmarks (as desktop will send a wipe command on bookmark restore, and things will "be bad" if we don't see that command).

  • Define a minimal "service state" so certain things can be coordinated with the high-level component. Examples of these states are "everything seems ok", "the service requested we backoff for some period", "an authentication error occurred", and possibly others.

  • Perform, or coordinate, the actual sync of the rust implemented engines - from the containing app's POV, there's a single "sync now" entry-point (in practice there might be a couple, but conceptually there's a single way to sync). Note that as below, how non-rust implemented engines are managed is TBD.

  • Manage the collection of (but not the submission of) telemetry from the various engines.

  • Expose APIs and otherwise coordinate with the high-level component.

Stuff we aren't quite sure where it fits include:

  • Coordination with non-rust implemented engines. These engines are almost certainly going to be implemented in the same language as the high-level component, which will make integration simpler. However, the low-level component will almost certainly need some information about these engines for populating info/collections etc. For now, we are punting on this until things become a bit clearer.

Implementation Details.

The above has been carefully written to try and avoid implementation details - the intent is that it's an overview of the architecture without any specific implementation decisions.

These next sections start getting specific, so implementation choices need to be made, and thus will possibly be more contentious.

In other words, get your spray-cans ready because there's a bikeshed being built!

However, let's start small and make some general observations.

Current implementations and challenges with the Rust components

  • Some apps only care about a subset of the engines - lockbox is one such app and only cares about a single collection/engine. It might be the case that lockbox uses a generic application-services library with many engines available, even though it only wants logins. Thus, the embedding application is the only thing which knows which engines should be considered to "exist". It may be that the app layer passes an engine to the sync manager, or the sync manager knows via some magic how to obtain these handles.

  • Some apps will use a combination of Rust components and "legacy" engines. For example, iOS is moving some of the engines to using Rust components, while other engines will be ported after delivery of the sync manager, if they are ported at all. We also plan to introduce some rust engines into desktop without integrating the "sync manager"

  • The rust components themselves are designed to be consumed as individual components - the "logins" component doesn't know anything about the "bookmarks" component.

There are a couple of gotchyas in the current implementations too - there's an issue when certain engines don't yet appear in meta/global - see bug 1479929 for all the details.

The tl;dr of the above is that each rust component should be capable of working with different sync managers. That said though, let's not over-engineer this and pretend we can design a single, canonical thing that will not need changing as we consider desktop and iOS.

State, state and more state. And then some state.

There's loads of state here. The app itself has some state. The high-level Sync Manager component will have state, the low-level component will have state, and each engine has state. Some of this state will need to be persisted (either on the device or on the storage servers) and some of this state can be considered ephemeral and lives only as long as the app.

A key challenge will be defining this state in a coherent way with clear boundaries between them, in an effort to allow decoupling of the various bits so Desktop and iOS can fit into this world.

This state management should also provide the correct degree of isolation for the various components. For example, each engine should only know about state which directly impacts how it works. For example, the keys used to encrypt a collection should only be exposed to that specific engine, and there's no need for one engine to know what info/collections returns for other engines, nor whether the device is currently connected to WiFi.

A thorn here is for persisted state - it would be ideal if the low-level component could avoid needing to persist any state, so it can avoid any kind of storage abstraction. We have a couple of ways of managing this:

  • The state which needs to be persisted is quite small, so we could delegate state storage to the high-level component in an opaque way, as this high-level component almost certainly already has storage requirements, such as storing the "choose what to sync" preferences.

  • The low-level component could add its own storage abstraction. This would isolate the high-level component from this storage requirement, but would add complexity to the sync manager - for example, it would need to be passed a directory where it should create a file or database.

We'll probably go with the former.

Implementation plan for the low-level component.

Let's try and move into actionable decisions for the implementation. We expect the implementation of the low-level component to happen first, followed very closely by the implementation of the high-level component for Android. So we focus first on these.

Clients Engine

The clients engine includes some meta-data about each client. We've decided we can't replace the clients engine with the FxA device record and we can't simply drop this engine entirely.

Of particular interest is "commands" - these involve communicating with the engine regarding commands targeting it, and accepting commands to be send to other devices. Note that outgoing commands are likely to not originate from a sync, but instead from other actions, such as "restore bookmarks".

However, because the only current requirement for commands is to wipe the store, and because you could anticipate "wipe" also being used when remotely disconnecting a device (eg, when a device is lost or stolen), our lives would probably be made much simpler by initially supporting only per-engine wipe commands.

Note that there has been some discussion about not implementing the client engine and replacing "commands" with some other mechanism. However, we have decided to not do that because the implementation isn't considered too difficult, and because desktop will probably require a number of changes to remove it (eg, "synced tabs" will not work correctly without a client record with the same guid as the clients engine.)

Note however that unlike desktop, we will use the FxA device ID as the client ID. Because FxA device IDs are more ephemeral than sync IDs, it will be necessary for engines using this ID to track the most-recent ID they synced with so the old record can be deleted when a change is detected.

Collections vs engines vs stores vs preferences vs Apis

For the purposes of the sync manager, we define:

  • An engine is the unit exposed to the user - an "engine" can be enabled or disabled. There is a single set of canonical "engines" used across the entire sync ecosystem - ie, desktop and mobile devices all need to agree about what engines exist and what the identifier for an engine is.

  • An Api is the unit exposed to the application layer for general application functionality. Application services has 'places' and 'logins' Apis and is the API used by the application to store and fetch items. Each 'Api' may have one or more 'stores' (although the application layer will generally not interact directly with a store)

  • A store is the code which actually syncs. This is largely an implementation detail. There may be multiple stores per engine (eg, the "history" engine may have "history" and "forms" stores) and a single 'Api' may expose multiple stores (eg, the "places Api" will expose history and bookmarks stores)

  • A collection is a unit of storage on a server. It's even more of an implementation detail than a store. For example, you might imagine a future where the "history" store uses multiple "collections" to help with containers.

In practice, this means that the high-level component should only need to care about an engine (for exposing a choice of what to sync to the user) and an api (for interacting with the data managed by that api). The low-level component will manage the mapping of engines to stores.

The declined list

This document isn't going to outline the history of how "declined" is used, nor talk about how this might change in the future. For the purposes of the sync manager, we have the following hard requirements:

  • The low-level component needs to know what the currently declined set of engines is for the purposes of re-populating meta/global.

  • The low-level component needs to know when the declined set needs to change based on user input (ie, when the user opts in to or out of a particular engine on this device)

  • The high-level component needs to be aware that the set of declined engines may change on every sync (ie, when the user opts in to or out of a particular engine on another device)

A complication is that due to networks being unreliable, there's an inherent conflict between "what is the current state?" and "what state changes are requested?". For example, if the user changes the state of an engine while there's no network, then exits the app, how do we ensure the user's new state is updated next time the app starts? What if the user has since made a different state request on a different device? Is the state as last-known on this device considered canonical?

To clarify, consider:

  • User on this device declines logins. This device now believes logins is disabled but history is enabled, but is unable to write this to the server due to no network.

  • The user declines history on a different device, but doesn't change logins. This device does manage to write the new list to the server.

  • This device restarts and the network is up. It believes history is enabled but logins is not - however, the state on the server is the exact opposite.

How does this device react?

(On the plus side, this is an extreme edge-case which none of our existing implementations handle "correctly" - which is easy to say, because there's no real definition for "correctly")

Regardless, the low-level component will not pretend to hide this complexity (ie, it will ignore it!). The low-level component will allow the app to ask for state changes as part of a sync, and will return what's on the server at the end of every sync. The app is then free to build whatever experience it desires around this.

Disconnecting from Sync

The low-level component needs to have the ability to disconnect all engines from Sync. Engines which are declined should also be reset.

Because we will need wipe() functionality to implement the clients engine, and because Lockbox wants to wipe on disconnect, we will provide disconnect and wipe functionality.

Specific deliverables for the low-level component.

Breaking the above down into actionable tasks which can be some somewhat concurrently, we will deliver:

The API

A straw-man for the API we will expose to the high-level components. This probably isn't too big, but we should do this as thoroughly as we can. In particular, ensure we have covered:

  • Declined management - how the app changes the declined list and how it learns of changes from other devices.

  • How telemetry gets handed from the low-level to the high-level.

  • The "state" - in particular, how the high-level component understands the auth state is wrong, and whether the service is in a degraded mode (eg, server requested backoff)

  • How the high-level component might specify "special" syncs, such as "just one engine" or "this is a pre-sleep, quick-as-possible sync", etc

There's a straw-man proposal for this at the end of the document.

A command-line (and possibly Android) utility.

We should build a utility (or 2) which can stand in for the high-level component, for testing and demonstration purposes.

This is something like places-utils.rs and the little utility Grisha has been using. This utility should act like a real client (ie, it should have an FxA device record, etc) and it should use the low-level component in exactly the same we we expect real products to use it.

Because it is just a consumer of the low-level component, it will force us to confront some key issues, such as how to get references to engines stored in multiple crates, how to present a unified "state" for things like auth errors, etc.

The "clients" engine

The initial work for the clients engine can probably be done without too much regard for how things are tied together - for example, much work could be done without caring how we get a reference to engines across crates.

State work

Implementing things needed to we can expose the correct state to the high-level manager for things like auth errors, backoff semantics, etc

Tie it together and other misc things.

There will be lots of loose ends to clean up - things like telemetry, etc.

Followup with non-rust engines.

We have identified that iOS will, at least in the short term, want the sync manager to be implemented in Swift. This will be responsible for syncing both the Swift and Rust implemented engines.

At some point in the future, Desktop may do the same - we will have both Rust and JS implemented engines which need to be coordinated. We ignore this requirement for now.

This approach still has a fairly easy time coordinating with the Rust implemented engines - the FFI will need to expose the relevant sync entry-points to be called by Swift, but the Swift code can hard-code the Rust engines it has and make explicit calls to these entry-points.

This Swift code will need to create the structures identified below, but this shouldn't be too much of a burden as it already has the information necessary to do so (ie, it already has info/collections etc)

TODO: dig into the Swift code and make sure this is sane.

Details

While we use rust struct definitions here, it's important to keep in mind that as mentioned above, we'll need to support the manager being written in something other than rust, and to support engines written in other than rust.

The structures below are a straw-man, but hopefully capture all the information that needs to be passed around.


// We want to define a list of "engine IDs" - ie, canonical strings which
// refer to what the user perceives as an "enigine" - but as above, these
// *do not* correspond 1:1 with either "stores" or "collections" (eg, "history"
// refers to 2 stores, and in a future world, might involve 3 collections).
enum Engine {
  History, // The "History" and "Forms" stores.
  Bookmarks, // The "Bookmark" store.
  Passwords,
}

impl Engine {
  fn as_str(&self) -> &'static str {
    match self {
      History => "history",
      // etc
  }
}

// A struct which reflects engine declined states.
struct EngineState {
  engine: Engine,
  enabled: bool,
}

// A straw-man for the reasons why a sync is happening.
enum SyncReason {
  Scheduled,
  User,
  PreSleep,
  Startup,
}

// A straw man for the general status.
enum ServiceStatus {
  Ok,
  // Some general network issue.
  NetworkError,
  // Some apparent issue with the servers.
  ServiceError,
  // Some external FxA action needs to be taken.
  AuthenticationError,
  // We declined to do anything for backoff or rate-limiting reasons.
  BackedOff,
  // Something else - you need to check the logs for more details.
  OtherError,
}

// Info we need from FxA to sync. This is roughly our Sync15StorageClientInit
// structure with the FxA device ID.
struct AccountInfo {
  key_id: String,
  access_token: String,
  tokenserver_url: Url,
  device_id: String,
}

// Instead of massive param and result lists, we use structures.
// This structure is passed to each and every sync.
struct SyncParams {
  // The engines to Sync. None means "sync all"
  engines: Option<Vec<Engine>>,
  // Why this sync is being done.
  reason: SyncReason,

  // Any state changes which should be done as part of this sync.
  engine_state_changes: Vec<EngineState>,

  // An opaque state "blob". This should be persisted by the app so it is
  // reused next sync.
  persisted_state: Option<String>,
}

struct SyncResult {
  // The general health.
  service_status: ServiceStatus,

  // The result for each engine.
  engine_results: HashMap<Engine, Result<()>>,

  // The list of declined engines, or None if we failed to get that far.
  declined_engines: Option<Vec<Engine>>,

  // When we are allowed to sync again. If > now() then there's some kind
  // of back-off. Note that it's not strictly necessary for the app to
  // enforce this (ie, it can keep asking us to sync, but we might decline).
  // But we might not too - eg, we might try a user-initiated sync.
  next_sync_allowed_at: Timestamp,

  // New opaque state which should be persisted by the embedding app and supplied
  // the next time Sync is called.
  persisted_state: String,

  // Telemetry. Nailing this down is tbd.
  telemetry: Option<JSONValue>,
}

struct SyncManager {}

impl SyncManager {
  // Initialize the sync manager with the set of Engines known by this
  // application without regard to the enabled/declined states.
  // XXX - still TBD is how we will pass "stores" around - it may be that
  // this function ends up taking an `impl Store`
  fn init(&self, engines: Vec<&str>) -> Result<()>;

  fn sync(&self, params: SyncParams) -> Result<SyncResult>;

  // Interrupt any current syncs. Note that this can be called from a different
  // thread.
  fn interrupt() -> Result<()>;

  // Disconnect this device from sync. This may "reset" the stores, but will
  // not wipe local data.
  fn disconnect(&self) -> Result<()>;

  // Wipe all local data for all local stores. This can be done after
  // disconnecting.
  // There's no exposed way to wipe the remote store - while it's possible
  // stores will want to do this, there's no need to expose this to the user.
  fn wipe(&self) -> Result<()>;
}