isolate redundant sections
This commit is contained in:
Родитель
42920fb6b4
Коммит
92b182f779
|
@ -5,8 +5,6 @@ The TRI implements data load orchestration into multiple parallel data warehouse
|
|||
|
||||
## Data Availability and Orchestration features
|
||||
|
||||
**TODO - Dev Team - Review the following and confirm if they apply for TRI-1**
|
||||
|
||||
The logical Data Warehouse architecture and orchestration address these requirements:
|
||||
|
||||
1. Each logical data warehouse (LDW) consists of a single physical data warehouse by default. More replicas per LDW can be configured for scalability and high availability.
|
||||
|
@ -35,9 +33,9 @@ The TRI also meets the following requirements for the tabular model generation i
|
|||
A set of control tables associate physical DWs to tables, schemas, and to time ranges and record dataset auditing information (start date, end date, row count, filesize, checksum) in a separate audit file.
|
||||
|
||||
The LDW load and read data sets iterate through three states:
|
||||
- Load: The LDW set is processing uploaded data files to “catch-up” to the latest and greatest data.
|
||||
- Load-Defer:The LDW is not processing updates nor serving customers; it is a hot-standby with “best available” data staleness for disaster recovery purposes. **TODO - Confirm if we have this state**
|
||||
- Primary: The LDW is up-to-date and serving requests but not receiving any additional data loads.
|
||||
- `Load`: The LDW set is processing uploaded data files to "catch-up" to the latest and greatest data.
|
||||
- `Standby`:The LDW is not processing updates nor serving customers; it is a hot-standby with “best available” data staleness for disaster recovery purposes.
|
||||
- `Active`: The LDW is up-to-date and serving requests but not receiving any additional data loads.
|
||||
|
||||
It is recommended that the data files that are loaded into physical DW instances have the following naming structure:
|
||||
|
||||
|
@ -46,45 +44,4 @@ It is recommended that the data files that are loaded into physical DW instances
|
|||
|
||||
This will provide sufficient information to determine the intent of the file should it appear outside of the expected system paths. The purpose of the audit file is to contain the rowcount, start/end date, filesize and checksum. Audit files must appear next to their data files in the same working directory always. Orphaned data or audit files should not be loaded.
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Anatomy of a Logical SQL DW Flip
|
||||
|
||||
Here is an example schedule showing how the logical DW flips occur - with physical data warehouses located in different Azure availability regions. The DW control logic in the job manager performs the flip operation on the schedule only if the current time is past the time of the schedule and the conditions for safe and healthy operation of the scheduled event is fulfilled.
|
||||
|
||||
- Active - is the state when the LDW is active serving user queries.
|
||||
- Load - is the state when the LDW is being loaded with data via data load jobs.
|
||||
- Standby - is the state when the administrator has paused the LDW (i.e. the physical data warehouses in the LDW) for planned maintenance, if no data is available to be loaded, or other reasons.
|
||||
|
||||
|
||||
| PST | EST | UTC | LDW 1 - US West | LDW 2 - US East | Data scenario |
|
||||
|:----|:----|:----|:------|:------|:-------------------------|
|
||||
|00:00 | 03:00 | 08:00 | Active | Load | Batch 1 is loaded into LDW 2 from BLOB via dynamic ADF pipelines |
|
||||
|08:00 | 11:00 | 16:00 | Load | Active | Batch 2 data is loaded into LDW 1, while LDW 2 becomes the reader/primary (NOTE: Any incomplete ADF pipelines may continue to load LDW 2 until completion; Query connections and performance may be impacted in LDW2) |
|
||||
|16:00 | 19:00 |24:00 | Active | Load | Batch 3 is loaded into LDW 2, while LDW 1 becomes the primary |
|
||||
|20:00 | 23:00 |04:00 | Active | PAUSE | Admin pauses the Loader LDW 4 hours into the loading cycle |
|
||||
|
||||
# Data Warehouse Flip Operation
|
||||
This transition of a LDW from Load to Active and vice versa a.k.a the "Flip Operation" is done every T hours where T is configurable by the user.
|
||||
The flip operation is executed through the following steps
|
||||
1. Once the current UTC time is past the end time of the current flip interval of T hours, a flip operation is initiated which needs to transition the currently Active LDW to Load status
|
||||
and the next-to-be-Active Load LDW into Active status. If there are no LDWs in Load state then no flip will happen. If there are more than 1 LDW in Load state then the next LDW in sequence after the currently Active LDW is picked as the one to be flipped to Active state.
|
||||
2. Once a flip operation is initiated the following conditions are checked before a Load LDW can be switched to Active state
|
||||
a. Each Load PDW in the Load LDW is transitioned to StopLoading state when no new load jobs for the PDW are started and the it waits for current load jobs to complete
|
||||
b. StopLoading PDW is transitioned to ScaleToActive state when current load jobs have completed and PDW's DWU capacity is being scaled up to higher capacity for servicing requests
|
||||
c. ScaleToActive PDW is transitioned to Active state when it can actively serve user queries
|
||||
3. Once each PDW in the next-to-be-Active LDW are flipped to Active state, the direct query nodes pointing to the PDWs of the previously Active LDW are switched to point to the newly Active ones.
|
||||
4. The above steps happen in a staggered manner such that Direct Query nodes don't change PDW connections all at once.This is to ensure that no existing user connections are dropped. A connection drain time is allowed when a Direct Query node stops accepting new requests but completes processing its existing requests before it can flip to the newly Active PDW.
|
||||
5. Once all the PDWs have switched to Active state, the Active PDWs of the previously Active LDW are then transitioned into Load state after being scaled down to a lower DWU capacity.
|
||||
6. A record is inserted in the database containing the timestamp when the next flip operation will be initiated and the all the above steps are repeated once the current UTC time is past that timestamp
|
||||
|
||||
**Importantly, are there any timing instructions for the Admin to restart the process**
|
||||
The flip interval of T hours is a configurable property and can be set by the Admin by updating a ControlServer database property
|
||||
When the next flip time comes around, this value will be used to set the next flip interval.
|
||||
If the Admin wants to flip immediately then the end timestamp of the current flip interval will need to be updated to current UTC time in the LDWExpectedStates db table and flip operation should be initiated in the next couple of minutes.
|
||||
|
||||
**What other situations will require Admin intervention**
|
||||
The flip operation requires a Load PDW to satisfy certain conditions before it can be made Active. These are explained in 2.a - 2.c of Data Warehouse Flip Operation. If load jobs get stuck or if scaling takes a long time, flip operation will be halted. If all the Direct Query nodes die, even then flip operation will not be triggered because currently ASDQ daemons initiate flip operation. Admin intervention will be required to address these.
|
||||
|
||||
**Explain what other steps the Admin should NOT do with the flip pattern**
|
||||
Once a flip operation is started, Admin should not try to change the state of PDWs or LDWs by themselves. Since these states are maintained in the job manager's database, any mismatch between those and the real state will throw off the flip operation. If any of the PDWs die , Admin needs to get it back into the state as was last recorded in the database.
|
||||
> To gain a deeper understanding of the flip process, please read [Anatomy of a Logical Data Warehouse Flip](./5-Understanding%20data%20warehouse%20flip).
|
||||
|
|
|
@ -1,52 +1,59 @@
|
|||
# Data Warehouse Flip Operation
|
||||
# Anatomy of a Logical Data Warehouse Flip
|
||||
|
||||
The Logical Data Warehouse Sets( each set being a group of Physical Data Warehouses by availability region) iterate through "Load" and "Active" states when the system is running. A LDW can also be in "Standby" state if it is not being actively used in the data warehousing process. The 3 states are defined below
|
||||
- Load: The LDW set is processing uploaded data files to "catch-up" to the latest and greatest data.
|
||||
- Standby:The LDW is not processing updates nor serving customers; it is a hot-standby with “best available” data staleness for disaster recovery purposes.
|
||||
- Active: The LDW is up-to-date and serving requests but not receiving any additional data loads.
|
||||
|
||||
### Anatomy of a Logical Data Warehouse Flip
|
||||
|
||||
This transition of a LDW from Load to Active and vice versa a.k.a the "Flip Operation" is done every T hours where T is configurable by the user.
|
||||
The flip operation is triggered by daemons which run as scheduled task on each of the Analysis Server Direct Query (ASDQ) Nodes.
|
||||
> For a detailed description of the LDW transition states (`Active`, `Standby`, `Load`), please read [LDW States and Availability](./4-Understanding%20logical%20datawarehouses.md#logical-data-warehouse-status-and-availability)
|
||||
|
||||
The transition of a LDW from `Load` to `Active` and vice versa a.k.a the "Flip Operation" is done every T hours where T is configurable by the user.
|
||||
The flip operation is triggered by daemons which run as scheduled tasks on each of the Analysis Server Direct Query (ASDQ) Nodes.
|
||||
|
||||
Every few minutes each daemon running on each ASDQ node queries the job manager if a LDW flip needs to happen.
|
||||
|
||||
1. The job manager maintains a database table "LDWExpectedStates" which stores the start and end times of the current flip interval. It consists of records that define which LDW is in Load state and which is in Active state and until what time they are supposed to be in those states.
|
||||
## Step-by-Step
|
||||
|
||||
1. The Job Manager maintains a database table "LDWExpectedStates" which stores the start and end times of the current flip interval. It consists of records that define which LDW is in `Load` state and which is in `Active` state and until what time they are supposed to be in those states.
|
||||
2. On being queried by the ASDQ daemon, job manager queries this table and checks if its past the end time of the current flip interval else it responds with a No-Op. If the current UTC time is past the endtime, then flip operation needs to be executed and the following steps are executed.
|
||||
* (a) The LDW which needs to be `Active` in the next flip interval is determined and `LDWExpectedStates` table is populated with details regarding the start and end time of the next flip interval. The endtime stamp is determined by adding T hours to the start time which is the current UTC time. If there are no LDWs in `Load` state then no flip will happen. If there are more than 1 LDW in `Load` state then the next LDW in sequence after the currently active LDW is picked as the one to be flipped to `Active` state.
|
||||
|
||||
a. The LDW which needs to be Active in the next flip interval is determined and LDWExpectedStates table is populated with details regarding the start and end time of the next flip interval. The endtime stamp is determined by adding T hours to the start time which is the current UTC time. If there are no LDWs in Load state then no flip will happen. If there are more than 1 LDW in Load state then the next LDW in sequence after the currently Active LDW is picked as the one to be flipped to Active state.
|
||||
* (b) The state of the next-to-be-Active LDW is switched to `Active` state and the state transitions of its PDWs from `Load` to `Active` are initiated
|
||||
|
||||
b. The state of the next-to-be-Active LDW is switched to Active state and the state transitions of its PDWs from Load to Active are initiated
|
||||
3. PDW state transition from `Load` to `Active` goes through a couple of intermediate steps as follows
|
||||
* (a) `Load`: The PDW is processing uploaded data files to "catch-up" to the latest and greatest data.
|
||||
* (b) `StopLoading`: The PDW will not be accepting any new data load jobs but will wait till the current load jobs complete
|
||||
* (c) `ScaleUpToActive`: State indicating that the PDW has completed all its assigned load jobs and is being scaled up to Active DWU capacity
|
||||
* (d) `Active` - PDW is up-to-date and serving requests but not receiving any additional data loads.
|
||||
|
||||
3. PDW state transition from Load to Active goes through a couple of intermediate steps as follows
|
||||
a. Load : The PDW is processing uploaded data files to "catch-up" to the latest and greatest data.
|
||||
b. StopLoading : The PDW will not be accepting any new data load jobs but will wait till the current load jobs complete
|
||||
c. ScaleUpToActive : State indicating that the PDW has completed all its assigned load jobs and is being scaled up to Active DWU capacity
|
||||
d. Active - PDW is up-to-date and serving requests but not receiving any additional data loads.
|
||||
4. Once a PDW is changed to `Active` state, job manager checks if there is at least 1 DQ node in the "DQ Alias group" which is still serving active queries. A "DQ Alias group" is the group of DQ nodes which point to the same PDW instance in an LDW. Multiple DQ nodes can point to the same PDW. This is to ensure that we can increase the availability of DQs if we need to, assuming the PDW can support concurrent queries from all these DQs. Checking at least 1 DQ is in `Active` state ensures new requests do not get dropped. If this check succeeds a "Transition" response is sent to the DQ node which stops accepting new connections from the DQ LoadBalancer and drains off the existing connections. Once the grace time is over, the DQ changes its connection string to point to the newly active PDW and reports to job manager which then allows other ASDQs to start their transitions.
|
||||
|
||||
4. Once a PDW is changed to Active state, job manager checks if there is at least 1 DQ node in the "DQ Alias group" which is still serving active queries. A "DQ Alias group" is the group of DQ nodes which point to the same PDW instance in an LDW. Multiple DQ nodes can point to the same PDW. This is ensure that we can increase the availability of DQs if we need to, assuming the PDW can support concurrent queries from all these DQs. Checking at least 1 DQ is in active state ensures new requests do not get dropped. If this check succeeds a "Transition" response is sent to the DQ node which stops accepting new connections from the DQ LoadBalancer and drains off the existing connections. Once the grace time is over, the DQ changes its connection string to point to the newly Active PDW and reports to job manager which then allows other ASDQs to start their transitions.
|
||||
|
||||
5. Once all the DQs in a "DQ Alias group" have flipped to a different PDW, the group's original PDW is transitioned to a Load state after scaling down its DWU capacity.
|
||||
6. After all the Active PDWs of the previously Active LDW have been transitioned to Load state, the state of the LDW is changed to Load state. This marks the end of the flip operation.
|
||||
5. Once all the DQs in a "DQ Alias group" have flipped to a different PDW, the group's original PDW is transitioned to a `Load` state after scaling down its DWU capacity.
|
||||
6. After all of the Active PDWs of the previously Active LDW have been transitioned to `Load` state, the state of the LDW is changed to `Load` state. This marks the end of the flip operation.
|
||||
|
||||
## Example
|
||||
Here is an example schedule showing how the Flip Operation occurs using the following configuration
|
||||
2 LDWS : LDW01, LDW02
|
||||
2 PDWS : PDW01-LDW01 (LDW01), PDW01-LDW02 (LDW02)
|
||||
2 DQ Nodes : DQ01(points to PDW01-LDW01), DQ02(points to PDW01-LDW01)
|
||||
ASDQ daemon schedule - 1 minute
|
||||
Connection Time Drain - 10 minutes
|
||||
* 2 LDWS: `LDW01`, `LDW02`
|
||||
* 2 PDWS: `PDW01-LDW01` (LDW01), `PDW01-LDW02` (LDW02)
|
||||
* 2 DQ Nodes: `DQ01` (points to PDW01-LDW01), `DQ02` (points to PDW01-LDW01)
|
||||
* ASDQ daemon schedule: 1 minute
|
||||
* Connection Time Drain: 10 minutes
|
||||
|
||||
| UTC | LDW01 | LDW02 | PDW01-LDW01 | PDW01-LDW02 | DQ01 | DQ02 |
|
||||
|:----|:----|:----|:----|:----|:----|:----|
|
||||
|00:00 | Active | Load | Active | Load | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:01 | Active | Load | Active | StopLoading | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:03 | Active | Load | Active | ScaleUpToActive | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:05 | Active | Active | Active | Active | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:06 | Active | Active | Active | Active | Transition : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:16 | Active | Active | Active | Active | ChangeCompleted : PDW01-LDW02 | Normal : PDW01-LDW01 |
|
||||
|00:26 | Active | Active | Active | Active | ChangeCompleted : PDW01-LDW02 | Transition : PDW01-LDW01 |
|
||||
|00:27 | Active | Active | ScaleDownToLoad | Active | Normal : PDW01-LDW02 | Normal : PDW02-LDW02 |
|
||||
|00:29 | Load | Active | Load | Active | Normal : PDW01-LDW02 | Normal : PDW02-LDW02 |
|
||||
|00:01 | Active | Load | Active | `StopLoading` | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:03 | Active | Load | Active | `ScaleUpToActive` | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:05 | Active | `Active` | Active | `Active` | Normal : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:06 | Active | Active | Active | Active | `Transition` : PDW01-LDW01 | Normal : PDW01-LDW01 |
|
||||
|00:16 | Active | Active | Active | Active | `ChangeCompleted` : PDW01-LDW02 | Normal : PDW01-LDW01 |
|
||||
|00:26 | Active | Active | Active | Active | ChangeCompleted : PDW01-LDW02 | `Transition` : PDW01-LDW01 |
|
||||
|00:27 | Active | Active | `ScaleDownToLoad` | Active | `Normal` : PDW01-LDW02 | Normal : PDW02-LDW02 |
|
||||
|00:29 | `Load` | Active | `Load` | Active | Normal : PDW01-LDW02 | Normal : PDW02-LDW02 |
|
||||
|
||||
## FAQ
|
||||
**Are there any timing instructions for the Admin to restart the process?**
|
||||
The flip interval of T hours is a configurable property and can be set by the Admin by updating a ControlServer database property
|
||||
When the next flip time comes around, this value will be used to set the next flip interval.
|
||||
If the Admin wants to flip immediately then the end timestamp of the current flip interval will need to be updated to current UTC time in the LDWExpectedStates db table and flip operation should be initiated in the next couple of minutes.
|
||||
|
||||
**What other situations will require Admin intervention?**
|
||||
The flip operation requires a Load PDW to satisfy certain conditions before it can be made Active. These are explained in 2.a - 2.c of Data Warehouse Flip Operation. If load jobs get stuck or if scaling takes a long time, flip operation will be halted. If all the Direct Query nodes die, even then flip operation will not be triggered because currently ASDQ daemons initiate flip operation. Admin intervention will be required to address these.
|
||||
|
||||
**What other steps should the Admin NOT do with the flip pattern?**
|
||||
Once a flip operation is started, Admin should not try to change the state of PDWs or LDWs by themselves. Since these states are maintained in the job manager's database, any mismatch between those and the real state will throw off the flip operation. If any of the PDWs die , Admin needs to get it back into the state as was last recorded in the database.
|
||||
|
|
Загрузка…
Ссылка в новой задаче