Mirrors what we've done for nondesktop.
Actually rolling this out will involve getting the change merged
to `spark-parquet-to-bigquery`, then doing a manual recreation of the table
and downstream. Because there's no windowing or joining involved, though,
those recreations can each be done efficiently in a single query that takes
perhaps a few minutes to run.
This change will allow us to move the Growth Dashboard off of Spark-based
analysis of Parquet datasets and onto BQ (via Redash).
Per [BQ docs on creating views](https://cloud.google.com/bigquery/docs/views#creating_a_view):
> For standard SQL views, the query must include the project ID in table and view references in the form `[PROJECT_ID].[DATASET].[TABLE]`.
We also remove the `observation_date` column. After further discussion with
the growth dashboard group, we're comfortable using the established definition
of the MAU window including the date of report; the concern is to make sure
we never include a partial day, which is already guaranteed by batch processing
only after a day is closed.
As discussed in the
[Smoot Growth Dashboard Engineering Requirements]
(https://docs.google.com/document/d/1L8tWDUjccutGGAldhpypRtPCaw3kkXboPUTtTZb02OA/edit?usp=sharing):
> In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket. We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket. It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.
This is an alternative to #13 that separates ETL to a single
dimensional `_raw_v1` table and then delays details for the presentation
layer to live views where possible.
For a time, we thought we were going to use release only for KPI
calculations, but this is likely going to revert to the historical
stance of including prerelease in MAU calculations.
This PR removes the filter we had on channel, and retains normalized_channel
as a dimension so we can filter by channel if needed at the analysis level.