Prowac
Primary Goals
The goal of Prowac is to collect and provide information about the adoption of various technologies identified in WADI and that enable sites to be called Progressive Web Apps:
- Measure adoption of progressive web apps technologies
- Understand which tools and services are used with progressive web apps
Description
Prowac can be organized into three pieces: A data store that contains information about the web technologies in use by popular sites, a crawler that populates the data store, and a dashboard that displays the information from the data store.
Prior Art
- https://libscore.com/
- https://trends.builtwith.com/
- https://www.chromestatus.com/metrics/feature/popularity
Requirements - MVP requirements to be completed in Q1
Data Store
For each site in the top one million websites ranked by Alexa, the data store should contain:
- Whether the landing page declares a W3C App Manifest
- Whether the landing page imports GoRoost
- Whether the landing page imports OneSignal
- Whether the landing page imports Mobify
- Whether the landing page redirects to HTTPS
- Whether the landing page registers a Service Worker
- Whether the landing page registers for Push Notifications
- Whether the Service Worker caches the landing page
Each website in this data store should be updated approximately once per month. The data store should retain the last 12 data points for each website (i.e. the information above, for each site, for about 1 year)
TODO: Verify this last requirement.
Crawler
- Populates the data store.
- Must be automatic
- Must be capable of resuming from failure/stop
- Must process each site in less than 1s on average
Dashboard
Should present the following information, in graphs where appropriate:
- % of sites with a manifest
- % of sites importing any of [GoRoost, OneSignal, Mobify]
- % of sites importing GoRoost
- % of sites importing OneSignal
- % of sites importing Mobify
- % of sites forcing HTTPS
- % of sites registering a Service Worker
- % of sites registering for Push Notifications
- % of sites caching the landing page in a Service Worker
Should additionally be possible to view individual data points for a specified site
Deployment
- No requirements about specific domain names, redirects
- Must be deployed somewhere publicly-accessible (e.g. Heroku)
Design
Data Store
Redis DB!
For future consideration if necessary:
- Cassandra
- redshift
Crawler
Written in Node.js, does the following:
TODO: Flow diagram (started @ https://www.draw.io/#G0BzDhuKIxRtxdNnR1RzVXbkxYQXc )
- Check URL DB: If empty, fetch Alexa list and populate URL DB
- For each URL in the URL DB:
- For each of the User Agents we wish to test:
- Try to retrieve page using HTTP
- Resolve any redirects, tracking whether the page is eventually served over HTTPS
- Check the HTML for
<link rel="manifest"
or similar - Parse HTML for any scripts and fetch those
- Check the JS scripts for GoRoost import - TODO: What does this look like?
- Check the JS scripts for OneSignal import - TODO: What does this look like?
- Check the JS scripts for Mobify import - TODO: What does this look like?
- Check the JS scripts for
navigator.serviceWorker.register
and fetch the SW- TODO: How to determine whether the SW caches the landing page?
- Check the JS scripts for
pushManager.subscribe()
- Store data point in our data store for this URL
- Remove URL from URL DB
- For each of the User Agents we wish to test:
User agents we wish to test:
- Desktop Firefox
- Desktop Chrome
- Fennec
- Mobile Chrome
Other options considered:
- a python client
- remote controlling a real browser engine using Marionette, CasperJS, Selenium, or similar and mocking the various calls that we care about
Dashboard
Mozaik or similar front-end
Deployment
- Heroku
Risks
Data store
- May be too large to store in memory, return all records in one query, etc. - extra design/implementation time to deal with these issues
Analysis: 1M sites with 12 pieces of data. If each piece of data is 1KB that's a 12GB database.
Crawler
- Performance: If each site takes 1s to analyze, updating the whole 1M+ list will take 11+ days.
- False negatives: Deferred loading of additional JavaScript that is not in initial payload
Dashboard
- Tim unfamiliar with front-end design and JS libraries
Deployment
- No current visible risks
People
- Tim - project coordination, design/engineering (crawler, dashboard). Can spend about 75% time on this project over 4 weeks
- Piotr - design/engineering (crawler, dashboard). Can spend about 40% time on this project over 4 weeks
- Harald - design/engineering (crawler, dashboard). Can spend about 10% time on this project over 4 weeks
Schedule
Feb 1-5
Planning/coordination
- Stabilize MVP requirements
- File implementation issues
- Wikis/flow diagrams/etc.
- Design discussions
Crawler
- Design roughly complete
- Initial implementation
Dashboard
- Design/mockups
- Investigating technologies
Deployment
Feb 8-12
Planning/coordination
- Verify progress against schedule, report/alter as necessary
Crawler
- Implementation, refinement of design
Dashboard
Deployment
- Travis CI
Feb 15-19
Planning/coordination
- Verify progress against schedule, report/alter as necessary
Crawler
- "Feature complete"
- Fetch from Alexa
- Fetch sites, pass sites to probes, store data in data store
Dashboard
Deployment
Feb 22-26
Planning/coordination
- Tim hand off project to Piotr & Harald
Crawler
- Implement probes
Dashboard
- Initial implementation
Deployment
- Once dashboard is ready, deploy to Heroku
Feb 29-Mar 4 (Tim gone)
- Add/refine tests
- Refine crawler (urlJobPopulator, urlJobProcessor, probes)
- Refine data store
- Refine dashboard
- Solidify Heroku deployment
Mar 7-11 (Tim gone)
- Add/refine tests
- Refine crawler (urlJobPopulator, urlJobProcessor, probes)
- Refine data store
- Refine dashboard
- Solidify Heroku deployment
Mar 14-18 (Tim gone)
Mar 21-25 (Tim gone)
Mar 28-31 (Tim gone)
Meetings
Monday, Feb 1 @ 10:00AM PT
Agenda
- Overview, people/time resources & schedule (7 min)
- Requirements (8 min)
- Design (10 min)
- Questions/closing (5 min)
Action Items
- Harald: Mockup Dashboard
- Tim: Start crawler, pass on follow-up info to Piotr