5 Home
Tim Abraldes редактировал(а) эту страницу 2016-02-16 18:05:14 -08:00

Prowac

Primary Goals

The goal of Prowac is to collect and provide information about the adoption of various technologies identified in WADI and that enable sites to be called Progressive Web Apps:

  • Measure adoption of progressive web apps technologies
  • Understand which tools and services are used with progressive web apps

Description

Prowac can be organized into three pieces: A data store that contains information about the web technologies in use by popular sites, a crawler that populates the data store, and a dashboard that displays the information from the data store.

Prior Art

Requirements - MVP requirements to be completed in Q1

Data Store

For each site in the top one million websites ranked by Alexa, the data store should contain:

  • Whether the landing page declares a W3C App Manifest
  • Whether the landing page imports GoRoost
  • Whether the landing page imports OneSignal
  • Whether the landing page imports Mobify
  • Whether the landing page redirects to HTTPS
  • Whether the landing page registers a Service Worker
  • Whether the landing page registers for Push Notifications
  • Whether the Service Worker caches the landing page

Each website in this data store should be updated approximately once per month. The data store should retain the last 12 data points for each website (i.e. the information above, for each site, for about 1 year)

TODO: Verify this last requirement.

Crawler

  • Populates the data store.
  • Must be automatic
  • Must be capable of resuming from failure/stop
  • Must process each site in less than 1s on average

Dashboard

Should present the following information, in graphs where appropriate:

  • % of sites with a manifest
  • % of sites importing any of [GoRoost, OneSignal, Mobify]
  • % of sites importing GoRoost
  • % of sites importing OneSignal
  • % of sites importing Mobify
  • % of sites forcing HTTPS
  • % of sites registering a Service Worker
  • % of sites registering for Push Notifications
  • % of sites caching the landing page in a Service Worker

Should additionally be possible to view individual data points for a specified site

Deployment

  • No requirements about specific domain names, redirects
  • Must be deployed somewhere publicly-accessible (e.g. Heroku)

Design

Data Store

Redis DB!

For future consideration if necessary:

Crawler

Written in Node.js, does the following:

TODO: Flow diagram (started @ https://www.draw.io/#G0BzDhuKIxRtxdNnR1RzVXbkxYQXc )

  • Check URL DB: If empty, fetch Alexa list and populate URL DB
  • For each URL in the URL DB:
    • For each of the User Agents we wish to test:
      • Try to retrieve page using HTTP
      • Resolve any redirects, tracking whether the page is eventually served over HTTPS
      • Check the HTML for <link rel="manifest" or similar
      • Parse HTML for any scripts and fetch those
        • Check the JS scripts for GoRoost import - TODO: What does this look like?
        • Check the JS scripts for OneSignal import - TODO: What does this look like?
        • Check the JS scripts for Mobify import - TODO: What does this look like?
        • Check the JS scripts for navigator.serviceWorker.register and fetch the SW
          • TODO: How to determine whether the SW caches the landing page?
      • Check the JS scripts for pushManager.subscribe()
      • Store data point in our data store for this URL
      • Remove URL from URL DB

User agents we wish to test:

  • Desktop Firefox
  • Desktop Chrome
  • Fennec
  • Mobile Chrome

Other options considered:

  • a python client
  • remote controlling a real browser engine using Marionette, CasperJS, Selenium, or similar and mocking the various calls that we care about

Dashboard

Mozaik or similar front-end

Deployment

  • Heroku

Risks

Data store

  • May be too large to store in memory, return all records in one query, etc. - extra design/implementation time to deal with these issues

Analysis: 1M sites with 12 pieces of data. If each piece of data is 1KB that's a 12GB database.

Crawler

  • Performance: If each site takes 1s to analyze, updating the whole 1M+ list will take 11+ days.
  • False negatives: Deferred loading of additional JavaScript that is not in initial payload

Dashboard

  • Tim unfamiliar with front-end design and JS libraries

Deployment

  • No current visible risks

People

  • Tim - project coordination, design/engineering (crawler, dashboard). Can spend about 75% time on this project over 4 weeks
  • Piotr - design/engineering (crawler, dashboard). Can spend about 40% time on this project over 4 weeks
  • Harald - design/engineering (crawler, dashboard). Can spend about 10% time on this project over 4 weeks

Schedule

Feb 1-5

Planning/coordination

  • Stabilize MVP requirements
  • File implementation issues
  • Wikis/flow diagrams/etc.
  • Design discussions

Crawler

  • Design roughly complete
  • Initial implementation

Dashboard

  • Design/mockups
  • Investigating technologies

Deployment

Feb 8-12

Planning/coordination

  • Verify progress against schedule, report/alter as necessary

Crawler

  • Implementation, refinement of design

Dashboard

Deployment

  • Travis CI

Feb 15-19

Planning/coordination

  • Verify progress against schedule, report/alter as necessary

Crawler

  • "Feature complete"
  • Fetch from Alexa
  • Fetch sites, pass sites to probes, store data in data store

Dashboard

Deployment

Feb 22-26

Planning/coordination

  • Tim hand off project to Piotr & Harald

Crawler

  • Implement probes

Dashboard

  • Initial implementation

Deployment

  • Once dashboard is ready, deploy to Heroku

Feb 29-Mar 4 (Tim gone)

  • Add/refine tests
  • Refine crawler (urlJobPopulator, urlJobProcessor, probes)
  • Refine data store
  • Refine dashboard
  • Solidify Heroku deployment

Mar 7-11 (Tim gone)

  • Add/refine tests
  • Refine crawler (urlJobPopulator, urlJobProcessor, probes)
  • Refine data store
  • Refine dashboard
  • Solidify Heroku deployment

Mar 14-18 (Tim gone)

Mar 21-25 (Tim gone)

Mar 28-31 (Tim gone)

Meetings

Monday, Feb 1 @ 10:00AM PT

Agenda

  • Overview, people/time resources & schedule (7 min)
  • Requirements (8 min)
  • Design (10 min)
  • Questions/closing (5 min)

Action Items

  • Harald: Mockup Dashboard
  • Tim: Start crawler, pass on follow-up info to Piotr