Adding host database update mode

* Update mode `srcupdate` pulls "Umbrella" top url list * Databases can be overridden by files in workdir * Rewrote url_db.py to become sources_db.py * Adapted argument parser * Non-existent test set handles now yield empty sets * Adapted existing code to new interface * Added unit tests for sources_db * Documentation update * Version bump to 3.1.0-alpha.3 * Moving reportdir check to modes that require it
2017-06-09 22:58:04 +02:00 · 2017-06-09 22:58:04 +02:00 · cd798a2457
--- a/README.md
+++ b/README.md
@ -1,13 +1,17 @@
 # TLS Canary version 3
 Automated testing of Firefox for TLS/SSL web compatibility

-Results live here:
+Regression scanning results live here:
 http://tlscanary.mozilla.org

 ## This project
 * Downloads a branch build and a release build of Firefox.
 * Automatically runs thousands of secure sites on those builds.
 * Diffs the results and presents potential regressions in an HTML page for further diagnosis.
+* Does performance regression testing
+* Extracts SSL state information
+* Can maintain an updated list of TLS-enabled top sites
+* Requires a highly reliable network link. **WiFi will not do.**

 ## Requirements
 * Python 2.7
@ -18,20 +22,40 @@ http://tlscanary.mozilla.org
 * OpenSSL-dev
 * libffi-dev

-The script ```linux_bootstrap.sh``` provides bootstrapping for an Ubuntu-based EC2 instance.
+The script [linux_bootstrap.sh](linux_bootstrap.sh) provides bootstrapping for an Ubuntu-based EC2 instance.

-### Windows support
+## Linux and Mac usage
+```
+git clone https://github.com/mozilla/tls-canary
+cd tls-canary
+virtualenv .
+source bin/activate
+pip install -e .
+tls_canary --help
+tls_canary --reportdir=/tmp/test --debug debug
+```

-The file ```windows_bootstrap.txt``` contains information on Windows-specific installation steps.
-Target environment is PowerShell. We're assuming [Chocolatey](https://chocolatey.org/) for dependency management.
+## Windows support
+Windows support targets **PowerShell 5.1** on **Windows 10**. Windows 7 and 8
+are generally able to run TLS Canary, but expect minor unicode
+encoding issues in terminal logging output.

-## Usage
-* cd tls-canary
-* virtualenv .
-* source bin/activate
-* pip install -e .
-* tls_canary --help
-* tls_canary --reportdir=/tmp/test --debug debug
+### Run in an admin PowerShell
+First, [install Chocolatey](https://chocolatey.org/install), then
+```
+choco install 7zip.commandline git golang openssh python2
+choco install python3  # Optional, provides the virtualenv cmdlet
+pip install virtualenv  # Not required if python3 installed
+```
+
+### Run in a user PowerShell
+```
+git clone https://github.com/mozilla/tls-canary
+cd tls-canary
+virtualenv -p c:\python27\python.exe venv
+venv\Scripts\activate
+pip install -e .
+```

 ### Command line arguments
 Argument | Choices / **default** | Description
@ -51,7 +75,7 @@ Argument | Choices / **default** | Description
 -t --test | release, **nightly**, beta, aurora, esr | Specify the main test candidate. Used by every run mode.
 -w --workdir | **~/.tlscanary** | Directory where cached files and other state is stored
 -x --scans | 3 | Number of scans to run against each host during performance mode. Currently limited to 20.
-MODE | **performance**, regression, scan | Test mode to run, given as positional parameter
+MODE | **performance**, regression, scan, srcupdate | Test mode to run, given as positional parameter

 ### Test modes

@ -62,6 +86,7 @@ Mode | Description
 performance | Runs a performance analysis against the hosts in the test set. Use `--scans` to specify how often each host is tested.
 regression | Runs a TLS regression test, comparing the 'test' candidate against the 'baseline' candidate. Only reports errors that are new to the test candiate. No error generated by baseline can make it to the report.
 scan | This mode only collects connection state information for every host in the test set.
+srcupdate | Compile a fresh set of TLS-enabled 'top' sites from the *Umbrella Top 1M* list. Use `-l` to override the default target size of 500k hosts. Use `-x` to adjust the number of passes for errors. Use `-x1` for a factor two speed improvement with slightly less stable results. Use `-b` to change the Firefox version used for filtering. You can use `-s` to create a new database, but you can't make it the default.

 ## Testing
 * nosetests -sv
--- a/cert.py
+++ b/cert.py
--- a/firefox_downloader.py
+++ b/firefox_downloader.py
@ -14,6 +14,60 @@ import cache
 logger = logging.getLogger(__name__)


+def get_to_file(url, filename):
+    global logger
+
+    try:
+        # TODO: Validate the server's SSL certificate
+        req = urllib2.urlopen(url)
+        file_size = int(req.info().getheader('Content-Length').strip())
+
+        # Caching logic is: don't re-download if file of same size is
+        # already in place. TODO: Switch to ETag if that's not good enough.
+        # This already prevents cache clutter with incomplete files.
+        if os.path.isfile(filename):
+            if os.stat(filename).st_size == file_size:
+                req.close()
+                logger.warning('Skipping download, using cached file `%s` instead' % filename)
+                return filename
+            else:
+                logger.warning('Purging incomplete or obsolete cache file `%s`' % filename)
+                os.remove(filename)
+
+        logger.debug('Downloading `%s` to %s' % (url, filename))
+        downloaded_size = 0
+        chunk_size = 32 * 1024
+        with open(filename, 'wb') as fp:
+            while True:
+                chunk = req.read(chunk_size)
+                if not chunk:
+                    break
+                downloaded_size += len(chunk)
+                fp.write(chunk)
+
+    except urllib2.HTTPError, err:
+        if os.path.isfile(filename):
+            os.remove(filename)
+        logger.error('HTTP error: %s, %s' % (err.code, url))
+        return None
+
+    except urllib2.URLError, err:
+        if os.path.isfile(filename):
+            os.remove(filename)
+        logger.error('URL error: %s, %s' % (err.reason, url))
+        return None
+
+    except KeyboardInterrupt:
+        if os.path.isfile(filename):
+            os.remove(filename)
+        if sys.stdout.isatty():
+            print
+        logger.critical('Download interrupted by user')
+        return None
+
+    return filename
+
+
 class FirefoxDownloader(object):

    __base_url = 'https://download.mozilla.org/?product=firefox' \
@ -66,59 +120,6 @@ class FirefoxDownloader(object):
        self.__workdir = workdir
        self.__cache = cache.DiskCache(os.path.join(workdir, "cache"), cache_timeout, purge=True)

-    @staticmethod
-    def __get_to_file(url, filename):
-        try:
-
-            # TODO: Validate the server's SSL certificate
-            req = urllib2.urlopen(url)
-            file_size = int(req.info().getheader('Content-Length').strip())
-
-            # Caching logic is: don't re-download if file of same size is
-            # already in place. TODO: Switch to ETag if that's not good enough.
-            # This already prevents cache clutter with incomplete files.
-            if os.path.isfile(filename):
-                if os.stat(filename).st_size == file_size:
-                    req.close()
-                    logger.warning('Skipping download using cached file `%s`' % filename)
-                    return filename
-                else:
-                    logger.warning('Purging incomplete or obsolete cache file `%s`' % filename)
-                    os.remove(filename)
-
-            logger.info('Downloading `%s` to %s' % (url, filename))
-            downloaded_size = 0
-            chunk_size = 32 * 1024
-            with open(filename, 'wb') as fp:
-                while True:
-                    chunk = req.read(chunk_size)
-                    if not chunk:
-                        break
-                    downloaded_size += len(chunk)
-                    fp.write(chunk)
-
-        except urllib2.HTTPError, err:
-            if os.path.isfile(filename):
-                os.remove(filename)
-            logger.error('HTTP error: %s, %s' % (err.code, url))
-            return None
-
-        except urllib2.URLError, err:
-            if os.path.isfile(filename):
-                os.remove(filename)
-            logger.error('URL error: %s, %s' % (err.reason, url))
-            return None
-
-        except KeyboardInterrupt:
-            if os.path.isfile(filename):
-                os.remove(filename)
-            if sys.stdout.isatty():
-                print
-            logger.critical('Download interrupted by user')
-            return None
-
-        return filename
-
    def download(self, release, platform=None, use_cache=True):

        if platform is None:
@ -138,4 +139,4 @@ class FirefoxDownloader(object):
            self.__cache.delete(cache_id)

        # __get_to_file will not re-download if same-size file is already there.
-        return self.__get_to_file(url, self.__cache[cache_id])
+        return get_to_file(url, self.__cache[cache_id])
--- a/firefox_extractor.py
+++ b/firefox_extractor.py
--- a/main.py
+++ b/main.py
@ -1,5 +1,3 @@
-#!/usr/bin/env python2
-
 # This Source Code Form is subject to the terms of the Mozilla Public
 # License, v. 2.0. If a copy of the MPL was not distributed with this file,
 # You can obtain one at http://mozilla.org/MPL/2.0/.
@ -8,6 +6,7 @@ import argparse
 import logging
 import coloredlogs
 import os
+import pkg_resources
 import shutil
 import sys
 import tempfile
@ -16,7 +15,7 @@ import cleanup
 import firefox_downloader as fd
 import loader
 import modes
-import url_store as us
+import sources_db as sdb


 # Initialize coloredlogs
@ -30,13 +29,17 @@ def get_argparser():
    Argument parsing
    :return: Parsed arguments object
    """
+    pkg_version = pkg_resources.require("tls_canary")[0].version
    home = os.path.expanduser('~')
-    testset_choice, testset_default = us.URLStore.list()
-    testset_choice.append('list')
+    # By nature of workdir being undetermined at this point, user-defined test sets in
+    # the override directory can not override the default test set. The defaulting logic
+    # needs to move behind the argument parser for that to happen.
+    src = sdb.SourcesDB()
+    testset_default = src.default
    release_choice, _, test_default, base_default = fd.FirefoxDownloader.list()

    parser = argparse.ArgumentParser(prog="tls_canary")
-    parser.add_argument('--version', action='version', version='%(prog)s 3.1.0-alpha.2')
+    parser.add_argument('--version', action='version', version='%(prog)s ' + pkg_version)
    parser.add_argument('-b', '--base',
                        help='Firefox base version to compare against (default: `%s`)' % base_default,
                        choices=release_choice,
@ -62,10 +65,10 @@ def get_argparser():
                        action='store',
                        default=4)
    parser.add_argument('-l', '--limit',
-                        help='Limit for number of URLs to test (default: unlimited)',
+                        help='Limit for number of hosts to test (default: no limit)',
                        type=int,
                        action='store',
-                        default=0)
+                        default=None)
    parser.add_argument('-m', '--timeout',
                        help='Timeout for worker requests (default: 10)',
                        type=float,
@ -88,9 +91,7 @@ def get_argparser():
                        action='store',
                        default=os.getcwd())
    parser.add_argument('-s', '--source',
-                        metavar='TESTSET',
                        help='Test set to run. Use `list` for info. (default: `%s`)' % testset_default,
-                        choices=testset_choice,
                        action='store',
                        default=testset_default)
    parser.add_argument('-t', '--test',
@ -231,20 +232,19 @@ def main():

    # If 'list' is specified as test, list available test sets, builds, and platforms
    if args.source == "list":
-        testset_list, testset_default = us.URLStore.list()
+        coloredlogs.install(level='ERROR')
+        db = sdb.SourcesDB(args)
        build_list, platform_list, _, _ = fd.FirefoxDownloader.list()
-        urldb = us.URLStore(os.path.join(module_dir, "sources"))
        print "Available builds: %s" % ' '.join(build_list)
        print "Available platforms: %s" % ' '.join(platform_list)
        print "Available test sets:"
-        for testset in testset_list:
-            urldb.clear()
-            urldb.load(testset)
-            if testset == testset_default:
-                default = "(default)"
+        for handle in db.list():
+            test_set = db.read(handle)
+            if handle == db.default:
+                default = " (default)"
            else:
                default = ""
-            print "  - %s [%d] %s" % (testset, len(urldb), default)
+            print "  - %s [%d hosts]%s" % (handle, len(test_set), default)
        sys.exit(1)

    # Create workdir (usually ~/.tlscanary, used for caching etc.)
@ -253,13 +253,6 @@ def main():
        logger.debug('Creating working directory %s' % args.workdir)
        os.makedirs(args.workdir)

-    # All code paths after this will generate a report, so check
-    # whether the report dir is a valid target. Specifically, prevent
-    # writing to the module directory.
-    if os.path.normcase(os.path.realpath(args.reportdir)) == os.path.normcase(os.path.realpath(module_dir)):
-        logger.critical("Refusing to write report to module directory. Please set --reportdir")
-        sys.exit(1)
-
    # Load the specified test mode
    try:
        loader.run(args, module_dir, tmp_dir)
--- a/modes/init.py
+++ b/modes/init.py
@ -6,8 +6,9 @@ import basemode
 import performance
 import regression
 import scan
+import sourceupdate

-__all__ = ["performance", "regression", "scan"]
+__all__ = ["performance", "regression", "scan", "sourceupdate"]


 def __subclasses_of(cls):
--- a/modes/basemode.py
+++ b/modes/basemode.py
@ -26,10 +26,8 @@ class BaseMode(object):
    Base functionality for all tests
    """
    def __init__(self, args, module_dir, tmp_dir):
-        global logger
-        self.__args = args
-        self.__mode = args.mode
        self.args = args
+        self.mode = args.mode
        self.module_dir = module_dir
        self.tmp_dir = tmp_dir

@ -57,15 +55,14 @@ class BaseMode(object):
            logger.error('Unsupported platform: %s' % sys.platform)
            sys.exit(5)

-        logger.debug('Detected platform: %s' % platform)
-
        # Download test candidate
-        fdl = fd.FirefoxDownloader(self.__args.workdir, cache_timeout=1*60*60)
+        logger.info('Downloading Firefox `%s` build for platform `%s`' % (build, platform))
+        fdl = fd.FirefoxDownloader(self.args.workdir, cache_timeout=1 * 60 * 60)
        build_archive_file = fdl.download(build, platform)
        if build_archive_file is None:
            sys.exit(-1)
        # Extract candidate archive
-        candidate_app = fe.extract(build_archive_file, self.__args.workdir, cache_timeout=1*60*60)
+        candidate_app = fe.extract(build_archive_file, self.args.workdir, cache_timeout=1 * 60 * 60)
        logger.debug("Build candidate executable is `%s`" % candidate_app.exe)

        return candidate_app
@ -93,9 +90,9 @@ class BaseMode(object):
        dir_util.copy_tree(default_profile_dir, new_profile_dir)

        logger.info("Updating OneCRL revocation data")
-        if self.__args.onecrl == "production" or self.__args.onecrl == "stage":
+        if self.args.onecrl == "production" or self.args.onecrl == "stage":
            # overwrite revocations file in test profile with live OneCRL entries from requested environment
-            revocations_file = one_crl.get_list(self.__args.onecrl, self.__args.workdir)
+            revocations_file = one_crl.get_list(self.args.onecrl, self.args.workdir)
            profile_file = os.path.join(new_profile_dir, "revocations.txt")
            logger.debug("Writing OneCRL revocations data to `%s`" % profile_file)
            shutil.copyfile(revocations_file, profile_file)
@ -114,7 +111,7 @@ class BaseMode(object):
        global logger

        timestamp = start_time.strftime("%Y-%m-%d-%H-%M-%S")
-        run_dir = os.path.join(self.__args.reportdir, "runs", timestamp)
+        run_dir = os.path.join(self.args.reportdir, "runs", timestamp)

        logger.debug("Saving profile to `%s`" % run_dir)
        dir_util.copy_tree(os.path.join(self.tmp_dir, profile_name), os.path.join(run_dir, profile_name))
@ -126,11 +123,11 @@ class BaseMode(object):

        # Default to values from args
        if num_workers is None:
-            num_workers = self.__args.parallel
+            num_workers = self.args.parallel
        if n_per_worker is None:
-            n_per_worker = self.__args.requestsperworker
+            n_per_worker = self.args.requestsperworker
        if timeout is None:
-            timeout = self.__args.timeout
+            timeout = self.args.timeout

        try:
            results = wp.run_scans(app, list(url_list), profile=profile, num_workers=num_workers,
--- a/modes/regression.py
+++ b/modes/regression.py
@ -2,15 +2,17 @@
 # License, v. 2.0. If a copy of the MPL was not distributed with this file,
 # You can obtain one at http://mozilla.org/MPL/2.0/.

-from math import ceil
 import datetime
 import logging
+from math import ceil
 import os
+import pkg_resources as pkgr
+import sys

 from modes.basemode import BaseMode
 import firefox_downloader as fd
 import report
-import url_store as us
+import sources_db as sdb


 logger = logging.getLogger(__name__)
@ -25,8 +27,6 @@ class RegressionMode(BaseMode):

        super(RegressionMode, self).__init__(args, module_dir, tmp_dir)

-        # TODO: argument validation logic
-
        # Define instance attributes for later use
        self.test_app = None
        self.base_app = None
@ -39,6 +39,17 @@ class RegressionMode(BaseMode):
        self.error_set = None

    def setup(self):
+        global logger
+
+        # Code paths after this will generate a report, so check
+        # whether the report dir is a valid target. Specifically, prevent
+        # writing to the module directory.
+        module_dir = pkgr.require("tls_canary")[0].location
+        if os.path.normcase(os.path.realpath(self.args.reportdir))\
+                .startswith(os.path.normcase(os.path.realpath(module_dir))):
+            logger.critical("Refusing to write report to module directory. Please set --reportdir")
+            sys.exit(1)
+
        self.test_app = self.get_test_candidate(self.args.test)
        self.base_app = self.get_test_candidate(self.args.base)

@ -50,13 +61,14 @@ class RegressionMode(BaseMode):
        self.base_profile = self.make_profile("base_profile")

        # Compile the set of URLs to test
-        sources_dir = os.path.join(self.module_dir, 'sources')
-        urldb = us.URLStore(sources_dir, limit=self.args.limit)
-        urldb.load(self.args.source)
-        self.url_set = set(urldb)
+        db = sdb.SourcesDB(self.args)
+        logger.info("Reading `%s` host database" % self.args.source)
+        self.url_set = db.read(self.args.source).as_set()
        logger.info("%d URLs in test set" % len(self.url_set))

    def run(self):
+        global logger
+
        logger.info("Testing Firefox %s %s against Firefox %s %s" %
                    (self.test_metadata["appVersion"], self.test_metadata["branch"],
                     self.base_metadata["appVersion"], self.base_metadata["branch"]))
--- a/modes/scan.py
+++ b/modes/scan.py
@ -5,12 +5,13 @@
 import datetime
 import logging
 import os
+import pkg_resources as pkgr
 import sys

 from modes.basemode import BaseMode
 import firefox_downloader as fd
 import report
-import url_store as us
+import sources_db as sdb


 logger = logging.getLogger(__name__)
@ -33,7 +34,6 @@ class ScanMode(BaseMode):
            logger.debug('Found base build parameter, ignoring')

        # Define instance attributes for later use
-        self.sources_dir = None
        self.url_set = None
        self.info_uri_set = None
        self.test_profile = None
@ -42,12 +42,21 @@ class ScanMode(BaseMode):
        self.start_time = None

    def setup(self):
+        global logger
+
+        # Code paths after this will generate a report, so check
+        # whether the report dir is a valid target. Specifically, prevent
+        # writing to the module directory.
+        module_dir = pkgr.require("tls_canary")[0].location
+        if os.path.normcase(os.path.realpath(self.args.reportdir))\
+                .startswith(os.path.normcase(os.path.realpath(module_dir))):
+            logger.critical("Refusing to write report to module directory. Please set --reportdir")
+            sys.exit(1)
+
        # Compile the set of URLs to test
-        self.sources_dir = os.path.join(self.module_dir, 'sources')
-        logger.info(self.args)
-        urldb = us.URLStore(self.sources_dir, limit=self.args.limit)
-        urldb.load(self.args.source)
-        self.url_set = set(urldb)
+        db = sdb.SourcesDB(self.args)
+        logger.info("Reading `%s` host database" % self.args.source)
+        self.url_set = db.read(self.args.source).as_set()
        logger.info("%d URLs in test set" % len(self.url_set))

        # Create custom profile
--- a/modes/sourceupdate.py
+++ b/modes/sourceupdate.py
@ -0,0 +1,171 @@
+# This Source Code Form is subject to the terms of the Mozilla Public
+# License, v. 2.0. If a copy of the MPL was not distributed with this file,
+# You can obtain one at http://mozilla.org/MPL/2.0/.
+
+import csv
+import datetime
+import logging
+import os
+import sys
+import zipfile
+
+from firefox_downloader import get_to_file
+from modes.basemode import BaseMode
+import sources_db as sdb
+
+
+logger = logging.getLogger(__name__)
+
+
+class SourceUpdateMode(BaseMode):
+    """
+    Mode to update the `top` host database from publicly available top sites data
+    """
+
+    name = "srcupdate"
+
+    # There are various top sites databases that might be considered for querying here.
+    # The other notable database is the notorious `Alexa Top 1M` which is available at
+    # "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip". It is based on usage data
+    # gathered from the equally notorious Alexa browser toolbar, while the `Umbrella top 1M`
+    # used below is DNS-based and its ranking is hence considered to be more representative.
+    # `Umbrella` and `Alexa` use precisely the same format and their links are thus
+    # interchangeable.
+    # For future reference, there is also Ulfr's database at
+    # "https://ulfr.io/f/top1m_has_tls_sorted.csv". It requires a different parser but
+    # has the advantage of clustering hosts by shared certificates.
+
+    top_sites_location = "http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip"
+
+    def __init__(self, args, module_dir, tmp_dir):
+        super(SourceUpdateMode, self).__init__(args, module_dir, tmp_dir)
+        self.start_time = None
+        self.db = None
+        self.sources = None
+        self.app = None
+        self.profile = None
+        self.result = None
+
+    def setup(self):
+        global logger
+
+        self.app = self.get_test_candidate(self.args.base)
+        self.profile = self.make_profile("base_profile")
+
+        tmp_zip_name = os.path.join(self.tmp_dir, "top.zip")
+        logger.info("Fetching unfiltered top sites data from the `Umbrella Top 1M` online database")
+        get_to_file(self.top_sites_location, tmp_zip_name)
+
+        try:
+            zipped = zipfile.ZipFile(tmp_zip_name)
+            if len(zipped.filelist) != 1 or not zipped.filelist[0].orig_filename.lower().endswith(".csv"):
+                logger.critical("Top sites zip file has unexpected content")
+                sys.exit(5)
+            tmp_csv_name = zipped.extract(zipped.filelist[0], self.tmp_dir)
+        except zipfile.BadZipfile:
+            logger.critical("Error opening top sites zip archive")
+            sys.exit(5)
+
+        self.db = sdb.SourcesDB(self.args)
+        is_default = self.args.source == self.db.default
+        self.sources = sdb.Sources(self.args.source, is_default)
+
+        with open(tmp_csv_name) as f:
+            cr = csv.DictReader(f, fieldnames=["rank", "hostname"])
+            self.sources.rows = [row for row in cr]
+
+        # A mild sanity check to see whether the downloaded data is valid.
+        if len(self.sources) < 900000:
+            logger.warning("Top sites is surprisingly small, just %d hosts" % len(self.sources))
+        if self.sources.rows[0] != {"hostname": "google.com", "rank": "1"}:
+            logger.warning("Top sites data looks weird. First line: `%s`" % self.sources.rows[0])
+
+    def run(self):
+        """
+        Perform the filter run. The objective is to filter out permanent errors so
+        we don't waste time on them during regular test runs.
+
+        The concept is:
+        Run top sites in chunks through Firefox and re-test all error URLs from that
+        chunk a number of times to weed out spurious network errors. Stop the process
+        once the required number of working hosts is collected.
+        """
+        global logger
+
+        self.start_time = datetime.datetime.now()
+
+        limit = 500000
+        if self.args.limit is not None:
+            limit = self.args.limit
+
+        logger.info("There are %d hosts in the unfiltered host set" % len(self.sources))
+        logger.info("Compiling set of %d working hosts for `%s` database update" % (limit, self.sources.handle))
+        working_set = set()
+
+        # Chop unfiltered sources data into chunks and iterate over each
+        chunk_size = max(int(limit / 20), 1000)
+        # TODO: Remove this log line once progress reporting is done properly
+        logger.warning("Progress is reported per chunk of %d hosts, not overall" % chunk_size)
+
+        for chunk_start in xrange(0, len(self.sources), chunk_size):
+
+            hosts_to_go = max(0, limit - len(working_set))
+            # Check if we're done
+            if hosts_to_go == 0:
+                break
+            logger.info("%d hosts to go to complete the working set" % hosts_to_go)
+
+            chunk_end = chunk_start + chunk_size
+            # Shrink chunk if it contains way more hosts than required to complete the working set
+            if chunk_size > hosts_to_go * 2:
+                # CAVE: This assumes that this is the last chunk we require. The downsized chunk
+                # is still 50% larger than required to complete the set to compensate for broken
+                # hosts. If the error rate in the chunk is greater than 50%, another chunk will be
+                # consumed, resulting in a gap of untested hosts between the end of this downsized
+                # chunk and the beginning of the next. Not too bad, but important to be aware of.
+                chunk_end = chunk_start + hosts_to_go * 2
+            # Check if we're running out of data for completing the set
+            if chunk_end > len(self.sources):
+                chunk_end = len(self.sources)
+
+            # Run chunk through multiple passes of Firefox, leaving only persistent errors in the
+            # error set.
+            logger.info("Processing chunk of %d hosts from the unfiltered set (#%d to #%d)"
+                        % (chunk_end - chunk_start, chunk_start, chunk_end - 1))
+            pass_chunk = self.sources.as_set(start=chunk_start, end=chunk_end)
+            pass_errors = pass_chunk
+            for _ in xrange(self.args.scans):
+                pass_errors = self.run_test(self.app, pass_errors, profile=self.profile, get_info=False,
+                                            get_certs=False, progress=True, return_only_errors=True)
+                if len(pass_errors) == 0:
+                    break
+
+            logger.info("Error rate in chunk was %.1f%%"
+                        % (100.0 * float(len(pass_errors)) / float(chunk_end - chunk_start)))
+
+            # Add all non-errors to the working set
+            working_set.update(pass_chunk.difference(pass_errors))
+
+        final_src = sdb.Sources(self.sources.handle, is_default=self.sources.is_default)
+        final_src.from_set(working_set)
+        final_src.sort()
+        final_src.trim(limit)
+
+        if len(final_src) < limit:
+            logger.warning("Ran out of hosts to complete the working set")
+
+        self.result = final_src
+
+    def report(self):
+        # There is no actual report for this mode, just write out the database
+        logger.info("Collected %d working hosts for the updated test set" % len(self.result))
+        logger.info("Writing updated `%s` host database" % self.result.handle)
+        self.db.write(self.result)
+
+    def teardown(self):
+        # Free some memory
+        self.db = None
+        self.sources = None
+        self.app = None
+        self.profile = None
+        self.result = None
--- a/progress_bar.py
+++ b/progress_bar.py
--- a/setup.py
+++ b/setup.py
@ -4,7 +4,7 @@

 from setuptools import setup, find_packages

-PACKAGE_VERSION = '3.1.0-alpha.2'
+PACKAGE_VERSION = '3.1.0-alpha.3'

 # Dependencies
 with open('requirements.txt') as f:
--- a/sources/debug.csv
+++ b/sources/debug.csv
@ -1,3 +1,4 @@
+rank,hostname
 1,google.com
 2,facebook.com
 3,youtube.com
--- a/sources/debug2.csv
+++ b/sources/debug2.csv
@ -1,3 +1,4 @@
+rank,hostname
 1,google.com
 2,facebook.com
 3,youtube.com
--- a/sources/google_ct_list.csv
+++ b/sources/google_ct_list.csv
--- a/sources/google_ct_list.csv.bz2
+++ b/sources/google_ct_list.csv.bz2
--- a/sources/smoke_list.csv
+++ b/sources/smoke_list.csv
@ -1,3 +1,5 @@
+#handle:smoke
+hostname
 1010.m2m.com
 1800cpap.com
 1800registry.com
--- a/sources/test_url_list.csv
+++ b/sources/test_url_list.csv
@ -1,3 +1,5 @@
+#handle:test
+hostname
 bip2.opi.org.pl
 centernet.fhcrc.org
 correo.pas.ucam.edu
@ -997,4 +999,4 @@ achillesparadise.reserve-online.net
 achillesplaza.reserve-online.net
 achtsamessen.wordpress.com
 acidborg.wordpress.com
-acidmartin.wordpress.com
+acidmartin.wordpress.com
--- a/sources/top_sites.csv
+++ b/sources/top_sites.csv
@ -1,3 +1,5 @@
+#default:handle:top
+rank,hostname
 1,google.com
 2,facebook.com
 3,microsoft.com
--- a/sources_db.py
+++ b/sources_db.py
@ -0,0 +1,268 @@
+# This Source Code Form is subject to the terms of the Mozilla Public
+# License, v. 2.0. If a copy of the MPL was not distributed with this file,
+# You can obtain one at http://mozilla.org/MPL/2.0/.
+
+import csv
+import logging
+import os
+
+
+logger = logging.getLogger(__name__)
+module_dir = os.path.abspath(os.path.split(__file__)[0])
+module_data_dir = os.path.join(module_dir, "sources")
+
+
+def list_sources(data_dirs):
+    """
+    This function trawls through all the sources CSV files in the given data directories
+    and generates a dictionary of handle names and associated file names. Per default, the
+    base part of the file name (without `.csv`) is used as handle for that list.
+
+    Files in latter data directories override files in former ones.
+
+    If the first line of a CSV file begins with a `#`, it is interpreted as a
+    colon-separated list of keywords. If it contains the keyword `handle`, the last
+    keyword is used as its handle instead of the file name derivative.
+
+    If the line contains the keyword `default`, it is being used as the default list.
+    When multiple CSV files use the `default` keyword, the lexicographically last file
+    name is used as default.
+
+    :param data_dirs: List of paths to directories containing CSV files
+    :return: (dict mapping handles to file names, str handle of default list)
+    """
+    global logger
+
+    sources_list = {}
+    default_source = None
+
+    for data_dir in data_dirs:
+        if not os.path.isdir(data_dir):
+            continue
+        for root, dirs, files in os.walk(data_dir):
+            for name in files:
+                if name.endswith(".csv"):
+                    file_name = os.path.abspath(os.path.join(root, name))
+                    logger.debug("Indexing sources database file `%s`" % file_name)
+                    source_handle, is_default = parse_csv_header(file_name)
+                    sources_list[source_handle] = os.path.abspath(os.path.join(root, name))
+                    if is_default:
+                        default_source = source_handle
+
+    return sources_list, default_source
+
+
+def parse_csv_header(file_name):
+    """
+    Read first line of file and try to interpret it as a series of colon-separated
+    keywords if the line starts with a `#`. Currently supported keywords:
+
+    - handle: The last keyword is interpreted as database vanity handle
+    - default: The database is used as default database.
+
+    If no handle is specified, the file name's base is used instead.
+
+    :param file_name: str with file name to check
+    :return: (string with handle, bool default state)
+    """
+    source_handle = os.path.splitext(os.path.basename(file_name))[0]
+    is_default = False
+    with open(file_name) as f:
+        line = f.readline().strip()
+        if line.startswith("#"):
+            keywords = line.lstrip("#").split(":")
+            if "handle" in keywords:
+                source_handle = keywords[-1]
+            if "default" in keywords:
+                is_default = True
+    return source_handle, is_default
+
+
+class SourcesDB(object):
+    """
+    Class to represent the database store for host data. CSV files from the `sources`
+    subdirectory of the module directory are considered as database source files.
+    Additionally, CSV files inside the `sources` subdirectory of the working directory
+    (usually ~/.tlscanary) are parsed and thus can override files from the module
+    directory.
+
+    Each database file is referenced by a unique handle. The first line of the CSV can
+    be a special control line that modifies how the database file is handled. See
+    sources_db.parse_csv_header().
+
+    The CSV files are required to contain a  regular CSV header line, the column
+    `hostname`, and optionally the column `rank`.
+    """
+    def __init__(self, args=None):
+        global module_data_dir
+        self.__args = args
+        if args is not None:
+            self.__data_dirs = [module_data_dir, os.path.join(args.workdir, "sources")]
+        else:
+            self.__data_dirs = [module_data_dir]
+        self.__list, self.default = list_sources(self.__data_dirs)
+        if self.default is None:
+            self.default = self.__list.keys()[0]
+
+    def list(self):
+        """
+        List handles of available source CSVs
+
+        :return: list with handles
+        """
+        handles_list = self.__list.keys()
+        handles_list.sort()
+        return handles_list
+
+    def read(self, handle):
+        """
+        Read the database file referenced by the given handle.
+
+        :param handle: str with handle
+        :return: Sources object containing the data
+        """
+        global logger
+        if handle not in self.__list:
+            logger.error("Unknown sources database handle `%s`. Continuing with empty set" % handle)
+            return Sources(handle)
+        file_name = self.__list[handle]
+        source = Sources(handle, handle == self.default)
+        source.load(file_name)
+        source.trim(self.__args.limit)
+        return source
+
+    def write(self, source):
+        """
+        Write a Sources object to a CSV database file into the `sources` subdirectory of
+        the working directory (usually ~/.tlscanary). The file is named <handle.csv>.
+        Metadata like handle and default state are stored in the first line of the file.
+
+        :param source: Sources object
+        :return: None
+        """
+        sources_dir = os.path.join(self.__args.workdir, "sources")
+        if not os.path.isdir(sources_dir):
+            os.makedirs(sources_dir)
+        file_name = os.path.join(sources_dir, "%s.csv" % source.handle)
+        source.write(file_name)
+
+
+class Sources(object):
+    def __init__(self, handle, is_default=False):
+        self.handle = handle
+        self.is_default = is_default
+        self.rows = []
+
+    def __len__(self):
+        return len(self.rows)
+
+    def __getitem__(self, item):
+        return self.rows[item]
+
+    def __iter__(self):
+        for row in self.rows:
+            yield row
+
+    def append(self, row):
+        """
+        Add a row to the end of the current sources list
+
+        :param row: dict of `rank` and `hostname`
+        :return:  None
+        """
+        self.rows.append(row)
+
+    def sort(self):
+        """
+        Sort rows according to rank
+
+        :return: None
+        """
+        self.rows.sort(key=lambda row: int(row["rank"]))
+
+    def load(self, file_name):
+        """
+        Load content of a sources database from a CSV file
+
+        :param file_name: str containing existing file name
+        :return: None
+        """
+        global logger
+        self.handle, self.is_default = parse_csv_header(file_name)
+        logger.debug("Reading `%s` sources from `%s`" % (self.handle, file_name))
+        with open(file_name) as f:
+            csv_reader = csv.DictReader(filter(lambda r: not r.startswith("#"), f))
+        self.rows = [row for row in csv_reader]
+
+    def trim(self, limit):
+        """
+        Trim length of sources list to given limit. Does not trim if
+        limit is None.
+
+        :param limit: int maximum length or None
+        :return: None
+        """
+        if limit is not None:
+            if len(self) > limit:
+                self.rows = self.rows[:limit]
+
+    def write(self, location):
+        """
+        Write out instance sources list to a CSV file. If location refers to
+        a directory, the file is written there and the file name is chosen as
+        <handle>.csv. Metadata like handle and default state are stored in the
+        first line of the file.
+
+        If location refers to a file name, it used as file name directly.
+        The target directory must exist.
+
+        :param location: directory or file name in an existing directory
+        :return: None
+        """
+        global logger
+        if os.path.isdir(location):
+            file_name = os.path.join(location, "%s.csv" % self.handle)
+        elif os.path.isdir(os.path.dirname(location)):
+            file_name = location
+        else:
+            raise Exception("Can't write to location `%s`" % location)
+        logger.debug("Writing `%s` sources to `%s`" % (self.handle, file_name))
+        with open(file_name, "w") as f:
+            header_keywords = []
+            if self.is_default:
+                header_keywords.append("default")
+            header_keywords += ["handle", self.handle]
+            f.write("#%s\n" % ":".join(header_keywords))
+            csv_writer = csv.DictWriter(f, self.rows[0].keys())
+            csv_writer.writeheader()
+            csv_writer.writerows(self.rows)
+        return file_name
+
+    def from_set(self, src_set):
+        """
+        Use set to fill this Sources object. The set is expected to contain
+        :param src_set: set with (rank, host) pairs
+        :return: None
+        """
+        self.rows = [{"rank": str(rank), "hostname": hostname} for rank, hostname in src_set]
+
+    def as_set(self, start=0, end=None):
+        """
+        Return rows of this sources list as a set. The set does not pertain any of
+        the sources' meta data (DB handle, default). You can specify `start` and `end`
+        to select just a chunk of data from the rows.
+
+        Warning: There is no plausibility checking on `start` and `end` parameters.
+
+        :param start: optional int marking beginning of chunk
+        :param end: optional int marking end of chunk
+        :return: set of (int rank, str hostname) pairs
+        """
+        if len(self.rows) == 0:
+            return set()
+        if end is None:
+            end = len(self.rows)
+        if "rank" in self.rows[0].keys():
+            return set([(int(row["rank"]), row["hostname"]) for row in self.rows[start:end]])
+        else:
+            return set([(0, row["hostname"]) for row in self.rows[start:end]])
--- a/tests/init.py
+++ b/tests/init.py
@ -45,3 +45,18 @@ def teardown_package():
    if tmp_dir is not None:
        shutil.rmtree(tmp_dir, ignore_errors=True)
        tmp_dir = None
+
+
+class ArgsMock(object):
+    """
+    Mock used for testing functionality that
+    requires access to an args-style object.
+    """
+    def __init__(self, **kwargs):
+        self.kwargs = kwargs
+
+    def __getattr__(self, attr):
+        try:
+            return self.kwargs[attr]
+        except KeyError:
+            return None
--- a/tests/sources_db_test.py
+++ b/tests/sources_db_test.py
@ -0,0 +1,88 @@
+# This Source Code Form is subject to the terms of the Mozilla Public
+# License, v. 2.0. If a copy of the MPL was not distributed with this file,
+# You can obtain one at http://mozilla.org/MPL/2.0/.
+
+from nose.tools import *
+import os
+
+import sources_db as sdb
+import tests
+
+
+def test_sources_db_instance():
+    """SourcesDB can list database handles"""
+
+    test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
+    db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
+    handle_list = db.list()
+    assert_true(type(handle_list) is list, "handle listing is an actual list")
+    assert_true(len(handle_list) > 0, "handle listing is not empty")
+    assert_true(db.default in handle_list, "default handle appears in listing")
+    assert_true("list" not in handle_list, "`list` must not be an existing handle")
+    assert_true("debug" in handle_list, "`debug` handle is required for testing")
+
+
+def test_sources_db_read():
+    """SourcesDB can read databases"""
+
+    test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
+    db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
+    src = db.read("debug")
+    assert_true(type(src) is sdb.Sources, "reading yields a Sources object")
+    assert_equal(len(src), len(src.rows), "length seems to be correct")
+    assert_true("hostname" in src[0].keys(), "`hostname` is amongst keys")
+    assert_true("rank" in src[0].keys(), "`rank` is amongst keys")
+    rows = [row for row in src]
+    assert_equal(len(rows), len(src), "yields expected number of iterable rows")
+
+
+def test_sources_db_write_and_override():
+    """SourcesDB databases can be written and overridden"""
+
+    test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
+
+    db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
+    old = db.read("debug")
+    old_default = db.default
+    override = sdb.Sources("debug", True)
+    row_one = {"foo": "bar", "baz": "bang", "boom": "bang"}
+    row_two = {"foo": "bar2", "baz": "bang2", "boom": "bang2"}
+    override.append(row_one)
+    override.append(row_two)
+    db.write(override)
+
+    # New SourcesDB instance required to detect overrides
+    db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
+    assert_true(os.path.exists(os.path.join(test_tmp_dir, "sources", "debug.csv")), "override file is written")
+    assert_equal(db.default, "debug", "overriding the default works")
+    assert_not_equal(old_default, db.default, "overridden default actually changes")
+    new = db.read("debug")
+    assert_equal(len(new), 2, "number of overridden rows is correct")
+    assert_true(new[0] == row_one and new[1] == row_two, "new rows are written as expected")
+    assert_not_equal(old[0], new[0], "overridden rows actually change")
+
+
+def test_sources_set_interface():
+    """Sources object can be created from and yield sets"""
+
+    # Sets are assumed to contain (rank, hostname) pairs
+    src_set = {(1, "mozilla.org"), (2, "mozilla.com"), (3, "addons.mozilla.org")}
+    src = sdb.Sources("foo")
+    src.from_set(src_set)
+    assert_equal(len(src), 3, "database from set has correct length")
+    assert_equal(src_set, src.as_set(), "yielded set is identical to the original")
+    assert_equal(len(src.as_set(1, 2)), 1, "yielded subset has expected length")
+
+
+def test_sources_sorting():
+    """Sources object can sort its rows by rank"""
+
+    src_set = {(1, "mozilla.org"), (2, "mozilla.com"), (3, "addons.mozilla.org")}
+    src = sdb.Sources("foo")
+    src.from_set(src_set)
+    # Definitely "unsort"
+    if int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]):
+        src.rows[0], src.rows[1] = src.rows[1], src.rows[0]
+    assert_false(int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]) < int(src.rows[2]["rank"]), "list is scrambled")
+    src.sort()
+    assert_true(int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]) < int(src.rows[2]["rank"]), "sorting works")
--- a/tests/xpcshell_worker_test.py
+++ b/tests/xpcshell_worker_test.py
@ -13,6 +13,7 @@ import xpcshell_worker as xw

@mock.patch('sys.stdout')  # to silence progress bar
 def test_xpcshell_worker(mock_sys):
+    """XPCShell worker runs and is responsive"""

    # Skip test if there is no app for this platform
    if tests.test_app is None:
--- a/url_store.py
+++ b/url_store.py
@ -1,82 +0,0 @@
-# This Source Code Form is subject to the terms of the Mozilla Public
-# License, v. 2.0. If a copy of the MPL was not distributed with this file,
-# You can obtain one at http://mozilla.org/MPL/2.0/.
-
-import csv
-import os
-
-
-__datasets = {
-    'debug': 'debug.csv',
-    'debug2': 'debug2.csv',
-    # 'google': 'google_ct_list.csv',  # disabled until cleaned
-    'smoke': 'smoke_list.csv',
-    'test': 'test_url_list.csv',
-    'top': 'top_sites.csv'
-}
-
-
-def list_datasets():
-    dataset_list = __datasets.keys()
-    dataset_list.sort()
-    dataset_default = "top"
-    assert dataset_default in dataset_list
-    return dataset_list, dataset_default
-
-
-def iterate(dataset, data_dir):
-    if dataset.endswith('.csv'):
-        csv_file_name = os.path.abspath(dataset)
-    else:
-        csv_file_name = __datasets[dataset]
-    with open(os.path.join(data_dir, csv_file_name)) as f:
-        csv_reader = csv.reader(f)
-        for row in csv_reader:
-            assert 0 <= len(row) <= 2
-            if len(row) == 2:
-                rank, url = row
-                yield int(rank), url
-            elif len(row) == 1:
-                rank = 0
-                url = row[0]
-                yield int(rank), url
-            else:
-                continue
-
-
-class URLStore(object):
-    def __init__(self, data_dir, limit=0):
-        self.__data_dir = os.path.abspath(data_dir)
-        self.__loaded_datasets = []
-        self.__limit = limit
-        self.__urls = []
-
-    def clear(self):
-        """Clear all active URLs from store."""
-        self.__urls = []
-
-    def __len__(self):
-        """Returns number of active URLs in store."""
-        if self.__limit > 0:
-            return min(len(self.__urls), self.__limit)
-        else:
-            return len(self.__urls)
-
-    def __iter__(self):
-        """Iterate all active URLs in store."""
-        for rank, url in self.__urls[:len(self)]:
-            yield rank, url
-
-    @staticmethod
-    def list():
-        """List handles and files for all static URL databases."""
-        return list_datasets()
-
-    def load(self, datasets):
-        """Load datasets array into active URL store."""
-        if type(datasets) == str:
-            datasets = [datasets]
-        for dataset in datasets:
-            for nr, url in iterate(dataset, self.__data_dir):
-                self.__urls.append((nr, url))
-            self.__loaded_datasets.append(dataset)
--- a/windows_bootstrap.txt
+++ b/windows_bootstrap.txt
@ -1,17 +0,0 @@
-Windows support targets Wndows 10 and PowerShell. Windows 7 and 8
-are generally able to run TLS Canary, but terminal escape sequences
-used for colored logging won't work properly.
-
- Run admin PowerShell
- Install chocolatey, https://chocolatey.org/install
- choco install 7zip.commandline git golang openssh python2
- choco install python3  # Optional, provides the virtualenv cmdlet
- pip install virtualenv  # Not required if python3 installed
-
- Run user PowerShell
- git clone https://github.com/mozilla/tls-canary
- cd tls-canary
- virtualenv -p c:\python27\python.exe venv
- venv\Scripts\activate
- pip install -e .
-
--- a/worker_pool.py
+++ b/worker_pool.py
@ -99,7 +99,7 @@ def scan_urls(app, target_list, profile=None, get_certs=False, timeout=10):
                if response.original_cmd["mode"] == "scan":
                    timeout_time = time.time() + timeout + 1
                # Ignore other ACKs.
-                continue;
+                continue
            # Else we know this is the result of a scan command.
            result = ScanResult(response)
            results[result.host] = result
--- a/xpcshell_worker.py
+++ b/xpcshell_worker.py