Adding host database update mode

* Update mode `srcupdate` pulls "Umbrella" top url list
* Databases can be overridden by files in workdir
* Rewrote url_db.py to become sources_db.py
* Adapted argument parser
* Non-existent test set handles now yield empty sets
* Adapted existing code to new interface
* Added unit tests for sources_db
* Documentation update
* Version bump to 3.1.0-alpha.3
* Moving reportdir check to modes that require it
This commit is contained in:
Christiane Ruetten 2017-06-09 22:58:04 +02:00
Родитель 81bd5478be
Коммит cd798a2457
27 изменённых файлов: 713 добавлений и 2678248 удалений

Просмотреть файл

@ -1,13 +1,17 @@
# TLS Canary version 3
Automated testing of Firefox for TLS/SSL web compatibility
Results live here:
Regression scanning results live here:
http://tlscanary.mozilla.org
## This project
* Downloads a branch build and a release build of Firefox.
* Automatically runs thousands of secure sites on those builds.
* Diffs the results and presents potential regressions in an HTML page for further diagnosis.
* Does performance regression testing
* Extracts SSL state information
* Can maintain an updated list of TLS-enabled top sites
* Requires a highly reliable network link. **WiFi will not do.**
## Requirements
* Python 2.7
@ -18,20 +22,40 @@ http://tlscanary.mozilla.org
* OpenSSL-dev
* libffi-dev
The script ```linux_bootstrap.sh``` provides bootstrapping for an Ubuntu-based EC2 instance.
The script [linux_bootstrap.sh](linux_bootstrap.sh) provides bootstrapping for an Ubuntu-based EC2 instance.
### Windows support
## Linux and Mac usage
```
git clone https://github.com/mozilla/tls-canary
cd tls-canary
virtualenv .
source bin/activate
pip install -e .
tls_canary --help
tls_canary --reportdir=/tmp/test --debug debug
```
The file ```windows_bootstrap.txt``` contains information on Windows-specific installation steps.
Target environment is PowerShell. We're assuming [Chocolatey](https://chocolatey.org/) for dependency management.
## Windows support
Windows support targets **PowerShell 5.1** on **Windows 10**. Windows 7 and 8
are generally able to run TLS Canary, but expect minor unicode
encoding issues in terminal logging output.
## Usage
* cd tls-canary
* virtualenv .
* source bin/activate
* pip install -e .
* tls_canary --help
* tls_canary --reportdir=/tmp/test --debug debug
### Run in an admin PowerShell
First, [install Chocolatey](https://chocolatey.org/install), then
```
choco install 7zip.commandline git golang openssh python2
choco install python3 # Optional, provides the virtualenv cmdlet
pip install virtualenv # Not required if python3 installed
```
### Run in a user PowerShell
```
git clone https://github.com/mozilla/tls-canary
cd tls-canary
virtualenv -p c:\python27\python.exe venv
venv\Scripts\activate
pip install -e .
```
### Command line arguments
Argument | Choices / **default** | Description
@ -51,7 +75,7 @@ Argument | Choices / **default** | Description
-t --test | release, **nightly**, beta, aurora, esr | Specify the main test candidate. Used by every run mode.
-w --workdir | **~/.tlscanary** | Directory where cached files and other state is stored
-x --scans | 3 | Number of scans to run against each host during performance mode. Currently limited to 20.
MODE | **performance**, regression, scan | Test mode to run, given as positional parameter
MODE | **performance**, regression, scan, srcupdate | Test mode to run, given as positional parameter
### Test modes
@ -62,6 +86,7 @@ Mode | Description
performance | Runs a performance analysis against the hosts in the test set. Use `--scans` to specify how often each host is tested.
regression | Runs a TLS regression test, comparing the 'test' candidate against the 'baseline' candidate. Only reports errors that are new to the test candiate. No error generated by baseline can make it to the report.
scan | This mode only collects connection state information for every host in the test set.
srcupdate | Compile a fresh set of TLS-enabled 'top' sites from the *Umbrella Top 1M* list. Use `-l` to override the default target size of 500k hosts. Use `-x` to adjust the number of passes for errors. Use `-x1` for a factor two speed improvement with slightly less stable results. Use `-b` to change the Firefox version used for filtering. You can use `-s` to create a new database, but you can't make it the default.
## Testing
* nosetests -sv

0
cert.py Executable file → Normal file
Просмотреть файл

109
firefox_downloader.py Executable file → Normal file
Просмотреть файл

@ -14,6 +14,60 @@ import cache
logger = logging.getLogger(__name__)
def get_to_file(url, filename):
global logger
try:
# TODO: Validate the server's SSL certificate
req = urllib2.urlopen(url)
file_size = int(req.info().getheader('Content-Length').strip())
# Caching logic is: don't re-download if file of same size is
# already in place. TODO: Switch to ETag if that's not good enough.
# This already prevents cache clutter with incomplete files.
if os.path.isfile(filename):
if os.stat(filename).st_size == file_size:
req.close()
logger.warning('Skipping download, using cached file `%s` instead' % filename)
return filename
else:
logger.warning('Purging incomplete or obsolete cache file `%s`' % filename)
os.remove(filename)
logger.debug('Downloading `%s` to %s' % (url, filename))
downloaded_size = 0
chunk_size = 32 * 1024
with open(filename, 'wb') as fp:
while True:
chunk = req.read(chunk_size)
if not chunk:
break
downloaded_size += len(chunk)
fp.write(chunk)
except urllib2.HTTPError, err:
if os.path.isfile(filename):
os.remove(filename)
logger.error('HTTP error: %s, %s' % (err.code, url))
return None
except urllib2.URLError, err:
if os.path.isfile(filename):
os.remove(filename)
logger.error('URL error: %s, %s' % (err.reason, url))
return None
except KeyboardInterrupt:
if os.path.isfile(filename):
os.remove(filename)
if sys.stdout.isatty():
print
logger.critical('Download interrupted by user')
return None
return filename
class FirefoxDownloader(object):
__base_url = 'https://download.mozilla.org/?product=firefox' \
@ -66,59 +120,6 @@ class FirefoxDownloader(object):
self.__workdir = workdir
self.__cache = cache.DiskCache(os.path.join(workdir, "cache"), cache_timeout, purge=True)
@staticmethod
def __get_to_file(url, filename):
try:
# TODO: Validate the server's SSL certificate
req = urllib2.urlopen(url)
file_size = int(req.info().getheader('Content-Length').strip())
# Caching logic is: don't re-download if file of same size is
# already in place. TODO: Switch to ETag if that's not good enough.
# This already prevents cache clutter with incomplete files.
if os.path.isfile(filename):
if os.stat(filename).st_size == file_size:
req.close()
logger.warning('Skipping download using cached file `%s`' % filename)
return filename
else:
logger.warning('Purging incomplete or obsolete cache file `%s`' % filename)
os.remove(filename)
logger.info('Downloading `%s` to %s' % (url, filename))
downloaded_size = 0
chunk_size = 32 * 1024
with open(filename, 'wb') as fp:
while True:
chunk = req.read(chunk_size)
if not chunk:
break
downloaded_size += len(chunk)
fp.write(chunk)
except urllib2.HTTPError, err:
if os.path.isfile(filename):
os.remove(filename)
logger.error('HTTP error: %s, %s' % (err.code, url))
return None
except urllib2.URLError, err:
if os.path.isfile(filename):
os.remove(filename)
logger.error('URL error: %s, %s' % (err.reason, url))
return None
except KeyboardInterrupt:
if os.path.isfile(filename):
os.remove(filename)
if sys.stdout.isatty():
print
logger.critical('Download interrupted by user')
return None
return filename
def download(self, release, platform=None, use_cache=True):
if platform is None:
@ -138,4 +139,4 @@ class FirefoxDownloader(object):
self.__cache.delete(cache_id)
# __get_to_file will not re-download if same-size file is already there.
return self.__get_to_file(url, self.__cache[cache_id])
return get_to_file(url, self.__cache[cache_id])

0
firefox_extractor.py Executable file → Normal file
Просмотреть файл

43
main.py Executable file → Normal file
Просмотреть файл

@ -1,5 +1,3 @@
#!/usr/bin/env python2
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
@ -8,6 +6,7 @@ import argparse
import logging
import coloredlogs
import os
import pkg_resources
import shutil
import sys
import tempfile
@ -16,7 +15,7 @@ import cleanup
import firefox_downloader as fd
import loader
import modes
import url_store as us
import sources_db as sdb
# Initialize coloredlogs
@ -30,13 +29,17 @@ def get_argparser():
Argument parsing
:return: Parsed arguments object
"""
pkg_version = pkg_resources.require("tls_canary")[0].version
home = os.path.expanduser('~')
testset_choice, testset_default = us.URLStore.list()
testset_choice.append('list')
# By nature of workdir being undetermined at this point, user-defined test sets in
# the override directory can not override the default test set. The defaulting logic
# needs to move behind the argument parser for that to happen.
src = sdb.SourcesDB()
testset_default = src.default
release_choice, _, test_default, base_default = fd.FirefoxDownloader.list()
parser = argparse.ArgumentParser(prog="tls_canary")
parser.add_argument('--version', action='version', version='%(prog)s 3.1.0-alpha.2')
parser.add_argument('--version', action='version', version='%(prog)s ' + pkg_version)
parser.add_argument('-b', '--base',
help='Firefox base version to compare against (default: `%s`)' % base_default,
choices=release_choice,
@ -62,10 +65,10 @@ def get_argparser():
action='store',
default=4)
parser.add_argument('-l', '--limit',
help='Limit for number of URLs to test (default: unlimited)',
help='Limit for number of hosts to test (default: no limit)',
type=int,
action='store',
default=0)
default=None)
parser.add_argument('-m', '--timeout',
help='Timeout for worker requests (default: 10)',
type=float,
@ -88,9 +91,7 @@ def get_argparser():
action='store',
default=os.getcwd())
parser.add_argument('-s', '--source',
metavar='TESTSET',
help='Test set to run. Use `list` for info. (default: `%s`)' % testset_default,
choices=testset_choice,
action='store',
default=testset_default)
parser.add_argument('-t', '--test',
@ -231,20 +232,19 @@ def main():
# If 'list' is specified as test, list available test sets, builds, and platforms
if args.source == "list":
testset_list, testset_default = us.URLStore.list()
coloredlogs.install(level='ERROR')
db = sdb.SourcesDB(args)
build_list, platform_list, _, _ = fd.FirefoxDownloader.list()
urldb = us.URLStore(os.path.join(module_dir, "sources"))
print "Available builds: %s" % ' '.join(build_list)
print "Available platforms: %s" % ' '.join(platform_list)
print "Available test sets:"
for testset in testset_list:
urldb.clear()
urldb.load(testset)
if testset == testset_default:
default = "(default)"
for handle in db.list():
test_set = db.read(handle)
if handle == db.default:
default = " (default)"
else:
default = ""
print " - %s [%d] %s" % (testset, len(urldb), default)
print " - %s [%d hosts]%s" % (handle, len(test_set), default)
sys.exit(1)
# Create workdir (usually ~/.tlscanary, used for caching etc.)
@ -253,13 +253,6 @@ def main():
logger.debug('Creating working directory %s' % args.workdir)
os.makedirs(args.workdir)
# All code paths after this will generate a report, so check
# whether the report dir is a valid target. Specifically, prevent
# writing to the module directory.
if os.path.normcase(os.path.realpath(args.reportdir)) == os.path.normcase(os.path.realpath(module_dir)):
logger.critical("Refusing to write report to module directory. Please set --reportdir")
sys.exit(1)
# Load the specified test mode
try:
loader.run(args, module_dir, tmp_dir)

Просмотреть файл

@ -6,8 +6,9 @@ import basemode
import performance
import regression
import scan
import sourceupdate
__all__ = ["performance", "regression", "scan"]
__all__ = ["performance", "regression", "scan", "sourceupdate"]
def __subclasses_of(cls):

Просмотреть файл

@ -26,10 +26,8 @@ class BaseMode(object):
Base functionality for all tests
"""
def __init__(self, args, module_dir, tmp_dir):
global logger
self.__args = args
self.__mode = args.mode
self.args = args
self.mode = args.mode
self.module_dir = module_dir
self.tmp_dir = tmp_dir
@ -57,15 +55,14 @@ class BaseMode(object):
logger.error('Unsupported platform: %s' % sys.platform)
sys.exit(5)
logger.debug('Detected platform: %s' % platform)
# Download test candidate
fdl = fd.FirefoxDownloader(self.__args.workdir, cache_timeout=1*60*60)
logger.info('Downloading Firefox `%s` build for platform `%s`' % (build, platform))
fdl = fd.FirefoxDownloader(self.args.workdir, cache_timeout=1 * 60 * 60)
build_archive_file = fdl.download(build, platform)
if build_archive_file is None:
sys.exit(-1)
# Extract candidate archive
candidate_app = fe.extract(build_archive_file, self.__args.workdir, cache_timeout=1*60*60)
candidate_app = fe.extract(build_archive_file, self.args.workdir, cache_timeout=1 * 60 * 60)
logger.debug("Build candidate executable is `%s`" % candidate_app.exe)
return candidate_app
@ -93,9 +90,9 @@ class BaseMode(object):
dir_util.copy_tree(default_profile_dir, new_profile_dir)
logger.info("Updating OneCRL revocation data")
if self.__args.onecrl == "production" or self.__args.onecrl == "stage":
if self.args.onecrl == "production" or self.args.onecrl == "stage":
# overwrite revocations file in test profile with live OneCRL entries from requested environment
revocations_file = one_crl.get_list(self.__args.onecrl, self.__args.workdir)
revocations_file = one_crl.get_list(self.args.onecrl, self.args.workdir)
profile_file = os.path.join(new_profile_dir, "revocations.txt")
logger.debug("Writing OneCRL revocations data to `%s`" % profile_file)
shutil.copyfile(revocations_file, profile_file)
@ -114,7 +111,7 @@ class BaseMode(object):
global logger
timestamp = start_time.strftime("%Y-%m-%d-%H-%M-%S")
run_dir = os.path.join(self.__args.reportdir, "runs", timestamp)
run_dir = os.path.join(self.args.reportdir, "runs", timestamp)
logger.debug("Saving profile to `%s`" % run_dir)
dir_util.copy_tree(os.path.join(self.tmp_dir, profile_name), os.path.join(run_dir, profile_name))
@ -126,11 +123,11 @@ class BaseMode(object):
# Default to values from args
if num_workers is None:
num_workers = self.__args.parallel
num_workers = self.args.parallel
if n_per_worker is None:
n_per_worker = self.__args.requestsperworker
n_per_worker = self.args.requestsperworker
if timeout is None:
timeout = self.__args.timeout
timeout = self.args.timeout
try:
results = wp.run_scans(app, list(url_list), profile=profile, num_workers=num_workers,

Просмотреть файл

@ -2,15 +2,17 @@
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
from math import ceil
import datetime
import logging
from math import ceil
import os
import pkg_resources as pkgr
import sys
from modes.basemode import BaseMode
import firefox_downloader as fd
import report
import url_store as us
import sources_db as sdb
logger = logging.getLogger(__name__)
@ -25,8 +27,6 @@ class RegressionMode(BaseMode):
super(RegressionMode, self).__init__(args, module_dir, tmp_dir)
# TODO: argument validation logic
# Define instance attributes for later use
self.test_app = None
self.base_app = None
@ -39,6 +39,17 @@ class RegressionMode(BaseMode):
self.error_set = None
def setup(self):
global logger
# Code paths after this will generate a report, so check
# whether the report dir is a valid target. Specifically, prevent
# writing to the module directory.
module_dir = pkgr.require("tls_canary")[0].location
if os.path.normcase(os.path.realpath(self.args.reportdir))\
.startswith(os.path.normcase(os.path.realpath(module_dir))):
logger.critical("Refusing to write report to module directory. Please set --reportdir")
sys.exit(1)
self.test_app = self.get_test_candidate(self.args.test)
self.base_app = self.get_test_candidate(self.args.base)
@ -50,13 +61,14 @@ class RegressionMode(BaseMode):
self.base_profile = self.make_profile("base_profile")
# Compile the set of URLs to test
sources_dir = os.path.join(self.module_dir, 'sources')
urldb = us.URLStore(sources_dir, limit=self.args.limit)
urldb.load(self.args.source)
self.url_set = set(urldb)
db = sdb.SourcesDB(self.args)
logger.info("Reading `%s` host database" % self.args.source)
self.url_set = db.read(self.args.source).as_set()
logger.info("%d URLs in test set" % len(self.url_set))
def run(self):
global logger
logger.info("Testing Firefox %s %s against Firefox %s %s" %
(self.test_metadata["appVersion"], self.test_metadata["branch"],
self.base_metadata["appVersion"], self.base_metadata["branch"]))

Просмотреть файл

@ -5,12 +5,13 @@
import datetime
import logging
import os
import pkg_resources as pkgr
import sys
from modes.basemode import BaseMode
import firefox_downloader as fd
import report
import url_store as us
import sources_db as sdb
logger = logging.getLogger(__name__)
@ -33,7 +34,6 @@ class ScanMode(BaseMode):
logger.debug('Found base build parameter, ignoring')
# Define instance attributes for later use
self.sources_dir = None
self.url_set = None
self.info_uri_set = None
self.test_profile = None
@ -42,12 +42,21 @@ class ScanMode(BaseMode):
self.start_time = None
def setup(self):
global logger
# Code paths after this will generate a report, so check
# whether the report dir is a valid target. Specifically, prevent
# writing to the module directory.
module_dir = pkgr.require("tls_canary")[0].location
if os.path.normcase(os.path.realpath(self.args.reportdir))\
.startswith(os.path.normcase(os.path.realpath(module_dir))):
logger.critical("Refusing to write report to module directory. Please set --reportdir")
sys.exit(1)
# Compile the set of URLs to test
self.sources_dir = os.path.join(self.module_dir, 'sources')
logger.info(self.args)
urldb = us.URLStore(self.sources_dir, limit=self.args.limit)
urldb.load(self.args.source)
self.url_set = set(urldb)
db = sdb.SourcesDB(self.args)
logger.info("Reading `%s` host database" % self.args.source)
self.url_set = db.read(self.args.source).as_set()
logger.info("%d URLs in test set" % len(self.url_set))
# Create custom profile

171
modes/sourceupdate.py Normal file
Просмотреть файл

@ -0,0 +1,171 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
import csv
import datetime
import logging
import os
import sys
import zipfile
from firefox_downloader import get_to_file
from modes.basemode import BaseMode
import sources_db as sdb
logger = logging.getLogger(__name__)
class SourceUpdateMode(BaseMode):
"""
Mode to update the `top` host database from publicly available top sites data
"""
name = "srcupdate"
# There are various top sites databases that might be considered for querying here.
# The other notable database is the notorious `Alexa Top 1M` which is available at
# "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip". It is based on usage data
# gathered from the equally notorious Alexa browser toolbar, while the `Umbrella top 1M`
# used below is DNS-based and its ranking is hence considered to be more representative.
# `Umbrella` and `Alexa` use precisely the same format and their links are thus
# interchangeable.
# For future reference, there is also Ulfr's database at
# "https://ulfr.io/f/top1m_has_tls_sorted.csv". It requires a different parser but
# has the advantage of clustering hosts by shared certificates.
top_sites_location = "http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip"
def __init__(self, args, module_dir, tmp_dir):
super(SourceUpdateMode, self).__init__(args, module_dir, tmp_dir)
self.start_time = None
self.db = None
self.sources = None
self.app = None
self.profile = None
self.result = None
def setup(self):
global logger
self.app = self.get_test_candidate(self.args.base)
self.profile = self.make_profile("base_profile")
tmp_zip_name = os.path.join(self.tmp_dir, "top.zip")
logger.info("Fetching unfiltered top sites data from the `Umbrella Top 1M` online database")
get_to_file(self.top_sites_location, tmp_zip_name)
try:
zipped = zipfile.ZipFile(tmp_zip_name)
if len(zipped.filelist) != 1 or not zipped.filelist[0].orig_filename.lower().endswith(".csv"):
logger.critical("Top sites zip file has unexpected content")
sys.exit(5)
tmp_csv_name = zipped.extract(zipped.filelist[0], self.tmp_dir)
except zipfile.BadZipfile:
logger.critical("Error opening top sites zip archive")
sys.exit(5)
self.db = sdb.SourcesDB(self.args)
is_default = self.args.source == self.db.default
self.sources = sdb.Sources(self.args.source, is_default)
with open(tmp_csv_name) as f:
cr = csv.DictReader(f, fieldnames=["rank", "hostname"])
self.sources.rows = [row for row in cr]
# A mild sanity check to see whether the downloaded data is valid.
if len(self.sources) < 900000:
logger.warning("Top sites is surprisingly small, just %d hosts" % len(self.sources))
if self.sources.rows[0] != {"hostname": "google.com", "rank": "1"}:
logger.warning("Top sites data looks weird. First line: `%s`" % self.sources.rows[0])
def run(self):
"""
Perform the filter run. The objective is to filter out permanent errors so
we don't waste time on them during regular test runs.
The concept is:
Run top sites in chunks through Firefox and re-test all error URLs from that
chunk a number of times to weed out spurious network errors. Stop the process
once the required number of working hosts is collected.
"""
global logger
self.start_time = datetime.datetime.now()
limit = 500000
if self.args.limit is not None:
limit = self.args.limit
logger.info("There are %d hosts in the unfiltered host set" % len(self.sources))
logger.info("Compiling set of %d working hosts for `%s` database update" % (limit, self.sources.handle))
working_set = set()
# Chop unfiltered sources data into chunks and iterate over each
chunk_size = max(int(limit / 20), 1000)
# TODO: Remove this log line once progress reporting is done properly
logger.warning("Progress is reported per chunk of %d hosts, not overall" % chunk_size)
for chunk_start in xrange(0, len(self.sources), chunk_size):
hosts_to_go = max(0, limit - len(working_set))
# Check if we're done
if hosts_to_go == 0:
break
logger.info("%d hosts to go to complete the working set" % hosts_to_go)
chunk_end = chunk_start + chunk_size
# Shrink chunk if it contains way more hosts than required to complete the working set
if chunk_size > hosts_to_go * 2:
# CAVE: This assumes that this is the last chunk we require. The downsized chunk
# is still 50% larger than required to complete the set to compensate for broken
# hosts. If the error rate in the chunk is greater than 50%, another chunk will be
# consumed, resulting in a gap of untested hosts between the end of this downsized
# chunk and the beginning of the next. Not too bad, but important to be aware of.
chunk_end = chunk_start + hosts_to_go * 2
# Check if we're running out of data for completing the set
if chunk_end > len(self.sources):
chunk_end = len(self.sources)
# Run chunk through multiple passes of Firefox, leaving only persistent errors in the
# error set.
logger.info("Processing chunk of %d hosts from the unfiltered set (#%d to #%d)"
% (chunk_end - chunk_start, chunk_start, chunk_end - 1))
pass_chunk = self.sources.as_set(start=chunk_start, end=chunk_end)
pass_errors = pass_chunk
for _ in xrange(self.args.scans):
pass_errors = self.run_test(self.app, pass_errors, profile=self.profile, get_info=False,
get_certs=False, progress=True, return_only_errors=True)
if len(pass_errors) == 0:
break
logger.info("Error rate in chunk was %.1f%%"
% (100.0 * float(len(pass_errors)) / float(chunk_end - chunk_start)))
# Add all non-errors to the working set
working_set.update(pass_chunk.difference(pass_errors))
final_src = sdb.Sources(self.sources.handle, is_default=self.sources.is_default)
final_src.from_set(working_set)
final_src.sort()
final_src.trim(limit)
if len(final_src) < limit:
logger.warning("Ran out of hosts to complete the working set")
self.result = final_src
def report(self):
# There is no actual report for this mode, just write out the database
logger.info("Collected %d working hosts for the updated test set" % len(self.result))
logger.info("Writing updated `%s` host database" % self.result.handle)
self.db.write(self.result)
def teardown(self):
# Free some memory
self.db = None
self.sources = None
self.app = None
self.profile = None
self.result = None

0
progress_bar.py Executable file → Normal file
Просмотреть файл

Просмотреть файл

@ -4,7 +4,7 @@
from setuptools import setup, find_packages
PACKAGE_VERSION = '3.1.0-alpha.2'
PACKAGE_VERSION = '3.1.0-alpha.3'
# Dependencies
with open('requirements.txt') as f:

Просмотреть файл

@ -1,3 +1,4 @@
rank,hostname
1,google.com
2,facebook.com
3,youtube.com

1 1 rank google.com hostname
1 rank hostname
2 1 1 google.com google.com
3 2 2 facebook.com facebook.com
4 3 3 youtube.com youtube.com

Просмотреть файл

@ -1,3 +1,4 @@
rank,hostname
1,google.com
2,facebook.com
3,youtube.com

1 1 rank google.com hostname
1 rank hostname
2 1 1 google.com google.com
3 2 2 facebook.com facebook.com
4 3 3 youtube.com youtube.com

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Двоичные данные
sources/google_ct_list.csv.bz2 Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -1,3 +1,5 @@
#handle:smoke
hostname
1010.m2m.com
1800cpap.com
1800registry.com

1 1010.m2m.com #handle:smoke
1 #handle:smoke
2 hostname
3 1010.m2m.com 1010.m2m.com
4 1800cpap.com 1800cpap.com
5 1800registry.com 1800registry.com

Просмотреть файл

@ -1,3 +1,5 @@
#handle:test
hostname
bip2.opi.org.pl
centernet.fhcrc.org
correo.pas.ucam.edu
@ -997,4 +999,4 @@ achillesparadise.reserve-online.net
achillesplaza.reserve-online.net
achtsamessen.wordpress.com
acidborg.wordpress.com
acidmartin.wordpress.com
acidmartin.wordpress.com

1 bip2.opi.org.pl #handle:test
1 #handle:test
2 hostname
3 bip2.opi.org.pl bip2.opi.org.pl
4 centernet.fhcrc.org centernet.fhcrc.org
5 correo.pas.ucam.edu correo.pas.ucam.edu
999 achillesplaza.reserve-online.net achillesplaza.reserve-online.net
1000 achtsamessen.wordpress.com achtsamessen.wordpress.com
1001 acidborg.wordpress.com acidborg.wordpress.com
1002 acidmartin.wordpress.com acidmartin.wordpress.com

Просмотреть файл

@ -1,3 +1,5 @@
#default:handle:top
rank,hostname
1,google.com
2,facebook.com
3,microsoft.com

Не удается отобразить этот файл, потому что он слишком большой.

268
sources_db.py Normal file
Просмотреть файл

@ -0,0 +1,268 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
import csv
import logging
import os
logger = logging.getLogger(__name__)
module_dir = os.path.abspath(os.path.split(__file__)[0])
module_data_dir = os.path.join(module_dir, "sources")
def list_sources(data_dirs):
"""
This function trawls through all the sources CSV files in the given data directories
and generates a dictionary of handle names and associated file names. Per default, the
base part of the file name (without `.csv`) is used as handle for that list.
Files in latter data directories override files in former ones.
If the first line of a CSV file begins with a `#`, it is interpreted as a
colon-separated list of keywords. If it contains the keyword `handle`, the last
keyword is used as its handle instead of the file name derivative.
If the line contains the keyword `default`, it is being used as the default list.
When multiple CSV files use the `default` keyword, the lexicographically last file
name is used as default.
:param data_dirs: List of paths to directories containing CSV files
:return: (dict mapping handles to file names, str handle of default list)
"""
global logger
sources_list = {}
default_source = None
for data_dir in data_dirs:
if not os.path.isdir(data_dir):
continue
for root, dirs, files in os.walk(data_dir):
for name in files:
if name.endswith(".csv"):
file_name = os.path.abspath(os.path.join(root, name))
logger.debug("Indexing sources database file `%s`" % file_name)
source_handle, is_default = parse_csv_header(file_name)
sources_list[source_handle] = os.path.abspath(os.path.join(root, name))
if is_default:
default_source = source_handle
return sources_list, default_source
def parse_csv_header(file_name):
"""
Read first line of file and try to interpret it as a series of colon-separated
keywords if the line starts with a `#`. Currently supported keywords:
- handle: The last keyword is interpreted as database vanity handle
- default: The database is used as default database.
If no handle is specified, the file name's base is used instead.
:param file_name: str with file name to check
:return: (string with handle, bool default state)
"""
source_handle = os.path.splitext(os.path.basename(file_name))[0]
is_default = False
with open(file_name) as f:
line = f.readline().strip()
if line.startswith("#"):
keywords = line.lstrip("#").split(":")
if "handle" in keywords:
source_handle = keywords[-1]
if "default" in keywords:
is_default = True
return source_handle, is_default
class SourcesDB(object):
"""
Class to represent the database store for host data. CSV files from the `sources`
subdirectory of the module directory are considered as database source files.
Additionally, CSV files inside the `sources` subdirectory of the working directory
(usually ~/.tlscanary) are parsed and thus can override files from the module
directory.
Each database file is referenced by a unique handle. The first line of the CSV can
be a special control line that modifies how the database file is handled. See
sources_db.parse_csv_header().
The CSV files are required to contain a regular CSV header line, the column
`hostname`, and optionally the column `rank`.
"""
def __init__(self, args=None):
global module_data_dir
self.__args = args
if args is not None:
self.__data_dirs = [module_data_dir, os.path.join(args.workdir, "sources")]
else:
self.__data_dirs = [module_data_dir]
self.__list, self.default = list_sources(self.__data_dirs)
if self.default is None:
self.default = self.__list.keys()[0]
def list(self):
"""
List handles of available source CSVs
:return: list with handles
"""
handles_list = self.__list.keys()
handles_list.sort()
return handles_list
def read(self, handle):
"""
Read the database file referenced by the given handle.
:param handle: str with handle
:return: Sources object containing the data
"""
global logger
if handle not in self.__list:
logger.error("Unknown sources database handle `%s`. Continuing with empty set" % handle)
return Sources(handle)
file_name = self.__list[handle]
source = Sources(handle, handle == self.default)
source.load(file_name)
source.trim(self.__args.limit)
return source
def write(self, source):
"""
Write a Sources object to a CSV database file into the `sources` subdirectory of
the working directory (usually ~/.tlscanary). The file is named <handle.csv>.
Metadata like handle and default state are stored in the first line of the file.
:param source: Sources object
:return: None
"""
sources_dir = os.path.join(self.__args.workdir, "sources")
if not os.path.isdir(sources_dir):
os.makedirs(sources_dir)
file_name = os.path.join(sources_dir, "%s.csv" % source.handle)
source.write(file_name)
class Sources(object):
def __init__(self, handle, is_default=False):
self.handle = handle
self.is_default = is_default
self.rows = []
def __len__(self):
return len(self.rows)
def __getitem__(self, item):
return self.rows[item]
def __iter__(self):
for row in self.rows:
yield row
def append(self, row):
"""
Add a row to the end of the current sources list
:param row: dict of `rank` and `hostname`
:return: None
"""
self.rows.append(row)
def sort(self):
"""
Sort rows according to rank
:return: None
"""
self.rows.sort(key=lambda row: int(row["rank"]))
def load(self, file_name):
"""
Load content of a sources database from a CSV file
:param file_name: str containing existing file name
:return: None
"""
global logger
self.handle, self.is_default = parse_csv_header(file_name)
logger.debug("Reading `%s` sources from `%s`" % (self.handle, file_name))
with open(file_name) as f:
csv_reader = csv.DictReader(filter(lambda r: not r.startswith("#"), f))
self.rows = [row for row in csv_reader]
def trim(self, limit):
"""
Trim length of sources list to given limit. Does not trim if
limit is None.
:param limit: int maximum length or None
:return: None
"""
if limit is not None:
if len(self) > limit:
self.rows = self.rows[:limit]
def write(self, location):
"""
Write out instance sources list to a CSV file. If location refers to
a directory, the file is written there and the file name is chosen as
<handle>.csv. Metadata like handle and default state are stored in the
first line of the file.
If location refers to a file name, it used as file name directly.
The target directory must exist.
:param location: directory or file name in an existing directory
:return: None
"""
global logger
if os.path.isdir(location):
file_name = os.path.join(location, "%s.csv" % self.handle)
elif os.path.isdir(os.path.dirname(location)):
file_name = location
else:
raise Exception("Can't write to location `%s`" % location)
logger.debug("Writing `%s` sources to `%s`" % (self.handle, file_name))
with open(file_name, "w") as f:
header_keywords = []
if self.is_default:
header_keywords.append("default")
header_keywords += ["handle", self.handle]
f.write("#%s\n" % ":".join(header_keywords))
csv_writer = csv.DictWriter(f, self.rows[0].keys())
csv_writer.writeheader()
csv_writer.writerows(self.rows)
return file_name
def from_set(self, src_set):
"""
Use set to fill this Sources object. The set is expected to contain
:param src_set: set with (rank, host) pairs
:return: None
"""
self.rows = [{"rank": str(rank), "hostname": hostname} for rank, hostname in src_set]
def as_set(self, start=0, end=None):
"""
Return rows of this sources list as a set. The set does not pertain any of
the sources' meta data (DB handle, default). You can specify `start` and `end`
to select just a chunk of data from the rows.
Warning: There is no plausibility checking on `start` and `end` parameters.
:param start: optional int marking beginning of chunk
:param end: optional int marking end of chunk
:return: set of (int rank, str hostname) pairs
"""
if len(self.rows) == 0:
return set()
if end is None:
end = len(self.rows)
if "rank" in self.rows[0].keys():
return set([(int(row["rank"]), row["hostname"]) for row in self.rows[start:end]])
else:
return set([(0, row["hostname"]) for row in self.rows[start:end]])

Просмотреть файл

@ -45,3 +45,18 @@ def teardown_package():
if tmp_dir is not None:
shutil.rmtree(tmp_dir, ignore_errors=True)
tmp_dir = None
class ArgsMock(object):
"""
Mock used for testing functionality that
requires access to an args-style object.
"""
def __init__(self, **kwargs):
self.kwargs = kwargs
def __getattr__(self, attr):
try:
return self.kwargs[attr]
except KeyError:
return None

88
tests/sources_db_test.py Normal file
Просмотреть файл

@ -0,0 +1,88 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
from nose.tools import *
import os
import sources_db as sdb
import tests
def test_sources_db_instance():
"""SourcesDB can list database handles"""
test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
handle_list = db.list()
assert_true(type(handle_list) is list, "handle listing is an actual list")
assert_true(len(handle_list) > 0, "handle listing is not empty")
assert_true(db.default in handle_list, "default handle appears in listing")
assert_true("list" not in handle_list, "`list` must not be an existing handle")
assert_true("debug" in handle_list, "`debug` handle is required for testing")
def test_sources_db_read():
"""SourcesDB can read databases"""
test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
src = db.read("debug")
assert_true(type(src) is sdb.Sources, "reading yields a Sources object")
assert_equal(len(src), len(src.rows), "length seems to be correct")
assert_true("hostname" in src[0].keys(), "`hostname` is amongst keys")
assert_true("rank" in src[0].keys(), "`rank` is amongst keys")
rows = [row for row in src]
assert_equal(len(rows), len(src), "yields expected number of iterable rows")
def test_sources_db_write_and_override():
"""SourcesDB databases can be written and overridden"""
test_tmp_dir = os.path.join(tests.tmp_dir, "sources_db_test")
db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
old = db.read("debug")
old_default = db.default
override = sdb.Sources("debug", True)
row_one = {"foo": "bar", "baz": "bang", "boom": "bang"}
row_two = {"foo": "bar2", "baz": "bang2", "boom": "bang2"}
override.append(row_one)
override.append(row_two)
db.write(override)
# New SourcesDB instance required to detect overrides
db = sdb.SourcesDB(tests.ArgsMock(workdir=test_tmp_dir))
assert_true(os.path.exists(os.path.join(test_tmp_dir, "sources", "debug.csv")), "override file is written")
assert_equal(db.default, "debug", "overriding the default works")
assert_not_equal(old_default, db.default, "overridden default actually changes")
new = db.read("debug")
assert_equal(len(new), 2, "number of overridden rows is correct")
assert_true(new[0] == row_one and new[1] == row_two, "new rows are written as expected")
assert_not_equal(old[0], new[0], "overridden rows actually change")
def test_sources_set_interface():
"""Sources object can be created from and yield sets"""
# Sets are assumed to contain (rank, hostname) pairs
src_set = {(1, "mozilla.org"), (2, "mozilla.com"), (3, "addons.mozilla.org")}
src = sdb.Sources("foo")
src.from_set(src_set)
assert_equal(len(src), 3, "database from set has correct length")
assert_equal(src_set, src.as_set(), "yielded set is identical to the original")
assert_equal(len(src.as_set(1, 2)), 1, "yielded subset has expected length")
def test_sources_sorting():
"""Sources object can sort its rows by rank"""
src_set = {(1, "mozilla.org"), (2, "mozilla.com"), (3, "addons.mozilla.org")}
src = sdb.Sources("foo")
src.from_set(src_set)
# Definitely "unsort"
if int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]):
src.rows[0], src.rows[1] = src.rows[1], src.rows[0]
assert_false(int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]) < int(src.rows[2]["rank"]), "list is scrambled")
src.sort()
assert_true(int(src.rows[0]["rank"]) < int(src.rows[1]["rank"]) < int(src.rows[2]["rank"]), "sorting works")

Просмотреть файл

@ -13,6 +13,7 @@ import xpcshell_worker as xw
@mock.patch('sys.stdout') # to silence progress bar
def test_xpcshell_worker(mock_sys):
"""XPCShell worker runs and is responsive"""
# Skip test if there is no app for this platform
if tests.test_app is None:

Просмотреть файл

@ -1,82 +0,0 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this file,
# You can obtain one at http://mozilla.org/MPL/2.0/.
import csv
import os
__datasets = {
'debug': 'debug.csv',
'debug2': 'debug2.csv',
# 'google': 'google_ct_list.csv', # disabled until cleaned
'smoke': 'smoke_list.csv',
'test': 'test_url_list.csv',
'top': 'top_sites.csv'
}
def list_datasets():
dataset_list = __datasets.keys()
dataset_list.sort()
dataset_default = "top"
assert dataset_default in dataset_list
return dataset_list, dataset_default
def iterate(dataset, data_dir):
if dataset.endswith('.csv'):
csv_file_name = os.path.abspath(dataset)
else:
csv_file_name = __datasets[dataset]
with open(os.path.join(data_dir, csv_file_name)) as f:
csv_reader = csv.reader(f)
for row in csv_reader:
assert 0 <= len(row) <= 2
if len(row) == 2:
rank, url = row
yield int(rank), url
elif len(row) == 1:
rank = 0
url = row[0]
yield int(rank), url
else:
continue
class URLStore(object):
def __init__(self, data_dir, limit=0):
self.__data_dir = os.path.abspath(data_dir)
self.__loaded_datasets = []
self.__limit = limit
self.__urls = []
def clear(self):
"""Clear all active URLs from store."""
self.__urls = []
def __len__(self):
"""Returns number of active URLs in store."""
if self.__limit > 0:
return min(len(self.__urls), self.__limit)
else:
return len(self.__urls)
def __iter__(self):
"""Iterate all active URLs in store."""
for rank, url in self.__urls[:len(self)]:
yield rank, url
@staticmethod
def list():
"""List handles and files for all static URL databases."""
return list_datasets()
def load(self, datasets):
"""Load datasets array into active URL store."""
if type(datasets) == str:
datasets = [datasets]
for dataset in datasets:
for nr, url in iterate(dataset, self.__data_dir):
self.__urls.append((nr, url))
self.__loaded_datasets.append(dataset)

Просмотреть файл

@ -1,17 +0,0 @@
Windows support targets Wndows 10 and PowerShell. Windows 7 and 8
are generally able to run TLS Canary, but terminal escape sequences
used for colored logging won't work properly.
- Run admin PowerShell
- Install chocolatey, https://chocolatey.org/install
- choco install 7zip.commandline git golang openssh python2
- choco install python3 # Optional, provides the virtualenv cmdlet
- pip install virtualenv # Not required if python3 installed
- Run user PowerShell
- git clone https://github.com/mozilla/tls-canary
- cd tls-canary
- virtualenv -p c:\python27\python.exe venv
- venv\Scripts\activate
- pip install -e .

Просмотреть файл

@ -99,7 +99,7 @@ def scan_urls(app, target_list, profile=None, get_certs=False, timeout=10):
if response.original_cmd["mode"] == "scan":
timeout_time = time.time() + timeout + 1
# Ignore other ACKs.
continue;
continue
# Else we know this is the result of a scan command.
result = ScanResult(response)
results[result.host] = result

0
xpcshell_worker.py Executable file → Normal file
Просмотреть файл