Server for the Mozilla Telemetry project

Перейти к файлу

Mark Reid a685e20534 Merge pull request #162 from mozilla/update_readme Update README with deprecation notice.		2018-10-01 10:03:52 -03:00
analysis	Use the latest AMI with a longer worker timeout.	2015-07-28 16:24:28 -03:00
bin	Bug 450645 - Always access hg.mozilla.org w/ https	2017-01-19 09:37:55 -04:00
cmake	Ensure that we preserve large int values.	2013-12-16 15:34:59 -04:00
docs	Bug 450645 - Always access hg.mozilla.org w/ https	2017-01-19 09:37:55 -04:00
http	Stop ATMOv1 from terminating expired instances.	2017-01-30 15:17:57 -04:00
mapreduce	Fix job name for fetching previous state.	2015-10-26 12:06:53 -03:00
mongodb	Revise patch.	2013-12-02 14:42:15 +00:00
monitoring	Update telemetry analysis links in a couple places	2016-01-11 14:47:36 -04:00
process_incoming	Bug 450645 - Always access hg.mozilla.org w/ https	2017-01-19 09:37:55 -04:00
provisioning	Install boto3 in worker userdata	2016-03-02 12:28:21 -08:00
server	Merge branch 'master' into reorganize_code	2013-10-28 15:14:44 -03:00
telemetry	Bug 450645 - Always access hg.mozilla.org w/ https	2017-01-19 09:37:55 -04:00
test	Add some tests around handling UTF-8 vs unicode.	2015-03-11 11:14:02 -03:00
.gitignore	Add ansible orchestration around current provisioning logic	2015-10-20 13:51:03 -07:00
CMakeLists.txt	Add the telemetry conversion process metrics	2013-12-18 09:42:32 -08:00
LICENSE	Add Mozilla Public License	2013-09-27 15:11:24 -03:00
README.md	Update README with deprecation notice.	2018-10-01 10:02:49 -03:00
TODO.md	Remove some completed TODOs	2014-04-01 10:41:55 -03:00
__init__.py	Add the required __init__.py files.	2013-10-05 14:43:28 -07:00

README.md

Telemetry Server

This repository is deprecated. Details on the current server for Firefox Telemetry can be found here and here.

Server components to receive, validate, convert, store, and process Telemetry data from the Mozilla Firefox browser.

Talk to us on irc.mozilla.org in the #telemetry channel, or visit the Project Wiki for more information.

See the TODO list for some outstanding tasks.

Storage Format

See StorageFormat for details.

On-disk Storage Structure

See StorageLayout for details.

Data Converter

Use RevisionCache to load the correct Histograms.json for a given payload
1. Use revision if possible
2. Fall back to appUpdateChannel and appBuildID or appVersion as needed
3. Use the Mercurial history to export each version of Histograms.json with the date range it was in effect for each repo (mozilla-central, -aurora, -beta, -release)
4. Keep local cache of Histograms.json versions to avoid re-fetching
Filter out bad submission data
1. Invalid histogram names
2. Histogram configs that don't match the expected parameters (histogram type, num buckets, etc)
3. Keep metrics for bad data

MapReduce

We have implemented a lightweight MapReduce framework that uses the Operating System's support for parallelism. It relies on simple python functions for the Map, Combine, and Reduce phases.

For data stored on multiple machines, each machine will run a combine phase, with the final reduce combining output for the entire cluster.

Mongodb Importer

Telemetry data can be optionally imported into mongodb. The benefits of doing that is the reduced time to run multiple map-reduce jobs on the same dataset, as mongodb keeps as much data as possible in memory.

Start mongodb, e.g. mongod --nojournal
Fetch a dataset from S3, e.g. aws s3 cp s3://... /mnt/yourdataset --recursive
Import the dataset, e.g. python3 -m mongodb.importer /mnt/yourdataset
Run a map-reduce job, e.g. mongo localhost/telemetry mongodb/examples/osdistribution.js

Plumbing

Once we have the converter and MapReduce framework available, we can easily consume from the existing Telemetry data source. This will mark the first point that the new dashboards can be fed with live data.

Integration with the existing pipeline is discussed in more detail on the Bagheera Integration page.

Data Acquisition

When everything is ready and productionized, we will route the client (Firefox) submissions directly into the new pipeline.

Code Overview

These are the important parts of the Telemetry Server architecture.

`http/server.js`

Contains the Node.js HTTP server for receiving payloads. The server's job is simply to write incoming submissions to disk as quickly as possible.

It accepts single submissions using the same type of URLs supported by Bagheera, and expects (but doesn't require) the partition information to be submitted as part of the URL.

To set up a test server locally:

Install node.js (left as an exercise to the reader)
Edit http/server_config.json, replacing log_path and stats_log_file with directories suitable to your machine
Run the server using cd http; node ./server.js ./server_config.js
Send some test data to the server. Using curl: curl -X POST http://127.0.0.1:8080/submit/telemetry/foo/bar/baz -d '{"test": 1}'

Stop the server, and check that there is a telemetry.log.<something>.finished file in the directory you specified in step 2 above.

You can examine the resulting file in python (from the root of the repo):

import telemetry.util.files as fu
for r in fu.unpack('/path/to/telemetry.log.<something>.finished'):
    print "URL Path:", r.path
    print "JSON Payload:", r.data
    print "Submission Timestamp:", r.timestamp
    print "Submission IP:", r.ip
    print "Error (if any):", r.error

`telemetry/convert.py`

Contains the Converter class, which is used to convert a JSON payload from the raw form submitted by Firefox to the more compact storage format for on-disk storage and processing.

You can run the main method in this file to process a given data file (the expected format is one record per line, each line containing an id followed by a tab character, followed by a json string).

You can also use the Converter class to convert data in a more flexible way.

`telemetry/export.py`

Contains code to export data to Amazon S3.

`telemetry/persist.py`

Contains the StorageLayout class, which is used to save payloads to disk using the directory structure as documented in the storage layout section above.

`telemetry/revision_cache.py`

Contains the RevisionCache class, which provides a mechanism for fetching the Histograms.json spec file for a given revision URL. Histogram data is cached locally on disk and in-memory as revisions are requested.

`telemetry/telemetry_schema.py`

Contains the TelemetrySchema class, which encapsulates logic used by the StorageLayout and MapReduce code.

`process_incoming/process_incoming_mp.py`

Contains the multi-process version of the data-transformation code. This is used to download incoming data (as received by the HTTP server), validate and convert it, then publish the results back to S3.

`process_incoming/worker`

Contains the C++ data validation and conversion routines.

Prerequisites

Clang 3.1 or GCC 4.7.0 or Visual Studio 10
CMake (2.8.7+) - http://cmake.org/cmake/resources/software.html
Boost (1.54.0) - http://www.boost.org/users/download/
zlib
OpenSSL
Protobuf

Optional (used for documentation)

Graphviz (2.28.0) - http://graphviz.org/Download..php
Doxygen (1.8+)- http://www.stack.nl/~dimitri/doxygen/download.html#latestsrc

convert - Build instructions (from the telemetry-server root)

mkdir release
cd release
cmake -DCMAKE_BUILD_TYPE=release ..
make

Configuring the converter

heka_server (string) - Hostname:port of the heka log/stats service.
histogram_server (string) - Hostname:port of the histogram.json web service.
telemetry_schema (string) - JSON file containing the dimension mapping.
histogram_server (string) - Hostname:port of the histogram.json web service.
storage_path (string) - Converter output directory
upload_path (string) - Staging directory for S3 uploads.
max_uncompressed (int) - Maximum uncompressed size of a telemetry record.
memory_constraint (int) -
compression_preset (int) -

    {
        "heka_server": "localhost:5565",
        "telemetry_schema": "telemetry_schema.json",
        "histogram_server": "localhost:9898",
        "storage_path": "storage",
        "upload_path": "upload",
        "max_uncompressed": 1048576,
        "memory_constraint": 1000,
        "compression_preset": 0
    }

Setting up/running the histogram server

pushd http
../bin/get_histogram_tools.sh
popd
python -m http.histogram_server

Running the converter

in the release directory

mkdir input
./convert convert.json input.txt

# input.txt should contain a list of files to process (newline delimited)
# i.e. /<path to telemetry-server>/release/input/telemetry1.log

from another shell, in the release directory

cp ../process_incoming/worker/common/test/data/telemetry1.log input

Without the histogram server running it will produce something like this:

processing file:"telemetry1.log"
LoadHistogram - connect: Connection refused
ConvertHistogramData - histogram not found: https://hg.mozilla.org/releases/mozilla-release/rev/a55c55edf302
done processing file:"telemetry1.log" processed:1 failures:1 time:0.001871 throughput (MiB/s):9.3563 data in (B):18356 data out (B):0

With the histogram server running:

processing file:"telemetry1.log"
done processing file:"telemetry1.log" processed:1 failures:0 time:0.013622 throughput (MiB/s):1.2851 data in (B):18356 data out (B):45909

Ubuntu Notes

apt-get install cmake libprotoc-dev zlib1g-dev libboost-system1.54-dev \
   libboost-filesystem1.54-dev libboost-thread1.54-dev libboost-test1.54-dev \
   libboost-log1.54-dev libboost-regex1.54-dev protobuf-compiler libssl-dev \
   liblzma-dev xz-utils

`mapreduce/job.py`

Contains the MapReduce code. This is the interface for running jobs on Telemetry data. There are example job scripts and input filters in the examples/ directory.

`provisioning/aws/*`

Contains scripts to provision and launch various kinds of cloud services. This includes launching a telemetry server node, a MapReduce job, or a node to process incoming data.

`monitoring/heka/*`

Contains the configuration used by Heka to process server logs.