lightbeam/doc/data_format.v1.0.md

5.7 KiB

File Formats in Collusion

Collusion had an ad-hoc file format for saved files, but has migrated to a new format that captures multiple site visits and changes over time better. This is the 1.0 version, which may be superceded by newer versions as we get more experience with what data is required for visualizations and tracker detection.

Format 1.0

This format has the following structure:

A root object whose keys are format, version, [optional] token, connections, and [optional] lastSync.

The format value is the string "Collusion Save File" and is for documentation and identification of JSON files which are Collusion-specific.

The version value is the string "1.0" and identifies the specific format documented here. A missing version key is the same as a version value of "0" and should be parsed and interpreted as specified for Format 0 above.

The token value is a Type-4 UUID (randomly generated) which can be used by a server to group updates and check for overlap or duplicate calls. This is generated by the client the first time data is shared.

The token value is only shared between client and server and is not exposed by the server to queries, not associated with an IP address, or otherwise traceable back to a specific client. If a client has not opted into sharing Collusion data, it will not have a token key.

The optional lastSync value is likewise only present if the client has opted into sharing Collusion data, and records the timestamp of the last item in the connections value which was shared. This value does not need to, and should not be, correlated with the time the shared data was uploaded, to avoid being able to connect data back to the client that shared it. The lastSync value is kept by the client only, and is not sent to the server when data is shared.

The connections value makes up the bulk of the file. It is an array of connection array objects, where each connection is represented as an array of values in the following order: [source, target, timestamp, contentType, cookie, sourceVisited, secure, sourcePathDepth, sourceQueryDepth]. Because we will be storing a lot more information over time than the older format, we do not store keys repetitively with each connection, but effectively a 9-tuple corresponding closely to a database row.

The source value is a URL containing domain and subdomain information for the requested site, but stripped of protocol, path, query, and fragment. Note, this is a change from the earlier format which also stripped off subdomain information that is now retained.

The target value is a URL containing domain and subdomain information for a resource loaded from a third-party site, and like the source the target is stripped of protocol, path, query, and fragment. Likewise, this is a change from the earlier format which also stripped off subdomain information that is now retained. Connections which differ only by subdomain are not considered third-party content, which means that if you visit example.com and it loads content from ads.example.com, those connections will not be tracked by Collusion.

The timestamp is an integer number of milliseconds since the Unix epoch (January 1, 1970) as normally used for Javascript Date objects. The granularity of the timestamp is intentionally reduced when this data is shared, by rounding timestamps down to the last 10 minutes to prevent trackers from comparing our data with theirs to re-associate our data with individual users. Note, this is a change from the earlier data format which only stored timestamps relative to the time Collusion was started, and couldn't be used to restore actual session dates or times.

The contentType value is the string reported by the target in the Content-Type header, although we MAY also compare with the actual type returned to improve the accuracy of this value. If there is no Content-Type header we WILL attempt to determine the type of the content. If all attempts fail, a default content type of "text/plain" (the standard default content type) will be used, but the value WILL NOT be null.

The cookie value is either true or false representing the existence of one or more Set-Cookie headers returned by the target.

The sourceVisited value is either true or false, indicating whether or not the source was loaded by the user in a page or tab. While it is expected that this value will generally be true, it may be false for sources in iframes.

The secure value will be true for content loaded via the HTTPS protocol, false for content loaded via HTTP protocol. No other protocols are currently tracked by Collusion as connections.

The sourcePathDepth is a metric of how many path elements there were in the source URL before it was sanitized. The URL http://example.com/ has a depth of 0, while the URL http://example.com/blog/post/2012/12/21 has a depth of 5.

The sourceQueryDepth is a metric of how many items there were in the query string. This is not a test of unique keys, just simple breaking after the "?" and splitting on "&" and ";". Again, http://example.com/ has a depth of 0, as does http://example.com/?, while http://example.com/?captain=kirk&ship=enterprise, http://example.com/?captain=kirk&captain=picard, and http://example.com/?captain=kirk;ship=enterprise all have a depth of 2.

Questions

  1. Should we be tracking the HTTP Method (GET, POST, etc.) of each connection?
  2. Is 10 minutes the right granularity to obfuscate the timestamps to? Should there be a random component?
  3. (Not a question, more of an implementation note) We also want to keep track of the tab a connection is loaded in, internally. This is used both to determine the sourceVisited value and for per-tab visualizations.