5.7 KiB
File Formats in Collusion
Collusion had an ad-hoc file format for saved files, but has migrated to a new format that captures multiple site visits and changes over time better. This is the 1.0 version, which may be superceded by newer versions as we get more experience with what data is required for visualizations and tracker detection.
Format 1.0
This format has the following structure:
A root object whose keys are format
, version
, [optional] token
, connections
, and [optional] lastSync
.
The format
value is the string "Collusion Save File" and is for documentation and identification of JSON files which are Collusion-specific.
The version
value is the string "1.0" and identifies the specific format documented here. A missing version
key is the same as a version
value of "0" and should be parsed and interpreted as specified for Format 0 above.
The token
value is a Type-4 UUID (randomly generated) which can be used by a server to group updates and check for overlap or duplicate calls. This is generated by the client the first time data is shared.
The token
value is only shared between client and server and is not exposed by the server to queries, not associated with an IP address, or otherwise traceable back to a specific client. If a client has not opted into sharing Collusion data, it will not have a token
key.
The optional lastSync
value is likewise only present if the client has opted into sharing Collusion data, and records the timestamp of the last item in the connections
value which was shared. This value does not need to, and should not be, correlated with the time the shared data was uploaded, to avoid being able to connect data back to the client that shared it. The lastSync
value is kept by the client only, and is not sent to the server when data is shared.
The connections
value makes up the bulk of the file. It is an array of connection array objects, where each connection is represented as an array of values in the following order: [source, target, timestamp, contentType, cookie, sourceVisited, secure, sourcePathDepth, sourceQueryDepth]
. Because we will be storing a lot more information over time than the older format, we do not store keys repetitively with each connection, but effectively a 9-tuple corresponding closely to a database row.
The source
value is a URL containing domain and subdomain information for the requested site, but stripped of protocol, path, query, and fragment. Note, this is a change from the earlier format which also stripped off subdomain information that is now retained.
The target
value is a URL containing domain and subdomain information for a resource loaded from a third-party site, and like the source the target is stripped of protocol, path, query, and fragment. Likewise, this is a change from the earlier format which also stripped off subdomain information that is now retained. Connections which differ only by subdomain are not considered third-party content, which means that if you visit example.com and it loads content from ads.example.com, those connections will not be tracked by Collusion.
The timestamp
is an integer number of milliseconds since the Unix epoch (January 1, 1970) as normally used for Javascript Date objects. The granularity of the timestamp is intentionally reduced when this data is shared, by rounding timestamps down to the last 10 minutes to prevent trackers from comparing our data with theirs to re-associate our data with individual users. Note, this is a change from the earlier data format which only stored timestamps relative to the time Collusion was started, and couldn't be used to restore actual session dates or times.
The contentType
value is the string reported by the target in the Content-Type
header, although we MAY also compare with the actual type returned to improve the accuracy of this value. If there is no Content-Type header we WILL attempt to determine the type of the content. If all attempts fail, a default content type of "text/plain" (the standard default content type) will be used, but the value WILL NOT be null.
The cookie
value is either true
or false
representing the existence of one or more Set-Cookie
headers returned by the target.
The sourceVisited
value is either true
or false
, indicating whether or not the source was loaded by the user in a page or tab. While it is expected that this value will generally be true, it may be false for sources in iframes.
The secure
value will be true
for content loaded via the HTTPS
protocol, false
for content loaded via HTTP
protocol. No other protocols are currently tracked by Collusion as connections.
The sourcePathDepth
is a metric of how many path elements there were in the source URL before it was sanitized. The URL http://example.com/ has a depth of 0, while the URL http://example.com/blog/post/2012/12/21 has a depth of 5.
The sourceQueryDepth
is a metric of how many items there were in the query string. This is not a test of unique keys, just simple breaking after the "?" and splitting on "&" and ";". Again, http://example.com/ has a depth of 0, as does http://example.com/?, while http://example.com/?captain=kirk&ship=enterprise, http://example.com/?captain=kirk&captain=picard, and http://example.com/?captain=kirk;ship=enterprise all have a depth of 2.
Questions
- Should we be tracking the HTTP Method (GET, POST, etc.) of each connection?
- Is 10 minutes the right granularity to obfuscate the timestamps to? Should there be a random component?
- (Not a question, more of an implementation note) We also want to keep track of the tab a connection is loaded in, internally. This is used both to determine the
sourceVisited
value and for per-tab visualizations.