Script to reconcile a dump of the FxA user database with Basket

Перейти к файлу

Shane Tomlinson 5e175b1a63 Merge pull request #21 from Mozilla-GitHub-Standards/master r=@shane-tomlinson Add Mozilla Code of Conduct		2019-04-01 16:07:59 +01:00
bin	feat(performance): Add a configurable highWaterMark for performance.	2018-06-05 20:23:48 +01:00
lib	fix(uuid): npm node-uuid is deprecated in favour of npm uuid	2018-06-05 10:49:15 -07:00
test_data	fix(tests): write expected counts when generating test data and import those values in tests	2018-05-31 22:41:27 -07:00
tests	fix(eslint): add eslint; fixes mostly unused vars and key ordering	2018-06-05 10:44:00 -07:00
.eslintrc	fix(eslint): add eslint; fixes mostly unused vars and key ordering	2018-06-05 10:44:00 -07:00
.gitignore	Initial checkin.	2018-05-08 14:43:30 +01:00
.nvmrc	fix(nvmrc): this requires node@8 to run	2018-05-21 01:00:43 -07:00
.travis.yml	fix(eslint): add eslint; fixes mostly unused vars and key ordering	2018-06-05 10:44:00 -07:00
CODE_OF_CONDUCT.md	Add Mozilla Code of Conduct file	2019-03-29 23:54:20 -07:00
LICENSE	Add LICENSE	2018-05-10 18:14:55 +01:00
README.md	refactor: Convert to streams to handle pressure/backoff	2018-06-01 17:16:41 +01:00
index.js	Initial checkin.	2018-05-08 14:43:30 +01:00
package-lock.json	fix(uuid): npm node-uuid is deprecated in favour of npm uuid	2018-06-05 10:49:15 -07:00
package.json	fix(uuid): npm node-uuid is deprecated in favour of npm uuid	2018-06-05 10:49:15 -07:00

README.md

fxa-reconcile-with-salesforce

A utility to reconcile the user databases between FxA and Mozilla's Salesforce instance.

Usage

Two database dumps in CSV format are needed. Both dumps need to be stripped of leading and trailing whitespace, contain no blank lines, and sorted by uid.

The FxA dump must contain uid, base64(email), base64(locale), and create_date

124516347fsdf2361425,base64('stomlinson@mozilla.com'),base64('en_US'),1526933232113

create_date is milliseconds since the Unix epoch.

The Salesforce dump must contain uid,email

124516347fsdf2361425,stomlinson@mozilla.com

To help check that the CSVs are parsed as expected, an intermediate JSON representation of the SQS commands to be sent is generated. This writes to stdout by default:

node bin/csv-to-json.js -f <fxa_sorted.csv> -s <salesforce_sorted.csv>

To save to a file:

node bin/csv-to-json.js -f <fxa_sorted.csv> -s <salesforce_sorted.csv> > commands.json

To do a dry run and display the list of SQS commands that would be sent:

node bin/json-to-sqs.js -i <commands_json>

To send to SQS, the target SQS endpoint and region must be specified:

node bin/json-to-sqs.js -i <commands_json> -u <target_sqs_url> -r <target_sqs_region>

AWS credentials are expected to be available. See https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/setting-credentials-node.html

Preparing the Salesforce data

The Salesforce data comes in a UTF-16 CSV with the following format:

CREATED_DATE_,EMAIL_ADDRESS_,FXA_ID,FXA_LANGUAGE_ISO2

This needs to be converted to a UTF-8 file with the following format:

FXA_ID,EMAIL

Standard unix commands can take care of the prep work:

iconv -f utf-16 -t utf-8 <salesforce.original.utf16.csv> > <salesforce.original.utf8.csv> - converts UTF-16 to UTF-8, creating an output file of half the size.
tail -n +2 <salesforce.original.utf8.csv> | cut -d , -f 3,2 | perl -pe 's/^(\S+),(\S+)/$2,$1/;' > <salesforce.correct_columns.unsorted.csv> - Ditch the schema line, extract and output one UID,email per line
sort <salesforce.correct_columns.unsorted.csv> > <salesforce.sorted.csv> - Sort the CSV by UID.

Allowing for more nodejs memory, if needed. Recommended.

node bin/csv-to-json.js --max-old-space-size=8192 -f <fxa_sorted.csv> -s <salesforce_sorted.csv>

Generating test data

Test data can be generated:

node ./bin/generate-test-data.js -f ./test_data/fxa.csv -s ./test_data/salesforce.csv -c 4000 --pc 27 --pu 2 --pd 1 --pi 1

This would generate a total of 4000 rows, 27% creates, 2% updates, 1% deletions, 1% invalid uid saved to test_data/fxa.csv and test_data/salesforce.csv

-c, --count <count> Total record count
-f, --fxa [filename] FxA CSV
-s, --salesforce [filename] Salesforce CSV
--pc [percentage] % of accounts that need to be created on Salesforce, defaults to 10%
--pu [percentage] % of accounts that need to be updated on Salesforce, defaults to 5%
--pd [percentage] % of accounts that need to be deleted on Salesforce, defaults to 10%
--pi [percentage] % of accounts that have an invalid uid from either source, defaults to 1%

Output files are unsorted. The generated uid for each line is random and sorting in the generation script is prohibitively expensive for large datasets. The best tool to sort the output is the Unix sort command:

sort fxa.csv > fxa_sorted.csv

sort salesforce.csv > salesforce_sorted.csv

You can run ./bin/regenerate-testdata.sh to do the above. Use those Counts to update expected in ./tests/lib/readers/csv.test.js.

Architecture

Everything is based on NodeJS Streams

lib/readers contains logic to read .csv and .diff files. A JSON blob with the event will be pushed onto the stream queue for each entry in the input file.
lib/transforms contains logic to transform the JSON blob from the reader into a message suitable for output. The transformed message is pushed onto the stream queue for further transformation or output.
lib/writers contains logic to send messages to SQS.

This architecture allows us to read multiple types of files as well as perform manual verification of the parsing and message format.

LICENSE

MPL-2.0