Merge pull request #156 from ajupazhamayil/final_report

Update final report
2019-08-23 10:12:32 -04:00 · 2019-08-23 10:12:32 -04:00 · c5a3791b1a
--- a/docs/GSoC_Final_Report_2019.md
+++ b/docs/GSoC_Final_Report_2019.md
@ -2,7 +2,7 @@
 # TUID Service Improvements

 ## Introduction
-The TUID service is responsible for making Temporally Unique IDentifiers to identify unique lines of source code. TUIDs can be used for mapping code coverage between various revisions. The service is a single-process Flask application which was using SQLite database and was unable to handle a large volume of incoming requests from the ETL machines.
+The TUID service is responsible for making Temporally Unique IDentifiers to identify unique lines of source code. TUIDs can be used for mapping code coverage between various revisions. The service is a single-process Flask application which was using an SQLite database and was unable to handle a large volume of incoming requests from the ETL machines.

 The aim of the proposal was to make the service more stable and faster. The proposal comprised of  two major changes:
 1. Porting of application to use Elasticsearch instead of SQLite to enable parallelism. 
@ -12,7 +12,7 @@ Elasticsearch porting was completed, but due to the nature of the project, the p


 ## Work
-Porting the project to use Elasticsearch instead of SQLite was the starting point. Here, the difference between ES and SQLite was considered- the lack of transaction in ES, the way ES writes, the way ES deletes, etc. The recomputation of revnums was removed to allow negative revnums which eliminated unnecessary processing. We changed the structure of the annotations table to save TUID as an ordered list instead of a pair (line:tuid).
+The starting point was to port the project to use Elasticsearch instead of SQLite. The differences between ES and SQLite were considered- lack of transactions in ES, mechanism of write and delete operations in ES, etc. Another point of change was regarding revision numbers. A revision number is an integer associated with every revision in the changeset log table. Since revision is a hash id, revision numbers enable us to order the revisions in the changeset log table so that we can quickly find batches of revisions and have them applied. Earlier, revision numbers were recomputed to start from zero when the revision numbers had negative values. This recomputation was eliminated to allow negative values so that it is optimized. Also, the structure of the annotations table was changed to save TUIDs as an ordered list instead of a (line, tuid) pair.

 The original idea was to incorporate multi-processing with two Flask servers. Due to synchronization issues, the focus shifted to have multiple processes in the service using DB.  One process acts as a tuid generator and others communicate to the generator process via a table to get TUIDs. Work on this was in progress when the synchronization issue again was a blocker. The synchronization issue referred here is that coordination of TUID creation across all the processes, to ensure TUIDs are unique would remove all the speed that was gained with a multi-process service. Also for each file, ordering of application of the changesets had to be ensured (so no parallelism possible). Each changeset may have multiple files, so if the update was done on all the files in a changeset, it had to be ensured that no other process was doing the same. One solution was to assign each file to one process. The issue was co-ordination among processes while the distribution of the same. An important issue that needs to be addressed is what should be the status of the process that was assigned a particular file gets terminated or blocked. The mechanism through which other processes get notified needs to be defined. It should be clear what amount of time other processes should wait for a blocked process. If the original process is resurrected or unblocked, we should define how it behaves with the original file to which it was associated. The problem was not fully understood. There could be other hidden issues which were not obvious.

@ -20,7 +20,7 @@ A decision was made not to have multiple processes. Instead, caching of TUIDs wa

 The code has not reached production yet and hence the ETL machines do not use the new changes for now.

-Link to the PRs: https://github.com/mozilla/TUID/pulls?q=is%3Apr+author%3Aajupazhamayil
+Link to the PRs: [Click here](https://github.com/mozilla/TUID/pulls?q=is%3Apr+author%3Aajupazhamayil)


 ## Challenges