Граф коммитов

293 Коммитов

Автор SHA1 Сообщение Дата
hlums aeb9486a6a Updated wikigold utils to be consistent with other datasets. 2019-06-23 21:57:04 +00:00
hlums 7c35f670d2 Added probabilities output to BERT token classifier. 2019-06-23 21:55:51 +00:00
hlums 7b827d6d5b Updated ner token preprocessing for Chinese text. 2019-06-23 21:53:06 +00:00
hlums 480a08f544 Updated NER notebook with new tokenizer api. 2019-06-23 21:51:34 +00:00
Said Bleik 35fc04c383
Merge pull request #113 from microsoft/hlu/two_sequence_utils_and_XNLI_notebook
Hlu/two sequence utils and xnli notebook
2019-06-21 17:39:07 -04:00
Said Bleik 9c3a95159c add sequential loader test 2019-06-21 17:37:03 -04:00
Hong Lu fb3f7ddef6 Moved _truncate_seq_pair outside of if else block. 2019-06-21 14:48:10 -04:00
Said Bleik 399707e747 meh 2019-06-21 14:43:09 -04:00
Said Bleik 65f1c82a81 arg name change 2019-06-21 14:41:39 -04:00
Said Bleik 8b56eec5cd updated defaults for predict's output 2019-06-21 14:24:40 -04:00
Said Bleik aec2ffbfe7 add namedtuple preds output 2019-06-21 13:17:04 -04:00
Said Bleik 136cadf0fe lm name changes 2019-06-21 13:16:43 -04:00
Hong Lu c53edc41b3 Added training and prediction time to notebook. 2019-06-21 10:50:20 -04:00
Said Bleik c0951fae7c rem data_loader 2019-06-20 14:44:18 -04:00
Said Bleik d53c17e1ed minor edit to preds 2019-06-20 14:37:47 -04:00
Said Bleik fad2564604 added optional prob dist predictions 2019-06-20 14:23:09 -04:00
Hong Lu 2010daf637 Removed redundant code. 2019-06-20 14:08:40 -04:00
Hong Lu 76fa9d7ed4 Fixed formatting. 2019-06-20 14:01:48 -04:00
Said Bleik 4388e6c588 add whole-word pretrained models 2019-06-20 13:28:10 -04:00
hlums ed3415b320 Updated utils of XNLI dataset. 2019-06-19 21:06:38 +00:00
hlums 593bb4eb5d Added convert_to_unicode helper function. 2019-06-19 21:06:15 +00:00
hlums 946a687729 Resolved confict with staging. 2019-06-19 21:05:53 +00:00
Said Bleik b407204c79 add sequential loader 2019-06-19 16:12:20 -04:00
Said Bleik 5245afafdd add dask data loader 2019-06-19 14:54:33 -04:00
Casey Hong 236a64e9d8 update stsbenchmark 📓 2019-06-18 17:20:47 -04:00
Janhavi Mahajan 074bca3619 black formatter 2019-06-18 16:49:08 -04:00
Janhavi Mahajan 7f1bb8a039 bug fix: sts-benchmark has extra tabs in some rows which caused incorrect reading of pandas df or azureml dataflow object 2019-06-18 16:49:08 -04:00
hlums e09a54512a Resolved conflict and merged from staging. 2019-06-18 15:54:55 +00:00
hlums 7cba858a42 Updated docstring. 2019-06-18 14:57:49 +00:00
hlums e5b5af4c78 Added warmup and support for two-sequence classification. 2019-06-17 22:40:24 +00:00
Miguel González-Fierro 818f2ef3d3
Merge pull request #78 from microsoft/liqun-first-pull
GenSen on AML deep dive notebook (sentence similarity)
2019-06-17 23:46:44 +02:00
Miguel González-Fierro 800ab9ac00
Merge pull request #103 from microsoft/bleik
update xnli dataset utils
2019-06-17 23:40:56 +02:00
Liqun Shao 0af38f7ccb make changes on notebook based on the structure change 2019-06-17 16:57:09 -04:00
Liqun Shao cc0fabcd55 add docstring for training on gpu only 2019-06-17 16:57:09 -04:00
Liqun Shao 70733e9d77 change the structure 2019-06-17 16:57:09 -04:00
Abhiram E f7a14146ea Fixed model state empty bug 2019-06-17 16:57:09 -04:00
Liqun Shao cc6cf46a28 format the files 2019-06-17 16:57:09 -04:00
Liqun Shao a7e0555235 fix the path 2019-06-17 16:57:09 -04:00
Liqun Shao d200d525fe remove unwanted logs; fix the bug for getting training time 2019-06-17 16:57:09 -04:00
Liqun Shao ae2f8f4d2a Make the following changes to increase the performance of horovod
training.
1. Add random seeds for iterators
2. learning rate=lr*hvd.size()
3. sync the optimizer
4. remove DataParallel
2019-06-17 16:57:09 -04:00
Liqun Shao 9e504182c3 fix the issue min_epoch_loss not updated during training, then it will never stop; min_epoch_loss always eqals to val_epoch_loss 2019-06-17 16:57:09 -04:00
Liqun Shao 4fa8bcc50f fix import 2019-06-17 16:57:09 -04:00
Abhiram E a065ae9bb0 Removed nested path joins 2019-06-17 16:57:09 -04:00
Abhiram E eb6719d9ec Minor fix to docstrings 2019-06-17 16:57:08 -04:00
Liqun Shao 0cff6731bb remove unwanted log 2019-06-17 16:57:08 -04:00
Abhiram E f8ffc290b7 Refactored gensen train.py 2019-06-17 16:57:08 -04:00
Liqun Shao 4369138371 add docstring to utils.py 2019-06-17 16:57:08 -04:00
Abhiram E 1ab6490348 Resolved comments on the Gensen code 2019-06-17 16:57:08 -04:00
Abhiram E 13ec0a36a9 Moved prints to logging 2019-06-17 16:57:08 -04:00
Abhiram E f9a0bd1435 Added docstrings to Gensen code and refactored based on code review comments 2019-06-17 16:57:08 -04:00
Liqun Shao b670abbb88 add original code source to all the code 2019-06-17 16:57:08 -04:00
Liqun Shao 45811ef9b6 add aml explanation 2019-06-17 16:57:08 -04:00
Liqun Shao 349958bafb 1. correct typo in the notebook
2. add header to all the python files
3. add comments for train.py to explain what does it do
2019-06-17 16:57:08 -04:00
Liqun Shao 01fdc9c82a The HyperDrive will --> The HyperDrive run automatically shows... 2019-06-17 16:57:08 -04:00
Liqun Shao 9e6c4680fe put create workspace in the first place 2019-06-17 16:57:08 -04:00
Liqun Shao 0c79d68381 remove all unnecessary labels 2019-06-17 16:57:08 -04:00
Liqun Shao 2c4cc44839 include imports 2019-06-17 16:57:08 -04:00
Liqun Shao c7cc976063 1. Move similarity explaining to README
2. Separate model.py into two
3. Remove unneccessary imports
2019-06-17 16:57:08 -04:00
Liqun Shao 9e212bc832 add explanation on tuning results 2019-06-17 16:57:07 -04:00
Liqun Shao 635eab8cb6 change the name of saving model 2019-06-17 16:57:07 -04:00
Liqun Shao b440661f6a fix the bug for stopping the training 2019-06-17 16:57:07 -04:00
Liqun Shao 6f3c89f0a7 Fixed the bug on training not converging 2019-06-17 16:57:07 -04:00
Liqun Shao ead07a6551 remove auto loader and save the best model state 2019-06-17 16:56:25 -04:00
Liqun Shao 786f8de629 change the stopping condition to when the validation loss is small 2019-06-17 16:56:25 -04:00
Liqun Shao fde487ea89 use adam optimizer instead of SGD 2019-06-17 16:56:25 -04:00
Liqun Shao c5687f9159 add README to gensen repo 2019-06-17 16:56:25 -04:00
Liqun Shao 3c43c229b2 Resolved conflicts 2019-06-17 16:56:25 -04:00
Liqun Shao 5137d91c46 add horovod distributed training to the gensen model and make the training stop with small validation loss 2019-06-17 16:56:25 -04:00
Abhiram E 2d5bfe6862 Refactored gensen related files by Maluuba 2019-06-17 16:56:24 -04:00
Liqun Shao 51ccc72cd3 first push 2019-06-17 16:56:24 -04:00
Said Bleik f12aabd5b0 add xnli dataset utils 2019-06-17 12:25:01 -04:00
Hong Lu dfb8553c5b Resolved conflict and merged staging. 2019-06-17 12:01:10 -04:00
Said Bleik a514025f5d
Merge pull request #99 from microsoft/bleik
ar TC example
2019-06-14 16:21:03 -04:00
Said Bleik 0929c37d56 removed unused arg 2019-06-14 16:18:50 -04:00
Said Bleik b0ead86bf2 added dataset utils 2019-06-14 15:11:17 -04:00
Said Bleik 51c22b9607 minor fixes 2019-06-13 22:26:29 -04:00
Said Bleik 9e85c2923f added missing imports 2019-06-13 15:30:53 -04:00
Hong Lu 4de4ece15c Changed python version in pre-commit-config back to 3.6 2019-06-13 14:46:57 -04:00
Ubuntu 5438d76596 Added test code for NER utils. 2019-06-13 18:25:27 +00:00
Hong Lu 6d671b6221 Started adding test code for NER. 2019-06-12 15:33:12 -04:00
Casey Hong e031d3d225 suppress nltk messages 2019-06-12 12:37:43 -04:00
Said Bleik 98a7071294
Merge pull request #85 from microsoft/casey-senteval
SentEval examples (local and with azureml support)
2019-06-11 14:39:51 -04:00
Casey Hong ba3ba5b5a8 use azureml_utils for workspace creation 2019-06-11 13:05:39 -04:00
Casey Hong e5b12c6f32 resolve merge conflicts 2019-06-11 11:45:30 -04:00
Chaoyu Guan f0d6a2f55c rename files and revise README for #62 2019-06-08 13:15:23 +00:00
Chaoyu Guan f4f3591668 add explain-NLP-model part for issue #62 2019-06-08 12:51:44 +00:00
Hong Lu 26fcc3cbe4 Added random seed option to wikigold util function. 2019-06-07 17:32:49 -04:00
Hong Lu 049ddf6442 Added BERT prefix to classifier names and some minor docstring updates. 2019-06-07 17:08:29 -04:00
Hong Lu 4e7ac8adc1 Minor updates in token classifier. 2019-06-07 10:56:49 -04:00
Hong Lu e40e9636f3 Removed old data utils script. 2019-06-07 10:42:27 -04:00
Hong Lu a8feb91a89 Removed common_ner.py 2019-06-07 10:35:12 -04:00
Hong Lu 2593620633 Added utility functions for token classification. 2019-06-07 10:34:09 -04:00
Hong Lu fbf15e64c6 Merge remote-tracking branch 'origin/staging' into hlu/BERT_NER_utils 2019-06-06 16:45:09 -04:00
Said Bleik b040c481eb
Merge pull request #86 from microsoft/abhiram-requests-fix
Minor fix suggested in Recommenders repo
2019-06-06 16:03:59 -04:00
Abhiram E 802188e115 Minor fix suggested in Recommenders repo 2019-06-06 15:46:17 -04:00
Casey Hong 23d9635230 senteval local and azureml 📓 2019-06-06 10:57:05 -07:00
Abhiram E f0db07fb3a Minor change. 2019-06-06 10:20:57 -07:00
Abhiram E 5b1ed5f447 FastText loader - Code changes and unit tests.
1. Added methods to download, extract and load glove vectors.
2. Added units test to test the public method.

Other changes
 1. Refactored files to add return types to docstrings.
 2. Minor changes to path variables.
2019-06-06 10:20:57 -07:00
Abhiram E 2498dbaaa1 Minor changes 2019-06-06 10:18:13 -07:00
abeswara 008bfa2c57 Glove loader - Code changes and unit tests.
1. Added methods to download, extract and load glove vectors.
2. Added units tests to test the public methods.

Other changes
 1. Made download and extract methods private.
 2. Refactored Word2vec unit tests to exclude private methods.
2019-06-06 10:16:46 -07:00
abeswara ae31e05a84 Word2vec loader - Code changes and unit tests.
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.

2. Added units tests to load, download and extract functions.
2019-06-06 10:12:29 -07:00
Said Bleik 9269ef5482 merge staging 2019-06-06 13:01:07 -04:00
Said Bleik c518d6a735 updated tc notebook and some utils 2019-06-05 21:37:16 -04:00
Abhiram E 3ac927edfa Using tqdm to show progress bar 2019-06-05 13:08:23 -04:00
Abhiram E 0e296b6291 Changed url fetch from urlretrieve to requests 2019-06-04 16:26:35 -04:00
Said Bleik ee9134d96f minor updates to seq classification 2019-06-03 10:03:49 -04:00
Hong Lu 9bcad55d20 Updated NER notebook with wikigold data. 2019-05-31 18:44:01 -04:00
Said Bleik 61b66a57aa updated device utils 2019-05-31 16:08:36 -04:00
Hong Lu 320b08d9af Removed BERT image. 2019-05-31 13:46:34 -04:00
Said Bleik 1a96bce557 added missing assignment 2019-05-30 10:42:30 -04:00
Hong Lu aaf0114cd7 Removed old scripts. 2019-05-29 14:57:23 -04:00
Hong Lu 52bd027555 Added helper function for postprocessing token classification results. 2019-05-29 14:39:58 -04:00
Said Bleik 5a81055e70 updated device utils and bert seq classifier 2019-05-28 23:16:19 -04:00
Abhiram E 36d7411bec Fix to limit the memory usage when using fasttext embedding loaders. Code changes to use the simpler version 2019-05-28 12:04:57 -04:00
Hong Lu 52cc16fb9b Updated token classifier api. 2019-05-24 18:09:56 -04:00
Hong Lu 5258c9cd7e Added some utility functions to the common script. Will be merged with common.py later. 2019-05-24 18:09:04 -04:00
Casey Hong 1cd36ccff7 fix snli noblank bug and add preprocessing tests 2019-05-21 23:00:56 -04:00
Said Bleik 63e546ab3c updated prerocessing, utils, classification 2019-05-21 16:45:23 -04:00
Hong Lu 2473e1a75c Black auto formatting. 2019-05-20 18:53:57 -04:00
Hong Lu 3d1c1862d9 Removed old data utils script. 2019-05-20 14:08:39 -04:00
Hong Lu 4a41ec41e8 Added a constant file. 2019-05-20 14:00:12 -04:00
Hong Lu 1393c74fb3 Minor updates for data class updates. 2019-05-20 13:59:38 -04:00
Hong Lu 9919a7bd35 Remived InputFeature class. Use namedtuple instead of class for input data. 2019-05-20 13:58:54 -04:00
Hong Lu e81138ad08 Changed optimizer and number of epochs configuration. 2019-05-20 13:58:16 -04:00
Said Bleik 49bb116474 update seq classifer 2019-05-17 10:04:46 -04:00
Hong Lu eef85dea41 Consolidated all configuration classes into a single class. 2019-05-16 18:11:21 -04:00
Hong Lu 7ca29691ae Consolidated some utility functions into BertTokenClassifier. 2019-05-16 18:10:47 -04:00
Hong Lu d87dfbc2af Minor edits and added docstring. 2019-05-16 18:10:14 -04:00
Hong Lu 14543fbd52 Added yaml configuration file for NER example. 2019-05-16 18:08:50 -04:00
Abhiram E 52d720e9bf Added option to limit number of word vectors for glove and word2vec 2019-05-15 00:22:37 -04:00
Janhavi Mahajan 1ed2c4dc0a feat(bug fix) updated snli notebook with to_lowercase_all() instead of to_lowercase() that expects a column name list. Fixed None object returning in to_lowercase when column name list is not passed 2019-05-13 18:14:31 -04:00
Said Bleik e9c17a961e update BERTSequenceClassifier and notebook 2019-05-13 15:18:21 -04:00
Said Bleik 7430e3b178 updated BERTSequenceClassifier + documentation 2019-05-13 14:38:54 -04:00
Said Bleik 7d2d74f975 BERTSequenceClassifier 2019-05-13 16:31:58 +00:00
Janhavi Mahajan bb5764a56a feat(code fix) rm_nltk_stop_words now expects sentences and stop_word column names 2019-05-10 16:50:34 -04:00
Janhavi Mahajan 197d771208 feat(code review comments) generalize nltk utils tokenize, remove_sto_words to more than 2 sentences 2019-05-10 16:27:48 -04:00
Janhavi Mahajan 6e3523810a feat(code review) fix to_nltk_tokens, add to_lowercase_all and to_lowercase as per said's comments 2019-05-10 16:27:48 -04:00
Abhiram E 49595b8666 Moved urls to module constants for pretrained embedding utils. 2019-05-09 14:58:12 -04:00
Casey Hong faf924b45b token_cols bugfix 2019-05-09 14:58:11 -04:00
Abhiram E 6ba272308b Minor change. 2019-05-09 14:58:11 -04:00
Abhiram E 2502d91e1b FastText loader - Code changes and unit tests.
1. Added methods to download, extract and load glove vectors.
2. Added units test to test the public method.

Other changes
 1. Refactored files to add return types to docstrings.
 2. Minor changes to path variables.
2019-05-09 14:58:11 -04:00
Abhiram E 4e480026a0 Minor changes 2019-05-09 14:58:10 -04:00
abeswara 8025b4449d Glove loader - Code changes and unit tests.
1. Added methods to download, extract and load glove vectors.
2. Added units tests to test the public methods.

Other changes
 1. Made download and extract methods private.
 2. Refactored Word2vec unit tests to exclude private methods.
2019-05-09 14:58:10 -04:00
abeswara 8408d7cce2 Word2vec loader - Code changes and unit tests.
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.

2. Added units tests to load, download and extract functions.
2019-05-09 14:58:10 -04:00
Abhiram E 9895dd41d7 Reformated files 2019-05-09 14:58:10 -04:00
Abhiram E 47ada0d03c Added support to download and extract word2vec pretrained vectors 2019-05-09 14:58:10 -04:00
Abhiram E 48adc4f619 Initial commit for word embeddings 2019-05-09 14:58:10 -04:00
miguelgfierro 3c3ce8c14a got timer from recommenders 2019-05-09 17:25:44 +01:00
Hong Lu 2af4d4a008 Moved notebooks to example folder. 2019-05-07 10:22:48 -04:00
Hong Lu 6e5b060e08 Added utils path to system path. 2019-05-07 10:20:03 -04:00