Liqun Shao
b670abbb88
add original code source to all the code
2019-06-17 16:57:08 -04:00
Liqun Shao
45811ef9b6
add aml explanation
2019-06-17 16:57:08 -04:00
Liqun Shao
349958bafb
1. correct typo in the notebook
...
2. add header to all the python files
3. add comments for train.py to explain what does it do
2019-06-17 16:57:08 -04:00
Liqun Shao
01fdc9c82a
The HyperDrive will --> The HyperDrive run automatically shows...
2019-06-17 16:57:08 -04:00
Liqun Shao
9e6c4680fe
put create workspace in the first place
2019-06-17 16:57:08 -04:00
Liqun Shao
0c79d68381
remove all unnecessary labels
2019-06-17 16:57:08 -04:00
Liqun Shao
2c4cc44839
include imports
2019-06-17 16:57:08 -04:00
Liqun Shao
c7cc976063
1. Move similarity explaining to README
...
2. Separate model.py into two
3. Remove unneccessary imports
2019-06-17 16:57:08 -04:00
Liqun Shao
9e212bc832
add explanation on tuning results
2019-06-17 16:57:07 -04:00
Liqun Shao
635eab8cb6
change the name of saving model
2019-06-17 16:57:07 -04:00
Liqun Shao
b440661f6a
fix the bug for stopping the training
2019-06-17 16:57:07 -04:00
Liqun Shao
6f3c89f0a7
Fixed the bug on training not converging
2019-06-17 16:57:07 -04:00
Liqun Shao
ead07a6551
remove auto loader and save the best model state
2019-06-17 16:56:25 -04:00
Liqun Shao
786f8de629
change the stopping condition to when the validation loss is small
2019-06-17 16:56:25 -04:00
Liqun Shao
fde487ea89
use adam optimizer instead of SGD
2019-06-17 16:56:25 -04:00
Liqun Shao
c5687f9159
add README to gensen repo
2019-06-17 16:56:25 -04:00
Liqun Shao
3c43c229b2
Resolved conflicts
2019-06-17 16:56:25 -04:00
Liqun Shao
5137d91c46
add horovod distributed training to the gensen model and make the training stop with small validation loss
2019-06-17 16:56:25 -04:00
Abhiram E
2d5bfe6862
Refactored gensen related files by Maluuba
2019-06-17 16:56:24 -04:00
Liqun Shao
51ccc72cd3
first push
2019-06-17 16:56:24 -04:00
Said Bleik
f12aabd5b0
add xnli dataset utils
2019-06-17 12:25:01 -04:00
Hong Lu
dfb8553c5b
Resolved conflict and merged staging.
2019-06-17 12:01:10 -04:00
Said Bleik
a514025f5d
Merge pull request #99 from microsoft/bleik
...
ar TC example
2019-06-14 16:21:03 -04:00
Said Bleik
0929c37d56
removed unused arg
2019-06-14 16:18:50 -04:00
Said Bleik
b0ead86bf2
added dataset utils
2019-06-14 15:11:17 -04:00
Said Bleik
51c22b9607
minor fixes
2019-06-13 22:26:29 -04:00
Said Bleik
9e85c2923f
added missing imports
2019-06-13 15:30:53 -04:00
Hong Lu
4de4ece15c
Changed python version in pre-commit-config back to 3.6
2019-06-13 14:46:57 -04:00
Ubuntu
5438d76596
Added test code for NER utils.
2019-06-13 18:25:27 +00:00
Hong Lu
6d671b6221
Started adding test code for NER.
2019-06-12 15:33:12 -04:00
Casey Hong
e031d3d225
suppress nltk messages
2019-06-12 12:37:43 -04:00
Said Bleik
98a7071294
Merge pull request #85 from microsoft/casey-senteval
...
SentEval examples (local and with azureml support)
2019-06-11 14:39:51 -04:00
Casey Hong
ba3ba5b5a8
use azureml_utils for workspace creation
2019-06-11 13:05:39 -04:00
Casey Hong
e5b12c6f32
resolve merge conflicts
2019-06-11 11:45:30 -04:00
Chaoyu Guan
f0d6a2f55c
rename files and revise README for #62
2019-06-08 13:15:23 +00:00
Chaoyu Guan
f4f3591668
add explain-NLP-model part for issue #62
2019-06-08 12:51:44 +00:00
Hong Lu
26fcc3cbe4
Added random seed option to wikigold util function.
2019-06-07 17:32:49 -04:00
Hong Lu
049ddf6442
Added BERT prefix to classifier names and some minor docstring updates.
2019-06-07 17:08:29 -04:00
Hong Lu
4e7ac8adc1
Minor updates in token classifier.
2019-06-07 10:56:49 -04:00
Hong Lu
e40e9636f3
Removed old data utils script.
2019-06-07 10:42:27 -04:00
Hong Lu
a8feb91a89
Removed common_ner.py
2019-06-07 10:35:12 -04:00
Hong Lu
2593620633
Added utility functions for token classification.
2019-06-07 10:34:09 -04:00
Hong Lu
fbf15e64c6
Merge remote-tracking branch 'origin/staging' into hlu/BERT_NER_utils
2019-06-06 16:45:09 -04:00
Said Bleik
b040c481eb
Merge pull request #86 from microsoft/abhiram-requests-fix
...
Minor fix suggested in Recommenders repo
2019-06-06 16:03:59 -04:00
Abhiram E
802188e115
Minor fix suggested in Recommenders repo
2019-06-06 15:46:17 -04:00
Casey Hong
23d9635230
senteval local and azureml 📓
2019-06-06 10:57:05 -07:00
Abhiram E
f0db07fb3a
Minor change.
2019-06-06 10:20:57 -07:00
Abhiram E
5b1ed5f447
FastText loader - Code changes and unit tests.
...
1. Added methods to download, extract and load glove vectors.
2. Added units test to test the public method.
Other changes
1. Refactored files to add return types to docstrings.
2. Minor changes to path variables.
2019-06-06 10:20:57 -07:00
Abhiram E
2498dbaaa1
Minor changes
2019-06-06 10:18:13 -07:00
abeswara
008bfa2c57
Glove loader - Code changes and unit tests.
...
1. Added methods to download, extract and load glove vectors.
2. Added units tests to test the public methods.
Other changes
1. Made download and extract methods private.
2. Refactored Word2vec unit tests to exclude private methods.
2019-06-06 10:16:46 -07:00
abeswara
ae31e05a84
Word2vec loader - Code changes and unit tests.
...
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.
2. Added units tests to load, download and extract functions.
2019-06-06 10:12:29 -07:00
Said Bleik
9269ef5482
merge staging
2019-06-06 13:01:07 -04:00
Said Bleik
c518d6a735
updated tc notebook and some utils
2019-06-05 21:37:16 -04:00
Abhiram E
3ac927edfa
Using tqdm to show progress bar
2019-06-05 13:08:23 -04:00
Abhiram E
0e296b6291
Changed url fetch from urlretrieve to requests
2019-06-04 16:26:35 -04:00
Said Bleik
ee9134d96f
minor updates to seq classification
2019-06-03 10:03:49 -04:00
Hong Lu
9bcad55d20
Updated NER notebook with wikigold data.
2019-05-31 18:44:01 -04:00
Said Bleik
61b66a57aa
updated device utils
2019-05-31 16:08:36 -04:00
Hong Lu
320b08d9af
Removed BERT image.
2019-05-31 13:46:34 -04:00
Said Bleik
1a96bce557
added missing assignment
2019-05-30 10:42:30 -04:00
Hong Lu
aaf0114cd7
Removed old scripts.
2019-05-29 14:57:23 -04:00
Hong Lu
52bd027555
Added helper function for postprocessing token classification results.
2019-05-29 14:39:58 -04:00
Said Bleik
5a81055e70
updated device utils and bert seq classifier
2019-05-28 23:16:19 -04:00
Abhiram E
36d7411bec
Fix to limit the memory usage when using fasttext embedding loaders. Code changes to use the simpler version
2019-05-28 12:04:57 -04:00
Hong Lu
52cc16fb9b
Updated token classifier api.
2019-05-24 18:09:56 -04:00
Hong Lu
5258c9cd7e
Added some utility functions to the common script. Will be merged with common.py later.
2019-05-24 18:09:04 -04:00
Casey Hong
1cd36ccff7
fix snli noblank bug and add preprocessing tests
2019-05-21 23:00:56 -04:00
Said Bleik
63e546ab3c
updated prerocessing, utils, classification
2019-05-21 16:45:23 -04:00
Hong Lu
2473e1a75c
Black auto formatting.
2019-05-20 18:53:57 -04:00
Hong Lu
3d1c1862d9
Removed old data utils script.
2019-05-20 14:08:39 -04:00
Hong Lu
4a41ec41e8
Added a constant file.
2019-05-20 14:00:12 -04:00
Hong Lu
1393c74fb3
Minor updates for data class updates.
2019-05-20 13:59:38 -04:00
Hong Lu
9919a7bd35
Remived InputFeature class. Use namedtuple instead of class for input data.
2019-05-20 13:58:54 -04:00
Hong Lu
e81138ad08
Changed optimizer and number of epochs configuration.
2019-05-20 13:58:16 -04:00
Said Bleik
49bb116474
update seq classifer
2019-05-17 10:04:46 -04:00
Hong Lu
eef85dea41
Consolidated all configuration classes into a single class.
2019-05-16 18:11:21 -04:00
Hong Lu
7ca29691ae
Consolidated some utility functions into BertTokenClassifier.
2019-05-16 18:10:47 -04:00
Hong Lu
d87dfbc2af
Minor edits and added docstring.
2019-05-16 18:10:14 -04:00
Hong Lu
14543fbd52
Added yaml configuration file for NER example.
2019-05-16 18:08:50 -04:00
Abhiram E
52d720e9bf
Added option to limit number of word vectors for glove and word2vec
2019-05-15 00:22:37 -04:00
Janhavi Mahajan
1ed2c4dc0a
feat(bug fix) updated snli notebook with to_lowercase_all() instead of to_lowercase() that expects a column name list. Fixed None object returning in to_lowercase when column name list is not passed
2019-05-13 18:14:31 -04:00
Said Bleik
e9c17a961e
update BERTSequenceClassifier and notebook
2019-05-13 15:18:21 -04:00
Said Bleik
7430e3b178
updated BERTSequenceClassifier + documentation
2019-05-13 14:38:54 -04:00
Said Bleik
7d2d74f975
BERTSequenceClassifier
2019-05-13 16:31:58 +00:00
Janhavi Mahajan
bb5764a56a
feat(code fix) rm_nltk_stop_words now expects sentences and stop_word column names
2019-05-10 16:50:34 -04:00
Janhavi Mahajan
197d771208
feat(code review comments) generalize nltk utils tokenize, remove_sto_words to more than 2 sentences
2019-05-10 16:27:48 -04:00
Janhavi Mahajan
6e3523810a
feat(code review) fix to_nltk_tokens, add to_lowercase_all and to_lowercase as per said's comments
2019-05-10 16:27:48 -04:00
Abhiram E
49595b8666
Moved urls to module constants for pretrained embedding utils.
2019-05-09 14:58:12 -04:00
Casey Hong
faf924b45b
token_cols bugfix
2019-05-09 14:58:11 -04:00
Abhiram E
6ba272308b
Minor change.
2019-05-09 14:58:11 -04:00
Abhiram E
2502d91e1b
FastText loader - Code changes and unit tests.
...
1. Added methods to download, extract and load glove vectors.
2. Added units test to test the public method.
Other changes
1. Refactored files to add return types to docstrings.
2. Minor changes to path variables.
2019-05-09 14:58:11 -04:00
Abhiram E
4e480026a0
Minor changes
2019-05-09 14:58:10 -04:00
abeswara
8025b4449d
Glove loader - Code changes and unit tests.
...
1. Added methods to download, extract and load glove vectors.
2. Added units tests to test the public methods.
Other changes
1. Made download and extract methods private.
2. Refactored Word2vec unit tests to exclude private methods.
2019-05-09 14:58:10 -04:00
abeswara
8408d7cce2
Word2vec loader - Code changes and unit tests.
...
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.
2. Added units tests to load, download and extract functions.
2019-05-09 14:58:10 -04:00
Abhiram E
9895dd41d7
Reformated files
2019-05-09 14:58:10 -04:00
Abhiram E
47ada0d03c
Added support to download and extract word2vec pretrained vectors
2019-05-09 14:58:10 -04:00
Abhiram E
48adc4f619
Initial commit for word embeddings
2019-05-09 14:58:10 -04:00
miguelgfierro
3c3ce8c14a
got timer from recommenders
2019-05-09 17:25:44 +01:00
Hong Lu
2af4d4a008
Moved notebooks to example folder.
2019-05-07 10:22:48 -04:00
Hong Lu
6e5b060e08
Added utils path to system path.
2019-05-07 10:20:03 -04:00
Hong Lu
bd4e805733
Updates to expose BERT objects to the user.
2019-05-07 10:01:55 -04:00
Said Bleik
23dad01abb
Merge pull request #35 from Microsoft/maidap-sentence-similarity
...
Sentence Similarity Datasets with New Folder Structure
2019-05-03 20:49:18 -04:00
Casey Hong
d65afe27f8
make colnames args in preprocess
2019-05-03 16:47:43 -04:00
Hong Lu
b15b0a4dfd
Fixed a few minor issues found during testing.
2019-05-02 17:59:47 -04:00
abeswara
84ac44cbc0
Resolved code review comments
2019-05-02 12:06:52 -04:00
Hong Lu
d5ee6d46cb
Initial check in of bert utility functions.
2019-05-02 10:50:30 -04:00
Said Bleik
10adf59777
update env, yahoo_answers, & classification eval
2019-05-01 22:49:41 +00:00
Janhavi Mahajan
338e606c5e
feat(code review comments) refactoring based on Miguel's comments
2019-05-01 18:40:44 -04:00
Casey Hong
810beb6f2c
organize stsbenchmark under new folder structure
2019-05-01 18:35:02 -04:00
Casey Hong
25a176b2cc
rm_stopwords suffix
2019-04-30 15:05:17 -04:00
Said Bleik
757e7d063d
Merge pull request #28 from Microsoft/maidap-sentence-similarity
...
Sentence similarity dataset
2019-04-30 12:26:04 -04:00
Said Bleik
f2467d5286
folder structure & example utils
2019-04-30 15:51:47 +00:00
Casey Hong
dc4eac5aee
refactor for consistency between snli <=> sts notebooks, add gensen-specific preprocessing for snli
2019-04-29 14:59:52 -04:00
Casey Hong
1aa60a3a00
begin snli-sts consistency refactoring
2019-04-29 14:59:52 -04:00
Janhavi Mahajan
1498bfb853
feat(code refactoring) moving code around as per the new structure decided.
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
ba2ad0cbfa
feat(code reformat) deleted snli from util_nlp
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
f0070819ea
feat(code reformat) Formatting code based on new folder structure
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
4aadf66654
feat(code reformat) moved nltk utils to preprocess.py
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
faa26b3c54
feat(doc strings) fixed doc string format
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
88e5a3d724
feat(code format) formatted file with black
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
c969085424
feat(code format) added doc strings, rewrite clean_snli function
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
44db348fe5
feat(data prep) save dataframe to csv and renamed folder from nltk to nltk_utils
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
6e46eade15
feat(data_prep) SNLI notebook showcasing data prep, Corrected nltk util for column_name
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
3964c04a7c
feat(data prep) NLTK tokenizer util file and notebook, deleted some redundant files, updated snli util with cleaner data prep functions
2019-04-26 12:11:55 -04:00
Janhavi Mahajan
f7b487cfbd
feat(data_prep) Added SNLI dataset prep utility
2019-04-26 12:11:55 -04:00
Abhiram E
84443d478c
Refactored STS notebooks, updated utils_nlp files with the latest code from utils_ss and deleted utils_ss
2019-04-24 17:16:06 -04:00
Abhiram E
ffb38ea42b
Refactored code according to new structure, moved files and modified imports
2019-04-24 15:33:41 -04:00
Abhiram E
d4db5a1860
Resolving code review comments.
...
1. Refactored and renamed msrpc_load notebook.
2. Removed redundant parameter to load_pandas_df function
2019-04-24 15:05:53 -04:00
Abhiram E
f66ee268c0
Refactoring changes to MSRPC
2019-04-24 15:05:52 -04:00
Abhiram E
b9fce4ae61
Notebooks and Tests
...
1. Added Jupyter Notebook for MSR-PC dataset quickstart task
2. Added unit tests for downloading the dataset and loading pandas df
3. Changes to MSRPC to take in path to the dataset if it already exists.
2019-04-24 15:05:00 -04:00
Abhiram E
ac0abdfd61
Data loader for MSR PC
...
1. Added data downloader for MSR PC
2. Added support to clean data and load specified datasets as a
pandas dataframe.
3. Updates to environment.yml for newly added packages.
2019-04-24 15:03:41 -04:00
Casey Hong
d20081766d
Add preprocessing notebook
2019-04-24 15:02:26 -04:00
Casey Hong
abacb5d022
Add tokenization with spacy
2019-04-24 15:02:26 -04:00
Casey Hong
b2bed84e0d
Include score column in dataframe
2019-04-24 15:02:26 -04:00
Casey Hong
f06630a55d
Download and clean stsbenchmark data
2019-04-24 15:02:26 -04:00
Casey Hong
819f0a215b
moving files to the sentence_similarity scenario directory
2019-04-24 13:54:53 -04:00
Casey Hong
6793a77608
clip docstring line length at 120
2019-04-22 17:47:59 -04:00
Casey Hong
81980e9eb6
Add and format docstrings
2019-04-22 17:47:59 -04:00
Casey Hong
42a9c11ac7
Add docstrings
2019-04-22 14:14:12 -04:00
Casey Hong
b31b7c3b13
Fix merge conflicts for rebase
2019-04-22 14:14:12 -04:00
Casey Hong
7176d7812e
Create sentence similarity branch
2019-04-18 15:10:46 -04:00
miguelgfierro
2effbfcfcb
cleaning
2019-04-16 19:53:15 +01:00
Richin Jain
2c5b8e587e
Intial commit to put the receipe template in
2019-04-05 13:55:58 -04:00