initial commit

2021-09-18 21:50:32 +00:00 · 2021-09-18 21:50:32 +00:00 · 2eb74c2dc8
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,135 @@
+# all files with SCRATCH prefix and large model / metric files
+**/SCRATCH*
+/pretrained_checkpoints
+/trained_models
+eval/e2e
+eval/GenerationEval
+venv
+trained_models
+pretrained_checkpoints
+.*
+*~*
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+.vscode/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+.idea/
+toy*.py
+.DS_Store
+toy*.py
+post/
+*.user
+*.nupkg
+/Packages
+
--- a/NLG/CODE_OF_CONDUCT.md
+++ b/NLG/CODE_OF_CONDUCT.md
@ -0,0 +1,9 @@
+# Microsoft Open Source Code of Conduct
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+
+Resources:
+
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
--- a/NLG/LICENSE
+++ b/NLG/LICENSE
@ -0,0 +1,21 @@
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
--- a/NLG/README.md
+++ b/NLG/README.md
@ -0,0 +1,182 @@
+# Adapting GPT-2 using LoRA
+
+This folder contains the implementation of LoRA in GPT-2 using the Python package `lora` and steps to replicate the results in our recent paper
+
+**LoRA: Low-Rank Adaptation of Large Language Models** <br>
+*Edward J. Hu\*, Yelong Shen\*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen* <br>
+Paper: https://arxiv.org/abs/2106.09685 <br>
+
+<p>
+<img src="figures/LoRA_GPT2.PNG" width="800" >
+</p>
+
+This repo reproduces our experiments on GPT-2.
+
+## Repository Overview
+
+Our implementation is based on the fine-tuning code for GPT-2 in [Hugging Face](https://huggingface.co/).
+There are several directories in this repo:
+* [src/](src) contains the source code used for data processing, training, and decoding.
+* [eval/](eval) contains the code for task-specific evaluation scripts.
+* [data/](data) contains the raw data we used in our experiments.
+* [vocab/](vocab) contains the GPT-2 vocabulary files.
+
+## Getting Started
+
+ 1. You can start with the following docker image: `nvcr.io/nvidia/pytorch:20.03-py3` on a GPU-capable machine, but any generic PyTorch image should work.
+ ```
+ docker run -it nvcr.io/nvidia/pytorch:20.03-py3
+ ```
+
+ 2. Clone the repo and install dependencies in a virtual environment (remove sudo if running in docker container):
+ ```
+ sudo apt-get update
+ sudo apt-get -y install git jq virtualenv
+ git clone https://github.com/microsoft/LoRA.git; cd LoRA
+ virtualenv -p `which python3` ./venv
+ . ./venv/bin/activate
+ pip install -r requirement.txt
+ bash download_pretrained_checkpoints.sh
+ bash create_datasets.sh
+ cd ./eval
+ bash download_evalscript.sh
+ cd ..
+ ```
+
+#### Now we are ready to replicate the results in our paper.
+
+## Replicating Our Result on E2E
+
+1. Train GPT-2 Medium with LoRA (see our paper for hyperparameters for GPT-2 Medium)
+```
+python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_ft.py \
+    --train_data ./data/e2e/train.jsonl \
+    --valid_data ./data/e2e/valid.jsonl \
+    --train_batch_size 8 \
+    --grad_acc 1 \
+    --valid_batch_size 4 \
+    --seq_len 512 \
+    --model_card gpt2.md \
+    --init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
+    --platform local \
+    --clip 0.0 \
+    --lr 0.0002 \
+    --weight_decay 0.01 \
+    --correct_bias \
+    --adam_beta2 0.999 \
+    --scheduler linear \
+    --warmup_step 500 \
+    --max_epoch 5 \
+    --save_interval 1000 \
+    --lora_dim 4 \
+    --lora_alpha 32 \
+    --lora_dropout 0.1 \
+    --label_smooth 0.1 \
+    --work_dir ./trained_models/GPT2_M/e2e \
+    --random_seed 110
+```
+
+2. Generate outputs from the trained model using beam search:
+```
+python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_beam.py \
+    --data ./data/e2e/test.jsonl \
+    --batch_size 1 \
+    --seq_len 512 \
+    --eval_len 64 \
+    --model_card gpt2.md \
+    --init_checkpoint ./trained_models/GPT2_M/e2e/model.26289.pt \
+    --platform local \
+    --lora_dim 4 \
+    --lora_alpha 32 \
+    --beam 10 \
+    --length_penalty 0.8 \
+    --no_repeat_ngram_size 4 \
+    --repetition_penalty 1.0 \
+    --eos_token_id 628 \
+    --work_dir ./trained_models/GPT2_M/e2e \
+    --output_file predict.26289.b10p08r4.jsonl
+```
+
+3. Decode outputs from step (2)
+```
+python src/gpt2_decode.py \
+    --vocab ./vocab \
+    --sample_file ./trained_models/GPT2_M/e2e/predict.26289.b10p08r4.jsonl \
+    --input_file ./data/e2e/test_formatted.jsonl \
+    --output_ref_file e2e_ref.txt \
+    --output_pred_file e2e_pred.txt
+```
+
+4. Run evaluation on E2E test set
+
+```
+python eval/e2e/measure_scores.py e2e_ref.txt e2e_pred.txt -p
+```
+
+## Replicating Our Result on WebNLG
+
+1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with webnlg (see our paper for hyperparameters)
+
+2. Decode outputs from beam search (step 2 above)
+```
+python src/gpt2_decode.py \
+    --vocab ./vocab \
+    --sample_file ./trained_models/GPT2_M/webnlg/predict.20000.b10p08.jsonl \
+    --input_file ./data/webnlg_challenge_2017/test_formatted.jsonl \
+    --ref_type webnlg \
+    --ref_num 6 \
+    --output_ref_file eval/GenerationEval/data/references_webnlg \
+    --output_pred_file eval/GenerationEval/data/hypothesis_webnlg \
+    --tokenize --lower
+```
+
+3. Run evaluation on WebNLG test set
+```
+cd ./eval/GenerationEval/
+python eval.py \
+    -R data/references_webnlg/reference \
+    -H data/hypothesis_webnlg \
+    -nr 6 \
+    -m bleu,meteor,ter 
+cd ../..
+```
+
+## Replicating Our Result on DART
+
+1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with dart (see our paper for hyperparameters)
+
+2. Decode outputs from beam search (step 2 above)
+```
+python src/gpt2_decode.py \
+        --vocab ./vocab \
+        --sample_file ./trained_models/GPT2_M/dart/predict.20000.b10p08.jsonl \
+        --input_file ./data/dart/test_formatted.jsonl \
+        --ref_type dart \
+        --ref_num 6 \
+        --output_ref_file eval/GenerationEval/data/references_dart \
+        --output_pred_file eval/GenerationEval/data/hypothesis_dart \
+        --tokenize --lower
+```
+
+3. Run evaluation on Dart test set
+```
+cd ./eval/GenerationEval/
+python eval.py \
+    -R data/references_dart/reference \
+    -H data/hypothesis_dart \
+    -nr 6 \
+    -m bleu,meteor,ter 
+cd ../..
+```
+
+## Citation
+```
+@misc{hu2021lora,
+    title={LoRA: Low-Rank Adaptation of Large Language Models},
+    author={Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Chen, Weizhu},
+    year={2021},
+    eprint={2106.09685},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
--- a/NLG/SECURITY.md
+++ b/NLG/SECURITY.md
@ -0,0 +1,41 @@
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
+
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
+
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
--- a/NLG/create_datasets.sh
+++ b/NLG/create_datasets.sh
@ -0,0 +1,44 @@
+#!/bin/bash
+
+echo "creating e2e datasets..."
+path=data/e2e
+echo "train..."
+python src/format_converting_e2e.py $path/train.txt $path/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+echo "test..."
+python src/format_converting_e2e.py $path/test.txt $path/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+
+echo "valid..."
+python src/format_converting_e2e.py $path/valid.txt $path/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+
+echo "creating webnlg datasets..."
+path=data/webnlg_challenge_2017
+echo "train..."
+python src/format_converting_webnlg.py $path/train.json $path/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+
+echo "test..."
+python src/format_converting_webnlg.py $path/test.json $path/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+
+echo "valid..."
+python src/format_converting_webnlg.py $path/dev.json $path/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+
+echo "creating dart datasets..."
+path=data/dart
+echo "train..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-train.json data/dart/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+
+echo "test..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-test.json data/dart/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+
+echo "valid..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-dev.json data/dart/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+
+echo "script complete!"
--- a/NLG/data/dart/dart-v1.1.1-full-dev.json
+++ b/NLG/data/dart/dart-v1.1.1-full-dev.json
--- a/NLG/data/dart/dart-v1.1.1-full-test.json
+++ b/NLG/data/dart/dart-v1.1.1-full-test.json
--- a/NLG/data/dart/dart-v1.1.1-full-train.json
+++ b/NLG/data/dart/dart-v1.1.1-full-train.json
--- a/NLG/data/e2e/test.txt
+++ b/NLG/data/e2e/test.txt
--- a/NLG/data/e2e/train.txt
+++ b/NLG/data/e2e/train.txt
--- a/NLG/data/e2e/valid.txt
+++ b/NLG/data/e2e/valid.txt
--- a/NLG/data/webnlg_challenge_2017/dev.json
+++ b/NLG/data/webnlg_challenge_2017/dev.json
--- a/NLG/data/webnlg_challenge_2017/test.json
+++ b/NLG/data/webnlg_challenge_2017/test.json
--- a/NLG/data/webnlg_challenge_2017/train.json
+++ b/NLG/data/webnlg_challenge_2017/train.json
--- a/NLG/download_pretrained_checkpoints.sh
+++ b/NLG/download_pretrained_checkpoints.sh
@ -0,0 +1,11 @@
+#!/bin/bash
+
+echo "downloading pretrained model checkpoints..."
+mkdir pretrained_checkpoints
+cd pretrained_checkpoints
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin
+cd ..
+
+echo "script complete!"
--- a/NLG/eval/README.md
+++ b/NLG/eval/README.md
@ -0,0 +1,7 @@
+### Evaluation code for E2E, WebNLG and Dart
+
+* Code for evaluating E2E https://github.com/tuetschek/e2e-metrics
+* Code for evaluating WebNLG and Dart https://github.com/WebNLG/GenerationEval.git
+
+Before running evaluation for the first time you must run
+`bash download_evalscript.sh`
--- a/NLG/eval/download_evalscript.sh
+++ b/NLG/eval/download_evalscript.sh
@ -0,0 +1,18 @@
+#!/bin/bash
+
+cd eval
+echo "installing evaluation dependencies"
+echo "downloading e2e-metrics..."
+git clone https://github.com/tuetschek/e2e-metrics e2e
+pip install -r e2e/requirements.txt
+
+echo "downloading GenerationEval for webnlg and dart..."
+git clone https://github.com/WebNLG/GenerationEval.git
+cd GenerationEval
+./install_dependencies.sh
+rm -r data/en
+rm -r data/ru
+cd ..
+mv eval.py GenerationEval/
+
+echo "script complete!"
--- a/NLG/eval/eval.py
+++ b/NLG/eval/eval.py
@ -0,0 +1,364 @@
+__author__='thiagocastroferreira'
+
+"""
+Author: Organizers of the 2nd WebNLG Challenge
+Date: 23/04/2020
+Description:
+    This script aims to evaluate the output of data-to-text NLG models by 
+    computing popular automatic metrics such as BLEU (two implementations), 
+    METEOR, chrF++, TER and BERT-Score.
+    
+    ARGS:
+        usage: eval.py [-h] -R REFERENCE -H HYPOTHESIS [-lng LANGUAGE] [-nr NUM_REFS]
+               [-m METRICS] [-nc NCORDER] [-nw NWORDER] [-b BETA]
+
+        optional arguments:
+          -h, --help            show this help message and exit
+          -R REFERENCE, --reference REFERENCE
+                                reference translation
+          -H HYPOTHESIS, --hypothesis HYPOTHESIS
+                                hypothesis translation
+          -lng LANGUAGE, --language LANGUAGE
+                                evaluated language
+          -nr NUM_REFS, --num_refs NUM_REFS
+                                number of references
+          -m METRICS, --metrics METRICS
+                                evaluation metrics to be computed
+          -nc NCORDER, --ncorder NCORDER
+                                chrF metric: character n-gram order (default=6)
+          -nw NWORDER, --nworder NWORDER
+                                chrF metric: word n-gram order (default=2)
+          -b BETA, --beta BETA  chrF metric: beta parameter (default=2)
+
+    EXAMPLE:
+        ENGLISH: 
+            python3 eval.py -R data/en/references/reference -H data/en/hypothesis -nr 4 -m bleu,meteor,chrf++,ter,bert,bleurt
+        RUSSIAN:
+            python3 eval.py -R data/ru/reference -H data/ru/hypothesis -lng ru -nr 1 -m bleu,meteor,chrf++,ter,bert
+"""
+
+import sys
+import argparse
+import codecs
+import copy
+import os
+import pyter
+import logging
+import nltk
+import subprocess
+import re
+
+from bert_score import score
+from metrics.chrF import computeChrF
+from metrics.bleurt.bleurt import score as bleurt_score
+
+from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
+from razdel import tokenize
+from tabulate import tabulate
+
+BLEU_PATH = 'metrics/multi-bleu-detok.perl'
+METEOR_PATH = 'metrics/meteor-1.5/meteor-1.5.jar'
+
+
+def parse(refs_path, hyps_path, num_refs, lng='en'):
+    logging.info('STARTING TO PARSE INPUTS...')
+    print('STARTING TO PARSE INPUTS...')
+    # references
+    references = []
+    for i in range(num_refs):
+        fname = refs_path + str(i) if num_refs > 1 else refs_path
+        with codecs.open(fname, 'r', 'utf-8') as f:
+            texts = f.read().split('\n')
+            for j, text in enumerate(texts):
+                if len(references) <= j:
+                    references.append([text])
+                else:
+                    references[j].append(text)
+
+    # references tokenized
+    references_tok = copy.copy(references)
+    for i, refs in enumerate(references_tok):
+        if lng == 'ru':
+            references_tok[i] = [' '.join([_.text for _ in tokenize(ref)]) for ref in refs]
+        else:
+            references_tok[i] = [' '.join(nltk.word_tokenize(ref)) for ref in refs]
+
+    # hypothesis
+    with codecs.open(hyps_path, 'r', 'utf-8') as f:
+        hypothesis = f.read().split('\n')
+
+    # hypothesis tokenized
+    hypothesis_tok = copy.copy(hypothesis)
+    if lng == 'ru':
+        hypothesis_tok = [' '.join([_.text for _ in tokenize(hyp)]) for hyp in hypothesis_tok]
+    else:
+        hypothesis_tok = [' '.join(nltk.word_tokenize(hyp)) for hyp in hypothesis_tok]
+
+
+    logging.info('FINISHING TO PARSE INPUTS...')
+    print('FINISHING TO PARSE INPUTS...')
+    return references, references_tok, hypothesis, hypothesis_tok
+
+def bleu_score(refs_path, hyps_path, num_refs):
+    logging.info('STARTING TO COMPUTE BLEU...')
+    print('STARTING TO COMPUTE BLEU...')
+    ref_files = []
+    for i in range(num_refs):
+        if num_refs == 1:
+            ref_files.append(refs_path)
+        else:
+            ref_files.append(refs_path + str(i))
+
+    command = 'perl {0} {1} < {2}'.format(BLEU_PATH, ' '.join(ref_files), hyps_path)
+    result = subprocess.check_output(command, shell=True)
+    try:
+        bleu = float(re.findall('BLEU = (.+?),', str(result))[0])
+    except:
+        logging.error('ERROR ON COMPUTING METEOR. MAKE SURE YOU HAVE PERL INSTALLED GLOBALLY ON YOUR MACHINE.')
+        print('ERROR ON COMPUTING METEOR. MAKE SURE YOU HAVE PERL INSTALLED GLOBALLY ON YOUR MACHINE.')
+        bleu = -1
+    logging.info('FINISHING TO COMPUTE BLEU...')
+    print('FINISHING TO COMPUTE BLEU...')
+    return bleu
+
+
+def bleu_nltk(references, hypothesis):
+    # check for empty lists
+    references_, hypothesis_ = [], []
+    for i, refs in enumerate(references):
+        refs_ = [ref for ref in refs if ref.strip() != '']
+        if len(refs_) > 0:
+            references_.append([ref.split() for ref in refs_])
+            hypothesis_.append(hypothesis[i].split())
+
+    chencherry = SmoothingFunction()
+    return corpus_bleu(references_, hypothesis_, smoothing_function=chencherry.method3)
+
+
+def meteor_score(references, hypothesis, num_refs, lng='en'):
+    logging.info('STARTING TO COMPUTE METEOR...')
+    print('STARTING TO COMPUTE METEOR...')
+    hyps_tmp, refs_tmp = 'hypothesis_meteor', 'reference_meteor'
+
+    with codecs.open(hyps_tmp, 'w', 'utf-8') as f:
+        f.write('\n'.join(hypothesis))
+
+    linear_references = []
+    for refs in references:
+        for i in range(num_refs):
+            linear_references.append(refs[i])
+
+    with codecs.open(refs_tmp, 'w', 'utf-8') as f:
+        f.write('\n'.join(linear_references))
+
+    try:
+        command = 'java -Xmx2G -jar {0} '.format(METEOR_PATH)
+        command += '{0} {1} -l {2} -norm -r {3}'.format(hyps_tmp, refs_tmp, lng, num_refs)
+        result = subprocess.check_output(command, shell=True)
+        meteor = result.split(b'\n')[-2].split()[-1]
+    except:
+        logging.error('ERROR ON COMPUTING METEOR. MAKE SURE YOU HAVE JAVA INSTALLED GLOBALLY ON YOUR MACHINE.')
+        print('ERROR ON COMPUTING METEOR. MAKE SURE YOU HAVE JAVA INSTALLED GLOBALLY ON YOUR MACHINE.')
+        meteor = -1
+
+    try:
+        os.remove(hyps_tmp)
+        os.remove(refs_tmp)
+    except:
+        pass
+    logging.info('FINISHING TO COMPUTE METEOR...')
+    print('FINISHING TO COMPUTE METEOR...')
+    return float(meteor)
+
+
+def chrF_score(references, hypothesis, num_refs, nworder, ncorder, beta):
+    logging.info('STARTING TO COMPUTE CHRF++...')
+    print('STARTING TO COMPUTE CHRF++...')
+    hyps_tmp, refs_tmp = 'hypothesis_chrF', 'reference_chrF'
+
+    # check for empty lists
+    references_, hypothesis_ = [], []
+    for i, refs in enumerate(references):
+        refs_ = [ref for ref in refs if ref.strip() != '']
+        if len(refs_) > 0:
+            references_.append(refs_)
+            hypothesis_.append(hypothesis[i])
+
+    with codecs.open(hyps_tmp, 'w', 'utf-8') as f:
+        f.write('\n'.join(hypothesis_))
+
+    linear_references = []
+    for refs in references_:
+        linear_references.append('*#'.join(refs[:num_refs]))
+
+    with codecs.open(refs_tmp, 'w', 'utf-8') as f:
+        f.write('\n'.join(linear_references))
+
+    rtxt = codecs.open(refs_tmp, 'r', 'utf-8')
+    htxt = codecs.open(hyps_tmp, 'r', 'utf-8')
+
+    try:
+        totalF, averageTotalF, totalPrec, totalRec = computeChrF(rtxt, htxt, nworder, ncorder, beta, None)
+    except:
+        logging.error('ERROR ON COMPUTING CHRF++.')
+        print('ERROR ON COMPUTING CHRF++.')
+        totalF, averageTotalF, totalPrec, totalRec = -1, -1, -1, -1
+    try:
+        os.remove(hyps_tmp)
+        os.remove(refs_tmp)
+    except:
+        pass
+    logging.info('FINISHING TO COMPUTE CHRF++...')
+    print('FINISHING TO COMPUTE CHRF++...')
+    return totalF, averageTotalF, totalPrec, totalRec
+
+
+def ter_score(references, hypothesis, num_refs):
+    logging.info('STARTING TO COMPUTE TER...')
+    print('STARTING TO COMPUTE TER...')
+    ter_scores = []
+    for hyp, refs in zip(hypothesis, references):
+        candidates = []
+        for ref in refs[:num_refs]:
+            if len(ref) == 0:
+                ter_score = 1
+            else:
+                try:
+                    ter_score = pyter.ter(hyp.split(), ref.split())
+                except:
+                    ter_score = 1
+            candidates.append(ter_score)
+
+        ter_scores.append(min(candidates))
+
+    logging.info('FINISHING TO COMPUTE TER...')
+    print('FINISHING TO COMPUTE TER...')
+    return sum(ter_scores) / len(ter_scores)
+
+
+def bert_score_(references, hypothesis, lng='en'):
+    logging.info('STARTING TO COMPUTE BERT SCORE...')
+    print('STARTING TO COMPUTE BERT SCORE...')
+    for i, refs in enumerate(references):
+        references[i] = [ref for ref in refs if ref.strip() != '']
+
+    try:
+        P, R, F1 = score(hypothesis, references, lang=lng)
+        logging.info('FINISHING TO COMPUTE BERT SCORE...')
+    #     print('FINISHING TO COMPUTE BERT SCORE...')
+        P, R, F1 = list(P), list(R), list(F1)
+        F1 = float(sum(F1) / len(F1))
+        P = float(sum(P) / len(P))
+        R = float(sum(R) / len(R))
+    except:
+        P, R, F1 = 0, 0, 0
+    return P, R, F1
+
+def bleurt(references, hypothesis, num_refs, checkpoint = "metrics/bleurt/bleurt-base-128"):
+    refs, cands = [], []
+    for i, hyp in enumerate(hypothesis):
+        for ref in references[i][:num_refs]:
+            cands.append(hyp)
+            refs.append(ref)
+
+    scorer = bleurt_score.BleurtScorer(checkpoint)
+    scores = scorer.score(refs, cands)
+    scores = [max(scores[i:i+num_refs]) for i in range(0, len(scores), num_refs)]
+    return round(sum(scores) / len(scores), 2)
+
+
+def run(refs_path, hyps_path, num_refs, lng='en', metrics='bleu,meteor,chrf++,ter,bert,bleurt',ncorder=6, nworder=2, beta=2):
+    metrics = metrics.lower().split(',')
+    references, references_tok, hypothesis, hypothesis_tok = parse(refs_path, hyps_path, num_refs, lng)
+    
+    result = {}
+    
+    logging.info('STARTING EVALUATION...')
+    if 'bleu' in metrics:
+        bleu = bleu_score(refs_path, hyps_path, num_refs)
+        result['bleu'] = bleu
+
+        b = bleu_nltk(references_tok, hypothesis_tok)
+        result['bleu_nltk'] = b
+    if 'meteor' in metrics:
+        meteor = meteor_score(references_tok, hypothesis_tok, num_refs, lng=lng)
+        result['meteor'] = meteor
+    if 'chrf++' in metrics:
+        chrf, _, _, _ = chrF_score(references, hypothesis, num_refs, nworder, ncorder, beta)
+        result['chrf++'] = chrf
+    if 'ter' in metrics:
+        ter = ter_score(references_tok, hypothesis_tok, num_refs)
+        result['ter'] = ter
+    if 'bert' in metrics:
+        P, R, F1 = bert_score_(references, hypothesis, lng=lng)
+        result['bert_precision'] = P
+        result['bert_recall'] = R
+        result['bert_f1'] = F1
+    if 'bleurt' in metrics and lng == 'en':
+        s = bleurt(references, hypothesis, num_refs)
+        result['bleurt'] = s
+    logging.info('FINISHING EVALUATION...')
+    
+    return result
+
+
+if __name__ == '__main__':
+    FORMAT = '%(levelname)s: %(asctime)-15s - %(message)s'
+    logging.basicConfig(filename='eval.log', level=logging.INFO, format=FORMAT)
+
+    argParser = argparse.ArgumentParser()
+    argParser.add_argument("-R", "--reference", help="reference translation", required=True)
+    argParser.add_argument("-H", "--hypothesis", help="hypothesis translation", required=True)
+    argParser.add_argument("-lng", "--language", help="evaluated language", default='en')
+    argParser.add_argument("-nr", "--num_refs", help="number of references", type=int, default=4)
+    argParser.add_argument("-m", "--metrics", help="evaluation metrics to be computed", default='bleu,meteor,ter,chrf++,bert,bleurt')
+    argParser.add_argument("-nc", "--ncorder", help="chrF metric: character n-gram order (default=6)", type=int, default=6)
+    argParser.add_argument("-nw", "--nworder", help="chrF metric: word n-gram order (default=2)", type=int, default=2)
+    argParser.add_argument("-b", "--beta", help="chrF metric: beta parameter (default=2)", type=float, default=2.0)
+
+    args = argParser.parse_args()
+
+    logging.info('READING INPUTS...')
+    refs_path = args.reference
+    hyps_path = args.hypothesis
+    lng = args.language
+    num_refs = args.num_refs
+    metrics = args.metrics#.lower().split(',')
+
+    nworder = args.nworder
+    ncorder = args.ncorder
+    beta = args.beta
+    logging.info('FINISHING TO READ INPUTS...')
+
+    result = run(refs_path=refs_path, hyps_path=hyps_path, num_refs=num_refs, lng=lng, metrics=metrics, ncorder=ncorder, nworder=nworder, beta=beta)
+    
+    metrics = metrics.lower().split(',')
+    headers, values = [], []
+    if 'bleu' in metrics:
+        headers.append('BLEU')
+        values.append(result['bleu'])
+
+        headers.append('BLEU NLTK')
+        values.append(round(result['bleu_nltk'], 2))
+    if 'meteor' in metrics:
+        headers.append('METEOR')
+        values.append(round(result['meteor'], 2))
+    if 'chrf++' in metrics:
+        headers.append('chrF++')
+        values.append(round(result['chrf++'], 2))
+    if 'ter' in metrics:
+        headers.append('TER')
+        values.append(round(result['ter'], 2))
+    if 'bert' in metrics:
+        headers.append('BERT-SCORE P')
+        values.append(round(result['bert_precision'], 2))
+        headers.append('BERT-SCORE R')
+        values.append(round(result['bert_recall'], 2))
+        headers.append('BERT-SCORE F1')
+        values.append(round(result['bert_f1'], 2))
+    if 'bleurt' in metrics and lng == 'en':
+        headers.append('BLEURT')
+        values.append(round(result['bleurt'], 2))
+
+    logging.info('PRINTING RESULTS...')
+    print(tabulate([values], headers=headers))
--- a/NLG/figures/LoRA_GPT2.PNG
+++ b/NLG/figures/LoRA_GPT2.PNG
--- a/NLG/figures/LoRA_GPT3.PNG
+++ b/NLG/figures/LoRA_GPT3.PNG
--- a/NLG/requirement.txt
+++ b/NLG/requirement.txt
@ -0,0 +1,7 @@
+--find-links https://download.pytorch.org/whl/torch_stable.html
+torch==1.7.1+cu101
+transformers==3.3.1
+spacy
+tqdm
+tensorboard
+progress
--- a/NLG/src/data_utils.py
+++ b/NLG/src/data_utils.py
@ -0,0 +1,269 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import os, sys
+import glob
+import random
+from collections import Counter, OrderedDict
+import numpy as np
+import torch
+import json
+
+import torch
+from torch.utils.data import Dataset
+from torch.utils.data import DataLoader
+
+
+class LMOrderedIterator(object):
+    def __init__(self, data, bsz, bptt, eval_len=None, device='cpu', world_size=1, rank=0):
+        """
+            data -- LongTensor -- the LongTensor is strictly ordered
+        """
+        self.data = data
+        self.bsz = bsz
+        self.world_size = world_size
+        self.rank = rank
+        self.bptt = bptt # tgt_len
+        # existing len.
+        self.eval_len = bptt if eval_len is None else eval_len
+
+        self.device = device
+        
+        self.global_bsz = bsz * world_size
+        # Work out how cleanly we can divide the dataset into bsz parts.
+        self.n_step = len(data) // self.global_bsz # bsz
+
+        self.split_data = torch.tensor(
+            data[rank * self.n_step * bsz : (rank + 1) * self.n_step * bsz], 
+            dtype=torch.long, device=self.device
+        )  # data.view(-1)
+
+        self.split_data = self.split_data.view(bsz, -1) 
+
+    def __iter__(self):
+        return self.get_fixlen_iter()
+
+    def get_batch(self, i, bptt, eval_len):
+        beg_idx = i
+        end_idx = i + bptt # seq_len
+        
+        # batch_size, lengh;
+        _input = self.split_data[:, beg_idx : end_idx].contiguous()
+        _target = self.split_data[:, beg_idx+1 : end_idx+1].contiguous()
+
+        _msk = torch.cat(
+            [
+                torch.zeros(bptt-eval_len, dtype=torch.float, device=self.device), 
+                torch.ones(eval_len, dtype=torch.float, device=self.device)
+            ]
+        )
+        _msk = _msk.unsqueeze(0).expand_as(_input) # .unsqueeze(-1) # length, 1; 
+        return _input, _target, _msk
+
+    def get_fixlen_iter(self, start=0):
+        self.data_len = self.split_data.size(1)
+        _eval_cursor = 0
+        for i in range(start, self.data_len - 1, self.eval_len):
+            bptt = min(self.bptt, self.data_len - i - 1)
+            _end_idx = i + bptt
+            yield self.get_batch(i, bptt, _end_idx - _eval_cursor)
+            _eval_cursor = _end_idx 
+
+
+class Corpus(object):
+    def __init__(self, path):
+        self.path = path
+        self.num_words = 0        
+        self.tokens = []
+        with open(self.path, "r") as reader:
+            for line in reader:
+                items = json.loads(line.strip())
+                book = items['book']
+                tokens = items['tokens']
+                num_words = items['num_words']
+
+                self.num_words += num_words
+                self.tokens.extend(tokens)
+
+
+class BinLMOrderedIterator(object):
+    def __init__(self, corpus, bsz, bptt, eval_len=None, device='cpu', world_size=1, rank=0):
+        """
+            data -- LongTensor -- the LongTensor is strictly ordered
+        """
+        self.corpus = corpus
+        self.bsz = bsz
+        self.world_size = world_size
+        self.rank = rank
+        self.bptt = bptt # tgt_len
+        # existing len.
+        self.eval_len = bptt if eval_len is None else eval_len
+        self.device = device
+        self.global_bsz = bsz * world_size
+        # Work out how cleanly we can divide the dataset into bsz parts.
+        self.n_step = corpus.length // self.global_bsz # bsz
+
+        self.offset = [(rank * bsz + _b) * self.n_step  for _b in range(bsz)]
+
+    def __iter__(self):
+        return self.get_fixlen_iter()
+
+    def get_batch(self, i, bptt, eval_len):
+        # batch_size, lengh;
+        _inputs = []
+        _targets = []
+        for _b in range(0, self.bsz):
+            _input = self.corpus.get_tokens(self.offset[_b] + i, bptt)
+            _target = self.corpus.get_tokens(self.offset[_b] + i + 1, bptt)
+
+            _inputs.append(_input)
+            _targets.append(_target)
+
+        _input = torch.tensor(_inputs, dtype=torch.int64, device=self.device).contiguous()
+        _target = torch.tensor(_targets, dtype=torch.int64, device=self.device).contiguous()
+
+        _msk = torch.cat(
+            [
+                torch.zeros(bptt-eval_len, dtype=torch.float, device=self.device), 
+                torch.ones(eval_len, dtype=torch.float, device=self.device)
+            ]
+        )
+        _msk = _msk.unsqueeze(0).expand_as(_input) # .unsqueeze(-1) # length, 1; 
+        return _input, _target, _msk
+
+    def get_fixlen_iter(self, start=0):
+        #self.data_len = self.split_data.size(1)
+        _eval_cursor = 0
+        for i in range(start, self.n_step - 1, self.eval_len):
+            bptt = min(self.bptt, self.n_step - i - 1)
+            _end_idx = i + bptt
+            yield self.get_batch(i, bptt, _end_idx - _eval_cursor)
+            _eval_cursor = _end_idx 
+
+
+class BinCorpus(object):
+    def __init__(self, path):
+        self.path = path
+
+        self.book_token_span = []
+        self.book_token_span.append(0)
+        tokens_sum = 0
+        self.num_words = 0    
+
+        with open(path+'.info', 'r') as info_reader:
+            for line in info_reader:
+                items = json.loads(line.strip())
+                book = items['book']
+                num_tokens = items['num_subtokens']
+                num_words = items['num_words']
+
+                tokens_sum += num_tokens
+                self.book_token_span.append(tokens_sum)
+                self.num_words += num_words
+
+        self.length = self.book_token_span[-1]
+        self.bin_reader = open(path+'.bin', 'rb')
+
+    def get_tokens(self, offset, count):
+        INT64_SIZE = 8
+        self.bin_reader.seek(offset * INT64_SIZE)
+        x = np.fromfile(self.bin_reader, count=count, dtype=np.int)
+        return x
+
+
+def get_lm_corpus(data):
+    print('Producing dataset {}...'.format(data))
+    corpus = Corpus(data)
+    return corpus
+
+
+def padding_tokens(tokens, max_seq_length, pad_token, direct, max_context_length=0):
+
+    if max_context_length == 0:
+        max_context_length = max_seq_length
+
+    if len(tokens) > max_context_length:
+        if direct > 0:
+            pad_tokens = tokens[:max_context_length]
+        else:
+            pad_tokens = tokens[-max_context_length:]
+    else:
+        pad_tokens = tokens
+    token_len = len(pad_tokens)
+    pad_tokens = pad_tokens + [pad_token for _ in range(max_seq_length - token_len)]
+    return pad_tokens, token_len
+
+
+class FT_Dataset(Dataset):
+    def __init__(self, ft_file, batch_size, max_seq_length, 
+                 max_eval_length=0, joint_lm=False, prefix_len=0, infix_len=0, 
+                 prefix_cursor=1000000, infix_cursor=2000000):
+        self.ft_file = ft_file
+        self.ft_samples = self.read_ft_file(ft_file)
+        self.batch_size = batch_size
+        self.num_examples = len(self.ft_samples)
+        self.max_seq_length = max_seq_length
+        self.max_eval_length = max_eval_length
+        self.rng = random.Random(911)
+        self.joint_lm = joint_lm
+
+        self.num_batches = int((self.num_examples + self.batch_size - 1) / self.batch_size) 
+
+        self.prefix_len = prefix_len
+        self.infix_len = infix_len
+        self.prefix_cursor = prefix_cursor
+        self.infix_cursor = infix_cursor
+
+    def __len__(self):
+        return self.num_batches * self.batch_size
+        
+    def __getitem__(self, item):
+        if(item >= self.num_examples):
+            item = self.rng.randint(0, self.num_examples - 1)
+
+        example = self.ft_samples[item]
+        context = example[0]
+        completion = example[1]
+
+        pretokens = [i + self.prefix_cursor for i in range(0, self.prefix_len)] 
+        intokens = [i + self.infix_cursor for i in range(0, self.infix_len)] 
+
+        conditions = pretokens + context + intokens 
+        _input, _input_len = padding_tokens(conditions + completion, self.max_seq_length, 0, 1)
+
+        pad_targets = [0 for i in range(0, self.prefix_len)] + context + [0 for i in range(0, self.infix_len)] + completion
+        _target, _ = padding_tokens(pad_targets[1:], self.max_seq_length, 0, 1)
+
+        if not self.joint_lm:
+            _msk = [0.0] * (len(conditions) - 1) + [1.0] * (_input_len - len(conditions))
+        else:
+            _msk = [1.0] * (_input_len - 1)
+
+        _msk, _ = padding_tokens(_msk, self.max_seq_length, 0.0, 1)
+        
+        output = {}
+        output["id"] = torch.tensor(item, dtype=torch.long)
+        
+        _query, _query_len = padding_tokens(
+            conditions, self.max_seq_length, 0, -1, 
+            max_context_length = self.max_seq_length - self.max_eval_length
+        )
+        output["query"] = torch.tensor(_query, dtype=torch.long)
+        output["query_len"] = torch.tensor(_query_len, dtype=torch.long)
+
+        output["input"] = torch.tensor(_input, dtype=torch.long) 
+        output["target"] = torch.tensor(_target, dtype=torch.long) 
+
+        output["mask"] = torch.tensor(_msk, dtype=torch.float)
+        return output
+
+    def read_ft_file(self, ft_file):
+        ft_samples = []
+        with open(ft_file, 'r') as reader:
+            for line in reader:
+                items = json.loads(line.strip())
+                context = items['context']
+                completion = items['completion']
+                ft_samples.append([context, completion])
+        return ft_samples
--- a/NLG/src/encoder.py
+++ b/NLG/src/encoder.py
@ -0,0 +1,132 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import os
+import json
+import regex as re
+from functools import lru_cache
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class Encoder:
+
+    def __init__(self, encoder, bpe_merges, errors='replace'):
+        self.encoder = encoder
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        self.errors = errors # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        try:
+            import regex as re
+            self.re = re
+        except ImportError:
+            raise ImportError('Please install regex with: pip install regex')
+
+        
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def encode(self, text):
+        bpe_tokens = []
+        tokens = []
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+            if token:
+                tokens.append(token)
+        return bpe_tokens, tokens
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
+        
+
+def get_encoder(models_dir):
+    with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
+        encoder = json.load(f)
+    with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
+        bpe_data = f.read()
+    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
+    return Encoder(
+        encoder=encoder,
+        bpe_merges=bpe_merges,
+    )
--- a/NLG/src/exp_utils.py
+++ b/NLG/src/exp_utils.py
@ -0,0 +1,46 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import functools
+import os, shutil
+import numpy as np
+
+import torch
+
+
+def logging(s, log_path, print_=True, log_=True):
+    if print_:
+        print(s)
+    if log_:
+        with open(log_path, 'a+') as f_log:
+            f_log.write(s + '\n')
+
+
+def get_logger(log_path, **kwargs):
+    return functools.partial(logging, log_path=log_path, **kwargs)
+
+
+def create_exp_dir(dir_path, scripts_to_save=None, debug=False):
+    if debug:
+        print('Debug Mode : no experiment dir created')
+        return functools.partial(logging, log_path=None, log_=False)
+
+    if not os.path.exists(dir_path):
+        os.makedirs(dir_path)
+
+    print('Experiment dir : {}'.format(dir_path))
+    if scripts_to_save is not None:
+        script_path = os.path.join(dir_path, 'scripts')
+        if not os.path.exists(script_path):
+            os.makedirs(script_path)
+        for script in scripts_to_save:
+            dst_file = os.path.join(dir_path, 'scripts', os.path.basename(script))
+            shutil.copyfile(script, dst_file)
+
+    return get_logger(log_path=os.path.join(dir_path, 'log.txt'))
+
+
+def save_checkpoint(model, optimizer, path, epoch):
+    torch.save(model, os.path.join(path, 'model_{}.pt'.format(epoch)))
+    torch.save(optimizer.state_dict(), os.path.join(path, 'optimizer_{}.pt'.format(epoch)))
--- a/NLG/src/format_converting_dart.py
+++ b/NLG/src/format_converting_dart.py
@ -0,0 +1,43 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import sys
+import io
+import json
+
+
+with open(sys.argv[1], 'r', encoding='utf8') as reader, \
+     open(sys.argv[2], 'w', encoding='utf8') as writer :
+    lines_dict = json.load(reader)
+
+    full_rela_lst = []
+    full_src_lst = []
+    full_tgt_lst = []
+    unique_src = 0
+
+    for example in lines_dict:
+        rela_lst = []
+        temp_triples = ''
+        for i, tripleset in enumerate(example['tripleset']):
+            subj, rela, obj = tripleset
+            rela = rela.lower()
+            rela_lst.append(rela)
+            if i > 0:
+                temp_triples += ' | '
+            temp_triples += '{} : {} : {}'.format(subj, rela, obj)
+
+        unique_src += 1
+
+        for sent in example['annotations']:
+            full_tgt_lst.append(sent['text'])
+            full_src_lst.append(temp_triples)
+            full_rela_lst.append(rela_lst)
+
+    print('unique source is', unique_src)
+
+    for src, tgt in zip(full_src_lst, full_tgt_lst):
+        x = {}
+        x['context'] =  src # context #+ '||'
+        x['completion'] = tgt #completion
+        writer.write(json.dumps(x)+'\n')
--- a/NLG/src/format_converting_e2e.py
+++ b/NLG/src/format_converting_e2e.py
@ -0,0 +1,20 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import sys
+import io
+import json
+
+
+with open(sys.argv[1], 'r', encoding='utf8') as reader, \
+	 open(sys.argv[2], 'w', encoding='utf8') as writer :
+	for line in reader:
+		items = line.strip().split('||')
+		context = items[0]
+		completion = items[1].strip('\n')
+		x = {}
+		x['context'] = context #+ '||'
+		x['completion'] = completion
+		writer.write(json.dumps(x)+'\n')
+
--- a/NLG/src/format_converting_webnlg.py
+++ b/NLG/src/format_converting_webnlg.py
@ -0,0 +1,68 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import sys
+import io
+import json
+
+
+with open(sys.argv[1], 'r', encoding='utf8') as reader, \
+     open(sys.argv[2], 'w', encoding='utf8') as writer :
+    lines_dict = json.load(reader)
+
+    full_rela_lst = []
+    full_src_lst = []
+    full_tgt_lst = []
+    full_cate_lst = []
+
+    seen = [
+        'Airport', 
+        'Astronaut', 
+        'Building', 
+        'City', 
+        'ComicsCharacter', 
+        'Food', 
+        'Monument', 
+        'SportsTeam', 
+        'University', 
+        'WrittenWork'
+    ]
+
+    cate_dict = {}
+    for i, example in enumerate(lines_dict['entries']):
+        sents = example[str(i+1)]['lexicalisations']
+        triples = example[str(i + 1)]['modifiedtripleset']
+        cate = example[str(i + 1)]['category']
+
+        if not cate in cate_dict:
+            cate_dict[cate] = 0
+        cate_dict[cate] += 1
+
+        rela_lst = []
+        temp_triples = ''
+        for i, tripleset in enumerate(triples):
+            subj, rela, obj = tripleset['subject'], tripleset['property'], tripleset['object']
+            rela_lst.append(rela)
+            if i > 0:
+                temp_triples += ' | '
+            temp_triples += '{} : {} : {}'.format(subj, rela, obj)
+
+        for sent in sents:
+            if sent["comment"] == 'good':
+                full_tgt_lst.append(sent['lex'])
+                full_src_lst.append(temp_triples)
+                full_rela_lst.append(rela_lst)
+                full_cate_lst.append(cate)
+
+    for cate in cate_dict:
+        print('cate', cate, cate_dict[cate])
+
+    #edited_sents = []
+    for src, tgt, cate in zip(full_src_lst, full_tgt_lst, full_cate_lst):
+        x = {}
+        x['context'] =  src # context #+ '||'
+        x['completion'] = tgt #completion
+        x['cate'] = cate in seen
+        writer.write(json.dumps(x)+'\n')
+
--- a/NLG/src/gpt2_beam.py
+++ b/NLG/src/gpt2_beam.py
@ -0,0 +1,400 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import argparse
+import time
+import math
+import os, sys
+import json
+import itertools
+from typing import Callable, Dict, Iterable, List, Optional, Tuple
+
+import torch
+from torch import Tensor, device, dtype, nn
+from torch.nn import CrossEntropyLoss
+from torch.nn import functional as F
+from torch.utils.data import DataLoader
+import torch.nn.functional as F
+torch.set_printoptions(threshold=100000)
+
+import numpy as np
+
+from gpu import (
+    add_gpu_params, 
+    parse_gpu, 
+    distributed_opt, 
+    distributed_gather, 
+    distributed_sync, 
+    cleanup
+)
+
+from exp_utils import create_exp_dir
+
+from data_utils import FT_Dataset 
+from model import GPT2Config, GPT2LMModel
+
+
+parser = argparse.ArgumentParser(description='PyTorch GPT2 beam decoding')
+
+add_gpu_params(parser)
+
+parser.add_argument('--data', type=str, default='../data/wikitext-103',
+                    help='location of the data corpus')
+
+parser.add_argument('--batch_size', type=int, default=10,
+                    help='batch size')
+
+parser.add_argument('--seq_len', type=int, default=512,
+                    help='number of tokens to predict')
+
+parser.add_argument('--eval_len', type=int, default=256,
+                    help='evaluation length')
+
+parser.add_argument('--min_length', type=int, default=0,
+                    help='minimum generation length')
+
+parser.add_argument('--model_card', default='gpt2.sm', choices=['gpt2.sm', 'gpt2.md', 'gpt2.lg'],
+                    help='model names')
+
+parser.add_argument('--init_checkpoint', default=None, type=str, help='initial checkpoint')
+
+parser.add_argument('--lora_dim', type=int, default=0, help='lora attn dimension')
+
+parser.add_argument('--lora_alpha', type=int, default=128, help='lora attn alpha')
+
+parser.add_argument('--work_dir', type=str, default=os.getenv('PT_OUTPUT_DIR', 'gpt2_model'), 
+                    help='working folder')
+
+parser.add_argument('--beam', type=int, default=1, help='beam search size')
+
+parser.add_argument('--length_penalty', type=float, default=1.0, help='length penalty')
+
+parser.add_argument('--no_repeat_ngram_size', type=int, default=4, help='no_repeat_ngram_size')
+
+parser.add_argument('--repetition_penalty', type=float, default=1.0, help='repetition_penalty')
+
+parser.add_argument('--eos_token_id', action='append', type=int, default=[50256], 
+                    help='eos token id')
+
+parser.add_argument('--output_file', type=str, default='beam_prediction.jsonl', 
+                    help='output file name')
+
+parser.add_argument('--prefix_len', default=0, type=int, help='prefix length')
+
+parser.add_argument('--infix_len', default=0, type=int, help='infix length')
+
+
+def print_args(args):
+    if args.rank == 0:
+        print('=' * 100)
+        for k, v in args.__dict__.items():
+            print('        - {} : {}'.format(k, v))
+        print('=' * 100)
+
+
+def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:
+    return tuple(layer_past.index_select(1, beam_idx).contiguous().detach() for layer_past in past)
+
+
+def _calc_banned_ngram_tokens(
+    prev_input_ids: Tensor, 
+    num_hypos: int, 
+    no_repeat_ngram_size: int, 
+    cur_len: int
+) -> None:
+    """Copied from fairseq for no_repeat_ngram in beam_search"""
+    if cur_len + 1 < no_repeat_ngram_size:
+        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
+        return [[] for _ in range(num_hypos)]
+    
+    generated_ngrams = [{} for _ in range(num_hypos)]
+    for idx in range(num_hypos):
+        gen_tokens = prev_input_ids[idx].tolist()
+        generated_ngram = generated_ngrams[idx]
+        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
+            prev_ngram_tuple = tuple(ngram[:-1])
+            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
+
+    def _get_generated_ngrams(hypo_idx):
+        # Before decoding the next token, prevent decoding of ngrams that have already appeared
+        start_idx = cur_len + 1 - no_repeat_ngram_size
+        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
+        return generated_ngrams[hypo_idx].get(ngram_idx, [])
+
+    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
+    return banned_tokens
+
+
+def _enforce_repetition_penalty_(
+    lprobs, 
+    batch_size, 
+    num_beams, 
+    prev_output_tokens, 
+    repetition_penalty
+):
+    """repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
+
+    for i in range(batch_size * num_beams):
+        print('prev_output_tokens.shape', prev_output_tokens.shape)
+        print('prev_output_tokens[i].shape', prev_output_tokens[i].shape)
+
+        for previous_token in set(prev_output_tokens[i].tolist()):
+            # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
+            if lprobs[i, previous_token] < 0:
+                lprobs[i, previous_token] *= repetition_penalty
+            else:
+                lprobs[i, previous_token] /= repetition_penalty
+
+def _postprocess_next_token_scores(
+    scores,
+    history,
+    cur_len,
+    batch_size,
+    num_beams,
+    repetition_penalty=1.0,                                
+    no_repeat_ngram_size=4,
+    bad_words_ids=None,
+    min_length=0,
+    max_length=100,
+    eos_token_id=None,
+):
+    # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)
+    if repetition_penalty != 1.0 and history is not None:
+        _enforce_repetition_penalty_(scores, batch_size, num_beams, history, repetition_penalty)
+
+    # score: batch_size * beam, vocab
+    # set eos token prob to zero if min_length is not reached
+    if eos_token_id is not None and cur_len < min_length:
+        for eos in eos_token_id:
+            scores[:, eos] = -float("inf")
+
+    if no_repeat_ngram_size > 0 and history is not None:
+        # calculate a list of banned tokens to prevent repetitively generating the same ngrams
+        num_batch_hypotheses = batch_size * num_beams
+        # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
+        banned_batch_tokens = _calc_banned_ngram_tokens(
+                history, num_batch_hypotheses, no_repeat_ngram_size, cur_len
+        )
+
+        for i, banned_tokens in enumerate(banned_batch_tokens):
+            scores[i, banned_tokens] = -float("inf")
+
+    return scores
+
+
+def _add_beam_candidate(
+    best_score, 
+    best_sequence, 
+    batch_size, 
+    num_beams, 
+    beam_scores, 
+    history, 
+    eos_token_id=None
+):
+    last_tokens = history[:, -1]
+    for _i in range(batch_size * num_beams):
+        if eos_token_id is None or last_tokens[_i] in eos_token_id:
+            cur_len = history.shape[-1]
+            _score = beam_scores.view(-1)[_i] / cur_len ** args.length_penalty
+
+            batch_id = _i // num_beams
+
+            if not batch_id in best_score or best_score[batch_id] < _score:
+                best_score[batch_id] = _score
+                best_sequence[batch_id][:cur_len] = history[_i]
+
+            beam_scores.view(-1)[_i] = -float("inf")
+
+
+def beam(model, data_iter, args):
+    model.eval()
+    total_loss = 0.
+    start_time = time.time()
+
+    all_predictions = {}
+    with torch.no_grad():
+        for idx, data in enumerate(data_iter):
+            data = {key: value for key, value in data.items()}
+
+            _id = data['id'].to(args.device)
+            _query = data['query'].to(args.device)
+            _query_len = data['query_len'].to(args.device)
+
+            ## local adaptation start.
+
+            ## local adaptation end.
+
+
+            output = None
+            score = None
+
+            batch_size = _id.size(0)
+            num_beams = args.beam
+            length_penalty = args.length_penalty
+
+            _batch = torch.arange(0, _id.size(0), device=args.device, dtype=torch.long)
+            
+            past = None
+            len_past = None
+
+            _query = _query.repeat(1, num_beams).view(batch_size * num_beams, -1)
+            _query_len = _query_len.unsqueeze(-1).repeat(1, num_beams).view(-1)
+
+            _bbatch = _batch.unsqueeze(-1).repeat(1, num_beams).view(-1)
+            
+            # scores for each sentence in the beam
+            beam_scores = torch.zeros(
+                (batch_size, num_beams), dtype=torch.float, device=_query.device
+            )
+
+            best_sequence = torch.zeros(
+                (batch_size, args.eval_len), dtype=torch.long, device=_query.device
+            )
+            best_score = {}
+
+            history = None
+            with torch.no_grad():
+                for i in range(0, args.eval_len):
+                    if i == 0:
+                        logits, past = model(_query) 
+                        logits = logits[_bbatch, (_query_len-1).long(), :] # batch_size * beam, vocab
+                    else:
+                        #print('token_id.shape', token_id.shape, token_id)
+                        #print('past.shape', past[0].shape)
+                        #print('len_past.shape', len_past.shape, len_past)
+                        
+                        logits, past = model(token_id, past=past, len_past=len_past) 
+                        logits = logits[:, -1, :]    # batch_size * beam, vocab
+
+                    logits = _postprocess_next_token_scores(           
+                        logits,
+                        history,
+                        i,
+                        batch_size,
+                        num_beams,
+                        repetition_penalty=args.repetition_penalty,                                
+                        no_repeat_ngram_size=args.no_repeat_ngram_size,
+                        min_length=args.min_length,
+                        eos_token_id=args.eos_token_id,
+                    )
+
+                    softmax_probs = F.softmax(logits, dim=-1)
+                    ##_prob, _w_idx = torch.topk(softmax_probs, num_beams) # batch_size, beam
+
+                    vocab_size = softmax_probs.shape[-1] 
+                    
+
+                    _logprob = torch.log(softmax_probs) # batch_size * beam, vocab
+                    if i == 0:
+                        next_scores = _logprob.view(batch_size, num_beams, -1)[:, 0, :] # batch_size, vocab
+                        
+                    else:
+                        next_scores = beam_scores.unsqueeze(-1) + _logprob.view(batch_size, num_beams, -1)
+                        next_scores = next_scores.view(batch_size, -1) # batch_size, beam * vocab
+
+                    next_scores, next_tokens = torch.topk(
+                        next_scores, num_beams, dim=1, largest=True, sorted=True
+                    )     # batch_size, num_beams
+                    
+                    beam_id = (next_tokens // vocab_size).view(-1)    # batch_size * num_beams
+                    token_id = (next_tokens % vocab_size).view(-1).unsqueeze(-1) # batch_size, num_beams
+
+                    beam_idx = beam_id.view(batch_size, num_beams) + (_batch * num_beams).unsqueeze(-1)
+                    past = _reorder_cache(past, beam_idx.view(-1))                
+                    beam_scores = next_scores # batch_size, num_beams
+                    len_past = (_query_len + i).long()
+
+                    if history is None:
+                        history = token_id.detach()
+                    else:
+                        history = torch.cat((history[beam_idx.view(-1)], token_id.detach()), dim=1).detach()
+
+                    _add_beam_candidate(
+                        best_score, best_sequence, batch_size, num_beams, beam_scores, history, 
+                        eos_token_id=args.eos_token_id
+                    )
+                
+                _add_beam_candidate(
+                    best_score, best_sequence, batch_size, num_beams, beam_scores, history
+                )
+
+
+            with torch.no_grad():
+                _id = distributed_gather(args, _id)
+                output = distributed_gather(args, best_sequence)
+                #score = distributed_gather(args, score)
+                distributed_sync(args)
+
+            if args.rank == 0:
+                _id = _id.view(-1).cpu()
+                output = output.view(-1, output.shape[-1]).cpu()
+                #score = score.view(-1, score.shape[-1]).cpu()
+
+                for _b in range(0, _id.shape[-1]):
+                    _i = int(_id[_b].item())
+                    all_predictions[_i] = {}
+                    all_predictions[_i]['id'] = _i
+                    all_predictions[_i]['predict'] = output[_b].tolist()
+                    #all_predictions[_i]['score'] = score[_b].tolist()
+
+                if idx % 10 == 0:
+                    print('inference samples', idx)
+
+    if args.rank == 0:
+        pred_file = os.path.join(args.work_dir, args.output_file) 
+        print('saving prediction file', pred_file)
+        with open(pred_file, 'w') as writer:
+            for _i in all_predictions:
+                writer.write(json.dumps(all_predictions[_i]) + '\n')
+    
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    parse_gpu(args)
+    print_args(args)
+    
+    if args.rank == 0:
+        args.logging = create_exp_dir(args.work_dir)
+
+    valid_data = FT_Dataset(
+        args.data, args.batch_size, args.seq_len, args.eval_len, 
+        prefix_len=args.prefix_len, infix_len=args.infix_len
+    )    
+    valid_sampler = torch.utils.data.distributed.DistributedSampler(valid_data)
+    valid_loader = DataLoader(
+        valid_data, batch_size=args.batch_size, num_workers=0, shuffle=False, 
+        pin_memory=False, drop_last=False, sampler=valid_sampler
+    )
+
+    if args.model_card == 'gpt2.sm':
+        config = GPT2Config(
+            n_embd=768, n_layer=12, n_head=12, 
+            lora_attn_dim=args.lora_dim, lora_attn_alpha=args.lora_alpha,
+            prefix_len=args.prefix_len, infix_len=args.infix_len
+        )
+    elif args.model_card == 'gpt2.md':
+        config = GPT2Config(
+            n_embd=1024, n_layer=24, n_head=16, 
+            lora_attn_dim=args.lora_dim, lora_attn_alpha=args.lora_alpha,
+            prefix_len=args.prefix_len, infix_len=args.infix_len
+        )
+    elif args.model_card == 'gpt2.lg':
+        config = GPT2Config(
+            n_embd=1280, n_layer=36, n_head=20, 
+            lora_attn_dim=args.lora_dim, lora_attn_alpha=args.lora_alpha,
+            prefix_len=args.prefix_len, infix_len=args.infix_len
+        )
+
+    lm_net = GPT2LMModel(config)
+    if args.init_checkpoint is not None:
+        print('loading model pretrained weight.')
+        cp = torch.load(args.init_checkpoint, map_location=torch.device('cpu'))
+        lm_net.load_weight(cp)    
+    lm_net = lm_net.cuda()
+
+    print('model sampling ...')
+    beam(lm_net, valid_loader, args)
+    distributed_sync(args)
+    print('cleanup dist ...')
+    cleanup(args)
--- a/NLG/src/gpt2_decode.py
+++ b/NLG/src/gpt2_decode.py
@ -0,0 +1,162 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import json
+import numpy as np
+import argparse
+import os
+import sys
+import re
+import json
+
+import torch
+import torch.nn as nn
+import torch.nn.parallel
+import torch.backends.cudnn as cudnn
+import torch.optim as optim
+import torch.utils.data
+
+import encoder
+
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument('--vocab', type=str, default=None, help='vocab path')
+
+parser.add_argument('--sample_file', default=None, type=str, help='ft sample file')
+parser.add_argument('--input_file', default=None, type=str, help='ft input file')
+
+parser.add_argument('--output_ref_file', default=None, type=str, help='output reference file')
+parser.add_argument('--output_pred_file', default=None, type=str, help='output predicion file')
+
+parser.add_argument('--ref_unique_file', default=None, type=str, help='reference unique id file')
+
+parser.add_argument('--ref_type', default='e2e', choices=['e2e', 'webnlg', 'dart'], 
+                    help='e2e style reference type; webnlg style reference type.')
+parser.add_argument('--ref_num', default=4, type=int, help='number of references.')
+
+
+parser.add_argument('--tokenize', action='store_true', help='')
+parser.add_argument('--lower', action='store_true', help='')
+
+parser.add_argument('--filter', default='all', choices=['all', 'seen', 'unseen'], 
+                    help='for webnlg only, filter categories that are seen during training, unseen, or all')
+
+args = parser.parse_args()
+
+
+def stardard_tokenize(sent):
+    sent = ' '.join(re.split('(\W)', sent))
+    sent = sent.split()
+    sent = ' '.join(sent)
+    return sent
+
+
+def post_process(sent, is_tokenize, is_lower):
+    if is_lower:
+        sent = sent.lower()
+    if is_tokenize:
+        sent = stardard_tokenize(sent)
+
+    return sent
+
+
+if __name__ == "__main__":
+    enc = encoder.get_encoder(args.vocab)
+
+    ref_unique = None
+
+    if args.ref_unique_file is not None:
+        print('reading ref_unique_file.')
+        ref_unique = []
+        uniques = {}
+        with open(args.ref_unique_file, 'r') as ref_unique_reader:
+            for line in ref_unique_reader:
+                _id = int(line.strip())
+                ref_unique.append(_id)
+                uniques[_id] = 1
+        print('len refer dict', len(ref_unique), 'unique', len(uniques))
+
+    with open(args.sample_file, 'r') as sample_reader, \
+             open(args.input_file, 'r', encoding='utf8') as input_reader, \
+             open(args.output_pred_file, 'w', encoding='utf8') as pred_writer:
+
+        refer_dict = {}
+        context_list = []
+        line_id = 0
+        for line in input_reader:
+            items = json.loads(line.strip())
+            context = items['context']
+            completion = items['completion']
+
+            context_list.append(context)
+
+            keep = False
+
+            if args.filter == 'all':
+                keep = True
+            if args.filter == 'seen' and items['cate']: 
+                keep = True
+            if args.filter == 'unseen' and not items['cate']:
+                keep = True
+
+            if ref_unique is None:
+                _key = context
+            else:
+                _key = ref_unique[line_id]
+
+            if keep:
+                if not _key in refer_dict:
+                    refer_dict[_key] = {}
+                    refer_dict[_key]['references'] = []
+                refer_dict[_key]['references'].append(completion.split('<|endoftext|>')[0].split('\n\n')[0].strip())
+
+            line_id += 1
+
+        print('unique refer dict', len(refer_dict))
+
+        for line in sample_reader:
+            items = json.loads(line.strip())
+            _id = items['id']
+            _pred_tokens = items['predict']
+
+            if ref_unique is None:
+                _key = context_list[_id]
+            else:
+                _key = ref_unique[_id]
+
+            #assert _key in refer_dict
+            if _key in refer_dict:
+                refer_dict[_key]['sample'] = enc.decode(_pred_tokens).split('<|endoftext|>')[0].split('\n\n')[0].strip() 
+
+        references = [refer_dict[s]['references'] for s in refer_dict]
+        hypothesis = [refer_dict[s]['sample'] for s in refer_dict]
+
+        if args.ref_type == 'e2e':
+            with open(args.output_ref_file, 'w', encoding='utf8') as ref_writer:
+                for ref, hyp in zip(references, hypothesis):
+                    for r in ref:
+                        ref_writer.write(post_process(r, args.tokenize, args.lower) + '\n')
+                    ref_writer.write('\n')
+                    pred_writer.write(post_process(hyp, args.tokenize, args.lower) + '\n')
+
+        elif args.ref_type in ['webnlg', 'dart']:
+            if not os.path.exists(args.output_ref_file):
+                os.makedirs(args.output_ref_file)
+
+            reference_writers = [
+                open(os.path.join(args.output_ref_file, f'reference{fid}'), 'w', encoding='utf8') 
+                for fid in range(0, args.ref_num)
+            ] 
+            
+            for ref, hyp in zip(references, hypothesis):
+                for fid in range(0, args.ref_num):
+                    if len(ref) > fid:
+                        reference_writers[fid].write(post_process(ref[fid], args.tokenize, args.lower) + '\n')
+                    else:
+                        reference_writers[fid].write(post_process(ref[0], args.tokenize, args.lower) + '\n')
+                pred_writer.write(post_process(hyp, args.tokenize, args.lower) + '\n')
+                    
+            for writer in reference_writers:
+                writer.close()
--- a/NLG/src/gpt2_encode.py
+++ b/NLG/src/gpt2_encode.py
@ -0,0 +1,70 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import json
+import numpy as np
+
+import encoder
+
+import argparse
+import os
+import random
+import torch
+import torch.nn as nn
+import torch.nn.parallel
+import torch.backends.cudnn as cudnn
+import torch.optim as optim
+import torch.utils.data
+
+import numpy
+import io
+import sys
+import threading
+import math
+import random
+
+import json
+import collections
+from collections import Counter
+from collections import OrderedDict
+from progress.bar import Bar as Bar
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--input', default=None, type=str, help='ft input file')
+parser.add_argument('--vocab', type=str, default=None, help='vocab path')
+parser.add_argument('--output', default=None, type=str, help='ft output file')
+parser.add_argument('--add_bos', action='store_true', help='')
+parser.add_argument('--add_eos', action='store_true', help='')
+args = parser.parse_args()
+
+
+if __name__ == "__main__":
+    enc = encoder.get_encoder(args.vocab)
+    
+    writer = open(args.output, 'w')
+
+    with open(args.input, 'r') as reader:
+        line_idx = 0
+        for line in reader:
+            items = json.loads(line.strip())
+            context = items['context']
+            completion = items['completion']
+
+            bos = 50256
+            eos = 50256
+            context_bpes, _ = enc.encode(context) 
+            context_bpes += [bos] if args.add_bos else []
+
+            completion_bpes, _ = enc.encode(' ' + completion)
+            completion_bpes += [eos] if args.add_eos else []
+
+            ft_json = {}
+            ft_json['context'] = context_bpes
+            ft_json['completion'] = completion_bpes 
+            writer.write(json.dumps(ft_json)+'\n')
+
+            line_idx += 1
+
+    writer.close()
--- a/NLG/src/gpt2_ft.py
+++ b/NLG/src/gpt2_ft.py
@ -0,0 +1,361 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import argparse
+import time
+import math
+import os, sys
+import numpy as np
+import itertools
+
+import torch
+import random
+from torch.utils.data import DataLoader
+torch.set_printoptions(threshold=100000)
+
+from gpu import (
+    add_gpu_params, 
+    parse_gpu, 
+    distributed_opt, 
+    distributed_gather, 
+    distributed_sync, 
+    cleanup
+)
+from optimizer import (
+    create_adam_optimizer, 
+    create_optimizer_scheduler, 
+    add_optimizer_params, 
+    create_adam_optimizer_from_args
+)
+
+from data_utils import FT_Dataset
+from model import GPT2Config, GPT2LMModel
+from exp_utils import create_exp_dir
+
+import loralib as lora
+
+parser = argparse.ArgumentParser(description='PyTorch GPT2 ft script')
+
+add_gpu_params(parser)
+add_optimizer_params(parser)
+
+parser.add_argument('--train_data', required=True, help='location of training data corpus')
+
+parser.add_argument('--valid_data', required=True, help='location of validation data corpus')
+
+parser.add_argument('--train_batch_size', type=int, default=8, help='training batch size')
+
+parser.add_argument('--valid_batch_size', type=int, default=4, help='validation batch size')
+
+parser.add_argument('--grad_acc', type=int, default=1, help='gradient accumulation steps')
+
+parser.add_argument('--clip', type=float, default=0.0, help='gradient clip')
+
+parser.add_argument('--seq_len', type=int, default=512, help='number of tokens to predict.')
+
+parser.add_argument('--model_card', default='gpt2.md', choices=['gpt2.sm', 'gpt2.md', 'gpt2.lg'], 
+                    help='model names')
+
+parser.add_argument('--init_checkpoint', default=None, help='pretrained checkpoint path')
+
+parser.add_argument('--fp16', action='store_true', help='train model with fp16')
+
+parser.add_argument('--log_interval', type=int, default=100, help='log interval')
+
+parser.add_argument('--eval_interval', type=int, default=2000, help='eval interval')
+
+parser.add_argument('--save_interval', type=int, default=500, help='save interval')
+
+parser.add_argument('--work_dir', type=str, default=os.getenv('PT_OUTPUT_DIR', 'gpt2_model'), 
+                    help='working folder.')
+
+parser.add_argument('--lora_dim', type=int, default=0, help='lora attn dimension')
+
+parser.add_argument('--lora_alpha', type=int, default=128, help='lora attn alpha')
+
+parser.add_argument('--obj', default='clm', choices=['jlm', 'clm'], 
+                    help='language model training objective')
+
+parser.add_argument('--lora_dropout', default=0.0, type=float, 
+                    help='dropout probability for lora layers')
+
+parser.add_argument('--label_smooth', default=0.0, type=float, help='label smoothing')
+
+parser.add_argument('--roll_interval', type=int, default=-1, help='rolling interval')
+
+parser.add_argument('--roll_lr', type=float, default=0.00001, help='rolling learning rate')
+
+parser.add_argument('--roll_step', type=int, default=100, help='rolling step')
+
+parser.add_argument('--eval_epoch', type=int, default=1, help='eval per number of epochs')
+
+# influence model, calculate the influence score between two samples.
+def print_args(args):
+    if args.rank == 0:
+        print('=' * 100)
+        for k, v in args.__dict__.items():
+            print(f'        - {k} : {v}')
+        print('=' * 100)
+
+
+class AverageMeter(object):
+    """Computes and stores the average and current value
+         Imported from https://github.com/pytorch/examples/blob/master/imagenet/main.py#L247-L262
+    """
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count
+
+
+def optimizer_step(_loss, _optimizer, _model, _schedule, args, is_update=True):
+    if args.fp16:
+        with amp.scale_loss(_loss, _optimizer) as _scaled_loss:
+            _scaled_loss.backward()
+    else:
+        _loss.backward()
+
+    if is_update:
+        if args.clip > 0:
+            if args.fp16:
+                torch.nn.utils.clip_grad_norm_(amp.master_params(_optimizer), args.clip)
+            else:
+                torch.nn.utils.clip_grad_norm_(_model.parameters(), args.clip)
+
+        _optimizer.step()        
+        _optimizer.zero_grad()
+
+    if _schedule is not None:
+        _schedule.step()
+
+
+def evaluate(model, valid_loader, args):
+    model.eval()
+    total_loss = 0.
+    start_time = time.time()
+
+    avg_lm_loss = AverageMeter()
+
+    with torch.no_grad():
+        for idx, data in enumerate(valid_loader):
+            data = {key: value for key, value in data.items()}
+
+            _input = data['input'].to(args.device)
+            _target = data['target'].to(args.device)
+            _msk = data['mask'].to(args.device)
+
+            _lm_logits, _loss = model(_input, lm_labels=_target, lm_mask=_msk) 
+            loss = _loss.mean() 
+            
+            avg_lm_loss.update(loss.item())
+
+            if idx % 100 == 0:
+                print('eval samples:', idx, 'loss:', loss.float())
+
+        total_time = time.time() - start_time
+        print('average loss', avg_lm_loss.avg)
+    return avg_lm_loss.avg, math.exp(avg_lm_loss.avg)
+
+
+def train_validate(
+    model, 
+    optimizer, 
+    scheduler, 
+    train_loader, 
+    valid_loader, 
+    args, 
+    train_step=0, 
+    epoch=0
+):
+    model.train()
+    avg_lm_loss = AverageMeter()
+    print('start to train the model................', epoch)
+    log_start_time = time.time()
+    best_val_ppl = None
+
+    train_loader.sampler.set_epoch(epoch)
+
+    for idx, data in enumerate(train_loader):
+        data = {key: value for key, value in data.items()}
+
+        _input = data['input'].to(args.device)
+        _target = data['target'].to(args.device)
+        _msk = data['mask'].to(args.device)
+
+        _lm_logits, _lm_loss = model(
+            _input, lm_labels=_target, lm_mask=_msk, label_smooth=args.label_smooth
+        ) 
+
+        _lm_loss = _lm_loss.mean() 
+
+        train_step += 1
+        is_update = True if train_step % args.grad_acc == 0 else False
+        avg_lm_loss.update(_lm_loss.item())
+        optimizer_step(
+            _lm_loss/(args.grad_acc), optimizer, model, scheduler, args, is_update=is_update
+        )
+        
+        if train_step % args.log_interval == 0: 
+            elapsed = time.time() - log_start_time
+            lr = optimizer.param_groups[0]['lr']
+            log_str = f'| epoch {epoch:3d} step {train_step:>8d} | { idx + 1:>6d} batches | ' \
+                      f'lr {lr:.3g} | ms/batch {elapsed * 1000 / args.log_interval:5.2f} | ' \
+                      f'loss {avg_lm_loss.val:5.2f} | avg loss {avg_lm_loss.avg:5.2f} | ' \
+                      f'ppl {math.exp(avg_lm_loss.avg):5.2f}'
+
+            if args.rank == 0: 
+                print(log_str)
+            log_start_time = time.time()
+            avg_lm_loss.reset()
+        
+        if train_step % args.save_interval == 0: 
+            if args.rank == 0:
+                model_path = os.path.join(args.work_dir, f'model.{train_step}.pt')
+                print('saving checkpoint', model_path)
+                torch.save({'model_state_dict': lora.lora_state_dict(model)}, model_path)
+            distributed_sync(args)
+
+        # evaluation interval
+        if train_step % args.eval_interval == 0:
+            eval_start_time = time.time()
+
+            valid_loss, valid_ppl = evaluate(model, valid_loader, args)
+
+            if best_val_ppl is None or valid_ppl < best_val_ppl:
+                best_val_ppl = valid_ppl
+                
+            log_str = f'| Eval {train_step // args.eval_interval:3d} at step {train_step:>8d} | ' \
+                      f'time: {time.time() - eval_start_time:5.2f}s | valid loss {valid_loss:5.2f} | ' \
+                      f'valid ppl {valid_ppl:5.2f} | best ppl {best_val_ppl:5.2f} '
+
+            if args.rank == 0:
+                print('-' * 100)
+                print(log_str)
+                print('-' * 100)
+
+            model.train()
+            distributed_sync(args)
+
+        if train_step == args.max_step:
+            break
+
+    if args.rank == 0:
+        model_path = os.path.join(args.work_dir, f'model.{train_step}.pt')
+        print('saving checkpoint', model_path)
+        torch.save({'model_state_dict': model.state_dict()}, model_path) 
+    distributed_sync(args)
+    return train_step
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    parse_gpu(args)
+    print_args(args)
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except Exception as e:
+            warnings.warn('Could not import amp, apex may not be installed')
+
+    torch.manual_seed(args.random_seed)
+    random.seed(args.random_seed)
+    
+    if args.rank == 0:
+        args.logging = create_exp_dir(args.work_dir)
+
+    train_data = FT_Dataset(
+        args.train_data, args.train_batch_size, args.seq_len, 
+        joint_lm=args.obj=='jlm'
+    )     
+    
+    valid_data = FT_Dataset(
+        args.valid_data, args.valid_batch_size, args.seq_len,
+    )
+
+    train_loader = DataLoader(
+        train_data, batch_size=args.train_batch_size, num_workers=0, 
+        shuffle=False, pin_memory=False, drop_last=True,
+        sampler=torch.utils.data.distributed.DistributedSampler(train_data, seed=args.random_seed)
+    )
+    
+    valid_loader = DataLoader(
+        valid_data, batch_size=args.valid_batch_size, num_workers=0, 
+        shuffle=False, pin_memory=False, drop_last=False,
+        sampler=torch.utils.data.distributed.DistributedSampler(valid_data, seed=args.random_seed)
+    )
+
+    if args.model_card == 'gpt2.sm':
+        config = GPT2Config(
+            n_embd=768, n_layer=12, n_head=12, 
+            lora_attn_dim=args.lora_dim, 
+            lora_attn_alpha=args.lora_alpha, 
+            lora_dropout=args.lora_dropout,
+        )
+    elif args.model_card == 'gpt2.md':
+        config = GPT2Config(
+            n_embd=1024, n_layer=24, n_head=16, 
+            lora_attn_dim=args.lora_dim, 
+            lora_attn_alpha=args.lora_alpha, 
+            lora_dropout=args.lora_dropout,
+        )
+    elif args.model_card == 'gpt2.lg':
+        config = GPT2Config(
+            n_embd=1280, n_layer=36, n_head=20, 
+            lora_attn_dim=args.lora_dim, 
+            lora_attn_alpha=args.lora_alpha, 
+            lora_dropout=args.lora_dropout,
+        )
+
+    lm_net = GPT2LMModel(config)
+    if args.init_checkpoint is not None:
+        print('loading model pretrained weight.')
+        lm_net.load_weight(torch.load(args.init_checkpoint))    
+
+    lm_net = lm_net.cuda()
+
+    if args.lora_dim > 0:
+        lora.mark_only_lora_as_trainable(lm_net)
+    optimizer = create_adam_optimizer_from_args(lm_net, args)
+
+    if args.max_step is None:
+        args.max_step = (args.max_epoch * train_data.num_batches + args.world_size - 1) // args.world_size
+        print('set max_step:', args.max_step)
+
+    scheduler = create_optimizer_scheduler(optimizer, args)
+    if args.fp16:
+        lm_net, optimizer = amp.initialize(lm_net, optimizer, opt_level="O1")
+    lm_net, optimizer = distributed_opt(args, lm_net, optimizer, grad_acc=args.grad_acc)
+
+    try:
+        train_step = 0
+        for epoch in itertools.count(start=1):
+            train_step = train_validate(
+                lm_net, optimizer, scheduler, train_loader, valid_loader, args, 
+                train_step=train_step, epoch=epoch
+            )
+            
+            if train_step >= args.max_step or (args.max_epoch is not None and epoch >= args.max_epoch):
+                if args.rank == 0:
+                    print('-' * 100)
+                    print('End of training')
+                break
+    except KeyboardInterrupt:
+        if args.rank == 0:
+            print('-' * 100)
+            print('Exiting from training early')
+
+    distributed_sync(args)
+    print('cleanup dist ...')
+    cleanup(args)
--- a/NLG/src/gpu.py
+++ b/NLG/src/gpu.py
@ -0,0 +1,126 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import argparse
+import time
+import math
+import os, sys
+import itertools
+
+import numpy as np
+
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.distributed as dist
+
+
+def add_gpu_params(parser: argparse.ArgumentParser):
+    parser.add_argument("--platform", default='k8s', type=str, help='platform cloud')
+    parser.add_argument("--local_rank", default=0, type=int, help='local rank')
+    parser.add_argument("--rank", default=0, type=int, help='rank')
+    parser.add_argument("--device", default=0, type=int, help='device')
+    parser.add_argument("--world_size", default=0, type=int, help='world size')
+    parser.add_argument("--random_seed", default=10, type=int, help='random seed')
+
+
+def distributed_opt(args, model, opt, grad_acc=1):
+    if args.platform == 'azure':
+        args.hvd.broadcast_parameters(model.state_dict(), root_rank=0)
+        opt = args.hvd.DistributedOptimizer(
+            opt, named_parameters=model.named_parameters(), backward_passes_per_step=grad_acc
+        )
+    elif args.platform == 'philly' or args.platform == 'k8s' or args.platform == 'local':
+        model = torch.nn.parallel.DistributedDataParallel(
+            model, device_ids=[args.local_rank], output_device=args.local_rank, 
+            find_unused_parameters=False, broadcast_buffers=False
+        )
+    return model, opt
+
+
+def distributed_gather(args, tensor):
+    g_y = [torch.zeros_like(tensor) for _ in range(args.world_size)]
+    torch.distributed.all_gather(g_y, tensor, async_op=False)
+    return torch.stack(g_y)
+
+
+def distributed_sync(args):
+    if args.platform == 'azure':
+        args.hvd.allreduce(torch.tensor(0), name='barrier')
+    else:
+        args.dist.barrier()
+
+
+def parse_gpu(args):
+    torch.manual_seed(args.random_seed)
+    
+    if args.platform == 'local':
+        dist.init_process_group(backend='nccl')
+        local_rank = torch.distributed.get_rank()
+        torch.cuda.set_device(local_rank)
+        device = torch.device('cuda', local_rank)
+        args.rank = local_rank
+        args.device = device
+        args.world_size = torch.distributed.get_world_size()
+        args.dist = dist
+        
+    elif args.platform == 'azure':
+        import horovod.torch as hvd
+        hvd.init()
+        print('azure hvd rank', hvd.rank(), 'local rank', hvd.local_rank())
+        local_rank = hvd.local_rank()
+        torch.cuda.set_device(local_rank)
+        device = torch.device('cuda', local_rank)                                                 
+        rank = hvd.rank()
+        world_size = hvd.size()
+        
+        args.local_rank = local_rank
+        args.rank = rank
+        args.device = device
+        args.world_size = world_size
+        args.hvd = hvd
+
+    elif args.platform == 'philly':
+        local_rank = args.local_rank 
+        torch.cuda.set_device(local_rank)
+        dist.init_process_group(backend='nccl')
+        rank = dist.get_rank()
+        world_size = torch.distributed.get_world_size()
+        device = torch.device('cuda', local_rank)     
+
+        args.rank = rank
+        args.device = device
+        args.world_size = world_size
+        args.dist = dist
+    elif args.platform == 'k8s':
+        master_uri = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
+        local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
+        args.local_rank = local_rank
+        world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
+        world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
+        rank = world_rank
+        torch.cuda.set_device(local_rank)
+        
+        dist.init_process_group(
+                backend='nccl',
+                init_method=master_uri,
+                world_size=world_size,
+                rank=world_rank,
+        )
+        device = torch.device("cuda", local_rank)
+        args.rank = rank
+        args.device = device
+        args.world_size = world_size
+        args.dist = dist
+    print(
+        'myrank:', args.rank, 
+        'local_rank:', args.local_rank, 
+        'device_count:', torch.cuda.device_count(), 
+        'world_size:', args.world_size
+    )
+    
+    
+def cleanup(args):
+    if args.platform == 'k8s' or args.platform == 'philly':
+        args.dist.destroy_process_group()
--- a/NLG/src/model.py
+++ b/NLG/src/model.py
@ -0,0 +1,450 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import logging
+import math
+import os
+from collections import OrderedDict 
+import copy
+import math
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss, MSELoss
+import torch.nn.functional as F
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LambdaLR
+from torch.nn.parameter import Parameter
+
+import loralib as lora
+
+
+def gelu(x):
+    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+def gelu_fast(x):
+    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
+
+
+def gelu_new(x):
+    """ Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+def _gelu_python(x):
+    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        This is now written in C in torch.nn.functional
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+class LayerNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-12):
+        """Construct a layernorm module in the TF style (epsilon inside the square root)."""
+        super(LayerNorm, self).__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.bias = nn.Parameter(torch.zeros(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, x):
+        u = x.mean(-1, keepdim=True)
+        s = (x - u).pow(2).mean(-1, keepdim=True)
+        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
+        return self.weight * x + self.bias
+
+
+class Conv1D(nn.Module):
+    def __init__(self, nf, nx):
+        super(Conv1D, self).__init__()
+        self.nf = nf
+        w = torch.empty(nx, nf)
+        nn.init.normal_(w, std=0.02)
+        self.weight = Parameter(w)
+        self.bias = Parameter(torch.zeros(nf))
+
+    def forward(self, x):
+        size_out = x.size()[:-1] + (self.nf,)
+        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
+        x = x.view(*size_out)
+        return x
+
+
+class Attention(nn.Module):
+    def __init__(self, nx, n_ctx, config, scale=False):
+        super(Attention, self).__init__()
+        n_state = nx  # in Attention: n_state=768 (nx=n_embd)
+        # [switch nx => n_state from Block to Attention to keep identical to TF implem]
+        
+        assert n_state % config.n_head == 0
+        self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))
+        self.n_head = config.n_head
+        self.split_size = n_state
+        self.scale = scale
+        self.c_attn = Conv1D(n_state * 3, nx)
+        self.c_attn = lora.MergedLinear(
+            nx, n_state * 3, 
+            r=config.lora_attn_dim, 
+            lora_alpha=config.lora_attn_alpha, 
+            lora_dropout=config.lora_dropout, 
+            enable_lora=[True, False, True], 
+            fan_in_fan_out=True,
+            merge_weights=False
+        )
+        self.c_proj = Conv1D(n_state, nx)
+
+        self.config = config
+    
+    def _attn(self, q, k, v, len_kv=None):
+        w = torch.matmul(q, k)
+        if self.scale:
+            w = w / math.sqrt(v.size(-1))
+        nd, ns = w.size(-2), w.size(-1)
+        b = self.bias[:, :, ns-nd:ns, :ns]
+        w = w * b - 1e10 * (1 - b)
+
+        # q : (batch, head, q_seq_length, head_features)
+        # k : (batch, head, head_features, kv_seq_length)
+        # w : (batch, head, q_seq_length, kv_seq_length)
+        # v : (batch, head, kv_seq_length, head_features)
+        if len_kv is not None:
+            _len = torch.arange(k.size(-1), device=k.device)
+            _input_msk =  _len[None, :] >= (len_kv)[:, None]
+            w = w.masked_fill(_input_msk.unsqueeze(1).unsqueeze(2), -1.0e10) 
+
+        w = nn.Softmax(dim=-1)(w)
+        return torch.matmul(w, v)
+
+    def merge_heads(self, x):
+        x = x.permute(0, 2, 1, 3).contiguous()
+        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
+        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states
+
+    def split_heads(self, x, k=False):
+        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
+        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states
+        if k:
+            return x.permute(0, 2, 3, 1).contiguous()  # (batch, head, head_features, seq_length)
+        else:
+            return x.permute(0, 2, 1, 3).contiguous()  # (batch, head, seq_length, head_features)
+
+    def forward(self, x, history=None, layer_past=None, len_past=None):
+        hidden_states = x
+
+        x = self.c_attn(x)
+        query, key, value = x.split(self.split_size, dim=2)
+
+        query = self.split_heads(query)
+        key = self.split_heads(key, k=True)
+        value = self.split_heads(value)
+
+        #_input_msk = None
+
+        len_kv = None
+
+        if layer_past is not None:
+            # key : (batch, head, head_features, seq_length)
+            # value : (batch, head, seq_length, head_features)
+            # layer_past, key : (batch, head, seq_length, head_features)
+            if len_past is None:
+                past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below
+                key = torch.cat((past_key, key), dim=-1)
+                value = torch.cat((past_value, value), dim=-2)
+            else:
+                key_seq = key.shape[-1]
+                assert key_seq == 1
+
+                _batch = torch.arange(0, key.shape[0], dtype=torch.long, device=key.device)
+
+                past_key, past_value = layer_past[0], layer_past[1]
+
+                past_key[_batch,:,len_past,:] = key.squeeze(-1)
+                past_value[_batch,:,len_past,:] = value.squeeze(-2)
+
+                key = past_key.transpose(-2, -1)
+                value = past_value
+
+                len_kv = len_past + 1
+
+        present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking
+        a = self._attn(query, key, value, len_kv = len_kv)
+        a = self.merge_heads(a)
+        a = self.c_proj(a)
+        return a, present
+
+
+class MLP(nn.Module):
+    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
+        super(MLP, self).__init__()
+        nx = config.n_embd
+        self.c_fc = Conv1D(n_state, nx)
+        self.c_proj = Conv1D(nx, n_state)
+        self.act = gelu
+
+    def forward(self, x):
+        h = self.act(self.c_fc(x))
+        h2 = self.c_proj(h)
+        return h2
+
+
+class Block(nn.Module):
+    def __init__(self, n_ctx, config, scale=False):
+        super(Block, self).__init__()
+        nx = config.n_embd
+        self.ln_1 = LayerNorm(nx, eps=config.layer_norm_epsilon)
+        self.attn = Attention(nx, n_ctx, config, scale)
+        self.ln_2 = LayerNorm(nx, eps=config.layer_norm_epsilon)
+        self.mlp = MLP(4 * nx, config)
+
+    def forward(self, x, layer_past=None, len_past=None):
+        a, present = self.attn(self.ln_1(x), layer_past=layer_past, len_past=len_past)
+        x = x + a
+        m = self.mlp(self.ln_2(x))
+        x = x + m
+        return x, present
+
+
+class GPT2Model(nn.Module):
+    def __init__(self, config):
+        super(GPT2Model, self).__init__()
+        self.n_layer = config.n_layer
+        self.n_embd = config.n_embd
+        self.n_vocab = config.vocab_size
+
+        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
+        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
+        block = Block(config.n_ctx, config, scale=True)
+        self.h = nn.ModuleList([copy.deepcopy(block) for _ in range(config.n_layer)])
+        self.ln_f = LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+
+        self.config = config
+
+
+    def forward(
+        self, 
+        input_ids, 
+        position_ids=None, 
+        token_type_ids=None, 
+        past=None, 
+        len_past=None
+    ):
+        if past is None:
+            past_length = 0
+            past = [None] * len(self.h)
+        elif len_past is None:
+            # equal size for past. []
+            past_length = past[0][0].size(-2)
+
+        if position_ids is None and len_past is None:
+            position_ids = torch.arange(
+                past_length, input_ids.size(-1) + past_length, 
+                dtype=torch.long, device=input_ids.device
+            )
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        elif len_past is not None:
+            position_ids = (len_past).unsqueeze(1) #.long()
+
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_ids.size(-1))
+        position_ids = position_ids.view(-1, position_ids.size(-1))
+
+        inputs_embeds = self.wte(input_ids)     
+
+        position_embeds = self.wpe(position_ids)
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
+            token_type_embeds = self.wte(token_type_ids)
+        else:
+            token_type_embeds = 0
+        hidden_states = inputs_embeds + position_embeds + token_type_embeds
+        presents = []
+        for block, layer_past in zip(self.h, past):
+            hidden_states, present = block(hidden_states, layer_past = layer_past, len_past=len_past)
+            presents.append(present)
+        hidden_states = self.ln_f(hidden_states)
+        output_shape = input_shape + (hidden_states.size(-1),)
+        return hidden_states.view(*output_shape), presents
+
+
+class GPT2LMHead(nn.Module):
+    def __init__(self, model_embeddings_weights, config):
+        super(GPT2LMHead, self).__init__()
+        self.n_embd = config.n_embd
+        self.set_embeddings_weights(model_embeddings_weights)
+
+    def set_embeddings_weights(self, model_embeddings_weights):
+        embed_shape = model_embeddings_weights.shape
+        self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
+        self.decoder.weight = model_embeddings_weights  # Tied weights
+
+    def forward(self, hidden_state):
+        # Truncated Language modeling logits (we remove the last token)
+        # h_trunc = h[:, :-1].contiguous().view(-1, self.n_embd)
+        lm_logits = self.decoder(hidden_state)
+        return lm_logits
+
+
+class GPT2Config(object):
+    def __init__(
+        self,
+        vocab_size_or_config_json_file=50257,
+        n_positions=1024,
+        n_ctx=1024,
+        n_embd=768,
+        n_layer=12,
+        n_head=12,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        lora_attn_dim=0,
+        lora_attn_alpha=128,
+        lora_dropout=0.0,
+        lora_r_dropout=0.0,
+        fix_dropout=0.0,
+    ):
+        self.vocab_size = vocab_size_or_config_json_file
+        self.n_ctx = n_ctx
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.lora_attn_dim = lora_attn_dim
+        self.lora_attn_alpha = lora_attn_alpha
+        self.lora_dropout = lora_dropout
+        self.lora_r_dropout = lora_r_dropout
+
+        self.fix_dropout = fix_dropout
+
+
+class GPT2LMModel(nn.Module):
+    def __init__(self, config):
+        super(GPT2LMModel, self).__init__()
+        self.transformer = GPT2Model(config)
+        self.lm_head = GPT2LMHead(self.transformer.wte.weight, config)
+        self.apply(self._init_weights)
+
+    def set_tied(self):
+        """ Make sure we are sharing the embeddings"""
+        self.lm_head.set_embeddings_weights(self.transformer.wte.weight)
+
+    def forward(
+        self, 
+        input_ids, 
+        lm_labels=None, 
+        lm_mask=None, 
+        past=None, 
+        len_past=None, 
+        label_smooth=0.0,
+        is_report_accuracy=False
+    ):
+        _batch, _len = input_ids.shape
+        hidden_states, presents = self.transformer(input_ids, past=past, len_past=len_past)
+
+        # batch, seq, vocab
+        lm_logits = self.lm_head(hidden_states)
+
+        if lm_labels is not None:
+
+            if is_report_accuracy:
+                _pred_token = torch.argmax(lm_logits, dim=-1)
+                _hit = (_pred_token == lm_labels) * lm_mask
+
+                _t1_acc = torch.zeros(_batch, dtype=torch.float, device=input_ids.device)
+                _all_acc = torch.zeros(_batch, dtype=torch.float, device=input_ids.device)
+                
+                for _b in range(0, _batch):
+                    for _i in range(0, _len):
+                        if lm_mask[_b, _i] >= 1.0:
+                            if _hit[_b, _i] > 0:
+                                _t1_acc[_b] = 1.0
+                            break  
+
+                    _is_succ = True
+                    for _i in range(0, _len):
+                        if lm_mask[_b, _i] >= 1.0:
+                            if _hit[_b, _i] <= 0:
+                                _is_succ = False
+                                break
+
+                    if _is_succ:
+                        _all_acc[_b] = 1.0
+
+                #_t1_acc = _t1_acc * 1.0 / _batch
+                #_all_acc = _all_acc * 1.0 / _batch
+
+            if label_smooth > 0.0001:
+                logprobs = torch.nn.functional.log_softmax(lm_logits.view(-1, lm_logits.size(-1)), dim=-1)
+                nll_loss = -logprobs.gather(dim=-1, index=lm_labels.view(-1).unsqueeze(1))
+                nll_loss = nll_loss.squeeze(1)
+                smooth_loss = -logprobs.mean(dim=-1)
+                loss = (1.0 - label_smooth) * nll_loss + label_smooth * smooth_loss
+                loss = loss.view(_batch, _len)
+            else:
+                loss_fct = nn.CrossEntropyLoss(ignore_index=-1, reduce=False)
+                loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1)).view(_batch, _len)
+
+            if lm_mask is None:
+                lm_mask = torch.ones(loss.shape, dtype=loss.dtype, device=loss.device)
+            loss = loss * lm_mask 
+
+            loss = loss.sum() / (lm_mask.sum() + 0.0001)
+
+            if is_report_accuracy:
+                return lm_logits, loss, _t1_acc, _all_acc
+            else:
+                return lm_logits, loss
+        return lm_logits, presents
+           
+    def _init_weights(self, module):
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            module.weight.data.normal_(mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+    def load_weight(self, state_dict):
+        if 'model_state_dict' in state_dict:
+            state_dict = state_dict['model_state_dict']
+    
+        state_dict_tmp = copy.deepcopy(state_dict)
+        old_keys = []
+        new_keys = []
+        for key in state_dict_tmp:
+            new_key = None
+            if key.endswith(".g"):
+                new_key = key[:-2] + ".weight"
+            elif key.endswith(".b"):
+                new_key = key[:-2] + ".bias"
+            elif key.endswith(".w"):
+                new_key = key[:-2] + ".weight"
+            
+            if key.startswith("module.transformer."):
+                new_key = key[len("module.transformer."):]
+
+            if new_key:
+                old_keys.append(key)
+                new_keys.append(new_key)
+
+        for old_key, new_key in zip(old_keys, new_keys):
+            state_dict[new_key] = state_dict.pop(old_key)
+        
+        for n, p in self.transformer.named_parameters():
+            if n not in state_dict:
+                state_dict[n] = p
+
+        self.transformer.load_state_dict(state_dict, strict=False)
+        self.set_tied()
--- a/NLG/src/optimizer.py
+++ b/NLG/src/optimizer.py
@ -0,0 +1,360 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import logging
+import math
+import os
+from collections import OrderedDict 
+import argparse
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss, MSELoss
+import torch.nn.functional as F
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LambdaLR, _LRScheduler
+
+
+def add_optimizer_params(parser: argparse.ArgumentParser):
+    parser.add_argument('--lr', default=0.00001, type=float, help='learning rate')
+    parser.add_argument('--weight_decay', default=0.01, type=float, help='weight decay rate')
+    parser.add_argument('--correct_bias', action='store_true', help='correct adam bias term')
+    parser.add_argument('--adam_epislon', default=1e-6, type=float, help='adam epsilon')
+    parser.add_argument('--no_decay_bias', action='store_true', help='no weight decay on bias weigh')
+    parser.add_argument('--adam_beta1', default=0.9, type=float, help='adam beta1 term')
+    parser.add_argument('--adam_beta2', default=0.98, type=float, help='adam beta2 term')
+    
+    parser.add_argument('--scheduler', default='linear', type=str,
+                        choices=['cosine', 'inv_sqrt', 'dev_perf', 'constant', 'linear', 'cycle'],
+                        help='lr scheduler to use.')
+
+    parser.add_argument('--max_step', type=int, default=None, help='upper epoch limit')
+
+    parser.add_argument('--max_epoch', type=int, default=None, help='max epoch of training')
+
+    parser.add_argument('--warmup_step', type=int, default=0, help='upper epoch limit')
+
+    parser.add_argument('--i_steps', type=str, default='0', help='interval_steps')
+    parser.add_argument('--i_lrs', type=str, default='0.00025', help='interval_lrs')
+
+
+class AdamW(Optimizer):
+    """ Implements Adam algorithm with weight decay fix.
+    Parameters:
+        lr (float): learning rate. Default 1e-3.
+        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.98)
+        eps (float): Adams epsilon. Default: 1e-6
+        weight_decay (float): Weight decay. Default: 0.0
+        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.
+    """
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.98), eps=1e-6, weight_decay=0.0, correct_bias=True):
+        if lr < 0.0:
+            raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr))
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[0]))
+        if not 0.0 <= betas[1] < 1.0:
+            raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[1]))
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {} - should be >= 0.0".format(eps))
+        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)
+        super().__init__(params, defaults)
+
+
+    def reset_state(self):
+        for group in param_groups:
+            for p in group['params']:
+                state = self.state[p]
+                state['step'] = 0
+                state["exp_avg"] = torch.zeros_like(p.data)
+                state["exp_avg_sq"] = torch.zeros_like(p.data)
+
+    def step(self, closure=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+        
+        for group in self.param_groups:
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+                grad = p.grad.data
+                if grad.is_sparse:
+                    raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")
+
+                state = self.state[p]
+
+                # State initialization
+                if len(state) == 0:
+                    state["step"] = 0
+                    # Exponential moving average of gradient values
+                    state["exp_avg"] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state["exp_avg_sq"] = torch.zeros_like(p.data)
+
+                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
+                beta1, beta2 = group["betas"]
+
+                state["step"] += 1
+
+                # Decay the first and second moment running average coefficient
+                # In-place operations to update the averages at the same time
+                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
+                denom = exp_avg_sq.sqrt().add_(group["eps"])
+
+                step_size = group["lr"]
+                if 'correct_bias' in group and group["correct_bias"]:  # No bias correction for Bert
+                    bias_correction1 = 1.0 - beta1 ** state["step"]
+                    bias_correction2 = 1.0 - beta2 ** state["step"]
+                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
+
+                p.data.addcdiv_(-step_size, exp_avg, denom)
+
+                # Just adding the square of the weights to the loss function is *not*
+                # the correct way of using L2 regularization/weight decay with Adam,
+                # since that will interact with the m and v parameters in strange ways.
+                #
+                # Instead we want to decay the weights in a manner that doesn't interact
+                # with the m/v parameters. This is equivalent to adding the square
+                # of the weights to the loss with plain (non-momentum) SGD.
+                # Add weight decay at the end (fixed version)
+                if group["weight_decay"] > 0.0:
+                    p.data.add_(p.data, alpha=-group["lr"] * group["weight_decay"])
+
+        return loss
+
+
+class CosineAnnealingWarmupRestarts(_LRScheduler):
+    """
+        optimizer (Optimizer): Wrapped optimizer.
+        first_cycle_steps (int): First cycle step size.
+        cycle_mult(float): Cycle steps magnification. Default: -1.
+        max_lr(float): First cycle's max learning rate. Default: 0.1.
+        min_lr(float): Min learning rate. Default: 0.001.
+        warmup_steps(int): Linear warmup step size. Default: 0.
+        gamma(float): Decrease rate of max learning rate by cycle. Default: 1.
+        last_epoch (int): The index of last epoch. Default: -1.
+    """
+    def __init__(
+        self,
+        optimizer : torch.optim.Optimizer,
+        max_lr : float = 0.1,
+        min_lr : float = 0.0,
+        warmup_steps : int = 0,
+        max_steps : int = 1,
+        alpha : float = 0.,
+        last_epoch : int = -1
+    ):
+        self.max_lr = max_lr # max learning rate in the current cycle
+        self.min_lr = min_lr # min learning rate
+        self.warmup_steps = warmup_steps # warmup step size
+        
+        self.alpha = alpha # decrease rate of max learning rate by cycle
+        self.max_steps = max_steps
+        super(CosineAnnealingWarmupRestarts, self).__init__(optimizer, last_epoch)
+        self.init_lr()
+    
+    def init_lr(self):
+        for param_group in self.optimizer.param_groups:
+            param_group['lr'] = self.min_lr
+    
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            curr_lr = self.max_lr * self.last_epoch / self.warmup_steps
+            return curr_lr
+        else:
+            _step = min(self.last_epoch, self.max_steps)
+            cosine_decay = 0.5 * (1 + math.cos(math.pi * _step / self.max_steps))
+            decayed = (1 - self.alpha) * cosine_decay + self.alpha
+            return self.max_lr * decayed # learning_rate * decayed
+
+    def step(self, epoch=None):
+        if epoch is None:
+            epoch = self.last_epoch + 1
+
+        self.last_epoch = math.floor(epoch)
+        _lr = self.get_lr()
+        for param_group in self.optimizer.param_groups: 
+            param_group['lr'] = _lr
+
+
+class CyclicScheduler(_LRScheduler):
+    def __init__(
+        self,
+        optimizer,
+        interval_steps = [],
+        interval_lrs = [],
+        last_epoch = -1,
+    ):        
+        self.optimizer = optimizer
+
+        self.interval_steps = interval_steps
+        self.interval_lrs = interval_lrs
+
+        self.last_epoch = last_epoch
+
+        super(CyclicScheduler, self).__init__(optimizer, last_epoch)
+        
+        self.init_lr()
+    
+    def init_lr(self):
+        for param_group in self.optimizer.param_groups:
+            param_group['lr'] = self.interval_lrs[0]
+    
+    def get_lr(self):
+        for _i in range(0, len(self.interval_steps)-1):
+            if self.last_epoch >= self.interval_steps[_i] and self.last_epoch < self.interval_steps[_i + 1]:
+                _alpha = (self.last_epoch - self.interval_steps[_i]) / (self.interval_steps[_i + 1] - self.interval_steps[_i] + 1e-6)
+                if _alpha < 0:
+                    _alpha = 0
+                if _alpha >= 1:
+                    _alpha = 1
+                curr_lr = _alpha * self.interval_lrs[_i + 1] + (1.0 - _alpha) * self.interval_lrs[_i]             
+                return curr_lr
+        return self.interval_lrs[-1]
+
+    def step(self, epoch=None):
+        if epoch is None:
+            epoch = self.last_epoch + 1
+
+        #self.max_lr = self.base_max_lr * (self.gamma**self.cycle)
+        self.last_epoch = math.floor(epoch)
+        _lr = self.get_lr()
+        for param_group in self.optimizer.param_groups: #, self.get_lr()):
+            param_group['lr'] = _lr
+
+
+
+def get_linear_schedule_with_warmup(
+    optimizer, 
+    num_warmup_steps, 
+    num_training_steps, 
+    last_epoch=-1
+):
+    """ Create a schedule with a learning rate that decreases linearly after
+    linearly increasing during a warmup period.
+    """
+    def lr_lambda(current_step):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        return max(0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)))
+    return LambdaLR(optimizer, lr_lambda, last_epoch)
+
+
+def get_constant_schedule_with_warmup(
+    optimizer, 
+    num_warmup_steps, 
+    num_training_steps, 
+    last_epoch=-1
+):
+    """ Create a schedule with a learning rate that decreases linearly after
+    linearly increasing during a warmup period.
+    """
+    def lr_lambda(current_step):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        return 1.0
+    return LambdaLR(optimizer, lr_lambda, last_epoch)
+
+
+def create_grouped_parameters(model, no_decay_bias): # args):
+    if not no_decay_bias:
+        optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters()], # if not any(nd in n for nd in no_decay)],
+        }]
+    else:
+        no_decay = ["bias", "layer_norm.weight"]
+
+        optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+        },
+        {
+            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 
+            "weight_decay": 0.0,
+        }]
+    return optimizer_grouped_parameters
+
+
+def create_adam_optimizer(
+    model, 
+    lr, 
+    weight_decay, 
+    optimizer_grouped_parameters=None, 
+    beta1=0.9, 
+    beta2=0.98, 
+    correct_bias=True, 
+    adam_epislon=1e-6, 
+    no_decay_bias=False
+):
+    if optimizer_grouped_parameters is None:
+        optimizer_grouped_parameters = create_grouped_parameters(model, no_decay_bias)
+
+    optimizer = AdamW(
+        optimizer_grouped_parameters, 
+        lr=lr, 
+        betas=(beta1, beta2), 
+        eps=adam_epislon, 
+        weight_decay=weight_decay, 
+        correct_bias=correct_bias
+    )
+    return optimizer
+
+
+def create_sgd_optimizer(model, lr):
+    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.0)
+    return optimizer
+    
+
+def create_adam_optimizer_from_args(model, args, grouped_parameters=None):
+    if grouped_parameters is None:
+        grouped_parameters = create_grouped_parameters(model, args.no_decay_bias)
+
+    optimizer = AdamW(
+        grouped_parameters, 
+        lr=args.lr, 
+        betas=(args.adam_beta1, args.adam_beta2), 
+        eps=args.adam_epislon, 
+        weight_decay=args.weight_decay, 
+        correct_bias=args.correct_bias
+    )
+    return optimizer
+
+
+def create_optimizer_scheduler(optimizer, args):
+    if args.scheduler == 'cosine':
+        scheduler = CosineAnnealingWarmupRestarts(
+            optimizer, 
+            max_lr=args.lr, 
+            min_lr=0.0, 
+            warmup_steps=args.warmup_step, 
+            max_steps=args.max_step, alpha=0
+        )
+    elif args.scheduler == 'linear':
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, args.warmup_step, args.max_step, last_epoch=-1
+        )
+    elif args.scheduler == 'cycle':
+        if args.i_steps is not None:
+            args.i_steps = [int(_i) for _i in args.i_steps.split(',')]
+            args.i_lrs = [float(_i) for _i in args.i_lrs.split(',')]
+        args.max_step = args.i_steps[-1]
+        print('max_step is rest to', args.max_step)
+        scheduler = CyclicScheduler(
+            optimizer, interval_steps=args.i_steps, interval_lrs=args.i_lrs
+        )
+    elif args.scheduler == 'constant':
+        scheduler = get_constant_schedule_with_warmup(
+            optimizer, args.warmup_step, args.max_step, last_epoch=-1
+        )
+    else:
+        # constant leanring rate.
+        scheduler = None
+    return scheduler
--- a/NLG/vocab/config.json
+++ b/NLG/vocab/config.json
@ -0,0 +1,33 @@
+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 1024,
+  "n_head": 16,
+  "n_layer": 24,
+  "n_positions": 1024,
+  "n_special": 0,
+  "predict_special_tokens": true,
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "vocab_size": 50257
+}
--- a/NLG/vocab/encoder.json
+++ b/NLG/vocab/encoder.json
--- a/NLG/vocab/tokenizer.json
+++ b/NLG/vocab/tokenizer.json
--- a/NLG/vocab/vocab.bpe
+++ b/NLG/vocab/vocab.bpe
--- a/NLG/vocab/vocab.json
+++ b/NLG/vocab/vocab.json
--- a/README.md
+++ b/README.md
@ -0,0 +1,186 @@
+# LoRA: Low-Rank Adaptation of Large Language Models
+
+
+This repo contains the source code of the Python package `loralib` and several examples of how to integrate it with PyTorch models, such as those in HuggingFace.
+We only support PyTorch for now.
+See our paper for a detailed description of LoRA.
+
+**LoRA: Low-Rank Adaptation of Large Language Models** <br>
+*Edward J. Hu\*, Yelong Shen\*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen* <br>
+Paper: https://arxiv.org/abs/2106.09685 <br>
+
+LoRA reduces the number of trainable parameters by learning pairs of rank-decompostion matrices while freezing the original weights.
+This vastly reduces the storage requirement for large language models adapted to specific tasks and enables efficient task-switching during deployment all without introducing inference latency.
+LoRA also outperforms several other adaptation methods including adapter, prefix-tuning, and fine-tuning.
+
+We obtain result comparable or superior to full finetuning on the GLUE benchmark using [RoBERTa (Liu et al., 2019)](https://arxiv.org/abs/1907.11692) base and large and [DeBERTa (He et al., 2020)](https://arxiv.org/abs/2006.03654) XXL 1.5B, while only training and storing a fraction of the parameters. Click the numbers below to download the RoBERTa and DeBERTa LoRA checkpoints.
+
+|   |         | RoBERTa base <br> Fine-tune  |  RoBERTa base <br> LoRA  | DeBERTa XXL <br> Fine-tune | DeBERTa XXL <br> LoRA  |
+|---|-------------------------|----------------|--------------------------|-----------------|-----------------|
+|   | # of Trainable Params.  | 125M | 0.8M | 1.5B | 4.7M     |
+|   | MNLI (m-Acc/mm-Acc)     | <b>87.6</b> | [<b>87.5</b>±.3/86.9±.3](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_mnli.bin) |91.7/<b>91.9</b>| [<b>91.9</b>±.1/<b>91.9</b>±.2](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_mnli.bin)       |
+|   | SST2 (Acc)              | 94.8 | [<b>95.1</b>±.2](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_sst2.bin) | <b>97.2</b>    | [96.9±.2](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_sst2.bin)                    |
+|   | MRPC (Acc)              | <b>90.2</b> | [<b>89.7</b>±.7](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_mrpc.bin) | 92.0           | [<b>92.6</b>±.6](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_mrpc.bin)             |
+|   | CoLA (Matthew's Corr)   | <b>63.6</b> | [<b>63.4</b>±1.2](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_cola.bin) | <b>72.0</b>    | [<b>72.4</b>±1.1](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_cola.bin)           |
+|   | QNLI (Acc)              | 92.8 | [<b>93.3</b>±.3](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_qnli.bin) | <b>96.0</b>    | [<b>96.0</b>±.1](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_qnli.bin)            |
+|   | QQP (Acc)               | <b>91.9</b> | [90.8±.1](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_qqp.bin) | 92.7           | [<b>92.9</b>±.1](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_qqp.bin)           |
+|   | RTE (Acc)               | 78.7 | [<b>86.6</b>±.7](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_rte.bin) | 93.9           | [<b>94.9</b>±.4](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_rte.bin)           |
+|   | STSB (Pearson/Spearman Corr) | 91.2 | [<b>91.5</b>±.2/<b>91.3</b>±.2](https://github.com/microsoft/LoRA/releases/download/RoBERTa/roberta_base_lora_stsb.bin) |<b>92.9</b>/92.6| [<b>93.0</b>±.2/<b>92.9</b>±.3](https://github.com/microsoft/LoRA/releases/download/DeBERTa/deberta_v2_xxlarge_lora_stsb.bin)      |
+|   | Average  | 86.40 | <b>87.24</b> | 91.06 | <b>91.32</b> |
+
+<i>Note: You still need the original pre-trained checkpoint from [HuggingFace](https://huggingface.co/) to use the LoRA checkpoints.</i>
+
+Fine-tuning numbers are taken from [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) and [He et al. (2020)](https://arxiv.org/abs/2006.03654).  We include confidence intervals on results from our experiments. Please follow the instructions in `NLU/` to reproduce our results.
+
+On GPT-2, LoRA compares favorably to both full finetuning and other efficient tuning methods, such as [adapter (Houlsby et al., 2019)](https://arxiv.org/abs/1902.00751) and [prefix tuning (Li and Liang, 2021)](https://arxiv.org/abs/2101.00190). We evaluated on E2E NLG Challenge, DART, and WebNLG:
+
+|   | Method              | # of Trainable Params | E2E (BLEU)   | DART (BLEU)  | WebNLG (BLEU-U/S/A)            |
+|---|---------------------|-----------------------|--------------|--------------|--------------------------------|
+|   | GPT-2 M (Fine-Tune) | 354.92M               | 68.2         | 46.0         | 30.4/<b>63.2</b>/47.6          |
+|   | GPT-2 M (Adapter)   | 0.37M                 | 66.3         | 42.4         | 45.1/54.5/50.2                 |
+|   | GPT-2 M (Prefix)    | 0.35M                 | 69.7         | 45.7         | 44.1/63.1/54.4                 |
+|   | GPT-2 M (LoRA)      | 0.35M                 |<b>70.4</b>±.1|<b>47.1</b>±.2| <b>46.7</b>±.4/62.1±.2/<b>55.3</b>±.2 |
+|   | GPT-2 L (Fine-Tune) | 774.03M               | 68.5         | 46.5         | 41.7/<b>64.6</b>/54.2          |
+|   | GPT-2 L (Adapter)   | 0.88M                 | 69.1±.1      | 45.7±.1      | <b>49.8</b>±.0/61.1±.0/56.0±.0 |
+|   | GPT-2 L (Prefix)    | 0.77M                 | 70.3         | 46.5         | 47.0/64.2/56.4                 |
+|   | GPT-2 L (LoRA)      | 0.77M                 |<b>70.4</b>±.1|<b>47.5</b>±.1| 48.4±.3/<b>64.0</b>±.3/<b>57.0</b>±.1 |
+
+Non-LoRA baselines, except for adapter on GPT-2 large, are taken from [Li and Liang (2021)](https://arxiv.org/abs/2101.00190). We include confidence intervals on results from our experiments.
+
+Download the GPT-2 LoRA checkpoints:
+ * [GPT-2 Medium E2E](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_md_lora_e2e.pt) (1.5 MB)
+ * [GPT-2 Medium DART](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_md_lora_dart.pt) (1.5 MB)
+ * [GPT-2 Medium WebNLG](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_md_lora_webnlg.pt) (1.5 MB)
+ * [GPT-2 Large E2E](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_lg_lora_e2e.pt) (2.3 MB)
+ * [GPT-2 Large DART](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_lg_lora_dart.pt) (2.3 MB)
+ * [GPT-2 Large WebNLG](https://github.com/microsoft/LoRA/releases/download/GPT-2/gpt2_lg_lora_webnlg.pt) (2.3 MB)
+
+Please follow the instructions in `NLG/` to reproduce our result.
+## Repository Overview
+
+There are several directories in this repo:
+* [loralib/](loralib) contains the source code for the package `loralib`, which needs to be installed to run the examples we provide;
+* [NLG/](NLG) contains an example implementation of LoRA in GPT-2 using our package, which can be used to reproduce the result in our paper;
+* [NLU/](NLU) contains an example implementation of LoRA in RoBERTa and DeBERTa using our package, which produces competitive results on the GLUE benchmark;
+* See how we use `loralib` in [GPT-2](NLG/src/model.py), [RoBERTa](NLU/src/transformers/models/roberta/modeling_roberta.py), and [DeBERTa v2](NLU/src/transformers/models/deberta_v2/modeling_deberta_v2.py)
+
+## Quickstart
+
+ 1. Installing `loralib` is simply
+ ```
+ pip install loralib
+ # Alternatively
+ # pip install git+https://github.com/microsoft/LoRA
+ ```
+
+ 2. You can choose to adapt some layers by replacing them with counterparts implemented in `loralib`. We only support `nn.Linear` for now. We also support a `MergedLinear` for cases where a single `nn.Linear` represents more than one layers, such as in some implementations of the attention `qkv` projection (see Additional Notes for more).
+ ```
+ # ===== Before =====
+ # layer = nn.Linear(in_features, out_features)
+
+ # ===== After ======
+ import loralib as lora
+ # Add a pair of low-rank adaptation matrices with rank r=16
+ layer = lora.Linear(in_features, out_features, r=16)
+ ```
+
+ 3. Before the training loop begins, mark only LoRA parameters as trainable.
+ ```
+ import loralib as lora
+ model = BigModel()
+ # This sets requires_grad to False for all parameters without the string "lora_" in their names
+ lora.mark_only_lora_as_trainable(model)
+ # Training loop
+ for batch in dataloader:
+    ...
+ ```
+ 4. When saving a checkpoint, generate a `state_dict` that only contains LoRA parameters.
+ ```
+ # ===== Before =====
+ # torch.save(model.state_dict(), checkpoint_path)
+ # ===== After =====
+ torch.save(lora.lora_state_dict(model), checkpoint_path)
+ ```
+ 5. When loading a checkpoint using `load_state_dict`, be sure to set `strict=False`.
+ ```
+ # Load the pretrained checkpoint first
+ model.load_state_dict(torch.load('ckpt_pretrained.pt'), strict=False)
+ # Then load the LoRA checkpoint
+ model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)
+ ```
+
+#### Now training can proceed as usual.
+
+## Additional Notes
+
+1. While we focus on a simple yet effect setup, namely adapting only the `q` and `v` projection in a Transformer, in our examples, LoRA can be apply to any subsets of pre-trained weights. We encourage you to explore different configurations, such as adapting the embedding layer by replacing `nn.Embedding` with `lora.Embedding` and/or adapting the MLP layers. It's very likely that the optimal configuration varies for different model architectures and tasks.
+
+2. Some Transformer implementation uses a single `nn.Linear` for the projection matrices for query, key, and value. If one wishes to constrain the rank of the updates to the individual matrices, one has to either break it up into three separate matrices or use `lora.MergedLinear`. Make sure to modify the checkpoint accordingly if you choose to break up the layer.
+```
+# ===== Before =====
+# qkv_proj = nn.Linear(d_model, 3*d_model)
+# ===== After =====
+# Break it up (remember to modify the pretrained checkpoint accordingly)
+q_proj = lora.Linear(d_model, d_model, r=8)
+k_proj = nn.Linear(d_model, d_model)
+v_proj = lora.Linear(d_model, d_model, r=8)
+# Alternatively, use lora.MergedLinear (recommended)
+qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])
+```
+1. Training bias vectors in tandem with LoRA might be a cost-efficient way to squeeze out extra task performance (if you tune the learning rate carefully). While we did not study its effect thoroughly in our paper, we make it easy to try in `lora`. You can mark some biases as trainable by passing "all" or "lora_only" to `bias=` when calling `mark_only_lora_as_trainable`. Remember to pass the corresponding `bias=` argument to `lora_state_dict` when saving a checkpoint.
+```
+# ===== Before =====
+# lora.mark_only_lora_as_trainable(model) # Not training any bias vectors
+# ===== After =====
+# Training all bias vectors associated with modules we apply LoRA to 
+lora.mark_only_lora_as_trainable(model, bias='lora_only')
+# Alternatively, we can train *all* bias vectors in the model, including LayerNorm biases
+lora.mark_only_lora_as_trainable(model, bias='all')
+# When saving a checkpoint, use the same bias= ('all' or 'lora_only')
+torch.save(lora.lora_state_dict(model, bias='all'), checkpoint_path)
+```
+4. Calling `model.eval()` will trigger the merging of LoRA parameters with the corresponding pretrained ones, which eliminates additional latency for subsequent forward passes. Calling `model.train()` again will undo the merge. This can be disabled by passing `merge_weights=False` to LoRA layers.
+
+## Contact
+Please contact us or post an issue if you have any questions.
+
+For questions related to the package `loralib`:
+* Edward Hu (edwardhu@microsoft.com)
+* Phillip Wallis (phwallis@microsoft.com)
+* Weizhu Chen (wzchen@microsoft.com)
+
+The GPT-2 example:
+* Phillip Wallis (phwallis@microsoft.com)
+* Yelong Shen (yeshe@microsoft.com)
+
+The DeBERTa example:
+* Lu Wang (luw@microsoft.com)
+
+## Acknowledgements
+We thank in alphabetical order Jianfeng Gao, Jade Huang, Jiayuan Huang, Lisa Xiang Li, Xiaodong Liu, Yabin Liu, Benjamin Van Durme, Luis Vargas, Haoran Wei, Peter Welinder, and Greg Yang for providing valuable feedback.
+
+## Citation
+```
+@misc{hu2021lora,
+    title={LoRA: Low-Rank Adaptation of Large Language Models},
+    author={Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Chen, Weizhu},
+    year={2021},
+    eprint={2106.09685},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+## Contributing
+
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+
+When you submit a pull request, a CLA bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
+contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/loralib/init.py
+++ b/loralib/init.py
@ -0,0 +1,4 @@
+name = "lora"
+
+from .layers import *
+from .utils import *
--- a/loralib/layers.py
+++ b/loralib/layers.py
@ -0,0 +1,322 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+import math
+from typing import Optional, List
+
+class LoRALayer():
+    def __init__(
+        self, 
+        r: int, 
+        lora_alpha: int, 
+        lora_dropout: float,
+        merge_weights: bool,
+    ):
+        self.r = r
+        self.lora_alpha = lora_alpha
+        # Optional dropout
+        if lora_dropout > 0.:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+
+class Embedding(nn.Embedding, LoRALayer):
+    # LoRA implemented in a dense layer
+    def __init__(
+        self,
+        num_embeddings: int,
+        embedding_dim: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
+        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0,
+                           merge_weights=merge_weights)
+        # Actual trainable parameters
+        if r > 0:
+            self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings)))
+            self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r)))
+            self.scaling = self.lora_alpha / self.r
+            # Freezing the pre-trained weight matrix
+            self.weight.requires_grad = False
+        self.reset_parameters()
+
+    def reset_parameters(self):
+        nn.Embedding.reset_parameters(self)
+        if hasattr(self, 'lora_A'):
+            # initialize A the same way as the default for nn.Linear and B to zero
+            nn.init.zeros_(self.lora_A)
+            nn.init.normal_(self.lora_B)
+
+    def train(self, mode: bool = True):
+        nn.Embedding.train(self, mode)
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            if self.r > 0:
+                self.weight.data -= (self.lora_B @ self.lora_A).T * self.scaling
+            self.merged = False
+    
+    def eval(self):
+        nn.Linear.eval(self)
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            if self.r > 0:
+                self.weight.data += (self.lora_B @ self.lora_A) * self.scaling
+            self.merged = True
+
+    def forward(self, x: torch.Tensor):
+        if self.r > 0 and not self.merged:
+            result = nn.Embedding.forward(self, x)
+            if self.r > 0:
+                after_A = F.embedding(
+                    x, self.lora_A.T, self.padding_idx, self.max_norm,
+                    self.norm_type, self.scale_grad_by_freq, self.sparse
+                )
+                result += (after_A @ self.lora_B.T) * self.scaling
+            return result
+        else:
+            return nn.Embedding.forward(self, x)
+            
+
+class Linear(nn.Linear, LoRALayer):
+    # LoRA implemented in a dense layer
+    def __init__(
+        self, 
+        in_features: int, 
+        out_features: int, 
+        r: int = 0, 
+        lora_alpha: int = 1, 
+        lora_dropout: float = 0.,
+        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        nn.Linear.__init__(self, in_features, out_features, **kwargs)
+        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
+                           merge_weights=merge_weights)
+
+        self.fan_in_fan_out = fan_in_fan_out
+        # Actual trainable parameters
+        if r > 0:
+            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
+            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
+            self.scaling = self.lora_alpha / self.r
+            # Freezing the pre-trained weight matrix
+            self.weight.requires_grad = False
+        self.reset_parameters()
+        if fan_in_fan_out:
+            self.weight.data = self.weight.data.T
+
+    def reset_parameters(self):
+        nn.Linear.reset_parameters(self)
+        if hasattr(self, 'lora_A'):
+            # initialize A the same way as the default for nn.Linear and B to zero
+            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
+            nn.init.zeros_(self.lora_B)
+
+    def train(self, mode: bool = True):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        nn.Linear.train(self, mode)
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            if self.r > 0:
+                self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
+            self.merged = False
+    
+    def eval(self):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        nn.Linear.eval(self)
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            if self.r > 0:
+                self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
+            self.merged = True
+
+    def forward(self, x: torch.Tensor):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        if self.r > 0 and not self.merged:
+            result = F.linear(x, T(self.weight), bias=self.bias)
+            if self.r > 0:
+                result += (self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T) * self.scaling
+            return result
+        else:
+            return F.linear(x, T(self.weight), bias=self.bias)
+
+
+class MergedLinear(nn.Linear, LoRALayer):
+    # LoRA implemented in a dense layer
+    def __init__(
+        self, 
+        in_features: int, 
+        out_features: int, 
+        r: int = 0, 
+        lora_alpha: int = 1, 
+        lora_dropout: float = 0.,
+        enable_lora: List[bool] = [False],
+        fan_in_fan_out: bool = False,
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        nn.Linear.__init__(self, in_features, out_features, **kwargs)
+        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
+                           merge_weights=merge_weights)
+        assert out_features % len(enable_lora) == 0, \
+            'The length of enable_lora must divide out_features'
+        self.enable_lora = enable_lora
+        self.fan_in_fan_out = fan_in_fan_out
+        # Actual trainable parameters
+        if r > 0 and any(enable_lora):
+            self.lora_A = nn.Parameter(
+                self.weight.new_zeros((r * sum(enable_lora), in_features)))
+            self.lora_B = nn.Parameter(
+                self.weight.new_zeros((out_features // len(enable_lora) * sum(enable_lora), r))
+            ) # weights for Conv1D with groups=sum(enable_lora)
+            self.scaling = self.lora_alpha / self.r
+            # Freezing the pre-trained weight matrix
+            self.weight.requires_grad = False
+            # Compute the indices
+            self.lora_ind = self.weight.new_zeros(
+                (out_features, ), dtype=torch.bool
+            ).view(len(enable_lora), -1)
+            self.lora_ind[enable_lora, :] = True
+            self.lora_ind = self.lora_ind.view(-1)
+        self.reset_parameters()
+        if fan_in_fan_out:
+            self.weight.data = self.weight.data.T
+
+    def reset_parameters(self):
+        nn.Linear.reset_parameters(self)
+        if hasattr(self, 'lora_A'):
+            # initialize A the same way as the default for nn.Linear and B to zero
+            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
+            nn.init.zeros_(self.lora_B)
+
+    def zero_pad(self, x):
+        result = x.new_zeros((*x.shape[:-1], self.out_features))
+        result = result.view(-1, self.out_features)
+        result[:, self.lora_ind] = x.reshape(
+            -1, self.out_features // len(self.enable_lora) * sum(self.enable_lora)
+        )
+        return result.view((*x.shape[:-1], self.out_features))
+
+    def train(self, mode: bool = True):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        nn.Linear.train(self, mode)
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            if self.r > 0 and any(self.enable_lora):
+                delta_w = F.conv1d(
+                    self.lora_A.data.unsqueeze(0), 
+                    self.lora_B.data.unsqueeze(-1), 
+                    groups=sum(self.enable_lora)
+                ).squeeze(0)
+                self.weight.data -= self.zero_pad(T(delta_w * self.scaling))
+            self.merged = False
+    
+    def eval(self):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        nn.Linear.eval(self)
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            if self.r > 0 and any(self.enable_lora):
+                delta_w = F.conv1d(
+                    self.lora_A.data.unsqueeze(0), 
+                    self.lora_B.data.unsqueeze(-1), 
+                    groups=sum(self.enable_lora)
+                ).squeeze(0)
+                self.weight.data += self.zero_pad(T(delta_w * self.scaling))
+            self.merged = True
+
+    def forward(self, x: torch.Tensor):
+        def T(w):
+            return w.T if self.fan_in_fan_out else w
+        if self.merged:
+            return F.linear(x, T(self.weight), bias=self.bias)
+        else:
+            result = F.linear(x, T(self.weight), bias=self.bias)
+            if self.r > 0:
+                after_A = F.linear(self.lora_dropout(x), self.lora_A)
+                after_B = F.conv1d(
+                    after_A.transpose(-2, -1), 
+                    self.lora_B.unsqueeze(-1), 
+                    groups=sum(self.enable_lora)
+                ).transpose(-2, -1)
+                result += self.zero_pad(after_B) * self.scaling
+            return result
+            
+
+class Conv2d(nn.Conv2d, LoRALayer):
+    # LoRA implemented in a dense layer
+    def __init__(
+        self, 
+        in_channels: int, 
+        out_channels: int,
+        kernel_size: int,
+        r: int = 0, 
+        lora_alpha: int = 1, 
+        lora_dropout: float = 0.,
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        nn.Conv2d.__init__(self, in_channels, out_channels, kernel_size, **kwargs)
+        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
+                           merge_weights=merge_weights)
+        assert type(kernel_size) is int
+        # Actual trainable parameters
+        if r > 0:
+            self.lora_A = nn.Parameter(
+                self.weight.new_zeros((r*kernel_size, in_channels*kernel_size))
+            )
+            self.lora_B = nn.Parameter(
+                self.weight.new_zeros((out_channels*kernel_size, r*kernel_size))
+            )
+            self.scaling = self.lora_alpha / self.r
+            # Freezing the pre-trained weight matrix
+            self.weight.requires_grad = False
+        self.reset_parameters()
+
+    def reset_parameters(self):
+        nn.Conv2d.reset_parameters(self)
+        if hasattr(self, 'lora_A'):
+            # initialize A the same way as the default for nn.Linear and B to zero
+            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
+            nn.init.zeros_(self.lora_B)
+
+    def train(self, mode: bool = True):
+        nn.Conv2d.train(self, mode)
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            self.weight.data -= (self.lora_B @ self.lora_A).view(self.weight.shape) * self.scaling
+            self.merged = False
+    
+    def eval(self):
+        nn.Conv2d.eval(self)
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            self.weight.data += (self.lora_B @ self.lora_A).view(self.weight.shape) * self.scaling
+            self.merged = True
+
+    def forward(self, x: torch.Tensor):
+        if self.r > 0 and not self.merged:
+            return F.conv2d(
+                x, 
+                self.weight + (self.lora_B @ self.lora_A).view(self.weight.shape) * self.scaling,
+                self.bias, self.stride, self.padding, self.dilation, self.groups
+            )
+        return nn.Conv2d.forward(self, x)
--- a/loralib/utils.py
+++ b/loralib/utils.py
@ -0,0 +1,49 @@
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+import torch
+import torch.nn as nn
+
+from typing import Dict
+
+from .layers import LoRALayer
+
+
+def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None:
+    for n, p in model.named_parameters():
+        if 'lora_' not in n:
+            p.requires_grad = False
+    if bias == 'none':
+        return
+    elif bias == 'all':
+        for n, p in model.named_parameters():
+            if 'bias' in n:
+                p.requires_grad = True
+    elif bias == 'lora_only':
+        for m in model.modules():
+            if isinstance(m, LoRALayer) and \
+                hasattr(m, 'bias') and \
+                m.bias is not None:
+                    m.bias.requires_grad = True
+    else:
+        raise NotImplementedError
+
+
+def lora_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]:
+    my_state_dict = model.state_dict()
+    if bias == 'none':
+        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k}
+    elif bias == 'all':
+        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k or 'bias' in k}
+    elif bias == 'lora_only':
+        to_return = {}
+        for k in my_state_dict:
+            if 'lora_' in k:
+                to_return[k] = my_state_dict[k]
+                bias_name = k.split('lora_')[0]+'bias'
+                if bias_name in my_state_dict:
+                    to_return[bias_name] = my_state_dict[bias_name]
+        return to_return
+    else:
+        raise NotImplementedError
--- a/setup.py
+++ b/setup.py
@ -0,0 +1,22 @@
+import setuptools
+
+with open("README.md", "r", encoding="utf-8") as fh:
+    long_description = fh.read()
+
+setuptools.setup(
+    name="loralib",
+    version="0.1.0",
+    author="Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen",
+    author_email="edward.hu@microsoft.com",
+    description="PyTorch implementation of low-rank adaptation (LoRA), a parameter-efficient approach to adapt a large pre-trained deep learning model which obtains performance on-par with full fine-tuning.",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/microsoft/LoRA",
+    packages=setuptools.find_packages(),
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.6',
+)