Added 12 model cards for Indian Language Models (#8198)

* Create README.md

* added model cards
This commit is contained in:
Kushal 2020-11-02 10:47:43 +05:30 коммит произвёл GitHub
Родитель 9bd30f7cf4
Коммит aa79aa4e7d
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
12 изменённых файлов: 344 добавлений и 0 удалений

Просмотреть файл

@ -0,0 +1,25 @@
---
language:
- bn
tags:
- MaskedLM
- Bengali
---
# Indic-Transformers Bengali BERT
## Model description
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
text = "আপনি কেমন আছেন?"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 6, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- bn
tags:
- MaskedLM
- Bengali
- DistilBERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Bengali DistilBERT
## Model description
This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
text = "আপনি কেমন আছেন?"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- bn
tags:
- MaskedLM
- Bengali
- RoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Bengali RoBERTa
## Model description
This is a RoBERTa language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
text = "আপনি কেমন আছেন?"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 10, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- bn
tags:
- MaskedLM
- Bengali
- XLMRoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Bengali XLMRoBERTa
## Model description
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
text = "আপনি কেমন আছেন?"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- hi
tags:
- MaskedLM
- Hindi
- BERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Hindi BERT
## Model description
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
text = "आपका स्वागत हैं"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- hi
tags:
- MaskedLM
- Hindi
- DistilBERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Hindi DistilBERT
## Model description
This is a DistilBERT language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
text = "आपका स्वागत हैं"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- hi
tags:
- MaskedLM
- Hindi
- RoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Hindi RoBERTa
## Model description
This is a RoBERTa language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
text = "आपका स्वागत हैं"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 11, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- hi
tags:
- MaskedLM
- Hindi
- XLMRoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Hindi XLMRoBERTa
## Model description
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
text = "आपका स्वागत हैं"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- te
tags:
- MaskedLM
- Telugu
- BERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu BERT
## Model description
This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- te
tags:
- MaskedLM
- Telugu
- DistilBERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu DistilBERT
## Model description
This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- te
tags:
- MaskedLM
- Telugu
- RoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu RoBERTa
## Model description
This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 14, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).

Просмотреть файл

@ -0,0 +1,29 @@
---
language:
- te
tags:
- MaskedLM
- Telugu
- XLMRoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu XLMRoBERTa
## Model description
This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).