Added 12 model cards for Indian Language Models (#8198)
* Create README.md * added model cards
This commit is contained in:
Родитель
9bd30f7cf4
Коммит
aa79aa4e7d
|
@ -0,0 +1,25 @@
|
|||
---
|
||||
language:
|
||||
- bn
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Bengali
|
||||
---
|
||||
# Indic-Transformers Bengali BERT
|
||||
## Model description
|
||||
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
|
||||
text = "আপনি কেমন আছেন?"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 6, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- bn
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Bengali
|
||||
- DistilBERT
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Bengali DistilBERT
|
||||
## Model description
|
||||
This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
|
||||
text = "আপনি কেমন আছেন?"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- bn
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Bengali
|
||||
- RoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Bengali RoBERTa
|
||||
## Model description
|
||||
This is a RoBERTa language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
|
||||
text = "আপনি কেমন আছেন?"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 10, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- bn
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Bengali
|
||||
- XLMRoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Bengali XLMRoBERTa
|
||||
## Model description
|
||||
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
|
||||
text = "আপনি কেমন আছেন?"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- hi
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Hindi
|
||||
- BERT
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Hindi BERT
|
||||
## Model description
|
||||
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
|
||||
text = "आपका स्वागत हैं"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- hi
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Hindi
|
||||
- DistilBERT
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Hindi DistilBERT
|
||||
## Model description
|
||||
This is a DistilBERT language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
|
||||
text = "आपका स्वागत हैं"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- hi
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Hindi
|
||||
- RoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Hindi RoBERTa
|
||||
## Model description
|
||||
This is a RoBERTa language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
|
||||
text = "आपका स्वागत हैं"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 11, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- hi
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Hindi
|
||||
- XLMRoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Hindi XLMRoBERTa
|
||||
## Model description
|
||||
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
|
||||
text = "आपका स्वागत हैं"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- te
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Telugu
|
||||
- BERT
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Telugu BERT
|
||||
## Model description
|
||||
This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
|
||||
text = "మీరు ఎలా ఉన్నారు"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- te
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Telugu
|
||||
- DistilBERT
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Telugu DistilBERT
|
||||
## Model description
|
||||
This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
|
||||
text = "మీరు ఎలా ఉన్నారు"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- te
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Telugu
|
||||
- RoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Telugu RoBERTa
|
||||
## Model description
|
||||
This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
|
||||
text = "మీరు ఎలా ఉన్నారు"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 14, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
language:
|
||||
- te
|
||||
tags:
|
||||
- MaskedLM
|
||||
- Telugu
|
||||
- XLMRoBERTa
|
||||
- Question-Answering
|
||||
- Token Classification
|
||||
- Text Classification
|
||||
---
|
||||
# Indic-Transformers Telugu XLMRoBERTa
|
||||
## Model description
|
||||
This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||
## Intended uses & limitations
|
||||
#### How to use
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
|
||||
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
|
||||
text = "మీరు ఎలా ఉన్నారు"
|
||||
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||
out = model(input_ids)[0]
|
||||
print(out.shape)
|
||||
# out = [1, 5, 768]
|
||||
```
|
||||
#### Limitations and bias
|
||||
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
Загрузка…
Ссылка в новой задаче