diff --git a/model_cards/MoseliMotsoehli/TswanaBert/README.md b/model_cards/MoseliMotsoehli/TswanaBert/README.md new file mode 100644 index 000000000..a6018d130 --- /dev/null +++ b/model_cards/MoseliMotsoehli/TswanaBert/README.md @@ -0,0 +1,72 @@ +--- +language: setswana +--- + +# TswanaBert + +## Model Description. +TswanaBERT is a transformers model pretrained on a corpus of Setswana data in a self-supervised fashion by masking part of the input words and training to predict the masks. + +## Intended uses & limitations +The model can be used for either masked language modeling or next word prediction. it can also be fine-tuned for a specifict application. + +#### How to use + +```python +>>> from transformers import pipeline +>>> from transformers import AutoTokenizer, AutoModelWithLMHead + +>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert") +>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert") +>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) +>>> unmasker("Ntshopotse e godile.") + +[{'score': 0.32749542593955994, + 'sequence': 'Ntshopotse setse e godile.', + 'token': 538, + 'token_str': 'Ġsetse'}, + {'score': 0.060260992497205734, + 'sequence': 'Ntshopotse le e godile.', + 'token': 270, + 'token_str': 'Ġle'}, + {'score': 0.058460816740989685, + 'sequence': 'Ntshopotse bone e godile.', + 'token': 364, + 'token_str': 'Ġbone'}, + {'score': 0.05694682151079178, + 'sequence': 'Ntshopotse ga e godile.', + 'token': 298, + 'token_str': 'Ġga'}, + {'score': 0.0565204992890358, + 'sequence': 'Ntshopotse, e godile.', + 'token': 16, + 'token_str': ','}] +``` + +#### Limitations and bias +The model is trained on a fairly small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet. + +## Training data + +The largest portion of this dataset (10k) lines of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) + +The I then added 200 more phrases and sentences by scrapping following sites. I continue to expand the dataset + +* http://setswana.blogspot.com/ +* https://omniglot.com/writing/tswana.php +* http://www.dailynews.gov.bw/ +* http://www.mmegi.bw/index.php +* https://tsena.co.bw +* http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html + +## Training procedure +The model was trained on a google colab Tesla T4 GPU for 200 epochs with a batch size of 64, on 13446 learned tokens. +Other model training configuration setting can be found [here](https://s3.amazonaws.com/models.huggingface.co/bert/MoseliMotsoehli/TswanaBert/config.json) + +### BibTeX entry and citation info + +```bibtex +@inproceedings{author = {Moseli Motsoehli}, + year={2020} +} +```