зеркало из https://github.com/microsoft/LID-tool.git
Added code of the LID-tool project.
This commit is contained in:
Родитель
23d014e839
Коммит
39a0b8f82b
|
@ -1,129 +1,2 @@
|
|||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
pip-wheel-metadata/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
.python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
*.pyc
|
||||
*Zone.Identifier
|
||||
|
|
216
README.md
216
README.md
|
@ -1,5 +1,219 @@
|
|||
# Language Identification (LID) for Code-Mixed text
|
||||
---
|
||||
|
||||
# Contributing
|
||||
This is a word level language identification tool for identifying Code-Mixed text of languages (like Hindi etc.) written in roman script and mixed with English. At a broader level, we utilize a ML classifier that's trained using MALLET to generate word level probabilities for language tags. We then utilize these probabilities along with the context information of surrounding words to generate language tags for each word of the input. We also use hand-crafted dictionaries as look-up tables to cover unique, corner and conflicting cases to give a robust language identification tool.
|
||||
|
||||
**Note:**
|
||||
|
||||
- Please read the [Papers](#Papers) section to understand the theory and experiments surrounding this project.
|
||||
|
||||
- The trained ML classifier model and dictionaries that are shipped by default with this project are specifically for `Hindi-English` Code-Mixed text.
|
||||
|
||||
- You can use this project to extend it to `any language pairs`. More information in [Train your Custom LID](#train-your-custom-lid).
|
||||
|
||||
|
||||
# Project Structure
|
||||
---
|
||||
The project has the following structure:
|
||||
```
|
||||
LID-tool/
|
||||
├── README.md
|
||||
├── classifiers/
|
||||
├── config.ini
|
||||
├── dictionaries/
|
||||
├── getLanguage.py
|
||||
├── sampleinp.txt
|
||||
├── sampleinp.txt_tagged
|
||||
├── sampleoutp.txt
|
||||
├── tests/
|
||||
├── tmp/
|
||||
└── utils/
|
||||
```
|
||||
|
||||
Here is more info about each component:
|
||||
- **classifiers/** - contains classifiers that are trained using MALLET. For now, we have a single classifier "HiEn.classifier".
|
||||
- **config.ini** - config file for the project. You can learn more about the config file in the [Working with the Config file to create dictionary and control other aspects of the project](Train_Custom_LID.md#working-with-the-config-file-to-create-dictionary-and-control-other-aspects-of-the-project).
|
||||
- **dictionaries/** - contains various Hindi and English dictionaries used in the project.
|
||||
- **getLanguage.py** - main file of the project, contains code for classifying input text into language tags.
|
||||
- **sample\* files** - contain the sample input, tagged and outputs of the LID.
|
||||
- **tests/** - contains validation sample sets from FIRE shared tasks, good for validating performance of the LID.
|
||||
- **tmp/** - temporary folder for holding intermediate MALLET files.
|
||||
- **utils/** - contains utility code for the LID like extracting features etc.
|
||||
|
||||
# Papers
|
||||
|
||||
- [Query word labeling and Back Transliteration for Indian
|
||||
Languages: Shared task system description - Spandana Gella et. al](https://www.isical.ac.in/~fire/wn/STTS/2013_translit_search-gella-msri.pdf)
|
||||
- [Testing the Limits of Word level Language Identification](https://www.aclweb.org/anthology/W14-5151.pdf)
|
||||
|
||||
# Installation
|
||||
---
|
||||
The installation of the tool is pretty straightforward as most of it is plug-n-play. Once you get all the required depencies you are good to go.
|
||||
|
||||
Once you clone the repository in your local system you have the code, and you can start installing the dependencies one by one.
|
||||
|
||||
## Dependencies
|
||||
1. **Java** - This project uses MALLET (written in java) to train/run classifiers hence you'd need java in your system. You can get a JRE from here:
|
||||
|
||||
```
|
||||
https://www.oracle.com/java/technologies/javase-jre8-downloads.html
|
||||
```
|
||||
2. **Python 3** - This project is written in Python 3 hence make sure to have it before running the LID. You can get Python 3 from Miniconda:
|
||||
|
||||
```
|
||||
https://docs.conda.io/en/latest/miniconda.html
|
||||
```
|
||||
|
||||
3. **MALLET** - You can download mallet binaries from the [mallet download page](http://mallet.cs.umass.edu/download.php) and follow the installation instructions given there.
|
||||
|
||||
|
||||
4. **Twitter Text Python (TTP)** - You can simply install it using pip:
|
||||
|
||||
```
|
||||
pip install twitter-text-python
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
**1. Linux Installation:**
|
||||
The major setup step required is to give executable rights to mallet binary. You can do so by the following command:
|
||||
|
||||
```
|
||||
chmod +x mallet-2.0.8/bin/mallet
|
||||
```
|
||||
|
||||
**2. Windows Installation:**
|
||||
The major setup step in windows is to make sure that you have set the correct environment variables for Python and Java. If you want to test whether the environment variables are rightly set or not, check whether they are accessible from the command prompt.
|
||||
|
||||
You also have to make sure that you have set the environment variable `MALLET_HOME` to the LID project's mallet folder. You can do so by opening a command prompt and typing:
|
||||
```
|
||||
set MALLET_HOME=\path to your LID directory\mallet-2.0.8\
|
||||
```
|
||||
Once you are done with the above set of steps, the next step is to just start using the LID.
|
||||
|
||||
# Usage
|
||||
---
|
||||
|
||||
## I. Getting Inference on a text input
|
||||
|
||||
|
||||
## a. Using LID in File Mode
|
||||
|
||||
You can simply execute getLanguage.py with the input file containing text data to be classified.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```
|
||||
python getLanguage.py <input_file>
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
python getLanguage.py sampleinp.txt
|
||||
```
|
||||
Output is written to `sampleinp.txt_tagged` or <input_file_tagged>
|
||||
|
||||
**Things to Note:**
|
||||
|
||||
1. Input file should contain lines in the following format:
|
||||
```
|
||||
<sentenceId>TAB<sentence>
|
||||
```
|
||||
**Example:**
|
||||
```
|
||||
1 Yeh mera pehla sentence hai
|
||||
```
|
||||
See sampleinp.txt for an example.
|
||||
|
||||
2. Make sure that there is no empty line or line with no text in the input file as the LID might throw an error.
|
||||
|
||||
## b. Using LID in Library Mode
|
||||
You can also use this LID as a library that can be imported in your own program and code.
|
||||
|
||||
Simply write the following lines of code:
|
||||
|
||||
```python
|
||||
from getLanguage import langIdentify
|
||||
|
||||
# inputText is a list of input sentences
|
||||
# classifier is the name of the mallet classifier to be used
|
||||
langIdentify(inputText, classifier)
|
||||
```
|
||||
The input will be a list of sentences to be language tagged, like this:
|
||||
|
||||
![](images/langIdentify_input.PNG)
|
||||
|
||||
The output will be a list of language tagged input sentences in such a way that each (word, tag) is a tuple pair:
|
||||
|
||||
![](images/langIdentify_output.PNG)
|
||||
|
||||
## II. Training your own MALLET classifier
|
||||
|
||||
Currently the project ships with classifier for Hindi-English pair by default but you can also train a classifier for your own language pairs.
|
||||
|
||||
Refer to the research papers attached to understand the methodology and the training paremeters.
|
||||
|
||||
You can use MALLET documentation on how to use it's API for training a new classifier: http://mallet.cs.umass.edu/classification.php
|
||||
|
||||
More information in [Train your Custom LID](#train-your-custom-lid).
|
||||
|
||||
## III. Testing a new classifier
|
||||
|
||||
We have collated a set of data sets that can be used as a validation set in case you want to test a new version of the classifier or any changes in the LID itself. Currently, they are only for Code-Mixed Hindi-English pair:
|
||||
|
||||
1. tests/Adversarial_FIRE_2015_Sentiment_Analysis_25.txt - 25 hardest input sentences from [FIRE 2015 Sentiment Analysis task](http://amitavadas.com/SAIL/data.html).
|
||||
2. tests/FIRE_2015_Sentiment_Analysis_25.txt - first 25 sentences from the FIRE 2015 Sentiment Analysis task.
|
||||
3. tests/test_sample_20.txt - 20 manually written code-mixed sentences.
|
||||
|
||||
You can use a test set by simply giving it as a parameter to getLanguage:
|
||||
|
||||
```
|
||||
python getLanguage.py tests/<test_name>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
python getLanguage.py tests/test_sample_20.txt
|
||||
```
|
||||
|
||||
The above command will execute your updated/new LID code on 20 manually crafted code-mixed sentences.
|
||||
|
||||
**Larger Datasets**
|
||||
|
||||
If you want to test your LID on larger datasets, then you can look at these two FIRE tasks:
|
||||
1. [FIRE 2013 LID task](https://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/fire_data.html) - The original dataset of 500 sentences for which this LID was built.
|
||||
2. [FIRE 2015 Sentiment Analysis task](http://amitavadas.com/SAIL/data.html) - 12,000+ language tagged sentences.
|
||||
|
||||
# Train your Custom LID
|
||||
---
|
||||
|
||||
There are a couple of changes and prelminary steps you need to train your own custom LID. You can follow the documentation page on [Train your Custom LID](Train_Custom_LID.md) for more information.
|
||||
|
||||
# Attribution
|
||||
---
|
||||
These are the open-source projects that this LID uses:
|
||||
|
||||
1. [MALLET: A Machine Learning for Language Toolkit. McCallum, Andrew Kachites.](http://mallet.cs.umass.edu/about.php)
|
||||
2. [Twitter-Text-Python (TTP) by Edmond Burnett](https://github.com/edmondburnett/twitter-text-python)
|
||||
|
||||
Apart from the above set of projects, we also use free and openly hosted dictionaries in the project to improve the LID, you can learn more about them in [Train your Custom LID](Train_Custom_LID.md).
|
||||
|
||||
# Contributors
|
||||
---
|
||||
|
||||
In the order of recency, most recent first :-
|
||||
|
||||
1. Mohd Sanad Zaki Rizvi
|
||||
2. Anirudh Srinivasan
|
||||
3. Sebastin Santy
|
||||
4. Anshul Bawa
|
||||
5. Silvana Hartmann
|
||||
6. Spandana Gella
|
||||
|
||||
|
||||
# Contributing to this Code
|
||||
---
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
|
|
|
@ -0,0 +1,97 @@
|
|||
# Information Flow of LID
|
||||
|
||||
|
||||
The following is a very high level approximation of how information flows in the LID:
|
||||
|
||||
![](images/info_flow_new_lid.PNG)
|
||||
|
||||
Let's understand what is happening here:
|
||||
|
||||
- Whenever you provide input to `getLanguage.py`, one of the `langIdentify()` or `langIdentifyFile()` functions are triggered based on whether you are using the LID in file mode or library mode.
|
||||
- Either of these functions, take your input sentences and split them into words. Then, on each sentence we invoke `callMallet()`. This function essentially has 2 tasks.
|
||||
- 1st task is to check which words are already present in language dictionaries that we have, we can directly assign language tags to such words using our logic.
|
||||
- The words that aren't present in the dictionaries are split away and sent to ML classifier that's trained using MALLET.
|
||||
- 2nd task is to call ML classifier on such words, but before that we invoke `ExtractFeatures.py` to generate `n-gram` features for the ML model. Once features are generated and ML classifier is called, it gives out certain probabilities for each word for both the languages.
|
||||
- Sometimes a set of words that are present in the dictionary are not tagged due to some corner cases (we have discussed in the next section). In such case, we `re-invoke ML classifier` on these words to get language probabilities.
|
||||
- Finally, we combine the probability values of both set of words: one that were processed by the ML classifier and the other ones which were tagged using dictionaries (We use custom probability values of `1E-9` and `0.999999999` for wrong and right language tags respectively for each dictionary tagged word; you can change these values using the `config` file).
|
||||
- Now that the input words are combined once again, their probability values are fed into the context based tagging logic (explained in the research paper). This algorithm, gives out the final language tags.
|
||||
|
||||
This was a quick walkthrough the code, you can read the research papers for more information around the algorithm:
|
||||
|
||||
- [Query word labeling and Back Transliteration for Indian
|
||||
Languages: Shared task system description - Spandana Gella et. al](https://www.isical.ac.in/~fire/wn/STTS/2013_translit_search-gella-msri.pdf)
|
||||
- [Testing the Limits of Word level Language Identification](https://www.aclweb.org/anthology/W14-5151.pdf)
|
||||
|
||||
# About Dictionaries and How to create your own?
|
||||
|
||||
## I. Brief description of each dictionary currently used in the project:
|
||||
|
||||
The project currently uses 4 manually crafted dictionaries for English and 2 for Hindi words. Here is how they have been sourced:
|
||||
|
||||
1. dictionaries/dict1bigr.txt – Uses a sample of 2-grams English words from [Google's Ngram viewer]().
|
||||
2. dictionaries/dict1coca.txt – Uses a sample of English words from [Corpus of Contemporary American English]().
|
||||
3. dictionaries/dict1goog10k.txt – Uses a sample of 10000 most frequent words from [Google's Trillion Word Corpus]().
|
||||
4. dictionaries/dict1hi.txt – Semi-automatically selected set of common Hindi words in the roman script.
|
||||
5. dictionaries/dict1hinmov.txt – Same as 4 but has common Hindi words prevalent in movie scripts.
|
||||
6. dictionaries/dict1text.txt – Manually curated list of slang and commonly used internet short-hands of English.
|
||||
|
||||
The above set of dictionaries are combined in different combinations to form language dictionaries for Hindi and English. Here is an overview:
|
||||
|
||||
![](images/dictionary_structure.PNG)
|
||||
|
||||
You can choose which files are used to combine in what permutation for each dictionary using the `[DICTIONARY HIERARCHY]` section of the `config.ini` file. Check [Working with the Config file to create dictionary and control other aspects of the project](#working-with-the-config-file-to-create-dictionary-and-control-other-aspects-of-the-project) for more info.
|
||||
|
||||
## II. Reason for using Dictionaries and How to create your own custom dictionaries for a new use-case or language pairs?
|
||||
|
||||
The addition of dictionaries in the project was an engineering decision that was taken after considering the empirical results, which showed that the dictionaries complemented the performance of the ML-based classifier (MALLET) for certain corner-cases.
|
||||
|
||||
Here are some of the problems that this method solved:
|
||||
### 1. Dealing with “common words” that can belong to either of the languages.
|
||||
|
||||
For example, the English word `“to”` is one of the ways in which the Hindi word `“तो”` or `“तू”` is spelt when written in `roman script` so the word “to” will be classified differently in the following two sentences:
|
||||
|
||||
|
||||
**Input:** I have to get back to my advisor
|
||||
|
||||
**Output:** I/EN have/EN ***to/EN*** get/EN back/EN to/EN my/EN advisor/EN
|
||||
|
||||
**Input:** Bhai to kabhi nahi sudhrega
|
||||
|
||||
**Output:** Bhai/HI ***to/HI*** kabhi/HI nahi/HI sudhrega/HI
|
||||
|
||||
|
||||
In this case, we make sure that the word “to” is present in both the dictionaries and the LID is supposed to focus more on the combination of ML probabilities and Context (surrounding words) to tag the language.
|
||||
|
||||
### 2. Words that surely belong to only one language.
|
||||
|
||||
For example, words like “bhai”, “nahi”, “kabhi” in Hindi and words like “advisor”, “get” etc. in English. In this case, we utilize the relevant dictionary to force tag it to the correct language even if the ML classifier says otherwise.
|
||||
|
||||
So the questions that you have to ask yourself while creating the dictionaries are:
|
||||
|
||||
1. “Are there certain words that can be spelt the same way in both the languages?” And,
|
||||
2. “Are there common words in one language that surely can’t be used in the other language?”
|
||||
|
||||
These are just a couple of things that we looked at while building this tool, but given your specific use-case you can consider more such engineering use-cases and customize the dictionaries accordingly.
|
||||
|
||||
# Working with the Config file to create dictionary and control other aspects of the project
|
||||
|
||||
The [config.ini](config.ini) file is like the central command center of the project. The fields are self explanatory and rightly commented so that you have an idea of what each field does.
|
||||
|
||||
All the information required to run the project is picked from this file.
|
||||
|
||||
Some of the important things for which you can use config file are:
|
||||
|
||||
1. You can give path to the different folders of the project like where is the data folder that contains all the dictionaries or where in your system Mallet's binaries are present.
|
||||
|
||||
2. You can also give names of the language pairs for which you are training your LID, the information from here used everywhere internally in the project.
|
||||
|
||||
3. You can also specify the names of the dictionaries for each language, and how are different files combined in what order to create each dictionary. See the existing `config.ini` file for an example.
|
||||
|
||||
4. Finally, you can provide the custom probability values to be used for dictionary tagged words in the project.
|
||||
|
||||
# Specific areas in code that need to be changed when creating your own LID
|
||||
|
||||
Here are a couple of changes that you'd need to do in case you are creating a new LID for different language pairs
|
||||
|
||||
1. First of all, use the `config.ini` file to override the default values for language names, the structure and names of dictionaries etc.
|
||||
2. You will have to re-write the custom logic of `dictTagging()` function in order to utilize your new dictionaries. Have a look at our logic to understand how using a simple set of conditionals, we choose which language tagged is to be alloted.
|
Двоичный файл не отображается.
|
@ -0,0 +1,39 @@
|
|||
[GENERAL]
|
||||
# if verbose is 1 then display language probabilities for each word; by default it is on or set to 1; set to 0 to turn off.
|
||||
verbose =
|
||||
# default: HINDI
|
||||
language_1 =
|
||||
# default: ENGLISH
|
||||
language_2 =
|
||||
|
||||
[DEFAULT PATHS]
|
||||
# Path to the classifiers folder, default: os.path.join(os.getcwd(), 'classifiers', 'HiEn.classifier')
|
||||
CLASSIFIER_PATH =
|
||||
# Path to the temporary folder, os.path.join(os.getcwd(), 'tmp', '')
|
||||
TMP_FILE_PATH =
|
||||
# Path to the dictionary folder, default: os.path.join(os.getcwd(), 'dictionaries', '')
|
||||
DICT_PATH =
|
||||
# Path to the mallet binary folder, default: os.path.join(os.getcwd(), 'mallet-2.0.8', 'bin', 'mallet')
|
||||
MALLET_PATH =
|
||||
|
||||
[DICTIONARY PROBABILITY VALUES]
|
||||
# initialize probability values for the correct and incorrect language
|
||||
# default: 0.999999999
|
||||
dict_prob_yes =
|
||||
# default: 1E-9
|
||||
dict_prob_no =
|
||||
|
||||
[DICTIONARY NAMES]
|
||||
# dictionary used to store already classified words between runs and in-memory
|
||||
# default: memoize_dict.pkl
|
||||
memoize_dict_file =
|
||||
|
||||
# name/number of dictionaries per language
|
||||
language_1_dicts = hindict1
|
||||
language_2_dicts = eng0dict1, eng1dict1
|
||||
|
||||
[DICTIONARY HIERARCHY]
|
||||
# which files are combined to form which dictionary
|
||||
eng0dict1 = dict1goog10k.txt, dict1coca.txt
|
||||
eng1dict1 = dict1bigr.txt, dict1text.txt
|
||||
hindict1 = dict1hinmov.txt, dict1hi.txt
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,447 @@
|
|||
hai
|
||||
ke
|
||||
ye
|
||||
mein
|
||||
se
|
||||
ki
|
||||
hain
|
||||
toh
|
||||
nahi
|
||||
ho
|
||||
nahin
|
||||
aur
|
||||
kya
|
||||
ko
|
||||
kar
|
||||
ka
|
||||
na
|
||||
ek
|
||||
tha
|
||||
yeh
|
||||
raha
|
||||
hoon
|
||||
tum
|
||||
hum
|
||||
pe
|
||||
sab
|
||||
mujhe
|
||||
liye
|
||||
baat
|
||||
koi
|
||||
gaya
|
||||
rahe
|
||||
tu
|
||||
kuch
|
||||
ne
|
||||
jo
|
||||
ab
|
||||
ye
|
||||
phir
|
||||
haan
|
||||
woh
|
||||
mere
|
||||
pata
|
||||
aa
|
||||
saath
|
||||
par
|
||||
thi
|
||||
le
|
||||
hua
|
||||
kiya
|
||||
bahut
|
||||
maine
|
||||
bhai
|
||||
gaye
|
||||
de
|
||||
yahan
|
||||
meri
|
||||
ji
|
||||
apne
|
||||
diya
|
||||
mera
|
||||
hoga
|
||||
abhi
|
||||
bas
|
||||
din
|
||||
theek
|
||||
kaam
|
||||
bol
|
||||
hun
|
||||
log
|
||||
h
|
||||
mat
|
||||
kaise
|
||||
naa
|
||||
lekin
|
||||
tak
|
||||
kyon
|
||||
kahan
|
||||
jab
|
||||
kisi
|
||||
chal
|
||||
jaa
|
||||
baad
|
||||
dekh
|
||||
aaj
|
||||
sakta
|
||||
paas
|
||||
chahiye
|
||||
lo
|
||||
agar
|
||||
apni
|
||||
kyun
|
||||
kaun
|
||||
kuchh
|
||||
karo
|
||||
karte
|
||||
paise
|
||||
wo
|
||||
beta
|
||||
aapne
|
||||
kal
|
||||
pehle
|
||||
tere
|
||||
aise
|
||||
hai-
|
||||
voh
|
||||
aapke
|
||||
yaar
|
||||
sirf
|
||||
k
|
||||
haath
|
||||
ja
|
||||
tujhe
|
||||
vah
|
||||
arre
|
||||
raat
|
||||
uske
|
||||
saal
|
||||
dono
|
||||
bola
|
||||
laga
|
||||
andar
|
||||
lagta
|
||||
naam
|
||||
apna
|
||||
hee
|
||||
liya
|
||||
aaya
|
||||
khud
|
||||
lag
|
||||
tumhe
|
||||
jaise
|
||||
aata
|
||||
har
|
||||
aisa
|
||||
tarah
|
||||
matlab
|
||||
mil
|
||||
achha
|
||||
sahab
|
||||
itna
|
||||
maar
|
||||
bahar
|
||||
teri
|
||||
shahid
|
||||
jao
|
||||
usse
|
||||
di
|
||||
tumhare
|
||||
logon
|
||||
kab
|
||||
iss
|
||||
samajh
|
||||
arjun
|
||||
thoda
|
||||
tumne
|
||||
ladki
|
||||
pahle
|
||||
uski
|
||||
kaha
|
||||
aapki
|
||||
jaan
|
||||
tera
|
||||
itni
|
||||
kahin
|
||||
chhod
|
||||
bata
|
||||
uss
|
||||
rahul
|
||||
sach
|
||||
bhee
|
||||
hamare
|
||||
badi
|
||||
mai
|
||||
teen
|
||||
kam
|
||||
aap
|
||||
chup
|
||||
bauji
|
||||
waqt
|
||||
maa
|
||||
tumhara
|
||||
baap
|
||||
bada
|
||||
humko
|
||||
dost
|
||||
saab
|
||||
kitna
|
||||
yaad
|
||||
wahan
|
||||
wali
|
||||
hui
|
||||
unke
|
||||
subah
|
||||
sahi
|
||||
jaao
|
||||
wala
|
||||
soch
|
||||
uska
|
||||
aadmi
|
||||
bolo
|
||||
jaldi
|
||||
sa
|
||||
taraf
|
||||
usko
|
||||
denge
|
||||
hue
|
||||
shuru
|
||||
kah
|
||||
bilkul
|
||||
dena
|
||||
unko
|
||||
itne
|
||||
saare
|
||||
aage
|
||||
tab
|
||||
lena
|
||||
jee
|
||||
bana
|
||||
kee
|
||||
neeche
|
||||
yahi
|
||||
aao
|
||||
humne
|
||||
keh
|
||||
hone
|
||||
karenge
|
||||
beti
|
||||
der
|
||||
bade
|
||||
maan
|
||||
poora
|
||||
si
|
||||
arey
|
||||
paanch
|
||||
poori
|
||||
ladke
|
||||
bina
|
||||
prem
|
||||
wapas
|
||||
yahin
|
||||
sau
|
||||
waise
|
||||
galat
|
||||
khush
|
||||
hote
|
||||
akele
|
||||
paisa
|
||||
aane
|
||||
koshish
|
||||
usne
|
||||
socha
|
||||
humein
|
||||
isliye
|
||||
ali
|
||||
babu
|
||||
karega
|
||||
jhoot
|
||||
dete
|
||||
arrey
|
||||
baje
|
||||
sabse
|
||||
dil
|
||||
rohan
|
||||
thay
|
||||
baith
|
||||
upar
|
||||
mar
|
||||
rani
|
||||
unka
|
||||
lete
|
||||
duniya
|
||||
tune
|
||||
zara
|
||||
ma
|
||||
leke
|
||||
rajvir
|
||||
beech
|
||||
minal
|
||||
tumse
|
||||
rahey
|
||||
kitni
|
||||
kitne
|
||||
unhe
|
||||
omi
|
||||
suna
|
||||
saaf
|
||||
hona
|
||||
hamara
|
||||
ladka
|
||||
neerja
|
||||
aisi
|
||||
hamesha
|
||||
nawab
|
||||
kyonki
|
||||
yehi
|
||||
ammi
|
||||
alag
|
||||
hahn
|
||||
dadu
|
||||
kaisa
|
||||
usey
|
||||
khempal
|
||||
cheez
|
||||
ruk
|
||||
aapse
|
||||
pooch
|
||||
lenge
|
||||
doon
|
||||
iske
|
||||
iska
|
||||
inko
|
||||
chai
|
||||
isse
|
||||
kis
|
||||
honge
|
||||
li
|
||||
mama
|
||||
kamre
|
||||
kuuch
|
||||
dus
|
||||
pakad
|
||||
agle
|
||||
iski
|
||||
singh
|
||||
accha
|
||||
dega
|
||||
papaji
|
||||
zinda
|
||||
wapis
|
||||
didi
|
||||
milega
|
||||
vijay
|
||||
unhone
|
||||
roz
|
||||
wahi
|
||||
kyoun
|
||||
varun
|
||||
bete
|
||||
bhool
|
||||
pade
|
||||
naya
|
||||
jis
|
||||
karein
|
||||
che
|
||||
bahu
|
||||
apun
|
||||
lene
|
||||
lage
|
||||
jhooth
|
||||
jaisa
|
||||
doosre
|
||||
ekdum
|
||||
khol
|
||||
chorr
|
||||
thhey
|
||||
raho
|
||||
aate
|
||||
dene
|
||||
ashwin
|
||||
oye
|
||||
ha
|
||||
re
|
||||
vo
|
||||
ae
|
||||
mamma
|
||||
khan
|
||||
dein
|
||||
acha
|
||||
biwi
|
||||
bolna
|
||||
wale
|
||||
allah
|
||||
nayi
|
||||
sar
|
||||
saalon
|
||||
chor
|
||||
thha
|
||||
tumhey
|
||||
lakh
|
||||
bolta
|
||||
inka
|
||||
pee
|
||||
jahan
|
||||
abhie
|
||||
bura
|
||||
lega
|
||||
line
|
||||
humse
|
||||
jaye
|
||||
bula
|
||||
fir
|
||||
lado
|
||||
saat
|
||||
sandhya
|
||||
wahin
|
||||
dum
|
||||
usi
|
||||
batao
|
||||
darr
|
||||
inn
|
||||
thee
|
||||
gussa
|
||||
pad
|
||||
deepak
|
||||
sabko
|
||||
bees
|
||||
baithe
|
||||
mushkil
|
||||
dilli
|
||||
baba
|
||||
zamindar
|
||||
kisne
|
||||
suno
|
||||
kum
|
||||
haar
|
||||
pandrah
|
||||
un
|
||||
jai
|
||||
akela
|
||||
wah
|
||||
kai
|
||||
hafte
|
||||
karachi
|
||||
isi
|
||||
ise
|
||||
madad
|
||||
bolne
|
||||
bech
|
||||
ram
|
||||
rah
|
||||
baitho
|
||||
isne
|
||||
prashant
|
||||
ro
|
||||
inhoney
|
||||
abbu
|
||||
maut
|
||||
desh
|
||||
hume
|
||||
kumar
|
||||
falak
|
||||
govi
|
||||
mile
|
||||
kaat
|
||||
dar
|
||||
ni
|
|
@ -0,0 +1,420 @@
|
|||
<3
|
||||
4u
|
||||
8L3W
|
||||
a3
|
||||
aamof
|
||||
aap
|
||||
aar
|
||||
aas
|
||||
add
|
||||
adn
|
||||
aeap
|
||||
afaik
|
||||
afk
|
||||
aisb
|
||||
aka
|
||||
aml
|
||||
aota
|
||||
asap
|
||||
at
|
||||
atm
|
||||
ayec
|
||||
ayor
|
||||
abtr
|
||||
b4
|
||||
b4n
|
||||
bak
|
||||
bau
|
||||
bbiaf
|
||||
bbiam
|
||||
bbl
|
||||
bbs
|
||||
bc
|
||||
bcnu
|
||||
bf
|
||||
bff
|
||||
bfn
|
||||
bg
|
||||
blnt
|
||||
bm&y
|
||||
bol
|
||||
brb
|
||||
brt
|
||||
bta
|
||||
btdt
|
||||
btw
|
||||
bcoz
|
||||
cam
|
||||
cas
|
||||
cmiiw
|
||||
cmon
|
||||
cnt
|
||||
cob
|
||||
cos
|
||||
coz
|
||||
cr8
|
||||
crb
|
||||
crbt
|
||||
cya
|
||||
cu
|
||||
cua
|
||||
cul
|
||||
cul8r
|
||||
cwyl
|
||||
cyo
|
||||
cys
|
||||
cth
|
||||
cuz
|
||||
cz
|
||||
da
|
||||
dl
|
||||
degt
|
||||
diku
|
||||
dqmot
|
||||
dno
|
||||
dnt
|
||||
dts
|
||||
dv8
|
||||
dw
|
||||
ebkac
|
||||
ez
|
||||
eg
|
||||
ema
|
||||
emfbi
|
||||
eod
|
||||
eom
|
||||
ezy
|
||||
eoc
|
||||
f2f
|
||||
f2t
|
||||
fbm
|
||||
fc
|
||||
fd's
|
||||
ficcl
|
||||
fitb
|
||||
fnx
|
||||
fomcl
|
||||
frt
|
||||
ftw
|
||||
ftc
|
||||
fwiw
|
||||
fya
|
||||
fyeo
|
||||
fyi
|
||||
f8
|
||||
f9
|
||||
fotb
|
||||
g
|
||||
g2cu
|
||||
gtg
|
||||
g2g
|
||||
g2r
|
||||
g9
|
||||
ga
|
||||
gal
|
||||
gb
|
||||
gbu
|
||||
gd
|
||||
gdr
|
||||
gf
|
||||
gfi
|
||||
gg
|
||||
giar
|
||||
gigo
|
||||
gj
|
||||
gl
|
||||
gl/hf
|
||||
gmta
|
||||
gna
|
||||
goi
|
||||
gol
|
||||
gr8
|
||||
gr&d
|
||||
gt
|
||||
gtg
|
||||
h&k
|
||||
h2cus
|
||||
h8
|
||||
hagn
|
||||
hags
|
||||
hago
|
||||
hand
|
||||
hf
|
||||
hhis
|
||||
h/o
|
||||
hoas
|
||||
hru
|
||||
hth
|
||||
hv
|
||||
iac
|
||||
ianal
|
||||
ib
|
||||
ic
|
||||
icbw
|
||||
idawtc
|
||||
idc
|
||||
idk
|
||||
idts
|
||||
idunno
|
||||
ig2r
|
||||
iirc
|
||||
ilu
|
||||
ilbl8
|
||||
ily
|
||||
im
|
||||
imao
|
||||
imho
|
||||
imnsho
|
||||
imo
|
||||
ims
|
||||
inal
|
||||
indtd
|
||||
iow
|
||||
ipo
|
||||
irl
|
||||
irmc
|
||||
iuss
|
||||
iykwim
|
||||
iyo
|
||||
iyss
|
||||
jb
|
||||
jic
|
||||
jk
|
||||
j00r
|
||||
jac
|
||||
jic
|
||||
jja
|
||||
jk
|
||||
jmo
|
||||
jp
|
||||
jpp
|
||||
jtlyk
|
||||
jsun
|
||||
k
|
||||
kk
|
||||
kl
|
||||
kiss
|
||||
kit
|
||||
kotc
|
||||
kotl
|
||||
kwim
|
||||
l8
|
||||
l8r
|
||||
lbr
|
||||
ld
|
||||
lgp
|
||||
lgr
|
||||
lol
|
||||
lqtm
|
||||
lsl
|
||||
ltm
|
||||
ltns
|
||||
lylas
|
||||
lyk
|
||||
m8
|
||||
mfi
|
||||
mmc
|
||||
msg
|
||||
mtf
|
||||
mtfbwu
|
||||
musm
|
||||
myob
|
||||
n
|
||||
n2u
|
||||
n1
|
||||
nbd
|
||||
nvm
|
||||
ne
|
||||
ne1
|
||||
ni
|
||||
nfm
|
||||
nimby
|
||||
nlt
|
||||
nm
|
||||
nmjc
|
||||
np
|
||||
no1
|
||||
noyb
|
||||
np
|
||||
nrn
|
||||
nt
|
||||
ntu
|
||||
nvm
|
||||
nw
|
||||
nwo
|
||||
nvm
|
||||
nbtr
|
||||
omg
|
||||
oic
|
||||
omw
|
||||
oo
|
||||
ooh
|
||||
ootd
|
||||
op
|
||||
otb
|
||||
otl
|
||||
otoh
|
||||
ott
|
||||
ottomh
|
||||
otw
|
||||
ova
|
||||
o
|
||||
pcm
|
||||
pdq
|
||||
plmk
|
||||
plz
|
||||
pls
|
||||
plu
|
||||
pm
|
||||
pmfi
|
||||
pmfji
|
||||
poahf
|
||||
ppl
|
||||
prob
|
||||
prolly
|
||||
prt
|
||||
prw
|
||||
ptmm
|
||||
pu
|
||||
pwn
|
||||
pxt
|
||||
q
|
||||
qik
|
||||
qt
|
||||
rodl
|
||||
rofl
|
||||
rotfl
|
||||
rotflol
|
||||
rotfluts
|
||||
rp
|
||||
rl
|
||||
rme
|
||||
rmv
|
||||
rsn
|
||||
ruok
|
||||
icnr
|
||||
ig2r
|
||||
sal
|
||||
say
|
||||
sbtsbc
|
||||
sc
|
||||
sete
|
||||
sis
|
||||
sit
|
||||
slap
|
||||
slp
|
||||
slpn
|
||||
smhid
|
||||
smt
|
||||
snafu
|
||||
so
|
||||
sol
|
||||
somy
|
||||
sotmg
|
||||
soz or sry
|
||||
spk
|
||||
spst
|
||||
ss
|
||||
ssinf
|
||||
str8
|
||||
stq
|
||||
suitm
|
||||
sul
|
||||
sup
|
||||
syl
|
||||
t+
|
||||
ta
|
||||
tafn
|
||||
tam
|
||||
tb
|
||||
tbd
|
||||
tbh
|
||||
tc
|
||||
tgif
|
||||
thts
|
||||
thnx
|
||||
thnq
|
||||
tu
|
||||
tq
|
||||
ty
|
||||
thx
|
||||
tia
|
||||
tiad
|
||||
tlk2ul8r
|
||||
tma
|
||||
tmb
|
||||
tmi
|
||||
tmot
|
||||
tmrw
|
||||
tmwfi
|
||||
tnstaafl
|
||||
toy
|
||||
tpm
|
||||
tptb
|
||||
tstb
|
||||
ttfn
|
||||
ttly
|
||||
ttml
|
||||
tttt
|
||||
ttyl
|
||||
ttys
|
||||
txt
|
||||
txtm8
|
||||
tym
|
||||
tyt
|
||||
tyvm
|
||||
tfpi
|
||||
ugtbk
|
||||
uktr
|
||||
ul
|
||||
ur
|
||||
uv
|
||||
uw
|
||||
vf
|
||||
vms
|
||||
vmsi
|
||||
w
|
||||
w/e
|
||||
w/i
|
||||
w/o
|
||||
wam
|
||||
wan2tlk
|
||||
wat
|
||||
wayf
|
||||
wb
|
||||
wb2my
|
||||
wg
|
||||
woteva
|
||||
whteva
|
||||
wiifm
|
||||
wk
|
||||
wkd
|
||||
wombat
|
||||
wrk
|
||||
wrud
|
||||
wut
|
||||
wt
|
||||
wtb
|
||||
wtg
|
||||
wth
|
||||
wts
|
||||
wubu2
|
||||
wuu2
|
||||
wu?
|
||||
wubu2?
|
||||
wuciwug
|
||||
wuf?
|
||||
wuwh
|
||||
wwyc
|
||||
wylei
|
||||
wat
|
||||
xlnt
|
||||
ya
|
||||
ybs
|
||||
ygbkm
|
||||
ykwycd
|
||||
ymmv
|
||||
yr
|
||||
yw
|
||||
zzz
|
||||
wysiwyg
|
|
@ -0,0 +1,579 @@
|
|||
"""
|
||||
Master code to take input, generate features, call MALLET and use the probabilities for generating language tags
|
||||
"""
|
||||
|
||||
# !/usr/bin/python
|
||||
|
||||
import sys
|
||||
import subprocess
|
||||
import re
|
||||
import os
|
||||
import time
|
||||
import codecs
|
||||
import pickle
|
||||
|
||||
from utils import extractFeatures as ef
|
||||
from utils import generateLanguageTags as genLangTag
|
||||
from collections import OrderedDict
|
||||
from configparser import ConfigParser
|
||||
|
||||
|
||||
def readConfig():
|
||||
"""
|
||||
Read config file to load global variables for the project
|
||||
"""
|
||||
|
||||
global language_1_dicts
|
||||
global language_2_dicts
|
||||
global memoize_dict
|
||||
global combined_dicts
|
||||
global CLASSIFIER_PATH
|
||||
global TMP_FILE_PATH
|
||||
global DICT_PATH
|
||||
global MALLET_PATH
|
||||
global dict_prob_yes
|
||||
global dict_prob_no
|
||||
global memoize_dict_file
|
||||
global verbose
|
||||
global lang1
|
||||
global lang2
|
||||
|
||||
# initialize dictionary variables
|
||||
language_1_dicts = {}
|
||||
language_2_dicts = {}
|
||||
# initialize list of dictionary words
|
||||
combined_dicts = []
|
||||
|
||||
# read config
|
||||
config = ConfigParser()
|
||||
config.read("config.ini")
|
||||
config_paths = config["DEFAULT PATHS"]
|
||||
config_probs = config["DICTIONARY PROBABILITY VALUES"]
|
||||
config_dicts = config["DICTIONARY NAMES"]
|
||||
config_gen = config["GENERAL"]
|
||||
|
||||
# setup paths for classifier, tmp folder, dictionaries and mallet
|
||||
CLASSIFIER_PATH = config_paths["CLASSIFIER_PATH"] if config_paths["CLASSIFIER_PATH"] else os.path.join(
|
||||
os.getcwd(), 'classifiers', 'HiEn.classifier')
|
||||
TMP_FILE_PATH = config_paths["TMP_FILE_PATH"] if config_paths["TMP_FILE_PATH"] else os.path.join(
|
||||
os.getcwd(), 'tmp', '')
|
||||
DICT_PATH = config_paths["DICT_PATH"] if config_paths["DICT_PATH"] else os.path.join(
|
||||
os.getcwd(), 'dictionaries', '')
|
||||
MALLET_PATH = config_paths["MALLET_PATH"] if config_paths["MALLET_PATH"] else os.path.join(
|
||||
os.getcwd(), 'mallet-2.0.8', 'bin', 'mallet')
|
||||
|
||||
# initialize probability values for the correct and incorrect language
|
||||
dict_prob_yes = config_probs["dict_prob_yes"] if config_probs["dict_prob_yes"] else 0.999999999
|
||||
dict_prob_no = config_probs["dict_prob_no"] if config_probs["dict_prob_no"] else 1E-9
|
||||
|
||||
# initialize memoize_dict from file is already present else with an empty dictionary
|
||||
memoize_dict_file = config_dicts["memoize_dict_file"] if config_dicts["memoize_dict_file"] else "memoize_dict.pkl"
|
||||
if os.path.isfile(DICT_PATH + memoize_dict_file):
|
||||
with open(DICT_PATH + memoize_dict_file, "rb") as fp:
|
||||
memoize_dict = pickle.load(fp)
|
||||
else:
|
||||
memoize_dict = {}
|
||||
|
||||
# by default verbose is ON
|
||||
verbose = int(config_gen["verbose"]) if config_gen["verbose"] else 1
|
||||
|
||||
# get language names by default language 1 is HINDI and language 2 is ENGLISH
|
||||
lang1 = config_gen["language_1"].upper(
|
||||
) if config_gen["language_1"] else "HINDI"
|
||||
lang2 = config_gen["language_2"].upper(
|
||||
) if config_gen["language_2"] else "ENGLISH"
|
||||
|
||||
lang_1dict_names = config_dicts["language_1_dicts"].split(
|
||||
",") if config_dicts["language_1_dicts"] else "hindict1"
|
||||
lang_2dict_names = config_dicts["language_2_dicts"].split(
|
||||
",") if config_dicts["language_2_dicts"] else "eng0dict1, eng1dict1"
|
||||
|
||||
# initialize language_1_dict and language_2_dict with all the sub dictionaries
|
||||
for dict_names in lang_1dict_names:
|
||||
language_1_dicts[dict_names.strip()] = {}
|
||||
for dict_names in lang_2dict_names:
|
||||
language_2_dicts[dict_names.strip()] = {}
|
||||
|
||||
|
||||
def createDicts():
|
||||
"""
|
||||
Create and populate language dictionaries for Language 1 and Language 2
|
||||
"""
|
||||
|
||||
global language_1_dicts
|
||||
global language_2_dicts
|
||||
global combined_dicts
|
||||
global DICT_PATH
|
||||
global lang1
|
||||
global lang2
|
||||
|
||||
language_1_words = []
|
||||
language_2_words = []
|
||||
|
||||
# read config to get dictionary structures
|
||||
config = ConfigParser()
|
||||
config.read("config.ini")
|
||||
dict_struct = dict(config.items("DICTIONARY HIERARCHY"))
|
||||
|
||||
# create language_1 dictionary
|
||||
for sub_dict in language_1_dicts:
|
||||
input_files = dict_struct[sub_dict].split(",")
|
||||
for filename in input_files:
|
||||
with open(DICT_PATH + filename.strip(), 'r') as dictfile:
|
||||
words = dictfile.read().split('\n')
|
||||
for w in words:
|
||||
language_1_dicts[sub_dict][w.strip().lower()] = ''
|
||||
|
||||
language_1_words.extend(list(language_1_dicts[sub_dict].keys()))
|
||||
print(lang1, 'dictionary created')
|
||||
|
||||
# create language_2 dictionary
|
||||
for sub_dict in language_2_dicts:
|
||||
input_files = dict_struct[sub_dict].split(",")
|
||||
for filename in input_files:
|
||||
with open(DICT_PATH + filename.strip(), 'r') as dictfile:
|
||||
words = dictfile.read().split('\n')
|
||||
for w in words:
|
||||
language_2_dicts[sub_dict][w.strip().lower()] = ''
|
||||
|
||||
language_2_words.extend(list(language_2_dicts[sub_dict].keys()))
|
||||
print(lang2, 'dictionary created')
|
||||
|
||||
# populate the combined word list
|
||||
combined_dicts.extend(language_1_words)
|
||||
combined_dicts.extend(language_2_words)
|
||||
|
||||
|
||||
def dictTagging(word, tag):
|
||||
"""
|
||||
Use language dictionaries to tag words
|
||||
"""
|
||||
|
||||
global language_1_dicts
|
||||
global language_2_dicts
|
||||
global lang1
|
||||
global lang2
|
||||
|
||||
dhin, den0, den1 = 0, 0, 0
|
||||
|
||||
word = word
|
||||
|
||||
if word.lower() in language_1_dicts["hindict1"].keys():
|
||||
dhin = 1
|
||||
if word.lower() in language_2_dicts["eng0dict1"].keys():
|
||||
den0 = 1
|
||||
if word.lower() in language_2_dicts["eng1dict1"].keys():
|
||||
den1 = 1
|
||||
|
||||
# if not den0 and not den1 and not dhin : do nothing
|
||||
if (not den0 and not den1 and dhin) or (not den0 and den1 and dhin): # make HI
|
||||
tag = lang1[:2]
|
||||
|
||||
if (not den0 and den1 and not dhin) or (den0 and not dhin): # make EN
|
||||
tag = lang2[:2]
|
||||
|
||||
# if den0 and not den1 and not dhin : subsumed
|
||||
# if den0 and not den1 and dhin : do nothing
|
||||
# if den0 and den1 and not dhin : sumsumed
|
||||
# if den0 and den1 and dhin : do nothing
|
||||
|
||||
return tag
|
||||
|
||||
|
||||
def dictLookup(word):
|
||||
"""
|
||||
Check whether a word is already present in a dictionary
|
||||
"""
|
||||
|
||||
global combined_dicts
|
||||
word = word.lower()
|
||||
if word in set(combined_dicts):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def blurb2Dict(blurb):
|
||||
"""
|
||||
Convert a str blurb to an ordered dictionary for comparison
|
||||
"""
|
||||
|
||||
dic2 = OrderedDict()
|
||||
wordlist = []
|
||||
for line in blurb.split("\n"):
|
||||
line = line.split("\t")
|
||||
word = line[0].split()
|
||||
tags = line[1:]
|
||||
|
||||
if len(word) != 0:
|
||||
dic2[word[0]] = tags
|
||||
wordlist.append(word)
|
||||
|
||||
return dic2, wordlist
|
||||
|
||||
|
||||
def memoizeWord(mallet_output):
|
||||
"""
|
||||
Update the memoize_dict with words that are recently classified by mallet
|
||||
"""
|
||||
|
||||
global memoize_dict
|
||||
|
||||
mallet_output = blurb2Dict(mallet_output)[0]
|
||||
|
||||
for word in mallet_output.keys():
|
||||
memoize_dict[word] = mallet_output[word]
|
||||
|
||||
|
||||
def mergeBlurbs(blurb, mallet_output, blurb_dict):
|
||||
"""
|
||||
Combine probabilities of words from both MALLET and dictionary outputs
|
||||
"""
|
||||
|
||||
global dict_prob_yes
|
||||
global dict_prob_no
|
||||
global verbose
|
||||
global lang1
|
||||
global lang2
|
||||
|
||||
# convert main blurb to OrderedDict
|
||||
main_dict = OrderedDict()
|
||||
wordlist_main = []
|
||||
for line in blurb.split("\n"):
|
||||
word, tag = line.split("\t")
|
||||
main_dict[word] = tag
|
||||
wordlist_main.append([word])
|
||||
|
||||
# populate dictionary based language tags with fixed probabilities for correct and incorrect
|
||||
blurb_dict = blurb_dict.replace(lang1[:2], lang1[:2].lower(
|
||||
) + "\t" + str(dict_prob_yes) + "\t" + lang2[:2].lower() + "\t" + str(dict_prob_no))
|
||||
blurb_dict = blurb_dict.replace(lang2[:2], lang2[:2].lower(
|
||||
) + "\t" + str(dict_prob_yes) + "\t" + lang1[:2].lower() + "\t" + str(dict_prob_no))
|
||||
blurb_dict, _wordlist_dict = blurb2Dict(blurb_dict)
|
||||
|
||||
# convert mallet blurb to OrderedDict only when it isn't empty
|
||||
mallet_is_empty = 1
|
||||
if mallet_output != "":
|
||||
mallet_is_empty = 0
|
||||
blurb_mallet, _wordlist_mallet = blurb2Dict(mallet_output)
|
||||
|
||||
# combining logic
|
||||
# iterate over the word list and populate probability values for tags from both dictionary and MALLET output
|
||||
for idx, word in enumerate(wordlist_main):
|
||||
current_word = word[0]
|
||||
updated_word = word
|
||||
if current_word in blurb_dict:
|
||||
updated_word.extend(blurb_dict[current_word])
|
||||
wordlist_main[idx] = updated_word
|
||||
else:
|
||||
if not mallet_is_empty:
|
||||
if current_word in blurb_mallet:
|
||||
updated_word.extend(blurb_mallet[current_word])
|
||||
wordlist_main[idx] = updated_word
|
||||
|
||||
# convert the updated blurb to str
|
||||
blurb_updated = []
|
||||
st = ""
|
||||
for word in wordlist_main:
|
||||
st = word[0]
|
||||
for tag in word[1:]:
|
||||
st = st + "\t" + str(tag)
|
||||
|
||||
st = st.strip()
|
||||
blurb_updated.append(st)
|
||||
st = ""
|
||||
|
||||
blurb_updated = "\n".join(blurb_updated)
|
||||
|
||||
if verbose != 0:
|
||||
print(blurb_updated, "\n---------------------------------\n")
|
||||
return blurb_updated
|
||||
|
||||
|
||||
def callMallet(inputText, classifier):
|
||||
"""
|
||||
Invokes the mallet classifier with input text and returns Main BLURB, MALLET OUTPUT and BLURB DICT
|
||||
"""
|
||||
|
||||
global combined_dicts
|
||||
global TMP_FILE_PATH
|
||||
global memoize_dict
|
||||
|
||||
"""
|
||||
DICIONARY CREATION CODE
|
||||
"""
|
||||
# create a dictionary if not already created, needed when using as a library
|
||||
if len(combined_dicts) == 0:
|
||||
createDicts()
|
||||
|
||||
# split words based on whether they are already present in the dictionary
|
||||
# new words go to MALLET for generating probabilities
|
||||
fixline_mallet = list(filter(lambda x: not dictLookup(x), inputText))
|
||||
fixline_dict = list(
|
||||
filter(lambda x: (x not in fixline_mallet) or (x in memoize_dict), inputText))
|
||||
|
||||
# create str blurb for mallet and dictionary input
|
||||
blurb = '\n'.join(["%s\toth" % (v.strip()) for v in inputText])
|
||||
blurb_mallet = '\n'.join(["%s\toth" % (v.strip()) for v in fixline_mallet])
|
||||
dict_tags = list(map(lambda x: dictTagging(x, "oth"), fixline_dict))
|
||||
|
||||
# get dict_tags from words that are already classified by mallet
|
||||
for idx, word in enumerate(fixline_dict):
|
||||
if word in memoize_dict:
|
||||
dict_tags[idx] = memoize_dict[word]
|
||||
|
||||
"""
|
||||
LOGIC FOR WORDS THAT ARE PRESENT IN MULTIPLE DICTIONARIES
|
||||
"""
|
||||
fixline_mallet_corrections = []
|
||||
for t, w in zip(dict_tags, fixline_dict):
|
||||
# if even after dict lookup, some words are still tagged oth due to cornercase then call mallet output on those words
|
||||
if t == "oth":
|
||||
fixline_mallet_corrections.append(w)
|
||||
|
||||
# update blurb_mallet
|
||||
blurb_mallet_corrections = '\n'.join(
|
||||
["%s\toth" % (v.strip()) for v in fixline_mallet_corrections])
|
||||
|
||||
# if mallet is not empty then you need to append the correction to the bottom, seperated by a \n otherwise you can just append it directly
|
||||
if blurb_mallet != "":
|
||||
blurb_mallet = blurb_mallet + "\n" + blurb_mallet_corrections
|
||||
else:
|
||||
blurb_mallet += blurb_mallet_corrections
|
||||
|
||||
# remove the words from blurb_dict
|
||||
dict_tags = filter(lambda x: x != "oth", dict_tags)
|
||||
fixline_dict = filter(
|
||||
lambda x: x not in fixline_mallet_corrections, fixline_dict)
|
||||
|
||||
blurb_dict = ""
|
||||
for word, tag in zip(fixline_dict, dict_tags):
|
||||
if not type(tag) == list:
|
||||
blurb_dict = blurb_dict + "%s\t%s" % (word.strip(), tag) + "\n"
|
||||
else:
|
||||
tmp_tags = "\t".join(tag)
|
||||
blurb_dict = blurb_dict + \
|
||||
"%s\t%s" % (word.strip(), tmp_tags) + "\n"
|
||||
|
||||
"""
|
||||
CALLING MALLET
|
||||
"""
|
||||
# this checks the case when blurb_mallet only has a \n due to words being taken into blurb_dict
|
||||
if blurb_mallet != "\n":
|
||||
# open a temp file and generate input features for mallet
|
||||
open(TMP_FILE_PATH + 'temp_testFile.txt', 'w').write(blurb_mallet)
|
||||
ef.main(TMP_FILE_PATH + 'temp_testFile.txt')
|
||||
# initialize t7 to track time taken by mallet
|
||||
t7 = time.time()
|
||||
# call mallet to get probability output
|
||||
subprocess.Popen(MALLET_PATH + " classify-file --input " + TMP_FILE_PATH + "temp_testFile.txt.features" +
|
||||
" --output " + TMP_FILE_PATH + "temp_testFile.txt.out --classifier %s" % (classifier), shell=True).wait()
|
||||
t_total = time.time()-t7
|
||||
mallet_output = open(
|
||||
TMP_FILE_PATH + 'temp_testFile.txt.out', 'r').read()
|
||||
else:
|
||||
mallet_output = ""
|
||||
|
||||
# memoize the probabilities of words already classified
|
||||
memoizeWord(mallet_output)
|
||||
|
||||
print("time for mallet classification", t_total, file=sys.stderr)
|
||||
return blurb, mallet_output, blurb_dict
|
||||
|
||||
|
||||
def genUID(results, fixline):
|
||||
"""
|
||||
ADDING UNIQUE IDS TO OUTPUT FILE AND FORMATTING
|
||||
|
||||
where:
|
||||
fixline is input text
|
||||
results is language probabilities for each word
|
||||
"""
|
||||
# NEW add unique id to results - which separator
|
||||
uniqueresults = list(range(len(results)))
|
||||
for idx in range(len(results)):
|
||||
uniqueresults[idx] = results[idx]
|
||||
uniqueresults[idx][0] = uniqueresults[idx][0]+"::{}".format(idx)
|
||||
langOut = OrderedDict()
|
||||
for v in uniqueresults:
|
||||
langOut[v[0]] = OrderedDict()
|
||||
for ii in range(1, len(v), 2):
|
||||
langOut[v[0]][v[ii]] = float(v[ii+1])
|
||||
fixmyline = fixline
|
||||
fnewlines = list(range(len(fixmyline)))
|
||||
for vvv in range(len(fixmyline)):
|
||||
fnewlines[vvv] = fixmyline[vvv]+"::{}".format(vvv)
|
||||
ffixedline = " ".join(fnewlines)
|
||||
|
||||
return ffixedline, langOut
|
||||
|
||||
|
||||
def langIdentify(inputText, classifier):
|
||||
"""
|
||||
Get language tags for sentences passed as a list
|
||||
|
||||
Input : list of sentences
|
||||
Output : list of words for each sentence with the language probabilities
|
||||
"""
|
||||
|
||||
global TMP_FILE_PATH
|
||||
|
||||
inputText = inputText.split("\n")
|
||||
outputText = []
|
||||
|
||||
"""
|
||||
CONFIG FILE CODE
|
||||
"""
|
||||
readConfig()
|
||||
|
||||
"""
|
||||
DICIONARY CREATION CODE
|
||||
"""
|
||||
createDicts()
|
||||
|
||||
for line in inputText:
|
||||
text = re.sub(r"([\w@#\'\\\"]+)([.:,;?!]+)", r"\g<1> \g<2> ", line)
|
||||
text = text.split()
|
||||
text = [x.strip() for x in text]
|
||||
text = [x for x in text if not re.match(r"^\s*$", x)]
|
||||
"""
|
||||
CALLING MALLET CODE HERE
|
||||
"""
|
||||
blurb, mallet_output, blurb_dict = callMallet(text, classifier)
|
||||
|
||||
"""
|
||||
WRITE COMBINING LOGIC HERE
|
||||
"""
|
||||
blurb_tagged = mergeBlurbs(blurb, mallet_output, blurb_dict)
|
||||
|
||||
results = [v.split("\t") for v in blurb_tagged.split("\n")]
|
||||
# generate unique id for output sentences and format
|
||||
ffixedline, langOut = genUID(results, text)
|
||||
# get language tags using context logic from probabilities
|
||||
out = genLangTag.get_res(ffixedline, langOut)
|
||||
realOut = re.sub("::[0-9]+/", "/", out)
|
||||
# get word, label pairs in the output
|
||||
realOut = realOut.split()
|
||||
realOut = [tuple(word.split("/")) for word in realOut]
|
||||
# generate output
|
||||
outputText.append(realOut)
|
||||
|
||||
return outputText
|
||||
|
||||
|
||||
def langIdentifyFile(filename, classifier):
|
||||
"""
|
||||
Get language tags for sentences from an input file
|
||||
|
||||
Input file: tsv with sentence id in first column and sentence in second column
|
||||
Output file: tsv with word per line, sentences separated by newline
|
||||
Output of sentence id in first column and best language tag in last column
|
||||
"""
|
||||
global TMP_FILE_PATH
|
||||
|
||||
# reading the input file
|
||||
fil = codecs.open(filename, 'r', errors="ignore")
|
||||
outfil = codecs.open(filename+"_tagged", 'a',
|
||||
errors="ignore", encoding='utf-8')
|
||||
line_count = 0
|
||||
line = (fil.readline()).strip()
|
||||
|
||||
while line is not None and line != "":
|
||||
line_count += 1
|
||||
|
||||
if (line_count % 100 == 0):
|
||||
print(line_count, file=sys.stderr)
|
||||
|
||||
if not line.startswith("#"):
|
||||
# reading sentences and basic pre-processing
|
||||
lineid = "\t".join(line.split("\t")[:1])
|
||||
line = " ".join(line.split("\t")[1:])
|
||||
fline = re.sub(r"([\w@#\'\\\"]+)([.:,;?!]+)",
|
||||
r"\g<1> \g<2> ", line)
|
||||
fixline = fline.split()
|
||||
fixline = [x.strip() for x in fixline]
|
||||
fixline = [x for x in fixline if not re.match(r"^\s*$", x)]
|
||||
|
||||
"""
|
||||
CALLING MALLET CODE HERE
|
||||
"""
|
||||
blurb, mallet_output, blurb_dict = callMallet(fixline, classifier)
|
||||
|
||||
"""
|
||||
WRITE COMBINING LOGIC HERE
|
||||
"""
|
||||
blurb_tagged = mergeBlurbs(blurb, mallet_output, blurb_dict)
|
||||
|
||||
results = [v.split("\t") for v in blurb_tagged.split("\n")]
|
||||
|
||||
# generate unique id for output sentences and format
|
||||
ffixedline, langOut = genUID(results, fixline)
|
||||
|
||||
# get language tags using context logic from probabilities
|
||||
out = genLangTag.get_res(ffixedline, langOut)
|
||||
outfil.write(u"##"+lineid+u"\t"+line+u"\n")
|
||||
realout = re.sub("::[0-9]+/", "/", out)
|
||||
outfil.write(lineid+u"\t"+realout+u'\n')
|
||||
else:
|
||||
print("### skipped commented line:: " + line.encode('utf-8') + "\n")
|
||||
outfil.write("skipped line" + line.encode('utf-8') + "\n")
|
||||
line = (fil.readline()).strip()
|
||||
fil.close()
|
||||
outfil.close()
|
||||
print("written to " + filename + "_tagged")
|
||||
|
||||
|
||||
def writeMemoizeDict():
|
||||
"""
|
||||
Write the Memoization Dictionary to the disk, update it with new words if already present
|
||||
"""
|
||||
|
||||
if os.path.isfile(DICT_PATH + memoize_dict_file):
|
||||
# if file already exists, then update memoize_dict before writing
|
||||
with open(DICT_PATH + memoize_dict_file, "rb") as fp:
|
||||
memoize_file = pickle.load(fp)
|
||||
if memoize_file != memoize_dict:
|
||||
print("updating memoize dictionary")
|
||||
memoize_dict.update(memoize_file)
|
||||
# write the memoize_dict to file
|
||||
with open(DICT_PATH + memoize_dict_file, "wb") as fp:
|
||||
pickle.dump(memoize_dict, fp)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
"""
|
||||
CONFIG FILE CODE
|
||||
"""
|
||||
readConfig()
|
||||
|
||||
"""
|
||||
DICIONARY CREATION CODE
|
||||
"""
|
||||
createDicts()
|
||||
|
||||
"""
|
||||
CLASSIFICATION CODE
|
||||
"""
|
||||
|
||||
blurb = sys.argv[1]
|
||||
print(blurb)
|
||||
print(sys.argv)
|
||||
classifier = CLASSIFIER_PATH
|
||||
mode = "file"
|
||||
|
||||
if len(sys.argv) > 2:
|
||||
mode = sys.argv[1]
|
||||
blurb = sys.argv[2]
|
||||
if len(sys.argv) > 3:
|
||||
classifer = sys.argv[3]
|
||||
if mode == "file" or mode == "f":
|
||||
# CHECK FILE EXISTS
|
||||
langIdentifyFile(blurb, classifier)
|
||||
else:
|
||||
langIdentify(blurb, classifier)
|
||||
|
||||
"""
|
||||
WRITE UPDATED MEMOIZE DICTIONARY TO DISK
|
||||
"""
|
||||
writeMemoizeDict()
|
||||
exit()
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 16 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 88 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 27 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 15 KiB |
|
@ -0,0 +1,4 @@
|
|||
1 Yeh mera pehla sentence hai
|
||||
2 Aur ye dusra
|
||||
3 Main kya karoon
|
||||
4 This is main sentence
|
|
@ -0,0 +1,8 @@
|
|||
##1 Yeh mera pehla sentence hai
|
||||
1 Yeh/HI mera/HI pehla/HI sentence/EN hai/HI
|
||||
##2 Aur ye dusra
|
||||
2 Aur/HI ye/HI dusra/HI
|
||||
##3 Main kya karoon
|
||||
3 Main/HI kya/HI karoon/HI
|
||||
##4 This is main sentence
|
||||
4 This/EN is/EN main/EN sentence/EN
|
|
@ -0,0 +1,32 @@
|
|||
HINDI dictionary created
|
||||
ENGLISH dictionary created
|
||||
sampleinp.txt
|
||||
['getLanguage.py', 'sampleinp.txt']
|
||||
time for mallet classification 1.1737194061279297
|
||||
Yeh hi 0.999999999 en 1e-09
|
||||
mera hi 0.999999999 en 1e-09
|
||||
pehla hi 0.999999999 en 1e-09
|
||||
sentence en 0.999999999 hi 1e-09
|
||||
hai hi 0.999999999 en 1e-09
|
||||
---------------------------------
|
||||
|
||||
time for mallet classification 0.46548032760620117
|
||||
Aur hi 0.999999999 en 1e-09
|
||||
ye en 0.9276992223233895 hi 0.07230077767661053
|
||||
dusra hi 0.999999999 en 1e-09
|
||||
---------------------------------
|
||||
|
||||
time for mallet classification 0.37943124771118164
|
||||
Main en 0.9098939778371711 hi 0.09010602216282892
|
||||
kya hi 0.999999999 en 1e-09
|
||||
karoon hi 0.999999999 en 1e-09
|
||||
---------------------------------
|
||||
|
||||
time for mallet classification 0.38225436210632324
|
||||
This en 0.8779217719993668 hi 0.1220782280006331
|
||||
is en 0.9597935346160775 hi 0.040206465383922474
|
||||
main en 0.4926395772768498 hi 0.5073604227231503
|
||||
sentence en 0.999999999 hi 1e-09
|
||||
---------------------------------
|
||||
|
||||
written to sampleinp.txt_tagged
|
|
@ -0,0 +1,25 @@
|
|||
1 Uhhh what ever happened to the this page is dedicated to quality confessions motto I thought stuff like this made its way to the other confessions page
|
||||
2 aap manifesto speaks a lot about your vision of delhi go ahead may god bless you to fulfill peoples dream
|
||||
3 greenbrier amp summers counties have just been placed under a winter storm watch for monday night through tuesday http URL
|
||||
4 intermission fc took a big step towards the title with a hard fought win over fellow table toppers the real ajax on monday night
|
||||
5 july th it's goin down at dar constitution hall i'm gonna make this last part rhyme mb y'all
|
||||
6 Ha dharamshala me hain
|
||||
7 Bhai aaap ki Sab say aache film kon si hai
|
||||
8 things to know for friday the u s postal service on the brink of default on a second multibillion dollar p http URL
|
||||
9 supw ki aisi ki taisi d kavi marks add toh hue nahi
|
||||
10 dont be so sure for all you know his webcam may be on and you may be famous on youtube already
|
||||
11 Sallu miya wonder ful
|
||||
12 Toh late kahan khaya tha
|
||||
13 on tuesday the theatres open at ten o'clock in the morning as lent begins after eight at night on tuesday all those who through want
|
||||
14 na na tum bahut chu panti ki harkat kiye P D
|
||||
15 drew peterson is no longer tune in for the lifetime original movie sunday at c on someUSER http URL
|
||||
16 i should've been guarding parker on that last play he may have scored but i bet i would've been on the right side of the court
|
||||
17 Akdam kick ke jesa
|
||||
18 that means ur menu chart must also include salt n water hahaha
|
||||
19 the ravens release their first injury report of the week on wednesday sounds like it will be loaded with names http URL
|
||||
20 bj penn vs nick diaz replay on fueltv i'd pay to watch it a nd time
|
||||
21 listen bhagavad gita was a conspiracy hatched by brahmins in collusion with kshatiryas to enslave women vaishya amp sudra komal
|
||||
22 sari earth ke faadu sticker lagey hotey they ullu bananey ko grrrrr
|
||||
23 garbage bin bhai batao na mera comment pahla hai na
|
||||
24 i have a strong sense for who this guys might be and looking at some of the past comments you should be able to at least make a good guess
|
||||
25 guddu ki to halat kharab hai use to ab itni garmi lag rhi hogi jitni ki delhi ki summer me lagti h p d
|
|
@ -0,0 +1,26 @@
|
|||
0 Han wo bhi baat hai
|
||||
1 its not unionbudget its buredinkabudget khana cmputr kapde ghar sab mehenga bewakuf bnaya sirf inhone
|
||||
2 public movie review shamitabh shamitabh amitabhbachchan moviereview
|
||||
3 ok
|
||||
4 kachhua sir toh students ke pakke wale dushman hain
|
||||
5 salmaan khan tumhare naam k pichey khan accha nahi lagtahai hata de muslaman ke naam ko jo itna hi bura kagta hai islamic dharm to abhi chod de zarurat nahi hai tumhari islaam ko
|
||||
6 bhaijaan thagaye hoghaey itne saare comments dekkar bhaijaan abhi online mein hai
|
||||
7 mummy jabb apne kisi dost ko nick name se bulati hai to kitta mazaa aata haii
|
||||
8 bas kar rulayega kya
|
||||
9 Bajrangi Bhai hamse bhi guftgu kar liya karo
|
||||
10 Uhhh what ever happened to the this page is dedicated to quality confessions motto I thought stuff like this made its way to the other confessions page
|
||||
11 koi mujhe bataega du colleges ke form ki cutt off kab aayegi
|
||||
12 Kuch nhi launde Bas thodi bahut holi
|
||||
13 commentator huge celebrations for the dismissal of suresh kohli cwc indvspak
|
||||
14 Kon h Bhai
|
||||
15 acha ji aisa kya bt we wer nt lyk dt knd f seniors i mean hum to aise ni the d
|
||||
16 aap manifesto speaks a lot about your vision of delhi go ahead may god bless you to fulfill peoples dream
|
||||
17 Maine bola h ki se le liyo
|
||||
18 Aur suna
|
||||
19 uske baap ko mobile ki factory tha kya
|
||||
20 Swagat nahi karoge humahra
|
||||
21 i remember it happen in my place also koi apni jagah se nahin hilega
|
||||
22 so pls
|
||||
23 I have had a couple of hot makeout sessions in a corner of the liby journal section with different guys It was weird as recently we saw another couple using our favored spot WTF seriously
|
||||
24 agr tu itna handsome h to vo bechaariya milte he ignore kyu krne lgi
|
||||
25 Bhai jaha fayeda ho vaha pe ladies first vala formula lagu ho jata hai p According to the time change ho jati hai ye
|
|
@ -0,0 +1,20 @@
|
|||
1 Yeh mera pehla sentence hai
|
||||
2 Aur ye dusra
|
||||
3 Main kya karoon?
|
||||
4 I love eating खाना with people
|
||||
5 Mujhe sabke साथ khelna happy करते hai
|
||||
6 Mujhe afaik transend rafataar maze
|
||||
7 Mujhe apne manager kaafi pasand hai, I like that guy
|
||||
8 Aaj ka day humesha yaad rahega humein because India won the World Cup
|
||||
9 Purana photo bhejna band karo
|
||||
10 Shoes pehen ke aao
|
||||
11 Itna serious nhi hona tha
|
||||
12 Friday fir repeat karna hoga tujhe
|
||||
13 Nahi wo society wale group me aaya tha ki drainage clog ho gaya hai tower 7 ka
|
||||
14 Bahar jana toh restricted hai
|
||||
15 Bangalore me lockdown hai
|
||||
16 Main nahi aa rha main apne cousins ke saath video call pe hoon
|
||||
17 Monday bhi chutti hai ?
|
||||
18 Airport pe abhi bhi foreign log aa rahe hain
|
||||
19 Pata chal gaya sound kya tha
|
||||
20 Tumlog apna bill verify kiya ?
|
|
@ -0,0 +1,3 @@
|
|||
This oth
|
||||
is oth
|
||||
main oth
|
|
@ -0,0 +1,3 @@
|
|||
This i:1 h:1 s:1 T:1 $T:1 hi:1 is:1 Th:1 s$:1 his:1 $Th:1 Thi:1 is$:1 This:1 his$:1 $Thi:1
|
||||
is i:1 s:1 is:1 s$:1 $i:1
|
||||
main a:1 i:1 m:1 n:1 ai:1 n$:1 $m:1 ma:1 in:1 mai:1 $ma:1 ain:1 in$:1 ain$:1 main:1 $mai:1
|
|
@ -0,0 +1,3 @@
|
|||
This en 0.7463832689008091 hi 0.253616731099191
|
||||
is en 0.9597935346160775 hi 0.040206465383922474
|
||||
main en 0.4291624908780122 hi 0.5708375091219877
|
|
@ -0,0 +1,66 @@
|
|||
"""
|
||||
Given a word list with language, prepare the data for input to MALLET
|
||||
"""
|
||||
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
import codecs
|
||||
|
||||
|
||||
def get_ngrams(word, n):
|
||||
"""
|
||||
Extracting all ngrams from a word given a value of n
|
||||
"""
|
||||
|
||||
if word[0] == '#':
|
||||
word = word[1:]
|
||||
|
||||
if n != 1:
|
||||
word = '$'+word+'$'
|
||||
|
||||
ngrams = defaultdict(int)
|
||||
for i in range(len(word)-(n-1)):
|
||||
ngrams[word[i:i+n]] += 1
|
||||
return ngrams
|
||||
|
||||
|
||||
def main(input_file_name):
|
||||
"""
|
||||
The main function
|
||||
"""
|
||||
|
||||
# The input file containing wordlist with language
|
||||
input_file = open(input_file_name, 'r')
|
||||
|
||||
# The output file
|
||||
output_file_name = input_file_name + ".features"
|
||||
output_file = codecs.open(output_file_name, 'w', encoding='utf-8')
|
||||
|
||||
# N upto which n grams have to be considered
|
||||
n = 5
|
||||
|
||||
# Iterate through the input-file
|
||||
for each_line in input_file:
|
||||
fields = list(filter(None, each_line.strip().split("\t")))
|
||||
word = fields[0]
|
||||
output_file.write(word)
|
||||
output_file.write('\t')
|
||||
|
||||
# Get all ngrams for the word
|
||||
for i in range(1, min(n+1, len(word) + 1)):
|
||||
ngrams = get_ngrams(word, i)
|
||||
for each_ngram in ngrams:
|
||||
output_file.write(each_ngram)
|
||||
output_file.write(':')
|
||||
output_file.write(str(ngrams[each_ngram]))
|
||||
output_file.write('\t')
|
||||
|
||||
output_file.write('\n')
|
||||
|
||||
input_file.close()
|
||||
output_file.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
input_file_name = sys.argv[1]
|
||||
main(input_file_name)
|
|
@ -0,0 +1,520 @@
|
|||
"""
|
||||
Context logic to generate language tags from probability values
|
||||
"""
|
||||
|
||||
import itertools
|
||||
import heapq
|
||||
import re
|
||||
from collections import Counter, OrderedDict
|
||||
from ttp import ttp
|
||||
from configparser import ConfigParser
|
||||
|
||||
|
||||
class RunSpan(object):
|
||||
def __init__(self, x, y):
|
||||
self.x = x
|
||||
self.y = y
|
||||
|
||||
|
||||
runList = []
|
||||
|
||||
|
||||
def run_compute_recur(strn, curpoint, curlist):
|
||||
|
||||
runstart = curpoint
|
||||
|
||||
while curpoint < len(strn)-1:
|
||||
if (strn[curpoint] != strn[curpoint+1]) or strn[curpoint] == '$':
|
||||
if (((curpoint - runstart) > 1) and (strn[curpoint] == strn[runstart] or strn[curpoint] == '$' or strn[runstart] == '$')):
|
||||
if(not(((curpoint-runstart) == 2) and runstart == 0)):
|
||||
newrun = RunSpan(runstart, curpoint)
|
||||
newrunlist = curlist[:]
|
||||
newrunlist.append(newrun)
|
||||
run_compute_recur(strn, curpoint+1, newrunlist)
|
||||
|
||||
curpoint += 1
|
||||
|
||||
lastrun = RunSpan(runstart, curpoint)
|
||||
if ((curpoint - runstart) > 2):
|
||||
curlist.append(lastrun)
|
||||
runList.append(curlist)
|
||||
|
||||
|
||||
def check_skips(strn, lang):
|
||||
if lang == '0':
|
||||
alter = '1'
|
||||
else:
|
||||
alter = '0'
|
||||
|
||||
i = 0
|
||||
|
||||
while i < len(strn)-2:
|
||||
|
||||
if strn[i] == alter:
|
||||
if (strn[i+1] == alter) and (strn[i+2] == alter):
|
||||
return False
|
||||
else:
|
||||
i += 1
|
||||
i += 1
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def check_CS(strn):
|
||||
global runList
|
||||
runList = []
|
||||
run_compute_recur(strn, 0, [])
|
||||
|
||||
TrueRunList = []
|
||||
TrueStrnList = []
|
||||
|
||||
for runset in runList:
|
||||
check = True
|
||||
check2 = True
|
||||
check3 = True
|
||||
TrueStrn = ""
|
||||
TrueStrn2 = ""
|
||||
|
||||
for run in runset:
|
||||
if check is False or (check2 is False and check3 is False):
|
||||
break
|
||||
|
||||
if strn[run.y] == '0':
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "0")
|
||||
check = check_skips(tosend, '0')
|
||||
TrueStrn = TrueStrn + tosend
|
||||
TrueStrn2 = TrueStrn2 + tosend
|
||||
|
||||
if strn[run.y] == '1':
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "1")
|
||||
check = check_skips(tosend, '1')
|
||||
TrueStrn = TrueStrn + tosend
|
||||
TrueStrn2 = TrueStrn2 + tosend
|
||||
|
||||
if strn[run.y] == '$':
|
||||
if strn[run.x] == '0':
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "0")
|
||||
check = check_skips(tosend, '0')
|
||||
TrueStrn = TrueStrn + tosend
|
||||
TrueStrn2 = TrueStrn2 + tosend
|
||||
|
||||
if strn[run.x] == '1':
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "1")
|
||||
check = check_skips(tosend, '1')
|
||||
TrueStrn = TrueStrn + tosend
|
||||
TrueStrn2 = TrueStrn2 + tosend
|
||||
|
||||
if strn[run.x] == '$':
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "1")
|
||||
check3 = check3 and check_skips(tosend, '1')
|
||||
if check3:
|
||||
TrueStrn = TrueStrn + tosend
|
||||
|
||||
tosend = strn[run.x:run.y+1]
|
||||
tosend = tosend.replace("$", "0")
|
||||
check2 = check2 and check_skips(tosend, '0')
|
||||
if check2:
|
||||
TrueStrn2 = TrueStrn2 + tosend
|
||||
|
||||
if check is True:
|
||||
if check3 is True and TrueStrn != "":
|
||||
TrueRunList.append(runset)
|
||||
TrueStrnList.append(TrueStrn)
|
||||
|
||||
if check2 is True and TrueStrn2 != "":
|
||||
TrueRunList.append(runset)
|
||||
TrueStrnList.append(TrueStrn2)
|
||||
|
||||
i = 0
|
||||
Purities = {}
|
||||
|
||||
while i < len(TrueRunList):
|
||||
purity = 0.0
|
||||
|
||||
for run in TrueRunList[i]:
|
||||
counter = (
|
||||
Counter(TrueStrnList[i][run.x:run.y+1]).most_common(1)[0])
|
||||
if counter[0] == '0':
|
||||
purity += float(float(counter[1])/float(len(TrueStrnList[i])))
|
||||
else:
|
||||
purity += float(float(counter[1])/float(len(TrueStrnList[i])))
|
||||
|
||||
purity /= (float(len(TrueRunList[i]))*float(len(TrueRunList[i])))
|
||||
|
||||
Purities[i] = purity
|
||||
|
||||
i += 1
|
||||
|
||||
if Purities:
|
||||
for item in dict_nlargest(Purities, 1):
|
||||
return (TrueRunList[item], TrueStrnList[item], Purities[item])
|
||||
|
||||
else:
|
||||
return ("", "", -1)
|
||||
|
||||
|
||||
def compute_tag(m, strn):
|
||||
global lang1
|
||||
global lang2
|
||||
|
||||
strn = strn.strip()
|
||||
origstrn = strn
|
||||
|
||||
encount = 0.0
|
||||
hicount = 0.0
|
||||
for ch in strn:
|
||||
if (ch == '0'):
|
||||
encount += 1
|
||||
if (ch == '1'):
|
||||
hicount += 1
|
||||
if (ch == '$'):
|
||||
hicount += 1
|
||||
encount += 1
|
||||
|
||||
if (encount/len(strn)) > 0.7:
|
||||
return lang2, "-1"
|
||||
if (hicount/len(strn)) > 0.8:
|
||||
return lang1, "-1"
|
||||
else:
|
||||
|
||||
count = 0
|
||||
for x, _y in m.items():
|
||||
if len(x) < 4:
|
||||
strn = strn[:count+1] + "$" + strn[count+2:]
|
||||
count += 1
|
||||
|
||||
a, b, c = check_CS(strn)
|
||||
if(c == -1 or len(a) < 2):
|
||||
CMrun = b[1:-1]
|
||||
|
||||
encount1 = 0
|
||||
hicount1 = 0
|
||||
for ch in CMrun:
|
||||
if (ch == '0'):
|
||||
encount1 += 1
|
||||
if (ch == '1'):
|
||||
hicount1 += 1
|
||||
if (ch == '$'):
|
||||
hicount1 += 1
|
||||
encount1 += 1
|
||||
|
||||
if(encount1 > hicount1):
|
||||
return "Code mixed" + " " + lang2, "-1"
|
||||
elif (encount1 < hicount1):
|
||||
return "Code mixed" + " " + lang1, "-1"
|
||||
else:
|
||||
return "Code mixed Equal", "-1"
|
||||
|
||||
else:
|
||||
strncs = ""
|
||||
|
||||
for i in a:
|
||||
if i.x > 0:
|
||||
if (strn[i.x] == '$') and (origstrn[i.x] != b[i.x]):
|
||||
b = b[:i.x] + origstrn[i.x] + b[i.x+1:]
|
||||
i.x += 1
|
||||
if i.y < len(b)-1:
|
||||
if (strn[i.y] == '$') and (origstrn[i.y] != b[i.y]):
|
||||
b = b[:i.y] + origstrn[i.y] + b[i.y+1:]
|
||||
i.y -= 1
|
||||
|
||||
for i in a:
|
||||
strncs += b[i.x:i.y+1] + "|"
|
||||
|
||||
return "Code switched", strncs[:-1]
|
||||
|
||||
|
||||
def dict_nlargest(d, n):
|
||||
return heapq.nlargest(n, d, key=lambda k: d[k])
|
||||
|
||||
|
||||
def get_res(orig, vals):
|
||||
|
||||
global lang1
|
||||
global lang1_code
|
||||
global lang2
|
||||
global lang2_code
|
||||
|
||||
# read config
|
||||
config = ConfigParser()
|
||||
config.read("config.ini")
|
||||
# use config to extract language names
|
||||
config_gen = config["GENERAL"]
|
||||
# get language names by default language 1 is HINDI and language 2 is ENGLISH
|
||||
lang1 = config_gen["language_1"].upper(
|
||||
) if config_gen["language_1"] else "HINDI"
|
||||
lang2 = config_gen["language_2"].upper(
|
||||
) if config_gen["language_2"] else "ENGLISH"
|
||||
|
||||
lang1_code = lang1.lower()[:2]
|
||||
lang2_code = lang2.lower()[:2]
|
||||
|
||||
prs = ttp.Parser()
|
||||
|
||||
count = 0
|
||||
mult = 0.95
|
||||
topx = 32
|
||||
dic = OrderedDict()
|
||||
origdic = OrderedDict()
|
||||
dic[1] = vals
|
||||
origdic[1] = orig
|
||||
|
||||
initlist = [u"".join(seq) for seq in itertools.product("01", repeat=5)]
|
||||
|
||||
alreadyExistingTweets = {}
|
||||
processedTweets = []
|
||||
finalTweetDict = OrderedDict()
|
||||
|
||||
for b, c in dic.items():
|
||||
if b:
|
||||
v = OrderedDict()
|
||||
origstr = u""
|
||||
for m, _n in c.items():
|
||||
origstr = origstr + u" " + m
|
||||
# SH had to comment out following line bc of errors:
|
||||
# ne_removed = origstr.encode('ascii', 'ignore') # why
|
||||
ne_removed = origstr
|
||||
|
||||
ne_removed = u' '.join(ne_removed.split())
|
||||
|
||||
total_length = 0.0
|
||||
max_length = 0
|
||||
|
||||
for word in ne_removed.split():
|
||||
if True: # SH changed
|
||||
total_length += len(word)
|
||||
if len(word) > max_length:
|
||||
max_length = len(word)
|
||||
v[word] = c[word]
|
||||
|
||||
initdic = {}
|
||||
curlen = 5
|
||||
|
||||
for item in initlist:
|
||||
initdic[item] = 1
|
||||
for i, _j in initdic.items():
|
||||
count = 0
|
||||
for x, y in v.items():
|
||||
if i[count] == '0':
|
||||
initdic[i] = initdic[i]*y[lang2_code]
|
||||
else:
|
||||
initdic[i] = initdic[i]*y[lang1_code]
|
||||
|
||||
if count > 0 and i[count-1] != i[count]:
|
||||
initdic[i] = initdic[i]*(1-mult)
|
||||
else:
|
||||
initdic[i] = initdic[i]*mult
|
||||
count += 1
|
||||
|
||||
if count == curlen:
|
||||
break
|
||||
|
||||
top32 = initdic
|
||||
curlen = count
|
||||
wordcount = 0
|
||||
|
||||
if curlen < 5:
|
||||
newdic = {}
|
||||
for x, y in top32.items():
|
||||
newdic[x[:curlen]] = y
|
||||
top32.clear()
|
||||
top32 = newdic
|
||||
|
||||
strn = u""
|
||||
|
||||
for x, y in v.items():
|
||||
strn = strn + u" " + x
|
||||
if wordcount < 5:
|
||||
wordcount += 1
|
||||
else:
|
||||
curlen += 1
|
||||
newdic = {}
|
||||
for k, p in top32.items():
|
||||
newstr = k + '0'
|
||||
newdic[newstr] = p * y[lang2_code]
|
||||
|
||||
if newstr[curlen-1] == '0':
|
||||
newdic[newstr] = newdic[newstr]*mult
|
||||
else:
|
||||
newdic[newstr] = newdic[newstr]*(1-mult)
|
||||
|
||||
newstr = k + '1'
|
||||
newdic[newstr] = p * y[lang1_code]
|
||||
|
||||
if newstr[curlen-1] == '1':
|
||||
newdic[newstr] = newdic[newstr]*mult
|
||||
else:
|
||||
newdic[newstr] = newdic[newstr]*(1-mult)
|
||||
|
||||
top32.clear()
|
||||
for x in dict_nlargest(newdic, topx):
|
||||
top32[x] = newdic[x]
|
||||
|
||||
for item in dict_nlargest(top32, 1):
|
||||
|
||||
if len(item) > 0:
|
||||
curOrigTweet = origdic[b]
|
||||
superOrig = curOrigTweet
|
||||
|
||||
curOrigTweet = re.sub(r"\s+", ' ', curOrigTweet.strip())
|
||||
tweetparse = prs.parse(curOrigTweet)
|
||||
tweetUrls = []
|
||||
for url in tweetparse.urls:
|
||||
for _e in range(0, (len(re.findall(url, curOrigTweet)))):
|
||||
tweetUrls.append(url)
|
||||
if (len(re.findall(url, curOrigTweet))) == 0:
|
||||
tweetUrls.append(url)
|
||||
|
||||
curOrigTweet = curOrigTweet.replace(
|
||||
url, " THIS_IS_MY_URL ")
|
||||
|
||||
# SH NEW new keep punctuation
|
||||
curOrigTweet = curOrigTweet.replace("#", " #")
|
||||
curOrigTweet = curOrigTweet.replace("@", " @")
|
||||
curOrigTweet = re.sub(r"\s+", ' ', curOrigTweet.strip())
|
||||
|
||||
tweetdic = OrderedDict()
|
||||
|
||||
splitOrig = curOrigTweet.split(' ')
|
||||
splitHT = origstr.strip().split(' ')
|
||||
|
||||
urlcount = 0
|
||||
for word in splitOrig:
|
||||
if word in splitHT:
|
||||
tweetdic[word] = "OK"
|
||||
elif '_' in word:
|
||||
if word == "THIS_IS_MY_URL":
|
||||
|
||||
tweetdic[tweetUrls[urlcount]] = "OTHER"
|
||||
urlcount += 1
|
||||
elif ((word[1:] in tweetparse.tags) or (word[1:] in tweetparse.users)):
|
||||
tweetdic[word] = "OTHER"
|
||||
else:
|
||||
splt = word.split('_')
|
||||
for wd in splt:
|
||||
if wd in splitHT:
|
||||
tweetdic[wd] = "OK"
|
||||
else:
|
||||
tweetdic[wd] = "OTHER"
|
||||
|
||||
else:
|
||||
tweetdic[word] = "OTHER"
|
||||
|
||||
splitNE = strn.strip().split(u' ')
|
||||
newNE = []
|
||||
newItem = ""
|
||||
|
||||
for word, tag in tweetdic.items():
|
||||
if tag == "OK":
|
||||
if word in splitNE:
|
||||
reqindex = splitNE.index(word)
|
||||
wordlist = []
|
||||
wordlist2 = []
|
||||
|
||||
if word in wordlist:
|
||||
tweetdic[word] = lang1_code.upper()
|
||||
newItem += "1"
|
||||
elif word in wordlist2:
|
||||
tweetdic[word] = lang2_code.upper()
|
||||
newItem += "0"
|
||||
# new next condition:
|
||||
elif re.match(r"\W+", word):
|
||||
print(word)
|
||||
tweetdic[word] = "OTHER"
|
||||
elif item[reqindex] == '0':
|
||||
tweetdic[word] = lang2_code.upper()
|
||||
newItem += "0"
|
||||
else:
|
||||
tweetdic[word] = lang1_code.upper()
|
||||
newItem += "1"
|
||||
newNE.append(word)
|
||||
else:
|
||||
tweetdic[word] = "OTHER"
|
||||
|
||||
newStrn = " ".join(newNE)
|
||||
|
||||
if len(newStrn) > 0:
|
||||
newV = OrderedDict()
|
||||
|
||||
for q, r in v.items():
|
||||
if q in newNE:
|
||||
newV[q] = r
|
||||
|
||||
tweettag, runs = compute_tag(newV, '$' + newItem + '$')
|
||||
|
||||
if runs != "-1":
|
||||
runSplit = runs.split('|')
|
||||
runs = runs[1:-1]
|
||||
|
||||
wordDict = OrderedDict()
|
||||
runcount = 0
|
||||
ind = 0
|
||||
|
||||
for word in tweetdic:
|
||||
wordlabel = OrderedDict()
|
||||
wordlabel["Label"] = tweetdic[word]
|
||||
|
||||
if tweettag == "Code mixed" + " " + lang2 or tweettag == lang2:
|
||||
wordlabel["Matrix"] = lang2_code.upper()
|
||||
elif tweettag == "Code mixed" + " " + lang1 or tweettag == lang1:
|
||||
wordlabel["Matrix"] = lang1_code.upper()
|
||||
elif tweettag == "Code mixed Equal" or tweettag == lang1:
|
||||
wordlabel["Matrix"] = "X"
|
||||
else:
|
||||
if (tweetdic[word] == lang2_code.upper() or tweetdic[word] == lang1_code.upper()):
|
||||
ind += 1
|
||||
if (runs[ind-1] == '|'):
|
||||
runcount += 1
|
||||
|
||||
if runSplit[runcount][0] == "0":
|
||||
wordlabel["Matrix"] = lang2_code.upper()
|
||||
else:
|
||||
wordlabel["Matrix"] = lang1_code.upper()
|
||||
|
||||
wordDict[word] = wordlabel
|
||||
|
||||
sansRT = superOrig.replace("rt", "").strip()
|
||||
|
||||
if not(newStrn in alreadyExistingTweets) and not(sansRT in alreadyExistingTweets):
|
||||
|
||||
wholeTweetDict = OrderedDict()
|
||||
wholeTweetDict["Tweet"] = superOrig
|
||||
|
||||
wholeTweetDict["Tweet-tag"] = tweettag
|
||||
wholeTweetDict["Word-level"] = wordDict
|
||||
wholeTweetDict["Twitter-tag"] = "None"
|
||||
|
||||
finalTweetDict[b] = wholeTweetDict
|
||||
|
||||
alreadyExistingTweets[sansRT] = "yes"
|
||||
alreadyExistingTweets[newStrn] = "yes"
|
||||
|
||||
processedTweets.append(b)
|
||||
else:
|
||||
processedTweets.append(b)
|
||||
|
||||
check_tw = {}
|
||||
for t, u in origdic.items():
|
||||
if t in finalTweetDict or t in processedTweets or u in check_tw:
|
||||
continue
|
||||
else:
|
||||
wholeTweetDict = OrderedDict()
|
||||
wholeTweetDict["Tweet"] = u
|
||||
wholeTweetDict["Tweet-tag"] = "Other_Noise"
|
||||
wholeTweetDict["Word-level"] = {}
|
||||
wholeTweetDict["Twitter-tag"] = "Noise"
|
||||
check_tw[u] = True
|
||||
finalTweetDict[t] = wholeTweetDict
|
||||
|
||||
final_output = ""
|
||||
|
||||
for x, y in finalTweetDict.items():
|
||||
for k, v in y['Word-level'].items():
|
||||
final_output += k + '/' + v['Label'] + ' '
|
||||
|
||||
return final_output
|
Загрузка…
Ссылка в новой задаче