Added code of the LID-tool project.

This commit is contained in:
Mohd Sanad Zaki Rizvi 2020-07-27 17:48:40 +05:30 коммит произвёл sanad
Родитель 23d014e839
Коммит 39a0b8f82b
28 изменённых файлов: 122007 добавлений и 130 удалений

131
.gitignore поставляемый
Просмотреть файл

@ -1,129 +1,2 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
*.pyc
*Zone.Identifier

216
README.md
Просмотреть файл

@ -1,5 +1,219 @@
# Language Identification (LID) for Code-Mixed text
---
# Contributing
This is a word level language identification tool for identifying Code-Mixed text of languages (like Hindi etc.) written in roman script and mixed with English. At a broader level, we utilize a ML classifier that's trained using MALLET to generate word level probabilities for language tags. We then utilize these probabilities along with the context information of surrounding words to generate language tags for each word of the input. We also use hand-crafted dictionaries as look-up tables to cover unique, corner and conflicting cases to give a robust language identification tool.
**Note:**
- Please read the [Papers](#Papers) section to understand the theory and experiments surrounding this project.
- The trained ML classifier model and dictionaries that are shipped by default with this project are specifically for `Hindi-English` Code-Mixed text.
- You can use this project to extend it to `any language pairs`. More information in [Train your Custom LID](#train-your-custom-lid).
# Project Structure
---
The project has the following structure:
```
LID-tool/
├── README.md
├── classifiers/
├── config.ini
├── dictionaries/
├── getLanguage.py
├── sampleinp.txt
├── sampleinp.txt_tagged
├── sampleoutp.txt
├── tests/
├── tmp/
└── utils/
```
Here is more info about each component:
- **classifiers/** - contains classifiers that are trained using MALLET. For now, we have a single classifier "HiEn.classifier".
- **config.ini** - config file for the project. You can learn more about the config file in the [Working with the Config file to create dictionary and control other aspects of the project](Train_Custom_LID.md#working-with-the-config-file-to-create-dictionary-and-control-other-aspects-of-the-project).
- **dictionaries/** - contains various Hindi and English dictionaries used in the project.
- **getLanguage.py** - main file of the project, contains code for classifying input text into language tags.
- **sample\* files** - contain the sample input, tagged and outputs of the LID.
- **tests/** - contains validation sample sets from FIRE shared tasks, good for validating performance of the LID.
- **tmp/** - temporary folder for holding intermediate MALLET files.
- **utils/** - contains utility code for the LID like extracting features etc.
# Papers
- [Query word labeling and Back Transliteration for Indian
Languages: Shared task system description - Spandana Gella et. al](https://www.isical.ac.in/~fire/wn/STTS/2013_translit_search-gella-msri.pdf)
- [Testing the Limits of Word level Language Identification](https://www.aclweb.org/anthology/W14-5151.pdf)
# Installation
---
The installation of the tool is pretty straightforward as most of it is plug-n-play. Once you get all the required depencies you are good to go.
Once you clone the repository in your local system you have the code, and you can start installing the dependencies one by one.
## Dependencies
1. **Java** - This project uses MALLET (written in java) to train/run classifiers hence you'd need java in your system. You can get a JRE from here:
```
https://www.oracle.com/java/technologies/javase-jre8-downloads.html
```
2. **Python 3** - This project is written in Python 3 hence make sure to have it before running the LID. You can get Python 3 from Miniconda:
```
https://docs.conda.io/en/latest/miniconda.html
```
3. **MALLET** - You can download mallet binaries from the [mallet download page](http://mallet.cs.umass.edu/download.php) and follow the installation instructions given there.
4. **Twitter Text Python (TTP)** - You can simply install it using pip:
```
pip install twitter-text-python
```
## Setup
**1. Linux Installation:**
The major setup step required is to give executable rights to mallet binary. You can do so by the following command:
```
chmod +x mallet-2.0.8/bin/mallet
```
**2. Windows Installation:**
The major setup step in windows is to make sure that you have set the correct environment variables for Python and Java. If you want to test whether the environment variables are rightly set or not, check whether they are accessible from the command prompt.
You also have to make sure that you have set the environment variable `MALLET_HOME` to the LID project's mallet folder. You can do so by opening a command prompt and typing:
```
set MALLET_HOME=\path to your LID directory\mallet-2.0.8\
```
Once you are done with the above set of steps, the next step is to just start using the LID.
# Usage
---
## I. Getting Inference on a text input
## a. Using LID in File Mode
You can simply execute getLanguage.py with the input file containing text data to be classified.
**Usage:**
```
python getLanguage.py <input_file>
```
**Example:**
```
python getLanguage.py sampleinp.txt
```
Output is written to `sampleinp.txt_tagged` or <input_file_tagged>
**Things to Note:**
1. Input file should contain lines in the following format:
```
<sentenceId>TAB<sentence>
```
**Example:**
```
1 Yeh mera pehla sentence hai
```
See sampleinp.txt for an example.
2. Make sure that there is no empty line or line with no text in the input file as the LID might throw an error.
## b. Using LID in Library Mode
You can also use this LID as a library that can be imported in your own program and code.
Simply write the following lines of code:
```python
from getLanguage import langIdentify
# inputText is a list of input sentences
# classifier is the name of the mallet classifier to be used
langIdentify(inputText, classifier)
```
The input will be a list of sentences to be language tagged, like this:
![](images/langIdentify_input.PNG)
The output will be a list of language tagged input sentences in such a way that each (word, tag) is a tuple pair:
![](images/langIdentify_output.PNG)
## II. Training your own MALLET classifier
Currently the project ships with classifier for Hindi-English pair by default but you can also train a classifier for your own language pairs.
Refer to the research papers attached to understand the methodology and the training paremeters.
You can use MALLET documentation on how to use it's API for training a new classifier: http://mallet.cs.umass.edu/classification.php
More information in [Train your Custom LID](#train-your-custom-lid).
## III. Testing a new classifier
We have collated a set of data sets that can be used as a validation set in case you want to test a new version of the classifier or any changes in the LID itself. Currently, they are only for Code-Mixed Hindi-English pair:
1. tests/Adversarial_FIRE_2015_Sentiment_Analysis_25.txt - 25 hardest input sentences from [FIRE 2015 Sentiment Analysis task](http://amitavadas.com/SAIL/data.html).
2. tests/FIRE_2015_Sentiment_Analysis_25.txt - first 25 sentences from the FIRE 2015 Sentiment Analysis task.
3. tests/test_sample_20.txt - 20 manually written code-mixed sentences.
You can use a test set by simply giving it as a parameter to getLanguage:
```
python getLanguage.py tests/<test_name>
```
For example:
```
python getLanguage.py tests/test_sample_20.txt
```
The above command will execute your updated/new LID code on 20 manually crafted code-mixed sentences.
**Larger Datasets**
If you want to test your LID on larger datasets, then you can look at these two FIRE tasks:
1. [FIRE 2013 LID task](https://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/fire_data.html) - The original dataset of 500 sentences for which this LID was built.
2. [FIRE 2015 Sentiment Analysis task](http://amitavadas.com/SAIL/data.html) - 12,000+ language tagged sentences.
# Train your Custom LID
---
There are a couple of changes and prelminary steps you need to train your own custom LID. You can follow the documentation page on [Train your Custom LID](Train_Custom_LID.md) for more information.
# Attribution
---
These are the open-source projects that this LID uses:
1. [MALLET: A Machine Learning for Language Toolkit. McCallum, Andrew Kachites.](http://mallet.cs.umass.edu/about.php)
2. [Twitter-Text-Python (TTP) by Edmond Burnett](https://github.com/edmondburnett/twitter-text-python)
Apart from the above set of projects, we also use free and openly hosted dictionaries in the project to improve the LID, you can learn more about them in [Train your Custom LID](Train_Custom_LID.md).
# Contributors
---
In the order of recency, most recent first :-
1. Mohd Sanad Zaki Rizvi
2. Anirudh Srinivasan
3. Sebastin Santy
4. Anshul Bawa
5. Silvana Hartmann
6. Spandana Gella
# Contributing to this Code
---
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us

97
Train_Custom_LID.md Normal file
Просмотреть файл

@ -0,0 +1,97 @@
# Information Flow of LID
The following is a very high level approximation of how information flows in the LID:
![](images/info_flow_new_lid.PNG)
Let's understand what is happening here:
- Whenever you provide input to `getLanguage.py`, one of the `langIdentify()` or `langIdentifyFile()` functions are triggered based on whether you are using the LID in file mode or library mode.
- Either of these functions, take your input sentences and split them into words. Then, on each sentence we invoke `callMallet()`. This function essentially has 2 tasks.
- 1st task is to check which words are already present in language dictionaries that we have, we can directly assign language tags to such words using our logic.
- The words that aren't present in the dictionaries are split away and sent to ML classifier that's trained using MALLET.
- 2nd task is to call ML classifier on such words, but before that we invoke `ExtractFeatures.py` to generate `n-gram` features for the ML model. Once features are generated and ML classifier is called, it gives out certain probabilities for each word for both the languages.
- Sometimes a set of words that are present in the dictionary are not tagged due to some corner cases (we have discussed in the next section). In such case, we `re-invoke ML classifier` on these words to get language probabilities.
- Finally, we combine the probability values of both set of words: one that were processed by the ML classifier and the other ones which were tagged using dictionaries (We use custom probability values of `1E-9` and `0.999999999` for wrong and right language tags respectively for each dictionary tagged word; you can change these values using the `config` file).
- Now that the input words are combined once again, their probability values are fed into the context based tagging logic (explained in the research paper). This algorithm, gives out the final language tags.
This was a quick walkthrough the code, you can read the research papers for more information around the algorithm:
- [Query word labeling and Back Transliteration for Indian
Languages: Shared task system description - Spandana Gella et. al](https://www.isical.ac.in/~fire/wn/STTS/2013_translit_search-gella-msri.pdf)
- [Testing the Limits of Word level Language Identification](https://www.aclweb.org/anthology/W14-5151.pdf)
# About Dictionaries and How to create your own?
## I. Brief description of each dictionary currently used in the project:
The project currently uses 4 manually crafted dictionaries for English and 2 for Hindi words. Here is how they have been sourced:
1. dictionaries/dict1bigr.txt – Uses a sample of 2-grams English words from [Google's Ngram viewer]().
2. dictionaries/dict1coca.txt – Uses a sample of English words from [Corpus of Contemporary American English]().
3. dictionaries/dict1goog10k.txt – Uses a sample of 10000 most frequent words from [Google's Trillion Word Corpus]().
4. dictionaries/dict1hi.txt – Semi-automatically selected set of common Hindi words in the roman script.
5. dictionaries/dict1hinmov.txt – Same as 4 but has common Hindi words prevalent in movie scripts.
6. dictionaries/dict1text.txt – Manually curated list of slang and commonly used internet short-hands of English.
The above set of dictionaries are combined in different combinations to form language dictionaries for Hindi and English. Here is an overview:
![](images/dictionary_structure.PNG)
You can choose which files are used to combine in what permutation for each dictionary using the `[DICTIONARY HIERARCHY]` section of the `config.ini` file. Check [Working with the Config file to create dictionary and control other aspects of the project](#working-with-the-config-file-to-create-dictionary-and-control-other-aspects-of-the-project) for more info.
## II. Reason for using Dictionaries and How to create your own custom dictionaries for a new use-case or language pairs?
The addition of dictionaries in the project was an engineering decision that was taken after considering the empirical results, which showed that the dictionaries complemented the performance of the ML-based classifier (MALLET) for certain corner-cases.
Here are some of the problems that this method solved:
### 1. Dealing with “common words” that can belong to either of the languages.
For example, the English word `“to”` is one of the ways in which the Hindi word `“तो”` or `“तू”` is spelt when written in `roman script` so the word “to” will be classified differently in the following two sentences:
**Input:** I have to get back to my advisor
**Output:** I/EN have/EN ***to/EN*** get/EN back/EN to/EN my/EN advisor/EN
**Input:** Bhai to kabhi nahi sudhrega
**Output:** Bhai/HI ***to/HI*** kabhi/HI nahi/HI sudhrega/HI
In this case, we make sure that the word “to” is present in both the dictionaries and the LID is supposed to focus more on the combination of ML probabilities and Context (surrounding words) to tag the language.
### 2. Words that surely belong to only one language.
For example, words like “bhai”, “nahi”, “kabhi” in Hindi and words like “advisor”, “get” etc. in English. In this case, we utilize the relevant dictionary to force tag it to the correct language even if the ML classifier says otherwise.
So the questions that you have to ask yourself while creating the dictionaries are:
1. “Are there certain words that can be spelt the same way in both the languages?” And,
2. “Are there common words in one language that surely cant be used in the other language?”
These are just a couple of things that we looked at while building this tool, but given your specific use-case you can consider more such engineering use-cases and customize the dictionaries accordingly.
# Working with the Config file to create dictionary and control other aspects of the project
The [config.ini](config.ini) file is like the central command center of the project. The fields are self explanatory and rightly commented so that you have an idea of what each field does.
All the information required to run the project is picked from this file.
Some of the important things for which you can use config file are:
1. You can give path to the different folders of the project like where is the data folder that contains all the dictionaries or where in your system Mallet's binaries are present.
2. You can also give names of the language pairs for which you are training your LID, the information from here used everywhere internally in the project.
3. You can also specify the names of the dictionaries for each language, and how are different files combined in what order to create each dictionary. See the existing `config.ini` file for an example.
4. Finally, you can provide the custom probability values to be used for dictionary tagged words in the project.
# Specific areas in code that need to be changed when creating your own LID
Here are a couple of changes that you'd need to do in case you are creating a new LID for different language pairs
1. First of all, use the `config.ini` file to override the default values for language names, the structure and names of dictionaries etc.
2. You will have to re-write the custom logic of `dictTagging()` function in order to utilize your new dictionaries. Have a look at our logic to understand how using a simple set of conditionals, we choose which language tagged is to be alloted.

Двоичные данные
classifiers/HiEn.classifier Normal file

Двоичный файл не отображается.

39
config.ini Normal file
Просмотреть файл

@ -0,0 +1,39 @@
[GENERAL]
# if verbose is 1 then display language probabilities for each word; by default it is on or set to 1; set to 0 to turn off.
verbose =
# default: HINDI
language_1 =
# default: ENGLISH
language_2 =
[DEFAULT PATHS]
# Path to the classifiers folder, default: os.path.join(os.getcwd(), 'classifiers', 'HiEn.classifier')
CLASSIFIER_PATH =
# Path to the temporary folder, os.path.join(os.getcwd(), 'tmp', '')
TMP_FILE_PATH =
# Path to the dictionary folder, default: os.path.join(os.getcwd(), 'dictionaries', '')
DICT_PATH =
# Path to the mallet binary folder, default: os.path.join(os.getcwd(), 'mallet-2.0.8', 'bin', 'mallet')
MALLET_PATH =
[DICTIONARY PROBABILITY VALUES]
# initialize probability values for the correct and incorrect language
# default: 0.999999999
dict_prob_yes =
# default: 1E-9
dict_prob_no =
[DICTIONARY NAMES]
# dictionary used to store already classified words between runs and in-memory
# default: memoize_dict.pkl
memoize_dict_file =
# name/number of dictionaries per language
language_1_dicts = hindict1
language_2_dicts = eng0dict1, eng1dict1
[DICTIONARY HIERARCHY]
# which files are combined to form which dictionary
eng0dict1 = dict1goog10k.txt, dict1coca.txt
eng1dict1 = dict1bigr.txt, dict1text.txt
hindict1 = dict1hinmov.txt, dict1hi.txt

68784
dictionaries/dict1bigr.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

13574
dictionaries/dict1coca.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Разница между файлами не показана из-за своего большого размера Загрузить разницу

27242
dictionaries/dict1hi.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,447 @@
hai
ke
ye
mein
se
ki
hain
toh
nahi
ho
nahin
aur
kya
ko
kar
ka
na
ek
tha
yeh
raha
hoon
tum
hum
pe
sab
mujhe
liye
baat
koi
gaya
rahe
tu
kuch
ne
jo
ab
ye
phir
haan
woh
mere
pata
aa
saath
par
thi
le
hua
kiya
bahut
maine
bhai
gaye
de
yahan
meri
ji
apne
diya
mera
hoga
abhi
bas
din
theek
kaam
bol
hun
log
h
mat
kaise
naa
lekin
tak
kyon
kahan
jab
kisi
chal
jaa
baad
dekh
aaj
sakta
paas
chahiye
lo
agar
apni
kyun
kaun
kuchh
karo
karte
paise
wo
beta
aapne
kal
pehle
tere
aise
hai-
voh
aapke
yaar
sirf
k
haath
ja
tujhe
vah
arre
raat
uske
saal
dono
bola
laga
andar
lagta
naam
apna
hee
liya
aaya
khud
lag
tumhe
jaise
aata
har
aisa
tarah
matlab
mil
achha
sahab
itna
maar
bahar
teri
shahid
jao
usse
di
tumhare
logon
kab
iss
samajh
arjun
thoda
tumne
ladki
pahle
uski
kaha
aapki
jaan
tera
itni
kahin
chhod
bata
uss
rahul
sach
bhee
hamare
badi
mai
teen
kam
aap
chup
bauji
waqt
maa
tumhara
baap
bada
humko
dost
saab
kitna
yaad
wahan
wali
hui
unke
subah
sahi
jaao
wala
soch
uska
aadmi
bolo
jaldi
sa
taraf
usko
denge
hue
shuru
kah
bilkul
dena
unko
itne
saare
aage
tab
lena
jee
bana
kee
neeche
yahi
aao
humne
keh
hone
karenge
beti
der
bade
maan
poora
si
arey
paanch
poori
ladke
bina
prem
wapas
yahin
sau
waise
galat
khush
hote
akele
paisa
aane
koshish
usne
socha
humein
isliye
ali
babu
karega
jhoot
dete
arrey
baje
sabse
dil
rohan
thay
baith
upar
mar
rani
unka
lete
duniya
tune
zara
ma
leke
rajvir
beech
minal
tumse
rahey
kitni
kitne
unhe
omi
suna
saaf
hona
hamara
ladka
neerja
aisi
hamesha
nawab
kyonki
yehi
ammi
alag
hahn
dadu
kaisa
usey
khempal
cheez
ruk
aapse
pooch
lenge
doon
iske
iska
inko
chai
isse
kis
honge
li
mama
kamre
kuuch
dus
pakad
agle
iski
singh
accha
dega
papaji
zinda
wapis
didi
milega
vijay
unhone
roz
wahi
kyoun
varun
bete
bhool
pade
naya
jis
karein
che
bahu
apun
lene
lage
jhooth
jaisa
doosre
ekdum
khol
chorr
thhey
raho
aate
dene
ashwin
oye
ha
re
vo
ae
mamma
khan
dein
acha
biwi
bolna
wale
allah
nayi
sar
saalon
chor
thha
tumhey
lakh
bolta
inka
pee
jahan
abhie
bura
lega
line
humse
jaye
bula
fir
lado
saat
sandhya
wahin
dum
usi
batao
darr
inn
thee
gussa
pad
deepak
sabko
bees
baithe
mushkil
dilli
baba
zamindar
kisne
suno
kum
haar
pandrah
un
jai
akela
wah
kai
hafte
karachi
isi
ise
madad
bolne
bech
ram
rah
baitho
isne
prashant
ro
inhoney
abbu
maut
desh
hume
kumar
falak
govi
mile
kaat
dar
ni

420
dictionaries/dict1text.txt Normal file
Просмотреть файл

@ -0,0 +1,420 @@
<3
4u
8L3W
a3
aamof
aap
aar
aas
add
adn
aeap
afaik
afk
aisb
aka
aml
aota
asap
at
atm
ayec
ayor
abtr
b4
b4n
bak
bau
bbiaf
bbiam
bbl
bbs
bc
bcnu
bf
bff
bfn
bg
blnt
bm&y
bol
brb
brt
bta
btdt
btw
bcoz
cam
cas
cmiiw
cmon
cnt
cob
cos
coz
cr8
crb
crbt
cya
cu
cua
cul
cul8r
cwyl
cyo
cys
cth
cuz
cz
da
dl
degt
diku
dqmot
dno
dnt
dts
dv8
dw
ebkac
ez
eg
ema
emfbi
eod
eom
ezy
eoc
f2f
f2t
fbm
fc
fd's
ficcl
fitb
fnx
fomcl
frt
ftw
ftc
fwiw
fya
fyeo
fyi
f8
f9
fotb
g
g2cu
gtg
g2g
g2r
g9
ga
gal
gb
gbu
gd
gdr
gf
gfi
gg
giar
gigo
gj
gl
gl/hf
gmta
gna
goi
gol
gr8
gr&d
gt
gtg
h&k
h2cus
h8
hagn
hags
hago
hand
hf
hhis
h/o
hoas
hru
hth
hv
iac
ianal
ib
ic
icbw
idawtc
idc
idk
idts
idunno
ig2r
iirc
ilu
ilbl8
ily
im
imao
imho
imnsho
imo
ims
inal
indtd
iow
ipo
irl
irmc
iuss
iykwim
iyo
iyss
jb
jic
jk
j00r
jac
jic
jja
jk
jmo
jp
jpp
jtlyk
jsun
k
kk
kl
kiss
kit
kotc
kotl
kwim
l8
l8r
lbr
ld
lgp
lgr
lol
lqtm
lsl
ltm
ltns
lylas
lyk
m8
mfi
mmc
msg
mtf
mtfbwu
musm
myob
n
n2u
n1
nbd
nvm
ne
ne1
ni
nfm
nimby
nlt
nm
nmjc
np
no1
noyb
np
nrn
nt
ntu
nvm
nw
nwo
nvm
nbtr
omg
oic
omw
oo
ooh
ootd
op
otb
otl
otoh
ott
ottomh
otw
ova
o
pcm
pdq
plmk
plz
pls
plu
pm
pmfi
pmfji
poahf
ppl
prob
prolly
prt
prw
ptmm
pu
pwn
pxt
q
qik
qt
rodl
rofl
rotfl
rotflol
rotfluts
rp
rl
rme
rmv
rsn
ruok
icnr
ig2r
sal
say
sbtsbc
sc
sete
sis
sit
slap
slp
slpn
smhid
smt
snafu
so
sol
somy
sotmg
soz or sry
spk
spst
ss
ssinf
str8
stq
suitm
sul
sup
syl
t+
ta
tafn
tam
tb
tbd
tbh
tc
tgif
thts
thnx
thnq
tu
tq
ty
thx
tia
tiad
tlk2ul8r
tma
tmb
tmi
tmot
tmrw
tmwfi
tnstaafl
toy
tpm
tptb
tstb
ttfn
ttly
ttml
tttt
ttyl
ttys
txt
txtm8
tym
tyt
tyvm
tfpi
ugtbk
uktr
ul
ur
uv
uw
vf
vms
vmsi
w
w/e
w/i
w/o
wam
wan2tlk
wat
wayf
wb
wb2my
wg
woteva
whteva
wiifm
wk
wkd
wombat
wrk
wrud
wut
wt
wtb
wtg
wth
wts
wubu2
wuu2
wu?
wubu2?
wuciwug
wuf?
wuwh
wwyc
wylei
wat
xlnt
ya
ybs
ygbkm
ykwycd
ymmv
yr
yw
zzz
wysiwyg

579
getLanguage.py Normal file
Просмотреть файл

@ -0,0 +1,579 @@
"""
Master code to take input, generate features, call MALLET and use the probabilities for generating language tags
"""
# !/usr/bin/python
import sys
import subprocess
import re
import os
import time
import codecs
import pickle
from utils import extractFeatures as ef
from utils import generateLanguageTags as genLangTag
from collections import OrderedDict
from configparser import ConfigParser
def readConfig():
"""
Read config file to load global variables for the project
"""
global language_1_dicts
global language_2_dicts
global memoize_dict
global combined_dicts
global CLASSIFIER_PATH
global TMP_FILE_PATH
global DICT_PATH
global MALLET_PATH
global dict_prob_yes
global dict_prob_no
global memoize_dict_file
global verbose
global lang1
global lang2
# initialize dictionary variables
language_1_dicts = {}
language_2_dicts = {}
# initialize list of dictionary words
combined_dicts = []
# read config
config = ConfigParser()
config.read("config.ini")
config_paths = config["DEFAULT PATHS"]
config_probs = config["DICTIONARY PROBABILITY VALUES"]
config_dicts = config["DICTIONARY NAMES"]
config_gen = config["GENERAL"]
# setup paths for classifier, tmp folder, dictionaries and mallet
CLASSIFIER_PATH = config_paths["CLASSIFIER_PATH"] if config_paths["CLASSIFIER_PATH"] else os.path.join(
os.getcwd(), 'classifiers', 'HiEn.classifier')
TMP_FILE_PATH = config_paths["TMP_FILE_PATH"] if config_paths["TMP_FILE_PATH"] else os.path.join(
os.getcwd(), 'tmp', '')
DICT_PATH = config_paths["DICT_PATH"] if config_paths["DICT_PATH"] else os.path.join(
os.getcwd(), 'dictionaries', '')
MALLET_PATH = config_paths["MALLET_PATH"] if config_paths["MALLET_PATH"] else os.path.join(
os.getcwd(), 'mallet-2.0.8', 'bin', 'mallet')
# initialize probability values for the correct and incorrect language
dict_prob_yes = config_probs["dict_prob_yes"] if config_probs["dict_prob_yes"] else 0.999999999
dict_prob_no = config_probs["dict_prob_no"] if config_probs["dict_prob_no"] else 1E-9
# initialize memoize_dict from file is already present else with an empty dictionary
memoize_dict_file = config_dicts["memoize_dict_file"] if config_dicts["memoize_dict_file"] else "memoize_dict.pkl"
if os.path.isfile(DICT_PATH + memoize_dict_file):
with open(DICT_PATH + memoize_dict_file, "rb") as fp:
memoize_dict = pickle.load(fp)
else:
memoize_dict = {}
# by default verbose is ON
verbose = int(config_gen["verbose"]) if config_gen["verbose"] else 1
# get language names by default language 1 is HINDI and language 2 is ENGLISH
lang1 = config_gen["language_1"].upper(
) if config_gen["language_1"] else "HINDI"
lang2 = config_gen["language_2"].upper(
) if config_gen["language_2"] else "ENGLISH"
lang_1dict_names = config_dicts["language_1_dicts"].split(
",") if config_dicts["language_1_dicts"] else "hindict1"
lang_2dict_names = config_dicts["language_2_dicts"].split(
",") if config_dicts["language_2_dicts"] else "eng0dict1, eng1dict1"
# initialize language_1_dict and language_2_dict with all the sub dictionaries
for dict_names in lang_1dict_names:
language_1_dicts[dict_names.strip()] = {}
for dict_names in lang_2dict_names:
language_2_dicts[dict_names.strip()] = {}
def createDicts():
"""
Create and populate language dictionaries for Language 1 and Language 2
"""
global language_1_dicts
global language_2_dicts
global combined_dicts
global DICT_PATH
global lang1
global lang2
language_1_words = []
language_2_words = []
# read config to get dictionary structures
config = ConfigParser()
config.read("config.ini")
dict_struct = dict(config.items("DICTIONARY HIERARCHY"))
# create language_1 dictionary
for sub_dict in language_1_dicts:
input_files = dict_struct[sub_dict].split(",")
for filename in input_files:
with open(DICT_PATH + filename.strip(), 'r') as dictfile:
words = dictfile.read().split('\n')
for w in words:
language_1_dicts[sub_dict][w.strip().lower()] = ''
language_1_words.extend(list(language_1_dicts[sub_dict].keys()))
print(lang1, 'dictionary created')
# create language_2 dictionary
for sub_dict in language_2_dicts:
input_files = dict_struct[sub_dict].split(",")
for filename in input_files:
with open(DICT_PATH + filename.strip(), 'r') as dictfile:
words = dictfile.read().split('\n')
for w in words:
language_2_dicts[sub_dict][w.strip().lower()] = ''
language_2_words.extend(list(language_2_dicts[sub_dict].keys()))
print(lang2, 'dictionary created')
# populate the combined word list
combined_dicts.extend(language_1_words)
combined_dicts.extend(language_2_words)
def dictTagging(word, tag):
"""
Use language dictionaries to tag words
"""
global language_1_dicts
global language_2_dicts
global lang1
global lang2
dhin, den0, den1 = 0, 0, 0
word = word
if word.lower() in language_1_dicts["hindict1"].keys():
dhin = 1
if word.lower() in language_2_dicts["eng0dict1"].keys():
den0 = 1
if word.lower() in language_2_dicts["eng1dict1"].keys():
den1 = 1
# if not den0 and not den1 and not dhin : do nothing
if (not den0 and not den1 and dhin) or (not den0 and den1 and dhin): # make HI
tag = lang1[:2]
if (not den0 and den1 and not dhin) or (den0 and not dhin): # make EN
tag = lang2[:2]
# if den0 and not den1 and not dhin : subsumed
# if den0 and not den1 and dhin : do nothing
# if den0 and den1 and not dhin : sumsumed
# if den0 and den1 and dhin : do nothing
return tag
def dictLookup(word):
"""
Check whether a word is already present in a dictionary
"""
global combined_dicts
word = word.lower()
if word in set(combined_dicts):
return True
return False
def blurb2Dict(blurb):
"""
Convert a str blurb to an ordered dictionary for comparison
"""
dic2 = OrderedDict()
wordlist = []
for line in blurb.split("\n"):
line = line.split("\t")
word = line[0].split()
tags = line[1:]
if len(word) != 0:
dic2[word[0]] = tags
wordlist.append(word)
return dic2, wordlist
def memoizeWord(mallet_output):
"""
Update the memoize_dict with words that are recently classified by mallet
"""
global memoize_dict
mallet_output = blurb2Dict(mallet_output)[0]
for word in mallet_output.keys():
memoize_dict[word] = mallet_output[word]
def mergeBlurbs(blurb, mallet_output, blurb_dict):
"""
Combine probabilities of words from both MALLET and dictionary outputs
"""
global dict_prob_yes
global dict_prob_no
global verbose
global lang1
global lang2
# convert main blurb to OrderedDict
main_dict = OrderedDict()
wordlist_main = []
for line in blurb.split("\n"):
word, tag = line.split("\t")
main_dict[word] = tag
wordlist_main.append([word])
# populate dictionary based language tags with fixed probabilities for correct and incorrect
blurb_dict = blurb_dict.replace(lang1[:2], lang1[:2].lower(
) + "\t" + str(dict_prob_yes) + "\t" + lang2[:2].lower() + "\t" + str(dict_prob_no))
blurb_dict = blurb_dict.replace(lang2[:2], lang2[:2].lower(
) + "\t" + str(dict_prob_yes) + "\t" + lang1[:2].lower() + "\t" + str(dict_prob_no))
blurb_dict, _wordlist_dict = blurb2Dict(blurb_dict)
# convert mallet blurb to OrderedDict only when it isn't empty
mallet_is_empty = 1
if mallet_output != "":
mallet_is_empty = 0
blurb_mallet, _wordlist_mallet = blurb2Dict(mallet_output)
# combining logic
# iterate over the word list and populate probability values for tags from both dictionary and MALLET output
for idx, word in enumerate(wordlist_main):
current_word = word[0]
updated_word = word
if current_word in blurb_dict:
updated_word.extend(blurb_dict[current_word])
wordlist_main[idx] = updated_word
else:
if not mallet_is_empty:
if current_word in blurb_mallet:
updated_word.extend(blurb_mallet[current_word])
wordlist_main[idx] = updated_word
# convert the updated blurb to str
blurb_updated = []
st = ""
for word in wordlist_main:
st = word[0]
for tag in word[1:]:
st = st + "\t" + str(tag)
st = st.strip()
blurb_updated.append(st)
st = ""
blurb_updated = "\n".join(blurb_updated)
if verbose != 0:
print(blurb_updated, "\n---------------------------------\n")
return blurb_updated
def callMallet(inputText, classifier):
"""
Invokes the mallet classifier with input text and returns Main BLURB, MALLET OUTPUT and BLURB DICT
"""
global combined_dicts
global TMP_FILE_PATH
global memoize_dict
"""
DICIONARY CREATION CODE
"""
# create a dictionary if not already created, needed when using as a library
if len(combined_dicts) == 0:
createDicts()
# split words based on whether they are already present in the dictionary
# new words go to MALLET for generating probabilities
fixline_mallet = list(filter(lambda x: not dictLookup(x), inputText))
fixline_dict = list(
filter(lambda x: (x not in fixline_mallet) or (x in memoize_dict), inputText))
# create str blurb for mallet and dictionary input
blurb = '\n'.join(["%s\toth" % (v.strip()) for v in inputText])
blurb_mallet = '\n'.join(["%s\toth" % (v.strip()) for v in fixline_mallet])
dict_tags = list(map(lambda x: dictTagging(x, "oth"), fixline_dict))
# get dict_tags from words that are already classified by mallet
for idx, word in enumerate(fixline_dict):
if word in memoize_dict:
dict_tags[idx] = memoize_dict[word]
"""
LOGIC FOR WORDS THAT ARE PRESENT IN MULTIPLE DICTIONARIES
"""
fixline_mallet_corrections = []
for t, w in zip(dict_tags, fixline_dict):
# if even after dict lookup, some words are still tagged oth due to cornercase then call mallet output on those words
if t == "oth":
fixline_mallet_corrections.append(w)
# update blurb_mallet
blurb_mallet_corrections = '\n'.join(
["%s\toth" % (v.strip()) for v in fixline_mallet_corrections])
# if mallet is not empty then you need to append the correction to the bottom, seperated by a \n otherwise you can just append it directly
if blurb_mallet != "":
blurb_mallet = blurb_mallet + "\n" + blurb_mallet_corrections
else:
blurb_mallet += blurb_mallet_corrections
# remove the words from blurb_dict
dict_tags = filter(lambda x: x != "oth", dict_tags)
fixline_dict = filter(
lambda x: x not in fixline_mallet_corrections, fixline_dict)
blurb_dict = ""
for word, tag in zip(fixline_dict, dict_tags):
if not type(tag) == list:
blurb_dict = blurb_dict + "%s\t%s" % (word.strip(), tag) + "\n"
else:
tmp_tags = "\t".join(tag)
blurb_dict = blurb_dict + \
"%s\t%s" % (word.strip(), tmp_tags) + "\n"
"""
CALLING MALLET
"""
# this checks the case when blurb_mallet only has a \n due to words being taken into blurb_dict
if blurb_mallet != "\n":
# open a temp file and generate input features for mallet
open(TMP_FILE_PATH + 'temp_testFile.txt', 'w').write(blurb_mallet)
ef.main(TMP_FILE_PATH + 'temp_testFile.txt')
# initialize t7 to track time taken by mallet
t7 = time.time()
# call mallet to get probability output
subprocess.Popen(MALLET_PATH + " classify-file --input " + TMP_FILE_PATH + "temp_testFile.txt.features" +
" --output " + TMP_FILE_PATH + "temp_testFile.txt.out --classifier %s" % (classifier), shell=True).wait()
t_total = time.time()-t7
mallet_output = open(
TMP_FILE_PATH + 'temp_testFile.txt.out', 'r').read()
else:
mallet_output = ""
# memoize the probabilities of words already classified
memoizeWord(mallet_output)
print("time for mallet classification", t_total, file=sys.stderr)
return blurb, mallet_output, blurb_dict
def genUID(results, fixline):
"""
ADDING UNIQUE IDS TO OUTPUT FILE AND FORMATTING
where:
fixline is input text
results is language probabilities for each word
"""
# NEW add unique id to results - which separator
uniqueresults = list(range(len(results)))
for idx in range(len(results)):
uniqueresults[idx] = results[idx]
uniqueresults[idx][0] = uniqueresults[idx][0]+"::{}".format(idx)
langOut = OrderedDict()
for v in uniqueresults:
langOut[v[0]] = OrderedDict()
for ii in range(1, len(v), 2):
langOut[v[0]][v[ii]] = float(v[ii+1])
fixmyline = fixline
fnewlines = list(range(len(fixmyline)))
for vvv in range(len(fixmyline)):
fnewlines[vvv] = fixmyline[vvv]+"::{}".format(vvv)
ffixedline = " ".join(fnewlines)
return ffixedline, langOut
def langIdentify(inputText, classifier):
"""
Get language tags for sentences passed as a list
Input : list of sentences
Output : list of words for each sentence with the language probabilities
"""
global TMP_FILE_PATH
inputText = inputText.split("\n")
outputText = []
"""
CONFIG FILE CODE
"""
readConfig()
"""
DICIONARY CREATION CODE
"""
createDicts()
for line in inputText:
text = re.sub(r"([\w@#\'\\\"]+)([.:,;?!]+)", r"\g<1> \g<2> ", line)
text = text.split()
text = [x.strip() for x in text]
text = [x for x in text if not re.match(r"^\s*$", x)]
"""
CALLING MALLET CODE HERE
"""
blurb, mallet_output, blurb_dict = callMallet(text, classifier)
"""
WRITE COMBINING LOGIC HERE
"""
blurb_tagged = mergeBlurbs(blurb, mallet_output, blurb_dict)
results = [v.split("\t") for v in blurb_tagged.split("\n")]
# generate unique id for output sentences and format
ffixedline, langOut = genUID(results, text)
# get language tags using context logic from probabilities
out = genLangTag.get_res(ffixedline, langOut)
realOut = re.sub("::[0-9]+/", "/", out)
# get word, label pairs in the output
realOut = realOut.split()
realOut = [tuple(word.split("/")) for word in realOut]
# generate output
outputText.append(realOut)
return outputText
def langIdentifyFile(filename, classifier):
"""
Get language tags for sentences from an input file
Input file: tsv with sentence id in first column and sentence in second column
Output file: tsv with word per line, sentences separated by newline
Output of sentence id in first column and best language tag in last column
"""
global TMP_FILE_PATH
# reading the input file
fil = codecs.open(filename, 'r', errors="ignore")
outfil = codecs.open(filename+"_tagged", 'a',
errors="ignore", encoding='utf-8')
line_count = 0
line = (fil.readline()).strip()
while line is not None and line != "":
line_count += 1
if (line_count % 100 == 0):
print(line_count, file=sys.stderr)
if not line.startswith("#"):
# reading sentences and basic pre-processing
lineid = "\t".join(line.split("\t")[:1])
line = " ".join(line.split("\t")[1:])
fline = re.sub(r"([\w@#\'\\\"]+)([.:,;?!]+)",
r"\g<1> \g<2> ", line)
fixline = fline.split()
fixline = [x.strip() for x in fixline]
fixline = [x for x in fixline if not re.match(r"^\s*$", x)]
"""
CALLING MALLET CODE HERE
"""
blurb, mallet_output, blurb_dict = callMallet(fixline, classifier)
"""
WRITE COMBINING LOGIC HERE
"""
blurb_tagged = mergeBlurbs(blurb, mallet_output, blurb_dict)
results = [v.split("\t") for v in blurb_tagged.split("\n")]
# generate unique id for output sentences and format
ffixedline, langOut = genUID(results, fixline)
# get language tags using context logic from probabilities
out = genLangTag.get_res(ffixedline, langOut)
outfil.write(u"##"+lineid+u"\t"+line+u"\n")
realout = re.sub("::[0-9]+/", "/", out)
outfil.write(lineid+u"\t"+realout+u'\n')
else:
print("### skipped commented line:: " + line.encode('utf-8') + "\n")
outfil.write("skipped line" + line.encode('utf-8') + "\n")
line = (fil.readline()).strip()
fil.close()
outfil.close()
print("written to " + filename + "_tagged")
def writeMemoizeDict():
"""
Write the Memoization Dictionary to the disk, update it with new words if already present
"""
if os.path.isfile(DICT_PATH + memoize_dict_file):
# if file already exists, then update memoize_dict before writing
with open(DICT_PATH + memoize_dict_file, "rb") as fp:
memoize_file = pickle.load(fp)
if memoize_file != memoize_dict:
print("updating memoize dictionary")
memoize_dict.update(memoize_file)
# write the memoize_dict to file
with open(DICT_PATH + memoize_dict_file, "wb") as fp:
pickle.dump(memoize_dict, fp)
if __name__ == "__main__":
"""
CONFIG FILE CODE
"""
readConfig()
"""
DICIONARY CREATION CODE
"""
createDicts()
"""
CLASSIFICATION CODE
"""
blurb = sys.argv[1]
print(blurb)
print(sys.argv)
classifier = CLASSIFIER_PATH
mode = "file"
if len(sys.argv) > 2:
mode = sys.argv[1]
blurb = sys.argv[2]
if len(sys.argv) > 3:
classifer = sys.argv[3]
if mode == "file" or mode == "f":
# CHECK FILE EXISTS
langIdentifyFile(blurb, classifier)
else:
langIdentify(blurb, classifier)
"""
WRITE UPDATED MEMOIZE DICTIONARY TO DISK
"""
writeMemoizeDict()
exit()

Двоичные данные
images/dictionary_structure.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 16 KiB

Двоичные данные
images/info_flow_new_lid.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 88 KiB

Двоичные данные
images/langIdentify_input.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 27 KiB

Двоичные данные
images/langIdentify_output.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 15 KiB

4
sampleinp.txt Normal file
Просмотреть файл

@ -0,0 +1,4 @@
1 Yeh mera pehla sentence hai
2 Aur ye dusra
3 Main kya karoon
4 This is main sentence

8
sampleinp.txt_tagged Normal file
Просмотреть файл

@ -0,0 +1,8 @@
##1 Yeh mera pehla sentence hai
1 Yeh/HI mera/HI pehla/HI sentence/EN hai/HI
##2 Aur ye dusra
2 Aur/HI ye/HI dusra/HI
##3 Main kya karoon
3 Main/HI kya/HI karoon/HI
##4 This is main sentence
4 This/EN is/EN main/EN sentence/EN

32
sampleoutp.txt Normal file
Просмотреть файл

@ -0,0 +1,32 @@
HINDI dictionary created
ENGLISH dictionary created
sampleinp.txt
['getLanguage.py', 'sampleinp.txt']
time for mallet classification 1.1737194061279297
Yeh hi 0.999999999 en 1e-09
mera hi 0.999999999 en 1e-09
pehla hi 0.999999999 en 1e-09
sentence en 0.999999999 hi 1e-09
hai hi 0.999999999 en 1e-09
---------------------------------
time for mallet classification 0.46548032760620117
Aur hi 0.999999999 en 1e-09
ye en 0.9276992223233895 hi 0.07230077767661053
dusra hi 0.999999999 en 1e-09
---------------------------------
time for mallet classification 0.37943124771118164
Main en 0.9098939778371711 hi 0.09010602216282892
kya hi 0.999999999 en 1e-09
karoon hi 0.999999999 en 1e-09
---------------------------------
time for mallet classification 0.38225436210632324
This en 0.8779217719993668 hi 0.1220782280006331
is en 0.9597935346160775 hi 0.040206465383922474
main en 0.4926395772768498 hi 0.5073604227231503
sentence en 0.999999999 hi 1e-09
---------------------------------
written to sampleinp.txt_tagged

Просмотреть файл

@ -0,0 +1,25 @@
1 Uhhh what ever happened to the this page is dedicated to quality confessions motto I thought stuff like this made its way to the other confessions page
2 aap manifesto speaks a lot about your vision of delhi go ahead may god bless you to fulfill peoples dream
3 greenbrier amp summers counties have just been placed under a winter storm watch for monday night through tuesday http URL
4 intermission fc took a big step towards the title with a hard fought win over fellow table toppers the real ajax on monday night
5 july th it's goin down at dar constitution hall i'm gonna make this last part rhyme mb y'all
6 Ha dharamshala me hain
7 Bhai aaap ki Sab say aache film kon si hai
8 things to know for friday the u s postal service on the brink of default on a second multibillion dollar p http URL
9 supw ki aisi ki taisi d kavi marks add toh hue nahi
10 dont be so sure for all you know his webcam may be on and you may be famous on youtube already
11 Sallu miya wonder ful
12 Toh late kahan khaya tha
13 on tuesday the theatres open at ten o'clock in the morning as lent begins after eight at night on tuesday all those who through want
14 na na tum bahut chu panti ki harkat kiye P D
15 drew peterson is no longer tune in for the lifetime original movie sunday at c on someUSER http URL
16 i should've been guarding parker on that last play he may have scored but i bet i would've been on the right side of the court
17 Akdam kick ke jesa
18 that means ur menu chart must also include salt n water hahaha
19 the ravens release their first injury report of the week on wednesday sounds like it will be loaded with names http URL
20 bj penn vs nick diaz replay on fueltv i'd pay to watch it a nd time
21 listen bhagavad gita was a conspiracy hatched by brahmins in collusion with kshatiryas to enslave women vaishya amp sudra komal
22 sari earth ke faadu sticker lagey hotey they ullu bananey ko grrrrr
23 garbage bin bhai batao na mera comment pahla hai na
24 i have a strong sense for who this guys might be and looking at some of the past comments you should be able to at least make a good guess
25 guddu ki to halat kharab hai use to ab itni garmi lag rhi hogi jitni ki delhi ki summer me lagti h p d

Просмотреть файл

@ -0,0 +1,26 @@
0 Han wo bhi baat hai
1 its not unionbudget its buredinkabudget khana cmputr kapde ghar sab mehenga bewakuf bnaya sirf inhone
2 public movie review shamitabh shamitabh amitabhbachchan moviereview
3 ok
4 kachhua sir toh students ke pakke wale dushman hain
5 salmaan khan tumhare naam k pichey khan accha nahi lagtahai hata de muslaman ke naam ko jo itna hi bura kagta hai islamic dharm to abhi chod de zarurat nahi hai tumhari islaam ko
6 bhaijaan thagaye hoghaey itne saare comments dekkar bhaijaan abhi online mein hai
7 mummy jabb apne kisi dost ko nick name se bulati hai to kitta mazaa aata haii
8 bas kar rulayega kya
9 Bajrangi Bhai hamse bhi guftgu kar liya karo
10 Uhhh what ever happened to the this page is dedicated to quality confessions motto I thought stuff like this made its way to the other confessions page
11 koi mujhe bataega du colleges ke form ki cutt off kab aayegi
12 Kuch nhi launde Bas thodi bahut holi
13 commentator huge celebrations for the dismissal of suresh kohli cwc indvspak
14 Kon h Bhai
15 acha ji aisa kya bt we wer nt lyk dt knd f seniors i mean hum to aise ni the d
16 aap manifesto speaks a lot about your vision of delhi go ahead may god bless you to fulfill peoples dream
17 Maine bola h ki se le liyo
18 Aur suna
19 uske baap ko mobile ki factory tha kya
20 Swagat nahi karoge humahra
21 i remember it happen in my place also koi apni jagah se nahin hilega
22 so pls
23 I have had a couple of hot makeout sessions in a corner of the liby journal section with different guys It was weird as recently we saw another couple using our favored spot WTF seriously
24 agr tu itna handsome h to vo bechaariya milte he ignore kyu krne lgi
25 Bhai jaha fayeda ho vaha pe ladies first vala formula lagu ho jata hai p According to the time change ho jati hai ye

20
tests/test_sample_20.txt Normal file
Просмотреть файл

@ -0,0 +1,20 @@
1 Yeh mera pehla sentence hai
2 Aur ye dusra
3 Main kya karoon?
4 I love eating खाना with people
5 Mujhe sabke साथ khelna happy करते hai
6 Mujhe afaik transend rafataar maze
7 Mujhe apne manager kaafi pasand hai, I like that guy
8 Aaj ka day humesha yaad rahega humein because India won the World Cup
9 Purana photo bhejna band karo
10 Shoes pehen ke aao
11 Itna serious nhi hona tha
12 Friday fir repeat karna hoga tujhe
13 Nahi wo society wale group me aaya tha ki drainage clog ho gaya hai tower 7 ka
14 Bahar jana toh restricted hai
15 Bangalore me lockdown hai
16 Main nahi aa rha main apne cousins ke saath video call pe hoon
17 Monday bhi chutti hai ?
18 Airport pe abhi bhi foreign log aa rahe hain
19 Pata chal gaya sound kya tha
20 Tumlog apna bill verify kiya ?

3
tmp/temp_testFile.txt Normal file
Просмотреть файл

@ -0,0 +1,3 @@
This oth
is oth
main oth

Просмотреть файл

@ -0,0 +1,3 @@
This i:1 h:1 s:1 T:1 $T:1 hi:1 is:1 Th:1 s$:1 his:1 $Th:1 Thi:1 is$:1 This:1 his$:1 $Thi:1
is i:1 s:1 is:1 s$:1 $i:1
main a:1 i:1 m:1 n:1 ai:1 n$:1 $m:1 ma:1 in:1 mai:1 $ma:1 ain:1 in$:1 ain$:1 main:1 $mai:1

Просмотреть файл

@ -0,0 +1,3 @@
This en 0.7463832689008091 hi 0.253616731099191
is en 0.9597935346160775 hi 0.040206465383922474
main en 0.4291624908780122 hi 0.5708375091219877

0
utils/__init__.py Normal file
Просмотреть файл

66
utils/extractFeatures.py Normal file
Просмотреть файл

@ -0,0 +1,66 @@
"""
Given a word list with language, prepare the data for input to MALLET
"""
import sys
from collections import defaultdict
import codecs
def get_ngrams(word, n):
"""
Extracting all ngrams from a word given a value of n
"""
if word[0] == '#':
word = word[1:]
if n != 1:
word = '$'+word+'$'
ngrams = defaultdict(int)
for i in range(len(word)-(n-1)):
ngrams[word[i:i+n]] += 1
return ngrams
def main(input_file_name):
"""
The main function
"""
# The input file containing wordlist with language
input_file = open(input_file_name, 'r')
# The output file
output_file_name = input_file_name + ".features"
output_file = codecs.open(output_file_name, 'w', encoding='utf-8')
# N upto which n grams have to be considered
n = 5
# Iterate through the input-file
for each_line in input_file:
fields = list(filter(None, each_line.strip().split("\t")))
word = fields[0]
output_file.write(word)
output_file.write('\t')
# Get all ngrams for the word
for i in range(1, min(n+1, len(word) + 1)):
ngrams = get_ngrams(word, i)
for each_ngram in ngrams:
output_file.write(each_ngram)
output_file.write(':')
output_file.write(str(ngrams[each_ngram]))
output_file.write('\t')
output_file.write('\n')
input_file.close()
output_file.close()
if __name__ == "__main__":
input_file_name = sys.argv[1]
main(input_file_name)

Просмотреть файл

@ -0,0 +1,520 @@
"""
Context logic to generate language tags from probability values
"""
import itertools
import heapq
import re
from collections import Counter, OrderedDict
from ttp import ttp
from configparser import ConfigParser
class RunSpan(object):
def __init__(self, x, y):
self.x = x
self.y = y
runList = []
def run_compute_recur(strn, curpoint, curlist):
runstart = curpoint
while curpoint < len(strn)-1:
if (strn[curpoint] != strn[curpoint+1]) or strn[curpoint] == '$':
if (((curpoint - runstart) > 1) and (strn[curpoint] == strn[runstart] or strn[curpoint] == '$' or strn[runstart] == '$')):
if(not(((curpoint-runstart) == 2) and runstart == 0)):
newrun = RunSpan(runstart, curpoint)
newrunlist = curlist[:]
newrunlist.append(newrun)
run_compute_recur(strn, curpoint+1, newrunlist)
curpoint += 1
lastrun = RunSpan(runstart, curpoint)
if ((curpoint - runstart) > 2):
curlist.append(lastrun)
runList.append(curlist)
def check_skips(strn, lang):
if lang == '0':
alter = '1'
else:
alter = '0'
i = 0
while i < len(strn)-2:
if strn[i] == alter:
if (strn[i+1] == alter) and (strn[i+2] == alter):
return False
else:
i += 1
i += 1
return True
def check_CS(strn):
global runList
runList = []
run_compute_recur(strn, 0, [])
TrueRunList = []
TrueStrnList = []
for runset in runList:
check = True
check2 = True
check3 = True
TrueStrn = ""
TrueStrn2 = ""
for run in runset:
if check is False or (check2 is False and check3 is False):
break
if strn[run.y] == '0':
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "0")
check = check_skips(tosend, '0')
TrueStrn = TrueStrn + tosend
TrueStrn2 = TrueStrn2 + tosend
if strn[run.y] == '1':
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "1")
check = check_skips(tosend, '1')
TrueStrn = TrueStrn + tosend
TrueStrn2 = TrueStrn2 + tosend
if strn[run.y] == '$':
if strn[run.x] == '0':
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "0")
check = check_skips(tosend, '0')
TrueStrn = TrueStrn + tosend
TrueStrn2 = TrueStrn2 + tosend
if strn[run.x] == '1':
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "1")
check = check_skips(tosend, '1')
TrueStrn = TrueStrn + tosend
TrueStrn2 = TrueStrn2 + tosend
if strn[run.x] == '$':
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "1")
check3 = check3 and check_skips(tosend, '1')
if check3:
TrueStrn = TrueStrn + tosend
tosend = strn[run.x:run.y+1]
tosend = tosend.replace("$", "0")
check2 = check2 and check_skips(tosend, '0')
if check2:
TrueStrn2 = TrueStrn2 + tosend
if check is True:
if check3 is True and TrueStrn != "":
TrueRunList.append(runset)
TrueStrnList.append(TrueStrn)
if check2 is True and TrueStrn2 != "":
TrueRunList.append(runset)
TrueStrnList.append(TrueStrn2)
i = 0
Purities = {}
while i < len(TrueRunList):
purity = 0.0
for run in TrueRunList[i]:
counter = (
Counter(TrueStrnList[i][run.x:run.y+1]).most_common(1)[0])
if counter[0] == '0':
purity += float(float(counter[1])/float(len(TrueStrnList[i])))
else:
purity += float(float(counter[1])/float(len(TrueStrnList[i])))
purity /= (float(len(TrueRunList[i]))*float(len(TrueRunList[i])))
Purities[i] = purity
i += 1
if Purities:
for item in dict_nlargest(Purities, 1):
return (TrueRunList[item], TrueStrnList[item], Purities[item])
else:
return ("", "", -1)
def compute_tag(m, strn):
global lang1
global lang2
strn = strn.strip()
origstrn = strn
encount = 0.0
hicount = 0.0
for ch in strn:
if (ch == '0'):
encount += 1
if (ch == '1'):
hicount += 1
if (ch == '$'):
hicount += 1
encount += 1
if (encount/len(strn)) > 0.7:
return lang2, "-1"
if (hicount/len(strn)) > 0.8:
return lang1, "-1"
else:
count = 0
for x, _y in m.items():
if len(x) < 4:
strn = strn[:count+1] + "$" + strn[count+2:]
count += 1
a, b, c = check_CS(strn)
if(c == -1 or len(a) < 2):
CMrun = b[1:-1]
encount1 = 0
hicount1 = 0
for ch in CMrun:
if (ch == '0'):
encount1 += 1
if (ch == '1'):
hicount1 += 1
if (ch == '$'):
hicount1 += 1
encount1 += 1
if(encount1 > hicount1):
return "Code mixed" + " " + lang2, "-1"
elif (encount1 < hicount1):
return "Code mixed" + " " + lang1, "-1"
else:
return "Code mixed Equal", "-1"
else:
strncs = ""
for i in a:
if i.x > 0:
if (strn[i.x] == '$') and (origstrn[i.x] != b[i.x]):
b = b[:i.x] + origstrn[i.x] + b[i.x+1:]
i.x += 1
if i.y < len(b)-1:
if (strn[i.y] == '$') and (origstrn[i.y] != b[i.y]):
b = b[:i.y] + origstrn[i.y] + b[i.y+1:]
i.y -= 1
for i in a:
strncs += b[i.x:i.y+1] + "|"
return "Code switched", strncs[:-1]
def dict_nlargest(d, n):
return heapq.nlargest(n, d, key=lambda k: d[k])
def get_res(orig, vals):
global lang1
global lang1_code
global lang2
global lang2_code
# read config
config = ConfigParser()
config.read("config.ini")
# use config to extract language names
config_gen = config["GENERAL"]
# get language names by default language 1 is HINDI and language 2 is ENGLISH
lang1 = config_gen["language_1"].upper(
) if config_gen["language_1"] else "HINDI"
lang2 = config_gen["language_2"].upper(
) if config_gen["language_2"] else "ENGLISH"
lang1_code = lang1.lower()[:2]
lang2_code = lang2.lower()[:2]
prs = ttp.Parser()
count = 0
mult = 0.95
topx = 32
dic = OrderedDict()
origdic = OrderedDict()
dic[1] = vals
origdic[1] = orig
initlist = [u"".join(seq) for seq in itertools.product("01", repeat=5)]
alreadyExistingTweets = {}
processedTweets = []
finalTweetDict = OrderedDict()
for b, c in dic.items():
if b:
v = OrderedDict()
origstr = u""
for m, _n in c.items():
origstr = origstr + u" " + m
# SH had to comment out following line bc of errors:
# ne_removed = origstr.encode('ascii', 'ignore') # why
ne_removed = origstr
ne_removed = u' '.join(ne_removed.split())
total_length = 0.0
max_length = 0
for word in ne_removed.split():
if True: # SH changed
total_length += len(word)
if len(word) > max_length:
max_length = len(word)
v[word] = c[word]
initdic = {}
curlen = 5
for item in initlist:
initdic[item] = 1
for i, _j in initdic.items():
count = 0
for x, y in v.items():
if i[count] == '0':
initdic[i] = initdic[i]*y[lang2_code]
else:
initdic[i] = initdic[i]*y[lang1_code]
if count > 0 and i[count-1] != i[count]:
initdic[i] = initdic[i]*(1-mult)
else:
initdic[i] = initdic[i]*mult
count += 1
if count == curlen:
break
top32 = initdic
curlen = count
wordcount = 0
if curlen < 5:
newdic = {}
for x, y in top32.items():
newdic[x[:curlen]] = y
top32.clear()
top32 = newdic
strn = u""
for x, y in v.items():
strn = strn + u" " + x
if wordcount < 5:
wordcount += 1
else:
curlen += 1
newdic = {}
for k, p in top32.items():
newstr = k + '0'
newdic[newstr] = p * y[lang2_code]
if newstr[curlen-1] == '0':
newdic[newstr] = newdic[newstr]*mult
else:
newdic[newstr] = newdic[newstr]*(1-mult)
newstr = k + '1'
newdic[newstr] = p * y[lang1_code]
if newstr[curlen-1] == '1':
newdic[newstr] = newdic[newstr]*mult
else:
newdic[newstr] = newdic[newstr]*(1-mult)
top32.clear()
for x in dict_nlargest(newdic, topx):
top32[x] = newdic[x]
for item in dict_nlargest(top32, 1):
if len(item) > 0:
curOrigTweet = origdic[b]
superOrig = curOrigTweet
curOrigTweet = re.sub(r"\s+", ' ', curOrigTweet.strip())
tweetparse = prs.parse(curOrigTweet)
tweetUrls = []
for url in tweetparse.urls:
for _e in range(0, (len(re.findall(url, curOrigTweet)))):
tweetUrls.append(url)
if (len(re.findall(url, curOrigTweet))) == 0:
tweetUrls.append(url)
curOrigTweet = curOrigTweet.replace(
url, " THIS_IS_MY_URL ")
# SH NEW new keep punctuation
curOrigTweet = curOrigTweet.replace("#", " #")
curOrigTweet = curOrigTweet.replace("@", " @")
curOrigTweet = re.sub(r"\s+", ' ', curOrigTweet.strip())
tweetdic = OrderedDict()
splitOrig = curOrigTweet.split(' ')
splitHT = origstr.strip().split(' ')
urlcount = 0
for word in splitOrig:
if word in splitHT:
tweetdic[word] = "OK"
elif '_' in word:
if word == "THIS_IS_MY_URL":
tweetdic[tweetUrls[urlcount]] = "OTHER"
urlcount += 1
elif ((word[1:] in tweetparse.tags) or (word[1:] in tweetparse.users)):
tweetdic[word] = "OTHER"
else:
splt = word.split('_')
for wd in splt:
if wd in splitHT:
tweetdic[wd] = "OK"
else:
tweetdic[wd] = "OTHER"
else:
tweetdic[word] = "OTHER"
splitNE = strn.strip().split(u' ')
newNE = []
newItem = ""
for word, tag in tweetdic.items():
if tag == "OK":
if word in splitNE:
reqindex = splitNE.index(word)
wordlist = []
wordlist2 = []
if word in wordlist:
tweetdic[word] = lang1_code.upper()
newItem += "1"
elif word in wordlist2:
tweetdic[word] = lang2_code.upper()
newItem += "0"
# new next condition:
elif re.match(r"\W+", word):
print(word)
tweetdic[word] = "OTHER"
elif item[reqindex] == '0':
tweetdic[word] = lang2_code.upper()
newItem += "0"
else:
tweetdic[word] = lang1_code.upper()
newItem += "1"
newNE.append(word)
else:
tweetdic[word] = "OTHER"
newStrn = " ".join(newNE)
if len(newStrn) > 0:
newV = OrderedDict()
for q, r in v.items():
if q in newNE:
newV[q] = r
tweettag, runs = compute_tag(newV, '$' + newItem + '$')
if runs != "-1":
runSplit = runs.split('|')
runs = runs[1:-1]
wordDict = OrderedDict()
runcount = 0
ind = 0
for word in tweetdic:
wordlabel = OrderedDict()
wordlabel["Label"] = tweetdic[word]
if tweettag == "Code mixed" + " " + lang2 or tweettag == lang2:
wordlabel["Matrix"] = lang2_code.upper()
elif tweettag == "Code mixed" + " " + lang1 or tweettag == lang1:
wordlabel["Matrix"] = lang1_code.upper()
elif tweettag == "Code mixed Equal" or tweettag == lang1:
wordlabel["Matrix"] = "X"
else:
if (tweetdic[word] == lang2_code.upper() or tweetdic[word] == lang1_code.upper()):
ind += 1
if (runs[ind-1] == '|'):
runcount += 1
if runSplit[runcount][0] == "0":
wordlabel["Matrix"] = lang2_code.upper()
else:
wordlabel["Matrix"] = lang1_code.upper()
wordDict[word] = wordlabel
sansRT = superOrig.replace("rt", "").strip()
if not(newStrn in alreadyExistingTweets) and not(sansRT in alreadyExistingTweets):
wholeTweetDict = OrderedDict()
wholeTweetDict["Tweet"] = superOrig
wholeTweetDict["Tweet-tag"] = tweettag
wholeTweetDict["Word-level"] = wordDict
wholeTweetDict["Twitter-tag"] = "None"
finalTweetDict[b] = wholeTweetDict
alreadyExistingTweets[sansRT] = "yes"
alreadyExistingTweets[newStrn] = "yes"
processedTweets.append(b)
else:
processedTweets.append(b)
check_tw = {}
for t, u in origdic.items():
if t in finalTweetDict or t in processedTweets or u in check_tw:
continue
else:
wholeTweetDict = OrderedDict()
wholeTweetDict["Tweet"] = u
wholeTweetDict["Tweet-tag"] = "Other_Noise"
wholeTweetDict["Word-level"] = {}
wholeTweetDict["Twitter-tag"] = "Noise"
check_tw[u] = True
finalTweetDict[t] = wholeTweetDict
final_output = ""
for x, y in finalTweetDict.items():
for k, v in y['Word-level'].items():
final_output += k + '/' + v['Label'] + ' '
return final_output