Skip to content

In this project, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words.

License

Notifications You must be signed in to change notification settings

gabrielpondc/oovunderstand

Repository files navigation

Chinese OOV recognition and understanding by contextual Word2Vec model

GitHub issues

Content

Project Introduction

image


Run Way

Mining the data for the corpus
$python weibomining.py
Extract the word from corpus as word list
$python oovfinder.py
Compare the word list with dictionary and extract the oov as list
$python isoov.py
Filter person name , organization name , place name from OOV list and delete these word from the list as cleaned oov list
$python namefinder.py
$python placefinder.py
$python orgfinder.py
Mining some corpus using the oov as keyword in Weibo 
$python keywordcorpuscrawl.py
Merge the keyword corpus and origin corpus and spilt words with jieba
$python splitsystem.py
Training model and caculate the similarity of each oov
$python modeltraining.py
Additional experiments are inputting an OOV for direct semantic understanding
$python modeltraining.py

Word Extract

Mutual information(MI)
image
Higher the correlation between X and Y, the higher the possibility of X and Y forming words,Lower the value of mutual information, lower the correlation between X and Y, the higher possibility of a boundary between X and Y
Left and right entropy
image
image

W : candidate words after N-Gram segmentation.
A: a collection of all words appearing on the left of a candidate.
a: a word appearing on the left.
B: a collection of all words appearing on the right of a candidate.
b: a word appearing on the right.
The more words appear around the candidate word W, the more likely it is that W is a word.


Some Result

Class OOV Similar Words of OOV
A 天才病(Genius Disease) 阿兹伯格综合症(Asperger's Syndrome)
B 新冠 (COVID-19) 感染(Infection), 病毒(Virus), 肺炎(pneumonia)
C 凤凰网(Media Organization) 应该 (Should be),讨论 (discuss),看法 (view)

The example of ’凤凰网‘(Media organization)on the left and ‘新冠’(Covid-19) on the right,Because the word ‘凤凰网’ often appears in the back of some news, it is difficult to predict the meaning of the word because there is not enough information in the context and there is a lot of noise,On the contrary, the word '新冠' is rich in contextual information, so the predicted value is also relatively accurate. image This example shows the understand of '耗子尾汁' by both CBOW and Skip-gram models. Both models accurately understand the semantic words, but the similarity between the two words understood by the CBOW model is higher image 1

Model A B C Accuracy
CBOW 21 13 1 97.10%
Skip-gram 17 14 4 88.57%

The result of OOV ’ 耗子尾汁’

Word Translation Similarity
好自为之 Take care of yourself 0.99997896
particle (in Chinese) 0.99997878
i 0.99997693
马保国 Baoguo Ma 0.99997658
Also 0.99997264
and 0.99997222
particle (in Chinese) 0.99997193

About the Author

JiaKai Gu
E-mail: gabrielpondc@cau.ac.kr
Jason J. Jung
Department of Computer Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea 06974
Tel.: +82-2-820-5136
Fax: +82-2-820-5301
E-mail: j3ung@cau.ac.kr

Cite this project

@article{gu2022contextual,
   author = {Gu, JiaKai and Li, Gen and Vo, Nam D. and Jung, Jason J.},
   title = {Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media},
   journal = {International Journal on Semantic Web and Information Systems (IJSWIS)},
   volume = {18},
   number = {1},
   pages = {1-14},
   ISSN = {1552-6283},
   DOI = {10.4018/IJSWIS.309428},
   url = { https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJSWIS.309428 },
   year = {2022},
   type = {Journal Article}
}

Data source


About

In this project, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words.

Topics

Resources

License

Stars

Watchers

Forks

Languages