Chinese OOV recognition and understanding by contextual Word2Vec model

Content

Chinese OOV recognition and understanding by contextual Word2Vec model

Project Introduction

Run Way

Mining the data for the corpus
$python weibomining.py
Extract the word from corpus as word list
$python oovfinder.py
Compare the word list with dictionary and extract the oov as list
$python isoov.py
Filter person name , organization name , place name from OOV list and delete these word from the list as cleaned oov list
$python namefinder.py
$python placefinder.py
$python orgfinder.py
Mining some corpus using the oov as keyword in Weibo 
$python keywordcorpuscrawl.py
Merge the keyword corpus and origin corpus and spilt words with jieba
$python splitsystem.py
Training model and caculate the similarity of each oov
$python modeltraining.py
Additional experiments are inputting an OOV for direct semantic understanding
$python modeltraining.py

Word Extract

Mutual information(MI)

Higher the correlation between X and Y, the higher the possibility of X and Y forming words,Lower the value of mutual information, lower the correlation between X and Y, the higher possibility of a boundary between X and Y
Left and right entropy

W : candidate words after N-Gram segmentation.
A: a collection of all words appearing on the left of a candidate.
a: a word appearing on the left.
B: a collection of all words appearing on the right of a candidate.
b: a word appearing on the right.
The more words appear around the candidate word W, the more likely it is that W is a word.

Some Result

Class	OOV	Similar Words of OOV
A	天才病(Genius Disease)	阿兹伯格综合症(Asperger's Syndrome)
B	新冠 (COVID-19)	感染(Infection), 病毒(Virus), 肺炎(pneumonia)
C	凤凰网(Media Organization)	应该 (Should be),讨论 (discuss),看法 (view)

The example of ’凤凰网‘(Media organization)on the left and ‘新冠’(Covid-19) on the right,Because the word ‘凤凰网’ often appears in the back of some news, it is difficult to predict the meaning of the word because there is not enough information in the context and there is a lot of noise,On the contrary, the word '新冠' is rich in contextual information, so the predicted value is also relatively accurate. This example shows the understand of '耗子尾汁' by both CBOW and Skip-gram models. Both models accurately understand the semantic words, but the similarity between the two words understood by the CBOW model is higher

Model	A	B	C	Accuracy
CBOW	21	13	1	97.10%
Skip-gram	17	14	4	88.57%

The result of OOV ’ 耗子尾汁’

Word	Translation	Similarity
好自为之	Take care of yourself	0.99997896
吗	particle (in Chinese)	0.99997878
我	i	0.99997693
马保国	Baoguo Ma	0.99997658
又	Also	0.99997264
和	and	0.99997222
呢	particle (in Chinese)	0.99997193

About the Author

JiaKai Gu
E-mail: gabrielpondc@cau.ac.kr
Jason J. Jung
Department of Computer Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea 06974
Tel.: +82-2-820-5136
Fax: +82-2-820-5301
E-mail: j3ung@cau.ac.kr

Cite this project

@article{gu2022contextual,
   author = {Gu, JiaKai and Li, Gen and Vo, Nam D. and Jung, Jason J.},
   title = {Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media},
   journal = {International Journal on Semantic Web and Information Systems (IJSWIS)},
   volume = {18},
   number = {1},
   pages = {1-14},
   ISSN = {1552-6283},
   DOI = {10.4018/IJSWIS.309428},
   url = { https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJSWIS.309428 },
   year = {2022},
   type = {Journal Article}
}

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Corpus		Corpus
model		model
result		result
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
additionalexp.py		additionalexp.py
drawmodel.py		drawmodel.py
isoov.py		isoov.py
keywordcorpuscrawl.py		keywordcorpuscrawl.py
keywordcrawl.py		keywordcrawl.py
modeltraining.py		modeltraining.py
namefinder.py		namefinder.py
oovfinder.py		oovfinder.py
oovfinder2.py		oovfinder2.py
oovfinder3.py		oovfinder3.py
orgfinder.py		orgfinder.py
placefinder.py		placefinder.py
splitsystem.py		splitsystem.py
weibomining.py		weibomining.py

License

gabrielpondc/oovunderstand

Folders and files

Latest commit

History

Repository files navigation

Chinese OOV recognition and understanding by contextual Word2Vec model

Content

Project Introduction

Run Way

Word Extract

Some Result

About the Author

Cite this project

Data source

About

Topics

Resources

License

Stars

Watchers

Forks

Languages