GitHub - tuanio/real-estates-recommend-search

Đối với các bài toán NLP (xử lý ngôn ngữ tự nhiên) thì điều quan trọng nhất vẫn là xử lý data. Ở trong file data real_estates.xlsx thì sẽ thấy khá nhiều các từ dính liền, từ không dấu nhưng có nghĩa, emoji, icons, urls, email lẫn lộn. Nếu cứ để đó mà train mô hình n-gram thì không ổn. Nên phải xử lý trước. Thì xử lý dữ liệu là nằm trong file preprocessing_data.ipynb. Chạy file đó xong sẽ có một file tên là real_estates_preprocesed.txt. File này chính là đoạn text đã được ghép lại và xử lý qua.
Xong phần xử lý thì đến phần training language model n-gram. Thì trong hướng dẫn ở link https://viblo.asia/p/language-modeling-mo-hinh-ngon-ngu-va-bai-toan-them-dau-cau-trong-tieng-viet-1VgZveV2KAw, tác giả đã nói đến Statistical Language Model, là tính xác xuất xuất hiện của một bộ từ (w_1...w_n), cái này thì chắc chắn sẽ ổn, vì thường các bộ w như thế sẽ giống văn viết hơn, và model sẽ học được từ văn viết đó. Một hướng tiếp cận cụ thể của statistical language model là N-gram language model, nghĩa là giới hạn lại n, trong bộ (w_1...w_n), và tính xác xuất có điều kiện của bộ này, dựa trên các bộ nhỏ hơn như là (w_1...w_n-1). Cái này thì đã có code dựa theo link trên đó, ở trong file n_gram_model.ipynb.
Để sử dụng được data và các notebook thì phải giải nén thư mục datasets.zip ra trước.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
models		models
pickles		pickles
README.md		README.md
character_base_lstm_lm.ipynb		character_base_lstm_lm.ipynb
datasets.zip		datasets.zip
lstm_word_base_lm.ipynb		lstm_word_base_lm.ipynb
make_bigcorpus.ipynb		make_bigcorpus.ipynb
n_gram_model.ipynb		n_gram_model.ipynb
preprocessing_data.ipynb		preprocessing_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

pickles

pickles

README.md

README.md

character_base_lstm_lm.ipynb

character_base_lstm_lm.ipynb

datasets.zip

datasets.zip

lstm_word_base_lm.ipynb

lstm_word_base_lm.ipynb

make_bigcorpus.ipynb

make_bigcorpus.ipynb

n_gram_model.ipynb

n_gram_model.ipynb

preprocessing_data.ipynb

preprocessing_data.ipynb

Repository files navigation

About

Releases

Packages

tuanio/real-estates-recommend-search

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks