GitHub - seanghay/khmerpunctuate: Punctuation Restoration for Khmer language

Punctuation Restoration for Khmer language

Built with [xashru/punctuation-restoration] using [xlm-roberta-base] and then exported to onnxruntime

Install

pip install khmerpunctuate

# Or
pip install git+https://github.com/seanghay/khmerpunctuate.git

Usage

Supported token types are

{
  0: "",
  1: " ",
  2: "!",
  3: "។",
  4: "?",
  5: "៖",
  6: "។\n",
  7: "B-NUMBER",
  8: "I-NUMBER",
  9: "B-QUOTE",
  10: "I-QUOTE",
}

from khmernormalizer import normalize
from khmercut import tokenize
from khmerpunctuate import punctuate

text = normalize("អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញបានព្រមានថានឹងចេញដីកាបញ្ជាឲ្យបង្ខំនិងឲ្យឃុំខ្លួនតាមនីតិវិធីប្រសិនបើលោករ៉ុងឈុនដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិមិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀលឲ្យបានមុនថ្ងៃទី០៤ខែមីនាឆ្នាំ២០២៤ទេនោះ")
tokens = tokenize(text)

output_text = ""
for token, punct, punct_id in punctuate(tokens):
  # exclude special tokens like I-NUMBER, B-NUMBER, I-QUOTE and B-QUOTE
  if punct_id < 7:
    output_text += token + punct
  else:
    output_text += token

print(output_text)

អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញ បានព្រមានថា នឹងចេញដីកាបញ្ជាឱ្យបង្ខំ និងឱ្យឃុំខ្លួនតាមនីតិវិធី ប្រសិនបើលោក រ៉ុង ឈុន ដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិ មិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀលឱ្យបានមុនថ្ងៃទី០៤ខែមីនា ឆ្នាំ២០២៤ទេនោះ

Example

The example below is available on [Google Colab]

Model file is hosted on [HuggingFace]

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
khmerpunctuate		khmerpunctuate
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ruff.toml		ruff.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

khmerpunctuate

khmerpunctuate

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

ruff.toml

ruff.toml

setup.py

setup.py

Repository files navigation

Punctuation Restoration for Khmer language

Install

Usage

Example

License

About

Languages

License

seanghay/khmerpunctuate

Folders and files

Latest commit

History

Repository files navigation

Punctuation Restoration for Khmer language

Install

Usage

Example

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages