Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

ecchochan · 2020-10-19T05:21:32Z

I was working on two things and wish to add this to this library in case somebody needs this:

normalize Chinese characters
avoid symbols from merging.

Features implemented:

Add a normalization option to normalizers::BertNormalizer - norm_options
- avoid 0-9 from merging - SEPARATE_INTEGERS
  e.g. 100 → 1 0 0
- avoid symbols from merging - SEPARATE_SYMBOLS
  e.g. Mr. Stark → Mr . Stark
- convert Simplified to Traditional characters - SIMPL_TO_TRAD
  e.g. 头 → 頭
- convert Traditional to Simplified characters - TRAD_TO_SIMPL
  e.g. 頭 → 头
- hand-crafted character mapping - ZH_NORM_MAPPING
  e.g. 【 → [
Check whether OpenCC is installed and can be used - normalizers::opencc_enabled
Test cases
- Separate Symbols
- Simpl. to Trad. characters
Expose the above features to Python

How to install OpenCC:

I used the following script to install OpenCC.

sudo su

apt-get install -y build-essential pkg-config opencc cmake doxygen

git clone https://github.com/BYVoid/OpenCC.git && cd OpenCC

git checkout ver.1.1.1  # or whatever version

make && make install
cd .. && rm -r OpenCC

Usage

from tokenizers.normalizers import BertNormalizer, NORM_OPTIONS
normalizer = BertNormalizer(
  norm_options=(
    NORM_OPTIONS.ZH_NORM_MAPPING | 
    NORM_OPTIONS.SIMPL_TO_TRAD | 
    NORM_OPTIONS.SEPARATE_INTEGERS | 
    NORM_OPTIONS.SEPARATE_SYMBOLS
    # Enable the options according to your needs
  )
)

Limitations

I used lazy_static to load OpenCC, to avoid unnecessary loading of OpenCC when not enabling it.
If the first tokenizer created does not have OpenCC enabled, all other tokenizers created afterwards will not be able to use OpenCC.
Awaiting better solutions 🤒

Updates (2020/10/19)

I see the checks failed from the dependency of rust-opencc which requires OpenCC library to be installed.

I have made the opencc a feature instead.

Install using:

# Rust
cargo build --features opencc

# Python
python3 setup.py install --opencc

# Conflicts: # bindings/node/lib/bindings/normalizers.js # bindings/node/native/src/normalizers.rs

# Conflicts: # docs/source/_static/js/custom.js # docs/source/api/python.inc # docs/source/components.rst # docs/source/conf.py # docs/source/index.rst # docs/source/quicktour.rst

# Conflicts: # bindings/python/Cargo.lock # bindings/python/py_src/tokenizers/normalizers/__init__.pyi

# Conflicts: # bindings/python/src/normalizers.rs # bindings/python/tests/bindings/test_normalizers.py # tokenizers/src/normalizers/bert.rs

ecchochan added 26 commits October 14, 2020 11:58

update

3dbe3eb

Merge branch 'master' into zh-norm-3

0d941d1

[feature] Add Chinese Normalization to BertNormalizer w/ test cases

fb24cd7

Merge remote-tracking branch 'upstream/master' into zh-norm-3

d14f12b

Make OpenCC a feature in Rust

50d6b28

Formatting

86795b6

Formatting

924a0c7

Try to make feature optional, and workflow without this feature

ef94c76

Formatting :S

440a7c5

Node - Add bindings norm_options

f1bc628

Formatting :S

cfc986e

Merge commit '180371d92945c13d13985b21c893edc4074ab3a2' into zh-norm-3

ecde2ff

Merge commit '2364d376f7c1ccfe14389d71a1308e40635c0af3' into zh-norm-3

b28a602

Merge commit '8f03d6ddc1f5d503160aac9082a94ed0006aca43' into zh-norm-3

37296f4

# Conflicts: # bindings/node/lib/bindings/normalizers.js # bindings/node/native/src/normalizers.rs

Merge commit '4929809af087a0a200ced77506b5bffc565faa6f' into zh-norm-3

cb2b4fe

Merge commit '655809c718933374bc28b668689429487642d585' into zh-norm-3

8e1bb7f

no message

e9c7db2

Merge commit '558f2d87795ffc9d9786f1e923398e3eebe14187' into zh-norm-3

54342ec

# Conflicts: # docs/source/_static/js/custom.js # docs/source/api/python.inc # docs/source/components.rst # docs/source/conf.py # docs/source/index.rst # docs/source/quicktour.rst

Merge commit 'dc60d4fc0c940c7c24962aec996150cd9708430f' into zh-norm-3

4d2152c

# Conflicts: # bindings/python/Cargo.lock # bindings/python/py_src/tokenizers/normalizers/__init__.pyi

Merge commit '6e364cb685858dab4d19a8ac79176588053c8c0e' into zh-norm-3

a6c0f65

# Conflicts: # bindings/python/src/normalizers.rs # bindings/python/tests/bindings/test_normalizers.py # tokenizers/src/normalizers/bert.rs

Merge commit 'd71e66e53c8107be2b0c767a897a94b7bb221791' into zh-norm-3

6aaeb79

Merge commit '2c711d45ce20c08d9ef67167797c80b767a758e6' into zh-norm-3

0ad10e1

Merge commit 'bc8bbf637ad5a936e141570dfe583f3e62e32a94' into zh-norm-3

3d06542

fix for linting

b105bf3

formatting

c8152a9

Update setup.py

0f8370c

github-actions bot added the Stale label May 10, 2024

github-actions bot closed this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

ecchochan commented Oct 19, 2020 •

edited

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

Conversation

ecchochan commented Oct 19, 2020 • edited

Features implemented:

How to install OpenCC:

Usage

Limitations

Updates (2020/10/19)

ecchochan commented Oct 19, 2020 •

edited