Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging #473

Closed
wants to merge 26 commits into from

Conversation

ecchochan
Copy link

@ecchochan ecchochan commented Oct 19, 2020

I was working on two things and wish to add this to this library in case somebody needs this:

  1. normalize Chinese characters
  2. avoid symbols from merging.

Features implemented:

  1. Add a normalization option to normalizers::BertNormalizer - norm_options

    • avoid 0-9 from merging - SEPARATE_INTEGERS
      e.g. 1001 0 0
    • avoid symbols from merging - SEPARATE_SYMBOLS
      e.g. Mr. StarkMr . Stark
    • convert Simplified to Traditional characters - SIMPL_TO_TRAD
      e.g.
    • convert Traditional to Simplified characters - TRAD_TO_SIMPL
      e.g.
    • hand-crafted character mapping - ZH_NORM_MAPPING
      e.g. [
  2. Check whether OpenCC is installed and can be used - normalizers::opencc_enabled

  3. Test cases

    • Separate Symbols
    • Simpl. to Trad. characters
  4. Expose the above features to Python

How to install OpenCC:

I used the following script to install OpenCC.

sudo su

apt-get install -y build-essential pkg-config opencc cmake doxygen

git clone https://github.com/BYVoid/OpenCC.git && cd OpenCC

git checkout ver.1.1.1  # or whatever version

make && make install
cd .. && rm -r OpenCC

Usage

from tokenizers.normalizers import BertNormalizer, NORM_OPTIONS
normalizer = BertNormalizer(
  norm_options=(
    NORM_OPTIONS.ZH_NORM_MAPPING | 
    NORM_OPTIONS.SIMPL_TO_TRAD | 
    NORM_OPTIONS.SEPARATE_INTEGERS | 
    NORM_OPTIONS.SEPARATE_SYMBOLS
    # Enable the options according to your needs
  )
)

Limitations

  1. I used lazy_static to load OpenCC, to avoid unnecessary loading of OpenCC when not enabling it.
    If the first tokenizer created does not have OpenCC enabled, all other tokenizers created afterwards will not be able to use OpenCC.
    Awaiting better solutions 🤒

Updates (2020/10/19)

I see the checks failed from the dependency of rust-opencc which requires OpenCC library to be installed.

I have made the opencc a feature instead.

Install using:

# Rust
cargo build --features opencc

# Python
python3 setup.py install --opencc

# Conflicts:
#	bindings/node/lib/bindings/normalizers.js
#	bindings/node/native/src/normalizers.rs
# Conflicts:
#	docs/source/_static/js/custom.js
#	docs/source/api/python.inc
#	docs/source/components.rst
#	docs/source/conf.py
#	docs/source/index.rst
#	docs/source/quicktour.rst
# Conflicts:
#	bindings/python/Cargo.lock
#	bindings/python/py_src/tokenizers/normalizers/__init__.pyi
# Conflicts:
#	bindings/python/src/normalizers.rs
#	bindings/python/tests/bindings/test_normalizers.py
#	tokenizers/src/normalizers/bert.rs
@github-actions github-actions bot added the Stale label May 10, 2024
@github-actions github-actions bot closed this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant