GitHub - nghuyong/cscd-ns: code and data for "CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers"

Data

Due to the lack of github lsf quota, the data folder exists on Google Drive (link) or Baidu Netdisk (link), which needs to be downloaded and moved to the main directory.

CSCD-NS

CSCD-NS is a Chinese Spelling Correction Dataset for Native Speakers, including 40,000 annotated sentences from real posts of official media on Sina Weibo.

training set	development set	test set	all
30,000	5,000	5,000	40,000

It can be found in ./data/cscd-ime, and the data format is label \t origin \t corrected, the label means whether the original sentence is correct or wrong.

LCSTS-IME-2M

LCSTS-IME-2M is a large-scale and high-quality pseudo dataset for the CSC task, including over 2 million samples. The data is come from LCSTS dataset and is constructed by simulating the input through pinyin IME. It can be found in ./data/lcsts-ime-2m and the format is the same as CSCD-IME.

Build Pseudo Dataset

Install requirements:

pip install https://github.com/kpu/kenlm/archive/master.zip
pip install -r requirements.txt

Build:

cd pseudo-data-construction
python build.py

The script take the sentence 一不小心选到了错误的方向 as example, and you could obtain the constructed pseudo data like this:

{
  "origin": "一不小心选到了错误的方向",
  "noise": "一部小心选到了错误的方向",
  "details": [
    {
      "start": 0,
      "end": 2,
      "origin_token": "一不",
      "noise_token": "一部",
      "pinyin_type": "same",
      "pinyin_token": "yibu",
      "ppl_improve": 105.68778749231029
    }
  ]
}

Evaluation

We provide an evaluation script to evaluate the metric of CSC systems, including the f1-score of detection and correction at the sentence level and character level.

cd evaluation
python evaluate.py

This evaluation script will evaluate the prediction result of BERT model on CSCD-IME, and generate report file.

Report file preview:

overview:
S_D_p:79.164
S_D_r:65.827
S_D_f1:71.882
S_C_p:70.548
S_C_r:58.663
S_C_f1:64.059
C_D_p:82.999
C_D_r:67.009
C_D_f1:74.152
C_C_p:73.591
C_C_r:59.415
C_C_f1:65.748

bad cases:
原始: 接受该报采访的患者家属与劳工组织认为，这与员工上班时接触某些含笨的清洁剂有关
正确: 接受该报采访的患者家属与劳工组织认为，这与员工上班时接触某些含【苯】的清洁剂有关
预测: 接受该报采访的患者家属与劳工组织认为，这与员工上班时接触某些含【笨】的清洁剂有关
错误类型: 漏纠
...

Citation

CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [pdf]

@misc{hu2024cscdns,
      title={CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers}, 
      author={Yong Hu and Fandong Meng and Jie Zhou},
      year={2024},
      eprint={2211.08788},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
evaluation		evaluation
pseudo-data-construction		pseudo-data-construction
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

evaluation

evaluation

pseudo-data-construction

pseudo-data-construction

LICENSE

LICENSE

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

Data

CSCD-NS

LCSTS-IME-2M

Build Pseudo Dataset

Evaluation

Citation

About

Releases

Packages

Languages

License

nghuyong/cscd-ns

Folders and files

Latest commit

History

Repository files navigation

Data

CSCD-NS

LCSTS-IME-2M

Build Pseudo Dataset

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages