Skip to content

code and data for "CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers"

License

Notifications You must be signed in to change notification settings

nghuyong/cscd-ns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



arXiv GitHub stars GitHub issues GitHub license

Data

Due to the lack of github lsf quota, the data folder exists on Google Drive (link) or Baidu Netdisk (link), which needs to be downloaded and moved to the main directory.

CSCD-NS

CSCD-NS is a Chinese Spelling Correction Dataset for Native Speakers, including 40,000 annotated sentences from real posts of official media on Sina Weibo.

training set development set test set all
30,000 5,000 5,000 40,000

It can be found in ./data/cscd-ime, and the data format is label \t origin \t corrected, the label means whether the original sentence is correct or wrong.

LCSTS-IME-2M

LCSTS-IME-2M is a large-scale and high-quality pseudo dataset for the CSC task, including over 2 million samples. The data is come from LCSTS dataset and is constructed by simulating the input through pinyin IME. It can be found in ./data/lcsts-ime-2m and the format is the same as CSCD-IME.

Build Pseudo Dataset

Install requirements:

pip install https://github.com/kpu/kenlm/archive/master.zip
pip install -r requirements.txt

Build:

cd pseudo-data-construction
python build.py

The script take the sentence 一不小心选到了错误的方向 as example, and you could obtain the constructed pseudo data like this:

{
  "origin": "一不小心选到了错误的方向",
  "noise": "一部小心选到了错误的方向",
  "details": [
    {
      "start": 0,
      "end": 2,
      "origin_token": "一不",
      "noise_token": "一部",
      "pinyin_type": "same",
      "pinyin_token": "yibu",
      "ppl_improve": 105.68778749231029
    }
  ]
}

Evaluation

We provide an evaluation script to evaluate the metric of CSC systems, including the f1-score of detection and correction at the sentence level and character level.

cd evaluation
python evaluate.py

This evaluation script will evaluate the prediction result of BERT model on CSCD-IME, and generate report file.

Report file preview:

overview:
S_D_p:79.164
S_D_r:65.827
S_D_f1:71.882
S_C_p:70.548
S_C_r:58.663
S_C_f1:64.059
C_D_p:82.999
C_D_r:67.009
C_D_f1:74.152
C_C_p:73.591
C_C_r:59.415
C_C_f1:65.748

bad cases:
原始: 接受该报采访的患者家属与劳工组织认为,这与员工上班时接触某些含笨的清洁剂有关
正确: 接受该报采访的患者家属与劳工组织认为,这与员工上班时接触某些含【苯】的清洁剂有关
预测: 接受该报采访的患者家属与劳工组织认为,这与员工上班时接触某些含【笨】的清洁剂有关
错误类型: 漏纠
...

Citation

CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [pdf]

@misc{hu2024cscdns,
      title={CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers}, 
      author={Yong Hu and Fandong Meng and Jie Zhou},
      year={2024},
      eprint={2211.08788},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

code and data for "CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages