Skip to content

longyuewangdcu/GuoFeng-Webnovel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 

Repository files navigation

Logo

🀄 GuoFeng Webnovel: A Discourse-Level and Multilingual Corpus of Web Fiction

License Stars Issues

¹ Tencent AI Lab, ² China Literature Ltd.

*Longyue Wang¹ is the corresponding author: vinnlywang@tencent.com

GuoFeng Webnovel is a publicly copyrighted, high-quality, discourse-level and multilingual corpus of web fiction. Specifically, its distinguishing feature lies in:

  • Rich Linguistic and Cultural Phenomena: literary texts contain more complex linguistic and cultural knowledge than non-literary ones.
  • Long-Range Context: literature such as novels have much longer contexts than texts in other domains.
  • Artificial General Intelligence: we anticipate that this dataset will not only advance existing research in the fields of machine translation but also inspire novel studies with Large Language Models.

News 🤩🤩🤩

  • [20/05/2024] 🛰️🛰️🛰️ GuoFeng Webnovel Corpus V2 is released: two discourse-level datasets for Chinese→German and Chinese→Russian.
  • [15/05/2023] WMT23 Shared Task: Discourse-Level Literary Translation
  • [15/05/2024] 🎉🎉🎉 GuoFeng Github is online now 🎉🎉🎉
  • [06/05/2023] 🚀🚀🚀 GuoFeng Webnovel Corpus V1 is released: a discourse-level dataset with sentence-level alignment for Chinese→English.
  • [14/04/2023] WMT23 Shared Task: Discourse-Level Literary Translation

Overview of GuoFeng Webnovel Corpus 💕

The dataset covers 14 genres such as fantasy science and romance. The detailed statistics are shown as follows.

domain

The word cloud of high-frequency words in different langauges as shown follows.

word

The data example sampled from Chinese-English set, and the colored words demonstrate rich linguistic phenomena.

word

Copyright and Licence

Copyright is a crucial consideration when it comes to releasing literary texts, and we (Tencent AI Lab and China Literature Ltd.) are the rightful copyright owners of the web fictions included in this dataset. We are pleased to make this data available to the research community, subject to certain terms and conditions.

  • 🔔 GuoFeng Webnovel Corpus is copyrighted by Tencent AI Lab and China Literature Limited.
  • 🚦 After completing the registration process with your institute information, WMT participants or researchers are granted permission to use the dataset solely for non-commercial research purposes and must comply with the principles of fair use (CC-BY 4.0).
  • 🔒 Modifying or redistributing the dataset is strictly prohibited. If you plan to make any changes to the dataset, such as adding more annotations, with the intention of publishing it publicly, please contact us first to obtain written consent.
  • 🚧 By using this dataset, you agree to the terms and conditions outlined above. We take copyright infringement very seriously and will take legal action against any unauthorized use of our data.

Citation ❗❗❗

📝 If you use the GuoFeng Webnovel Corpus, please cite the following papers and claim the original download link:

@inproceedings{wang2023findings,
  title={Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs},
  author={Wang, Longyue and Tu, Zhaopeng and Gu, Yan and Liu, Siyou and Yu, Dian and Ma, Qingsong and Lyu, Chenyang and Zhou, Liting and Liu, Chao-Hong and Ma, Yufeng and others},
  booktitle={Proceedings of the Eighth Conference on Machine Translation},
  pages={55--67},
  year={2023}
}

@inproceedings{wang2024findings,
  title={Findings of the WMT 2024 Shared Task on Discourse-Level Literary Translation},
  author={Wang, Longyue and Liu, Siyou and Wu, Minghao and Jiao, Wenxiang and Wang, Xing and Xu, Jiahao and Tu, Zhaopeng and Zhou, Liting and Gu, Yan and Chen, Weiyu and Koehn, Philipp and Way, Andy and Yuan, Yulin},
  booktitle={Proceedings of the Ninth Conference on Machine Translation},
  year={2024}
}

Download Link: https://github.com/longyuewangdcu/GuoFeng-Webnovel.

Data Processing

💌 The web novels are originally written in Chinese by novel writers and then translated into other languages by professional translators. Taking Chinese-English for instance, we processed the data using automatic and manual methods:

  1. we match Chinese books with its English counterparts based on bilingual titles;
  2. within each book, Chinese-English chapters are aligned using Chapter ID numbers;
  3. within each chapter, we build a MT-based sentence aligner to align sentences in parallel, preserving the sentence order in the chapter;
  4. human annotators are engaged to review and correct any discrepancies in sentence-level alignment.

💡 Note that:

  1. Some sentences may have no aligned translations because human translators translate novels in a document way;
  2. For Chinese-German and Chinese-Russian, we currently skip 3~4 steps and only keep chapter-level parallel data. The current version may contain some translation errors, e.g. mistranslation.
Logo

Data Description (GuoFeng Webnovel Corpus V1) 1️⃣

Chinese→English

We release 22,567 continuous chapters from 179 web novels, covering 14 genres such as fantasy science and romance. The data are document-level with cross-sentence alignment information. The data statistics are listed as follows.

Table1 Book Chapter Sentence Notes
Train 179 22,567 1,939,187 14 genres
Valid 1 22 22 755 same books with Train
Test 1 26 22 697 same books with Train
Valid 2 10 10 853 different books with Train
Test 2 12 12 917 different books with Train
Testing Input - - - TBA

Data Format 💾

Taking "train.en" for exaple, the data format is shown as follows: <BOOK id=""> </BOOK> indicates a book boundary, which contains a number of continous chapters with the tag <CHAPTER id=""> </CHAPTER>. The contents are splited into sentences and manually aligned to Chinese sentences in "train.zh".

<BOOK id="100-jdxx">
<CHAPTER id="jdxx_0001">
Chapter 1 Make Your Choice, Youth
"Implode reality, pulverize thy spirit. By banishing this world, comply with the blood pact, I will summon forth thee, O' young Demon King!"
At a park during sunset, a childlike, handsome youth placed his left hand on his chest, while his right hand was stretched out with his fingers wide open, as though he was about to release something amazing from his palm. He looked serious and solemn.
... ...
</CHAPTER>
<CHAPTER id="jdxx_0002">
....
</CHAPTER>
</BOOK>

Data Description (GuoFeng Webnovel Corpus V2) 2️⃣

We release ~19K continuous chapters from ~120 web novels, covering 14 genres such as fantasy science and romance. The data are document-level without alignment information. The data statistics are listed as follows.

Chinese→German

Subset # Book # Chapter # X Word / CH Char Notes
Train 118 19,101 25,562,039 / 36,790,017 14 genres
Valid -- -- -- --
Test -- -- -- --
Testing Input -- -- -- --

Chinese→Russian

Subset # Book # Chapter # X Word / CH Char Notes
Train 122 19,971 23,521,169 / 39,074,007 14 genres
Valid -- -- -- --
Test -- -- -- --
Testing Input -- -- -- --

Data Format 💾

Data Format: Taking Chinese-German for exaple, the data format is shown as follows: (1) 1-ac, 2-ccg, ...... indicates book-level folders. (2) In "1-ac" folder, 15-jlws_0001-CH.txt, 15-jlws_0001-DE.txt, .... are continous chapters in Chinese and German languages. (3) In each file, there is no tags and sentence-level alignment information.

.
    ├── 1-ac                       # Book ID - English Title
    │   ├── 15-jlws_0001-CH.txt    # Chapter ID - Chinese
    │   ├── 15-jlws_0001-DE.txt    # Chapter ID - German
    │   ├── ......                 # more chapters
    ├── 2-ccg                      # Book ID - English Title
    │   ├── 62-xzltq_0002-CH.txt   # Chapter ID - Chinese
    │   ├── 62-xzltq_0002-DE.txt   # Chapter ID - German
    │   ├── ......                 # more chapters
    ├── ......                     # more books
	
15-jlws_0001-CH.txt

第一章 李戴
李戴走出考场,穿梭在密密麻麻的人群当中。看着周围那一张张春风得意的脸,耳边响起路人兴高采烈的讨论声,李戴心中却愈加的沮丧。
“哎,考砸了!想进入到面试是肯定没戏了。”李戴揉了揉太阳穴,头脑中那种沉甸甸的感觉却愈发的浓郁。

15-jlws_0001-DE.txt

Kapitel 1: Li Dai
Li Dai verließ das Prüfungszentrum und bewegte sich durch die dichte Menschenmenge. Er sah die triumphierenden Gesichter um ihn herum und hörte die enthusiastischen Diskussionen der Passanten, doch in seinem Herzen wurde er immer deprimierter.
"Oh, ich habe die Prüfung vergeigt! Eine Chance auf ein Vorstellungsgespräch gibt es sicherlich nicht mehr." Li Dai massierte seine Schläfen, das schwere Gefühl in seinem Kopf wurde immer intensiver.

Pretrained Models 🔢

We provide three types of in-domain pretrained models (same as last year) and large language models (new in this year):

Version Layer Hidden Size Vocab Continuous Train
Chinese-Llama-2-7B 32 4096 32000 Chinese and English literary texts (115B tokens)
RoBERTa base 12 enc 768 21128
Chinese literary text (84B tokens) mBARTCC25 12 enc + 12 dec 1024 250000

Download ⏬

Data Download 👨‍👩

The GuoFeng Webnovel Corpus V1 and V2 can be download via Github: (1) Go to "Download" Section and click the buttion; (2) Fill out the registration form and you will get the link in the Final Page.

🎈 Download GuoFeng Webnovel Corpus (via Google Form and Dropbox) 🎈
🎈 Download GuoFeng Webnovel Corpus (via Tencent Form and Weiyun) 🎈

Models Download 🤖

🎈 Download Chinese-Llama-2 🎈
🎈 Download RoBERTa & mBART 🎈

Committee

Data Team 🧑🏻‍💼

Longyue Wang* (vincentwang0229@gmail.com) (Tencent AI Lab)
Zhaopeng Tu (Tencent AI Lab)
Yan Gu (China Literature Limited)
Weiyu Chen (China Literature Limited)

Technical Team 🧑‍🏫

Jiahao Xu (Tencent AI Lab)
Wenxiang Jiao (Tencent AI Lab)
Xiang Wang (Tencent AI Lab)

Contact ☎️

If you have any further questions or suggestions, please do not hesitate to send an email to Longyue Wang (vincentwang0229@gmail.com or vinnylywang@tencent.com).

Sponsors 🙏🙏🙏

Logo Logo

Star History

Star History Chart