LcsPIRT: A Large-Scale Long Chinese Text Summarization Corpus

Introduction:

Text summarization is one of the major tasks of natural language processing that automatically converts text into a short summary. Some summarization systems, for example, for short English, long English, and short Chinese text, have benefited from advances in the neural encoder-decoder model because of the availability of large datasets. However, the summarization of long Chinese text has been limited to datasets of a couple of hundred instances. The aim of this paper is to explore the long Chinese text summarization task. To begin with, a first large-scale long Chinese text summarization corpus, the Long Chinese Summarization of Police Inquiry Record Text (LcsPIRT) is introduced. Additionally, a sequence-to-sequence (Seq2Seq) model that incorporates a global encoding process with an attention mechanism is proposed, and its competitive results on the LcsPIRT corpus are achieved compared with several benchmarked methods. Finally, a flexible model selection strategy is present by comparing and analyzing the performance of various models on different subsets of the corpus. This strategy is helpful for choosing the right Seq2Seq model based on different attributes of the source text.

Data Property:

This corpus includes two sub-corpora: LcsPIRT in a Dialogue Manner (LcsPIRT-DM) and LcsPIRT in a Paragraph Manner (LcsPIRT-PM). Each sub-corpus contains 38,500 text-summary pairs, hence, the two sub-corpora contain a total of 77,000 text-summary pairs. The text in the corpus is desensitized to ensure that the privacy of the person involved is not exposed. The dataset consists of two parts shown as Table 2.

Download:

If you want to acquire the corpus. Please fill the application form and send to Xuefeng Xi: xfxi@usts.edu.cn or Zhou Pi: pizhou@post.usts.edu.cn [application]

Copyright Notice:

1.Respect the privacy of personal information of the original source.

2.The original copyright of all the data of the Long Chinese Summarization of Police Inquiry Record Text corpus belongs to the NLP Lab. of SUST, Suzhou University of Science and Technology, organizes, filters and purifies them. LcsPIRT is free to the public.

3.If you want to use the dataset for depth study, data providers (the NLP Lab. of SUST, Suzhou University of Science and Technology) should be identified in your results.

4.The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes.

5.If the terms changed, the latest online version shall prevail.