Introduction:
Text
summarization is one of the major tasks of natural language processing that
automatically converts text into a short summary. Some summarization systems,
for example, for short English, long English, and short Chinese text, have
benefited from advances in the neural encoder-decoder model because of the
availability of large datasets. However, the summarization of long Chinese text
has been limited to datasets of a couple of hundred instances. The aim of this
paper is to explore the long Chinese text summarization task. To begin with, a
first large-scale long Chinese text summarization corpus, the Long Chinese Summarization of
Police Inquiry Record Text (LcsPIRT) is introduced.
Additionally, a sequence-to-sequence (Seq2Seq) model that incorporates a global
encoding process with an attention mechanism is proposed, and its competitive
results on the LcsPIRT corpus are achieved compared with several benchmarked
methods. Finally, a flexible model selection strategy is present by comparing
and analyzing the performance of various models on different subsets of the
corpus. This strategy is helpful for choosing the right Seq2Seq model based on
different attributes of the source text.
Data Property:
This corpus
includes two sub-corpora: LcsPIRT in a Dialogue Manner (LcsPIRT-DM) and LcsPIRT
in a Paragraph Manner (LcsPIRT-PM). Each sub-corpus contains 38,500
text-summary pairs, hence, the two sub-corpora contain a total of 77,000
text-summary pairs. The text in the corpus is desensitized to ensure that the
privacy of the person involved is not exposed. The dataset consists of two
parts shown as Table 2.
Download:
If you want to acquire the corpus. Please fill the application form and send to Xuefeng Xi: xfxi@usts.edu.cn or Zhou Pi: pizhou@post.usts.edu.cn [application]
Copyright
Notice:
1.Respect the
privacy of personal information of the original source.
2.The original
copyright of all the data of the Long
Chinese Summarization of Police Inquiry Record Text corpus
belongs to the NLP Lab. of SUST, Suzhou University of Science and Technology,
organizes, filters and purifies them. LcsPIRT is free to the public.
3.If you want to
use the dataset for depth study, data providers (the NLP Lab. of SUST,
Suzhou University of Science and Technology) should be identified in your
results.
4.The dataset is
only for the specified applicant or study groups for research purposes. Without
permission, it may not be used for any commercial purposes.
5.If the terms
changed, the latest online version shall prevail.