- Description:
SummScreen Summarization dataset, non-anonymized, non-tokenized version.
Train/val/test splits and filtering are based on the final tokenized dataset, but transcripts and recaps provided are based on the untokenized text.
There are two features:
- transcript: Full episode transcripts, each line of dialogue separated by newlines
recap: Recaps or summaries of episodes
Homepage: https://github.com/mingdachen/SummScreen
Source code:
tfds.datasets.summscreen.Builder
Versions:
1.0.0
(default): Initial release.
Download size:
841.27 MiB
Supervised keys (See
as_supervised
doc):('transcript', 'recap')
Figure (tfds.show_examples): Not supported.
Citation:
@article{DBLP:journals/corr/abs-2104-07091,
author = {Mingda Chen and
Zewei Chu and
Sam Wiseman and
Kevin Gimpel},
title = {SummScreen: {A} Dataset for Abstractive Screenplay Summarization},
journal = {CoRR},
volume = {abs/2104.07091},
year = {2021},
url = {https://arxiv.org/abs/2104.07091},
archivePrefix = {arXiv},
eprint = {2104.07091},
timestamp = {Mon, 19 Apr 2021 16:45:47 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-07091.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
summscreen/fd (default config)
Config description: ForeverDreaming
Dataset size:
132.99 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
337 |
'train' |
3,673 |
'validation' |
338 |
- Feature structure:
FeaturesDict({
'episode_number': Text(shape=(), dtype=string),
'episode_title': Text(shape=(), dtype=string),
'recap': Text(shape=(), dtype=string),
'show_title': Text(shape=(), dtype=string),
'transcript': Text(shape=(), dtype=string),
'transcript_author': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
episode_number | Text | string | ||
episode_title | Text | string | ||
recap | Text | string | ||
show_title | Text | string | ||
transcript | Text | string | ||
transcript_author | Text | string |
- Examples (tfds.as_dataframe):
summscreen/tms
Config description: TVMegaSite
Dataset size:
592.53 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
1,793 |
'train' |
18,915 |
'validation' |
1,795 |
- Feature structure:
FeaturesDict({
'episode_summary': Text(shape=(), dtype=string),
'recap': Text(shape=(), dtype=string),
'recap_author': Text(shape=(), dtype=string),
'show_title': Text(shape=(), dtype=string),
'transcript': Text(shape=(), dtype=string),
'transcript_author': Tensor(shape=(None,), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
episode_summary | Text | string | ||
recap | Text | string | ||
recap_author | Text | string | ||
show_title | Text | string | ||
transcript | Text | string | ||
transcript_author | Tensor | (None,) | string |
- Examples (tfds.as_dataframe):