TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

summscreen

Description:

SummScreen Summarization dataset, non-anonymized, non-tokenized version.

Train/val/test splits and filtering are based on the final tokenized dataset, but transcripts and recaps provided are based on the untokenized text.

There are two features:

transcript: Full episode transcripts, each line of dialogue separated by newlines
recap: Recaps or summaries of episodes
Homepage: https://github.com/mingdachen/SummScreen
Source code: tfds.datasets.summscreen.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: 841.27 MiB
Supervised keys (See as_supervised doc): ('transcript', 'recap')
Figure (tfds.show_examples): Not supported.
Citation:

@article{DBLP:journals/corr/abs-2104-07091,
  author    = {Mingda Chen and
               Zewei Chu and
               Sam Wiseman and
               Kevin Gimpel},
  title     = {SummScreen: {A} Dataset for Abstractive Screenplay Summarization},
  journal   = {CoRR},
  volume    = {abs/2104.07091},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.07091},
  archivePrefix = {arXiv},
  eprint    = {2104.07091},
  timestamp = {Mon, 19 Apr 2021 16:45:47 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-07091.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

summscreen/fd (default config)

Config description: ForeverDreaming
Dataset size: 132.99 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'test'`	337
`'train'`	3,673
`'validation'`	338

Feature structure:

FeaturesDict({
    'episode_number': Text(shape=(), dtype=string),
    'episode_title': Text(shape=(), dtype=string),
    'recap': Text(shape=(), dtype=string),
    'show_title': Text(shape=(), dtype=string),
    'transcript': Text(shape=(), dtype=string),
    'transcript_author': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
episode_number	Text	string
episode_title	Text	string
recap	Text	string
show_title	Text	string
transcript	Text	string
transcript_author	Text	string

Examples (tfds.as_dataframe):

summscreen/tms

Config description: TVMegaSite
Dataset size: 592.53 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	1,793
`'train'`	18,915
`'validation'`	1,795

Feature structure:

FeaturesDict({
    'episode_summary': Text(shape=(), dtype=string),
    'recap': Text(shape=(), dtype=string),
    'recap_author': Text(shape=(), dtype=string),
    'show_title': Text(shape=(), dtype=string),
    'transcript': Text(shape=(), dtype=string),
    'transcript_author': Tensor(shape=(None,), dtype=string),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
episode_summary	Text		string
recap	Text		string
recap_author	Text		string
show_title	Text		string
transcript	Text		string
transcript_author	Tensor	(None,)	string

Examples (tfds.as_dataframe):