- Description:
This dataset contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates. PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark.
Books are partitioned into a train, validation, and test set. Books metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date, book_link).
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/deepmind/pg19
Source code:
tfds.datasets.pg19.Builder
Versions:
0.1.1
(default): No release notes.
Download size:
Unknown size
Dataset size:
10.94 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
100 |
'train' |
28,602 |
'validation' |
50 |
- Feature structure:
FeaturesDict({
'book_id': int32,
'book_link': string,
'book_text': Text(shape=(), dtype=string),
'book_title': string,
'publication_date': string,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
book_id | Tensor | int32 | ||
book_link | Tensor | string | ||
book_text | Text | string | ||
book_title | Tensor | string | ||
publication_date | Tensor | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}