- Description:
BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:
- A (Human Necessities),
- B (Performing Operations; Transporting),
- C (Chemistry; Metallurgy),
- D (Textiles; Paper),
- E (Fixed Constructions),
- F (Mechanical Engineering; Lightning; Heating; Weapons; Blasting),
- G (Physics),
- H (Electricity), and
- Y (General tagging of new or cross-sectional technology)
There are two features:
- description: detailed description of patent.
summary: Patent abstract.
Additional Documentation: Explore on Papers With Code
Homepage: https://evasharma.github.io/bigpatent/
Source code:
tfds.datasets.big_patent.Builder
Versions:
1.0.0
: lower cased tokenized words2.0.0
: Update to use cased raw strings2.1.2
(default): Fix update to cased raw strings.
Download size:
9.45 GiB
Auto-cached (documentation): No
Feature structure:
FeaturesDict({
'abstract': Text(shape=(), dtype=string),
'description': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
abstract | Text | string | ||
description | Text | string |
Supervised keys (See
as_supervised
doc):('description', 'abstract')
Figure (tfds.show_examples): Not supported.
Citation:
@misc{sharma2019bigpatent,
title={BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization},
author={Eva Sharma and Chen Li and Lu Wang},
year={2019},
eprint={1906.03741},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
big_patent/all (default config)
Config description: Patents under all categories.
Dataset size:
35.17 GiB
Splits:
Split | Examples |
---|---|
'test' |
67,072 |
'train' |
1,207,222 |
'validation' |
67,068 |
- Examples (tfds.as_dataframe):
big_patent/a
Config description: Patents under Cooperative Patent Classification (CPC)a: Human Necessities
Dataset size:
5.16 GiB
Splits:
Split | Examples |
---|---|
'test' |
9,675 |
'train' |
174,134 |
'validation' |
9,674 |
- Examples (tfds.as_dataframe):
big_patent/b
Config description: Patents under Cooperative Patent Classification (CPC)b: Performing Operations; Transporting
Dataset size:
4.06 GiB
Splits:
Split | Examples |
---|---|
'test' |
8,974 |
'train' |
161,520 |
'validation' |
8,973 |
- Examples (tfds.as_dataframe):
big_patent/c
Config description: Patents under Cooperative Patent Classification (CPC)c: Chemistry; Metallurgy
Dataset size:
3.63 GiB
Splits:
Split | Examples |
---|---|
'test' |
5,614 |
'train' |
101,042 |
'validation' |
5,613 |
- Examples (tfds.as_dataframe):
big_patent/d
Config description: Patents under Cooperative Patent Classification (CPC)d: Textiles; Paper
Dataset size:
255.56 MiB
Splits:
Split | Examples |
---|---|
'test' |
565 |
'train' |
10,164 |
'validation' |
565 |
- Examples (tfds.as_dataframe):
big_patent/e
Config description: Patents under Cooperative Patent Classification (CPC)e: Fixed Constructions
Dataset size:
871.40 MiB
Splits:
Split | Examples |
---|---|
'test' |
1,914 |
'train' |
34,443 |
'validation' |
1,914 |
- Examples (tfds.as_dataframe):
big_patent/f
Config description: Patents under Cooperative Patent Classification (CPC)f: Mechanical Engineering; Lightning; Heating; Weapons; Blasting
Dataset size:
2.06 GiB
Splits:
Split | Examples |
---|---|
'test' |
4,754 |
'train' |
85,568 |
'validation' |
4,754 |
- Examples (tfds.as_dataframe):
big_patent/g
Config description: Patents under Cooperative Patent Classification (CPC)g: Physics
Dataset size:
8.19 GiB
Splits:
Split | Examples |
---|---|
'test' |
14,386 |
'train' |
258,935 |
'validation' |
14,385 |
- Examples (tfds.as_dataframe):
big_patent/h
Config description: Patents under Cooperative Patent Classification (CPC)h: Electricity
Dataset size:
7.50 GiB
Splits:
Split | Examples |
---|---|
'test' |
14,279 |
'train' |
257,019 |
'validation' |
14,279 |
- Examples (tfds.as_dataframe):
big_patent/y
Config description: Patents under Cooperative Patent Classification (CPC)y: General tagging of new or cross-sectional technology
Dataset size:
3.46 GiB
Splits:
Split | Examples |
---|---|
'test' |
6,911 |
'train' |
124,397 |
'validation' |
6,911 |
- Examples (tfds.as_dataframe):