- Description:
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages:
- French
- Spanish
- German
- Chinese
- Japanese
- Korean
For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification at https://arxiv.org/abs/1908.11828
Similar to PAWS Dataset, examples are split into Train/Dev/Test sections. All files are in the tsv format with four columns:
id
: A unique id for each pair.sentence1
: The first sentence.sentence2
: The second sentence.(noisy_)label
: (Noisy) label for each pair.
Each label has two possible values: 0 indicates the pair has different meaning, while 1 indicates the pair is a paraphrase.
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
Source code:
tfds.datasets.paws_x_wiki.Builder
Versions:
1.0.0
(default): No release notes.
Download size:
28.88 MiB
Auto-cached (documentation): Yes
Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'sentence1': Text(shape=(), dtype=string),
'sentence2': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
sentence1 | Text | string | ||
sentence2 | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@InProceedings{pawsx2019emnlp,
title = { {PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification} },
author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
booktitle = {Proc. of EMNLP},
year = {2019}
}
paws_x_wiki/de (default config)
Config description: Translated to de
Dataset size:
15.27 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,380 |
'validation' |
2,000 |
- Examples (tfds.as_dataframe):
paws_x_wiki/en
Config description: Translated to en
Dataset size:
14.59 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,175 |
'validation' |
2,000 |
- Examples (tfds.as_dataframe):
paws_x_wiki/es
Config description: Translated to es
Dataset size:
15.27 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,401 |
'validation' |
1,961 |
- Examples (tfds.as_dataframe):
paws_x_wiki/fr
Config description: Translated to fr
Dataset size:
15.79 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,399 |
'validation' |
1,988 |
- Examples (tfds.as_dataframe):
paws_x_wiki/ja
Config description: Translated to ja
Dataset size:
17.77 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,401 |
'validation' |
2,000 |
- Examples (tfds.as_dataframe):
paws_x_wiki/ko
Config description: Translated to ko
Dataset size:
16.42 MiB
Splits:
Split | Examples |
---|---|
'test' |
1,999 |
'train' |
49,164 |
'validation' |
2,000 |
- Examples (tfds.as_dataframe):
paws_x_wiki/zh
Config description: Translated to zh
Dataset size:
13.20 MiB
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
49,401 |
'validation' |
2,000 |
- Examples (tfds.as_dataframe):