TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

paws_x_wiki

Description:

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages:

French
Spanish
German
Chinese
Japanese
Korean

For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification at https://arxiv.org/abs/1908.11828

Similar to PAWS Dataset, examples are split into Train/Dev/Test sections. All files are in the tsv format with four columns:

id: A unique id for each pair.
sentence1: The first sentence.
sentence2: The second sentence.
(noisy_)label: (Noisy) label for each pair.

Each label has two possible values: 0 indicates the pair has different meaning, while 1 indicates the pair is a paraphrase.

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
Source code: tfds.datasets.paws_x_wiki.Builder
Versions:
- 1.0.0 (default): No release notes.
Download size: 28.88 MiB
Auto-cached (documentation): Yes
Feature structure:

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=string),
    'sentence2': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
label	ClassLabel	int64
sentence1	Text	string
sentence2	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@InProceedings{pawsx2019emnlp,
  title = { {PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification} },
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

paws_x_wiki/de (default config)

Config description: Translated to de
Dataset size: 15.27 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,380
`'validation'`	2,000

Examples (tfds.as_dataframe):

paws_x_wiki/en

Config description: Translated to en
Dataset size: 14.59 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,175
`'validation'`	2,000

Examples (tfds.as_dataframe):

paws_x_wiki/es

Config description: Translated to es
Dataset size: 15.27 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,401
`'validation'`	1,961

Examples (tfds.as_dataframe):

paws_x_wiki/fr

Config description: Translated to fr
Dataset size: 15.79 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,399
`'validation'`	1,988

Examples (tfds.as_dataframe):

paws_x_wiki/ja

Config description: Translated to ja
Dataset size: 17.77 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,401
`'validation'`	2,000

Examples (tfds.as_dataframe):

paws_x_wiki/ko

Config description: Translated to ko
Dataset size: 16.42 MiB
Splits:

Split	Examples
`'test'`	1,999
`'train'`	49,164
`'validation'`	2,000

Examples (tfds.as_dataframe):

paws_x_wiki/zh

Config description: Translated to zh
Dataset size: 13.20 MiB
Splits:

Split	Examples
`'test'`	2,000
`'train'`	49,401
`'validation'`	2,000

Examples (tfds.as_dataframe):