TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

genomics_ood

Description:

Bacteria identification based on genomic sequences holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on out-of-distribution (OOD) genomic sequences from new bacteria that were not present in the training data.

We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. New bacterial classes are gradually discovered over the years. Grouping classes by years is a natural way to mimic the in-distribution and OOD examples.

The dataset contains genomic sequences sampled from 10 bacteria classes that were discovered before the year 2011 as in-distribution classes, 60 bacteria classes discovered between 2011-2016 as OOD for validation, and another 60 different bacteria classes discovered after 2016 as OOD for test, in total 130 bacteria classes. Note that training, validation, and test data are provided for the in-distribution classes, and validation and test data are proviede for OOD classes. By its nature, OOD data is not available at the training time.

The genomic sequence is 250 long, composed by characters of {A, C, G, T}. The sample size of each class is 100,000 in the training and 10,000 for the validation and test sets.

For each example, the features include: seq: the input DNA sequence composed by {A, C, G, T}. label: the name of the bacteria class. seq_info: the source of the DNA sequence, i.e., the genome name, NCBI accession number, and the position where it was sampled from. domain: if the bacteria is in-distribution (in), or OOD (ood)

The details of the dataset can be found in the paper supplemental.

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research/google-research/tree/master/genomics_ood
Source code: tfds.structured.GenomicsOod
Versions:
- 0.0.1 (default): No release notes.
Download size: Unknown size
Dataset size: 926.87 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	100,000
`'test_ood'`	600,000
`'train'`	1,000,000
`'validation'`	100,000
`'validation_ood'`	600,000

Feature structure:

FeaturesDict({
    'domain': Text(shape=(), dtype=string),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=130),
    'seq': Text(shape=(), dtype=string),
    'seq_info': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
domain	Text	string
label	ClassLabel	int64
seq	Text	string
seq_info	Text	string

Supervised keys (See as_supervised doc): ('seq', 'label')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@inproceedings{ren2019likelihood,
  title={Likelihood ratios for out-of-distribution detection},
  author={Ren, Jie and
  Liu, Peter J and
  Fertig, Emily and
  Snoek, Jasper and
  Poplin, Ryan and
  Depristo, Mark and
  Dillon, Joshua and
  Lakshminarayanan, Balaji},
  booktitle={Advances in Neural Information Processing Systems},
  pages={14707--14718},
  year={2019}
}