- Description:
Bacteria identification based on genomic sequences holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on out-of-distribution (OOD) genomic sequences from new bacteria that were not present in the training data.
We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. New bacterial classes are gradually discovered over the years. Grouping classes by years is a natural way to mimic the in-distribution and OOD examples.
The dataset contains genomic sequences sampled from 10 bacteria classes that were discovered before the year 2011 as in-distribution classes, 60 bacteria classes discovered between 2011-2016 as OOD for validation, and another 60 different bacteria classes discovered after 2016 as OOD for test, in total 130 bacteria classes. Note that training, validation, and test data are provided for the in-distribution classes, and validation and test data are proviede for OOD classes. By its nature, OOD data is not available at the training time.
The genomic sequence is 250 long, composed by characters of {A, C, G, T}. The sample size of each class is 100,000 in the training and 10,000 for the validation and test sets.
For each example, the features include: seq: the input DNA sequence composed by {A, C, G, T}. label: the name of the bacteria class. seq_info: the source of the DNA sequence, i.e., the genome name, NCBI accession number, and the position where it was sampled from. domain: if the bacteria is in-distribution (in), or OOD (ood)
The details of the dataset can be found in the paper supplemental.
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research/google-research/tree/master/genomics_ood
Source code:
tfds.structured.GenomicsOod
Versions:
0.0.1
(default): No release notes.
Download size:
Unknown size
Dataset size:
926.87 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
100,000 |
'test_ood' |
600,000 |
'train' |
1,000,000 |
'validation' |
100,000 |
'validation_ood' |
600,000 |
- Feature structure:
FeaturesDict({
'domain': Text(shape=(), dtype=string),
'label': ClassLabel(shape=(), dtype=int64, num_classes=130),
'seq': Text(shape=(), dtype=string),
'seq_info': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
domain | Text | string | ||
label | ClassLabel | int64 | ||
seq | Text | string | ||
seq_info | Text | string |
Supervised keys (See
as_supervised
doc):('seq', 'label')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@inproceedings{ren2019likelihood,
title={Likelihood ratios for out-of-distribution detection},
author={Ren, Jie and
Liu, Peter J and
Fertig, Emily and
Snoek, Jasper and
Poplin, Ryan and
Depristo, Mark and
Dillon, Joshua and
Lakshminarayanan, Balaji},
booktitle={Advances in Neural Information Processing Systems},
pages={14707--14718},
year={2019}
}