- Description:
Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect cardiotoxicity caused by binding hERG target, a protein associated with heart beat rhythm. The data covers over 9000 molecules with hERG activity.
The data is split into four splits: train, test-iid, test-ood1, test-ood2.
Each molecule in the dataset has 2D graph annotations which is designed to facilitate graph neural network modeling. Nodes are the atoms of the molecule and edges are the bonds. Each atom is represented as a vector encoding basic atom information such as atom type. Similar logic applies to bonds.
We include Tanimoto fingerprint distance (to training data) for each molecule in the test sets to facilitate research on distributional shift in graph domain.
For each example, the features include: atoms: a 2D tensor with shape (60, 27) storing node features. Molecules with less than 60 atoms are padded with zeros. Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12) storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor with shape (60, ) storing node masks. 1 indicates the corresponding atom is real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing edge masks. 1 indicates the corresponding edge is real, othewise a padded one. active: a one-hot vector indicating if the molecule is toxic or not. [0, 1] indicates it's toxic, otherwise [1, 0] non-toxic.
References
[1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM, 2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00884
[2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under Distributional Shift. NeurIPS DistShift Workshop 2021. https://arxiv.org/abs/2111.12951
Homepage: https://github.com/google/uncertainty-baselines/tree/main/baselines/drug_cardiotoxicity
Source code:
tfds.graphs.cardiotox.Cardiotox
Versions:
1.0.0
(default): Initial release.
Download size:
Unknown size
Dataset size:
1.66 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
839 |
'test2' |
177 |
'train' |
6,523 |
'validation' |
1,631 |
- Feature structure:
FeaturesDict({
'active': Tensor(shape=(2,), dtype=int64),
'atom_mask': Tensor(shape=(60,), dtype=float32),
'atoms': Tensor(shape=(60, 27), dtype=float32),
'dist2topk_nbs': Tensor(shape=(1,), dtype=float32),
'molecule_id': string,
'pair_mask': Tensor(shape=(60, 60), dtype=float32),
'pairs': Tensor(shape=(60, 60, 12), dtype=float32),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
active | Tensor | (2,) | int64 | |
atom_mask | Tensor | (60,) | float32 | |
atoms | Tensor | (60, 27) | float32 | |
dist2topk_nbs | Tensor | (1,) | float32 | |
molecule_id | Tensor | string | ||
pair_mask | Tensor | (60, 60) | float32 | |
pairs | Tensor | (60, 60, 12) | float32 |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@ARTICLE{Han2021-tu,
title = "Reliable Graph Neural Networks for Drug Discovery Under
Distributional Shift",
author = "Han, Kehang and Lakshminarayanan, Balaji and Liu, Jeremiah",
month = nov,
year = 2021,
archivePrefix = "arXiv",
primaryClass = "cs.LG",
eprint = "2111.12951"
}