- Description:
'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).
This dataset is experimental, and the API is subject to change in future releases.
The below description of the dataset is adapted from the OGB paper:
Input Format
All the molecules are pre-processed using RDKit ([1]).
- Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
- Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.
- Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.
The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py
Prediction
The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.
References
[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit
[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf
[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.
Homepage: https://ogb.stanford.edu/docs/graphprop
Source code:
tfds.datasets.ogbg_molpcba.Builder
Versions:
0.1.0
: Initial release of experimental API.0.1.1
: Exposes the number of edges in each graph explicitly.0.1.2
: Add metadata field for GraphVisualizer.0.1.3
(default): Add metadata field for names of individual tasks.
Download size:
37.70 MiB
Dataset size:
822.53 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
43,793 |
'train' |
350,343 |
'validation' |
43,793 |
- Feature structure:
FeaturesDict({
'edge_feat': Tensor(shape=(None, 3), dtype=float32),
'edge_index': Tensor(shape=(None, 2), dtype=int64),
'labels': Tensor(shape=(128,), dtype=float32),
'node_feat': Tensor(shape=(None, 9), dtype=float32),
'num_edges': Tensor(shape=(None,), dtype=int64),
'num_nodes': Tensor(shape=(None,), dtype=int64),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
edge_feat | Tensor | (None, 3) | float32 | |
edge_index | Tensor | (None, 2) | int64 | |
labels | Tensor | (128,) | float32 | |
node_feat | Tensor | (None, 9) | float32 | |
num_edges | Tensor | (None,) | int64 | |
num_nodes | Tensor | (None,) | int64 |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples):
- Examples (tfds.as_dataframe):
- Citation:
@inproceedings{DBLP:conf/nips/HuFZDRLCL20,
author = {Weihua Hu and
Matthias Fey and
Marinka Zitnik and
Yuxiao Dong and
Hongyu Ren and
Bowen Liu and
Michele Catasta and
Jure Leskovec},
editor = {Hugo Larochelle and
Marc Aurelio Ranzato and
Raia Hadsell and
Maria{-}Florina Balcan and
Hsuan{-}Tien Lin},
title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs},
booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
on Neural Information Processing Systems 2020, NeurIPS 2020, December
6-12, 2020, virtual},
year = {2020},
url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html},
timestamp = {Tue, 19 Jan 2021 15:57:06 +0100},
biburl = {https://dblp.org/rec/conf/nips/HuFZDRLCL20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}