TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

ogbg_molpcba

Description:

'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).

This dataset is experimental, and the API is subject to change in future releases.

The below description of the dataset is adapted from the OGB paper:

Input Format

All the molecules are pre-processed using RDKit ([1]).

Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.
Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.

The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py

Prediction

The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.

References

[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit

[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf

[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.

Homepage: https://ogb.stanford.edu/docs/graphprop
Source code: tfds.datasets.ogbg_molpcba.Builder
Versions:
- 0.1.0: Initial release of experimental API.
- 0.1.1: Exposes the number of edges in each graph explicitly.
- 0.1.2: Add metadata field for GraphVisualizer.
- 0.1.3 (default): Add metadata field for names of individual tasks.
Download size: 37.70 MiB
Dataset size: 822.53 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	43,793
`'train'`	350,343
`'validation'`	43,793

Feature structure:

FeaturesDict({
    'edge_feat': Tensor(shape=(None, 3), dtype=float32),
    'edge_index': Tensor(shape=(None, 2), dtype=int64),
    'labels': Tensor(shape=(128,), dtype=float32),
    'node_feat': Tensor(shape=(None, 9), dtype=float32),
    'num_edges': Tensor(shape=(None,), dtype=int64),
    'num_nodes': Tensor(shape=(None,), dtype=int64),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
edge_feat	Tensor	(None, 3)	float32
edge_index	Tensor	(None, 2)	int64
labels	Tensor	(128,)	float32
node_feat	Tensor	(None, 9)	float32
num_edges	Tensor	(None,)	int64
num_nodes	Tensor	(None,)	int64

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples):

Visualization

Examples (tfds.as_dataframe):

Citation:

@inproceedings{DBLP:conf/nips/HuFZDRLCL20,
  author    = {Weihua Hu and
               Matthias Fey and
               Marinka Zitnik and
               Yuxiao Dong and
               Hongyu Ren and
               Bowen Liu and
               Michele Catasta and
               Jure Leskovec},
  editor    = {Hugo Larochelle and
               Marc Aurelio Ranzato and
               Raia Hadsell and
               Maria{-}Florina Balcan and
               Hsuan{-}Tien Lin},
  title     = {Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
               on Neural Information Processing Systems 2020, NeurIPS 2020, December
               6-12, 2020, virtual},
  year      = {2020},
  url       = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html},
  timestamp = {Tue, 19 Jan 2021 15:57:06 +0100},
  biburl    = {https://dblp.org/rec/conf/nips/HuFZDRLCL20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}