List of compound datasets

2019/9/13

List the machine learning and chemoinformatics datasets that you might use for practice later.

Compound DB

A group of databases containing structural information of compounds.

Pubchem

PubChem is one of the chemical molecule databases.
Millions of compound structure and description datasets can be downloaded via FTP. There are less than 1000 atoms and 1000 bonds accumulated in PubChem.
More than 80 database sources are contributing to PubChem's growth.

wikipedia

https://pubchem.ncbi.nlm.nih.gov/

CheEMBL

A database of drugs and their candidate compounds.Currently, activity data of 180 million compounds and 1500 million cases are recorded.54 sub-datasets of various screening and assay data seem to be easy to use.

https://www.ebi.ac.uk/chembl/

ZINC15 database

A dataset of drug-like organic compounds containing 3D information, originally developed for virtual screening by docking calculations. More than 7 million structures are listed.

http://zinc15.docking.org/
Related papers:ZINC 15 – Ligand Discovery for Everyone

Platinum dataset

A dataset that needs to generate (calculate) a conformation of a compound before docking simulation, and is used as a benchmark to verify the accuracy of that conformation.The compound species contained are protein-binding ligands, less than 5000.

Although it is a small dataset, it seems to be rich in structural diversity.It's not too heavy and seems to be ideal for practice. You can also download and use it with rdkit.

http://biosig.unimelb.edu.au/platinum/
Related papers:High-Quality Dataset of Protein-Bound Ligand Conformations and Its Application to Benchmarking Conformer Ensemble Generators
Japanese Reference Articles: Paper Memo – Benchmarking Commercial Conformer Ensemble Generators

Compound dataset

A dataset containing some objective variable such as a compound and its activity value.

Tox21

Data set for the Tox21 Data Challenge 2014 competition, sponsored by the National Institutes of Health (NIH), the US Environmental Protection Agency (EPA), and the US Food and Drug Administration (FDA) to compete for accuracy in toxicity prediction based on chemical structural formulas.

Nuclear receptor reporter genes (ER, AR, aromatase, etc.), stress response (p53, ARE,  HSEOther) assay results are included.

Molecule net

MoleculeNet is a benchmark dataset designed to test molecular property predictions by machine learning.It is based on multiple public databases and contains the following datasets:

  • QM7, QM8, QM9: Data set summarizing chemical structure and quantum chemistry calculation output value
  • Water solubility, logP
  • HIV replication inhibition, human β-secretase inhibition, etc.
  • Blood-brain barrier permeability, database of over-the-counter drugs and side effects, Tox21, ToxCast, etc.

kaggle: Predicting Molecular Properties

Data set of coupling constants between two atoms used in the kaggle competition..Given in the xyz coordinate data of the compound (about 13 compounds, 450 million atomic combinations and coupling constants).

Other references

Some of the above datasets are integrated into the deep learning frameworks "Deep Chem" and "Chainer chemistory" for the fields of chemistry and biology. Deep Chem supports Molecule net, Chainer chemistry supports Molecule net, QM9, Tox21, and Zinc.

There was also a site like this that compiled a list of life science databases.Not only compounds, but also genomes and organisms.
Integbio database catalog

I participated in the kaggle competition "Predicting Molecular Properties", but it took a long time to calculate with huge data (300 MB or more with features added), and I had a hard time.When practicing visualization of compound data, establishment and verification of machine learning models, I felt that Molecule net, Tox21, or rather LogP would be sufficient at all.