This page contains several archive or links towards various chemical databases of molecules. Each database concerns a specific problem (either classification or Regression).
Dataset | # dataset | mean size | mean degree | min size | max size | Stereoisomerism | Problem type |
---|---|---|---|---|---|---|---|
PAH | 94 | 20.7 | 2.4 | 10 | 28 | No | Classif. |
MAO | 68 | 18.4 | 2.1 | 11 | 27 | No | Classif. |
PTC | 416 | 14.4 | 2.1 | 2 | 64 | No | Classif. |
AIDS | 2000 | 15.7 | 2.1 | 2 | 95 | No | Classif. |
Alkane | 150 | 8.9 | 1.8 | 1 | 10 | No | Regression |
Acyclic | 185 | 8.2 | 1.8 | 3 | 11 | No | Regression |
Chiral Acyclic | 35 | 21.29 | 1.98 | 14 | 32 | Yes | Regression |
Vitamin D | 69 | 76.91 | 2.05 | 68 | 88 | Yes | Regression |
ACE | 32 | 52 | 2.04 | 52 | 52 | Yes | Classification |
Steroid | 64 | 75.11 | 2.08 | 57 | 94 | Yes | Regression |
Monoterpens | 382 | 10 | No | Classification |
Results bellow are obtained for each method using a leave one out procedure with a two-class SVM. This classification scheme is made for each of the 68 molecules of the dataset.
Method | Classification accuracy | |
---|---|---|
(1) | Suard et al. (2002) | 80% (55/68) |
(2) | Vishwanathan et al. (2010) | 82% (56/68) |
(3) | Neuhaus and Bunke (2007) | 90% (61/68) |
(4) | Riesen et al. (2007) | 91% (62/68) |
(5) | Normalized standard Graph Laplacian kernel | 90% (61/68) |
(6) | Normalized fast Graph Laplacian kernel | 90% (61/68) |
(7) | Mahé and Vert (2008) | 96% (65/68) |
(8) | Gaüzère et al. (2012) | 94% (64/68) |
In both results presented bellow GLK stands for graph Laplacian kernel.
|
|
Results displayed bellow have been obtained using several test sets composed of 10% of the database, the remaining 90% being used as training set. For each test molecules composing the test and training sets have been choosen randomly.
Method | Average error (ºC) | RMSE (ºC) | |
---|---|---|---|
(1) | Cherqaoui and Villemin (1994) | 3.11 | 3.70 |
(2) | Neuhaus and Bunke (2007) | 5.42 | 10.01 |
(3) | Riesen et al. (2007) | 5.27 | 7.10 |
(4) | Suard et al. (2002) | 4.66 | 6.21 |
(5) | Vishwanathan et al. (2010) | 10.61 | 16.28 |
(6) | Graph Laplacian kernel | 10.79 | 16.45 |
(7) | Mahé and Vert (2008) | 2.41 | 3.48 |
(8) | Gaüzère et al. (2012) | 1.41 | 1.92 |
Method | Classification accuracy | |
---|---|---|
(1) | Treelet Kernel[5] | 71.3 |
(2) | Graph Edit distance[8] | 72 |
(3) | Cycles [16] | 63 |
(4) | Relevant Cycle graph [6] | 77.7 |
(5) | Relevant Cycle Hypergraph [7] | 76.3 |
(6) | Augmented Cycle [8] | 80.7 |
A dataset of acyclic molecules with chiral atoms. Molecules are provide with their optical rotation for a regression problem. Each molecules is composed of one or two chiral vertices (mean number : 1.06).
The standard deviation of optical rotation angles is equal to 38.25º with values ranging from −89º to 78º .
Method | RMSE (angle) | |
---|---|---|
(1) | Mahé and Vert (2008) | 34.1 |
(2) | Gauzere B. et al, 2012 | 26.2 |
(3) | Brown J. et al. 2010 | 24.2 |
(4) | Grenier P.-A. et al, 2013 | 14.8 |
This dataset is composed of 69 derivatives of the vitamin D. Molecules are provided with their biological activity for a regression problem. After normalization the standard deviation of biological activity is equal to 0.256259.
Method | RMSE | |
---|---|---|
(1) | Mahé and Vert (2008) | 0.251 |
(2) | Gauzere B. et al, 2012 | 0.271 |
(3) | Brown J. et al. 2010 | 0.184 |
(4) | Grenier P.-A. et al, ICPR 2014 | 0.194 |
(5) | Grenier P.-A. et al, ICPR 2014 kernel multiplied by Gauzere B. et al, 2012 kernel | 0.191 |
(6) | Grenier P.-A. et al, S+SPR2014 | 0.180 |
This dataset is composed of all the stereoisomers of the perindoprilate. As this molecule has 5 stereocenters, the dataset is composed of 32 molecules. This is a classification problem where each molecule which inhibit the angiotensin-converting enzyme (ACE) correspond to a class, and those who do not inhibit it to the other class.
Please consult Castillo-Garit 2007 for further details on this dataset.
Method | Accuracy | |
---|---|---|
(1) | Treelet Kernel (Gauzere B. et al, 2012) | 71.875 |
(2) | Brown J. et al. 2010 | 96.875 |
(3) | Grenier P.-A. et al, ICPR 2014 | 87.5 |
(4) | Grenier P.-A. et al, S+SPR2014 | 96.875 |
(4) | Interaction graph 1 (or 2), Grenier P.-A. et al, GbR2015 | 93.75 |
(5) | Interaction graph 1 +MKL, Grenier P.-A. et al, GbR2015 | 100 |
This dataset is composed of 64 steroid. One molecule is withdraw from the original database because this molecule is different from every other molecules and thus induced a bias on results. Molecules are provided with their molecular rotation for a regression problem. After normalization the standard deviation of molecular rotation is 0.208116. The mean number of stereocenters is 9.16.
Further information about this dataset may be found in Bernstein S.(1941)
Ref | Method | MAE/RMSE |
---|---|---|
Mahé, P., Vert, J.-P., 2008. | Tree pattern kernel | 0.030/0.056 |
Gauzere B. et al, 2012 | Treelet Kernel | 0.044/0.087 |
Brown J. et al. 2010 | Oriented tree pattern | 0.030/0.070 |
Grenier P.-A. et al, ICPR 2014 | Stereo kernel | 0.051/0.083 |
Grenier P.-A. et al, GbR 2015 | Graph of interaction 3 with Treelets | 0.039/0.069 |