This page contains several archive or links towards various chemical databases of molecules. Each database concerns a specific problem (either classification or Regression).
| Dataset | # dataset | mean size | mean degree | min size | max size | Stereoisomerism | Problem type |
|---|---|---|---|---|---|---|---|
| PAH | 94 | 20.7 | 2.4 | 10 | 28 | No | Classif. |
| MAO | 68 | 18.4 | 2.1 | 11 | 27 | No | Classif. |
| PTC | 416 | 14.4 | 2.1 | 2 | 64 | No | Classif. |
| AIDS | 2000 | 15.7 | 2.1 | 2 | 95 | No | Classif. |
| Alkane | 150 | 8.9 | 1.8 | 1 | 10 | No | Regression |
| Acyclic | 185 | 8.2 | 1.8 | 3 | 11 | No | Regression |
| Chiral Acyclic | 35 | 21.29 | 1.98 | 14 | 32 | Yes | Regression |
| Vitamin D | 69 | 76.91 | 2.05 | 68 | 88 | Yes | Regression |
| ACE | 32 | 52 | 2.04 | 52 | 52 | Yes | Classification |
| Steroid | 64 | 75.11 | 2.08 | 57 | 94 | Yes | Regression |
| Monoterpens | 382 | 10 | No | Classification |
Results bellow are obtained for each method using a leave one out procedure with a two-class SVM. This classification scheme is made for each of the 68 molecules of the dataset.
| Method | Classification accuracy | |
|---|---|---|
| (1) | Suard et al. (2002) | 80% (55/68) |
| (2) | Vishwanathan et al. (2010) | 82% (56/68) |
| (3) | Neuhaus and Bunke (2007) | 90% (61/68) |
| (4) | Riesen et al. (2007) | 91% (62/68) |
| (5) | Normalized standard Graph Laplacian kernel | 90% (61/68) |
| (6) | Normalized fast Graph Laplacian kernel | 90% (61/68) |
| (7) | Mahé and Vert (2008) | 96% (65/68) |
| (8) | Gaüzère et al. (2012) | 94% (64/68) |
In both results presented bellow GLK stands for graph Laplacian kernel.
|
|
Results displayed bellow have been obtained using several test sets composed of 10% of the database, the remaining 90% being used as training set. For each test molecules composing the test and training sets have been choosen randomly.
| Method | Average error (ºC) | RMSE (ºC) | |
|---|---|---|---|
| (1) | Cherqaoui and Villemin (1994) | 3.11 | 3.70 |
| (2) | Neuhaus and Bunke (2007) | 5.42 | 10.01 |
| (3) | Riesen et al. (2007) | 5.27 | 7.10 |
| (4) | Suard et al. (2002) | 4.66 | 6.21 |
| (5) | Vishwanathan et al. (2010) | 10.61 | 16.28 |
| (6) | Graph Laplacian kernel | 10.79 | 16.45 |
| (7) | Mahé and Vert (2008) | 2.41 | 3.48 |
| (8) | Gaüzère et al. (2012) | 1.41 | 1.92 |
| Method | Classification accuracy | |
|---|---|---|
| (1) | Treelet Kernel[5] | 71.3 |
| (2) | Graph Edit distance[8] | 72 |
| (3) | Cycles [16] | 63 |
| (4) | Relevant Cycle graph [6] | 77.7 |
| (5) | Relevant Cycle Hypergraph [7] | 76.3 |
| (6) | Augmented Cycle [8] | 80.7 |
A dataset of acyclic molecules with chiral atoms. Molecules are provide with their optical rotation for a regression problem. Each molecules is composed of one or two chiral vertices (mean number : 1.06).
The standard deviation of optical rotation angles is equal to 38.25º with values ranging from −89º to 78º .
| Method | RMSE (angle) | |
|---|---|---|
| (1) | Mahé and Vert (2008) | 34.1 |
| (2) | Gauzere B. et al, 2012 | 26.2 |
| (3) | Brown J. et al. 2010 | 24.2 |
| (4) | Grenier P.-A. et al, 2013 | 14.8 |
This dataset is composed of 69 derivatives of the vitamin D. Molecules are provided with their biological activity for a regression problem. After normalization the standard deviation of biological activity is equal to 0.256259.
| Method | RMSE | |
|---|---|---|
| (1) | Mahé and Vert (2008) | 0.251 |
| (2) | Gauzere B. et al, 2012 | 0.271 |
| (3) | Brown J. et al. 2010 | 0.184 |
| (4) | Grenier P.-A. et al, ICPR 2014 | 0.194 |
| (5) | Grenier P.-A. et al, ICPR 2014 kernel multiplied by Gauzere B. et al, 2012 kernel | 0.191 |
| (6) | Grenier P.-A. et al, S+SPR2014 | 0.180 |
This dataset is composed of all the stereoisomers of the perindoprilate. As this molecule has 5 stereocenters, the dataset is composed of 32 molecules. This is a classification problem where each molecule which inhibit the angiotensin-converting enzyme (ACE) correspond to a class, and those who do not inhibit it to the other class.
Please consult Castillo-Garit 2007 for further details on this dataset.
| Method | Accuracy | |
|---|---|---|
| (1) | Treelet Kernel (Gauzere B. et al, 2012) | 71.875 |
| (2) | Brown J. et al. 2010 | 96.875 |
| (3) | Grenier P.-A. et al, ICPR 2014 | 87.5 |
| (4) | Grenier P.-A. et al, S+SPR2014 | 96.875 |
| (4) | Interaction graph 1 (or 2), Grenier P.-A. et al, GbR2015 | 93.75 |
| (5) | Interaction graph 1 +MKL, Grenier P.-A. et al, GbR2015 | 100 |
This dataset is composed of 64 steroid. One molecule is withdraw from the original database because this molecule is different from every other molecules and thus induced a bias on results. Molecules are provided with their molecular rotation for a regression problem. After normalization the standard deviation of molecular rotation is 0.208116. The mean number of stereocenters is 9.16.
Further information about this dataset may be found in Bernstein S.(1941)
| Ref | Method | MAE/RMSE |
|---|---|---|
| Mahé, P., Vert, J.-P., 2008. | Tree pattern kernel | 0.030/0.056 |
| Gauzere B. et al, 2012 | Treelet Kernel | 0.044/0.087 |
| Brown J. et al. 2010 | Oriented tree pattern | 0.030/0.070 |
| Grenier P.-A. et al, ICPR 2014 | Stereo kernel | 0.051/0.083 |
| Grenier P.-A. et al, GbR 2015 | Graph of interaction 3 with Treelets | 0.039/0.069 |