#### Pregled bibliografske jedinice broj: 151922

## Modeling water solubility of molecules by ensemble of multivariate regression models

Modeling water solubility of molecules by ensemble of multivariate regression models

*// 15th European Symposium on QSAR and Molecular Modelling in Rational Design of Bioactive Molecules (The Euro-QSAR 2004) : Proceeding Book*/ Aki-Sener, Esin ; Yalcin, Ismail (ur.).

Istanbul: The Euro-QSAR 2004, 2004. str. 349-349 (poster, nije recenziran, sažetak, znanstveni)

CROSBI ID: **151922**
Za ispravke kontaktirajte CROSBI podršku putem web obrasca

**Naslov**

Modeling water solubility of molecules by ensemble of multivariate regression models

**Autori**

Lučić, Bono ; Nadramija, Damir

**Vrsta, podvrsta i kategorija rada**

Sažeci sa skupova, sažetak, znanstveni

**Izvornik**

15th European Symposium on QSAR and Molecular Modelling in Rational Design of Bioactive Molecules (The Euro-QSAR 2004) : Proceeding Book
/ Aki-Sener, Esin ; Yalcin, Ismail - Istanbul : The Euro-QSAR 2004, 2004, 349-349

**Skup**

European Symposium on QSAR & Molecular Modelling in Rational Design of Bioactive Molecules (15 ; 2004)

**Mjesto i datum**

Istanbul, Turska, 05.-10.09.2004

**Vrsta sudjelovanja**

Poster

**Vrsta recenzije**

Nije recenziran

**Ključne riječi**

solubility in water; SMILES structure; 3D structure; molecular descriptors; Dragon program; single multivariate regression models; ensemble of single multivariate models

**Sažetak**

Relationship between molecular structure and solubility of 1297 organic compounds were studied. This data set of molecules was partitioned into training (1039 molecules) and test (258 molecules) sets, as it was done by Liu and So in their neural network study [1]. Initial structures of molecules were encoded as SMILES and converted into the 3D structures by the CORINA program (www2.chemie.uni-erlangen.de/software/corina/) and more than 1000 initial descriptors were computed by the program Dragon 2.1 (http://www.disat.unimib.it/chm/). Initial set of descriptors was filtered in order to remove non-significant and highly inter-correlated descriptors (123 descriptors remained after filtering). Finally, the best single linear multivariate regression (MR) models containing 1-7 descriptors were selected from the set of 123 descriptors by the CROMRsel program [2-4]. Standard error of estimate and standard error of leave-one-out (LOO) cross-validation (CV) obtained from the training set (1039 molecules) are 0.739 and 0.746 log units, respectively. In the best seven-descriptor model five topological descriptors, one atom-centered fragment descriptor and calculated Moriguchi octanol-water partition coefficient are involved. Using the best linear seven-descriptor model we calculated predicted values on external data set containing 258 molecules and obtained the same results (Stst = 0.745 log units), as it was obtained by the neural network model developed on the same sets of molecules [1]. It is important to note that the statistical parameter (Stst = 0.71 log units) given in ref. [1] (Table 1, page 1636) for the best neural network model was not computed correctly - correct value is Stst = 0.745 log units. However, our seven-descriptor MR model is linear (containing only eight optimized parameters) and much simpler than the above mentioned neural network model (which is a nonlinear model, consisting of seven descriptors and 19 optimized parameters). In order to access information about the reliability of single prediction of solubility for each molecule obtained by the best MR model we selected K top models (each having i descriptors, i = 1, ..., 10). In this study, the total number (K) of the best top models (selected as the 'the best' according to the lowest fitted and LOO cross-validated standard errors) was varied between 50 and 104. Selection of the top linear models into the ensemble of multivariate regression models was performed by the CROMRsel program [2-3] starting from the initial set of (123) descriptors. In addition, nonlinear MR models were constructed starting with the selection of the best top models from descriptors involved in the best selected linear models (obtained in the first step, when the best single MR models were selected) and by taking their nonlinear terms (squares and cross-products). For each of the top K models new predictions on the test set were obtained. Using K predictions of water solubility for each molecule in the test set we computed the mean value of K predictions and the range of predictions on logarithmic scale for each molecule. Generating ensemble MR models in this way we obtained very important information about the reliability of prediction of solubility for a molecule. However, interpretation of ensemble MR models (as well as each other ensemble model) is more difficult task than in cases of single MR models, what will be discussed. [1] Liu R, So S-S. Development of quantitative Structure-Property Relationship Models for Early ADME Evaluation in Drug Discovery. 1. Aqueous Solubility. J Chem Inf Comput Sci 2001 ; 41:1633-9. [2] Lučić B, Trinajstić N. Multivariate regression outperforms several robust architectures of neural networks in QSAR modeling. J Chem Inf Comput Sci 1999 ; 39:121-132. [3] Lučić B, Amić D, Trinajstić N. Nonlinear multivariate regression outperforms several concisely designed neural networks on three QSPR data sets. J Chem Inf Comput Sci 2000 ; 40:403-413. [4] Lučić B, Nadramija D, Bašic I, Trinajstić N. Toward generating simpler QSAR models: nonlinear multivariate regression versus several neural network ensembles and some related methods. J Chem Inf Comput Sci 2003 ; 43:1094-1102.

**Izvorni jezik**

Engleski

**Znanstvena područja**

Kemija

**POVEZANOST RADA**

**Projekt / tema**

0098034 - 0098034 (, )

**Ustanove**

Institut "Ruđer Bošković", Zagreb,

PLIVA HRVATSKA d.o.o.