CroRIS - CROSBI

izvor podataka: crosbi ✓

Machine learning in prediction of intrinsic aqueous solubility of drug‐like compounds: Generalization, complexity, or predictive ability? (CROSBI ID 294426)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Lovrić, Mario ; Pavlović, Kristina ; Žuvela, Petar ; Spataru, Adrian ; Lučić, Bono ; Kern, Roman ; Wong, Ming Wah Machine learning in prediction of intrinsic aqueous solubility of drug‐like compounds: Generalization, complexity, or predictive ability? // Journal of chemometrics, 35 (2021), 7-8; e3349, 16. doi: 10.1002/cem.3349

Podaci o odgovornosti

Autori

Lovrić, Mario ; Pavlović, Kristina ; Žuvela, Petar ; Spataru, Adrian ; Lučić, Bono ; Kern, Roman ; Wong, Ming Wah

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Machine learning in prediction of intrinsic aqueous solubility of drug‐like compounds: Generalization, complexity, or predictive ability?

Sažetak

We present a collection of publicly available intrinsic aqueous solubility data of 829 drug‐like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R2(test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R2(test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]‐based split), better generalization was observed for the RF model. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R2(test) of 0.81.

Ključne riječi

consensus modeling ; LASSO ; LightGBM ; PCA ; permutation importance ; QSAR ; randomforests

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Journal of chemometrics

Volumen (broj)

35 (7-8)

Godina

2021.

Broj rada

e3349

Broj stranica

16

Status objave rada

objavljeno

ISSN

0886-9383

e-ISSN

1099-128X

DOI

10.1002/cem.3349

Povezanost rada

Povezane osobe

Mario Lovrić (autor/i)

Bono Lučić (autor/i)

Povezane ustanove

Institut Ruđer Bošković (098) (autorova ustanova)

Povezani projekti

Napredne metode i tehnologije u znanosti o podatcima i kooperativnim sustavima (rezultat rada na projektu)

Područje

Interdisciplinarne prirodne znanosti, Kemija, Računarstvo

Poveznice

analyticalsciencejournals.onlinelibrary.wiley.com

doi.org

fulir.irb.hr

Indeksiranost

Scopus

Current Contents Connect (CCC)

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)

Krugovi, vizual Srca