Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints

Lovrić, Mario; Đuričić, Tomislav; Tran, Han T. N.; Hussain, Hussain; Lacić, Emanuel; Rasmussen, Morten A.; Kern, Roman

izvor podataka: crosbi !

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints (CROSBI ID 297286)

Prilog u časopisu | ostalo | međunarodna recenzija

Lovrić, Mario ; Đuričić, Tomislav ; Tran, Han T. N. ; Hussain, Hussain ; Lacić, Emanuel ; Rasmussen, Morten A. ; Kern, Roman Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints // Pharmaceuticals, 14 (2021), 8; 758, 17. doi: 10.3390/ph14080758

Podaci o odgovornosti

Autori

Lovrić, Mario ; Đuričić, Tomislav ; Tran, Han T. N. ; Hussain, Hussain ; Lacić, Emanuel ; Rasmussen, Morten A. ; Kern, Roman

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints

Sažetak

Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.

Ključne riječi

manifold learning ; machine learning ; rdkit ; embeddings ; Tox21 ; principal component analysis ; autoencoder

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Pharmaceuticals

Volumen (broj)

14 (8)

Godina

2021.

Broj rada

758

Broj stranica

Status objave rada

objavljeno

e-ISSN

1424-8247

DOI

10.3390/ph14080758

Trošak objave rada u otvorenom pristupu

APC

1600,00 CHF

Povezanost rada

Povezane osobe

Mario Lovrić (CroRIS ID: 37884; MBZ: 397424) (autor/i)

Povezane ustanove

Institut za antropologiju (196) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti, Interdisciplinarne prirodne znanosti, Kemija

Poveznice

doi.org

mdpi.com

Indeksiranost

Scopus

Current Contents Connect (CCC)

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)