Pregled bibliografske jedinice broj: 36099
A new efficient approach for variable selection based on multiregression : prediction of gas chromatographic retention time and response factors
A new efficient approach for variable selection based on multiregression : prediction of gas chromatographic retention time and response factors // Journal of chemical information and computer sciences, 39 (1999), 3; 610-621 (međunarodna recenzija, članak, znanstveni)
CROSBI ID: 36099 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
A new efficient approach for variable selection based on multiregression : prediction of gas chromatographic retention time and response factors
Autori
Lučić, Bono ; Trinajstić, Nenad ; Sild, S. ; Karelson, M. ; Katritzky, A.R.
Izvornik
Journal of chemical information and computer sciences (0095-2338) 39
(1999), 3;
610-621
Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni
Sažetak
The selection of the most relevant variable is a frequent problem in the
analysis of chemical data, especially now considering the large amounts of
data created by the increased computer power and analytical resolution. A
novel procedure for variable selection based on multiregression (MR)
analysis is developed and applied to the quantitative structure-property
relationship (QSPR) modeling of gas chromatographic retention times t(R)
and Dietz response factors RF on 152 diverse chemical compounds. Using 296
descriptors generated by the CODESSA program, "absolutely the best" linear
MR models containing from 1 to 5 descriptors were first selected (similar
to 2 x 10(10) models were checked), and then "the best" linear stepwise MR
models with six and seven descriptors were obtained through "i by i"
stepwise selection. In this paper i was varied from 1 to 4, so that in
each next step i descriptors were added to the previously selected
descriptors. Nonlinear models were developed by the inclusion of
cross-products of initial descriptors. We selected as the most important
descriptors for tR the number of C-H and C-X bonds, connectivity indices
of order 3, the highest normal mode vibrational frequency, and the
rotational entropy of the molecule at 300 K. In the case of RF modeling
the most important descriptors are those related to the relative number
and weight of effective C atoms, the orbital electronic population, and
the bond order and valency of C and II atoms. Comparison with the best
six-descriptor models obtained by the normal CODESSA procedure shows that
nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs
0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for tR
and RF, respectively. Our novel procedure of selecting a small number of
the most important descriptors from a data set allows us to extract a
larger amount of useful information than with the procedure implemented in
CODESSA. Thus, our new procedure enables the selection of the best
possible MR models from 1010 possibilities. Through the introduction of
cross-product terms, we obtained nonlinear MR models which are superior to
the corresponding linear models.
Izvorni jezik
Engleski
Znanstvena područja
Kemija
Citiraj ovu publikaciju:
Časopis indeksira:
- Current Contents Connect (CCC)
- Web of Science Core Collection (WoSCC)
- Science Citation Index Expanded (SCI-EXP)
- SCI-EXP, SSCI i/ili A&HCI
- Scopus
- MEDLINE