FloatX: A C++ Library for Customized Floating-Point Arithmetic (CROSBI ID 269527)
Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Flegar, Goran ; Scheiddeger, Florian ; Novaković, Vedran ; Mariani, Giovanni ; Tomás, Andrés E. ; Malossi, A. Cristiano I. ; Quintana-Ortí, Enrique S.
engleski
FloatX: A C++ Library for Customized Floating-Point Arithmetic
We present FloatX (Float eXtended), a C++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware- supported floating-point types as back end to preserve efficiency, and incurs no storage overhead. The paper discusses in detail the design principles, programming interface and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK ; the Ginkgo library for sparse linear systems ; and two neural network applications related with image processing and text recognition.
Mathematics of computing ; Mathematical software ; Arbitrary-precision arithmetic
Rad je prihvaćen za objavljivanje 01.10.2019. Rad je objavljen online 09.12.2019.
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o izdanju
45 (4)
2019.
40
23
objavljeno
0098-3500
1557-7295
10.1145/3368086