Pregled bibliografske jedinice broj: 1076524
Design of performance optimized transform and quantization computation blocks for video compression in heterogeneous high performance computing systems
Design of performance optimized transform and quantization computation blocks for video compression in heterogeneous high performance computing systems, 2020., doktorska disertacija, Fakultet elektrotehnike i računarstva, Zagreb
CROSBI ID: 1076524 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Design of performance optimized transform and quantization computation blocks for video compression in heterogeneous high performance computing systems
Autori
Čobrnić, Mate
Vrsta, podvrsta i kategorija rada
Ocjenski radovi, doktorska disertacija
Fakultet
Fakultet elektrotehnike i računarstva
Mjesto
Zagreb
Datum
19.06
Godina
2020
Stranica
107
Mentor
Mario Kovač
Ključne riječi
Video Coding ; High Efficiency Video Coding (HEVC) ; Integer Discrete Cosine Transform (DCT) ; Heterogeneous Computing ; Hardware Acceleration
Sažetak
When analysing Internet traffic today it can be found that digital video content prevails. Its domination will continue to grow in the upcoming years and reach 80% of all traffic by 2021. If converted to Internet video minutes per second, this equals about one million video minutes per second. Providing and supporting improved compression capability is therefore expected from video processing devices. This will relieve the pressure on storage systems and communication networks while creating preconditions for further development of video services. Transform and quantization is one of the most compute-intensive parts of modern hybrid video coding systems. Improving the compression capability of this computation block is achieved using complex algorithms at the expense of increasing implementation complexity. Design requirements for higher throughput, reduced communication latency and low power consumption cannot be accomplished using homogenous systems and heterogeneous multiprocessor high performance systems are imposed as a solution. This thesis presents an area efficient reusable architecture for the integer discrete cosine transform and quantization and also highly performance optimized kernel designed for execution on a GPU. In the case of hardware architecture, optimization is based on exploiting the symmetry and subset properties of the transform matrix. The proposed multiply-accumulate architecture is fully pipelined. It provides a two-way interface over which the processing system can control the data path of the transform process and receive the feedback information about utilization from the device. The proposed architecture is implemented on the FPGA platform, that achieves a throughput of 815 Msps and can support encoding of a 4K UHD@30 fps video sequence in real-time. Considering GPU implementation, the performance optimization strategy involved all three aspects of parallel design, exposing as much of the algorithm’s intrinsic parallelism as possible, with the exploitation of high throughput memory and efficient instruction usage. It combined efficient mapping of transform blocks to thread blocks and efficient vectorized access patterns to shared memory for all transform sizes. Two different GPUs were used to evaluate the proposed implementation. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb