Author: Ali, Omar Shaaban Ibrahim./ Title: APU/GPU Heterogeneous Systems Trade offs /

Search In this Thesis

العنوان

APU/GPU Heterogeneous Systems Trade offs /

المؤلف

Ali, Omar Shaaban Ibrahim.

هيئة الاعداد

باحث / عمر شعبان إبراهيم على

مشرف / محمد مؤنس علي بيومي

مشرف / عزيزة ابراهيم حسين

مشرف / حسن علي يونس الأنصاري

الموضوع

Parallel programming (Computer science). High performance computing.

تاريخ النشر

2018.

عدد الصفحات

111 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة الكهربائية والالكترونية

تاريخ الإجازة

1/1/2018

مكان الإجازة

جامعة المنيا - كلية الهندسه - الهندسة الكهربيه

الفهرس

Only 14 pages are availabe for public view

from

128

from

128

Abstract

Answering the question of how potential is the APU for accelerating regular/irregular algorithms or problems, is not a trivial matter and a daunting task. Numerous aspects and different angles of the problem needs to be carefully studied.
This work was divided into two parts, in the first half we used N-bodies problem for astrophysics with (BH as an irregular algorithm and All-Pairs as a regular algorithm) as a case study, and at the second part we extended this study by further exploring the Intel HD 5500 APU and conducting a concentrated study to the underlying architecture of the device to deduce its characteristics, understanding how to fully leverage and reach the maximum performance out of our device, and finally to determine performance limitations and boundaries.
At the first part, a serial implementation for BH was tested on the CPU at first. In addition, two parallel implementations were tested for each algorithm, one for the GPU and one for the APU.
The results clearly proved that the dedicated GPU outperformed the integrated APU in all cases. However, the APU showed some potential for a smaller dataset, roughly equal to 30k particles. Despite the fact that the APU did not outperforms the GPU at this smaller dataset, this might (more teste needs to be conducted to fully prove this) be an acceptable performance for some other regular or irregular applications as a cheaper solution (regarding the FLOPS per dollar price and power consumption) rather than using the expensive discrete GPU. In addition, lowering the arithmetic intensity coupled with the use of local memory and maximum WG size of the device increased the APU performance by 94.6% than the original implementation. Hence, the APU is a good choice for algorithms (or applications that relies on irregular data structure with a relatively mid-sized data set) that have been fully optimized to exploit all of the APU resources and make the maximum use of all of the device’s features (i.e. achieving the highest occupancy that can be reached for your problem). We need to note and emphasize the fact that higher occupancy does not necessarily automatically translates into higher performance gains. For further explanation, GPUs demand a certain number of warps per multiprocessor to hide the instruction pipeline latency, and targeting a higher occupancy might cause the performance bottleneck to migrate from one part to another. In our case, this can be noticed in increasing the WG size of the GPU to the maximum (Table 7.4) enhanced the performance by only 4.45%, on other problems this might even go down because this increases the contention of the memory controller and even cache misses. Consequently, optimizations should be made after the code is entirely written and the target must be a global impact rather than local impact, and a good profiling for the code must be done at first to determine the hot-spot area and bottlenecks of the code that affect the performance heavily, and concentrate on these spots only.
The second part of this study, however, was subdivided into two sub-sections:
I: Peak Capabilities of The Device that includes
 Memory operations such as (memory read/ write, copy bandwidth, host device, and device to host copy)
 Arithmetic pipeline such as max device’s computing FLOPs, and floating point mathematical operations for both scalers and vectors.
II: Computation Patterns including (Sparse Linear Algebra, Dense Linear Algebra, and N-Body Methods), and OCL 2.0 shared virtual memory including three applications (Advanced Encryption Standard, Finite Impulse Response, and Histogram of colored images).
We concluded that victor size of 2 and 4 performs better for all operations tested and this was supported by inspecting the complier’s generated PTX and report the SIMD width of each operation. Only SP supports all scaler and vector sized FLOP operations, both 24-bit and 32-bit operations support up to a vector of size 4, and 64-bit operation supports only scaler and vector of size 2. However, the performance was very poor of the 64-bit operations indicating that it is emulated on the device and there is no dedicated hardware for 64-bit operations. Pinned memory reported low latency than paged one for both H2D and D2H memory copy.
Both performance of NT and NN matrix multiplication hold for the GPU, wherein, the APU the NT performed better on APU than NN. This implies that row-order computations has faster memory access than column-order and more suitable for the APU than the GPU in this case.
Scan dwarf reported better performance for the discrete GPU than the APU. The MD application noticeably performed very well on the APU.
OCL 2.0 SVM feature performed better for all the three applications tested (AES, FIR, and HIST), however, it proved strong point were the algorithm works on data that swings back and forth between the host and the device or operates on the data on multiple passes as the case of FIR application were the OCL 2.0 performance is noticeably better than the OCL 1.2 implementation.