Introduction to High Performance Scientific Computing
Format: PDF / Kindle (mobi) / ePub
This is a textbook that teaches the bridging topics between numerical analysis, parallel computing, code performance, large scale applications.
number of disciplines and skill sets, and correspondingly, for someone to be successful at using high performance computing in science requires at least elementary knowledge of and skills in all these areas. Computations stem from an application context, so some acquaintance with physics and engineering sciences is desirable. Then, problems in these application areas are typically translated into linear algebraic, and sometimes combinatorial, problems, so a computational scientist needs knowledge
now repeatedly reused, and are therefore more likely to remain in the cache. This rearranged code displays better temporal locality in its use of x[i]. 184.108.40.206 Spatial locality The concept of spatial locality is slightly more involved. A program is said to exhibit spatial locality if it references memory that is ‘close’ to memory it already referenced. In the classical von Neumann architecture with only a processor and memory, spatial locality should be irrelevant, since one address in memory
languages to machine instructions. However, on occasion we will discuss how a program at high level can be written to ensure efficiency at the low level. In scientific computing, however, we typically do not pay much attention to program code, focusing almost exclusively on data and how it is moved about during program execution. For most practical purposes it is as if program and data are stored separately. The little that is essential about instruction handling can be described as follows. The
90 percent. However, it should be noted that many scientific codes do not feature the dense linear solution kernel, so the performance on this benchmark is not indicative of the performance on a typical code. Linear system solution through iterative methods (section 5.5), for instance, is much less efficient in a flops-per-second sense, being dominated by the bandwidth between CPU and memory (a bandwidth bound algorithm). One implementation of the Linpack benchmark that is often used is
as • • • • load the value of b from memory into a register, load the value of c from memory into another register, compute the sum and write that into yet another register, and write the sum value back to the memory location of c. Looking at assembly code (for instance the output of a compiler), you see the explicit load, compute, and store instructions. Compute instructions such as add or multiply only operate on registers; for instance addl %eax, %edx 7. Actually, a[i] is loaded before it