A. Types of parallelism
Bit-level parallelism
From the
advent of very-large-scale integration (VLSI)
computer-chip fabrication technology in the 1970s until about 1986, speed-up in
computer architecture was driven by doubling computer word size—the amount of information
the processor can manipulate per cycle. Increasing the word size reduces the
number of instructions the processor must execute to perform an operation on
variables whose sizes are greater than the length of the word. For example,
where an 8-bit
processor must add two 16-bit integers, the processor must first add the 8 lower-order
bits from each integer using the standard addition instruction, then add the
8 higher-order bits using an add-with-carry instruction and the carry bit
from the lower order addition; thus, an 8-bit processor requires two
instructions to complete a single operation, where a 16-bit processor would be
able to complete the operation with a single instruction.
Instruction-level parallelism
A canonical five-stage pipeline in a RISC machine (IF =
Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access,
WB = Register write back)
A computer
program, is in essence, a stream of instructions executed by a processor. These
instructions can be re-ordered and combined into groups which
are then executed in parallel without changing the result of the program. This
is known as instruction-level parallelism. Advances in instruction-level
parallelism dominated computer architecture from the mid-1980s until the
mid-1990s.
Modern
processors have multi-stage instruction pipelines. Each stage in the
pipeline corresponds to a different action the processor performs on that
instruction in that stage; a processor with an N-stage pipeline can have up to
N different instructions at different stages of completion. The canonical
example of a pipelined processor is a RISC processor, with five
stages: instruction fetch, decode, execute, memory access, and write back. The Pentium 4
processor had a 35-stage pipeline.
Task parallelism
Task
parallelism is the characteristic of a parallel program that "entirely
different calculations can be performed on either the same or different sets of
data". This contrasts with data parallelism, where the same calculation is
performed on the same or different sets of data. Task parallelism does not
usually scale with the size of a problem.
B. Distributed computing
A distributed computer (also known as a distributed
memory multiprocessor) is a distributed memory computer system in which the
processing elements are connected by a network. Distributed computers are
highly scalable.
A logical view of a Non-Uniform Memory Access (NUMA)
architecture. Processors in one directory can access that directory's memory
with less latency than they can access memory in the other directory's memory.
D. Parallel programming languages
Concurrent programming languages,
libraries, APIs, and parallel programming models (such as Algorithmic Skeletons) have been created
for programming parallel computers. These can generally be divided into classes
based on the assumptions they make about the underlying memory
architecture—shared memory, distributed memory, or shared distributed memory.
Shared memory programming languages communicate by manipulating shared memory
variables. Distributed memory uses message
passing. POSIX Threads and OpenMP are two
of most widely used shared memory APIs, whereas Message Passing Interface (MPI) is the
most widely used message-passing system API. One concept used in programming parallel programs is the future concept, where one part of a
program promises to deliver a required datum to another part of a program at
some future time.
CAPS entreprise and Pathscale are also
coordinating their effort to make HMPP (Hybrid Multicore Parallel
Programming) directives an Open Standard called OpenHMPP.
The OpenHMPP
directive-based programming model offers a syntax to efficiently offload
computations on hardware accelerators and to optimize data movement to/from the
hardware memory. OpenHMPP directives describe remote procedure call (RPC) on an
accelerator device (e.g. GPU) or more generally a set of cores. The directives
annotate C or Fortran codes to describe two sets of functionalities: the
offloading of procedures (denoted codelets) onto a remote device and the
optimization of data transfers between the CPU main memory and the accelerator
memory.
E. introductory programming cuda gpu
Using CUDA, the GPUs can be used for general purpose processing (i.e., not exclusively graphics); this approach is known as GPGPU. Unlike CPUs, however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly.
General-Purpose Computing on Graphics Processing Units (GPGPU, rarely GPGP or GP²U) is the utilization of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).
Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA.
References: http://en.wikipedia.org/wiki/
Tidak ada komentar:
Posting Komentar