Most modern processors are in denial about two critical aspects of machine organization: parallel execution and hierarchical memory organization. These processors present an illusion of sequential execution and uniform, flat memory. The evolution of these sequential, latency-optimized processors is at an end, and their performance is increasing only slowly over time. In contrast, the performance of throughput-optimized processors, like GPUs, continues to scale at historical rates. Throughput processors embrace, rather than deny, parallelism and memory hierarchy to realize their performance and efficiency advantage compared to conventional processors. Throughput processors have hundreds of cores today and will have thousands of cores by 2015. They will deliver most of the performance, and most of the user value, in future computer systems.
This talk will discuss some of the challenges and opportunities in the architecture and programming of future throughput processors. In these processors, performance derives from parallelism and efficiency derives from locality. Parallelism can take advantage of the plentiful and inexpensive arithmetic units in a throughput processor. Without locality, however, bandwidth quickly becomes a bottleneck. Communication bandwidth, not arithmetic is the critical resource in a modern computing system that dominates cost, performance, and power. This talk will discuss exploitation of parallelism and locality with examples drawn from the Imagine and Merrimac projects, from NVIDIA GPUs, and from three generations of stream programming systems.