[hcs-d] TALK: CUDA Optimization: A Case Study of High-Performance Sorting (Duane Merrill from University of Virginia, Tuesday April 26th, 7:35 PM, Harvard MD G125)

Nicolas Pinto pinto at mit.edu
Sat Apr 23 18:20:40 EDT 2011

Title: CUDA Optimization: A Case Study of High-Performance Sorting
Speaker: Duane Merrill (University of Virginia)

Date: 4-26-2011
Time: 7:35 PM
Location: Harvard Maxwell Dworkin G125 (http://j.mp/eCgV66)

Harvard CS264 2011 Guest Lecture Series
"Massively Parallel Computing" Course
Host: Nicolas Pinto (Harvard, MIT)


In this presentation, we use our implementation for high performance
radix sorting as a case study for illustrating advanced design
patterns and idioms. These techniques have allowed us to demonstrate
Fermi sorting rates that exceed 1.0 billion 32-bit keys per second
(and over 770 million key-value pairs per second), making it the
fastest fully-programmable micro-architecture for this genre of
sorting problems. Although the CUDA programming model is elegantly
decoupled from any particular hardware configuration, we present
techniques for exploiting knowledge of the NVIDIA GPU machine model in
order to produce more efficient implementations. Our design patterns
enable the compiler to specialize a single program text for a variety
of architectures, resulting in target code that “fits” the underlying
hardware significantly better than more general approaches. In
particular, we discuss strategies for kernel fusion, warp-synchronous
programming, flexible granularity via meta-programming, algorithm
serialization, and data-movement tuning.

Speaker biography:

Duane Merrill is a Ph. D. candidate at the University of Virginia,
Department of Computer Science. His advisor is Professor Andrew
Grimshaw. His current research interests lay in parallel and
high-performance computing, specifically in regard to programming
models and algorithmic primitives for GPU architectures.  His
dissertation work investigates efficient strategies and algorithms for
solving classes of fine-grained, parallel problems that are typically
thought to be poorly-suited for the bulk-synchronous GPU machine
model.  Much of his prior academic work has involved concurrent
systems in one form or another, including grid and distributed
computing; virtual machines and hypervisor technologies; operating
systems and meta-systems; and security architecture and protocols.

Nicolas Pinto, PhD

More information about the hcs-discuss mailing list