^{[NEW]}
Watch
video
and
slides
from the
GTC'12
presentation by Alexey Titov.
^{[NEW]}
NCSA
deploys GPUenabled TeraChem software on
Forge
cluster. ( Link)
^{[NEW]} TeraChem is now available for
download.
TeraChem
publications,
upcoming presentations,
in the news
Detailed simulations based on the
principles of quantum mechanics play an ever increasing role in suggesting, guiding, and
explaining experiments in chemistry and materials science. In fact, quantum chemistry is one of
the major consumers of CPU cycles at national supercomputer centers. The rise to prominence of
computational chemistry was largely driven by early demonstrations of the predictive power of
quantum mechanics applied to chemical problems and the tremendous advances in computing
power over the past decades. However,
limited computational resources remain a serious obstacle to the application of quantum
chemistry in problems of widespread importance, such as the design of more effective drugs to
treat diseases or new catalysts which can be used in applications such as fuel cells or
environmental remediation. Thus, there is considerable impetus to seek relief for this bottleneck
in any way possible, including both the development of new and more effective algorithms and
also the exploration of new computer architectures.
The past decade has seen a tremendous increase in the
computing requirements of consumer videogames, and this
demand is being met through novel hardware architectures
in the form of proprietary consoles and graphics cards.
Offerings such as the Sony PlayStation 3 (designed around
IBM's Cell processor) and the nVidia
GeForce 8800 GTX
graphics card are excellent examples, both of which may be
characterized as stream processors. Stream processing is a
generalization of the single instruction multiple data (SIMD)
vector processing model which formed the core of the Cray1
supercomputer. Applications are organized into streams and
kernels, representing blocks of data and code transformations,
respectively. The kernel is typically comprised of a tight loop
of relatively few instructions. Streams of data are then
processed in pipelined and parallel fashion by many processors
executing a small number (possibly only one) of kernels.
Since a graphics card typically costs less than a single
CPU used in conventional scientific clusters, it is tempting
to consider the use of graphics cards for computational
chemistry. The earliest attempts to use graphics processing
units (GPUs) for nongraphical computing in fields outside
of chemistry were largely stymied by limited precision
and difficulty of programming. The former problem has been
partially remedied, and the latest GPUs support 32bit
floating point arithmetic. The next generation of GPUs and
stream processors from nVidia and AMD have already been
announced and will extend this support to 64bit. The latter
problem of programming difficulty has been largely removed
by nVidia's recent introduction of the Compute Unified
Device Architecture ( CUDA), which provides a relatively
simple programming interface that can be called from the
standard C language.
The GeForce 8800 GTX
( Figure 1)
consists of 16 independent stream
multiprocessors (SM), running at a clock speed of
1.35 GHz, which is comparable to the conventional CPUs
commonly used as the basis for scientific computing clusters. Each SM has a Single Instruction
Multiple Data (SIMD) implementation with eight scalar
processors and one instruction unit. At each clock cycle, the
instruction unit of an SM broadcasts the same instruction to
all eight of its scalar processor units, which then operate on
different data. Each SM can process several blocks of threads concurrently,
but all the threads in a given block are guaranteed to
be executed on a single SM. Threads within the same block
are thereby able to communicate with each other very
efficiently using fast onchip shared memory and are
furthermore able to synchronize their execution. Perhaps the most detailed
descriptions of the nVidia GeForce GPU architecture and
the CUDA API are provided in the CUDA
Programming Guide.
Figure 1. Schematic block diagram of the nVidia GeForce 8800 GTX.
Twoelectron repulsion integral (ERI) problem
Two of the most basic questions in chemistry are "Where are the electrons?" and "Where are
the nuclei?"
Electronic structure
theory, i.e. quantum chemistry, focuses on the first of these.
Because the electrons are very light, the laws of quantum mechanics must be applied and they
are described with an electronic wavefunction determined from solution of the timeindependent
Schrodinger equation. As usual in quantum mechanics, the absolute square of this wavefunction
is interpreted as a probability distribution for the positions of the electrons. Once the electronic
distribution is known for a fixed nuclear configuration, it is straightforward to calculate the
resulting forces on the nuclei. Thus, the answer to the second question follows from the answer
to the first, through either a search for the arrangement of the nuclei which minimizes the energy
(molecular geometry optimization) or solution of the classical
Newtonian equations
of motion.
In many electronic structure methods, the electronic distribution function is linearly expanded over
socalled Gaussian type basis functions (also known as Gaussian type orbitals, or GTOs), centered
on atomic nuclei R
where the integers l, m, and n represent the angular momentum of the orbital.
The total angular momentum is given by L = l + m + n and is often reffered to as
s, p, d, etc for L = 0, 1, 2,... respectively.
The unknown expansion coefficient are then found via subsequent iteration procedure
( HartreeFock method, for example),
which requires evaluation of large number of twoelectron repulsion integrals (ERIs)
Even though gaussian form of the basis functions leads to
analytic expressions for the twoelectron integrals, there are O(N^{4}) such integrals to be
evaluated, where N grows linearly with the size of the molecule under consideration. In practice,
many of these integrals are small and can be neglected, but the number of nonnegligible
integrals still grows faster than O(N^{2}), making their evaluation a critical bottleneck in quantum
chemistry.
We have previously explored three different algorithms
( Figure 2)
to evaluate the O(N^{4}) ERIs over
contracted basis functions and store them in the GPU memory (see
Publications
for details). The algorithms were tested on a system composed of 64 hydrogen atoms arranged on a 4x4x4 lattice. Two basis sets
were used  the first (denoted STO6G) had six stype Gaussian basis functions for each contracted basis
function with one contracted basis function per atom. The second (denoted 6311G) has three
contracted stype Gaussian basis functions per atom, and these contracted functions are combinations of three, one, and
one Gaussian basis functions, respectively. These two basis sets represent highly contracted or
relatively uncontracted basis sets and serve to show how the performance of the algorithms is
affected by the degree of contraction in the basis set. For the hydrogen atom lattice test case, the
number of contracted basis functions is 64 and 192 for the STO6G and 6311G basis sets,
respectively. This leads to O(10^{6}) and O(10^{8}) ERIs over contracted basis functions.
A benchmark test performed on the evaluation of these integrals showed that the current GPU implementation achievse
up to 130fold speedup over a traditional optimized CPU implementation running on an AMD Opteron.
Table 1 summarizes corresponding
timings.
Later, we extended the algorithms to support ptype basis functions and performed a series of benchmarks
on large molecules depicted on the right, like the 768atom duplex DNA strand (over 5000 basis functions). The results
confirmed that more than 100fold GPU over CPU speedups are directly achievable for chemically and biologically important systems,
making calculations of 1000atom systems on desktopsize machines a routine task in the future. Specifications and performance
of some of the handcoded GPU kernels handling different kinds of the twoelectron integrals are represented in
Table 2.
Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The matrix of
contracted integrals is represented by the large square. Small squares below the main diagonal (dark green)
represent integrals which do not need to be computed because the integral matrix is symmetric. Each of the
contracted integrals is a sum over primitive integrals, represented by the small squares in the blow up of two
squares corresponding to contracted integrals. The different mapping schemes differ in how the computational
work is apportioned to threads and thread blocks, and are depicted schematically with red squares
superimposed on the integral matrix denoting the work done by a representative thread block and the three
blow ups showing how the work is apportioned to threads within the thread block.
Table 1. Timings for the twoelectron integral evaluation on the GPU and CPU
using the three mapping shemes (1B1CI, 1T1CI, 1T1PI). The benchmarked system
is 4x4x4 hydrogen atom cube with 0.74A nearestneighbor distance.^{a}


^{a}
The "CPU precalculation" column lists the amount of time required
to generate pair quantities on the CPU, and the "GPUCPU transfer"
column lists the amount of time required to copy the contracted
integrals from the GPU to CPU memory. Timings for the same test
case using the GAMESS program package on a single Opteron 175
CPU are provided for comparison.

