QUANTUM CHEMISTRY ON GPU
[NEW] Watch video and slides from the GTC'12 presentation by Alexey Titov.
[NEW] NCSA deploys GPU-enabled TeraChem software on Forge cluster. (Link)
[NEW] TeraChem is now available for download.

TeraChem publications, upcoming presentations, in the news


     Detailed simulations based on the principles of quantum mechanics play an ever increasing role in suggesting, guiding, and explaining experiments in chemistry and materials science. In fact, quantum chemistry is one of the major consumers of CPU cycles at national supercomputer centers. The rise to prominence of computational chemistry was largely driven by early demonstrations of the predictive power of quantum mechanics applied to chemical problems and the tremendous advances in computing power over the past decades. However, limited computational resources remain a serious obstacle to the application of quantum chemistry in problems of widespread importance, such as the design of more effective drugs to treat diseases or new catalysts which can be used in applications such as fuel cells or environmental remediation. Thus, there is considerable impetus to seek relief for this bottleneck in any way possible, including both the development of new and more effective algorithms and also the exploration of new computer architectures.

      The past decade has seen a tremendous increase in the computing requirements of consumer videogames, and this demand is being met through novel hardware architectures in the form of proprietary consoles and graphics cards. Offerings such as the Sony PlayStation 3 (designed around IBM's Cell processor) and the nVidia GeForce 8800 GTX graphics card are excellent examples, both of which may be characterized as stream processors. Stream processing is a generalization of the single instruction multiple data (SIMD) vector processing model which formed the core of the Cray-1 supercomputer. Applications are organized into streams and kernels, representing blocks of data and code transformations, respectively. The kernel is typically comprised of a tight loop of relatively few instructions. Streams of data are then processed in pipelined and parallel fashion by many processors executing a small number (possibly only one) of kernels.

      Since a graphics card typically costs less than a single CPU used in conventional scientific clusters, it is tempting to consider the use of graphics cards for computational chemistry. The earliest attempts to use graphics processing units (GPUs) for nongraphical computing in fields outside of chemistry were largely stymied by limited precision and difficulty of programming. The former problem has been partially remedied, and the latest GPUs support 32-bit floating point arithmetic. The next generation of GPUs and stream processors from nVidia and AMD have already been announced and will extend this support to 64-bit. The latter problem of programming difficulty has been largely removed by nVidia's recent introduction of the Compute Unified Device Architecture (CUDA), which provides a relatively simple programming interface that can be called from the standard C language.

      The GeForce 8800 GTX (Figure 1) consists of 16 independent stream multiprocessors (SM), running at a clock speed of 1.35 GHz, which is comparable to the conventional CPUs commonly used as the basis for scientific computing clusters. Each SM has a Single Instruction Multiple Data (SIMD) implementation with eight scalar processors and one instruction unit. At each clock cycle, the instruction unit of an SM broadcasts the same instruction to all eight of its scalar processor units, which then operate on different data. Each SM can process several blocks of threads concurrently, but all the threads in a given block are guaranteed to be executed on a single SM. Threads within the same block are thereby able to communicate with each other very efficiently using fast on-chip shared memory and are furthermore able to synchronize their execution. Perhaps the most detailed descriptions of the nVidia GeForce GPU architecture and the CUDA API are provided in the CUDA Programming Guide.
NVIDIA GEFORCE 8800 GTX SCHEME
Figure 1. Schematic block diagram of the nVidia GeForce 8800 GTX.



Two-electron repulsion integral (ERI) problem


     Two of the most basic questions in chemistry are "Where are the electrons?" and "Where are the nuclei?" Electronic structure theory, i.e. quantum chemistry, focuses on the first of these. Because the electrons are very light, the laws of quantum mechanics must be applied and they are described with an electronic wavefunction determined from solution of the time-independent Schrodinger equation. As usual in quantum mechanics, the absolute square of this wavefunction is interpreted as a probability distribution for the positions of the electrons. Once the electronic distribution is known for a fixed nuclear configuration, it is straightforward to calculate the resulting forces on the nuclei. Thus, the answer to the second question follows from the answer to the first, through either a search for the arrangement of the nuclei which minimizes the energy (molecular geometry optimization) or solution of the classical Newtonian equations of motion.

     In many electronic structure methods, the electronic distribution function is linearly expanded over so-called Gaussian type basis functions (also known as Gaussian type orbitals, or GTOs), centered on atomic nuclei R

GAUSSIAN TYPE ATOMIC ORBITAL GTO

where the integers l, m, and n represent the angular momentum of the orbital. The total angular momentum is given by L = l + m + n and is often reffered to as s, p, d, etc for L = 0, 1, 2,... respectively. The unknown expansion coefficient are then found via subsequent iteration procedure (Hartree-Fock method, for example), which requires evaluation of large number of two-electron repulsion integrals (ERIs)

PRIMITIVE TWO-ELECTRON REPULSION INTEGRAL ERI

Even though gaussian form of the basis functions leads to analytic expressions for the two-electron integrals, there are O(N4) such integrals to be evaluated, where N grows linearly with the size of the molecule under consideration. In practice, many of these integrals are small and can be neglected, but the number of non-negligible integrals still grows faster than O(N2), making their evaluation a critical bottleneck in quantum chemistry.

      We have previously explored three different algorithms (Figure 2) to evaluate the O(N4) ERIs over contracted basis functions and store them in the GPU memory (see Publications for details). The algorithms were tested on a system composed of 64 hydrogen atoms arranged on a 4x4x4 lattice. Two basis sets were used - the first (denoted STO-6G) had six s-type Gaussian basis functions for each contracted basis function with one contracted basis function per atom. The second (denoted 6-311G) has three contracted s-type Gaussian basis functions per atom, and these contracted functions are combinations of three, one, and one Gaussian basis functions, respectively. These two basis sets represent highly contracted or relatively uncontracted basis sets and serve to show how the performance of the algorithms is affected by the degree of contraction in the basis set. For the hydrogen atom lattice test case, the number of contracted basis functions is 64 and 192 for the STO-6G and 6-311G basis sets, respectively. This leads to O(106) and O(108) ERIs over contracted basis functions. A benchmark test performed on the evaluation of these integrals showed that the current GPU implementation achievse up to 130-fold speedup over a traditional optimized CPU implementation running on an AMD Opteron. Table 1 summarizes corresponding timings.

     Later, we extended the algorithms to support p-type basis functions and performed a series of benchmarks on large molecules depicted on the right, like the 768-atom duplex DNA strand (over 5000 basis functions). The results confirmed that more than 100-fold GPU over CPU speedups are directly achievable for chemically and biologically important systems, making calculations of 1000-atom systems on desktop-size machines a routine task in the future. Specifications and performance of some of the hand-coded GPU kernels handling different kinds of the two-electron integrals are represented in Table 2.


TWO-ELECTRON REPULSION INTEGRAL MAPPING
Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The matrix of contracted integrals is represented by the large square. Small squares below the main diagonal (dark green) represent integrals which do not need to be computed because the integral matrix is symmetric. Each of the contracted integrals is a sum over primitive integrals, represented by the small squares in the blow up of two squares corresponding to contracted integrals. The different mapping schemes differ in how the computational work is apportioned to threads and thread blocks, and are depicted schematically with red squares superimposed on the integral matrix denoting the work done by a representative thread block and the three blow ups showing how the work is apportioned to threads within the thread block.



Table 1. Timings for the two-electron integral evaluation on the GPU and CPU using the three mapping shemes (1B1CI, 1T1CI, 1T1PI). The benchmarked system is 4x4x4 hydrogen atom cube with 0.74A nearest-neighbor distance.a
Basis set.
GPU (seconds).
CPU pre-calculation.
(seconds)
GPU-CPU transfer.
(seconds)
GAMESS #
(seconds)
1B1CI
1T1CI
1T1PI
6-311G
7.086
0.675
0.428
0.009
0.883
170.8
STO-6G
1.608
1.099
2.863
0.012
0.012
90.8
   a The "CPU precalculation" column lists the amount of time required to generate pair quantities on the CPU, and the "GPU-CPU transfer" column lists the amount of time required to copy the contracted integrals from the GPU to CPU memory. Timings for the same test case using the GAMESS program package on a single Opteron 175 CPU are provided for comparison.
" The promise of revolutionary materials and processes from materials science will routinely require several petaflops of computer power in the not too distant future."
                     Raymond Orbach, 2003
" The fact that the field of quantum-chemical computations is undergoing a revolutionary change due to the availability of high speed computers is too well known to be emphasized here."
                      Enrico Clementi, 1965


NVIDIA CUDA









NVIDIA TESLA






NVIDIA DESKSIDE











CAFFEINE MOLECULE




CHOLESTEROL MOLECULE




BUCKYBALL MOLECULE




TAXOL MOLECULE






VALINOMYCIN MOLECULE






DUPLEX DNA MOLECULE

Table 2. Integral evaluation GPU kernel specifications and performance results.b
GPU Kernel
FLOPS
per kernel
MOPS
per kernel
Performance
(GFLOPS/s)
Bandwidth
(GBytes/s)
ssss
30
12
88
131
sssp
55
15
70
71
sspp
84
21
69
64
pppp
387
21
97
20
   b All the instructions (SUM, MUL, DIV, SQRT, EXP etc) were counted as 1 FLOP, excluding MAD that was assumed to take 2 FLOPS. Each 32-bit load instruction was counted as 1 MOP, while 64-bit and 128-bit load instructions were counted as 2 and 4 MOPS, correspondingly.


Publications
  • Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. Christine M. Isborn, Nathan Luehr, Ivan S. Ufimtsev, and Todd J. Martinez, Journal of Chemical Theory and Computation,
    June 2011 (Link)

  • Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (GPUs). Nathan Luehr, Ivan S. Ufimtsev, and Todd. J. Martinez, Journal of Chemical Theory and Computation,
    April 2011 (Link)

  • Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. Ivan S. Ufimtsev and Todd J. Martinez, Journal of Chemical Theory and Computation,
    October 2009 (Link)

  • Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. Ivan S. Ufimtsev and Todd J. Martinez, Journal of Chemical Theory and Computation,
    March 2009 (Link)

  • Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. Ivan S. Ufimtsev and Todd J. Martinez, Journal of Chemical Theory and Computation,
    February 2008 (Link)

  • Graphical Processing Units for Quantum Chemistry. Ivan S. Ufimtsev and Todd J. Martinez, Computing in Science and Engineering,
    November 2008 (Link)


Upcoming Presentations
  • Electronic structure and first principles molecular dynamics on graphical processing units (GPUs). T.J. Martinez, I.S. Ufimtsev, N. Luehr, C. Isborn, PACIFICHEM'10, December 2010 (Link)

In the News
  • The GPU Revolution (by Lauren Wolf). Chemical and Engineering News,
    November 2010 (Link)

  • Speeding Up Computations (by Elizabeth Wilson). Chemical and Engineering News,
    September 2008 (Link)

  • Reflecting Chemical Intuition (by J. William Bell). NCSA News,
    June 2008 (Link)

  • Games Consoles Reveal the Supercomputer Within (by Mike Nagle). New Scientist,
    February 2008 (Link)

  • PetaChem website (Link)




JCTC COVER

JCTC COVER

JCTC COVER
nih
Funded by
National Science Foundation |  National Institutes of Health |  Research Corporation
Department of Energy |  Sloan Foundation |  Beckman Foundation
Dreyfus Foundation |  Packard Foundation
NSF LOGO