150
|
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
|
2 "http://www.w3.org/TR/html4/strict.dtd">
|
|
3 <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
|
|
4 <html>
|
|
5 <head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
|
|
6 <title>Polly - Performance</title>
|
|
7 <link type="text/css" rel="stylesheet" href="menu.css">
|
|
8 <link type="text/css" rel="stylesheet" href="content.css">
|
|
9 </head>
|
|
10 <body>
|
|
11 <div id="box">
|
|
12 <!--#include virtual="menu.html.incl"-->
|
|
13 <div id="content">
|
|
14 <h1>Performance</h1>
|
|
15
|
|
16 <p>To evaluate the performance benefits Polly currently provides we compiled the
|
|
17 <a href="https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/">Polybench
|
|
18 2.0</a> benchmark suite. Each benchmark was run with double precision floating
|
|
19 point values on an Intel Core Xeon X5670 CPU @ 2.93GHz (12 cores, 24 thread)
|
|
20 system. We used <a href="https://sourceforge.net/projects/pocc/files/">PoCC</a> and the included <a
|
|
21 href="http://pluto-compiler.sf.net">Pluto</a> transformations to optimize the
|
|
22 code. The source code of Polly and LLVM/clang was checked out on
|
|
23 25/03/2011.</p>
|
|
24
|
|
25 <p>The results shown were created fully automatically without manual
|
|
26 interaction. We did not yet spend any time to tune the results. Hence
|
236
|
27 further improvements may be achieved by tuning the code generated by Polly, the
|
150
|
28 heuristics used by Pluto or by investigating if more code could be optimized.
|
|
29 As Pluto was never used at such a low level, its heuristics are probably
|
|
30 far from perfect. Another area where we expect larger performance improvements
|
|
31 is the SIMD vector code generation. At the moment, it rarely yields to
|
|
32 performance improvements, as we did not yet include vectorization in our
|
|
33 heuristics. By changing this we should be able to significantly increase the
|
|
34 number of test cases that show improvements.</p>
|
|
35
|
|
36 <p>The polybench test suite contains computation kernels from linear algebra
|
|
37 routines, stencil computations, image processing and data mining. Polly
|
236
|
38 recognizes the majority of them and is able to show good speedup. However,
|
150
|
39 to show similar speedup on larger examples like the SPEC CPU benchmarks Polly
|
|
40 still misses support for integer casts, variable-sized multi-dimensional arrays
|
236
|
41 and probably several other constructs. This support is necessary as such
|
150
|
42 constructs appear in larger programs, but not in our limited test suite.
|
|
43
|
|
44 <h2> Sequential runs</h2>
|
|
45
|
|
46 For the sequential runs we used Polly to create a program structure that is
|
|
47 optimized for data-locality. One of the major optimizations performed is tiling.
|
|
48 The speedups shown are without the use of any multi-core parallelism. No
|
|
49 additional hardware is used, but the single available core is used more
|
|
50 efficiently.
|
|
51 <h3> Small data size</h3>
|
|
52 <img src="images/performance/sequential-small.png" /><br />
|
|
53 <h3> Large data size</h3>
|
|
54 <img src="images/performance/sequential-large.png" />
|
|
55 <h2> Parallel runs</h2>
|
|
56 For the parallel runs we used Polly to expose parallelism and to add calls to an
|
|
57 OpenMP runtime library. With OpenMP we can use all 12 hardware cores
|
|
58 instead of the single core that was used before. We can see that in several
|
|
59 cases we obtain more than linear speedup. This additional speedup is due to
|
|
60 improved data-locality.
|
|
61 <h3> Small data size</h3>
|
|
62 <img src="images/performance/parallel-small.png" /><br />
|
|
63 <h3> Large data size</h3>
|
|
64 <img src="images/performance/parallel-large.png" />
|
|
65 </div>
|
|
66 </div>
|
|
67 </body>
|
|
68 </html>
|