comparison docs/Vectorizers.rst @ 148:63bd29f05246

merged
author Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date Wed, 14 Aug 2019 19:46:37 +0900
parents c2174574ed3a
children
comparison
equal deleted inserted replaced
146:3fc4d5c3e21e 148:63bd29f05246
309 } 309 }
310 310
311 Vectorization of function calls 311 Vectorization of function calls
312 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 312 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
313 313
314 The Loop Vectorize can vectorize intrinsic math functions. 314 The Loop Vectorizer can vectorize intrinsic math functions.
315 See the table below for a list of these functions. 315 See the table below for a list of these functions.
316 316
317 +-----+-----+---------+ 317 +-----+-----+---------+
318 | pow | exp | exp2 | 318 | pow | exp | exp2 |
319 +-----+-----+---------+ 319 +-----+-----+---------+
325 +-----+-----+---------+ 325 +-----+-----+---------+
326 |fma |trunc|nearbyint| 326 |fma |trunc|nearbyint|
327 +-----+-----+---------+ 327 +-----+-----+---------+
328 | | | fmuladd | 328 | | | fmuladd |
329 +-----+-----+---------+ 329 +-----+-----+---------+
330
331 Note that the optimizer may not be able to vectorize math library functions
332 that correspond to these intrinsics if the library calls access external state
333 such as "errno". To allow better optimization of C/C++ math library functions,
334 use "-fno-math-errno".
330 335
331 The loop vectorizer knows about special instructions on the target and will 336 The loop vectorizer knows about special instructions on the target and will
332 vectorize a loop containing a function call that maps to the instructions. For 337 vectorize a loop containing a function call that maps to the instructions. For
333 example, the loop below will be vectorized on Intel x86 if the SSE4.1 roundps 338 example, the loop below will be vectorized on Intel x86 if the SSE4.1 roundps
334 instruction is available. 339 instruction is available.
367 372
368 Performance 373 Performance
369 ----------- 374 -----------
370 375
371 This section shows the execution time of Clang on a simple benchmark: 376 This section shows the execution time of Clang on a simple benchmark:
372 `gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_. 377 `gcc-loops <https://github.com/llvm/llvm-test-suite/tree/master/SingleSource/UnitTests/Vectorizer>`_.
373 This benchmarks is a collection of loops from the GCC autovectorization 378 This benchmarks is a collection of loops from the GCC autovectorization
374 `page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman. 379 `page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
375 380
376 The chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for "corei7-avx", running on a Sandybridge iMac. 381 The chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for "corei7-avx", running on a Sandybridge iMac.
377 The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels. 382 The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.