Mercurial > hg > Papers > 2015 > yuhi-master

Binary file paper/figures/GPU/fft_dragonfly.pdf has changed
Binary file paper/figures/multicore/.DS_Store has changed
Binary file paper/figures/multicore/word_count.pdf has changed
--- a/slide/blank.html	Wed Feb 18 02:10:50 2015 +0900
+++ b/slide/blank.html	Wed Feb 18 03:27:34 2015 +0900
@@ -666,47 +666,173 @@
       </div>

       <div class='slide'>
-        <h2>-</h2>
+        <h2>実験環境</h2>
+        <table border="0" >
+          <tr bgcolor="palegreen">
+            <th align="center">Model</th><th align="center">MacPro Mid 2010</th>
+          </tr>
+
+          <tr bgcolor="dbffa3">
+            <th align="left" >CPU</th><th align="left">6-core Intel Xeon@2.66GHz</th>
+          </tr>
+          <tr bgcolor="palegreen">
+            <th align="left">Serial-ATA Device</th><th align="left">HDD ST4000VN000-1H4168</th>
+          </tr>
+          <tr bgcolor="dbffa3">
+            <th align="left">Memory</th><th align="left">16GB</th>
+          </tr>
+          <tr bgcolor="palegreen">
+            <th align="left">OS</th><th align="left">MacOSX 10.10.1</th>
+          </tr>
+          <tr bgcolor="dbffa3">
+            <th align="left">Graphics</th><th align="left">NVIDIA Quadro K5000 4096MB</th>
+          </tr>
+        </table>
+        <hr>
+        <table border="0" >
+          <tr bgcolor="palegreen">
+            <th align="center">Model</th><th align="center">MacPro Late 2013</th>
+          </tr>
+          <tr bgcolor="dbffa3">
+            <th align="left" >CPU</th><th align="left">6-core Intel E5@3.5GHz</th>
+          </tr>
+          <tr bgcolor="palegreen">
+            <th align="left">Serial-ATA Device</th><th align="left">Apple SSD SM0256</th>
+          </tr>
+          <tr bgcolor="dbffa3">
+            <th align="left">Memory</th><th align="left">16GB</th>
+          </tr>
+          <tr bgcolor="palegreen">
+            <th align="left">OS</th><th align="left">MacOSX 10.10.1</th>
+          </tr>
+          <tr bgcolor="dbffa3">
+            <th align="left">Graphics</th><th align="left">AMD FireProD700 6144MB</th>
+          </tr>
+        </table>
+        <p>
+          MacPro 2010 と MacPro 2013 で実験を行った。
+          MacPro 2013 がより新しいモデルで、クロック数が高く、SSDを使用している。
+        </p>
       </div>

       <div class='slide'>
-        <h2>実験に利用する例題-WordCount-</h2>
-      </div>
-
-      <div class='slide'>
-        <h2>実験に利用する例題-FFT-</h2>
-      </div>
-
-      <div class='slide'>
-        <h2>実験環境</h2>
-      </div>
-
-      <div class='slide'>
-        <h2>マルチコア CPU による並列実行のベンチマーク</h2>
+        <h2>WordCount によるマルチコア CPU における並列実行のベンチマーク</h2>
+        <table><tr align="left">
+            <th><img src="./images/word_count_multicore.png" width="600">
+            </th>
+            <th>
+              <p>2つの実験環境でコア数に対する実行時間の測定を行った。</p>
+              <p>
+                MacPro 2010 において 6CPU を使用した場合、
+                1CPU を使用した場合に比べて<font color="red"> 5.0 倍</font>の速度向上が見られた。
+              </p>
+              <p>
+                MacPro 2013 においては 6CPU を使用した場合、
+                1CPU を使用した場合に比べて<font color="red"> 5.2 倍</font>の速度向上が見られた。
+              </p>
+              <p>計算機のコア数である 6CPU までは充分に並列度を維持する事ができた。</p>
+            </th>
+        </tr></table>
       </div>

       <div class='slide'>
         <h2>DMA の prefecth に関するベンチマーク </h2>
-      </div>
-
-      <div class='slide'>
-        <h2>GPGPU のベンチマーク</h2>
+        <table><tr align="left">
+            <th><img src="./images/dmabench.png" width="600">
+            </th>
+            <th>
+              <p>
+                DMA 転送の prefetch 機能を使用した場合(prefetch)と
+                使用しなかった場合(no_prefetch)について測定を行った。
+              </p>
+              <p>
+                測定の結果、prefetch を使用すると CPU 数が1の場合は<font color="red">1.17%</font>、
+                CPU 数が6の場合は<font color="red">1.63%</font>の性能向上が見られた。
+                6CPU までは prefetch を使用した場合の性能が高く、6CPU を超えるとほぼ同じ性能となった。
+              </p>
+            </th>
+        </tr></table>
       </div>

       <div class='slide'>
         <h2>データ並列実行のベンチマーク</h2>
+        <table><tr align="left">
+            <th><img src="./images/wordcount_dataparallel.png" width="600">
+            </th>
+            <th>
+              <p>
+                データ並列実行に関して、マルチコア CPU 、OpenCL、CUDA について通常実行した場合と
+                データ並列実行した場合に関して WordCount を用いて測定を行った。
+              </p>
+              <p>
+                マルチコア CPU では<font color="red">1.06 倍</font>の性能向上が見られた。
+                GPU に関しては劇的な性能向上が見られ、OpenCL は<font color="red"> 115 倍</font>、
+                CUDA は<font color="red"> 14 倍</font>の性能が向上した。
+                この結果から GPGPU を行う際はデータ並列による実行が必須であることがわかる。
+              </p>
+              <p>
+                全体的に性能は向上したが、マルチコア CPU に比べて<font color="blue"> GPU の性能が出ていない</font>。
+              </p>
+            </th>
+        </tr></table>
+      </div>
+
+      <div class='slide'>
+        <h2>FFT による GPGPU のベンチマーク(MacPro2010)</h2>
+        <table><tr align="left">
+            <th><img src="./images/fft_firefly.png" width="600">
+            </th>
+            <th>
+              <p>
+                FFT により マルチコア CPU、CUDA、OpenCL について測定を行った。
+                CUDAは 1CPU と比べて<font color="red">3.5倍</font>、
+                6CPU と比べて<font color="red">1.1倍</font>の性能向上が見られた。
+              </p>
+              <p>
+                OpenCL に関しては、1CPU と比べて<fonr color="red">2.75倍</fonr>の性能向上が確認できたが、
+                6CPU と比べると<font color="blue">0.76 倍の性能低下</font>が見られた。
+                OpenCL のみで FFT を行った場合と比べても<font color="blue">0.76倍</font>の性能低下が見られた。
+              </p>
+            </th>
+        </tr></table>
       </div>

       <div class='slide'>
-        <h2>GPGPU のベンチマーク</h2>
-      </div>
-
-      <div class='slide'>
-        <h2>FFT による GPGPU のベンチマーク</h2>
+        <h2>FFT による GPGPU のベンチマーク(MacPro2013)</h2>
+        <table><tr align="left">
+            <th><img src="./images/fft_dragonfly.png" width="600">
+            </th>
+            <th>
+              <p>
+                GPU の性能が高い計算機で測定した結果、GPGPU の性能向上が確認できた。
+                OpenCL が1CPU と比べて<fonr color="red">6倍</fonr>、
+                6CPU と比べて<font color="red">1.6 倍</font>の性能が出た。
+                OpenCL のみで FFT を行った場合と比べても同等の性能を発揮することができた。
+              </p>
+            </th>
+        </tr></table>
       </div>

       <div class='slide'>
         <h2>BlockedRead による並列 I/O のベンチマーク</h2>
+        <table><tr align="left">
+            <th><img src="./images/io_thread_firefly.png" width="600">
+            </th>
+            <th>
+              <p>
+                Cerium の従来の読み込み方式である mmap、一般的な file open である read、
+                更に今回実装した BlocledRead の測定を行った。
+                BlockedRead に関しては io Thread を使用した場合(BlockedRead_io)と、
+                使用しない場合(BlockedRead_speany)の測定を行う。
+              </p>
+              <p>
+                6CPU において、BlockedRead_IOを使用した場合、mmap に比べて<font color="red">1.1倍</font>、
+                read に比べて<font color="red">1.58倍</font>、
+                BlocedRead_speany と比べて <font color="red">1.34 倍</font>の性能向上が見られた。
+              </p>
+            </th>
+        </tr></table>
+
       </div>

       <div class='slide'>
Binary file slide/images/dmabench.png has changed
Binary file slide/images/fft_dragonfly.png has changed
Binary file slide/images/fft_firefly.png has changed
Binary file slide/images/io_thread_firefly.png has changed
Binary file slide/images/sort_multicore.png has changed
Binary file slide/images/word_count_multicore.png has changed
Binary file slide/images/wordcount_dataparallel.pdf has changed
Binary file slide/images/wordcount_dataparallel.png has changed