annotate paper/c5.tex @ 97:c1738511433c

add tSearch Source code
author Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
date Thu, 18 Feb 2016 20:40:30 +0900
parents ca50770a7fde
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
53
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
1 \chapter{ベンチマーク}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
2 本項で行なった実験の環境は以下の通りである。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
3 \begin{itemize}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
4 \item Mac OS X 10.10.5
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
5 \item 2*2.66 GHz 6-Core Intel Xeon
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
6 \item Memory 16GB 1333MHz DDR3
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
7 \item 1TB HDD
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
8 \end{itemize}
45
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 40
diff changeset
9
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
10 Cerium で実装した Word Count と Mac の wc の比較と、実装した正規表現と Mac の egrep の比較を行なった。
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
11 また、それぞれの結果に実装した並列処理向け I/O の結果も含む。
54
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 53
diff changeset
12
16
a3c5125aea03 add images
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 15
diff changeset
13 \section{Word Count}
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
14 ファイルの大きさは 約500MByte で、このファイルには 約650万行、約8300万単語が含まれている。
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
15
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
16 表\ref{table:IOwordcount} は、ファイル読み込みを含めた Word Count の結果である。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
17 Mac の wc ではこのファイルを処理するのに 10.59 秒かかる。それに対して、Cerium Word Count は mmap Blocked Read 全ての状況で Mac の wc よりも速いことを示している。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
18 Cerium Word Count 12 CPU のとき、7.83 秒で処理をしており、Mac の wc の 1.4 倍ほど速くなっている。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
19
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
20 mmap は読み込みを OS が制御しており、書き手が制御できない。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
21 また Word Count が走る際ファイルアクセスはランダムアクセスとなる。
89
a4af6428370a add images
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 76
diff changeset
22 mmap はランダムアクセスを想定していなくて結果がばらつきが起こっていると考えられる。
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
23 Blocked Read では読み込みをプログラムの書き手が制御しており、ファイルの読み込みもファイルの先頭から順次読み込みを行なっている。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
24 そのため、読み込みを含めた結果にばらつきが起こりにくくなっていると予想される。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
25
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
26 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
27 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
28 \begin{center}
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
29 ファイルサイズ : 500MB(単語数約8500万)\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
30 ファイル読み込みも含む\\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
31 \begin{tabular}[t]{|r|r|r|r|}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
32 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
33 CPU Num / 実行方式 & Mac(wc) & mmap & Blocked Read\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
34 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
35 1 & 10.59 & 9.96 & 9.33 \\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
36 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
37 4 & --- & 8.63 & 8.52 \\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
38 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
39 8 & --- & 10.35 & 8.04 \\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
40 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
41 12 & --- & 9.26 & 7.82 \\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
42 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
43 \end{tabular}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
44 \caption{ファイル読み込みを含む Word Count}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
45 \label{table:IOwordcount}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
46 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
47 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
48 \end{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
49
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
50 \newpage
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
51 表\ref{fig:wordcount} はファイル読み込みを含まない Word Count の結果である。
50
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 47
diff changeset
52
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
53 Mac の wc ではこのファイルを処理するのに 4.08 秒かかる。それに対して、Cerium Word Count は 1 CPU で 3.70 秒、12 CPU だと 0.40 秒で処理できる。
53
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
54
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
55 1 CPU で動作させると Mac の wc よりも 1.1 倍ほど速くなり、12 CPU で動作させると wc よりも 10.2 倍ほど速くなった。
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
56 1 CPU と 12 CPU で比較すると、9.25 倍ほど速くなった。
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
57
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
58 ファイルを読み込んだ結果と比較すると、ファイルを読み込まないで実行したほうが 6,7 秒ほど速くなる。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
59 これよりファイルを読み込んだ文字列処理の場合、処理時間の60\%から90\% はファイルの読み込みであることがわかる。
53
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
60
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
61 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
62 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
63 \begin{center}
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
64 ファイルサイズ : 500MB(単語数約8500万)\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
65 ファイル読み込みは含まない\\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
66 \begin{tabular}[t]{|r|r|}
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
67 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
68 実行方式 & 実行速度(秒)\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
69 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
70 Mac(wc) & 4.08 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
71 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
72 Cerium Word Count(CPU 1) & 3.70\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
73 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
74 Cerium Word Count(CPU 4) & 1.00\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
75 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
76 Cerium Word Count(CPU 8) & 0.52\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
77 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
78 Cerium Word Count(CPU 12) & 0.40\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
79 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
80 \end{tabular}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
81 \caption{ファイル読み込み無しの Word Count}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
82 \label{fig:wordcount}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
83 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
84 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
85 \end{tiny}
54
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 53
diff changeset
86
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
87 \section{Boyer-Moore Stirng Search}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
88 ファイルの大きさは 約500MByte で、このファイルには、約8300万単語が含まれている。このファイルに `Paskistan' という文字列がいくつマッチするかカウントしている。(マッチ数 約400万)
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
89
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
90 表\ref{table:bmsearch}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
91 はファイル読み込みを含まないで計測している。力任せ法と比較すると、Boyer-Moore String Search が最大 63 \% ほど速くなる。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
92
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
93 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
94 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
95 \begin{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
96 ファイルサイズ : 500MB(単語数約8500万)\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
97 検索文字列 : Pakistan\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
98 マッチ数約400万 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
99 ファイル読み込みは含まない\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
100 \begin{tabular}[t]{|r|r|r|}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
101 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
102 CPU 数 & 力任せ法 & Boyer-Moore String Search\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
103 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
104 1 & 3.17 & 1.70 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
105 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
106 4 & 0.87 & 0.49 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
107 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
108 8 & 0.47 & 0.27 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
109 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
110 12 & 0.33 & 0.21 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
111 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
112 \end{tabular}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
113 \caption{力任せ法と Boyer-Moore String Search の比較}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
114 \label{table:bmsearch}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
115 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
116 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
117 \end{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
118
16
a3c5125aea03 add images
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 15
diff changeset
119 \section{正規表現}
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
120 当実験では、Mac の egrep 、C で実装した逐次に DFA の状態遷移と照らし合わせマッチングをする single thread grep、Cerium で並列処理をする CeriumGrep を比較している。
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
121
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
122 表\ref{table:metachar}は500MB(単語数約8500万) のファイルに対して正規表現 `[A-Z][A-Za-z0-9]*s' をマッチングした結果である。
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
123 これはファイル読み込みを含めた結果と読み込みを含めていない結果の比較である。
47
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 45
diff changeset
124
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
125 single thread grep や CeriumGrep は繰返し実行をすると実行速度が短くなる。
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
126 これは、読み込んだファイルがキャッシュに残っており、ファイル読み込みが省略されるためである。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
127 同じファイルに対して違う正規表現でマッチングさせたとしてもファイル読み込みが省略されるため、高速に結果が返ってくる。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
128
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
129 しかし egrep は繰返し実行しても同じ速度で実行されるため、毎回ファイルを読み込む。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
130 そのため、同じファイルに違う正規表現をマッチさせようとすると毎回読み込みが行われる。
53
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
131
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
132 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
133 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
134 \begin{center}
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
135 正規表現 `[A-Z][A-Za-z0-9]*s'\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
136 ファイルサイズ : 500MB(単語数約8500万)\\
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
137 \begin{tabular}[t]{|c|r|r|}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
138 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
139 実行方式 & ファイル読み込み有 & ファイル読み込み無\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
140 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
141 single thread grep & 21.17 & 16.15\\
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
142 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
143 CeriumGrep(CPU 12) mmap & 18.00 & 5.12\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
144 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
145 CeriumGrep(CPU 12) bread & 15.76 & 5.18\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
146 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
147 egrep & 59.51 & 59.51 \\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
148 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
149 \end{tabular}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
150 \caption{ファイル読み込み有りと無しを変化させた各 grep の結果}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
151 \label{table:metachar}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
152 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
153 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
154 \end{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
155
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
156 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
157
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
158 表\ref{table:AZaz} は正規表現 `[A-Z][A-Za-z0-9]*s' を 500MB(単語数約8500万)、1GB(単語数約1.7億語)のファイルに対してマッチングを行なった。
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
159
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
160 ファイルサイズに比例して処理時間も長くなっていく。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
161
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
162 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
163 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
164 \begin{center}
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
165 正規表現 `[A-Z][A-Za-z0-9]*s'\\
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
166 ファイルサイズ : 500MB(単語数約8500万)\\
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
167 ファイル読み込みを含む\\
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
168 \begin{tabular}[t]{|c|r|r|r|r|}
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
169 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
170 実行方式/File Size(Match Num) & 50MB(54万) & 100MB(107万) & 500MB(536万) & 1GB(1072万) \\
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
171 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
172 single thread grep & 4.51 & 9.42 & 20.62 & 40.10\\
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
173 \hline
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
174 CeriumGrep(CPU 12) mmap & 8.97 & 10.79 & 18.00 & 29.16\\
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
175 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
176 CeriumGrep(CPU 12) bread & 7.75 & 10.49 & 15.76 & 26.83\\
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
177 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
178 egrep & 6.42 & 12.80 & 59.51 & 119.23\\
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
179 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
180 \end{tabular}
67
9c16f6b18100 add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 66
diff changeset
181 \caption{ファイルサイズを変化させた各 grep の結果}
56
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
182 \label{table:AZaz}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
183 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
184 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
185 \end{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 54
diff changeset
186
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
187 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
188
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
189 表\ref{table:abab}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
190 は aとb が多く含まれている約500MB(単語数約2400万)のファイルに対して、正規表現の状態数を変化させてみた。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
191 これは読み込みを含んでいる結果で、CeriumGrep のファイル読み込みは Blocked Read、CPU 数 12 にて実行した。
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
192
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
193
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
194 この例題で使用した正規表現は `(a \textbar b)' を `*a' の直後に付け加えていくと、それだけ状態数が指数関数的に増えていく。
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
195 CeriumGrep は状態数が指数関数的に増加しても 1 秒ほど増加するが、egrep では 5,6 秒ほど増加していく。
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
196 \newpage
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
197
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
198 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
199 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
200 \begin{center}
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
201 ファイルサイズ : 500MB(単語数約2400万)\\
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
202 状態数の内訳 : 状態割当時の状態数/subset 後の状態数\\
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
203 ファイル読み込みを含む\\
73
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 72
diff changeset
204 \begin{tabular}[t]{|l|r|r|r|r|}
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
205 \hline
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
206 正規表現 & 状態数 & マッチ数 & CeriumGrep (CPU 12) bread & egrep \\
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
207 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
208 '(a \textbar b)*a(a \textbar b)(a \textbar b)z' & 5/12 & 約10万 & 26.58 & 70.11 \\
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
209 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
210 '(a \textbar b)*a(a \textbar b)(a \textbar b)(a \textbar b)z' & 6/21 & 約8万 & 27.89 & 76.78 \\
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
211 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
212 '(a \textbar b)*a(a \textbar b)(a \textbar b)(a \textbar b)(a \textbar b)z' & 7/38 & 約4万 & 28.86 & 81.88 \\
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
213 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
214 '(a \textbar b)*a(a \textbar b)(a \textbar b)(a \textbar b)(a \textbar b)(a \textbar b)z' & 8/71 & 約2万 & 29.15 & 86.93 \\
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
215 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
216 \end{tabular}
67
9c16f6b18100 add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 66
diff changeset
217 \caption{正規表現の状態数を増やした Grep の結果}
65
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
218 \label{table:abab}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
219 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
220 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
221 \end{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
222
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 62
diff changeset
223 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
53
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 50
diff changeset
224
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
225 表\ref{table:nomatch} ab の文字列がならんでいるところに (W \textbar w)ord の正規表現
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
226 aとb が多く含まれている約500MB(単語数約2300万)のファイルに対して、全くマッチしない正規表現を与えてマッチングさせてみた。
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
227
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
228 マッチングしない場合でも egrep と比較して CeriumGrep bread のほうが 30 \% ほど速くなる。
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
229 \begin{tiny}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
230 \begin{table}[ht]
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
231 \begin{center}
74
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
232 ファイルサイズ : 500MB(単語数約2400万)\\
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 73
diff changeset
233 正規表現 : (W \textbar w)ord\\
75
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 74
diff changeset
234 ファイル読み込みを含む\\
66
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 65
diff changeset
235 \begin{tabular}[t]{|c|r|}
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
236 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
237 実行方式 & time (s)\\
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
238 \hline
91
ca50770a7fde add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 89
diff changeset
239 single thread grep & 27.13\\
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
240 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
241 CeriumGrep(CPU 12) mmap & 21.58\\
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
242 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
243 CeriumGrep(CPU 12) bread & 19.99\\
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
244 \hline
71
c01a514d33f7 add bm_search
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 67
diff changeset
245 egrep & 28.33\\
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
246 \hline
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
247 \end{tabular}
67
9c16f6b18100 add result
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 66
diff changeset
248 \caption{全くマッチングしないパターンを grep した結果}
62
0d13c52a54fd remove bm_search explain
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 61
diff changeset
249 \label{table:nomatch}
61
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
250 \end{center}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
251 \end{table}
Masataka Kohagura <kohagura@cr.ie.u-ryukyu.ac.jp>
parents: 57
diff changeset
252 \end{tiny}