egrep [wW] [oO] [rR] [dD]がgrep -iワードより高速なのはなぜですか？

Question

grep -iの方が頻繁にあり、同等のegrepよりも遅いことがわかりました。この場合、各文字の大文字または小文字と照合します。

$ time grep -iq "thats" testfile real 0m0.041s user 0m0.038s sys 0m0.003s $ time egrep -q "[tT][hH][aA][tT][sS]" testfile real 0m0.010s user 0m0.003s sys 0m0.006s

しますgrep -i egrepが実行しない追加のテストを実行しますか？

Gilles &#39;SO- stop being evil&#39; · Accepted Answer

grep -i 'a'は、ASCIIのみのロケールではgrep '[Aa]'と同等です。 Unicodeロケールでは、文字の等価性と変換が複雑になる可能性があるため、grepは、同等の文字を判別するために追加の作業を行う必要がある場合があります。関連するロケール設定はLC_CTYPEで、バイトを文字として解釈する方法を決定します。

私の経験では、UTF-8ロケールでGNU grepを呼び出すと遅くなる可能性があります。 ASCII文字のみを検索していることがわかっている場合は、ASCIIのみのロケールで呼び出すほうが高速な場合があります。私はそれを期待します

time LC_ALL=C grep -iq "thats" testfile time LC_ALL=C egrep -q "[tT][hH][aA][tT][sS]" testfile

区別できないタイミングを生成します。

そうは言っても、Debian jessieでGNU grepを使用して結果を再現することはできません（ただし、テストファイルを指定していません）。 ASCIIロケール（LC_ALL=C）を設定すると、grep -iの方が速くなります。効果は文字列の正確な性質に依存します。たとえば、文字が繰り返されている文字列はパフォーマンスを低下させます（これは予想される）。

muru · Answer

好奇心から、私はこれをArch Linuxシステムでテストしました：

$ uname -r 4.4.5-1-Arch $ df -h . Filesystem Size Used Avail Use% Mounted on tmpfs 3.9G 720K 3.9G 1% /tmp $ dd if=/dev/urandom bs=1M count=1K | base64 > foo $ df -h . Filesystem Size Used Avail Use% Mounted on tmpfs 3.9G 1.4G 2.6G 35% /tmp $ for i in {1..100}; do /usr/bin/time -f '%e' -ao grep.log grep -iq foobar foo; done $ for i in {1..100}; do /usr/bin/time -f '%e' -ao egrep.log egrep -q '[fF][oO][oO][bB][aA][rR]' foo; done $ grep --version grep (GNU grep) 2.23 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

そして、いくつかの統計は 1つのコマンドで数値のリストの最小、最大、中央値、および平均を取得する方法はありますか？：

$ R -q -e "x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1])" > x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1]) V1 Min. :1.330 1st Qu.:1.347 Median :1.360 Mean :1.362 3rd Qu.:1.370 Max. :1.440 [1] 0.02322725 > > $ R -q -e "x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1])" > x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1]) V1 Min. :1.330 1st Qu.:1.340 Median :1.360 Mean :1.365 3rd Qu.:1.380 Max. :1.430 [1] 0.02320288 > >

私はen_GB.utf8ロケールですが、時間はほとんど区別できません。