複数のファイル間で同じ単語を比較するにはどうすればよいですか？

Question

複数のファイルで同じ単語を数え、それらがどのファイルにあるかを示したいと思います。

File1：

This is so beautiful

File2：

There are so beautiful

File3：

so beautiful

目的の出力1：

so:3 beautiful:3

目的の出力2：

so: file1:1 file2:1 file3:1 beautiful: file1:1 file2:1 file3:1

pLumo · Answer

これを試して、

# Declare the files you want to include files=( file* ) # Function to find common words in any number of files wcomm() { # If no files provided, exit the function. [ $# -lt 1 ] && return 1 # Extract words from first file local common_words=$(grep -o "\w*" "$1" | sort -u) while [ $# -gt 1 ]; do # shift $1 to next file shift # Extract words from next file local next_words=$(grep -o "\w*" "$1" | sort -u) # Get only words in common from $common_words and $next_words common_words=$(comm -12 <(echo "${common_words,,}") <(echo "${next_words,,}")) done # Output the words common to all input files echo "$common_words" } # Output number of matches for each of the common words in total and per file for w in $(wcomm "${files[@]}"); do echo $w:$(grep -oiw "$w" "${files[@]}" | wc -l); for f in "${files[@]}"; do echo $f:$(grep -oiw "$w" "$f" | wc -l); done; echo; done

出力：

beautiful:3 file1:1 file2:1 file3:1 so:3 file1:1 file2:1 file3:1

説明：

スクリプト内にコメントとして含まれています。

特徴：

ARG_MAX が許可する限り多くのファイル
grepが単語区切り文字として理解するもので区切られたすべての単語を検索します。
大文字と小文字を区別しないので、「美しい」と「美しい」は同じ言葉です。

RudiC · Answer

これをすべてawkアプローチで試してください。

awk ' {for (i=1; i<=NF; i++) {WC[$i]++ FC[$i,FILENAME]++ } } END {for (w in WC) if (WC[w] > 1) print w, WC[w] print "" for (f in FC) {split (f, T, SUBSEP) w = T[1] if (WC[w] > 1) {if (!D[w]) print w, "" print T[2], FC[f] D[w] = 1 } } } ' OFS=":" file[1-3] so:3 beautiful:3 beautiful: file3:1 file2:1 file1:1 so: file1:1 file2:1 file3:1

各ファイルのそれぞれのデータ（ワードカウントとファイルあたりのワードカウント）を収集し、ENDセクションで、ワードカウント> 1に基づいて目的の出力1および2を生成します。

Celios · Answer

コードを記述したくない場合は、結果をすばやく知る方法を使用して、次のコマンドを使用できます。

cat list_of_words | while read line; do echo $line; grep -riE '$line'-c where_to_look_or_folder; done -r :read into files -i: no casesensitive -E: regexp is useable if you want something more complicated to search -c: counter

出力：

Word1 path:filename:count

例：

cat text | while read line; do echo $line; grep -riE '$line'-c somwhwere/nowhere; done

glenn jackman · Answer

Perlは、指定されたファイル内のすべての単語をカウントし、2回以上見られた単語については、合計カウントとファイルごとのカウントを出力します

$ cat file1 This is so beautiful foo foo foo $ cat file2 There are so beautiful foo bar bar $ cat file3 so beautiful bar baz

その後

Perl -lane ' for (@F) {$count{$_}++; $filecount{$_}{$ARGV}++} END { for $Word (sort keys %count) { if ($count{$Word} > 1) { print "$Word:$count{$Word}"; print "$_:$filecount{$Word}{$_}" for sort keys %{ $filecount{$Word} }; print ""; } } } ' file{1,2,3}

bar:3 file2:2 file3:1 beautiful:3 file1:1 file2:1 file3:1 foo:4 file1:3 file2:1 so:3 file1:1 file2:1 file3:1

「カウント」の降順で結果を並べ替えるには、ENDブロックで次の行を使用します。

for $Word (reverse sort {$count{$a} <=> $count{$b}} keys %count) {

F&#243;lkvangr · Answer

このプロセスでは、各ファイルおよびすべてのファイルについて、ディレクトリまたはサブディレクトリにあるテキストファイルに含まれる単語の出現回数が表示されます。

すべてのテキストファイルを連結します。
重複する単語を抑制します。

結果は単語のリストであり、リスト内の各単語は、連結されたファイルだけでなく、各ファイルでも連続してカウントされます。

リスト内の各単語の出現回数を検索します。

1.すべてのテキストファイルを連結する

find . -type f -exec cat {} \;

findは、現在のディレクトリまたはサブディレクトリ内のすべてのテキストファイルを検索し、catを呼び出して、一致するすべてのファイルを連結します。

2.重複する単語を抑制します

単語を各行に個別に配置して単語のリストを作成できます。文字以外の各文字は改行で置き換えられ、その後、一連の改行が1つの改行で置き換えられます。

tr -cs '[:alpha:]' '[\n*]'

最後に、単語の適切なリストを取得するには、重複する単語を抑制する必要があります。 uniqはテキストの繰り返し行を除外できますが、行を並べ替える必要があるため、sortを使用してテキスト行を並べ替えることができます。

sort | uniq

または

sort -u

注：大文字と小文字を区別しない検索は、パイプラインに次のコマンドを追加して実行できます。

tr '[:upper:]' '[:lower:]'

潜在的な問題

既存のスクリプトファイルは、現在のディレクトリまたはサブディレクトリにある場合でも、他のファイルと連結しないでください（c.f. man find）。

find . -type f $ \! -name "*${0##*/}" $ -exec cat {} \;

注： $0はシェルの名前に展開されます。 ${PARAMETER##Word}は、最も長い一致パターンが削除されたパラメーターに展開されます（c.f.シェルパラメーター展開）。

例えば、 /usr/local/bin/myscriptはmyscriptになります。

結果

list=$(find . -type f $ \! -name "*${0##*/}" $ -exec cat {} \; | tr -cs '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort -u)

コマンド置換により、pipelineの出力を変数に保存できます。

3.発生数を検索します。

a）指定された単語を含むすべての行を検索します。各行は異なる行にあります...

grep --exclude="*${0##*/}" -Rowi $Word .

...そして行数を数えます

grep --exclude="*${0##*/}" -Rowi $Word . | wc -l

b）各入力ファイルの一致する行をカウントします。

grep --exclude="*${0##*/}" -Rowi $Word . | tr '[:upper:]' '[:lower:]' | uniq -c | sed -E "$sed_script"

指定されたWordを含むすべての行を検索し（各出現は別の行にあります）、出現回数とソースファイルを出力し、データを再フォーマットします。

注： $0はシェルの名前に展開されます。 ${PARAMETER##Word}は、最も長い一致パターンが削除されたパラメーターに展開されます（c.f.シェルパラメーター展開）。

例えば、 /usr/local/bin/myscriptはmyscriptになります。

注： --excludeはGNU拡張子、notに指定POSIX grepです。これが問題である場合は、オプションを削除してシェルスクリプトの上部にある次のステートメント。

# assign first positional parameter to "dirname" # and move to this directory dirname=${1:?first positional parameter missing\!} cd "$dirname"

シェルスクリプト

#!/bin/sh list=$(find . -type f $ \! -name "*${0##*/}" $ -exec cat {} \; | tr -cs '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort -u) # extract data and print formatted data sed_script='s/([[:digit:]]+) (.*):[[:alpha:]]+/\2:\1/;s/[[:blank:]]+//' for Word in $list; do echo $Word:$(grep --exclude="*${0##*/}" -Rowi $Word . | wc -l) grep --exclude="*${0##*/}" -Rowi $Word . | tr '[:upper:]' '[:lower:]' | uniq -c | sed -E "$sed_script" echo done | sed -e "/^[[:alpha:]]\+:1/{N;N;d;}"

入力

Prompt% cat file1 Lorem ipsum dolor sit amet. cat cat Prompt% cat file2 Lorem ipsum dolor sit amet, consectetur adipiscing elit. cat ut Prompt% cat file3 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique egestas massa sed facilisis. Duis hendrerit ut. cat tristique tristique tristique tristique

出力

adipiscing:2 ./file2:1 ./file3:1 amet:3 ./file1:1 ./file2:1 ./file3:1 cat:4 ./file1:2 ./file2:1 ./file3:1 consectetur:2 ./file2:1 ./file3:1 dolor:3 ./file1:1 ./file2:1 ./file3:1 elit:2 ./file2:1 ./file3:1 ipsum:3 ./file1:1 ./file2:1 ./file3:1 lorem:3 ./file1:1 ./file2:1 ./file3:1 sed:2 ./file3:2 sit:3 ./file1:1 ./file2:1 ./file3:1 tristique:5 ./file3:5 ut:2 ./file2:1 ./file3:1

Kamaraj · Answer

このコードを試してください。必要に応じて調整する

bash-4.1$ cat test.sh #!/bin/bash OUTPUT_FILE=/tmp/output.txt awk '{ for(i=1;i<=NF;i++) { Arr[$i]++ } } END{ for (i in Arr){ if(Arr[i]>1) { print i":"Arr[i] } } }' file* > ${OUTPUT_FILE} cat ${OUTPUT_FILE} echo "" IFS=":" while read Word TOTAL_COUNT do echo "${Word}:" for FILE_NAME in file* do COUNT=$(tr ' ' '
' < ${FILE_NAME} | grep -c "${Word}") if [ "${COUNT}" -gt "0" ] then echo "${FILE_NAME}:${COUNT}" fi done done < ${OUTPUT_FILE} bash-4.1$ bash test.sh beautiful:3 so:3 beautiful: file1:1 file2:1 file3:1 so: file1:1 file2:1 file3:1

αғsнιη · Answer

grepを使用して単語とファイルの名前を入力し、次にawkを使用して出力を再フォーマットし、目的の結果を取得します。

grep -Ho '\w\+' file* | awk -F':' '{ words[$1 FS $2]++; seen[$2]++ } END{ for (x in seen) { print x":" seen[x]; for (y in words) { if (y ~ "\<" x "\>")print substr(y, 1, length(y)-length(x)), words[y] } } }'

これにより、次のような優れた出力が得られます（一度に両方の望ましい出力）。

so:3 file1: 1 file2: 1 file3: 1 This:1 file1: 1 beautiful:3 file3: 1 file1: 1 file2: 1 There:1 file2: 1 are:1 file2: 1 is:1 file1: 1