ファイル内のすべての単語の頻度リストを作成する方法は？

Question

このようなファイルがあります：

This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

2列のリストを生成したいと思います。最初の列は表示される単語を示し、2番目の列は表示される頻度を示します。例：

this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1

この作業を簡単にするために、リストを処理する前に、すべての句読点を削除し、すべてのテキストを小文字に変更します。
簡単な解決策がない限り、wordsとWordは2つの別個の単語としてカウントできます。

これまでのところ、私はこれを持っています：

sed -i "s/ /
/g" ./file1.txt # put all words on a new line while read line do count="$(grep -c $line file1.txt)" echo $line"@"$count >> file2.txt # add Word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines

何らかの理由で、これは各Wordの後にのみ「0」を表示しています。

頻度情報とともに、ファイルに表示されるすべてのWordのリストを生成するにはどうすればよいですか？

eduffy · Accepted Answer

sedおよびgrepではなく、tr、sort、uniq、およびawk：

% (tr ' ' '
' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1

Bohdan · Answer

niq -cは既に必要な処理を行っています。入力を並べ替えるだけです：

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '
' | sort | uniq -c

出力：

 6 a 7 d 7 s

Rony · Answer

入力ファイルの内容

$ cat inputFile.txt This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

sed | sort | uniqを使用

$ sed 's/\.//g;s/$.*$/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c 1 a 2 appear 1 file 1 is 1 many 1 more 2 of 1 once 1 one 1 only 2 some 1 than 2 the 1 this 1 time 1 with 3 words

uniq -icは大文字と小文字を区別して無視しますが、結果リストにはThisの代わりにthisが含まれます。

Sheharyar · Answer

AWKを使用しましょう！

この関数は、提供されたファイルで発生する各Wordの頻度を降順でリストします。

function wordfrequency() { awk ' BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { Word = tolower($i) words[Word]++ } } END { for (w in words) printf("%3d %s
", words[w], w) } ' | sort -rn }

ファイルで次のように呼び出すことができます。

$ cat your_file.txt | wordfrequency

出典： AWK-ward Ruby

Jerin A Mathews · Answer

これにはtrを使用できます。実行するだけです

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

都市名のテキストファイルのサンプル出力：

3026 Toronto 2006 Montréal 1117 Edmonton 1048 Calgary 905 Ottawa 724 Winnipeg 673 Vancouver 495 Brampton 489 Mississauga 482 London 467 Hamilton

potong · Answer

これはあなたのために働くかもしれません：

tr '[:upper:]' '[:lower:]' <file | tr -d '[:punct:]' | tr -s ' ' '\n' | sort | uniq -c | sed 's/ *$[0-9]*$ $.*$/\2@\1/'

John Red · Answer

Python 3！

_"""Counts the frequency of each Word in the given text; words are defined as entities separated by whitespaces; punctuations and other symbols are ignored; case-insensitive; input can be passed through stdin or through a file specified as an argument; prints highest frequency words first""" # Case-insensitive # Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/ import sys # Find if input is being given through stdin or from a file lines = None if len(sys.argv) == 1: lines = sys.stdin else: lines = open(sys.argv[1]) D = {} for line in lines: for Word in line.split(): Word = ''.join(list(filter( lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\|:;\"'<>,.?/", Word))) Word = Word.lower() if Word in D: D[Word] += 1 else: D[Word] = 1 for Word in sorted(D, key=D.get, reverse=True): print(Word + ' ' + str(D[Word])) _

このスクリプトに「frequency.py」という名前を付け、「〜/ .bash_aliases」に行を追加しましょう。

_alias freq="python3 /path/to/frequency.py" _

ファイル "content.txt"内で頻出語を見つけるには、次のようにします。

_freq content.txt _

出力をパイプすることもできます：

_cat content.txt | freq _

さらに、複数のファイルからテキストを分析します。

_cat content.txt story.txt article.txt | freq _

Python 2を使用している場合は、

''.join(list(filter(args...))) with filter(args...)
_python3_ with python
print(whatever) with _print whatever_

Dennis Williamson · Answer

並べ替えにはGNU AWK（gawk）が必要です。asort()を使用しない別のAWKがある場合、これを簡単に調整し、sortにパイプできます。

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

複数行に分割：

awk '{ gsub(/\./, ""); for (i = 1; i <= NF; i++) { w = tolower($i); count[w]++; words[w] = w } } END { qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]] }' inputfile

Dani Konoplya · Answer

 awk '{ BEGIN{Word[""]=0;} { for (el =1 ; el <= NF ; ++el) {Word[$el]++ } } END { for (i in Word) { if (i !="") { print Word[i],i; } } }' file.txt | sort -nr

GL2014 · Answer

#!/usr/bin/env bash declare -A map words="$1" [[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;} while read line; do for Word in $line; do ((map[$Word]++)) done; done < <(cat $words ) for key in ${!map[@]}; do echo "the Word $key appears ${map[$key]} times" done|sort -nr -k5