ファイル内の重複行を削除せずに識別しますか？

Question

エントリの長いリストを持つテキストファイルとしての参照があり、それぞれに2つ（またはそれ以上）のフィールドがあります。

最初の列は参照のURLです。 2番目の列はタイトルで、エントリの作成方法によって多少異なる場合があります。存在する場合と存在しない場合がある3番目のフィールドについても同じです。

最初のフィールド（参照URL）が同一のエントリを識別しますが、削除しません。私はsort -k1,1 -uについて知っていますが、それは最初のヒットを除くすべてを（非対話的に）自動的に削除します。どちらを保持するかを選択できるように私に知らせる方法はありますか？

同じ最初のフィールド（http://unix.stackexchange.com/questions/49569/）を持つ3行の以下の抜粋では、追加のタグ（ソート、CLI）があり、行＃1と＃3を削除するため、行2を保持します。

http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

このような「重複」を特定するのに役立つプログラムはありますか？次に、行＃1と＃3を個人的に削除して手動でクリーンアップできますか？

Radu Rădeanu · Accepted Answer

あなたの質問を理解したら、次のようなものが必要だと思います。

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

または：

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

file.txtは、関心のあるデータを含むファイルです。

出力には、最初のフィールドが2回以上見つかった行と行の数が表示されます。

Lekensteyn · Answer

これはuniqコマンドで解決できる古典的な問題です。 uniqは、重複を検出できます連続行と重複の削除（-u、--unique）または重複のみの保持（-d、--repeated）。

重複する行の順序は重要ではないため、最初に並べ替える必要があります。次に、uniqを使用して一意の行のみを印刷します。

sort yourfile.txt | uniq -u

-cオプションの重複数を出力する--count（-d）オプションもあります。詳細については、uniqのマニュアルページを参照してください。

最初のフィールドの後の部分が本当に気にならない場合は、次のコマンドを使用して重複キーを見つけ、そのキーの各行番号を印刷できます（別の| sort -nを追加して、出力を行でソートします）：

 cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D

（最初のフィールドをキーとして使用して）重複する行を表示するため、uniqを直接使用することはできません。自動化を困難にする問題は、タイトル部分が異なることですが、プログラムはどのタイトルを最終タイトルと見なすべきかを自動的に決定することはできません。

これは、テキストファイルを入力として受け取り、すべての重複する行を出力するAWKスクリプト（script.awkに保存）で、削除するものを決定できます。（awk -f script.awk yourfile.txt）

#!/usr/bin/awk -f { # Store the line ($0) grouped per URL ($1) with line number (NR) as key lines[$1][NR] = $0; } END { for (url in lines) { # find lines that have the URL occur multiple times if (length(lines[url]) > 1) { for (lineno in lines[url]) { # Print duplicate line for decision purposes print lines[url][lineno]; # Alternative: print line number and line #print lineno, lines[url][lineno]; } } } }

terdon · Answer

これを正しく読めば、必要なのは

awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done

これにより、デュープを含む行の番号と行自体が出力されます。たとえば、次のファイルを使用します。

foo bar baz http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field bar foo baz http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI baz foo bar http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

次の出力が生成されます。

2:http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field 4:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI 6:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

行番号のみを印刷するには、次のようにします

awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1

そして、行のみを印刷するには：

awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-

説明：

awkスクリプトは、ファイルの最初のスペース区切りフィールドのみを印刷します。 $Nを使用して、N番目のフィールドを印刷します。 sortはそれをソートし、uniq -cは各行の出現回数をカウントします。

これはwhileループに渡され、$numとしてオカレンスの数を保存し、$dupeとして行を保存し、$numが1より大きい場合（少なくとも複製される1回）-nを使用して行番号を出力し、その行のファイルを検索します。 --はgrepに、後に続くものがコマンドラインオプションではないことを伝えます。これは、$dupeが-で始まる場合に役立ちます。

Jacob Vlijm · Answer

リストの中で最も冗長なものは間違いなく、おそらくもっと短くなるでしょう：

#!/usr/bin/python3 import collections file = "file.txt" def find_duplicates(file): with open(file, "r") as sourcefile: data = sourcefile.readlines() splitlines = [ (index, data[index].split(" ")) for index in range(0, len(data)) ] lineheaders = [item[1][0] for item in splitlines] dups = [x for x, y in collections.Counter(lineheaders).items() if y > 1] dupsdata = [] for item in dups: occurrences = [ splitlines_item[0] for splitlines_item in splitlines\ if splitlines_item[1][0] == item ] corresponding_lines = [ "["+str(index)+"] "+data[index] for index in occurrences ] dupsdata.append((occurrences, corresponding_lines)) # printing output print("found duplicates:
"+"-"*17) for index in range(0, len(dups)): print(dups[index], dupsdata[index][0]) lines = [item for item in dupsdata[index][1]] for line in lines: print(line, end = "") find_duplicates(file)

次のようなテキストファイルを与えます：

monkey banana dog bone monkey banana peanut cat mice dog cowmeat

次のような出力：

found duplicates: ----------------- dog [1, 4] [1] dog bone [4] dog cowmeat monkey [0, 2] [0] monkey banana [2] monkey banana peanut

削除する行を選択したら：

removelist = [2,1] def remove_duplicates(file, removelist): removelist = sorted(removelist, reverse=True) with open(file, "r") as sourcefile: data = sourcefile.readlines() for index in removelist: data.pop(index) with open(file, "wt") as sourcefile: for line in data: sourcefile.write(line) remove_duplicates(file, removelist)

Clint Smith · Answer

彼女は私がそれを解決した方法です：

file_with_duplicates：

1,a,c 2,a,d 3,a,e <--duplicate 4,a,t 5,b,k <--duplicate 6,b,l 7,b,s 8,b,j 1,b,l 3,a,d <--duplicate 5,b,l <--duplicate

列1および2でソートおよび重複除外されたファイル

sort -t',' -k1,1 -k2,2 -u file_with_duplicates

列1と2のみでソートされたファイル

sort -t',' -k1,1 -k2,2 file_with_duplicates

違いのみを表示：

diff <(sort -t',' -k1,1 -k2,2 -u file_with_duplicates) <(sort -t',' -k1,1 -k2,2 file_with_duplicates) 3a4 3,a,d 6a8 5,b,l

DK Bose · Answer

次のソートされたfile.txtを参照してください。

addons.mozilla.org/en-US/firefox/addon/click-to-play-per-element/ ::: C2P per-element addons.mozilla.org/en-us/firefox/addon/prospector-oneLiner/ ::: OneLiner askubuntu.com/q/21033 ::: What is the difference between gksudo and gksu? askubuntu.com/q/21148 ::: openoffice calc sheet tabs (also askubuntu.com/q/138623) askubuntu.com/q/50540 ::: What is Ubuntu's Definition of a "Registered Application"? askubuntu.com/q/53762 ::: How to use lm-sensors? askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors stackoverflow.com/q/4594319 ::: bash - Shell replace cr\lf by comma stackoverflow.com/q/4594319 ::: Shell replace cr\lf by comma wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence - Ubuntu Wiki www.youtube.com/watch?v=1olY5Qzmbk8 ::: Create new mime types in Ubuntu www.youtube.com/watch?v=2hu9JrdSXB8 ::: Change mouse cursor www.youtube.com/watch?v=Yxfa2fXJ1Wc ::: Mouse cursor size

リストが短いため、（ソート後に）3セットの重複があることがわかります。

次に、たとえば、以下を保持することを選択できます。

askubuntu.com/q/53762 ::: How to use lm-sensors?

のではなく

askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors

しかし、長いリストの場合、これは難しいでしょう。 1つがuniqを提案し、もう1つがcutを提案する2つの回答に基づいて、このコマンドが希望する出力を提供することがわかります。

$ cut -d " " -f1 file.txt | uniq -d askubuntu.com/q/53762 stackoverflow.com/q/4594319 wiki.ubuntu.com/ClipboardPersistence $