区切られたデータを便利なCSVに操作する

Question

次の形式の出力があります。

count id type 588 10 | 3 10 12 | 3 883 14 | 3 98 17 | 3 17 18 | 1 77598 18 | 3 10000 21 | 3 17892 2 | 3 20000 23 | 3 63 27 | 3 6 3 | 3 2446 35 | 3 14 4 | 3 15 4 | 1 253 4 | 2 19857 4 | 3 1000 5 | 3 ...

これはかなり面倒で、CSVにクリーンアップする必要があるので、プロジェクトマネージャーにスプレッドシートをプレゼントすることができます。

問題の核心はこれです：これの出力が必要です：

id、sum_of_type_1、sum_of_type_2、sum_of_type_3

この例はid "4"です。

14 4 | 3 15 4 | 1 253 4 | 2 19857 4 | 3

これは代わりに：

4,15,253,19871

残念ながら、私はこの種のことについてはかなりごみです。すべての行をクリーンアップしてCSVに収めることができましたが、重複を排除して行をグループ化することができませんでした。今私はこれを持っています：

awk 'BEGIN{OFS=",";} {split($line, part, " "); print part[1],part[2],part[4]}' | awk '{ gsub (" ", "", $0); print}'

しかし、それはすべて、ごみの文字をクリーンアップして、行を再度印刷することです。

行を上記の出力にマッサージする最良の方法は何ですか？

DarkHeart · Accepted Answer

それを行う方法は、すべてをハッシュに入れることです。

# put values into a hash based on the id and tag awk 'NR>1{n[$2","$4]+=$1} END{ # merge the same ids on the one line for(i in n){ id=i; sub(/,.*/,"",id); a[id]=a[id]","n[i]; } # print everyhing for(i in a){ print i""a[i]; } }'

編集：私の最初の答えは質問に正しく答えませんでした

choroba · Answer

Perlが救い出します：

#!/usr/bin/Perl use warnings; use strict; use feature qw{ say }; <>; # Skip the header. my %sum; my %types; while (<>) { my ($count, $id, $type) = grep length, split '[\s|]+'; $sum{$id}{$type} += $count; $types{$type} = 1; } say join ',', 'id', sort keys %types; for my $id (sort { $a <=> $b } keys %sum) { say join ',', $id, map $_ // q(), @{ $sum{$id} }{ sort keys %types }; }

タイプのテーブルとIDのテーブルの2つのテーブルを保持します。 IDごとに、タイプごとの合計が格納されます。

steeldriver · Answer

GNU datamash がオプションの場合、

awk 'NR>1 {print $1, $2, $4}' OFS=, file | datamash -t, -s --filler=0 crosstab 2,3 sum 1 ,1,2,3 10,0,0,588 12,0,0,10 14,0,0,883 17,0,0,98 18,17,0,77598 2,0,0,17892 21,0,0,10000 23,0,0,20000 27,0,0,63 3,0,0,6 35,0,0,2446 4,15,253,19871 5,0,0,1000

Maarten Fabr&#233; · Answer

Python（特にpandasライブラリは、この種の作業に非常に適しています）

_data = """count id type 588 10 | 3 10 12 | 3 883 14 | 3 98 17 | 3 17 18 | 1 77598 18 | 3 10000 21 | 3 17892 2 | 3 20000 23 | 3 63 27 | 3 6 3 | 3 2446 35 | 3 14 4 | 3 15 4 | 1 253 4 | 2 19857 4 | 3 1000 5 | 3""" import pandas as pd from io import StringIO # to read from string, not needed to read from file df = pd.read_csv(StringIO(data), sep=sep='\s+\|?\s*', index_col=None, engine='python') _

これはcsvデータを_pandas DataFrame_に読み取ります

_ count id type 0 588 10 3 1 10 12 3 2 883 14 3 3 98 17 3 4 17 18 1 5 77598 18 3 6 10000 21 3 7 17892 2 3 8 20000 23 3 9 63 27 3 10 6 3 3 11 2446 35 3 12 14 4 3 13 15 4 1 14 253 4 2 15 19857 4 3 16 1000 5 3 _

次に group このデータをidで取得し、列countの合計を取ります

_df_sum = df.groupby(('type', 'id'))['count'].sum().unstack('type').fillna(0) _

unstack reshapes これは、IDを列に移動し、fillnaは空のフィールドに0を入力します

_df_sum.to_csv() _

これは戻ります

_id,1,2,3 2,0.0,0.0,17892.0 3,0.0,0.0,6.0 4,15.0,253.0,19871.0 5,0.0,0.0,1000.0 10,0.0,0.0,588.0 12,0.0,0.0,10.0 14,0.0,0.0,883.0 17,0.0,0.0,98.0 18,17.0,0.0,77598.0 21,0.0,0.0,10000.0 23,0.0,0.0,20000.0 27,0.0,0.0,63.0 35,0.0,0.0,2446.0 _

データフレームには欠落データ（空のIDとタイプの組み合わせ）が含まれているため、pandasはintsをfloatに変換します（内部動作の制限）。入力はintのみです。最後から2番目の行をdf_sum = df.groupby(('type', 'id'))['count'].sum().unstack('type').fillna(0).astype(int)に変更できます

user218374 · Answer

Perlを使用してCSVファイルをループし、途中で適切なタイプの合計をハッシュに蓄積できます。そして最後に、IDごとに収集した情報を表示します。

データ構造

%h = ( ID1 => [ sum_of_type1, sum_of_type2, sum_of_type3 ], ... )

これは、以下のコードを理解するのに役立ちます。

Perl

Perl -wMstrict -Mvars='*h' -F'\s+|\|' -lane ' $, = chr 44, next if $. == 1; my($count, $id, $type) = grep /./, @F; $h{ $id }[ $type-1 ] += $count}{ print $_, map { $_ || 0 } @{ $h{$_} } for sort { $a <=> $b } keys %h ' yourcsvfile

出力

2,0,0,17892 3,0,0,6 4,15,253,19871 5,0,0,1000 ...

glenn jackman · Answer

私の見解は、他の人とあまり変わらない。配列の配列を持つGNU awkを使用します

gawk ' NR == 1 {next} {count[$2][$4] += $1} END { for (id in count) { printf "%d", id for (type=1; type<=3; type++) { # add zero to coerce possible empty string into a number printf ",%d", 0 + count[id][type] } print "" # adds the newline for this line } } ' file

出力

2,0,0,17892 3,0,0,6 4,15,253,19871 5,0,0,1000 10,0,0,588 12,0,0,10 14,0,0,883 17,0,0,98 18,17,0,77598 21,0,0,10000 23,0,0,20000 27,0,0,63 35,0,0,2446

Prem Joshi · Answer

このコードを使用して、id列に基づいて値を合計できます。

コードの後に1つのawkステートメントを追加しました

awk 'BEGIN{OFS=",";} {split($line, part, " "); print part[1],part[2],part[4]}' abcd | awk '{ gsub (" ", "", $0); print}' | awk 'BEGIN{FS=OFS=SUBSEP=","}{arr[$2,$3]+=$1;}END{for ( i in arr ) print i,arr[i];}'

これで先に進んで...