Linuxで必要な列を持つファイルを結合するにはどうすればよいですか？

Question

ディレクトリ「results」に次のような多くのファイルがあります

58052 results/TB1.genes.results 198003 results/TB1.isoforms.results 58052 results/TB2.genes.results 198003 results/TB2.isoforms.results 58052 results/TB3.genes.results 198003 results/TB3.isoforms.results 58052 results/TB4.genes.results 198003 results/TB4.isoforms.results

たとえば、TB1.genes.resultsファイルは次のようになります。

gene_id transcript_id(s) length effective_length expected_count TPM FPKM ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 2206.00 1997.20 1.00 0.00 0.01 ENSG00000000005 ENST00000373031,ENST00000485971 940.50 731.73 0.00 0.00 0.00 ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 977.15 768.35 1865.00 14.27 37.82 ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 3779.11 3570.31 1521.00 2.50 6.64 ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 1936.74 1727.94 1860.00 6.33 16.77 ENSG00000000938 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472 2020.10 1811.30 6846.00 22.22 58.90 ENSG00000000971 ENST00000359637,ENST00000367429,ENST00000466229,ENST00000470918,ENST00000496761,ENST00000630130 2587.83 2379.04 0.00 0.00 0.00 ENSG00000001036 ENST00000002165,ENST00000367585,ENST00000451668 1912.64 1703.85 1358.00 4.69 12.42 ENSG00000001084 ENST00000229416,ENST00000504353,ENST00000504525,ENST00000505197,ENST00000505294,ENST00000509541,ENST00000510837,ENST00000513939,ENST00000514004,ENST00000514373,ENST00000514933,ENST00000515580,ENST00000616923 2333.50 2124.73 1178.00 3.26 8.64

他のファイルにも同じ列があります。「genes.results」と「gene_id」および「expected_count」列をすべて1つのテキストファイルに結合するには、次のコマンドを実行しました。

paste results/*.genes.results | tail -n+2 | cut -f1,5,12,19,26 > final.genes.rsem.txt [-f1 (gene_id), 5 (expected_count column from TB1.genes.results), 12 (expected_count column from TB2.genes.results), 19 (expected_count column from TB3.genes.results), 26 (expected_count column from TB4.genes.results)]

「final.genes.rsem.txt」には、すべてのファイルからgene_id列とexpected_count列が選択されています。

ENSG00000000003 1.00 0.00 3.00 2.00 ENSG00000000005 0.00 0.00 0.00 0.00 ENSG00000000419 1865.00 1951.00 5909.00 8163.00 ENSG00000000457 1521.00 1488.00 849.00 1400.00 ENSG00000000460 1860.00 1616.00 2577.00 2715.00 ENSG00000000938 6846.00 5298.00 1.00 2.00 ENSG00000000971 0.00 0.00 6159.00 7069.00 ENSG00000001036 1358.00 1186.00 6196.00 7009.00 ENSG00000001084 1178.00 1186.00 631.00 1293.00

私の質問は-コマンドが少ないため、コマンドで列番号を指定しただけです（ "cut" -f1,5,12,19,26のように）。 100個を超えるサンプルがある場合の対処方法どうすれば必要な列と結合できますか？

MiniMax · Accepted Answer

GNU awkが使用されます。このコマンドをbashスクリプトに入れました。より便利になります。

使用法： ./join_files.shまたは、きれいに印刷するには、次のようにします：./join_files.sh | column -t。

#!/bin/bash gawk ' NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc"; header = $1; } FNR == 1 { file = gensub(/.*/([^.]*)\..*/, "\1", "g", FILENAME); header = header OFS file; } FNR > 1 { arr[$1] = arr[$1] OFS $5; } END { print header; for(i in arr) { print i arr[i]; } }' results/*.genes.results

出力（テスト用に同じ内容の3つのファイルを作成しました）

$ ./join_files.sh | column -t gene_id TB1 TB2 TB3 ENSG00000000003 1.00 1.00 1.00 ENSG00000000005 0.00 0.00 0.00 ENSG00000000419 1865.00 1865.00 1865.00 ENSG00000000457 1521.00 1521.00 1521.00 ENSG00000000460 1860.00 1860.00 1860.00 ENSG00000000938 6846.00 6846.00 6846.00 ENSG00000000971 0.00 0.00 0.00 ENSG00000001036 1358.00 1358.00 1358.00 ENSG00000001084 1178.00 1178.00 1178.00

説明-コメントが追加された同じコード。また、man gawk。

gawk ' # NR - the total number of input records seen so far. # If the total line number is equal 1 NR == 1 { # If the "sorted_in" element exists in PROCINFO, then its value controls # the order in which array elements are traversed in the (for in) loop. # else the order is undefined. PROCINFO["sorted_in"] = "@ind_num_asc"; # Each field in the input record may be referenced by its position: $1, $2, and so on. # $1 - is the first field or the first column. # The first field in the first line is the "gene_id" Word; # Assign it to the header variable. header = $1; } # FNR - the input record number in the current input file. # NR is the total lines counter, FNR is the current file lines counter. # FNR == 1 - if it is the first line of the current file. FNR == 1 { # remove from the filename all unneeded parts by the "gensub" function # was - results/TB1.genes.results # become - TB1 file = gensub(/.*/([^.]*)\..*/, "\1", "g", FILENAME); # and add it to the header variable, concatenating it with the # previous content of the header, using OFS as delimiter. # OFS - the output field separator, a space by default. header = header OFS file; } # some trick is used here. # $1 - the first column value - "gene_id" # $5 - the fifth column value - "expected_count" FNR > 1 { # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on. # and add "expected_count" values to it, separated by OFS. # each time, when the $1 equals to the specific "gene_id", the $5 value will be # added into this array item. # Example: # arr["ENSG00000000003"] = 1.00 # arr["ENSG00000000003"] = 1.00 2.00 # arr["ENSG00000000003"] = 1.00 2.00 3.00 arr[$1] = arr[$1] OFS $5; } END { print header; for(i in arr) { print i arr[i]; } }' results/*.genes.results

John Smith · Answer

質問を正しく理解できたら、多くの列を出力する必要がある場合の対処方法を知りたいと思います。使用しているcutコマンドは、列の範囲を理解します。たとえば、列1、5、および7から13、および17から最後までのすべての列を出力するには、次を使用します。

cut -f1,5,7-13,17-

または、cutコマンドを使用して特定のフィールドを除外できます。たとえば、フィールド番号5を除外するには

cut --compliment -f5

あなたがやりたいのは-私が見るように-transcript_id（s）である2番目の列を削除することだけなので、

cut --compliment -f2

pSあなたが与えたデータはスクリプトでは機能しないことに注意してください。あなたはそれを簡略化し、いくつかの列を削除したと思います。