2つのCSV巨大CSVファイルを比較し、perlを使用してその違いを別のcsvファイルに出力します

Question

複数のフィールド（約30フィールド）の2つのcsvファイルがあり、巨大なサイズ（約4GB）です。

File1: EmployeeName,Age,Salary,Address Vinoth,12,2548.245,"140,North Street,India" Vivek,40,2548.245,"140,North Street,India" Karthick,10,10.245,"140,North Street,India" File2: EmployeeName,Age,Salary,Address Vinoth,12,2548.245,"140,North Street,USA" Karthick,10,10.245,"140,North Street,India" Vivek,40,2548.245,"140,North Street,India"

これら2つのファイルを比較して、その違いを別のcsvファイルに報告します。上記の例では、Employee VivekとKarthickの詳細が異なる行番号で表示されていますが、それでもレコードデータは同じであるため、一致していると見なす必要があります。住所に不一致があるため、従業員Vinothレコードは不一致と見なす必要があります。

出力diff.csvファイルには、以下のようにFile1とFile 2の不一致レコードが含まれている可能性があります。

Diff.csv EmployeeName,Age,Salary,Address F1, Vinoth,12,2548.245,"140,North Street,India" F2, Vinoth,12,2548.245,"140,North Street,USA"

これまでにコードを書きました。この後、バイナリ検索またはこれを行う他の効率的な方法かどうかを選択するオプションが混乱しています。手伝っていただけませんか？

My approach 1. Load the File2 in memory as hashes of hashes. 2.Read line by line from File1 and match it with the hash of hashes in memory. use strict; use warnings; use Text::CSV_XS; use Getopt::Long; use Data::Dumper; use Text::CSV::Hashify; use List::BinarySearch qw( :all ); # Get Command Line Parameters my %opts = (); GetOptions( \%opts, "file1=s", "file2=s", ) or die("Error in command line arguments
"); if ( !defined $opts{'file1'} ) { die "CSV file --file1 not specified.
"; } if ( !defined $opts{'file2'} ) { die "CSV file --file2 not specified.
"; } my $file1 = $opts{'file1'}; my $file2 = $opts{'file2'}; my $file3 = 'diff.csv'; print $file2 . "
"; my $csv1 = Text::CSV_XS->new( { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } ); my $csv2 = Text::CSV_XS->new( { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } ); my $csvout = Text::CSV_XS->new( { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } ); open( my $fh1, '<:encoding(utf8)', $file1 ) or die "Cannot not open '$file1' $!.
"; open( my $fh2, '<:encoding(utf8)', $file2 ) or die "Cannot not open '$file2' $!.
"; open( my $fh3, '>:encoding(utf8)', $file3 ) or die "Cannot not open '$file3' $!.
"; binmode( STDOUT, ":utf8" ); my $f1line = undef; my $f2line = undef; my $header1 = undef; my $f1empty = 'false'; my $f2empty = 'false'; my $reccount = 0; my $hash_ref = hashify( "$file2", 'EmployeeName' ); if ( $f1empty eq 'false' ) { $f1line = $csv1->getline($fh1); } while (1) { if ( $f1empty eq 'false' ) { $f1line = $csv1->getline($fh1); } if ( !defined $f1line ) { $f1empty = 'true'; } if ( $f1empty eq 'true' ) { last; } else { ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in Perl. Need help here. } } print "End of Program" . "
";

Polar Bear · Accepted Answer

このようなmagnitudeのデータをデータベースに保存することは、この種のタスクへの最も適切なアプローチです。少なくとも SQLlite が推奨されますが、他のデータベース MariaDB 、 MySQL 、 PostgreSQL は非常にうまく機能します。

次のコードは、特別なモジュールなしで目的の出力を実現する方法を示していますが、めちゃくちゃな入力データは考慮されていません。このスクリプトは、違いが1つの余分なスペースである場合でも、データレコードを異なるものとして報告します。

オプションoutputを指定しない限り、デフォルトの出力はコンソールウィンドウです。

注：ファイル＃1全体がメモリに読み込まれます。大きなファイルの処理にはしばらく時間がかかる場合があります。

use strict; use warnings; use feature 'say'; use Getopt::Long qw(GetOptions); use Pod::Usage; my %opt; my @args = ( 'file1|f1=s', 'file2|f2=s', 'output|o=s', 'debug|d', 'help|?', 'man|m' ); GetOptions( \%opt, @args ) or pod2usage(2); print Dumper(\%opt) if $opt{debug}; pod2usage(1) if $opt{help}; pod2usage(-exitval => 0, -verbose => 2) if $opt{man}; pod2usage(1) unless $opt{file1}; pod2usage(1) unless $opt{file2}; unlink $opt{output} if defined $opt{output} and -f $opt{output}; compare($opt{file1},$opt{file2}); sub compare { my $fname1 = shift; my $fname2 = shift; my $hfile1 = file2hash($fname1); open my $fh, '<:encoding(utf8)', $fname2 or die "Couldn't open $fname2"; while(<$fh>) { chomp; next unless /^(.*?),(.*)$/; my($key,$data) = ($1, $2); if( !defined $hfile1->{$key} ) { my $msg = "$fname1 $key is missing"; say_msg($msg); } elsif( $data ne $hfile1->{$key} ) { my $msg = "$fname1 $key,$hfile1->{$key}
$fname2 $_"; say_msg($msg); } } } sub say_msg { my $msg = shift; if( $opt{output} ) { open my $fh, '>>:encoding(utf8)', $opt{output} or die "Couldn't to open $opt{output}"; say $fh $msg; close $fh; } else { say $msg; } } sub file2hash { my $fname = shift; my %hash; open my $fh, '<:encoding(utf8)', $fname or die "Couldn't open $fname"; while(<$fh>) { chomp; next unless /^(.*?),(.*)$/; $hash{$1} = $2; } close $fh; return \%hash; } __END__ =head1 NAME comp_cvs - compares two CVS files and stores differense =head1 SYNOPSIS comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt Options: -f1,--file1 input CVS filename #1 -f2,--file2 input CVS filename #2 -o,--output output filename -d,--debug output debug information -?,--help brief help message -m,--man full documentation =head1 OPTIONS =over 4 =item B<-f1,--file1> Input CVS filename #1 =item B<-f2,--file2> Input CVS filename #2 =item B<-o,--output> Output filename =item B<-d,--debug> Print debug information. =item B<-?,--help> Print a brief help message and exits. =item B<--man> Prints the manual page and exits. =back =head1 DESCRIPTION B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal. =head1 EXIT STATUS The section describes B<EXIT STATUS> codes of the program =head1 ENVIRONMENT The section describes B<ENVIRONMENT VARIABLES> utilized in the program =head1 FILES The section describes B<FILES> which used for program's configuration =head1 EXAMPLES The section demonstrates some B<EXAMPLES> of the code =head1 REPORTING BUGS The section provides information how to report bugs =head1 AUTHOR The section describing author and his contanct information =head1 ACKNOWLEDGMENT The section to give credits people in some way related to the code =head1 SEE ALSO The section describing related information - reference to other programs, blogs, website, ... =head1 HISTORY The section gives historical information related to the code of the program =head1 COPYRIGHT Copyright information related to the code =cut

テストファイルの出力

file1.cvs Vinoth,12,2548.245,"140,North Street,India" file2.cvs Vinoth,12,2548.245,"140,North Street,USA"

user3343917 · Answer

#!/usr/bin/env Perl use Data::Dumper; use Digest::MD5; use 5.01800; use warnings; my %POS; my %chars; open my $FILEA,'<',q{FileA.txt} or die "Can't open 'FileA.txt' for reading! $!"; open my $FILEB,'<',q{FileB.txt} or die "Can't open 'FileB.txt' for reading! $!"; open my $OnlyInA,'>',q{OnlyInA.txt} or die "Can't open 'OnlyInA.txt' for writing! $!"; open my $InBoth,'>',q{InBoth.txt} or die "Can't open 'InBoth.txt' for writing! $!"; open my $OnlyInB,'>',q{OnlyInB.txt} or die "Can't open 'OnlyInB.txt' for writing! $!"; <$FILEA>, $POS{FILEA}=tell $FILEA; <$FILEB>, $POS{FILEB}=tell $FILEB; warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' '; { # Scan for first character of the records involved while (<$FILEA>) { $chars{substr($_,0,1)}++; }; while (<$FILEB>) { $chars{substr($_,0,1)}--; }; # So what characters do we need to deal with? warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' '; }; my @chars=sort keys %chars; { my %_h; # For each of the characters in our character set for my $char (@chars) { warn Data::Dumper->Dump([\$char],[qw(*char)]),' '; # Beginning of data sections seek $FILEA,$POS{FILEA},0; seek $FILEB,$POS{FILEB},0; %_h=(); my $pos=tell $FILEA; while (<$FILEA>) { next unless (substr($_,0,1) eq $char); # for each record save the lengthAndMD5 as the key and its start as the value $_h{lengthAndMD5(\$_)}=$pos; $pos=tell $FILEA; }; my $_s; while (<$FILEB>) { next unless (substr($_,0,1) eq $char); if (exists $_h{$_s=lengthAndMD5(\$_)}) { # It's a duplicate print {$InBoth} $_; delete $_h{$_s}; } else { # (Not in FILEA) It's only in FILEB print {$OnlyInB} $_; } }; # only in FILEA warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' '; for my $key (keys %_h) { # Only in FILEA seek $FILEA,delete $_h{$key},0; print {$OnlyInA} scalar <$FILEA>; }; # Should be empty warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' '; }; }; close $OnlyInB or die "Could NOT close 'OnlyInB.txt' after writing! $!"; close $InBoth or die "Could NOT close 'InBoth.txt' after writing! $!"; close $OnlyInA or die "Could NOT close 'OnlyInA.txt' after writing! $!"; close $FILEB or die "Could NOT close 'FileB.txt' after reading! $!"; close $FILEA or die "Could NOT close 'FileA.txt' after reading! $!"; exit; sub lengthAndMD5 { return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]})); }; __END__