複数のフィールド(約30フィールド)の2つのcsvファイルがあり、巨大なサイズ(約4GB)です。
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
これら2つのファイルを比較して、その違いを別のcsvファイルに報告します。上記の例では、Employee VivekとKarthickの詳細が異なる行番号で表示されていますが、それでもレコードデータは同じであるため、一致していると見なす必要があります。住所に不一致があるため、従業員Vinothレコードは不一致と見なす必要があります。
出力diff.csvファイルには、以下のようにFile1とFile 2の不一致レコードが含まれている可能性があります。
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
これまでにコードを書きました。この後、バイナリ検索またはこれを行う他の効率的な方法かどうかを選択するオプションが混乱しています。手伝っていただけませんか?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in Perl. Need help here.
}
}
print "End of Program" . "\n";
このようなmagnitudeのデータをデータベースに保存することは、この種のタスクへの最も適切なアプローチです。少なくとも SQLlite が推奨されますが、他のデータベース MariaDB 、 MySQL 、 PostgreSQL は非常にうまく機能します。
次のコードは、特別なモジュールなしで目的の出力を実現する方法を示していますが、めちゃくちゃな入力データは考慮されていません。このスクリプトは、違いが1つの余分なスペースである場合でも、データレコードを異なるものとして報告します。
オプションoutputを指定しない限り、デフォルトの出力はコンソールウィンドウです。
注:ファイル#1全体がメモリに読み込まれます。大きなファイルの処理にはしばらく時間がかかる場合があります。
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my @args = (
'file1|f1=s',
'file2|f2=s',
'output|o=s',
'debug|d',
'help|?',
'man|m'
);
GetOptions( \%opt, @args ) or pod2usage(2);
print Dumper(\%opt) if $opt{debug};
pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};
unlink $opt{output} if defined $opt{output} and -f $opt{output};
compare($opt{file1},$opt{file2});
sub compare {
my $fname1 = shift;
my $fname2 = shift;
my $hfile1 = file2hash($fname1);
open my $fh, '<:encoding(utf8)', $fname2
or die "Couldn't open $fname2";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
my($key,$data) = ($1, $2);
if( !defined $hfile1->{$key} ) {
my $msg = "$fname1 $key is missing";
say_msg($msg);
} elsif( $data ne $hfile1->{$key} ) {
my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
say_msg($msg);
}
}
}
sub say_msg {
my $msg = shift;
if( $opt{output} ) {
open my $fh, '>>:encoding(utf8)', $opt{output}
or die "Couldn't to open $opt{output}";
say $fh $msg;
close $fh;
} else {
say $msg;
}
}
sub file2hash {
my $fname = shift;
my %hash;
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
close $fh;
return \%hash;
}
__END__
=head1 NAME
comp_cvs - compares two CVS files and stores differense
=head1 SYNOPSIS
comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
Options:
-f1,--file1 input CVS filename #1
-f2,--file2 input CVS filename #2
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-f1,--file1>
Input CVS filename #1
=item B<-f2,--file2>
Input CVS filename #2
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
=head1 DESCRIPTION
B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
=head1 EXIT STATUS
The section describes B<EXIT STATUS> codes of the program
=head1 ENVIRONMENT
The section describes B<ENVIRONMENT VARIABLES> utilized in the program
=head1 FILES
The section describes B<FILES> which used for program's configuration
=head1 EXAMPLES
The section demonstrates some B<EXAMPLES> of the code
=head1 REPORTING BUGS
The section provides information how to report bugs
=head1 AUTHOR
The section describing author and his contanct information
=head1 ACKNOWLEDGMENT
The section to give credits people in some way related to the code
=head1 SEE ALSO
The section describing related information - reference to other programs, blogs, website, ...
=head1 HISTORY
The section gives historical information related to the code of the program
=head1 COPYRIGHT
Copyright information related to the code
=cut
テストファイルの出力
file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
#!/usr/bin/env Perl
use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;
my %POS;
my %chars;
open my $FILEA,'<',q{FileA.txt}
or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
$POS{FILEA}=tell $FILEA;
<$FILEB>,
$POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';
{ # Scan for first character of the records involved
while (<$FILEA>) {
$chars{substr($_,0,1)}++;
};
while (<$FILEB>) {
$chars{substr($_,0,1)}--;
};
# So what characters do we need to deal with?
warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
};
my @chars=sort keys %chars;
{
my %_h;
# For each of the characters in our character set
for my $char (@chars) {
warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
# Beginning of data sections
seek $FILEA,$POS{FILEA},0;
seek $FILEB,$POS{FILEB},0;
%_h=();
my $pos=tell $FILEA;
while (<$FILEA>) {
next
unless (substr($_,0,1) eq $char);
# for each record save the lengthAndMD5 as the key and its start as the value
$_h{lengthAndMD5(\$_)}=$pos;
$pos=tell $FILEA;
};
my $_s;
while (<$FILEB>) {
next
unless (substr($_,0,1) eq $char);
if (exists $_h{$_s=lengthAndMD5(\$_)}) { # It's a duplicate
print {$InBoth} $_;
delete $_h{$_s};
}
else { # (Not in FILEA) It's only in FILEB
print {$OnlyInB} $_;
}
};
# only in FILEA
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
for my $key (keys %_h) { # Only in FILEA
seek $FILEA,delete $_h{$key},0;
print {$OnlyInA} scalar <$FILEA>;
};
# Should be empty
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
};
};
close $OnlyInB
or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
or die "Could NOT close 'FileA.txt' after reading! $!";
exit;
sub lengthAndMD5 {
return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
};
__END__