PowerShellは、大きな（大きな）ファイルの行数を取得します

Question

ファイルから行数を取得する方法の1つは、PowerShellのこのメソッドです。

PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a=Get-Content .\sub.ps1 PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a.count 34 PS C:\Users\Pranav\Desktop\PS_Test_Scripts>

ただし、大きな800 MBのテキストファイルがある場合、ファイル全体を読み取らずにそのファイルから行番号を取得する方法を教えてください。

上記の方法はRAMを消費しすぎるため、スクリプトがクラッシュしたり、完了に時間がかかりすぎたりします。

Akim · Answer

つかいます Get-Content -Read $nLinesAtTime部分ごとにファイルを読むには：

$nlines = 0; # Read file by 1000 lines at a time gc $YOURFILE -read 1000 | % { $nlines += $_.Length }; [string]::Format("{0} has {1} lines", $YOURFILE, $nlines)

そして、これは簡単ですが、小さなファイルでの作業を検証するための遅いスクリプトです：

gc $YOURFILE | Measure-Object -Line

Pseudothink · Answer

テキストファイルの行をカウントするいくつかの異なる方法と、各方法に必要な時間とメモリを示す、一緒にまとめたPowerShellスクリプトを次に示します。以下の結果は、時間とメモリの要件に明らかな違いがあることを示しています。テストでは、100のReadCount設定を使用して、スイートスポットがGet-Contentであるように見えます。他のテストでは、より多くの時間および/またはメモリ使用量が必要でした。

#$testFile = 'C:	est_small.csv' # 245 lines, 150 KB #$testFile = 'C:	est_medium.csv' # 95,365 lines, 104 MB $testFile = 'C:	est_large.csv' # 285,776 lines, 308 MB # Using ArrayList just because they are faster than Powershell arrays, for some operations with large arrays. $results = New-Object System.Collections.ArrayList function AddResult { param( [string] $sMethod, [string] $iCount ) $result = New-Object -TypeName PSObject -Property @{ "Method" = $sMethod "Count" = $iCount "Elapsed Time" = ((Get-Date) - $dtStart) "Memory Total" = [System.Math]::Round((GetMemoryUsage)/1mb, 1) "Memory Delta" = [System.Math]::Round(((GetMemoryUsage) - $dMemStart)/1mb, 1) } [void]$results.Add($result) Write-Output "$sMethod : $count" [System.GC]::Collect() } function GetMemoryUsage { # return ((Get-Process -Id $pid).PrivateMemorySize) return ([System.GC]::GetTotalMemory($false)) } # Get-Content -ReadCount 1 [System.GC]::Collect() $dMemStart = GetMemoryUsage $dtStart = Get-Date $count = 0 Get-Content -Path $testFile -ReadCount 1 |% { $count++ } AddResult "Get-Content -ReadCount 1" $count # Get-Content -ReadCount 10,100,1000,0 # Note: ReadCount = 1 returns a string. Any other value returns an array of strings. # Thus, the Count property only applies when ReadCount is not 1. @(10,100,1000,0) |% { $dMemStart = GetMemoryUsage $dtStart = Get-Date $count = 0 Get-Content -Path $testFile -ReadCount $_ |% { $count += $_.Count } AddResult "Get-Content -ReadCount $_" $count } # Get-Content | Measure-Object $dMemStart = GetMemoryUsage $dtStart = Get-Date $count = (Get-Content -Path $testFile -ReadCount 1 | Measure-Object -line).Lines AddResult "Get-Content -ReadCount 1 | Measure-Object" $count # Get-Content.Count $dMemStart = GetMemoryUsage $dtStart = Get-Date $count = (Get-Content -Path $testFile -ReadCount 1).Count AddResult "Get-Content.Count" $count # StreamReader.ReadLine $dMemStart = GetMemoryUsage $dtStart = Get-Date $count = 0 # Use this constructor to avoid file access errors, like Get-Content does. $stream = New-Object -TypeName System.IO.FileStream( $testFile, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite) if ($stream) { $reader = New-Object IO.StreamReader $stream if ($reader) { while(-not ($reader.EndOfStream)) { [void]$reader.ReadLine(); $count++ } $reader.Close() } $stream.Close() } AddResult "StreamReader.ReadLine" $count $results | Select Method, Count, "Elapsed Time", "Memory Total", "Memory Delta" | ft -auto | Write-Output

〜95k行、104 MBを含むテキストファイルの結果は次のとおりです。

Method Count Elapsed Time Memory Total Memory Delta ------ ----- ------------ ------------ ------------ Get-Content -ReadCount 1 95365 00:00:11.1451841 45.8 0.2 Get-Content -ReadCount 10 95365 00:00:02.9015023 47.3 1.7 Get-Content -ReadCount 100 95365 00:00:01.4522507 59.9 14.3 Get-Content -ReadCount 1000 95365 00:00:01.1539634 75.4 29.7 Get-Content -ReadCount 0 95365 00:00:01.3888746 346 300.4 Get-Content -ReadCount 1 | Measure-Object 95365 00:00:08.6867159 46.2 0.6 Get-Content.Count 95365 00:00:03.0574433 465.8 420.1 StreamReader.ReadLine 95365 00:00:02.5740262 46.2 0.6

大きなファイルの結果（〜285k行、308 MBを含む）は次のとおりです。

Method Count Elapsed Time Memory Total Memory Delta ------ ----- ------------ ------------ ------------ Get-Content -ReadCount 1 285776 00:00:36.2280995 46.3 0.8 Get-Content -ReadCount 10 285776 00:00:06.3486006 46.3 0.7 Get-Content -ReadCount 100 285776 00:00:03.1590055 55.1 9.5 Get-Content -ReadCount 1000 285776 00:00:02.8381262 88.1 42.4 Get-Content -ReadCount 0 285776 00:00:29.4240734 894.5 848.8 Get-Content -ReadCount 1 | Measure-Object 285776 00:00:32.7905971 46.5 0.9 Get-Content.Count 285776 00:00:28.4504388 1219.8 1174.2 StreamReader.ReadLine 285776 00:00:20.4495721 46 0.4

latkin · Answer

最初に試みることは、Get-Contentをストリーミングし、一度に1行ずつ行数を増やして、すべての行を一度に配列に格納することです。これにより、適切なストリーミング動作が得られると思います。つまり、ファイル全体が一度にメモリに格納されるのではなく、現在の行だけになります。

$lines = 0 Get-Content .\File.txt |%{ $lines++ }

そして、他の答えが示唆するように、-ReadCountを追加することでこれを高速化できます。

それがうまくいかない場合（遅すぎる、またはメモリが多すぎる）直接StreamReaderに移動できます：

$count = 0 $reader = New-Object IO.StreamReader 'c:\logs\MyLog.txt' while($reader.ReadLine() -ne $null){ $count++ } $reader.Close() # Don't forget to do this. Ideally put this in a try/finally block to make sure it happens.

greenjaed · Answer

.NETを使用する別のソリューションを次に示します。

[Linq.Enumerable]::Count([System.IO.File]::ReadLines("FileToCount.txt"))

それはあまり割り込み可能ではありませんが、メモリ上では非常に簡単です。

user2176024 · Answer

これは、txtファイルの空白を解析するときにメモリ使用量を減らしてみることを書いたものです。そうは言っても、メモリ使用量は依然としてかなり高くなりますが、プロセスの実行時間は短くなります。

ファイルの背景を説明するために、ファイルには200万件を超えるレコードがあり、各行の前後に先頭に空白があります。合計時間は5分以上だったと思います。

$testing = 'C:\Users\something\something	est3.txt' $filecleanup = Get-ChildItem $testing foreach ($file in $filecleanup) { $file1 = Get-Content $file -readcount 1000 | foreach{$_.Trim()} $file1 > $filecleanup }