unzipはアーカイブ内の単一のファイルを見つけるためにどのような方法を使用しますか？

Question

たとえば、サイズがそれぞれ30MBのランダムテキストデータを含む100個のファイルを作成するとします。次に、圧縮なしのZipアーカイブ、つまり_Zip dataset.Zip -r -0 *.txt_を作成します。次に、このアーカイブから1つのファイルのみを抽出します。

ここで説明したように、アーカイブからファイルを解凍/抽出する方法は2つあります。

ファイルの最後まで移動し、中央ディレクトリを検索します。次に、抽出するファイルへの高速ランダムアクセスにそれを使用します。（償却済みO(1)複雑さ）
各ローカルヘッダーを調べ、一致するヘッダーを抽出します。（O(n)複雑さ）

Unzipはどの方法を使用しますか？私の実験から、それは方法2を使用しているようですか？

Stephen Kitt · Accepted Answer

大きなアーカイブで単一のファイルを検索する場合、メソッド1を使用します。これは、straceを使用して確認できます。

open("dataset.Zip", O_RDONLY) = 3 ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device) write(1, "Archive: dataset.Zip
", 22Archive: dataset.Zip ) = 22 lseek(3, 943718400, SEEK_SET) = 943718400 read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522 lseek(3, 943722880, SEEK_SET) = 943722880 read(3, "\3\f\225P\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20 lseek(3, 943718400, SEEK_SET) = 943718400 read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522 lseek(3, 849346560, SEEK_SET) = 849346560 read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192 stat("Rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) lstat("Rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) stat("Rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) lstat("Rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) open("Rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device) write(1, " extracting: Rand-28.txt "..., 37 extracting: Rand-28.txt ) = 37 read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192

unzipが開くdataset.Zip、最後まで検索してから、アーカイブ内の要求されたファイルの先頭まで検索します（Rand-28.txt、オフセット849346560で）、そこから読み取ります。

中央ディレクトリは、アーカイブの最後の65557バイトをスキャンして見つかります。参照ここから始まるコード：

/*--------------------------------------------------------------------------- Find and process the end-of-central-directory header. UnZip need only check last 65557 bytes of zipfile: comment may be up to 65535, end-of- central-directory record is 18 bytes, and signature itself is 4 bytes; add some to allow for appended garbage. Since ZipInfo is often used as a debugging tool, search the whole zipfile if zipinfo_mode is true. ---------------------------------------------------------------------------*/

Thomas Dickey · Answer

実際にはそれは混合物です。 unzipは、既知の場所から一部のデータを読み取り、Zipファイル内のターゲットエントリに関連する（ただし同一ではない）データブロックを読み取ります。

Zip/unzipの設計は、ソースファイルのコメントで説明されています。 extract.c ：

/*--------------------------------------------------------------------------- The basic idea of this function is as follows. Since the central di- rectory lies at the end of the zipfile and the member files lie at the beginning or middle or wherever, it is not very desirable to simply read a central directory entry, jump to the member and extract it, and then jump back to the central directory. In the case of a large zipfile this would lead to a whole lot of disk-grinding, especially if each mem- ber file is small. Instead, we read from the central directory the per- tinent information for a block of files, then go extract/test the whole block. Thus this routine contains two small(er) loops within a very large outer loop: the first of the small ones reads a block of files from the central directory; the second extracts or tests each file; and the outer one loops over blocks. There's some file-pointer positioning stuff in between, but that's about it. Btw, it's because of this jump- ing around that we can afford to be lenient if an error occurs in one of the member files: we should still be able to go find the other members, since we know the offset of each from the beginning of the zipfile. ---------------------------------------------------------------------------*/

フォーマット自体は主にPK-Wareの実装から派生し、 programming information text-files に要約されています。それによると、中央ディレクトリにも複数のタイプのレコードがあるため、unzipは簡単にファイルの最後に移動して、ターゲットファイルを検索するためのエントリの配列を作成できません。

時間をかけてソースコードを読むと、unzipが8192バイトのバッファを読み取ることがわかります（ INBUFSIZ）を探します。私はかなり大きなZipファイル（Javaソース）を念頭に置いていました）にのみ単一ファイルの抽出を使用しますが、小さなZipファイルの場合でも、バッファーサイズ。これを確認するために、PuTTYのGitファイルを圧縮し、2727ファイル（gitログのコピーをカウント）を取得しました。Javaは20年前よりも大きい、縮小されていません。Zipファイルからそのログを抽出しています（アルファベット順にソートされたインデックスの末尾にないため、選択されていますnot中央ディレクトリから読み取られた最初のブロックで）これはstraceからこれを与えましたlseek呼び出しの場合：

lseek(3, -2252, SEEK_CUR) = 1267 lseek(3, 120463360, SEEK_SET) = 120463360 lseek(3, 120468731, SEEK_SET) = 120468731 lseek(3, 120135680, SEEK_SET) = 120135680 lseek(3, 270336, SEEK_SET) = 270336 lseek(3, 120463360, SEEK_SET) = 120463360

いつものように、ベンチマークでは ymmv です。