Pythonでファイルがバイナリ（非テキスト）であるかどうかを検出するにはどうすればよいですか？

Question

Pythonでファイルがバイナリ（非テキスト）かどうかを確認するにはどうすればよいですか？

私はPythonで大量のファイルを検索し、バイナリファイルで一致を取得し続けています。これにより、出力が非常に乱雑になります。

grep -Iを使用できることは知っていますが、grepで許可されている以上のことをデータで処理しています。

過去には、0x7fよりも大きい文字を検索するだけでしたが、utf8などにより、現代のシステムでは不可能になります。ソリューションは高速であることが理想ですが、どのソリューションでも実行できます。

Gavin M. Roy · Accepted Answer

import mimetypes ... mime = mimetypes.guess_type(file)

バイナリMIMEタイプのリストをコンパイルするのはかなり簡単です。たとえば、Apacheはmime.typesファイルとともに配布されます。このファイルを解析して一連のリスト、バイナリ、およびテキストを作成し、mimeがテキストリストまたはバイナリリストにあるかどうかを確認できます。

jfs · Answer

さらに別の方法 file（1）の動作に基づく：

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f}) >>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

例：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024)) True >>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024)) False

skyking · Answer

あなたはUTF-8でのpython3を使用している場合、それはあなたがUnicodeDecodeErrorを取得する場合、単にテキストモードと停止処理でファイルを開き、まっすぐ進むです。 Python3は、テキストモード（およびバイナリモードのバイト配列）でファイルを処理するときにUnicodeを使用します。エンコードが任意のファイルをデコードできない場合、UnicodeDecodeErrorを取得する可能性が高くなります。

例：

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

例：

try: with open(filename, "r") as f: for l in f: process_line(l) except UnicodeDecodeError: pass # Fond non-text data

Jorge Orpinel · Answer

これを試して：

def is_binary(filename): """Return true if the given filename is binary. @raise EnvironmentError: if the file does not exist or cannot be accessed. @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010 @author: Trent Mick <TrentM@ActiveState.com> @author: Jorge Orpinel <jorge@orpinel.com>""" fin = open(filename, 'rb') try: CHUNKSIZE = 1024 while 1: chunk = fin.read(CHUNKSIZE) if '\0' in chunk: # found null byte return True if len(chunk) < CHUNKSIZE: break # done # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es. finally: fin.close() return False

Shane C. Mason · Answer

それが役立つ場合、多くの多くのバイナリ型はマジックナンバーで始まります。リストはこちらファイルの署名。

Jacob Gabrielson · Answer

Unix file コマンドを使用する提案は次のとおりです。

import re import subprocess def istext(path): return (re.search(r':.* text', subprocess.Popen(["file", '-L', path], stdout=subprocess.PIPE).stdout.read()) is not None)

使用例：

 >>> istext（ '/ etc/motd'） True >>> istext（ '/ vmlinuz'） False > >> open（ '/ tmp/japanese'）。read（） '\ xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\XBF\XE3\X81\x9a\XE3\X81\x8c\XE3\X82\X81\xe5\XBA\xa7\XE3\X81\XAE\XE6\X99\X82\xe4\XBB\XA3\XE3\X81\XAE\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82
 ' >>> istext（'/tmp/japanese '）＃UTF-8で動作する True

Windowsに移植できないという欠点があり（fileコマンドのようなものがない限り）、各ファイルの外部プロセスを生成する必要があります。

guettli · Answer

使用 binaryornot ライブラリ（ GitHubの）。

それは非常に簡単で、このstackoverflowの質問で見つかったコードに基づいています。

あなたは実際にしかし、このパッケージには、奇妙なファイルタイプ、クロスプラットフォームのすべての種類とコードのこれらの2行をテスト徹底的に書くとする必要がなくなります、コードの2行でこれを書くことができます。

Douglas Leeder · Answer

通常、推測する必要があります。

ファイルがそれらを持っている場合は、1つの手がかりとして、拡張子を見ることができます。

また、既知のバイナリ形式を認識し、無視することもできます。

それ以外の場合は、印刷できないASCIIバイトの割合を確認し、そこから推測します。

また、UTF-8からデコードを試してみて、それが賢明な出力を生成かどうかを確認することができます。

Kamil Kisiel · Answer

Windowsを使用していない場合は、 Python Magic を使用してファイルタイプを決定できます。それは、テキスト/ MIMEタイプがあるなら、あなたは確認することができます。

Tom Kennedy · Answer

UTF-16の警告と短いソリューション：

def is_binary(filename): """ Return true if the given filename appears to be binary. File is considered to be binary if it contains a NULL byte. FIXME: This approach incorrectly reports UTF-16 as binary. """ with open(filename, 'rb') as f: for block in f: if b'\0' in block: return True return False

roskakori · Answer

ここでは、ファイルが最初の8192バイト以内にゼロバイトを探しBOMで始まっていない場合は場合は最初にチェックすることを機能があります：

import codecs #: BOMs to indicate that a file is a text file even if it contains zero bytes. _TEXT_BOMS = ( codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE, codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE, codecs.BOM_UTF8, ) def is_binary_file(source_path): with open(source_path, 'rb') as source_file: initial_bytes = source_file.read(8192) return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \ and b'\0' in initial_bytes

それはすべての実用的な目的のためにゼロバイトを含めることはできませんので、技術的にはUTF-8 BOMのチェックは不要です。それは非常に一般的なエンコーディングであるとして、しかし、それは始まりの代わりに0のすべての8192のバイトをスキャンしてBOMを確認するために迅速です。

Serhii · Answer

テキストモードでバイナリファイルを開こうとすると失敗するため、python自体を使用してファイルがバイナリかどうかを確認できます。

def is_binary(file_name): try: with open(file_name, 'tr') as check_file: # try open file in text mode check_file.read() return False except: # if fail then file is non-text (binary) return True

Leonardo · Answer

最善の解決策はguess_type関数を使用することだと思います。複数のMIMEタイプのリストを保持し、独自のタイプを含めることもできます。ここに私の問題を解決するために私がしたスクリプトがあります：

from mimetypes import guess_type from mimetypes import add_type def __init__(self): self.__addMimeTypes() def __addMimeTypes(self): add_type("text/plain",".properties") def __listDir(self,path): try: return listdir(path) except IOError: print ("The directory {0} could not be accessed".format(path)) def getTextFiles(self, path): asciiFiles = [] for files in self.__listDir(path): if guess_type(files)[0].split("/")[0] == "text": asciiFiles.append(files) try: return asciiFiles except NameError: print ("No text files in directory: {0}".format(path)) finally: del asciiFiles

コードの構造に基づいてわかるように、クラスの内部にあります。ただし、アプリケーション内で実装するものはほとんど変更できます。使い方はとても簡単です。メソッドgetTextFilesは、パス変数で渡すディレクトリにあるすべてのテキストファイルを含むリストオブジェクトを返します。

rsaw · Answer

ここにきて、まったく同じものを探しました。バイナリまたはテキストを検出するための標準ライブラリが提供する包括的なソリューションです。人々が提案したオプションを検討した後、nix fileコマンドが最良の選択であるように見えます（Linux boxen向けに開発しているだけです）。 fileを使用してソリューションを投稿した人もいますが、私の意見では不必要に複雑なので、ここに私が思いついたものがあります：

def test_file_isbinary(filename): cmd = shlex.split("file -b -e soft '{}'".format(filename)) if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}: return False return True

言うまでもありませんが、この関数を呼び出すコードは、テストする前にファイルを読み取れることを確認する必要があります。そうしないと、ファイルが誤ってバイナリとして検出されます。

Eat at Joes · Answer

@Kami Kisielの答えは同じモジュールではありません、現在維持のpython-魔法を使用してみてください。これはあなたがlibmagicバイナリファイルが必要になりますがWindowsを含むすべてのプラットフォームをサポートしています。これは、READMEで説明されています。

mimetypes モジュールとは異なり、ファイルの拡張子を使用せず、代わりにファイルの内容を検査します。

>>> import magic >>> magic.from_file("testdata/test.pdf", mime=True) 'application/pdf' >>> magic.from_file("testdata/test.pdf") 'PDF document, version 1.2' >>> magic.from_buffer(open("testdata/test.pdf").read(1024)) 'PDF document, version 1.2'

kenorb · Answer

ほとんどのプログラムでは、ファイルに NULL文字が含まれている場合、ファイルはバイナリ（「行指向」ではない任意のファイル）と見なされます。

Perlのpp_fttext()（pp_sys.c）Pythonで実装：

import sys PY3 = sys.version_info[0] == 3 # A function that takes an integer in the 8-bit range and returns # a single-character byte object in py3 / a single-character string # in py2. # int2byte = (lambda x: bytes((x,))) if PY3 else chr _text_characters = ( b''.join(int2byte(i) for i in range(32, 127)) + b'

	\f\b') def istextfile(fileobj, blocksize=512): """ Uses heuristics to guess whether the given file is text or binary, by reading a single block of bytes from the file. If more than 30% of the chars in the block are non-text, or there are NUL ('\x00') bytes in the block, assume this is a binary file. """ block = fileobj.read(blocksize) if b'\x00' in block: # Files with null bytes are binary return False Elif not block: # An empty file is considered a valid text file return True # Use translate's 'deletechars' argument to efficiently remove all # occurrences of _text_characters from the block nontext = block.translate(None, _text_characters) return float(len(nontext)) / len(block) <= 0.30

このコードは、Python 2とPython 3を変更せずに実行するように書かれています。

ソース： Perlの「ファイルがテキストかバイナリかを推測する」Pythonで実装

kenorb · Answer

より簡単な方法は、ファイルがNULL文字（\x00）in演算子を使用して、たとえば：

b'\x00' in open("foo.bar", 'rb').read()

以下の完全な例を参照してください。

#!/usr/bin/env python3 import argparse if __== '__main__': parser = argparse.ArgumentParser() parser.add_argument('file', nargs=1) args = parser.parse_args() with open(args.file[0], 'rb') as f: if b'\x00' in f.read(): print('The file is binary!') else: print('The file is not binary!')

サンプル使用法：

$ ./is_binary.py /etc/hosts The file is not binary! $ ./is_binary.py `which which` The file is binary!

fortran · Answer

あなたはUnixにいますか？もしそうなら、試してください：

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

シェルの戻り値は逆になります（0でも構いません。したがって、「テキスト」が見つかった場合は0を返し、PythonつまりFalse式です）。

Rob Truxal · Answer

* NIXで：

`file`シェルコマンドにアクセスできる場合、shlexはサブプロセスモジュールをより使いやすくするのに役立ちます。

from os.path import realpath from subprocess import check_output from shlex import split filepath = realpath('rel/or/abs/path/to/file') assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

または、次のコマンドを使用して、それをforループに固定して、現在のディレクトリ内のすべてのファイルの出力を取得することもできます。

import os for afile in [x for x in os.listdir('.') if os.path.isfile(x)]: assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

またはすべてのサブディレクトリ用：

for curdir, filelist in Zip(os.walk('.')[0], os.walk('.')[2]): for afile in filelist: assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

Pythonでファイルがバイナリ（非テキスト）であるかどうかを検出するにはどうすればよいですか？

* NIXで：

fileシェルコマンドにアクセスできる場合、shlexはサブプロセスモジュールをより使いやすくするのに役立ちます。

または、次のコマンドを使用して、それをforループに固定して、現在のディレクトリ内のすべてのファイルの出力を取得することもできます。

またはすべてのサブディレクトリ用：

`file`シェルコマンドにアクセスできる場合、shlexはサブプロセスモジュールをより使いやすくするのに役立ちます。