さまざまな構造のフォームからフィールドを抽出する

Question

バランスシートから特定のフィールドを抽出しようとしています。たとえば、次の貸借対照表の「在庫」の値が1,277,838であることがわかります。

Balance sheet

現在、Tesseractを使用して画像をテキストに変換しています。ただし、この変換ではテキストのストリームが生成されるため、フィールドとその値を関連付けるのは困難です（これらの値は、対応するフィールドのテキストのすぐ隣にあるとは限らないため）。

いくつか検索した後、Tesseractはuznファイルを使用して画像のゾーンから読み取ることができます。ただし、貸借対照表の値の特定のゾーンはフォームからフォームにシフトする可能性があるため、「在庫」と1,277,838が同じ行にあると判断できる解決策に興味があります。理想的には、テキストのグリッド構造の出力が必要です（テキストのどのチャンクが同じ行/列にあるかを空間的に識別できるようにするため）。

誰かが私がこの結果をどのように達成できるかを説明するのを手伝ってくれませんか？

gaw89 · Accepted Answer

TesseractとPython（pytesseractライブラリ）を使用して同様のタスクを実行しています。Tesseractの.hocr出力ファイル（ https://en.wikipedia.org/wiki/HOCR ）ページ上の私の検索語句（たとえば、「在庫」）の場所を見つけてから、ページの小さなセクションでTesseractを再実行して、その領域の精度を高めます。ここで使用するコードは次のとおりです。 TesseractからのHOCR出力を解析するには：

def parse_hocr(search_terms=None, hocr_file=None, regex=None): """Parse the hocr file and find a reasonable bounding box for each of the strings in search_terms. Return a dictionary with values as the bounding box to be used for extracting the appropriate text. inputs: search_terms = Tuple, A Tuple of search terms to look for in the HOCR file. outputs: box_dict = Dictionary, A dictionary whose keys are the elements of search_terms and values are the bounding boxes where those terms are located in the document. """ # Make sure the search terms provided are a Tuple. if not isinstance(search_terms,Tuple): raise ValueError('The search_terms parameter must be a Tuple') # Make sure we got a HOCR file handle when called. if not hocr_file: raise ValueError('The parser must be provided with an HOCR file handle.') # Open the hocr file, read it into BeautifulSoup and extract all the ocr words. hocr = open(hocr_file,'r').read() soup = bs.BeautifulSoup(hocr,'html.parser') words = soup.find_all('span',class_='ocrx_Word') result = dict() # Loop through all the words and look for our search terms. for Word in words: w = Word.get_text().lower() for s in search_terms: # If the Word is in our search terms, find the bounding box if len(w) > 1 and difflib.SequenceMatcher(None, s, w).ratio() > .5: bbox = Word['title'].split(';') bbox = bbox[0].split(' ') bbox = Tuple([int(x) for x in bbox[1:]]) # Update the result dictionary or raise an error if the search term is in there twice. if s not in result.keys(): result.update({s:bbox}) else: pass return result

これにより、HOCRファイルで適切な用語を検索し、その特定のWordの境界ボックスを返すことができます。次に、境界ボックスを少し拡張して、ページの非常に小さなサブセットでTesseractを実行します。これにより、ページ全体をOCRするだけの場合よりもはるかに正確になります。明らかに、このコードの一部は私の使用に固有のものですが、それを開始する場所を提供するはずです。

このページは、テッセラクトに与える適切な引数を見つけるのに非常に役立ちます。画像の小さなセクションで正確な結果を得るには、ページセグメンテーションモードが非常に重要であることがわかりました。

zuphilip · Answer

Gaw89ですでに述べたように、Tesseractはテキストとしてだけでなく、ストリームとしてより多くの情報を出力できます。 hocr fileformat は、各段落、行、Wordの位置（境界ボックス）も提供します。

$ tesseract 4LV05.png out -l eng hocr

次に、たとえば、「在庫」という単語の境界ボックスを簡単に見つけることができます。

$ grep 'Inventory' out.hocr <span class='ocr_line' id='line_1_5' title="bbox 23 183 112 204; baseline 0 -5; x_size 21; x_descenders 5; x_ascenders 4"><span class='ocrx_Word' id='Word_1_15' title='bbox 23 183 112 204; x_wconf 93'>Inventory</span>

したがって、このWordのバウンディングボックスは183から204に垂直に広がり、このラベルの対応する値に対して、同じ垂直スペースでボックスを検索する必要があります。これは、例えば、ここで達成することができます

$ grep 'bbox [0-9]* 18[0-9]' out.hocr <p class='ocr_par' id='par_1_4' lang='eng' title="bbox 23 183 112 204"> <span class='ocr_line' id='line_1_5' title="bbox 23 183 112 204; baseline 0 -5; x_size 21; x_descenders 5; x_ascenders 4"><span class='ocrx_Word' id='Word_1_15' title='bbox 23 183 112 204; x_wconf 93'>Inventory</span> <span class='ocr_line' id='line_1_30' title="bbox 1082 183 1178 202; baseline 0 -3; x_size 22; x_descenders 5.5; x_ascenders 5.5"><span class='ocrx_Word' id='Word_1_82' title='bbox 1082 183 1178 202; x_wconf 93'>1,277,838</span> <span class='ocr_line' id='line_1_54' title="bbox 1301 183 1379 202; baseline 0 -3; x_size 22; x_descenders 5.5; x_ascenders 5.5"><span class='ocrx_Word' id='Word_1_107' title='bbox 1301 183 1379 202; x_wconf 95'>953,675</span>

2番目の結果には、ターゲット値が含まれています。 bboxの垂直座標を比較して、確実に最初の列を抽出できます。

この例では、コマンドgrepで十分ですが、同様のことを行う方法は他にもあります。また、正規表現は、ページのゆがみ具合に応じて、他の計算に置き換えられる場合があります。

別の方法として、オープンソース Tabula を試して、pdfから表形式のデータを抽出することもできます。