シェルスクリプトで複数行の文字列を見つける方法は？

Question

文字列を見つけたい

Time series prediction with ensemble models

私はpdftotext "$file" - | grep "$string"。where $fileはpdfファイル名、$stringは上記の文字列を使用しています。文字列全体にline.butのような行を見つけることができません：

Time series prediction with ensemble models

どうすれば解決できますか。Linuxは初めてです。そのため、詳細な説明を歓迎します。

Jacob Vlijm · Answer

Pythonでは、a lotを実行できます...

後でもう一度見ると、おそらく最適化を行うことができますが、私のテストでは、次のスクリプトが仕事をしています。

ファイルでテスト済み：

Monkey eats banana since he ran out of peanuts Monkey eats banana since he ran out of peanuts really, Monkey eats banana since he ran out of peanuts A lot of useless text here… Have to add some lines for the sake of the test. Monkey eats banana since he ran out of peanuts

「Monkeyはバナナを食べるのでバナナを食べた」という文字列を検索すると、次のように出力されます。

Found matches -------------------- [line 1] Monkey eats banana since he ran out of peanuts [line 2] Monkey eats banana since he ran out of peanuts [line 5] Monkey eats banana since he ran out of peanuts [line 9] Monkey eats banana since he ran out of peanuts

スクリプト

#!/usr/bin/env python3 import subprocess import sys f = sys.argv[1]; string = sys.argv[2] # convert to .txt with your suggestion subprocess.call(["pdftotext", f]) # read the converted file text = open(f.replace(".pdf", ".txt")).read() # editing the file a bit for searching options / define th length of the searched string subtext = text.replace("
", " "); size = len(string) # in a while loop, find the matching string and set the last found index as a start for the next match matches = []; start = 0 while True: match = subtext.find(string, start) if match == -1: break else: matches.append(match) start = match+1 print("Found matches
"+20*"-") for m in matches: # print the found matches, replacing the edited- in spaces by (possibly) original 
 print("[line "+str(text[:m].count("
")+1)+"]
"+text[m:m+size].strip())

使用するには：

スクリプトを空のファイルにコピーし、search_pdf.pyとして保存します

次のコマンドで実行します：

python3 /path/to/search_pdf.py /path/to/file.pdf string_to_look_for

パスまたは検索された文字列にスペースが含まれている場合は、引用符を使用する必要があることに言及する必要はありません。

python3 '/path to/search_pdf.py' '/path to/file.pdf' 'string to look for'

terdon · Answer

Steeldriverのコメントで提案されている別のアプローチは、すべての改行をスペースに置き換え、pdftotextの出力を1つの長い行に変換し、以下を検索することです。

string="Time series prediction with ensemble models" pdftotext "$file" - | tr '
' ' ' | grep -o "$string"

-oを追加して、grepが行の一致部分のみを印刷するようにしました。これがないと、ファイルの内容全体が印刷されます。

別のアプローチは、-zの代わりに\0を使用して行を定義するように指示するgrepのスイッチを使用することです。これは、入力全体が単一の「行」として扱われ、Perl互換または拡張正規表現を使用して一致させることができることを意味します。

$ printf 'foo
bar
baz
' | grep -oPz 'foo
bar' foo bar

ただし、文字列が複数行に分割されている方法を事前に知っていない限り、これは役に立ちません。