Javaを使用して一連のバイナリファイルで検索する必要があるバイトシーケンスがあります。
例:バイナリファイルでバイトシーケンスDEADBEEF
(16進数)を検索しています。これをJavaで実行するにはどうすればよいですか?バイナリファイルのString.contains()
などの組み込みメソッドはありますか?
いいえ、そのための組み込みの方法はありません。しかし、 [〜#〜] here [〜#〜] から直接コピーします(元のコードに2つの修正が適用されています):
/**
* Knuth-Morris-Pratt Algorithm for Pattern Matching
*/
class KMPMatch {
/**
* Finds the first occurrence of the pattern in the text.
*/
public static int indexOf(byte[] data, byte[] pattern) {
if (data.length == 0) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) { j++; }
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j > 0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
private int bytesIndexOf(byte[] source, byte[] search, int fromIndex) {
boolean find = false;
int i;
for (i = fromIndex; i < (source.length - search.length); i++) {
if (source[i] == search[0]) {
find = true;
for (int j = 0; j < search.length; j++) {
if (source[i + j] != search[j]) {
find = false;
}
}
}
if (find) {
break;
}
}
if (!find) {
return -1;
}
return i;
}
あなたはbigdocを使用してギガバイトオーダーファイルからバイトのシーケンスを見つけることができます。
以下のGithubのLibと例: https://github.com/riversun/bigdoc
package org.example;
import Java.io.File;
import Java.util.List;
import org.riversun.bigdoc.bin.BigFileSearcher;
public class Example {
public static void main(String[] args) throws Exception {
byte[] searchBytes = "hello world.".getBytes("UTF-8");
File file = new File("/var/tmp/yourBigfile.bin");
BigFileSearcher searcher = new BigFileSearcher();
List<Long> findList = searcher.searchBigFile(file, searchBytes);
System.out.println("positions = " + findList);
}
}
メモリ上で検索したい場合はチェックしてください。 Githubの例: https://github.com/riversun/finbin
import Java.util.List;
import org.riversun.finbin.BigBinarySearcher;
public class Example {
public static void main(String[] args) throws Exception {
BigBinarySearcher bbs = new BigBinarySearcher();
byte[] iamBigSrcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bbs.searchBytes(iamBigSrcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
バイトの配列内の一致した位置をすべて返します
また、大きなバイト配列にも耐えることができます。
ライブラリを好む人のために、TwitterのElephant-Birdオープンソースライブラリ(Apacheライセンス)にKnuth-Morris-Prattアルゴリズムの実装(以下のソース)があります。
ライブラリはGithubの次の場所にあります。 https://github.com/Twitter/elephant-bird
package com.Twitter.elephantbird.util;
import Java.io.IOException;
import Java.io.InputStream;
import Java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher {
protected byte[] pattern_;
protected int[] borders_;
// An upper bound on pattern length for searching. Throws exception on longer patterns
public static final int MAX_PATTERN_LENGTH = 1024;
public StreamSearcher(byte[] pattern) {
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
* @param pattern
* the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern) {
if (pattern.length > MAX_PATTERN_LENGTH) {
throw new IllegalArgumentException("The maximum pattern length is " + MAX_PATTERN_LENGTH);
}
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
* @throws IOException
*/
public long search(InputStream stream) throws IOException {
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1) {
bytesRead++;
while (j >= 0 && (byte)b != pattern_[j]) {
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length) {
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
protected void preProcess() {
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length) {
while (j >= 0 && pattern_[i] != pattern_[j]) {
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}