UTF-8でエンコードされたら、Java文字列を指定されたバイト数に収まるように切り捨てるにはどうすればよいですか？

Question

Java Stringを切り捨てて、UTF-8でエンコードされた後、指定されたバイト数のストレージに収まるようにするにはどうすればよいですか？

Matt Quail · Accepted Answer

これは、UTF-8表現の大きさをカウントし、それを超えると切り捨てる単純なループです。

_public static String truncateWhenUTF8(String s, int maxBytes) { int b = 0; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // ranges from http://en.wikipedia.org/wiki/UTF-8 int skip = 0; int more; if (c <= 0x007f) { more = 1; } else if (c <= 0x07FF) { more = 2; } else if (c <= 0xd7ff) { more = 3; } else if (c <= 0xDFFF) { // surrogate area, consume next char as well more = 4; skip = 1; } else { more = 3; } if (b + more > maxBytes) { return s.substring(0, i); } b += more; i += skip; } return s; } _

このは、入力文字列に表示される代理ペアを処理します。 JavaのUTF-8エンコーダーは（正しく）サロゲートペアを2つの3バイトシーケンスではなく単一の4バイトシーケンスとして出力するため、truncateWhenUTF8()は可能な限り長い切り捨てられた文字列を返します。実装でサロゲートペアを無視すると、切り捨てられた文字列が必要以上にショートする可能性があります。

私はそのコードで多くのテストを行っていませんが、ここにいくつかの予備テストがあります：

_private static void test(String s, int maxBytes, int expectedBytes) { String result = truncateWhenUTF8(s, maxBytes); byte[] utf8 = result.getBytes(Charset.forName("UTF-8")); if (utf8.length > maxBytes) { System.out.println("BAD: our truncation of " + s + " was too big"); } if (utf8.length != expectedBytes) { System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length); } System.out.println(s + " truncated to " + result); } public static void main(String[] args) { test("abcd", 0, 0); test("abcd", 1, 1); test("abcd", 2, 2); test("abcd", 3, 3); test("abcd", 4, 4); test("abcd", 5, 4); test("a\u0080b", 0, 0); test("a\u0080b", 1, 1); test("a\u0080b", 2, 1); test("a\u0080b", 3, 3); test("a\u0080b", 4, 4); test("a\u0080b", 5, 4); test("a\u0800b", 0, 0); test("a\u0800b", 1, 1); test("a\u0800b", 2, 1); test("a\u0800b", 3, 1); test("a\u0800b", 4, 4); test("a\u0800b", 5, 5); test("a\u0800b", 6, 5); // surrogate pairs test("\uD834\uDD1E", 0, 0); test("\uD834\uDD1E", 1, 0); test("\uD834\uDD1E", 2, 0); test("\uD834\uDD1E", 3, 0); test("\uD834\uDD1E", 4, 4); test("\uD834\uDD1E", 5, 4); } _

Updatedコード例を変更し、サロゲートペアを処理するようになりました。

mitchnull · Answer

CharsetEncoder を使用する必要があります。単純なgetBytes() + UTF-8文字を半分にカットできる限り多くコピーします。

このようなもの：

public static int truncateUtf8(String input, byte[] output) { ByteBuffer outBuf = ByteBuffer.wrap(output); CharBuffer inBuf = CharBuffer.wrap(input.toCharArray()); Charset utf8 = Charset.forName("UTF-8"); utf8.newEncoder().encode(inBuf, outBuf, true); System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes"); return outBuf.position(); }

sigget · Answer

これが私が思いついたものです。標準のJava APIを使用しているので、安全で、すべてのUnicodeの奇妙さと代理ペアなどと互換性があるはずです。解決策は http：// www.jroller.com/holy/entry/truncating_utf_string_to_the nullのチェックが追加され、文字列がmaxBytesより少ないバイトの場合のデコードを回避します。

/** * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal * character. * * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the */ public static String truncateToFitUtf8ByteLength(String s, int maxBytes) { if (s == null) { return null; } Charset charset = Charset.forName("UTF-8"); CharsetDecoder decoder = charset.newDecoder(); byte[] sba = s.getBytes(charset); if (sba.length <= maxBytes) { return s; } // Ensure truncation by having byte buffer = maxBytes ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes); CharBuffer cb = CharBuffer.allocate(maxBytes); // Ignore an incomplete character decoder.onMalformedInput(CodingErrorAction.IGNORE) decoder.decode(bb, cb, true); decoder.flush(cb); return new String(cb.array(), 0, cb.position()); }

billjamesdev · Answer

UTF-8エンコーディングには、バイトセットのどこにいるかを確認できる優れた特性があります。

必要な文字数制限でストリームを確認してください。

上位ビットが0の場合、それは1バイト文字です。0に置き換えるだけで問題ありません。
その上位ビットが1で、次のビットも1である場合は、マルチバイト文字の先頭にいるので、そのバイトを0に設定するだけで問題ありません。
上位ビットが1で、次のビットが0の場合は、文字の途中にあり、上位ビットに1が2つ以上あるバイトに到達するまでバッファに沿って戻り、そのバイトを次のように置き換えます。 0。

例：ストリームが次の場合：31 33 31 C1 A3 32 33 00、文字列を1、2、3、5、6、または7バイトの長さにすることはできますが、4ではなく、C1の後に0を付けることができます。マルチバイト文字の始まりです。

Suresh Gupta · Answer

-new String（data.getBytes（ "UTF-8"）、0、maxLen、 "UTF-8"）;を使用できます。

user19050 · Answer

変換せずにバイト数を計算できます。

foreach character in the Java string if 0 <= character <= 0x7f count += 1 else if 0x80 <= character <= 0x7ff count += 2 else if 0x800 <= character <= 0xd7ff // excluding the surrogate area count += 3 else if 0xdc00 <= character <= 0xffff count += 3 else { // surrogate, a bit more complicated count += 4 skip one extra character in the input stream }

サロゲートペア（D800-DBFFおよびU + DC00–U + DFFF）を検出し、有効なサロゲートペアごとに4バイトをカウントする必要があります。最初の範囲で最初の値を取得し、2番目の範囲で2番目の値を取得した場合は、すべて問題ありません。それらをスキップして4を追加します。ただし、そうでない場合は、無効なサロゲートペアです。 Javaがそれをどのように処理するかはわかりませんが、その（ありそうもない）場合、アルゴリズムは正しくカウントする必要があります。