PHPのpreg_matchとUTF-8

Question

preg_match を使用してUTF8エンコードされた文字列を検索しようとしています。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE); echo $a_matches[0][1];

「H」は文字列「¡Hola！」のインデックス1にあるため、これは1を出力するはずです。しかし、2を出力します。したがって、正規表現で "u" modifier を渡しているにもかかわらず、サブジェクトをUTF8エンコード文字列として処理していないようです。

Php.iniに次の設定があり、他のUTF8関数が機能しています。

mbstring.func_overload = 7 mbstring.language = Neutral mbstring.internal_encoding = UTF-8 mbstring.http_input = pass mbstring.http_output = pass mbstring.encoding_translation = Off

何か案は？

user187291 · Accepted Answer

これは「機能」のようです。 http://bugs.php.net/bug.php?id=37391 を参照してください

'u'スイッチはpcreに対してのみ意味があり、PHP自体はそれを認識しません。

PHPの観点から見ると、文字列はバイトシーケンスであり、バイトオフセットを返すことは論理的です（「正しい」とは言いません）。

Gumbo · Answer

修飾子を使用すると、パターンとサブジェクトの両方がUTF-8として解釈されますが、キャプチャされたオフセットはバイト単位でカウントされます。

mb_strlenを使用して、バイトではなくUTF-8文字で長さを取得できます。

$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1]));

Natxet · Answer

正規表現の前にこれを追加してみてください（* UTF8）：

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

マジック、 http://www.php.net/manual/es/function.preg-match.php#95828 のコメントのおかげ

Guy Fawkes · Answer

ネクロポスティングはすみませんが、誰かが役に立つかもしれません：以下のコードは、preg_match関数とpreg_match_all関数の両方として機能し、UTF8エンコードされた文字列のcorrect offsetで正しい一致を返します。

 mb_internal_encoding('UTF-8'); /** * Returns array of matches in same format as preg_match or preg_match_all * @param bool $matchAll If true, execute preg_match_all, otherwise preg_match * @param string $pattern The pattern to search for, as a string. * @param string $subject The input string. * @param int $offset The place from which to start the search (in bytes). * @return array */ function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0) { $matchInfo = array(); $method = 'preg_match'; $flag = PREG_OFFSET_CAPTURE; if ($matchAll) { $method .= '_all'; } $n = $method($pattern, $subject, $matchInfo, $flag, $offset); $result = array(); if ($n !== 0 && !empty($matchInfo)) { if (!$matchAll) { $matchInfo = array($matchInfo); } foreach ($matchInfo as $matches) { $positions = array(); foreach ($matches as $match) { $matchedText = $match[0]; $matchedLength = $match[1]; $positions[] = array( $matchedText, mb_strlen(mb_strcut($subject, 0, $matchedLength)) ); } $result[] = $positions; } if (!$matchAll) { $result = $result[0]; } } return $result; } $s1 = 'Попробуем русскую строку для теста'; $s2 = 'Try english string for test'; var_dump(pregMatchCapture(true, '/обу/', $s1)); var_dump(pregMatchCapture(false, '/обу/', $s1)); var_dump(pregMatchCapture(true, '/lish/', $s2)); var_dump(pregMatchCapture(false, '/lish/', $s2));

私の例の出力：

 array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(6) "обу" [1]=> int(4) } } } array(1) { [0]=> array(2) { [0]=> string(6) "обу" [1]=> int(4) } } array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(4) "lish" [1]=> int(7) } } } array(1) { [0]=> array(2) { [0]=> string(4) "lish" [1]=> int(7) } }

velcrow · Answer

Hのマルチバイトの安全な位置を見つけたいだけなら、mb_strpos（）を試してください。

mb_internal_encoding('UTF-8'); $str = "\xC2\xA1Hola!"; $pos = mb_strpos($str, 'H'); echo $str."
"; echo $pos."
"; echo mb_substr($str,$pos,1)."
";

出力：

¡Hola! 1 H

bronek89 · Answer

Preg_matchによって返されたオフセットを適切なutfオフセットに変換する小さなクラスを作成しました。

final class NonUtfToUtfOffset { /** @var int[] */ private $utfMap = []; public function __construct(string $content) { $contentLength = mb_strlen($content); for ($offset = 0; $offset < $contentLength; $offset ++) { $char = mb_substr($content, $offset, 1); $nonUtfLength = strlen($char); for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) { $this->utfMap[] = $offset; } } } public function convertOffset(int $nonUtfOffset): int { return $this->utfMap[$nonUtfOffset]; } }

次のように使用できます。

$content = 'aą bać d'; $offsetConverter = new NonUtfToUtfOffset($content); preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE); foreach ($m[1] as [$Word, $offset]) { echo "bad: " . mb_substr($content, $offset, mb_strlen($Word))."
"; echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($Word))."
"; }

https://3v4l.org/8Y32J

Danon · Answer

T-Regx ライブラリをご覧ください。

_pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) { echo $match->offset(); // characters echo $match->byteOffset(); // bytes }); _

この$match->offset()はUTF-8の安全なオフセットです。