PHP：UnicodeコードポイントをUTF-8に変換

Question

次の形式のデータがあります：U+597DまたはこのようなU+6211。それらをUTF-8に変換したい（元の文字は好と我）。どうすればできますか？

Mez · Accepted Answer

$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\1;", $string), ENT_NOQUOTES, 'UTF-8');

おそらく最も簡単なソリューションです。

velcrow · Answer

function utf8($num) { if($num<=0x7F) return chr($num); if($num<=0x7FF) return chr(($num>>6)+192).chr(($num&63)+128); if($num<=0xFFFF) return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128); if($num<=0x1FFFFF) return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128); return ''; } function uniord($c) { $ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0; $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128); $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128); $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128); return false; }

utf8（）およびuniord（）は、phpでchr（）およびord（）関数をミラーリングしようとします。

echo utf8(0x6211)."
"; echo uniord(utf8(0x6211))."
"; echo "U+".dechex(uniord(utf8(0x6211)))."
"; //In your case: $wo='U+6211'; $hao='U+597D'; echo utf8(hexdec(str_replace("U+","", $wo)))."
"; echo utf8(hexdec(str_replace("U+","", $hao)))."
";

出力：

我 25105 U+6211 我 好

Rabin Lama Dong · Answer

PHP 7以降

PHP 7の時点で、これを行うには nicodeコードポイントエスケープ構文を使用できます。

echo "\u{597D}";出力好。

John Slegers · Answer

polyfillとordのマルチバイトバージョンが欠落している場合に、以下を念頭に置いてchrを作成しました。

関数を定義しますmb_ordおよびmb_chrまだ存在しない場合のみ。フレームワークまたは将来のバージョンのPHPに存在する場合、ポリフィルは無視されます。
広く使用されているmbstring拡張を使用して変換を行います。 mbstring拡張がロードされていない場合、代わりにiconv拡張を使用します。

また、HTMLエンティティのエンコード/デコードおよびエンコード/デコード用の関数をJSON形式に追加し、これらの関数を使用する方法のデモコードも追加しました。

コード

if (!function_exists('codepoint_encode')) { function codepoint_encode($str) { return substr(json_encode($str), 1, -1); } } if (!function_exists('codepoint_decode')) { function codepoint_decode($str) { return json_decode(sprintf('"%s"', $str)); } } if (!function_exists('mb_internal_encoding')) { function mb_internal_encoding($encoding = NULL) { return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding); } } if (!function_exists('mb_convert_encoding')) { function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) { return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str); } } if (!function_exists('mb_chr')) { function mb_chr($ord, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { return pack("N", $ord); } else { return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE'); } } } if (!function_exists('mb_ord')) { function mb_ord($char, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char); return $ord; } else { return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE'); } } } if (!function_exists('mb_htmlentities')) { function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') { return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) { return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0])); }, $string); } } if (!function_exists('mb_html_entity_decode')) { function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') { return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding); } }

使い方

echo "
Get string from numeric DEC value
"; var_dump(mb_chr(25105)); var_dump(mb_chr(22909)); echo "
Get string from numeric HEX value
"; var_dump(mb_chr(0x6211)); var_dump(mb_chr(0x597D)); echo "
Get numeric value of character as DEC int
"; var_dump(mb_ord('我')); var_dump(mb_ord('好')); echo "
Get numeric value of character as HEX string
"; var_dump(dechex(mb_ord('我'))); var_dump(dechex(mb_ord('好'))); echo "
Encode / decode to DEC based HTML entities
"; var_dump(mb_htmlentities('我好', false)); var_dump(mb_html_entity_decode('&#25105;&#22909;')); echo "
Encode / decode to HEX based HTML entities
"; var_dump(mb_htmlentities('我好')); var_dump(mb_html_entity_decode('&#x6211;&#x597D;')); echo "
Use JSON encoding / decoding
"; var_dump(codepoint_encode("我好")); var_dump(codepoint_decode('\u6211\u597d'));

出力

Get string from numeric DEC value string(3) "我" string(3) "好" Get string from numeric HEX value string(3) "我" string(3) "好" Get numeric value of character as DEC string int(25105) int(22909) Get numeric value of character as HEX string string(4) "6211" string(4) "597d" Encode / decode to DEC based HTML entities string(16) "&#25105;&#22909;" string(6) "我好" Encode / decode to HEX based HTML entities string(16) "&#x6211;&#x597D;" string(6) "我好" Use JSON encoding / decoding string(12) "\u6211\u597d" string(6) "我好"

eleg · Answer

mb_convert_encoding( preg_replace("/U\+([0-9A-F]*)/" ,"&#x\1;" ,'U+597DU+6211' ) ,"UTF-8" ,"HTML-ENTITIES" );

うまく動作します。

Dor · Answer

次の表を使用して：

http://en.wikipedia.org/wiki/UTF-8#Description

より簡単にすることはできません:)

範囲に応じてユニコード番号をマスクするだけです。

Php&#39;Regex · Answer

<?php function chr_utf8($n,$f='C*'){ return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n): ($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n): ($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):''))); } $your_input='U+597D'; echo (chr_utf8(hexdec(ltrim($your_input,'U+')))); // Output 好

コールバック関数を使用する場合は、試すことができます：

<?php // Note: function chr_utf8 shown above is required $your_input='U+597DU+6211'; $result=preg_replace_callback('#U\+([a-f0-9]+)#i',function($a){return chr_utf8(hexdec($a[1]));},$your_input); echo $result; // Output 好我

チェックイン https://eval.in/748187

Claudio Garaycochea · Answer

これはうまくいきました。「レターu00e1 u00e9など」という文字列がある場合「Lettersáé」に置き換えます。

function unicode2html($str){ // Set the locale to something that's UTF-8 capable setlocale(LC_ALL, 'en_US.UTF-8'); // Convert the codepoints to entities $str = preg_replace("/u([0-9a-fA-F]{4})/", "&#x\1;", $str); // Convert the entities to a UTF-8 string return iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str); }

Tschallacka · Answer

私はwysiwigエディターを使用していたので、htmlに影響を与えずに特定の文字をフィルターする必要がありましたが、Wordからコピーペーストするとコンテンツにニースのレンダリング不能文字が追加されます。

私のソリューションは、簡単な置換リストに要約されます。

class ReplaceIllegal { public static $find = array ( 0 => '\x0', 1 => '\x1', 2 => '\x2', 3 => '\x3', 4 => '\x4', 5 => '\x5', 6 => '\x6', 7 => '\x7', 8 => '\x8', 9 => '\x9', 10 => '\xA', 11 => '\xB', 12 => '\xC', 13 => '\xD', 14 => '\xE', 15 => '\xF', 16 => '\x10', 17 => '\x11', 18 => '\x12', 19 => '\x13', 20 => '\x14', 21 => '\x15', 22 => '\x16', 23 => '\x17', 24 => '\x18', 25 => '\x19', 26 => '\x1A', 27 => '\x1B', 28 => '\x1C', 29 => '\x1D', 30 => '\x1E', 31 => '\x80', 32 => '\x81', 33 => '\x82', 34 => '\x83', 35 => '\x84', 36 => '\x85', 37 => '\x86', 38 => '\x87', 39 => '\x88', 40 => '\x89', 41 => '\x8A', 42 => '\x8B', 43 => '\x8C', 44 => '\x8D', 45 => '\x8E', 46 => '\x8F', 47 => '\x90', 48 => '\x91', 49 => '\x92', 50 => '\x93', 51 => '\x94', 52 => '\x95', 53 => '\x96', 54 => '\x97', 55 => '\x98', 56 => '\x99', 57 => '\x9A', 58 => '\x9B', 59 => '\x9C', 60 => '\x9D', 61 => '\x9E', 62 => '\x9F', 63 => '\xA0', 64 => '\xA1', 65 => '\xA2', 66 => '\xA3', 67 => '\xA4', 68 => '\xA5', 69 => '\xA6', 70 => '\xA7', 71 => '\xA8', 72 => '\xA9', 73 => '\xAA', 74 => '\xAB', 75 => '\xAC', 76 => '\xAD', 77 => '\xAE', 78 => '\xAF', 79 => '\xB0', 80 => '\xB1', 81 => '\xB2', 82 => '\xB3', 83 => '\xB4', 84 => '\xB5', 85 => '\xB6', 86 => '\xB7', 87 => '\xB8', 88 => '\xB9', 89 => '\xBA', 90 => '\xBB', 91 => '\xBC', 92 => '\xBD', 93 => '\xBE', 94 => '\xBF', 95 => '\xC0', 96 => '\xC1', 97 => '\xC2', 98 => '\xC3', 99 => '\xC4', 100 => '\xC5', 101 => '\xC6', 102 => '\xC7', 103 => '\xC8', 104 => '\xC9', 105 => '\xCA', 106 => '\xCB', 107 => '\xCC', 108 => '\xCD', 109 => '\xCE', 110 => '\xCF', 111 => '\xD0', 112 => '\xD1', 113 => '\xD2', 114 => '\xD3', 115 => '\xD4', 116 => '\xD5', 117 => '\xD6', 118 => '\xD7', 119 => '\xD8', 120 => '\xD9', 121 => '\xDA', 122 => '\xDB', 123 => '\xDC', 124 => '\xDD', 125 => '\xDE', 126 => '\xDF', 127 => '\xE0', 128 => '\xE1', 129 => '\xE2', 130 => '\xE3', 131 => '\xE4', 132 => '\xE5', 133 => '\xE6', 134 => '\xE7', 135 => '\xE8', 136 => '\xE9', 137 => '\xEA', 138 => '\xEB', 139 => '\xEC', 140 => '\xED', 141 => '\xEE', 142 => '\xEF', 143 => '\xF0', 144 => '\xF1', 145 => '\xF2', 146 => '\xF3', 147 => '\xF4', 148 => '\xF5', 149 => '\xF6', 150 => '\xF7', 151 => '\xF8', 152 => '\xF9', 153 => '\xFA', 154 => '\xFB', 155 => '\xFC', 156 => '\xFD', 157 => '\xFE', ); private static $replace = array ( 0 => '&#0;', 1 => '&#1;', 2 => '&#2;', 3 => '&#3;', 4 => '&#4;', 5 => '&#5;', 6 => '&#6;', 7 => '&#7;', 8 => '&#8;', 9 => '&#9;', 10 => '&#10;', 11 => '&#11;', 12 => '&#12;', 13 => '&#13;', 14 => '&#14;', 15 => '&#15;', 16 => '&#16;', 17 => '&#17;', 18 => '&#18;', 19 => '&#19;', 20 => '&#20;', 21 => '&#21;', 22 => '&#22;', 23 => '&#23;', 24 => '&#24;', 25 => '&#25;', 26 => '&#26;', 27 => '&#27;', 28 => '&#28;', 29 => '&#29;', 30 => '&#30;', 31 => '&#128;', 32 => '&#129;', 33 => '&#130;', 34 => '&#131;', 35 => '&#132;', 36 => '&#133;', 37 => '&#134;', 38 => '&#135;', 39 => '&#136;', 40 => '&#137;', 41 => '&#138;', 42 => '&#139;', 43 => '&#140;', 44 => '&#141;', 45 => '&#142;', 46 => '&#143;', 47 => '&#144;', 48 => '&#145;', 49 => '&#146;', 50 => '&#147;', 51 => '&#148;', 52 => '&#149;', 53 => '&#150;', 54 => '&#151;', 55 => '&#152;', 56 => '&#153;', 57 => '&#154;', 58 => '&#155;', 59 => '&#156;', 60 => '&#157;', 61 => '&#158;', 62 => '&#159;', 63 => '&#160;', 64 => '&#161;', 65 => '&#162;', 66 => '&#163;', 67 => '&#164;', 68 => '&#165;', 69 => '&#166;', 70 => '&#167;', 71 => '&#168;', 72 => '&#169;', 73 => '&#170;', 74 => '&#171;', 75 => '&#172;', 76 => '&#173;', 77 => '&#174;', 78 => '&#175;', 79 => '&#176;', 80 => '&#177;', 81 => '&#178;', 82 => '&#179;', 83 => '&#180;', 84 => '&#181;', 85 => '&#182;', 86 => '&#183;', 87 => '&#184;', 88 => '&#185;', 89 => '&#186;', 90 => '&#187;', 91 => '&#188;', 92 => '&#189;', 93 => '&#190;', 94 => '&#191;', 95 => '&#192;', 96 => '&#193;', 97 => '&#194;', 98 => '&#195;', 99 => '&#196;', 100 => '&#197;', 101 => '&#198;', 102 => '&#199;', 103 => '&#200;', 104 => '&#201;', 105 => '&#202;', 106 => '&#203;', 107 => '&#204;', 108 => '&#205;', 109 => '&#206;', 110 => '&#207;', 111 => '&#208;', 112 => '&#209;', 113 => '&#210;', 114 => '&#211;', 115 => '&#212;', 116 => '&#213;', 117 => '&#214;', 118 => '&#215;', 119 => '&#216;', 120 => '&#217;', 121 => '&#218;', 122 => '&#219;', 123 => '&#220;', 124 => '&#221;', 125 => '&#222;', 126 => '&#223;', 127 => '&#224;', 128 => '&#225;', 129 => '&#226;', 130 => '&#227;', 131 => '&#228;', 132 => '&#229;', 133 => '&#230;', 134 => '&#231;', 135 => '&#232;', 136 => '&#233;', 137 => '&#234;', 138 => '&#235;', 139 => '&#236;', 140 => '&#237;', 141 => '&#238;', 142 => '&#239;', 143 => '&#240;', 144 => '&#241;', 145 => '&#242;', 146 => '&#243;', 147 => '&#244;', 148 => '&#245;', 149 => '&#246;', 150 => '&#247;', 151 => '&#248;', 152 => '&#249;', 153 => '&#250;', 154 => '&#251;', 155 => '&#252;', 156 => '&#253;', 157 => '&#254;', ); /* * replace illegal characters for escaped html character but don't touch anything else. */ public static function getSaveValue($value) { return str_replace(self::$find, self::$replace, $value); } public static function makeIllegal($find,$replace) { self::$find[] = $find; self::$replace[] = $replace; } }