utf 8 - How to decode such strange string to UTF-8? (PHP) -


Then I have the % u041E% u043B% u0435% u0433% 20% u042F% u043a Is it real in UTF-8 or (better for HTML organizations for me)?

This javascript escape () format. This is similar to the url-encoding but is not compatible. Its use is usually a mistake.

Instead, the best part is to use the correct URL-encoding ( encodeurIComponent () ) that generates the script. After that you can decode it with urldecode or with any other normal URL-decoding function on the server side.

If you have to exchange data in this non-standard format, then you have to type a custom decoder for it. Take a quick hack while taking advantage of the HTML character-reference-decoder:

< Pre> function jsunescape ($ s) {$ s = preg_replace ('/% u (....) /', '& # X $ 1;', $ s); $ S = preg_replace ('/%(..)/', 'and # x $ 1;', $ s); Return html_entity_decode ($ s, ENT_COMPAT, 'utf-8'); }

This gives a raw UTF-8 byte string. If you really want it in HTML character references like & amp; # 1056; & Amp; # 1091; ... then leave the html_entity_decode call but generally you do not do this. Best to keep the wire in raw format, unless they need to avoid for the final output - and not the best, not to replace the non-ASCII characters in the context of the letter unless you really need it .

What would happen if something like this was sent to me '% CE% EB% E5% E3 +% DF% EA% F3% F8% EA% E8% ED'

This URL is form-encoded, which is not directly compatible with the escape () format. When the 2-digit byte escape from the url-encoding is different from the crazy escape -format 4-digit code-unit-escapes, then the character + is ambiguous This is a plus (if String comes from escape ), or a space (if this browser comes from form submission). There is no way to say how it is. This escape () .

There is another reason to not use it; If the charset of this string was UTF-8, yes, the above mentioned function would be fine, both URL-encoded bytes and nuts escape () - convert UTRix characters into UTF-8 bytes.

Although this actually appears to be code page 1251 (Windows Russian). Do you really want to handle all your strings in CP1251? If so, you have to change it a little bit to encode the escape of the four digit into a different charset. It's messy:

  function url_or_maybe_jsescape_decode ($ s, $ charset, $ ifform) {if ($ isform) $ s = str_replace ('+', '', $ s); $ S = preg_replace ('/% u (....) /', 'and # x $ 1;', $ s); $ S = preg_replace ('/%( ...)/', 'and; # x $ 1;', $ s); $ S = html_entity_decode ($ s, ENT_COMPAT, $ characterset); $ S = str_replace ('& amp;!', '& Amp;', $ s); $ S = html_entity_decode ($ s, ENT_COMPAT, 'utf-8'); Return $ s; } Echo url_or_maybe_jsescape_decode ( '% CE% EB% E5% E3 +% DF% EA% F3% F8% EA% E8% ED', 'cp1251', TRUE);  

I strongly recommend:

  1. Fix flash file so that it is not appropriate encoderIconcontent and no more Escape , so that you can use a standard URL-decoder instead of this ugly hack.

  2. Using UTF-8 through its application, you can only support languages ​​other than Russian, and you need to worry about the input encoding of the submitted form. Do not need to.

(All encodings that do not suck UTF-8, and this is a fact proven by science!)


Comments