C++ iterate or split UTF-8 string into array of symbols? -


Search for a platform- and third-party-libraries - split it into an array of UTF-8 strings or repetition independently

Resolve:

if I understand correctly, so it seems that if you want to start each UTF-8 character, if so, it would be quite easy to parse them (interpretation is a different matter) But the definition of how many octets are included is well defined:

  four number borders | UTF-8 octet sequence (hexadecimal) | (binary) ---------- ---------- + -------------------------- ------------- ------ 0000 0000-0000 007 F | 0xxxxxxx 0000 0080-0000 07 ff | Ll0xxxxx L0xxxxxx 0000 0800-0000 Ffff | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 Ffff | for example, if  lb  In the first octet of the character of UTF-8, I think the number of octets included in the following will be fixed. 

Unsigned four pounds; If ((leg & amp; 0x80) == 0) / / Seed bit is zero, a single ascii printf ("1 octet \ n") should be; Else if ((lb & amp; 0xE0) == 0xC0) // 110x xxxx printf ("2 octets \ n"); Else if ((lb & amp; 0xF0) == 0xE0) / 1110 xxxx printf ("3 octets \ n"); Else if ((lb & amp; 0xF8) == 0xF0) // 1111 0xxx printf ("4 octets \ n"); Other printf ("unrecognized lead byte (% 02x) \ n", lb);

Ultimately, though, you'd be better off using an existing library suggested in another post. Can classify characters according to the above address Oktet, but once it is finished, "it does not help anything" with them.


Comments