utf 8 - Is there a list of language only character regions for UTF-8 somewhere? -


I am trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my vision of working, I need to ignore non-language characters such as control letters, mathematical symbols etc. Just trying to disintegrate the original Latin section of the UTF standard, which has resulted in several areas, such as the description symbol, in the middle of a range of rightly valid Latin characters.

Is there a list that recognizes these areas? Or better yet, a Reggae which defines areas or can identify some different characters in C #?

Look at Unicode; You can see the character class syntax in this C # Regular Expressions \ p {catname} . So to match a lower-case letter, you will use \ p {Ll} . You can add them to [\ p {Ll} \ p {Lu}] matches the characters in either the LL or Lu class.


Comments