I want to replace all the high unicode characters from large documents, such as Accent Es, left and right quotes, etc. , With "normal" counterparts in a lower class, such as regular 'E', and straight quotation, I need to show it on a very large document repeatedly, I am getting an example of what I think It may be perl:
Is it a faster way to do this without using s.replace (...) in Python. (...) To change(...)...? I have tried some words to change it and separating the document has actually slowed down.
Edit, the version of my version does not seem to work:
# - * - coding: ISO-8859-15 - * - Import unidecode Def ascii_map (): data = {} for range in range (256): H = number file name = 'x {num: 02x}' format. (Num = number): try: == __import __ ('.unidecode' + filename, fromlist = true) except importError: calculate (mod.data) for Val, L: and pass i = h & Lt; & Lt; 8 i + = L. If I & gt; = 0x80: Data [i] = Unicode (Val) return data if __name__ == '__ main__': S = u '"fancy" fancy2' print (s.translate (ascii_map ()))
text after "# - * - encoding: utf-8 - * - import unicodedata def shoehorn_unicode_into_ascii (s): Return UnicodeTata. General ('NFKD', S). Note that if the __name __ == '__ main__': S = U "éèêàùçÇ" print (shoehorn_unicode_into_ascii) # eeeaucC
note, @ encode ('ascii', 'neglect') In the form of Mark Tolonen, please state that the above method excludes some characters such as' B '' '' If the above code shortens those characters that you translate, then you must manually fix these problems To use the string the
translation
method may use another option Rana (see).When you have a large Unicode string, it will be very fast to use the
replace
method by using thetranslation
method. >Edit:
Unidode
is the complete mapping of the Unicode codepoint for Assisi, however, theunidecode.unidecode
string is character-to-character (In a python loop) loops, which is slower than using thetranslation
method.The following assist function uses the
unifiedcode
data files and the code to get a better code, especially for long string.On 1-6 MB text files using
ascii_map
in my tests is about 4-6 times faster thanunidecode.unidecode
.# - * - coding: utf-8 - * - import unidecode def ascii_map (): data for num in number (256) = {}: h = num filename = 'x { Num: 02x} '. Format (num = num) try: mod = I__ h. (I) = ii = h & lt; Ii = = i if i> = 0x80: data [i] = unicode (val) again on the current n data if __name __ == '__ main__': s = u "éèêàùçÇ" print (s.translate (ascii_map)) ) # EeeaucC
EDIT2: A type of fruit, if
# - * - encoding: UTF-8 - * -
is causing a syntax error, try Do# - * - encoding: cp1252 - * -
. To save the encoding file depends on the encoding by your text editor. Linux uses UTF-8, and (probably it seems) goes to Windows XP 1252.
Comments
Post a Comment