I want to replace all the high unicode characters from large documents, such as Accent Es, left and right quotes, etc. , With "normal" counterparts in a lower class, such as regular 'E', and straight quotation, I need to show it on a very large document repeatedly, I am getting an example of what I think It may be perl:
Is it a faster way to do this without using s.replace (...) in Python. (...) To change(...)...? I have tried some words to change it and separating the document has actually slowed down.
Edit, the version of my version does not seem to work:
# - * - coding: ISO-8859-15 - * - Import unidecode Def ascii_map (): data = {} for range in range (256): H = number file name = 'x {num: 02x}' format. (Num = number): try: == __import __ ('.unidecode' + filename, fromlist = true) except importError: calculate (mod.data) for Val, L: and pass i = h & Lt; & Lt; 8 i + = L. If I & gt; = 0x80: Data [i] = Unicode (Val) return data if __name__ == '__ main__': S = u '"fancy" fancy2' print (s.translate (ascii_map ()))text after "# - * - encoding: utf-8 - * - import unicodedata def shoehorn_unicode_into_ascii (s): Return UnicodeTata. General ('NFKD', S). Note that if the __name __ == '__ main__': S = U "éèêàùçÇ" print (shoehorn_unicode_into_ascii) # eeeaucCnote, @ encode ('ascii', 'neglect') In the form of Mark Tolonen, please state that the above method excludes some characters such as' B '' '' If the above code shortens those characters that you translate, then you must manually fix these problems To use the string the
translationmethod may use another option Rana (see).When you have a large Unicode string, it will be very fast to use the
replacemethod by using thetranslationmethod. >Edit:
Unidodeis the complete mapping of the Unicode codepoint for Assisi, however, theunidecode.unidecodestring is character-to-character (In a python loop) loops, which is slower than using thetranslationmethod.The following assist function uses the
unifiedcodedata files and the code to get a better code, especially for long string.On 1-6 MB text files using
ascii_mapin my tests is about 4-6 times faster thanunidecode.unidecode.# - * - coding: utf-8 - * - import unidecode def ascii_map (): data for num in number (256) = {}: h = num filename = 'x { Num: 02x} '. Format (num = num) try: mod = I__ h. (I) = ii = h & lt; Ii = = i if i> = 0x80: data [i] = unicode (val) again on the current n data if __name __ == '__ main__': s = u "éèêàùçÇ" print (s.translate (ascii_map)) ) # EeeaucCEDIT2: A type of fruit, if
# - * - encoding: UTF-8 - * -is causing a syntax error, try Do# - * - encoding: cp1252 - * -. To save the encoding file depends on the encoding by your text editor. Linux uses UTF-8, and (probably it seems) goes to Windows XP 1252.
Comments
Post a Comment