What's the fastest way to strip and replace a document of high unicode characters using Python? -


I want to replace all the high unicode characters from large documents, such as Accent Es, left and right quotes, etc. , With "normal" counterparts in a lower class, such as regular 'E', and straight quotation, I need to show it on a very large document repeatedly, I am getting an example of what I think It may be perl:

Is it a faster way to do this without using s.replace (...) in Python. (...) To change(...)...? I have tried some words to change it and separating the document has actually slowed down.

Edit, the version of my version does not seem to work:

  # - * - coding: ISO-8859-15 - * - Import unidecode Def ascii_map (): data = {} for range in range (256): H = number file name = 'x {num: 02x}' format. (Num = number): try: == __import __ ('.unidecode' + filename, fromlist = true) except importError: calculate (mod.data) for Val, L: and pass i = h & Lt; & Lt; 8 i + = L. If I & gt; = 0x80: Data [i] = Unicode (Val) return data if __name__ == '__ main__': S = u '"fancy" fancy2' print (s.translate (ascii_map ()))   text after "
  # - * - encoding: utf-8 - * - import unicodedata def shoehorn_unicode_into_ascii (s): Return UnicodeTata. General ('NFKD', S). Note that if the __name __ == '__ main__': S = U "éèêàùçÇ" print (shoehorn_unicode_into_ascii) # eeeaucC  

note, @ encode ('ascii', 'neglect') In the form of Mark Tolonen, please state that the above method excludes some characters such as' B '' '' If the above code shortens those characters that you translate, then you must manually fix these problems To use the string the translation method may use another option Rana (see).

When you have a large Unicode string, it will be very fast to use the replace method by using the translation method. >

Edit: Unidode is the complete mapping of the Unicode codepoint for Assisi, however, the unidecode.unidecode string is character-to-character (In a python loop) loops, which is slower than using the translation method.

The following assist function uses the unifiedcode data files and the code to get a better code, especially for long string.

On 1-6 MB text files using ascii_map in my tests is about 4-6 times faster than unidecode.unidecode .

  # - * - coding: utf-8 - * - import unidecode def ascii_map (): data for num in number (256) = {}: h = num filename = 'x { Num: 02x} '. Format (num = num) try: mod = I__ h. (I) = ii = h & lt; Ii = = i if i> = 0x80: data [i] = unicode (val) again on the current n data if __name __ == '__ main__': s = u "éèêàùçÇ" print (s.translate (ascii_map)) ) # EeeaucC  

EDIT2: A type of fruit, if # - * - encoding: UTF-8 - * - is causing a syntax error, try Do # - * - encoding: cp1252 - * - . To save the encoding file depends on the encoding by your text editor. Linux uses UTF-8, and (probably it seems) goes to Windows XP 1252.


Comments