What's the fastest way to strip and replace a document of high unicode characters using Python? -

I want to replace all the high unicode characters from large documents, such as Accent Es, left and right quotes, etc. , With "normal" counterparts in a lower class, such as regular 'E', and straight quotation, I need to show it on a very large document repeatedly, I am getting an example of what I think It may be perl:

Is it a faster way to do this without using s.replace (...) in Python. (...) To change(...)...? I have tried some words to change it and separating the document has actually slowed down.

Edit, the version of my version does not seem to work:

  # - * - coding: ISO-8859-15 - * - Import unidecode Def ascii_map (): data = {} for range in range (256): H = number file name = 'x {num: 02x}' format. (Num = number): try: == __import __ ('.unidecode' + filename, fromlist = true) except importError: calculate (mod.data) for Val, L: and pass i = h & Lt; & Lt; 8 i + = L. If I & gt; = 0x80: Data [i] = Unicode (Val) return data if __name__ == '__ main__': S = u '"fancy" fancy2' print (s.translate (ascii_map ()))   text after "   # - * - encoding: utf-8 - * - import unicodedata def shoehorn_unicode_into_ascii (s): Return UnicodeTata. General ('NFKD', S). Note that if the __name __ == '__ main__': S = U "éèêàùçÇ" print (shoehorn_unicode_into_ascii) # eeeaucC  
  note, @ encode ('ascii', 'neglect') In the form of Mark Tolonen, please state that the above method excludes some characters such as' B '' '' If the above code shortens those characters that you translate, then you must manually fix these problems To use the string the  translation  method may use another option Rana (see). 
  When you have a large Unicode string, it will be very fast to use the  replace  method by using the  translation  method. > 
  Edit:   Unidode  is the complete mapping of the Unicode codepoint for Assisi, however, the  unidecode.unidecode  string is character-to-character (In a python loop) loops, which is slower than using the  translation  method. 
  The following assist function uses the  unifiedcode  data files and the code to get a better code, especially for long string. 
  On 1-6 MB text files using  ascii_map  in my tests is about 4-6 times faster than  unidecode.unidecode . 
   # - * - coding: utf-8 - * - import unidecode def ascii_map (): data for num in number (256) = {}: h = num filename = 'x { Num: 02x} '. Format (num = num) try: mod = I__ h. (I) = ii = h & lt; Ii = = i if i> = 0x80: data [i] = unicode (val) again on the current n data if __name __ == '__ main__': s = u "éèêàùçÇ" print (s.translate (ascii_map)) ) # EeeaucC  
  EDIT2: A type of fruit, if  # - * - encoding: UTF-8 - * -  is causing a syntax error, try Do  # - * - encoding: cp1252 - * - . To save the encoding file depends on the encoding by your text editor. Linux uses UTF-8, and (probably it seems) goes to Windows XP 1252.

New Tmime

Search This Blog

What's the fastest way to strip and replace a document of high unicode characters using Python? -

Comments

Post a Comment