algorithm - Searching a 25 GB corpus for a single word -


I would like to find Wikipedia's 25 GB corpus for one word. I've used grep, but it takes a lot of time to do what is an efficient and easy representation that can be searched quickly, besides, I have to get the exact match.

Thank you.

You probably want to place one of the mapping in place (bytecode offsets) in the list. The list of words will be sorted alphabetically, then you can have a secondary index where some letters start in this large list of words.

  Lazy hash | Word Index | Corpus starts at AAA X. AAA | Laurem Ipsum Dollar starts at Aub Y ... | Sit amet ..... AAC ... | And 486, 549, 684, ... ... ... ... | | Zzz ... | |  

This is the method of advocating by natural language professor in my department (we did this exercise as a laboratory in the algorithmic course).


Comments