c# - How can I parse free text (Twitter tweets) against a large database of values? -


Suppose I have a database in which 500,000 records, each represents, say, an animal. What would be the best way to parse 140 character tweets to identify matching records in the name of the animal? For example, in this string ...

"I went to the forest today and could not believe my eyes: I showed a huge polar bear with a red squirrel picnic."

... I would like to flag the words "huge polar bear" and "red squirrel" because they appear in my database.

It kills me as a problem which may have probably been solved many times, but from where I am seated it looks prohibitive - every DB record for a match in the string Surely doing this is a crazy way to investigate.

Did the science degree make me out of my misery? I am working in C # if it makes a difference Cheers!

Assuming the database is quite stable, use one This is an incomplete form of hash table which only Stores the bits indicating the presence of the value, stored the value without. This is probability, because a hash may have to collide, so each hit will need a complete lookup. But a 1 MB Bloom Filter with 500,000 entries may be less than 0.03% of false positives.

Some maths: To achieve this low rate, 23 hash codes, for each total 52 9 bits, each with 23 bits of entropy. Bob Jenkins generates 1 9 2 bits entropy in a single pass (If you use variables that are not present in the hash () , then Bob might be recovering as a "normal" hash Due to this), thus maximum passes three. Due to the way Bluefault works, you do not need all entropy for each query because most lookups will report a well before going to the 23rd hash code.

Edit: You have to explicitly purse the words from the text. Finding each example of / \ b \ w + \ b / might be for the first version.

To match the phrases, after each n -version (aka n -gram) where n is the number 2 The biggest phrase in this is that you can make it very cheap by adding a word to the phrase for a different bloom filter, and only n - test for the grams, for which each word can be used for this second filter passes by.


Comments