Phrase analysis and expansion with Ruby

The idea is to take a phrase and analyze it for use in Information Retrieval. We need to tokenize it into words, possibly transmute some of the tokens, possibly expand some tokens into subphrases. This class lets you register lambdas to perform transformations, substitutions, and expansions. Expansions can take a numerical value representing the cost of the operation; this is intended for raising or lowering the scores of matches in the theoretical IR application. Given the phrase "joe's sushi & bait-shop shack", assume I want to tokenize on whitespace, replace the ampersand with the word "and", and create word variants for the hyphenized and apostrophized words. See the last spec for an example of the Ruby data structure this class generates.

0 comments: