## Tuesday, June 2, 2009

### Scalable Algorithms 2

In my last post, I told of a programming challenge taken, and my first go at it. While the algorithm I came up with had a certain cleverness, there was much to criticize about it.

Another friend, who had received the same challenge, attacked the problem from a different perspective. His algorithm was not as liberal with the input as mine, as it accepted only abbreviations or single words. It would match a prefix of a word other than the first one, though, which is something mine didn't.

For instance, if there was an option "create application shortcuts", it would match that against "cas" or any prefix of it (eg. "ca"), "create" or any prefix of it, "application" or any prefix of it and "shortcuts" or any prefix of it.

Map of Prefixes

I won't post his algorithm of it here (though I'd be happy to link to a post of his, if he does so), but the idea was precomputing a couple of maps: one of full abbreviations into set of menu options, and one of words into set of menu options.

The cleverness of it, though, is that he did not precompute all possible prefixes. Instead, he used ranges, close-ended at the input being matched, and open ended at the input being matched with the last character replaced by it's successor, and applied that range to the keys of each map. The result was that all keys to which the input was a prefix of would be selected.

Now, there were two things about his algorithm I didn't like. First, it would use mutable sets and maps when precomputing. I was trying to avoid that. Second, it would not match a concatenation of the prefix of different words, so appshort would match the example previously given.

So I decided to take a shot at it, which, sadly, did not benefit from the range cleverness. As before, I'll follow the code with my criticisms of it.

`val wordsep = "[ :,;]"def words(s : String) = s.split(wordsep)def wordPrefixes(s : String) = (0 to s.length).map(s.substring(0,_))def combinedPrefixes(s : String) =  words(s).foldLeft(Set(""))((set, word) =>    set.flatMap(x => wordPrefixes(word).map(y => x + y)))def reversePhraseMap(l : List[String]) =  l.foldLeft(Map[String,Set[String]]())((map, phrase) =>    map + (phrase -> combinedPrefixes(phrase)))def allPrefixes(m : Map[String,Set[String]]) = m.values.reduceLeft(_ ++ _)def phraseMap(m : Map[String,Set[String]]) =  Map[String,Set[String]]() ++ allPrefixes(m).  map(p => p -> m.foldLeft(Set[String]())((s, x) =>    if(x._2.contains(p)) s + x._1 else s))def indexPhrases(l : List[String]) = phraseMap(reversePhraseMap(l))def searchPhrases(l : List[String]) = {  val index = indexPhrases(l.map(_.toLowerCase))  (s : String) => index(s.toLowerCase)}`

The way you use it is you pass a list of menu options to searchPhrases, and searchPhrases returns a function which will look up an input in the list and return the result.

Well. I'm not particularly fond of this algorithm, but the code is much more obvious, in my opinion, than the first one. I had to compute the reversed map first (ie, the menu options pointing to a set of their abbreviations), and then reverse that. I wonder if I shouldn't have reversed the definitions name, though (pun half-intended).

Now, this code is 19 lines vs the first's 34, but this isn't really enough of a difference when considers code complexity. But the first code had 4 "if" statements, all of them followed by "else" statements, one "match" statement and, to boot, a couple of mutually recursive functions. This one has a single "if". This is functional programming at it's best (well, almost best :). In fact, the whole code could be written as a single expression, a single statement:

`def searchPhrases(l : List[String]) =   (s : String) => (    Map[String,Set[String]]() ++     (      (        l        map (_ toLowerCase)        foldLeft Map[String,Set[String]]()      ) ((map, phrase) =>        map + (phrase ->           (            (              phrase               split "[ :,;]"              foldLeft Set("")            ) ((set, word) =>              set flatMap (x => (                  0                  to word.length                  map (word substring (0, _))                  map (x + _)                )              )            )          )        )      ).values      reduceLeft (_ ++ _)      map (p =>         p -> (          (            (              (                l                map (_.toLowerCase)                foldLeft Map[String,Set[String]]()              ) ((map, phrase) =>                map +                 (phrase -> (                  (                    phrase                    split ("[ :,;]")                    foldLeft Set("")                  ) ((set, word) =>                    (                      set                      flatMap (x =>                         (                          0                           to word.length                          map (word substring (0, _))                           map (x + _)                        )                      )                    )                  )                ))              )            )            foldLeft Set[String]()          ) ((s, x) =>            if (x._2 contains p) s + x._1 else s          )        )      )    )  ) (s toLowerCase)`

Now, that's a long, long expression to keep track of, and there's some repetition in it. Broken up in smaller definition helps manage the size, and the definition's names help understand what is being done.

Still, the fact the lack of control structures helps one to focus on what is being done. You don't have to keep backtracking the code and thinking what happens on that other condition. So, while both algorithms deal in immutable objects only, the more you keep to high order functions in place of control structures, the clearer becomes the code.

This algorithm has a fast search, and it can catch any variation of prefixes, as long as no words are inverted. To do that it spends some time building the map, and the map itself is very wasteful of memory. It doesn't make take advantage of character-by-character input, but it doesn't really need it. It has no heuristics to order the list of matched menu options, though, and no easy way to retrofit it. That I can see, at least.

At this point, I decided that what I wanted was a DFA matcher, but this took me to an unexpected place first. The next algorithm has 9 lines of code, two being def lines without code, one being a constant definition, and one being "}". At that, it matches everything this algorithm does. Can you beat that before my next post? :-)