Posted on Sun 03 June 2012

Decrypting scrambled words

I'm sure you've all seen posts like

Olny srmat poelpe can raed tihs. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, ...

Now this is mildly amusing, but not all that interesting on itself. However, I began to wonder - is it possible to automatically unscramble texts like the above? Surprisingly, it (almost) is!

Of the 72945 unique words in my /usr/share/dict/words (not counting possesive forms, eg "storekeeper's"), only 630 have a non-unique scrambled form, if you ignore plural forms. What do I mean by this? Well, if you take the letters in those words and sort them, there is more than one word with the same sequence of sorted letters. However, for 606 of those, there are just two possible options. Only for 24 words there are groups of 3 words with the same possible representation, 6 of which are quite similar:

bread, bared, beard
field, filed, flied
bleary, barely, barley
petard, parted, prated
staled, salted, slated
staling, salting, slating
storied, sortied, steroid
paternal, prenatal, parental

If you include plural forms (ie anything with "s" as last letter), you get considerably more duplicates - 9 groups of 4 words each, 22 groups of 3 and 585 of 2, for 1272 non-unique words in total.

It turns out that this is not quite enough to allow for naive unscrambling - the snippet from above is turned into this (thanks for "Zzzz" pointing out how to improve it a bit):

only smart people can read this i cloud not believe that i cloud actually understand what i was reading the phenomenal power of the human mind

Enough to get the meaning of the text, but not perfect. The accuracy would certainly improve if my algorithm always chose the most common word, but, alas, it's just a few lines of scala (also, my dictionary contains no information about word frequency):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
object Scramble extends App {
  val words = io.Source.fromFile("/usr/share/dict/words").getLines()
  val uniques = words.filter(! _.contains("'")).toList

  // group words by what you get when you sort their individual characters
  val scrambled_lookup = uniques.map(w => w.toList.sorted.mkString).zip(uniques).groupBy(_._1).map {
    case (scrambled, ws) => (scrambled, ws.map(_._2))
  }
  // words whose sorted versions collide
  val non_uniques = scrambled_lookup.toList.map(_._2).sortWith(_.length > _.length).filter(_.length > 1)

  // words which have non-unique sorted representation even if you fix first and last letter
  val not_recognizable = non_uniques.map {
    // remove the first filter to include plural words
    case anagrams => anagrams.filter(_.last != 's').combinations(2).toList.filter {
      case first :: second :: nil => 
        first.head == second.head && first.last == second.last
    } .flatten.groupBy(w => w).toList.map(_._1)
  } .filter(_.length > 0)

  // all the non-unique words
  println(not_recognizable.sortWith(_.length > _.length))

  // number of words with same sorted representation, grouped by size of group
  println(not_recognizable.groupBy(l => l.length).toList.map {
    case (length, list_of_ws) => (length, list_of_ws.length)
  })

  def unscramble(text: String) = text.replaceAll("[\\.,]", "").split(" ").map(_.toLowerCase).map {
    case w => scrambled_lookup(w.toList.sorted.mkString).filter{
      case p => p.head == w.head && p.last == w.last
    }.head
  }.mkString(" ")

  val txt = "Olny srmat poelpe can raed tihs. I cluod not blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The pheonmneal pweor of the hmuan mnid"
  println(unscramble(txt))
}

Play around with it if you like - maybe you know of a better dictionary file?

© Julian Schrittwieser. Built using Pelican. Theme by Giulio Fidente on github. .