Thinking algebraically: a functional-programming dividend that pays during your imperative-programming day job

Posted by Tom Moertel Thu, 21 Aug 2008 01:50:00 GMT

Although at work I code mostly in Python – a language from which lambda and map were nearly removed – I still find that functional-programming experience has its benefits. One of the “functional-programming dividends” I notice most often is insight gained from considering problems from an algebraic perspective.

Recently, for example, I had a small parsing problem. I had to write code that, given a simple grammar specification as input, emits a regular expression that matches the language generated by the grammar.

Here’s a simplified version of the problem. A grammar specification is limited to a series of one or more atoms. For example, “a b c” represents the atom “a”, followed by the atom “b”, followed by the atom “c”. To generate the grammar, the series of atoms is interpreted such that each atom (except the last) generates a production rule of the following form:

atom_rule ::=
  <the literal atom> (SPACE <the next rule> | NOTHING)

(SPACE represents literal white space and NOTHING represents an empty string.) The grammar as a whole is rooted in the first atom’s rule.

Thus the specification “a b c” represents the following grammar:

grammar ::= a_rule
a_rule  ::= "a" (SPACE b_rule | NOTHING)
b_rule  ::= "b" (SPACE c_rule | NOTHING)
c_rule  ::= "c" 

Note that the final atom’s production matches only the literal atom itself: it has no following rule on which to chain.

The grammar, in turn, generates the following language:

a
a b
a b c

Thus, given the grammar specification “a b c”, my code had to produce a regular expression that would match “a”, “a b”, or “a b c”.

At this point, please stop for a moment and think about this little programming exercise. Do any solutions leap to mind? How would you approach the problem? Form your opinions now, because I’m going to ask you about them later. (If you’re feeling especially caffeinated, try coding a solution before reading on.)

For me, the insight that made the exercise easy was seeing that the grammar is given by folding a (suitably defined) right-associative binary operator through the series of atoms. The relationship might be easier to see if you substitute away the intermediate production rules from the grammar above:

grammar ::= "a" (SPACE "b" (SPACE "c" | NOTHING) | NOTHING)

If you squint past the SPACE and NOTHING terms, you’ll see that the grammar has the form

(a + (b + (c)))

The + is a binary operator that generates the parts we squinted away. Once you see what’s going on structurally, the operator is easy to define:

x + y = x (SPACE y | NOTHING)

Compare the operator’s definition with that of the atom_rule I presented at the beginning of the article. They’re structurally the same: the operator’s x and y are the atom rule’s <the literal atom> and <the next rule>.

Now all that remains is to generalize the “a b c” formula into a general formula that works for arbitrary grammar specifications. Fortunately, this work has already been done for us. The generalized formula is nothing more than a right fold. In Haskell, the particular right-fold flavor we want is called foldr1.

Given a list of atoms, we can use foldr1 to construct its grammar as follows:

mk_grammar atoms = foldr1 (+) atoms

But Python, our implementation language, does not offer a foldr1 function. This wrinkle, however, is another thing we can iron out by thinking algebraically. Python doesn’t have foldr1, but it does have a reduce function, which represents a left fold, equivalent to Haskell’s foldl’ or foldl1’. Because our + operator is strict and our list of atoms is finite, we can take advantage of the following identity:

foldr1 (+) xs == foldl1 (flip (+)) (reverse xs)

That is, we can convert a right fold into a left fold by flipping the arguments of the operator and operating on the list in reverse. Thus we can implement the fold we want in terms of the fold we have:

# Python code

def foldr1(f, xs):
    return reduce(flip(f), reversed(xs))

def flip(f):
    def g(x, y):
        return f(y, x)
    return g

Now writing a Python-based solution is straightforward:

def grammar_spec_to_re(spec):
    atoms = grammar_spec_to_atoms(spec)
    atom_res = map(atom_to_re, atoms)
    grammar_re = r'\A%s\Z' % foldr1(op, atom_res)
    return grammar_re

def op(x, y):
    # x + y = x (SPACE y | NOTHING)
    return r'%s(\s+%s)?' % (x, y)

def grammar_spec_to_atoms(spec):
    return spec.split()

def atom_to_re(atom):
    return re.escape(atom)

Using our solution, let’s compile the “a b c” grammar specification into its corresponding regular expression:

>>> print grammar_spec_to_re('a b c')
\Aa(\s+b(\s+c)?)?\Z

And that’s basically how I solved the problem.

To play around with the solution, here’s a small helper class that compiles a grammar specification into a regular expression and then tests strings for matching the regexp:

class GrammarMatcher(object):
    def __init__(self, spec):
        self.re = re.compile(grammar_spec_to_re(spec))
    def __call__(self, s):
        return not not self.re.match(s)

Now, let’s try out the regular expression generated for the grammar specification “a b c” :

>>> matcher = GrammarMatcher('a b c')
>>> matcher('')
False
>>> matcher('a')
True
>>> matcher('ab')
False
>>> matcher('a b')
True
>>> matcher('a  b')
True
>>> matcher('a b c')
True
>>> matcher('a b c d')
False
>>> matcher('a c')
False
>>> matcher('b')
False

Now, those questions I promised. If you’re a functional programmer, did a fold-based solution leap out at you? (Did you think of the problem in terms of folds?) If you’re not a functional programmer, how did you see the problem? Did the solution above seem twisted, confusing, or overly clever?

(There are no right or wrong answers. I’m just curious about how people with different backgrounds view the problem.)

Update: Edited to clarify that the problem is to convert a grammar specification into a regular expression, not just test whether a string matches a specified grammar.

Posted in
Tags , , ,
20 comments
no trackbacks
Reddit Delicious

Comments

  1. Anonymous said about 6 hours later:

    The solution that came immediately to my mind was:

    regEx = compile . concat . intersperse ”|” . map escapeRegExChars . drop 1 . inits . words

    (untested)

    But if you want to generate /small/ regexs then I would certainly have approached it through a right fold. Though I probably wouldn’t have gone through the process of thinking of it as an “algebra”.

  2. Anon said about 7 hours later:

    I’ve read about half of The Haskell School of Expression and I use Ruby for my day job. I had little trouble following your solution, but it certainly didn’t jump out at me. The solution seems quite elegant to me.

  3. Jules Jacobs said about 9 hours later:

    I’d use set operations on strings and that T is a subset of P if the intersection of sets P and T is equal to T: P \ (P \ T) = T.

    class String
      def -(other) # using - as a poor man's \
        # remove other from self, i.e. self - other
        self.sub(other, '')
      end
    end
    
    def matches?(text,pattern)
      # remove whitespace and insert start-of-string marker
      t,p = [text,pattern].map{|s| '. ' + s.split.join(' ')}
      # perform subset test and reject empty text
      p - (p - t) == t && t != '. '
    end

    Because we only want to match from the beginning we have to use a start-of-string marker (I used ’.’).

    matches?('a b', 'a b c') => true
    matches?('a c', 'a b c') => false
  4. Tom Moertel said about 11 hours later:

    Jules Jacobs, what happens if an atom contains more than one character? For example, consider the grammar specification “a bc d”, which generates the strings “a”, “a bc”, and “a bc d” :

    ?> matches?('a b', 'a bc d')
    => true
    

    Cheers,
    Tom

  5. Derek Elkins said about 11 hours later:

    I wouldn’t have bothered going through a grammar form since the original is already closer to the regex than the grammar. The foldr1 solution was pretty much immediately apparent to me at the point where you suggested thinking of a solution. On the other hand, I already tend to unconsciously think of alternation and sequencing as operators in a ring-like structure, though I don’t think that was particularly significant here.

  6. Jules Jacobs said about 12 hours later:

    Good catch Tom! It can be fixed by adding an end-of-string marker too.

    def matches?(text,pattern)
      t,p = [text,pattern].map{|s| '.' + s.split.join('.') + '.'}
      p - (p - t) == t
    end

    This way we don’t even need to check if the text is empty. Is this code correct now?

    Of course you’d use something like this in a real program:

    class Array
      def all_with_index?
        each_with_index do |elem,i|
          return false unless yield elem,i
        end
        return true
      end
    
      def prefix_of?(other)
        all_with_index?{|elem,i| other[i] == elem}
      end
    end
    
    def matches?(text,pattern)
      text.split.prefix_of?(pattern.split) and not text.empty?
    end
  7. Graham Fawcett said about 12 hours later:

    Enjoyable article, thanks. I thought it was funny that in a thinking-functionally article, you implemented GrammarMatcher as a class, instead of a functional one-liner like grammar_matcher = lambda re: lambda s: bool(re.match(s)) ;-)

  8. Larry Yogman said about 15 hours later:

    Using higher-order programming to reverse both the op argument order and the list order seems like a long way to go just to reuse Python’s fold.

    I find it far more natural to just use simple recursion to walk down to the end of the list, and then build up the result by applying op while popping the call stack.

  9. bob said about 16 hours later:

    The solution is way obscure and will only work for this particular kind of trivial grammar that consumes one symbol after another. I would fire you and hire Larry, whose approach is maintainable and generalizable.

  10. Simple said about 16 hours later:

    I’m not sure I understand what the complexity of the grammar you’re going for is, but given what you’ve shown (including the a bc d kind), the simple solution is (in pseudo pseudo code):

     var allowed = array( a , b , cd, e )
     var values = input.split( ' ' )
    
     // the code:
    
    function is_legit( values, allowed )
     values[0] == allowed[0] && 
     (values != nil && allowed != nil) && 
      is_legit(unshift(values), unshift( allowed ))
    
    I threw in the tail recursion to make your cut of elegance, but really, it could just as easily be done using an index. If your grammar isn’t a linear and is a graph this method is also much more adaptable. It’s basically a finite state graph walk (in the park).
  11. Simple said about 17 hours later:

    Holy batman: the formating on that post!

    here’s the code again:

    var allowed = array( a , b , cd, e )
    
    var values = input.split(' ')
    
    function is_legit( values, allowed )
     values[ 0 ] == allowed[ 0 ] && 
     (values != empty && allowed != empty ) &&
     is_legit( values.unshift, allowed.unshift )
    
  12. Simple said about 17 hours later:

    Last update (I’m so used to being able to edit my posts):

    for proper termination, values == empty should return true if we want sub sequences to test true. This is “left to the reader” as it’s trivial to do.

  13. Jules Jacobs said about 17 hours later:
    s this code correct now?

    Still not correct: matches?(‘b c’, ‘a b c’) => true, should be false.

    Corrected version:

    def matches?(text,pattern)
      t,p = [text,pattern].map{|s| '#.' + s.split.join('.') + '.'}
      p - (p - t) == t
    end

    That’s what I get for trying to be clever…

  14. Richard said about 22 hours later:

    I’ve got a mostly-imperative background, plus a couple of months of learning Haskell. My first thought, as a quick hack (that is, something I don’t intend to maintain for a long period) is this:

     def spec_to_re(spec):
       x = spec.split(' ')
       return '( '.join(x) + ')?' * (len(x)-1) + '$'
    
    In Haskell I’d probably write:
     spec_to_re (' ':xs) = "( " ++ spec_to_re xs ++ ")?" 
     spec_to_re (x:xs) = x:spec_to_re xs
     spec_to_re [] = "$" 
    

    ... and maybe later “simplify” it to …

     spec_to_re = (++ "$") . foldr1 (\a b -> a ++ "( " ++ b ++ ")?") . words
    
  15. Richard said 1 day later:

    Why does it have to be a regexp? There’s a much simpler non-regexp solution:

    def match(pat, str): return (pat + " ").startswith(str + " ")
    
    (with a whitespace-tidying pass first if necessary).

  16. Tom Moertel said 1 day later:

    Richard, the problem is defined this way to capture the interesting part of a problem I encountered at work. In the real problem, the atoms themselves represented regular expressions (and they could match whitespace, too), and the computed, whole-grammar regular expression was used not only for match-testing but also for binding: to capture the portion of a whole-grammar match that matched each atom. In the article, I was able to strip away that complexity, but I had to ask for a regular expression as output to preserve the interesting bit.

    Cheers,
    Tom

  17. SICP said 1 day later:

    Have you read your SICP?

  18. Larry Yogman said 1 day later:

    SICP is a classic, beautiful book.

    The Structure and Interpretation of Computer Programs

    My recursive list walking above was based on instincts laid down by a CS education that started from that book. But on further reflection, I realized that I actually didn’t simplify enough.

    My recursive approach, the KISS functional programmer’s solution, was appropriate for a linked list, but not for Python’s (array-backed?) indexed lists. I don’t think Python lists provide an efficient sublist view equivalent to the LISP/Scheme “cdr”.

    The best thing to do at a Python shop, for performance and maintainability, would probably be the KISS procedural programmer’s approach, a loop with the list index running backwards. BORING. But right. Whatever language you’re working in, learn to write like native – then the natives will be less inclined to make war against you.

  19. enishada said 7 days later:

    Nice solution. Curious about the class of grammars for which you are able to generate a matching RE.

  20. raphael said 185 days later:

    Hi Tom,

    re = "\A%s%s\Z" % ("(\s+".join(words) ,(len(words)-1)*")?")
    I’d probably go for that one, here it seems using an imperative approach seems to result in less/easier understandable code.

    Sometimes I think using Haskell makes you (a general you) a worse imperative programmer rather than a better one. But I know I can’t stop anyway ;-)

    regards,

    raphael

    p.s. I really enjoy reading your blog, keep writing!

Trackbacks

Use the following link to trackback from your own site:
http://tea.moertel.com/articles/trackback/766

(leave url/email »)

   Comment Markup Help Preview comment