The inner beauty of tree traversals

Posted by Tom Moertel Fri, 27 Jan 2012 04:24:00 GMT

This has been done a million times before, but if you haven’t seen it, it’s pretty neat. Let’s say you have a simple recursive data structure like a binary tree:

module Tree where

data Tree a = Empty
            | Node a (Tree a) (Tree a)
            deriving (Eq, Show)

-- some binary trees

t0 = Empty
t1 = Node 1 Empty Empty
t3 = Node 2 (Node 1 Empty Empty) (Node 3 Empty Empty)

Now say you want to traverse the nodes, in “preorder.” So you write a traversal function. It folds a binary operator f through the values in the tree, starting with an initial accumulator value of z. First, it visits the current node, then the left subtree, and finally the right subtree, combining each value it encounters with the current accumulator value to get the next accumulator value. The final accumulator value is the result.

Spelling out the steps, it looks like this:

preorder_traversal f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = let z'   = f v z     -- current node
                            z''  = go l z'   -- left subtree
                            z''' = go r z''  -- right subtree
                        in z'''

But if you wanted an “inorder” traversal instead, you’d write a slightly different function, visiting the left subtree before the current node:

inorder_traversal f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = let z'   = go l z    -- left subtree
                            z''  = f v z'    -- current node
                            z''' = go r z''  -- right subtree
                        in z'''

To see how the two traversal functions work, let’s use both to flatten the 3-node tree t3 we defined earlier.

flatten traversal = reverse . traversal (:) []

test0i = flatten inorder_traversal t3   -- [1,2,3]
test0p = flatten preorder_traversal t3  -- [2,1,3]

Great.

But now you start thinking about post-order traversal. Do you want to write a third traversal function that’s almost the same as the first two?

So you go back to the inorder traversal and start staring real hard at that let statement. Can you rewrite it? Yes:

inorder_traversal2 f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = go r . f v . go l $ z

And now you’ve got it. Just by reordering the applications of (go l), (f v), and (go r), you can control the traversal.

Which leads to a general traversal function that lets some other function control that ordering:

traverse step f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = step (f v) (go l) (go r) z

Doesn’t that look beautiful?

Now you can just spell out your desired traversals:

preorder   = traverse $ \n l r -> r . l . n
inorder    = traverse $ \n l r -> r . n . l
postorder  = traverse $ \n l r -> n . r . l

A test drive:

test1p = flatten preorder t3   -- [2,1,3]
test1i = flatten inorder t3    -- [1,2,3]
test1o = flatten postorder t3  -- [1,3,2]

And you’re happy.

But can you go further? What if your trees are binary search trees and you just want to find the minimum or maximum value? Can you just traverse to the left or right?

Yes:

leftorder  = traverse $ \n l r -> l . n
rightorder = traverse $ \n l r -> r . n

treemin = leftorder min maxBound
treemax = rightorder max minBound

test2l = treemin t3 :: Int  -- 1
test2r = treemax t3 :: Int  -- 3

Isn’t that neat?

Posted in
Tags , , , ,
3 comments
no trackbacks
Reddit Delicious

A flyweight mocking helper for Python

Posted by Tom Moertel Mon, 07 Nov 2011 05:05:00 GMT

Recently, I needed a lightweight mocking solution when adding unit tests to some existing Python code. I could have used one of the many Python mocking libraries, but I only had to replace a few module functions during testing, so I just wrote a tiny mocking helper rather than add a dependency to the project.

The helper is simple and surprisingly versatile:

import contextlib
import functools

@contextlib.contextmanager
def mocked(func,
           expecting=None, returning=None, raising=None,  # specs
           replacement_logic=None, called=1):
    """Stub out and mock a function for a yield's duration.""" 

    if (returning, raising, replacement_logic).count(None) < 2:
        raise ValueError("returning, raising, and replacement_logic " 
                         "are incompatible with each other")

    # default logic for implementing mock fn: return or raise per specs
    def default_logic(*_args, **_kwds):
        if raising:
            raise raising
        return returning

    # prepare wrapper to replace mocked function for duration of yield
    invocations = [0]
    @functools.wraps(func)
    def replacement(*args, **kwds):
        if expecting is not None:
            assert expecting == (args, kwds)  # did we get expected args?
        invocations[0] += 1
        return (replacement_logic or default_logic)(*args, **kwds)

    # replace mocked function, yield to test, and then check & clean up
    module = sys.modules.get(func.__module__)
    setattr(module, func.__name__, replacement)
    try:
        yield  # give control back to test for a while
        assert invocations[0] == called  # was mock called enough?
    finally:
        setattr(module, func.__name__, func)

def uncalled(func):
    """Require that a function not be called for a yield's duration.""" 
    return mocked(func, called=0)

The idea is to wrap test code that requires mocked external functions with a mocking helper. Here’s an example:

def test_ml_errors_must_be_reported(self):
    """When an error occurs, it must be reported; nothing must be sent.""" 
    data = self._request_data()
    exception = mls.MlError()
    exception.exc = '[[[error description]]]'
    with mocked(mls.subscriber_count, raising=exception):
        with uncalled(sms.send):
            resp = self.app.request("/mailings", method='POST', data=data)
    self.assertEqual(resp.status, '200 OK')
    self.assertIn(exception.exc, resp.data)  # must be reported

In this test method, I’m making a simulated POST request to a web service that’s supposed to check how many users are in a mailing list and then, possibly, send some SMS messages. In this case, however, I want to simulate that an error occurs when talking to the mailing-list service. The test must verify that the error is reported and, crucially, that no messages are sent.

The library modules for talking to the mailing-list service and the SMS service are called mls and sms. So, for the duration of the simulated request, I’m replacing two functions in these modules with mock versions. The mock version of mls.subscriber_count, when called, will raise the exception I’m trying to simulate. The mock version of sms.send, however, must not be called; if it is, the uncalled mocking helper will alert me by raising an exception.

So the mocking helpers not only temporarily install mock implementations of functions but also assert that those mock implementations are (or are not) called as expected. In the following code, for example, I use this capability to make sure that the mailing-list service is asked to get the subscriber count for the mailing list having the right key. I also simulate the service returning a subscriber count of 123.

def test_unconfirmed_msgs_must_be_confirmed(self):
    """An unconfirmed msg must be confirmed and not sent.""" 
    mlkey = 'TEST_ML_KEY'
    body = 'Test message!'
    data = self._request_data(mlkey=mlkey, body=body)
    with mocked(mls.subscriber_count, expecting=((mlkey,),{}), returning=123):
        with uncalled(sms.send):
            resp = self.app.request("/mailings", method='POST', data=data)
    self.assertEqual(resp.status, '200 OK')
    self.assertIn('Reply CONFIRM', resp.data)    # we must ask for confirmation
    self.assertIn('123 subscribers', resp.data)  # and supply subscriber count
    self.assertIn(body, resp.data)               # and the msg to be sent

The mocked helper lets you do a few more things, too, but you get the idea: Sometimes a short helper function can take you a long way.

Tags , ,
no comments
no trackbacks
Reddit Delicious

The most surprisingly helpful thing I have written

Posted by Tom Moertel Wed, 02 Nov 2011 02:55:00 GMT

Back in 2007, I repaired an aging and fairly obscure A/V receiver that had lost the ability to respond to its remote control. This I did by re-soldering some hard-to-find solder joints that had broken on its circuit board.

On the chance that someone else had a similar problem, I posted some instructions and photos on my blog. I didn’t think much of it at the time.

But since then, every week or so, another comment shows up, thanking me for writing it. Some typical examples:

Fixed my Kenwood V6030D 10 minutes ago. Life is good again, Thanks mate….

Ditto! Worked on my VR-209 like a champ! Thanks!

Thank you for this posting, which I stumbled upon when I was researching the problem of my remote control no longer working for this receiver (VR-507). These pictures were invaluable to locate the faulty pins. (They sure are small.) Re-soldering them restored full functionality to the receiver and the original remote control. Good job!

Thanks, Fixed my KRF-8010D with this, been 5 years fighting with remote working now and then.

There are now about 60 comments like that. I never would have imagined that 60 people would have read the post let alone get out a soldering iron because of it. But they did! And it helped them!

Now, every time I’m feeling down, I Google up that post and read the thank-yous. It cheers me up.

So here’s the lesson: Write it down. If you’ve figured something out, even if it seems unimportant, write it down. Maybe someone else will find it helpful. Maybe a lot of someone elses will find it helpful.

You never know. It might even cheer you up someday.

Posted in
Tags , , ,
5 comments
no trackbacks
Reddit Delicious

Almost there: 100K words

Posted by Tom Moertel Thu, 21 Jul 2011 00:16:00 GMT

In 2007, I counted how many words I had written for my blog and was surprised to discover it was 60 thousand, about a short novel’s worth. Just now, I stumbled upon that earlier count and wondered what the count was today.

>> Article.find(:all).inject(0) { |sum,a| sum +=
?>        (a.body + a.extended.to_s).split(/\s+/).length }
=> 98702

I’m about 1300 words shy of 100 thousand.

Now I’ve got to write a few more posts for the blog.

:-)

Posted in
Tags ,
no comments
no trackbacks
Reddit Delicious

Good enough to steal

Posted by Tom Moertel Sun, 10 Jul 2011 22:06:00 GMT

Recently, I wrote a simple plagiarism detector as a fun programming exercise. Then, merely a few days later, some company gave me cause to use it.

This company, it seems, was looking to hire a programmer. So they posted a job ad that was more or less word-for-word copied from a job ad that I had written for the company where I work. The duplication, being so extensive, was hard to miss. (The offending company, to its credit, promptly removed the copied ad from its web site when we let them know about it.)

I had originally written the text used in that ad back in December 2010, when we were starting another round of hiring. I had hoped that when the right kind of programmers read it, they would discern that we were programmers just like them, programmers who cared for their craft enough and who cared for their team enough to take hiring other programmers seriously. I didn’t want our ads to seem anything like those spat out by people just mouthing the words that everyone else was mouthing to “get talent.” So I worked on getting the words right, thinking the investment would somehow help us stand apart, if just a bit, when hiring.

But I had failed to consider that authenticity in job ads can be faked by just copying what seems authentic. So, a few days ago, when Googling for statistically unlikely phrases from the text I had written, I was actually surprised to discover that a number of companies and recruiters were now using my words, more or less unchanged, to signal how “authentic” they were.

At first, I was annoyed. But, upon reflection, I realized that the plagiarism was telling me that I had written something worth stealing. That’s a good thing, right?

After all, in a society where all too many people are willing to claim your words as their own, which is worse: to write something and have it stolen or to write something and have it not?

P.S. For the pedants who would point out that nothing was actually “stolen” here, please understand that steal has a well-established sense that means basically to plagiarize, as in T.S. Eliot’s quip that “Mediocre writers borrow. Great writers steal.”

Posted in
Tags , , , ,
5 comments
no trackbacks
Reddit Delicious

Writing a simple plagiarism detector in Haskell

Posted by Tom Moertel Thu, 16 Jun 2011 04:13:00 GMT

Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.

One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.

The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two different documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)

But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.

It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.

Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:

$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt

But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA," 
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.

In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.

Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.

Anyway, here’s the code.

{-

Simple plagiarism detector that detects n-gram duplication.

Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).

Tom Moertel <tmoertel@gmail.com>
2011-06-11

-}

module Main where

import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)

main = do
  n:suspect:sources <- getArgs
  putStrLn =<< findMatches (read n) suspect sources

findMatches n suspect sources = do
  db <- S.fromList . concat  <$> mapM (loadNgrams n) sources
  stxt <- readFile suspect
  return $ concat (match n db (nGrams n stxt) (whiteWords stxt))

match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
  where
    matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
    isMatch loc ng | ng `S.member` db = [loc .. loc + n - 1]
                   | otherwise        = []

mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
  where
    go _   []     ws               = ws
    go _   _      []               = []
    go i (l:ls) (w:ws) | i < l     = w : go (i + 1) (l:ls) ws
    go i (l:ls) (w:ws) | otherwise = f w :  go (i + 1) ls ws

uniq = S.toList . S.fromList

loadNgrams n file = nGrams n <$> readFile file

nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
           tails . map normalize . words

normalize = map toLower . filter isAlphaNum

-- returns words like 'words' but preserves leading whitespace for each

whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
  ("", "") -> []
  (s1, "") -> [s1]
  (s1, s') -> (s1 ++ w) : whiteWords s''
    where
      (w, s'') = break isSpace s'

Posted in
Tags , ,
9 comments
no trackbacks
Reddit Delicious

Presso hand-powered espresso maker: thoughts?

Posted by Tom Moertel Tue, 05 Apr 2011 16:50:00 GMT

Has anyone tried the Presso hand-powered espresso maker? It seems that some people have figured out how to get good espresso from it.

I’m wondering if it’s reliably good espresso. If it is, the Presso might make a good take-into-work espresso maker.

Any experience reports?

Posted in
Tags ,
6 comments
no trackbacks
Reddit Delicious

Dear readers: let me know if my feeds explode

Posted by Tom Moertel Sat, 26 Mar 2011 16:14:00 GMT

I recently started redirecting requests for my blog’s RSS and Atom feeds to a consolidated feed with FeedBurner. If you’re reading my blog via one of these feeds – and you should be since my blog’s design is sufficiently horrible to burst your eyeballs if you try to read it directly – please let me know if the feeds break for you.

Examples of breakage:
  • the feeds no longer update
  • the content of the feeds looks garbled or goofy
  • images or links are broken
  • the feeds literally catch on fire and burn

If you see any of the above, please let me know.

Thanks!

Posted in
Tags , ,
4 comments
no trackbacks
Reddit Delicious

Errata-sleuthing for fun and knowledge

Posted by Tom Moertel Fri, 25 Mar 2011 03:30:00 GMT

I love inspecting errata, those lists of corrections and desired changes that authors compile for their books, hoping that someday their books will be sent back to the presses and the changes can be worked in.

For many technical books, the errata are mostly boring fixes for publishing errors. Sometimes, however, you can see that the author’s thinking has evolved, and then errata-sleuthing becomes fun. The errata list becomes a kind of “brain diff” for the author’s thoughts. It’s not often that someone can tell you with such precision how their thinking has evolved on important subjects.

By studying these diffs, you can learn something about the way their creators think. Similarly, Seth Roberts wrote on his blog that he learned something about writing and editing by comparing New Yorker articles to the slightly different versions published elsewhere; the New Yorker versions were better, and the differences between the versions explained why they were better. It’s like getting free writing lessons from The New Yorker.

A recent example of errata-sleuting: The errata for Judea Pearl’s Causality are fascinating. Causality research is a rapidly advancing field, and it’s interesting to see, through edits to his two-year-old book, which new developments one of the pioneers of the field thinks are worth working into the text. For instance: the addition of the Mediation Formula to the discussion of indirect effects. There are also many new citations to recent research and a few helpful clarifications. I also like (and am slightly amused) that he’s sharpened some of the poking-sticks he’s using to gently prod reluctant researchers, clinging to traditional methods, toward the advantages of modern causality theory: “some statisticians” has become “traditional statisticians”; a potential-outcomes “fallacy” now has “damaging consequences.” Good stuff.

The next time you’re working through an interesting textbook, see if you can find errata for it, even if you have the most-recent printing. You might learn something unexpected, and for very little cost.

Posted in
Tags , , , ,
no comments
no trackbacks
Reddit Delicious

A quick summary of the VL2 data-center network scheme

Posted by Tom Moertel Thu, 17 Mar 2011 15:21:00 GMT

Reading the most-recent issue of Communications of the ACM, I enjoyed the reprint of “VL2: A Scalable and Flexible Data Center Network.” Here’s a summary.

First, the problem: Start out with a rack of 20–40 servers. At the top of the rack, place a switch – the top-of-rack (ToR) switch – to connect the servers together. Now fill your data center with racks just like this. To allow the servers to talk across racks, take each rack’s ToR switch and run its uplink ports (typically 2) into separate (for redundancy) higher-level aggregation switches. To connect aggregated rack-groups together, uplink the aggregation switches into higher-level aggregation switches. Keep aggregating this way until everything is connected.

But, eventually, you can’t join the switched domains further because they’ll get too big and unwieldy; that’s when you uplink into the final layer of the hierarchy: access routers; they split your aggregated-aggregated rack-group-groups into separate layer-2 networks and route packets between them.

What’s wrong with this hierarchical network graph? In a word, oversubscription. When big jobs need to run across the data center, the uplinks saturate and become bottlenecks. According to the paper, even ToR uplinks are typically oversubscribed 1:2 to 1:20; the higher-level uplinks are even worse: 1:200 is not uncommon.

To work around these bottlenecks, network designers end up buying expensive network hardware and configuring it for specific workloads. But running large data centers is so expensive that you want the flexibility to squeeze lots of different jobs into the spare capacity, and networks tuned for one kind of workload are the opposite of flexibility.

How do you get both performance and flexibility? The authors of the paper propose VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices. This separation lets you design the above-rack network to provide huge path capacity using commodity hardware. In the paper’s running example, it’s a folded Clos network in two levels of switches – the aggregation level and, above that, the intermediate level.

Here’s how it works. You assign each network device on each server a location address (LA); this is an IP address that stays with it for life, naming the device. These LAs get advertised to the switches above, which keep track of them using a typical layer-3 link-state routing protocol.

Each application, however, you assign a block of application addresses (AAs) from a separate pool. LAs and AAs never mix. Each application is coded as if it runs on a dedicated Ethernet segment having only its AAs attached. When you want to give an application an extra server, you map one of the application’s unused AAs to the server’s LA, effectively attaching the server to the application’s virtual Ethernet network.

AA-to-LA mapping is handled by a fast, scalable directory system that is invisible to applications. It’s invisible because it’s implemented as a kernel extension. When a server sends a packet to one of its AA-addressed peers on the application’s virtual network, the kernel extension consults the directory to get the destination server’s corresponding LA and sends the packet to the AA via the LA.

How the packet gets there is clever. The kernel extension doesn’t send the packet directly to the LA but instead sends it up to the very top level of the network, bouncing it off a randomly selected intermediate-level switch, before it is delivered to the LA device.

All this happens by encapsulation and tunneling. The packet to the destination AA is wrapped within a packet to the corresponding destination LA. That packet, in turn, is wrapped within a packet to the randomly chosen intermediate switch. When the packet is sent, it goes up to the intermediate switch, which unwraps the outer layer of encapsulation and bounces the inner packet down to the switch handling the LA. That switch, in turn, unwraps the remaining layer and sends the original packet – this is the one to the AA – to the destination server, which gets it via its application-specific virtual Ethernet adapter.

Why bounce packets off a random intermediate switch? The randomization spreads out traffic, allowing for high network utilization. This “valiant load balancing” is cheap and effective, resulting in about 94% efficiency in the paper’s tests.

That, in a nutshell, is VL2. (I simplified some things; in reality the randomization doesn’t occur for each packet but for entire flows of packets.)

For more information, check out the original paper at Microsoft Research.

Posted in
Tags , , , , ,
no comments
no trackbacks
Reddit Delicious

Older posts: 1 2 3 ... 19