Posted by Tom Moertel
Fri, 27 Jan 2012 04:24:00 GMT
This has been done a million times before, but if you
haven’t seen it, it’s pretty neat. Let’s say you have a simple
recursive data structure like a binary tree:
module Tree where
data Tree a = Empty
| Node a (Tree a) (Tree a)
deriving (Eq, Show)
t0 = Empty
t1 = Node 1 Empty Empty
t3 = Node 2 (Node 1 Empty Empty) (Node 3 Empty Empty)
Now say you want to traverse the nodes, in “preorder.” So you write a
traversal function. It folds a binary operator f through
the values in the tree, starting with an initial accumulator value of
z. First, it visits the current node, then the left subtree, and
finally the right subtree, combining each value it encounters
with the current accumulator value to get the next accumulator
value. The final accumulator value is the result.
Spelling out the steps, it looks like this:
preorder_traversal f z tree = go tree z
where
go Empty z = z
go (Node v l r) z = let z' = f v z
z'' = go l z'
z''' = go r z''
in z'''
But if you wanted an “inorder” traversal instead, you’d write a
slightly different function, visiting the left subtree before the
current node:
inorder_traversal f z tree = go tree z
where
go Empty z = z
go (Node v l r) z = let z' = go l z
z'' = f v z'
z''' = go r z''
in z'''
To see how the two traversal functions work, let’s use both to flatten
the 3-node tree t3 we defined earlier.
flatten traversal = reverse . traversal (:) []
test0i = flatten inorder_traversal t3
test0p = flatten preorder_traversal t3
Great.
But now you start thinking about post-order traversal. Do you want
to write a third traversal function that’s almost the same as the
first two?
So you go back to the inorder traversal and start staring real hard at that let statement.
Can you rewrite it? Yes:
inorder_traversal2 f z tree = go tree z
where
go Empty z = z
go (Node v l r) z = go r . f v . go l $ z
And now you’ve got it. Just by reordering the applications of (go
l), (f v), and (go r), you can control the
traversal.
Which leads to a general traversal function that lets some other function
control that ordering:
traverse step f z tree = go tree z
where
go Empty z = z
go (Node v l r) z = step (f v) (go l) (go r) z
Doesn’t that look beautiful?
Now you can just spell out your desired traversals:
preorder = traverse $ \n l r -> r . l . n
inorder = traverse $ \n l r -> r . n . l
postorder = traverse $ \n l r -> n . r . l
A test drive:
test1p = flatten preorder t3
test1i = flatten inorder t3
test1o = flatten postorder t3
And you’re happy.
But can you go further? What if your trees are binary search trees
and you just want to find the minimum or maximum value? Can you
just traverse to the left or right?
Yes:
leftorder = traverse $ \n l r -> l . n
rightorder = traverse $ \n l r -> r . n
treemin = leftorder min maxBound
treemax = rightorder max minBound
test2l = treemin t3 :: Int
test2r = treemax t3 :: Int
Isn’t that neat?
Posted in functional programming
Tags fun, haskell, recursive, refactoring, trees
3 comments
no trackbacks

Posted by Tom Moertel
Mon, 07 Nov 2011 05:05:00 GMT
Recently, I needed a lightweight mocking solution when adding unit tests to some existing Python code. I could have used one of the many Python mocking libraries, but I only had to replace
a few module functions during testing, so I just wrote a tiny
mocking helper rather than add a dependency to the project.
The helper is simple and surprisingly versatile:
import contextlib
import functools
@contextlib.contextmanager
def mocked(func,
expecting=None, returning=None, raising=None, # specs
replacement_logic=None, called=1):
"""Stub out and mock a function for a yield's duration."""
if (returning, raising, replacement_logic).count(None) < 2:
raise ValueError("returning, raising, and replacement_logic "
"are incompatible with each other")
# default logic for implementing mock fn: return or raise per specs
def default_logic(*_args, **_kwds):
if raising:
raise raising
return returning
# prepare wrapper to replace mocked function for duration of yield
invocations = [0]
@functools.wraps(func)
def replacement(*args, **kwds):
if expecting is not None:
assert expecting == (args, kwds) # did we get expected args?
invocations[0] += 1
return (replacement_logic or default_logic)(*args, **kwds)
# replace mocked function, yield to test, and then check & clean up
module = sys.modules.get(func.__module__)
setattr(module, func.__name__, replacement)
try:
yield # give control back to test for a while
assert invocations[0] == called # was mock called enough?
finally:
setattr(module, func.__name__, func)
def uncalled(func):
"""Require that a function not be called for a yield's duration."""
return mocked(func, called=0)
The idea is to wrap test code that requires mocked
external functions with a mocking helper. Here’s an example:
def test_ml_errors_must_be_reported(self):
"""When an error occurs, it must be reported; nothing must be sent."""
data = self._request_data()
exception = mls.MlError()
exception.exc = '[[[error description]]]'
with mocked(mls.subscriber_count, raising=exception):
with uncalled(sms.send):
resp = self.app.request("/mailings", method='POST', data=data)
self.assertEqual(resp.status, '200 OK')
self.assertIn(exception.exc, resp.data) # must be reported
In this test method, I’m making a simulated POST request to a web
service that’s supposed to check how many users are in a mailing
list and then, possibly, send some SMS messages. In this case,
however, I want to simulate that an error occurs when talking to the
mailing-list service. The test must verify that the error is reported
and, crucially, that no messages are sent.
The library modules for talking to the mailing-list service and the
SMS service are called mls and sms. So,
for the duration of the simulated request, I’m replacing two functions
in these modules with mock versions.
The mock version of mls.subscriber_count, when called,
will raise the exception I’m trying to simulate. The mock version of
sms.send, however, must not be called; if it is, the
uncalled mocking helper will alert me by raising an
exception.
So the mocking helpers not only temporarily install mock
implementations of functions but also assert that those mock
implementations are (or are not) called as expected. In the following
code, for example, I use this capability to make sure that the
mailing-list service is asked to get the subscriber count for the
mailing list having the right key. I also simulate the service
returning a subscriber count of 123.
def test_unconfirmed_msgs_must_be_confirmed(self):
"""An unconfirmed msg must be confirmed and not sent."""
mlkey = 'TEST_ML_KEY'
body = 'Test message!'
data = self._request_data(mlkey=mlkey, body=body)
with mocked(mls.subscriber_count, expecting=((mlkey,),{}), returning=123):
with uncalled(sms.send):
resp = self.app.request("/mailings", method='POST', data=data)
self.assertEqual(resp.status, '200 OK')
self.assertIn('Reply CONFIRM', resp.data) # we must ask for confirmation
self.assertIn('123 subscribers', resp.data) # and supply subscriber count
self.assertIn(body, resp.data) # and the msg to be sent
The mocked helper lets you do a few more things, too, but
you get the idea: Sometimes a short helper function can take you a
long way.
Tags mocks, python, testing
no comments
no trackbacks

Posted by Tom Moertel
Wed, 02 Nov 2011 02:55:00 GMT
Back in 2007, I repaired an aging and fairly obscure A/V receiver that had lost the ability to respond to its remote control. This I did by re-soldering some hard-to-find solder joints that had broken on its circuit board.
On the chance that someone else had a similar problem, I posted some instructions and photos on my blog. I didn’t think much of it at the time.
But since then, every week or so, another comment shows up, thanking me for writing it. Some typical examples:
Fixed my Kenwood V6030D 10 minutes ago. Life is good again, Thanks mate….
Ditto! Worked on my VR-209 like a champ! Thanks!
Thank you for this posting, which I stumbled upon when I was researching the problem of my remote control no longer working for this receiver (VR-507). These pictures were invaluable to locate the faulty pins. (They sure are small.) Re-soldering them restored full functionality to the receiver and the original remote control. Good job!
Thanks, Fixed my KRF-8010D with this, been 5 years fighting with remote working now and then.
There are now about 60 comments like that. I never would have imagined that 60 people would have read the post let alone get out a soldering iron because of it. But they did! And it helped them!
Now, every time I’m feeling down, I Google up that post and read the thank-yous. It cheers me up.
So here’s the lesson: Write it down. If you’ve figured something out, even if it seems unimportant, write it down. Maybe someone else will find it helpful. Maybe a lot of someone elses will find it helpful.
You never know. It might even cheer you up someday.
Posted in interesting stuff
Tags helping, lessons, surprising, writing
5 comments
no trackbacks

Posted by Tom Moertel
Thu, 21 Jul 2011 00:16:00 GMT
In 2007, I counted how many words I had written for my blog and was surprised to discover it was 60 thousand, about a short novel’s worth. Just now, I stumbled upon that earlier count and wondered what the count was today.
>> Article.find(:all).inject(0) { |sum,a| sum +=
?> (a.body + a.extended.to_s).split(/\s+/).length }
=> 98702
I’m about 1300 words shy of 100 thousand.
Now I’ve got to write a few more posts for the blog.
:-)
Posted in writing
Tags blog, writing
no comments
no trackbacks

Posted by Tom Moertel
Sun, 10 Jul 2011 22:06:00 GMT
Recently, I wrote a simple plagiarism detector as a fun programming exercise. Then, merely a few days later, some company gave me cause to use it.
This company, it seems, was looking to hire a programmer. So they posted a job ad that was more or less word-for-word copied from a job ad that I had written for the company where I work. The duplication, being so extensive, was hard to miss. (The offending company, to its credit, promptly removed the copied ad from its web site when we let them know about it.)
I had originally written the text used in that ad back in December 2010, when we were starting another round of hiring. I had hoped that when the right kind of programmers read it, they would discern that we were programmers just like them, programmers who cared for their craft enough and who cared for their team enough to take hiring other programmers seriously. I didn’t want our ads to seem anything like those spat out by people just mouthing the words that everyone else was mouthing to “get talent.” So I worked on getting the words right, thinking the investment would somehow help us stand apart, if just a bit, when hiring.
But I had failed to consider that authenticity in job ads can be faked by just copying what seems authentic. So, a few days ago, when Googling for statistically unlikely phrases from the text I had written, I was actually surprised to discover that a number of companies and recruiters were now using my words, more or less unchanged, to signal how “authentic” they were.
At first, I was annoyed. But, upon reflection, I realized that the plagiarism was telling me that I had written something worth stealing. That’s a good thing, right?
After all, in a society where all too many people are willing to claim your words as their own, which is worse: to write something and have it stolen or to write something and have it not?
P.S. For the pedants who would point out that nothing was actually “stolen” here, please understand that steal has a well-established sense that means basically to plagiarize, as in T.S. Eliot’s quip that “Mediocre writers borrow. Great writers steal.”
Posted in writing
Tags hiring, plagiarism, recruiters, weaselry, writing
5 comments
no trackbacks

Posted by Tom Moertel
Thu, 16 Jun 2011 04:13:00 GMT
Recently, I wrote a simple plagiarism detector in Haskell using
n-grams. The problem it solves is this: You have a document, the
suspect, which you believe may be derived from other documents, the
sources. You want to find regions in the suspect that are identical
to those in the sources.
One simple solution is to use n-grams, which are just the n-word
neighborhoods taken from some sequence of words. For example, the
4-grams of the sequence a b c d e f are a b c d, b c d e, and
c d e f.
The idea for plagiarism detection with n-grams is that, as you
increase n, the probability that a given n-gram will be
shared between two arbitrary documents rapidly vanishes. For example,
Google says, right now, that the 2-gram “rapidly vanishes” appears in
about 11,000 documents, but the 5-gram “two different documents
rapidly vanishes” occurs in no documents. (Once Google indexes this
post, however, that will change.)
But when one document is copied partly from another, they will
probably share many long sequences of words. So, the sharing of longer
n-grams is evidence of copying.
It’s not proof, however. If two papers about U.S. history both contain
the 5-gram “The United States of America,” that coincidence is not surprising. But
when lots of unlikely sharing occurs, it’s strong evidence of some
form of common ancestry.
Fortunately, when copying occurs, it’s often easy to see, once
you know where to look. So my detector just annotates the suspect
document to indicate which portions of it share n-grams with the
source documents. Shared regions are converted to ALL CAPS. For
example, looking for shared 4-grams, my detector finds ample evidence
that the following paragraph was copied from the text above:
$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt
But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA,"
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.
In fact, it was copied: I duplicated an earlier paragraph and then
changed a few words.
Using n-grams to detect sharing is not a perfect system,
but it’s a simple system, and, because its output is
easily interpreted, it’s an effective system. When your eyeballs see
a lot of ALL CAPS TEXT, it’s easy to judge what
it represents.
Anyway, here’s the code.
module Main where
import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)
main = do
n:suspect:sources <- getArgs
putStrLn =<< findMatches (read n) suspect sources
findMatches n suspect sources = do
db <- S.fromList . concat <$> mapM (loadNgrams n) sources
stxt <- readFile suspect
return $ concat (match n db (nGrams n stxt) (whiteWords stxt))
match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
where
matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
isMatch loc ng | ng `S.member` db = [loc .. loc + n 1]
| otherwise = []
mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
where
go _ [] ws = ws
go _ _ [] = []
go i (l:ls) (w:ws) | i < l = w : go (i + 1) (l:ls) ws
go i (l:ls) (w:ws) | otherwise = f w : go (i + 1) ls ws
uniq = S.toList . S.fromList
loadNgrams n file = nGrams n <$> readFile file
nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
tails . map normalize . words
normalize = map toLower . filter isAlphaNum
whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
("", "") -> []
(s1, "") -> [s1]
(s1, s') -> (s1 ++ w) : whiteWords s''
where
(w, s'') = break isSpace s'
Posted in programming
Tags haskell, ngrams, plagiarism
9 comments
no trackbacks

Posted by Tom Moertel
Tue, 05 Apr 2011 16:50:00 GMT
Has anyone tried the Presso hand-powered espresso maker? It seems that some people have figured out how to get good espresso from it.
I’m wondering if it’s reliably good espresso. If it is, the Presso might make a good take-into-work espresso maker.
Any experience reports?
Posted in espresso
Tags espresso, presso
6 comments
no trackbacks

Posted by Tom Moertel
Sat, 26 Mar 2011 16:14:00 GMT
I recently started redirecting requests for my blog’s RSS and Atom feeds to a consolidated feed with FeedBurner. If you’re reading my blog via one of these feeds – and you should be since my blog’s design is sufficiently horrible to burst your eyeballs if you try to read it directly – please let me know if the feeds break for you.
Examples of breakage:
- the feeds no longer update
- the content of the feeds looks garbled or goofy
- images or links are broken
- the feeds literally catch on fire and burn
If you see any of the above, please let me know.
Thanks!
Posted in site news
Tags feedburner, feeds, redirect
4 comments
no trackbacks

Posted by Tom Moertel
Fri, 25 Mar 2011 03:30:00 GMT
I love inspecting errata, those lists of corrections and desired changes that authors compile for their books, hoping that someday their books will be sent back to the presses and the changes can be worked in.
For many technical books, the errata are mostly boring fixes for publishing errors. Sometimes, however, you can see that the author’s thinking has evolved, and then errata-sleuthing becomes fun. The errata list becomes a kind of “brain diff” for the author’s thoughts. It’s not often that someone can tell you with such precision how their thinking has evolved on important subjects.
By studying these diffs, you can learn something about the way their creators think. Similarly, Seth Roberts wrote on his blog that he learned something about writing and editing by comparing New Yorker articles to the slightly different versions published elsewhere; the New Yorker versions were better, and the differences between the versions explained why they were better. It’s like getting free writing lessons from The New Yorker.
A recent example of errata-sleuting: The errata for Judea Pearl’s Causality are fascinating. Causality research is a rapidly advancing field, and it’s interesting to see, through edits to his two-year-old book, which new developments one of the pioneers of the field thinks are worth working into the text. For instance: the addition of the Mediation Formula to the discussion of indirect effects. There are also many new citations to recent research and a few helpful clarifications. I also like (and am slightly amused) that he’s sharpened some of the poking-sticks he’s using to gently prod reluctant researchers, clinging to traditional methods, toward the advantages of modern causality theory: “some statisticians” has become “traditional statisticians”; a potential-outcomes “fallacy” now has “damaging consequences.” Good stuff.
The next time you’re working through an interesting textbook, see if you can find errata for it, even if you have the most-recent printing. You might learn something unexpected, and for very little cost.
Posted in fun stuff
Tags causality, errata, hacks, learning, sleuthing
no comments
no trackbacks

Posted by Tom Moertel
Thu, 17 Mar 2011 15:21:00 GMT
Reading the most-recent issue of Communications of the ACM, I enjoyed the reprint of “VL2: A Scalable and Flexible Data Center Network.” Here’s a summary.
First, the problem: Start out with a rack of 20–40 servers. At the top of the rack, place a switch – the top-of-rack (ToR) switch – to connect the servers together. Now fill your data center with racks just like this. To allow the servers to talk across racks, take each rack’s ToR switch and run its uplink ports (typically 2) into separate (for redundancy) higher-level aggregation switches. To connect aggregated rack-groups together, uplink the aggregation switches into higher-level aggregation switches. Keep aggregating this way until everything is connected.
But, eventually, you can’t join the switched domains further because they’ll get too big and unwieldy; that’s when you uplink into the final layer of the hierarchy: access routers; they split your aggregated-aggregated rack-group-groups into separate layer-2 networks and route packets between them.
What’s wrong with this hierarchical network graph? In a word, oversubscription. When big jobs need to run across the data center, the uplinks saturate and become bottlenecks. According to the paper, even ToR uplinks are typically oversubscribed 1:2 to 1:20; the higher-level uplinks are even worse: 1:200 is not uncommon.
To work around these bottlenecks, network designers end up buying expensive network hardware and configuring it for specific workloads. But running large data centers is so expensive that you want the flexibility to squeeze lots of different jobs into the spare capacity, and networks tuned for one kind of workload are the opposite of flexibility.
How do you get both performance and flexibility? The authors of the paper propose VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices. This separation lets you design the above-rack network to provide huge path capacity using commodity hardware. In the paper’s running example, it’s a folded Clos network in two levels of switches – the aggregation level and, above that, the intermediate level.
Here’s how it works. You assign each network device on each server a location address (LA); this is an IP address that stays with it for life, naming the device. These LAs get advertised to the switches above, which keep track of them using a typical layer-3 link-state routing protocol.
Each application, however, you assign a block of application addresses (AAs) from a separate pool. LAs and AAs never mix. Each application is coded as if it runs on a dedicated Ethernet segment having only its AAs attached. When you want to give an application an extra server, you map one of the application’s unused AAs to the server’s LA, effectively attaching the server to the application’s virtual Ethernet network.
AA-to-LA mapping is handled by a fast, scalable directory system that is invisible to applications. It’s invisible because it’s implemented as a kernel extension. When a server sends a packet to one of its AA-addressed peers on the application’s virtual network, the kernel extension consults the directory to get the destination server’s corresponding LA and sends the packet to the AA via the LA.
How the packet gets there is clever. The kernel extension doesn’t send the packet directly to the LA but instead sends it up to the very top level of the network, bouncing it off a randomly selected intermediate-level switch, before it is delivered to the LA device.
All this happens by encapsulation and tunneling. The packet to the destination AA is wrapped within a packet to the corresponding destination LA. That packet, in turn, is wrapped within a packet to the randomly chosen intermediate switch. When the packet is sent, it goes up to the intermediate switch, which unwraps the outer layer of encapsulation and bounces the inner packet down to the switch handling the LA. That switch, in turn, unwraps the remaining layer and sends the original packet – this is the one to the AA – to the destination server, which gets it via its application-specific virtual Ethernet adapter.
Why bounce packets off a random intermediate switch? The randomization spreads out traffic, allowing for high network utilization. This “valiant load balancing” is cheap and effective, resulting in about 94% efficiency in the paper’s tests.
That, in a nutshell, is VL2. (I simplified some things; in reality the randomization doesn’t occur for each packet but for entire flows of packets.)
For more information, check out the original paper at Microsoft Research.
Posted in networking
Tags centers, clos, data, networking, scaling, vl2
no comments
no trackbacks
