Sunday, October 21, 2007

Making foreach useful in the corner cases

One of the nice things about recent developments in mainstream programming languages is improved support for looping.

In a functional language with proper support for lambda functions, it might not be a problem but if you're programming in something a bit more mainstream, chances are that you're writing loops over data structures all the time. In my experience, it turns out to be a real improvement if the language offers a bit of syntactic sugar and allows you to write something like:
for element in datastructure 
do something with element

Python's got it, C# and Java have it, last I heard there were plans to introduce it in C++ too.

Until recently I thought this particular piece of syntactic sugar was of the kind which is really useful in 90% of the cases and completely useless in the remaining 10%. For example, what if you want to extract two elements and not just one? Or loop over two datastructures? Or only loop over certain elements?

In fact, there are elegant solutions to all these problems. I've been pondering over them this week. Here's how you do in Python, but all this requires of the language is the concept of iterators and some kind of convenient support for tuples.

Extracting only part of the sequence: Slice the datastructure with a function that skips the parts you don't want.

Python has the [start:end] operator which is as succinct as it gets for extracting an interval. There's also a more generic islice function in the itertools module which works with iterators:
from itertools import islice
mylist = [1, 2, 3, 4, 5]
for x in islice(mylist, 1, 4):
print x
(... prints 2, 3, 4)

This idea can be extended in many ways, e.g. if you need every second element, it's easy to come up with a function that only iterates over them. If you need to exclude elements based on something related to the elements themselves and not their position in the datastructure, there's always the filter function.

Extracting elements from several datastructures: Construct an iterator that iterates over the sequences in parallel, returning a tuple with one element from each.

Python has the built-in zip function which returns the complete sequences zipped up as a list of tuples. Until Python 3 is out, izip in itertools is closer to what we want:
from itertools import izip
xlist = [1, 2, 3]
ylist = ["a", "b", "c"]
for x, y in izip(xlist, ylist):
print x, y
(... prints 1 a, 2 b, 3 c)

So it's no problem. We can still maintain our nice, succinct for loop, there's no need to manually iterate over each sequence in turn. Note that zip and izip will automatically stop when one of sequences runs out of data.

Extracting several elements from the same datastructure: Construct an iterator that returns a sliding window of the sequence as a tuple.

You could code this manually, but it's even easier to do once you know zip: just zip the datastructure with itself, with a little offset. In Python, that's zip(seq, seq[1:]), or more generally:
from itertools import izip, islice
def slidingwindow(seq, windowsize):
t = ()
for i in range(windowsize):
t += (islice(seq, i, None),)
return izip(*t)

That's all you need to get two or more elements out at the same time:
mylist = [1, 2, 4, 8]
for prev, val in slidingwindow(mylist, 2):
print val - prev
(... prints 1, 2, 4)

No need to manually set prev to the predecessor in each iteration. You could easily add padding or let the window advance in larger steps if needed. The latter is probably easiest to do by combining the function with an appropriate slicing operation.

Which brings me to the final point. An important part of the beauty of these operations is that they can be combined in a multitude of ways.


Note: if you're going to use slidingwindow in production and want something to paste in, here's an even more general version that works with arbitrary iterator input:
from itertools import izip, tee
def slidingwindow(iterable, windowsize):
t = tee(iterable, windowsize)
for j in range(1, windowsize):
for i in range(j, windowsize):
try:
t[i].next()
except StopIteration:
pass

return izip(*t)

The tee function duplicates an iterator by storing the data returned by it until all its duplicates have passed it too. The code above advances each of the returned iterators by the correct amount. It seems it's a bit faster to use tee than doing the equivalent stuff by hand.

No comments:

Post a Comment