Heapq vs Timsort: When the tortoise beats the hare

Mon 22 Feb 2010 08:09 PM

Tags: heapq performance timsort

Today during the Pycon sprints, we had a discussion about which questions people like to ask in interviews. This lead to some interesting discussions and tests that show that that using a heapq can actually be faster than timsort in certain conditions.

Some Classic Interview Questions

During this discussion, I was reminded of one of my boss Matt Billenstein's favorite questions:

Using any language, take a file with n lines, 5 columns a line, and sort it by the integer value of the second column.

Now this is not a particularly exciting or difficult one to solve, but it is surprising how many people get this question wrong. The answer is basically bound to running time of O(n log n), and memory usage of O(n). It can even be solved without ever leaving the shell:

sort -nk2 input.txt > output.txt

I, on the other hand, do really like a slight variation of this question:

Using any language, take this file with n lines, 5 columns a line, and sort it by the integer value of the second column and output the 10 smallest lines.

Now this is still really easy to do. It can still be done in the command line:

sort -nk2 input.txt | head -n10 > output.txt

But its not necessarily the best way to do it.

Now a crucial constraint

Adding the constraint that memory usage must be bound to the size of the output makes this problem a little harder. Now the memory must be O(m) where m is the number of lines I want in the output file (in this case O(10)). This means that you are only allowed to keep a list that is 10 items long in memory at any given time.

This is where my bash skills are no longer sufficient so I will switch to python. Now me and a colleague went back and forth over which would be faster.

  • Inserting the object, in order, into the result array. (In-order Insertion)
  • Appending the item to the array and calling sort on the array. (Lazy Insertion)

The respective runtime complexity for these two are O(nm) and O(nm log m), where n is the number of lines in the file and m is the size of the desired output file. So it would reason that the first would be faster, but the lazy insertion uses timsort and it is super optimized and written in C. One sure fire way to find out which is faster is to just test it.

The results of this are some what interesting:

http://dl.dropbox.com/u/191765/pics/graph_IIvLI.png

So all this really tells us is that is that timsort is fast and written in C. Comparing it to something written in python is probably not even worth it, but we did get pretty close. I wonder if we can do better.

Using The heapq For Speed

During working through these tests Honza Kral came up with an optimization for the in-order insertion. The heap in python is written in C, and it has faster run time complexity than timsort. We could store the list as a heap and keep a variable that is the size of the largest item in the heap. If the item we are trying to insert into the heapq is smaller than the largest item in the heapq we can do one poppush operation. This can be done in O( log m) time where m is the number of lines desired in the output file. If we do that for every line in the file (n) we get a total runtime of (n log m). For this very reason there seems to be a function in the standard library heapq.nsmallest designed to do just this.

The results:

http://dl.dropbox.com/u/191765/pics/graph_all.png

* Violates the memory constraint but just for comparison

So as we expected it does better than both in-order insertion and lazy insertion. Just for fun we compared it to reading everything into memory and running timsort on it. Because heapq is O(n log m) and timsort is O(n log n) and m < n it makes sense that heapq runs faster, but it doesn't beat timsort by as much as the runtime complexity would imply.

Comparing heapq vs timsort

With a large data set and a small desired output heapq will out perform timsort in both memory usage and run time complexity, but, heapq doesn't out perform timsort by the margin I thought it would because of what I can only assume is massive optimization of timsort in C. In fact as the magnitude difference to the input size to the output size decreases timsort will beat heapq:

http://dl.dropbox.com/u/191765/pics/graph_HvS.png

So the main conclusion we can draw here is that timsort is massively optimized to so greatly out perform heapq. More than just making smart decision, but it must be optimizations done in the C code as well. But one thing we have to keep in mind is that heapq is using far less memory. In conclusion, the heapq.nsmallest is better than a sort and truncate, if you are concearned with memory usage or if your output is considerably smaller than your input.

The Code

import random
import heapq
import time

time_a = []
time_b = []
time_c = []
time_d = []

def random_gen(size):
    for i in xrange(size):
        yield random.randint(0,size)

def a(nums, size_of_result_list):
    # A
    # In-order Insertion
    # memory = O(size_of_result_list)
    # time = O(size_of_array * size_of_result_list) *sort not done in C
    t = time.time()
    result_list = []
    for i in nums:
        length = len(result_list)
        j = -1
        if length:
            while j >= (-1 - length):
                if j == -1-length:
                    result_list.insert(0,i)
                    break;
                if result_list[j] < i:
                    result_list.insert(length+j+1,i)
                    break;
                j -= 1
            if length+1 > size_of_result_list:
                result_list = result_list[:size_of_result_list]
        else:
            result_list.append(i)
    time_a.append(time.time() - t)

def c(nums, size_of_result_list):
    # C
    # Lazy insertion (generate the item, append it, sort it, trucate it)
    # memory = O(size_of_result_list)
    # time = O(size_of_array*size_of_result_list*log(size_of_result_list))
    t = time.time()
    result_list = []
    for i in nums:
        result_list.append(i)
        result_list.sort()
        result_list = result_list[:10]
    time_c.append(time.time() - t)

def b(nums, size_of_result_list):
    # B
    # Read it all, sort it with timsort, truncate
    # memory = O(size_of_array) *violates the first premise, but just for fun
    # time = O(size_of_array*log(size_of_array)) *optimized and written in C
    t = time.time()
    nums = list(nums)
    nums.sort()
    result_list = nums[:size_of_result_list]
    time_b.append(time.time() - t)


def d(nums, size_of_result_list):
    # D
    # HeapQ
    # memory = O(size_of_result_list)
    # time = O(size_of_array*log(size_of_array))
    t = time.time()
    result_list = heapq.nsmallest(size_of_result_list, nums)
    time_d.append(time.time() - t)


def test(num_of_tests, size_of_array, size_of_result_list):
    for i in xrange(num_of_tests):
        a(random_gen(size_of_array), size_of_result_list)
        b(random_gen(size_of_array), size_of_result_list)
        c(random_gen(size_of_array), size_of_result_list)
        d(random_gen(size_of_array), size_of_result_list)

    print "With", size_of_array, "items, outputting ", size_of_result_list, "smallest in", num_of_tests, "tests"
    print "Insertion Sort -> Total:", sum(time_a), ", Average:", sum(time_a)/len(time_a)
    print "Lazy Insertion Sort -> Total:", sum(time_c), ", Average:", sum(time_c)/len(time_c)
    print "Read All, TimSort -> Total:", sum(time_b), ", Average:", sum(time_b)/len(time_b)
    print "HeapQ Insertion -> Total:", sum(time_d), ", Average:", sum(time_d)/len(time_d)

if __name__ == "__main__":
    test(10000, 10000, 100)