Python Performance Part 2 Redux: Split & Reduce Large Strings for 'A Href' Hypertext

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))

Timing Comparison: ~ 300% Performance Improvement

Note: hypertext.html is 48MB.

braydon@bgf:~/python_tests/extract$ time python split.py 

real    0m1.263s
user    0m1.112s
sys     0m0.156s

braydon@bgf:~/python_tests/extract$ time python split_2.py 

real    0m0.392s
user    0m0.268s
sys     0m0.120s

split.py

Previously, I had found the best solution to my problem was to split() the large string up by the “>” character, and then reduce to a list of hyperlinks.

def is_ahref(a,b):
    y = b.find("<a href=")
    if y != -1: a.append(b[y+9:-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])

There is a better solution. Splitting the text up by the “>” character is wasteful; there are many “>”s in html, and most of them that will not have hyperlinks. We don’t need to even check if the item in the list is an href if we split the string into a list that all will have an href, and then reduce it as before.

def is_ahref(a,b):
    z = b.find(">")
    a.append(b[1:z-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

fc = open("hypertext.html","r").read().split("<a href=")
hrefs = preduce(is_ahref,fc,[])

However because the size of the list will be exactly the same as it started, we shouldn’t need to use reduce(), but rather we can just map() a fuction to run through the entire list, ‘reducing’ it to a list of just the hyperlinks.

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))
This entry was posted in Code, Hacking and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>