split_2.py
def get_value(a):
return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))
Timing Comparison: ~ 300% Performance Improvement
Note: hypertext.html is 48MB.
braydon@bgf:~/python_tests/extract$ time python split.py real 0m1.263s user 0m1.112s sys 0m0.156s braydon@bgf:~/python_tests/extract$ time python split_2.py real 0m0.392s user 0m0.268s sys 0m0.120s
split.py
Previously, I had found the best solution to my problem was to split() the large string up by the “>” character, and then reduce to a list of hyperlinks.
def is_ahref(a,b):
y = b.find("<a href=")
if y != -1: a.append(b[y+9:-1])
return a
def preduce(fn,ls,a):
ls.insert(0,a)
return reduce(fn,ls)
hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])
There is a better solution. Splitting the text up by the “>” character is wasteful; there are many “>”s in html, and most of them that will not have hyperlinks. We don’t need to even check if the item in the list is an href if we split the string into a list that all will have an href, and then reduce it as before.
def is_ahref(a,b):
z = b.find(">")
a.append(b[1:z-1])
return a
def preduce(fn,ls,a):
ls.insert(0,a)
return reduce(fn,ls)
fc = open("hypertext.html","r").read().split("<a href=")
hrefs = preduce(is_ahref,fc,[])
However because the size of the list will be exactly the same as it started, we shouldn’t need to use reduce(), but rather we can just map() a fuction to run through the entire list, ‘reducing’ it to a list of just the hyperlinks.
split_2.py
def get_value(a):
return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))

