Extract Content from an "href" Attribute of an "A" tag in HTML using Python

Stack Overflow solution

Initially my first thought was that this question on Stack Overflow was stupid anti-pattern because anybody can split strings to get wanted data. I consider it a hack though if the data has a scheme and could be parsed by a proper parser.

The poser of the question changed the question to clarify that he wanted . I'm happy he did because I learned something new. I have been creating the data URLs to embed into a website by joining strings.

It turns out there is a parser and its a well-defined scheme for data URLs. Usually when I think "this is stupid anti-pattern" I then do some research and learn something along the way. Though some things do remain stupid anti-pattern and knowing which ones are still stupid anti-pattern is a skill, too.

For example, trying to parse HTML with regular expressions is stupid anti-pattern.


In [1]:
html_string = """
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">

Update: A comment on Stack Overflow reveals there is native Python support for data URIs.

In [3]:
from contextlib import ExitStack
from urllib.request import urlopen
import lxml.etree

HREF = "href"

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

uris = (
    for item in tree.iterdescendants()
    if HREF in item.attrib

with ExitStack() as stack:
    resources = (stack.enter_context(urlopen(uri)) for uri in uris)
    data = [fh.read().decode() for fh in resources]
['csvcontentfollows', 'csvcontentfollows', 'csvcontentfollows']