# Extract Content from an "href" Attribute of an "A" tag in HTML using Python

## Stack Overflow solution¶

Initially my first thought was that this question on Stack Overflow was stupid anti-pattern because anybody can split strings to get wanted data. I consider it a hack though if the data has a scheme and could be parsed by a proper parser.

The poser of the question changed the question to clarify that he wanted . I'm happy he did because I learned something new. I have been creating the data URLs to embed into a website by joining strings.

It turns out there is a parser and its a well-defined scheme for data URLs. Usually when I think "this is stupid anti-pattern" I then do some research and learn something along the way. Though some things do remain stupid anti-pattern and knowing which ones are still stupid anti-pattern is a skill, too.

For example, trying to parse HTML with regular expressions is stupid anti-pattern.

### Resources¶

In [1]:
html_string = """
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,csvcontentfollows">
"""


### Update: A comment on Stack Overflow reveals there is native Python support for data URIs.¶

In [3]:
from contextlib import ExitStack
from urllib.request import urlopen
import lxml.etree

HREF = "href"

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

uris = (
item.attrib[HREF]
for item in tree.iterdescendants()
if HREF in item.attrib
)

with ExitStack() as stack:
resources = (stack.enter_context(urlopen(uri)) for uri in uris)
data = [fh.read().decode() for fh in resources]
print(data)

['csvcontentfollows', 'csvcontentfollows', 'csvcontentfollows']