Extract Content from an "A" Tag "href" attribute in HTML using a data URL parser written in Python

Stack Overflow solution to parse data URLs that are href attribues in an "A" tag.

Initially my first thought was that this question on Stack Overflow was stupid anti-pattern because anybody can split strings to get wanted data.

I consider it a hack though if the data has a scheme and could be parsed by a proper parser. I have been hacking together creating data URLs to embed images into a website by joining strings. It felt stupid anti-pattern. I continued the hack because I didn't know that data URIs were a thing.

The poser of the question changed the question to clarify that he wanted pure python a proper parser for data URLs that are values to an "href" attribute on an HTML "a" tag.

I'm happy he did because I learned something new. And now I know better.

It turns out there is a parser and data URIs are a well-defined entity.

Usually when I think "This is stupid anti-pattern." I force myself to do some research. And many times I change my mind from thinking "This is stupid anti-pattern." to "This is what learning is.".

Though some things do remain stupid anti-pattern and knowing which ones are still stupid anti-pattern is a skill, too.

For example, trying to parse HTML with regular expressions is stupid anti-pattern.

First accepted answer on Stack Overflow

Resources

In [4]:
html_string = """
<a href="data:text/csv;charset=UTF-8,%22csvcontentfollows">
"""
In [5]:
import lxml.etree
from datauri import DataURI

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

uris = (
    DataURI(item.attrib["href"])
    for item in tree.iterdescendants()
    if item.attrib.get("href")
)
attrs = ("mimetype", "charset", "is_base64", "data")
print([{attr: getattr(uri, attr) for attr in attrs} for uri in uris])
[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': '"csvcontentfollows'}]
In [6]:
from html.parser import HTMLParser
from datauri import DataURI

uri_attrs = ("mimetype", "charset", "is_base64", "data")

class MyHTMLParser(HTMLParser):
    
    def __init__(self):
        super().__init__()
        self.data = []
    
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for attr, value in attrs:
                if attr == "href":
                    # Adjust the delimter for splitting as necessary
                    for key, value in attrs:
                        uri = DataURI(value)
                        self.data.append({attr: getattr(uri, attr) for attr in uri_attrs})
        
parser = MyHTMLParser()
parser.feed(html_string)
print(parser.data)
[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': '"csvcontentfollows'}]