Stack Overflow solution¶
- Question: Extract a content from
- Answer: My Answer
Initially my first thought was that this question on Stack Overflow was
stupid anti-pattern because anybody can split strings to get wanted data. I consider it a hack though if the data has a scheme and could be parsed by a proper parser.
The poser of the question changed the question to clarify that he wanted . I'm happy he did because I learned something new. I have been creating the data URLs to embed into a website by joining strings.
It turns out there is a parser and its a well-defined scheme for data URLs. Usually when I think "this is
stupid anti-pattern" I then do some research and learn something along the way. Though some things do remain stupid anti-pattern and knowing which ones are still stupid anti-pattern is a skill, too.
For example, trying to parse HTML with regular expressions is
html_string = """ <a href="data:text/csv;charset=UTF-8,csvcontentfollows"> <a href="data:text/csv;charset=UTF-8,csvcontentfollows"> <a href="data:text/csv;charset=UTF-8,csvcontentfollows"> """
from contextlib import ExitStack from urllib.request import urlopen import lxml.etree HREF = "href" tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser()) uris = ( item.attrib[HREF] for item in tree.iterdescendants() if HREF in item.attrib ) with ExitStack() as stack: resources = (stack.enter_context(urlopen(uri)) for uri in uris) data = [fh.read().decode() for fh in resources] print(data)
['csvcontentfollows', 'csvcontentfollows', 'csvcontentfollows']