Writing XML/HTML content with Python

Writing XML/HTML content with Python

Resources

Molina, Alessandro

Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python

Packt Publishing. Kindle Edition.

In [13]:
from IPython.display import display, IFrame
In [14]:
display(
    IFrame(
        src="https://read.amazon.com/kp/card?asin=B07C5Q59ZZ&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_vLO2DbDBHB5GJ",
        width="336",
        height="550",
    )
)

Writing SGML-based languages is generally not very hard, most languages provide utilities to work with them, but if the document gets too big, it's easy to get lost when trying to build the tree of elements programmatically.

Ending up with hundreds of .addChild or similar calls all after each other makes it really hard to understand where we were in the document and what part of it we are currently editing.

Thankfully, by joining the Python ElementTree module with context managers, we can have a solution that allows our code structure to match the structure of the XML/HTML we are trying to generate.

Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition.

In [15]:
import xml.etree.ElementTree as ET
from contextlib import contextmanager
In [16]:
import pprint as _pprint
In [17]:
def custom_print(items, indent=4, width=None, minwidth=5):
    width = (
        min(length if (length := len(item)) >= minwidth else minwidth for item in items)
        if width is None
        else width
    )
    _pprint.pprint(items, indent=indent, width=width)

Output from ET.Element??:

Init signature: ET.Element(self, /, *args, **kwargs)
Docstring:      <no docstring>
Source:        
class Element:
    """An XML element.

    This class is the reference implementation of the Element interface.

    An element's length is its number of subelements.  That means if you
    want to check if an element is truly empty, you should check BOTH
    its length AND its text attribute.

    The element tag, attribute names, and attribute values can be either
    bytes or strings.

    *tag* is the element name.  *attrib* is an optional dictionary containing
    element attributes. *extra* are additional element attributes given as
    keyword arguments.

    Example form:
        <tag attrib>text<child/>...</tag>tail

    """
In [18]:
class XMLDocument:
    def __init__(self, root="document", mode="xml"):
        self._root = ET.Element(root) # self._root has append method
        self._mode = mode

    def __str__(self):

        return ET.tostring(self._root, encoding="unicode", method=self._mode)

    def write(self, fobj):
        ET.ElementTree(self._root).write(fobj)

    def __enter__(self):
        return XMLDocumentBuilder(self._root)

    def __exit__(self, exc_type, value, traceback):
        return None


class XMLDocumentBuilder:
    def __init__(self, root):
        self._current = [root]

    def tag(self, *args, **kwargs):
        el = ET.Element(*args, **kwargs)
        self._current[-1].append(el)

        @contextmanager
        def _context():
            self._current.append(el)
            try:
                yield el
            finally:
                self._current.pop()

        return _context()

    def text(self, text):
        if self._current[-1].text is None:
            self._current[-1].text = ""
        self._current[-1].text += text

We can then use our XMLDocument to build the document we want. For example, we can build web pages in HTML mode:

Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition.

In [19]:
HTML = "html"
doc = XMLDocument(HTML, mode=HTML)

with doc as _:  # _ is an instance of XMLDocumentBuilder
    with _.tag("head"):
        with _.tag("title"):
            _.text("This is the title.")
    with _.tag("body"):
        with _.tag("div", id="main-div"):
            with _.tag("h1"):
                _.text("My Document")
            with _.tag("strong"):
                _.text("Hello World")
            _.tag("img", src="https://placeholder.apps.selfip.com/image/150x150")
In [20]:
from chamelboots.html.utils import prettify_html
In [21]:
print(prettify_html(doc.__str__()))
<html>
 <head>
  <title>
   This is the title.
  </title>
 </head>
 <body>
  <div id="main-div">
   <h1>
    My Document
   </h1>
   <strong>
    Hello World
   </strong>
   <img src="https://placeholder.apps.selfip.com/image/150x150"/>
  </div>
 </body>
</html>

Transfer the file to my static web server.

In [22]:
import tempfile
from subprocess import check_output
import shlex
from pathlib import Path
from time import sleep

# add .transferred for listener script on remote server
_, filepath_ = tempfile.mkstemp(suffix=".transferred.html", prefix="xmlhtml_")
filepath = Path(filepath_)
doc.write(filepath)
# webshare is a bash scipt that uses scp to copy the file to the server
check_output(shlex.split(f"webshare {filepath}"))
Out[22]:
b''
In [23]:
filepath
Out[23]:
PosixPath('/tmp/xmlhtml_f877db9j.transferred.html')

Display new HTML document.

In [24]:
sleep(3) # Give time for script on webserver to run and change file permissions.
new_path = '.'.join(filepath.name.split('.')[::2])
src = f"https://static.apps.selfip.com/{new_path}"
print(src)
print(new_path)
IFrame(
    src=src,
    width="auto",
    height="300",
)
https://static.apps.selfip.com/xmlhtml_f877db9j.html
xmlhtml_f877db9j.html
Out[24]:

XMLDocumentBuilder keeps a stack of nodes to track where we are in the tree (XMLDocumentBuilder._current). The tail of that list will always tell us which tag we're currently inside.

The interesting part is that the XMLDocumentBuilder.tag method also returns a context manager. On entry, it will set the entered tag as the currently active one and on exit, it will recover the previously active node.

That allows us to nest XMLDocumentBuilder.tag calls and generate a tree of tags

The actual document node can be grabbed through as, so in previous examples we were able to grab the title node that was just created and set a text for it, but XMLDocumentBuilder.text would have worked too because the title node was now the active element once we entered its context.

There's more…

There is one trick that I frequently apply when using this recipe. It makes it a bit harder to understand what's going on, on the Python side, and that's the reason why I avoided doing it while explaining the recipe itself, but it makes the HTML/XML structure even more readable by getting rid of most Python noise.

If you assign the XMLDocumentBuilder.tag and XMLDocumentBuilder.text methods to some short names, you can nearly disappear the fact that you are calling Python functions and make the XML structure more relevant

In [25]:
doc = XMLDocument('html', mode="html")

with doc as builder:
    _ = builder.tag
    _t = builder.text
    
    with _("head"):
        with _("title"): _t("This is the title.")
    with _("body"):
        with _("div", id="main-div"):
            with _("h1"): _t("My Document")
            with _("strong"): _t("Hello World")
            _("img", scr="https://placeholder.apps.selfip.com/image/325x325")
print(prettify_html(doc.__str__()))
<html>
 <head>
  <title>
   This is the title.
  </title>
 </head>
 <body>
  <div id="main-div">
   <h1>
    My Document
   </h1>
   <strong>
    Hello World
   </strong>
   <img scr="https://placeholder.apps.selfip.com/image/325x325"/>
  </div>
 </body>
</html>

Use with chamelboots.

Is it possible?

In [26]:
import pprint
import itertools as it
from functools import partial

from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
from chamelboots.constants import Join, FAKE
In [27]:
from llist import dllist, dllistnode

dll = dllist()
dll.appendleft(dllistnode(dllist(range(10, 20))))
dll.appendright(dllistnode(dllist(range(10))))
list(dll)
Out[27]:
[dllist([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
 dllist([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
In [28]:
import operator as op
from functools import reduce
In [39]:
from chamelboots.datautils import paths_in_data, get_from
from chamelboots.html import dictdata
from chamelboots.constants import HTML_PARSER
from lxml import etree
In [84]:
d = dictdata(
    etree.fromstring(
        """<top>
 <!--Generated for PyMOTW-->
 <child>
  This child contains text.
 </child>
 <child_with_tail>
  This child has regular text.
 </child_with_tail>
 And "tail" text.
 <child_with_entity_ref>
  This &amp; that
 </child_with_entity_ref>
</top>

""",
        HTML_PARSER,
    )
)
path = paths_in_data(d)[0]
# print(path)
example = get_from(d, path[:6])
# print({'top': {'html': [{'head': {}}, {'body': {}}]}})
# print(paths_in_data(d['html']))

paths_in_data({})
Out[84]:
dict_keys(['inner_content', 'attribs', 'tail'])
In [33]:
CLASS = "class"

ATTRIBUTES, CONTENT = "attributes", "content"
TAL_STATEMENTS = TSA, TSC = tuple(
    TS(*args) for args in ((item,) * 2 for item in (ATTRIBUTES, CONTENT))
)


class HTMLDocument:
    def __init__(self, root="html", mode="html"):
        # self._root = ET.Element(root) # self._root has append method
        self._root = [root]
        self._mode = mode

    def __enter__(self):
        return HTMLDocumentBuilder(self._root)

    def __exit__(self, exc_type, value, traceback):
        return None


class HTMLDocumentBuilder:

    parents = ("head", "body")

    def __init__(self, root):
        self._current = [root]
        (self.current_context_tag,) = root
        self.path = [
            self.current_context_tag,
        ]
        self.tree = {
            'top': {self.current_context_tag: [{parent: {}} for parent in self.parents]}
        }

    def paths_in_data_(self, obj):
        return paths_in_data(obj)

    def get_by_path(self, root, items):
        """Access a nested object in root by item sequence."""
        return reduce(op.getitem, items, root)

    def set_by_path(self, root, items, value):
        """Set a value in a nested object in root by item sequence."""
        # return value of get_by_path has to be a dict
        self.get_by_path(root, items[:-1])[items[-1]] = value

    def tag(self, tag, **kwargs):
        item = (tag,)
        element = [item]
        self._current[-1].append(element)

        @contextmanager
        def _context():  # runs when context entered using "with"
            self._current.append(element)

            # work on tree
            self.path.append(tag)
            self.current_context_tag = tag
            print(self.tree)

            try:
                yield element
            finally:
                self._current.pop()

        return _context()


doc = HTMLDocument()

with doc as builder:
    _ = builder.tag

    with _("head", attributes={}, content=""):
        _("meta", attributes={}, content="")
        _("title", attributes={}, content=FAKE.word())
    with _("body", attributes={}, content=""):
        with _("div", attributes={CLASS: "aa"}, content=""):
            _("span", attributes={CLASS: "bb"}, content=FAKE.paragraph())
        with _("div", attributes={CLASS: "a"}, content=""):
            _("span", attributes={CLASS: "b"}, content=FAKE.paragraph())
            with _("div", attributes={CLASS: "c"}, content=""):
                _("span", attributes={CLASS: "d"}, content=FAKE.paragraph())
                with _("p", attributes={CLASS: "e"}, content=""):
                    _("span", attributes={CLASS: "1"}, content=FAKE.paragraph())
                    _("span", attributes={CLASS: "2"}, content=FAKE.paragraph())
                    _("span", attributes={CLASS: "3"}, content=FAKE.paragraph())
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
In [34]:
from xml.etree.ElementTree import Element, SubElement, tostring, XML
In [35]:
COUNT = 3
# XML eats the parent
children = XML(
    CT("div", (TS(CONTENT, f"structure {CONTENT}"),)).render(
        content=CT(
            "p",
            (
                TS("repeat", "item items"),
                TS("content", "item"),
                TS(ATTRIBUTES, "next(attributes)"),
            ),
        ).render(
            items=range(COUNT),
            attributes=iter(dict(id=hex(id(dict()))) for _ in range(COUNT)),
        )
    )
)
In [36]:
top = Element('html')

HEAD = SubElement(top, 'head')
BODY = SubElement(top, 'body')
BODY.extend(children)

print(prettify_html(tostring(top).decode()))
<html>
 <head>
 </head>
 <body>
  <p id="0x7f39203b9ad0">
   0
  </p>
  <p id="0x7f39203f72f0">
   1
  </p>
  <p id="0x7f39203698f0">
   2
  </p>
 </body>
</html>
In [37]:
top.getchildren()
Out[37]:
[<Element 'head' at 0x7f3910736cb0>, <Element 'body' at 0x7f3910957a10>]

Create XML/HTML document

In [38]:
from xml.etree.ElementTree import Element, SubElement, Comment, tostring

top = Element('top')

comment = Comment('Generated for PyMOTW')
top.append(comment)

child = SubElement(top, 'child')
child.text = 'This child contains text.'

child_with_tail = SubElement(top, 'child_with_tail')
child_with_tail.text = 'This child has regular text.'
child_with_tail.tail = 'And "tail" text.'

child_with_entity_ref = SubElement(top, 'child_with_entity_ref')
child_with_entity_ref.text = 'This & that'

print(prettify_html(tostring(top).decode()))
<top>
 <!--Generated for PyMOTW-->
 <child>
  This child contains text.
 </child>
 <child_with_tail>
  This child has regular text.
 </child_with_tail>
 And "tail" text.
 <child_with_entity_ref>
  This &amp; that
 </child_with_entity_ref>
</top>