{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing XML/HTML content with Python\n", "\n", "### Resources\n", "\n", "[Molina, Alessandro](https://medium.com/@__amol__/markov-chains-with-python-1109663f3678)\n", "\n", "_Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python_ \n", "\n", "Packt Publishing. Kindle Edition. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, IFrame" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(\n", " IFrame(\n", " src=\"https://read.amazon.com/kp/card?asin=B07C5Q59ZZ&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_vLO2DbDBHB5GJ\",\n", " width=\"336\",\n", " height=\"550\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Writing SGML-based languages is generally not very hard, most languages provide utilities to work with them, but if the document gets too big, it's easy to get lost when trying to build the tree of elements programmatically. \n", "\n", "> Ending up with hundreds of .addChild or similar calls all after each other makes it really hard to understand where we were in the document and what part of it we are currently editing. \n", "\n", "> Thankfully, by joining the Python ElementTree module with context managers, we can have a solution that allows our code structure to match the structure of the XML/HTML we are trying to generate.\n", "\n", "Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import xml.etree.ElementTree as ET\n", "from contextlib import contextmanager" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import pprint as _pprint\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def custom_print(items, indent=4, width=None, minwidth=5):\n", " width = (\n", " min(length if (length := len(item)) >= minwidth else minwidth for item in items)\n", " if width is None\n", " else width\n", " )\n", " _pprint.pprint(items, indent=indent, width=width)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output from ET.Element??:\n", "\n", "```python\n", "Init signature: ET.Element(self, /, *args, **kwargs)\n", "Docstring: \n", "Source: \n", "class Element:\n", " \"\"\"An XML element.\n", "\n", " This class is the reference implementation of the Element interface.\n", "\n", " An element's length is its number of subelements. That means if you\n", " want to check if an element is truly empty, you should check BOTH\n", " its length AND its text attribute.\n", "\n", " The element tag, attribute names, and attribute values can be either\n", " bytes or strings.\n", "\n", " *tag* is the element name. *attrib* is an optional dictionary containing\n", " element attributes. *extra* are additional element attributes given as\n", " keyword arguments.\n", "\n", " Example form:\n", " text...tail\n", "\n", " \"\"\"\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "class XMLDocument:\n", " def __init__(self, root=\"document\", mode=\"xml\"):\n", " self._root = ET.Element(root) # self._root has append method\n", " self._mode = mode\n", "\n", " def __str__(self):\n", "\n", " return ET.tostring(self._root, encoding=\"unicode\", method=self._mode)\n", "\n", " def write(self, fobj):\n", " ET.ElementTree(self._root).write(fobj)\n", "\n", " def __enter__(self):\n", " return XMLDocumentBuilder(self._root)\n", "\n", " def __exit__(self, exc_type, value, traceback):\n", " return None\n", "\n", "\n", "class XMLDocumentBuilder:\n", " def __init__(self, root):\n", " self._current = [root]\n", "\n", " def tag(self, *args, **kwargs):\n", " el = ET.Element(*args, **kwargs)\n", " self._current[-1].append(el)\n", "\n", " @contextmanager\n", " def _context():\n", " self._current.append(el)\n", " try:\n", " yield el\n", " finally:\n", " self._current.pop()\n", "\n", " return _context()\n", "\n", " def text(self, text):\n", " if self._current[-1].text is None:\n", " self._current[-1].text = \"\"\n", " self._current[-1].text += text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> We can then use our XMLDocument to build the document we want. For example, we can build web pages in HTML mode:\n", "\n", "Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "HTML = \"html\"\n", "doc = XMLDocument(HTML, mode=HTML)\n", "\n", "with doc as _: # _ is an instance of XMLDocumentBuilder\n", " with _.tag(\"head\"):\n", " with _.tag(\"title\"):\n", " _.text(\"This is the title.\")\n", " with _.tag(\"body\"):\n", " with _.tag(\"div\", id=\"main-div\"):\n", " with _.tag(\"h1\"):\n", " _.text(\"My Document\")\n", " with _.tag(\"strong\"):\n", " _.text(\"Hello World\")\n", " _.tag(\"img\", src=\"https://placeholder.apps.selfip.com/image/150x150\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from chamelboots.html.utils import prettify_html" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " This is the title.\n", " \n", " \n", " \n", "
\n", "

\n", " My Document\n", "

\n", " \n", " Hello World\n", " \n", " \n", "
\n", " \n", "\n" ] } ], "source": [ "print(prettify_html(doc.__str__()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transfer the file to my static web server." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b''" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tempfile\n", "from subprocess import check_output\n", "import shlex\n", "from pathlib import Path\n", "from time import sleep\n", "\n", "# add .transferred for listener script on remote server\n", "_, filepath_ = tempfile.mkstemp(suffix=\".transferred.html\", prefix=\"xmlhtml_\")\n", "filepath = Path(filepath_)\n", "doc.write(filepath)\n", "# webshare is a bash scipt that uses scp to copy the file to the server\n", "check_output(shlex.split(f\"webshare {filepath}\"))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/tmp/xmlhtml_f877db9j.transferred.html')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filepath" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display new HTML document." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://static.apps.selfip.com/xmlhtml_f877db9j.html\n", "xmlhtml_f877db9j.html\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sleep(3) # Give time for script on webserver to run and change file permissions.\n", "new_path = '.'.join(filepath.name.split('.')[::2])\n", "src = f\"https://static.apps.selfip.com/{new_path}\"\n", "print(src)\n", "print(new_path)\n", "IFrame(\n", " src=src,\n", " width=\"auto\",\n", " height=\"300\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> XMLDocumentBuilder keeps a stack of nodes to track where we are in the tree (XMLDocumentBuilder._current). The tail of that list will always tell us which tag we're currently inside.\n", "\n", "> The interesting part is that the XMLDocumentBuilder.tag method also returns a context manager. On entry, it will set the entered tag as the currently active one and on exit, it will recover the previously active node.\n", "\n", "> That allows us to nest XMLDocumentBuilder.tag calls and generate a tree of tags\n", "\n", "> The actual document node can be grabbed through as, so in previous examples we were able to grab the title node that was just created and set a text for it, but XMLDocumentBuilder.text would have worked too because the title node was now the active element once we entered its context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## There's more…\n", "\n", "> There is one trick that I frequently apply when using this recipe. It makes it a bit harder to understand what's going on, on the Python side, and that's the reason why I avoided doing it while explaining the recipe itself, but it makes the HTML/XML structure even more readable by getting rid of most Python noise.\n", "\n", "> If you assign the XMLDocumentBuilder.tag and XMLDocumentBuilder.text methods to some short names, you can nearly disappear the fact that you are calling Python functions and make the XML structure more relevant" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " This is the title.\n", " \n", " \n", " \n", "
\n", "

\n", " My Document\n", "

\n", " \n", " Hello World\n", " \n", " \n", "
\n", " \n", "\n" ] } ], "source": [ "doc = XMLDocument('html', mode=\"html\")\n", "\n", "with doc as builder:\n", " _ = builder.tag\n", " _t = builder.text\n", " \n", " with _(\"head\"):\n", " with _(\"title\"): _t(\"This is the title.\")\n", " with _(\"body\"):\n", " with _(\"div\", id=\"main-div\"):\n", " with _(\"h1\"): _t(\"My Document\")\n", " with _(\"strong\"): _t(\"Hello World\")\n", " _(\"img\", scr=\"https://placeholder.apps.selfip.com/image/325x325\")\n", "print(prettify_html(doc.__str__()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use with chamelboots.\n", "\n", "Is it possible?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "import pprint\n", "import itertools as it\n", "from functools import partial\n", "\n", "from chamelboots import ChameleonTemplate as CT\n", "from chamelboots import TalStatement as TS\n", "from chamelboots.constants import Join, FAKE" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[dllist([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),\n", " dllist([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from llist import dllist, dllistnode\n", "\n", "dll = dllist()\n", "dll.appendleft(dllistnode(dllist(range(10, 20))))\n", "dll.appendright(dllistnode(dllist(range(10))))\n", "list(dll)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "import operator as op\n", "from functools import reduce" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "from chamelboots.datautils import paths_in_data, get_from\n", "from chamelboots.html import dictdata\n", "from chamelboots.constants import HTML_PARSER\n", "from lxml import etree" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['inner_content', 'attribs', 'tail'])" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = dictdata(\n", " etree.fromstring(\n", " \"\"\"\n", " \n", " \n", " This child contains text.\n", " \n", " \n", " This child has regular text.\n", " \n", " And \"tail\" text.\n", " \n", " This & that\n", " \n", "\n", "\n", "\"\"\",\n", " HTML_PARSER,\n", " )\n", ")\n", "path = paths_in_data(d)[0]\n", "# print(path)\n", "example = get_from(d, path[:6])\n", "# print({'top': {'html': [{'head': {}}, {'body': {}}]}})\n", "# print(paths_in_data(d['html']))\n", "\n", "paths_in_data({})" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n", "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n", "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n", "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n", "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n", "{'top': {'html': [{'head': {}}, {'body': {}}]}}\n" ] } ], "source": [ "CLASS = \"class\"\n", "\n", "ATTRIBUTES, CONTENT = \"attributes\", \"content\"\n", "TAL_STATEMENTS = TSA, TSC = tuple(\n", " TS(*args) for args in ((item,) * 2 for item in (ATTRIBUTES, CONTENT))\n", ")\n", "\n", "\n", "class HTMLDocument:\n", " def __init__(self, root=\"html\", mode=\"html\"):\n", " # self._root = ET.Element(root) # self._root has append method\n", " self._root = [root]\n", " self._mode = mode\n", "\n", " def __enter__(self):\n", " return HTMLDocumentBuilder(self._root)\n", "\n", " def __exit__(self, exc_type, value, traceback):\n", " return None\n", "\n", "\n", "class HTMLDocumentBuilder:\n", "\n", " parents = (\"head\", \"body\")\n", "\n", " def __init__(self, root):\n", " self._current = [root]\n", " (self.current_context_tag,) = root\n", " self.path = [\n", " self.current_context_tag,\n", " ]\n", " self.tree = {\n", " 'top': {self.current_context_tag: [{parent: {}} for parent in self.parents]}\n", " }\n", "\n", " def paths_in_data_(self, obj):\n", " return paths_in_data(obj)\n", "\n", " def get_by_path(self, root, items):\n", " \"\"\"Access a nested object in root by item sequence.\"\"\"\n", " return reduce(op.getitem, items, root)\n", "\n", " def set_by_path(self, root, items, value):\n", " \"\"\"Set a value in a nested object in root by item sequence.\"\"\"\n", " # return value of get_by_path has to be a dict\n", " self.get_by_path(root, items[:-1])[items[-1]] = value\n", "\n", " def tag(self, tag, **kwargs):\n", " item = (tag,)\n", " element = [item]\n", " self._current[-1].append(element)\n", "\n", " @contextmanager\n", " def _context(): # runs when context entered using \"with\"\n", " self._current.append(element)\n", "\n", " # work on tree\n", " self.path.append(tag)\n", " self.current_context_tag = tag\n", " print(self.tree)\n", "\n", " try:\n", " yield element\n", " finally:\n", " self._current.pop()\n", "\n", " return _context()\n", "\n", "\n", "doc = HTMLDocument()\n", "\n", "with doc as builder:\n", " _ = builder.tag\n", "\n", " with _(\"head\", attributes={}, content=\"\"):\n", " _(\"meta\", attributes={}, content=\"\")\n", " _(\"title\", attributes={}, content=FAKE.word())\n", " with _(\"body\", attributes={}, content=\"\"):\n", " with _(\"div\", attributes={CLASS: \"aa\"}, content=\"\"):\n", " _(\"span\", attributes={CLASS: \"bb\"}, content=FAKE.paragraph())\n", " with _(\"div\", attributes={CLASS: \"a\"}, content=\"\"):\n", " _(\"span\", attributes={CLASS: \"b\"}, content=FAKE.paragraph())\n", " with _(\"div\", attributes={CLASS: \"c\"}, content=\"\"):\n", " _(\"span\", attributes={CLASS: \"d\"}, content=FAKE.paragraph())\n", " with _(\"p\", attributes={CLASS: \"e\"}, content=\"\"):\n", " _(\"span\", attributes={CLASS: \"1\"}, content=FAKE.paragraph())\n", " _(\"span\", attributes={CLASS: \"2\"}, content=FAKE.paragraph())\n", " _(\"span\", attributes={CLASS: \"3\"}, content=FAKE.paragraph())" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from xml.etree.ElementTree import Element, SubElement, tostring, XML\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "COUNT = 3\n", "# XML eats the parent\n", "children = XML(\n", " CT(\"div\", (TS(CONTENT, f\"structure {CONTENT}\"),)).render(\n", " content=CT(\n", " \"p\",\n", " (\n", " TS(\"repeat\", \"item items\"),\n", " TS(\"content\", \"item\"),\n", " TS(ATTRIBUTES, \"next(attributes)\"),\n", " ),\n", " ).render(\n", " items=range(COUNT),\n", " attributes=iter(dict(id=hex(id(dict()))) for _ in range(COUNT)),\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " \n", "

\n", " 0\n", "

\n", "

\n", " 1\n", "

\n", "

\n", " 2\n", "

\n", " \n", "\n" ] } ], "source": [ "\n", "top = Element('html')\n", "\n", "HEAD = SubElement(top, 'head')\n", "BODY = SubElement(top, 'body')\n", "BODY.extend(children)\n", "\n", "print(prettify_html(tostring(top).decode()))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[, ]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top.getchildren()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create XML/HTML document\n", "\n", "* [examples in Python](https://pymotw.com/2/xml/etree/ElementTree/create.html)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " This child contains text.\n", " \n", " \n", " This child has regular text.\n", " \n", " And \"tail\" text.\n", " \n", " This & that\n", " \n", "\n" ] } ], "source": [ "from xml.etree.ElementTree import Element, SubElement, Comment, tostring\n", "\n", "top = Element('top')\n", "\n", "comment = Comment('Generated for PyMOTW')\n", "top.append(comment)\n", "\n", "child = SubElement(top, 'child')\n", "child.text = 'This child contains text.'\n", "\n", "child_with_tail = SubElement(top, 'child_with_tail')\n", "child_with_tail.text = 'This child has regular text.'\n", "child_with_tail.tail = 'And \"tail\" text.'\n", "\n", "child_with_entity_ref = SubElement(top, 'child_with_entity_ref')\n", "child_with_entity_ref.text = 'This & that'\n", "\n", "print(prettify_html(tostring(top).decode()))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" }, "nikola": { "category": "do-it-yourself", "date": "2019-11-26 01:14:05 UTC", "description": "A programmatic way to write HTML using Python from a recipe in the book 'Modern Python Standard Library Cookbook'.", "link": "", "slug": "writing-xmlhtml-content-with-python", "tags": "python, code, examples", "title": "Writing XML/HTML content with Python", "type": "text" } }, "nbformat": 4, "nbformat_minor": 2 }