Writing XML/HTML content with Python

Writing XML/HTML content with Python

Resources

Molina, Alessandro

Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python

Packt Publishing. Kindle Edition.

In [13]:
from IPython.display import display, IFrame
In [14]:
display(
    IFrame(
        src="https://read.amazon.com/kp/card?asin=B07C5Q59ZZ&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_vLO2DbDBHB5GJ",
        width="336",
        height="550",
    )
)

Writing SGML-based languages is generally not very hard, most languages provide utilities to work with them, but if the document gets too big, it's easy to get lost when trying to build the tree of elements programmatically.

Ending up with hundreds of .addChild or similar calls all after each other makes it really hard to understand where we were in the document and what part of it we are currently editing.

Thankfully, by joining the Python ElementTree module with context managers, we can have a solution that allows our code structure to match the structure of the XML/HTML we are trying to generate.

Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition.

In [15]:
import xml.etree.ElementTree as ET
from contextlib import contextmanager
In [16]:
import pprint as _pprint
In [17]:
def custom_print(items, indent=4, width=None, minwidth=5):
    width = (
        min(length if (length := len(item)) >= minwidth else minwidth for item in items)
        if width is None
        else width
    )
    _pprint.pprint(items, indent=indent, width=width)

Output from ET.Element??:

Init signature: ET.Element(self, /, *args, **kwargs)
Docstring:      <no docstring>
Source:        
class Element:
    """An XML element.

    This class is the reference implementation of the Element interface.

    An element's length is its number of subelements.  That means if you
    want to check if an element is truly empty, you should check BOTH
    its length AND its text attribute.

    The element tag, attribute names, and attribute values can be either
    bytes or strings.

    *tag* is the element name.  *attrib* is an optional dictionary containing
    element attributes. *extra* are additional element attributes given as
    keyword arguments.

    Example form:
        <tag attrib>text<child/>...</tag>tail

    """
In [18]:
class XMLDocument:
    def __init__(self, root="document", mode="xml"):
        self._root = ET.Element(root) # self._root has append method
        self._mode = mode

    def __str__(self):

        return ET.tostring(self._root, encoding="unicode", method=self._mode)

    def write(self, fobj):
        ET.ElementTree(self._root).write(fobj)

    def __enter__(self):
        return XMLDocumentBuilder(self._root)

    def __exit__(self, exc_type, value, traceback):
        return None


class XMLDocumentBuilder:
    def __init__(self, root):
        self._current = [root]

    def tag(self, *args, **kwargs):
        el = ET.Element(*args, **kwargs)
        self._current[-1].append(el)

        @contextmanager
        def _context():
            self._current.append(el)
            try:
                yield el
            finally:
                self._current.pop()

        return _context()

    def text(self, text):
        if self._current[-1].text is None:
            self._current[-1].text = ""
        self._current[-1].text += text

We can then use our XMLDocument to build the document we want. For example, we can build web pages in HTML mode:

Molina, Alessandro. Modern Python Standard Library Cookbook: Over 100 recipes to fully leverage the features of the standard library in Python . Packt Publishing. Kindle Edition.

In [19]:
HTML = "html"
doc = XMLDocument(HTML, mode=HTML)

with doc as _:  # _ is an instance of XMLDocumentBuilder
    with _.tag("head"):
        with _.tag("title"):
            _.text("This is the title.")
    with _.tag("body"):
        with _.tag("div", id="main-div"):
            with _.tag("h1"):
                _.text("My Document")
            with _.tag("strong"):
                _.text("Hello World")
            _.tag("img", src="https://placeholder.apps.selfip.com/image/150x150")
In [20]:
from chamelboots.html.utils import prettify_html
In [21]:
print(prettify_html(doc.__str__()))
<html>
 <head>
  <title>
   This is the title.
  </title>
 </head>
 <body>
  <div id="main-div">
   <h1>
    My Document
   </h1>
   <strong>
    Hello World
   </strong>
   <img src="https://placeholder.apps.selfip.com/image/150x150"/>
  </div>
 </body>
</html>

Transfer the file to my static web server.

In [22]:
import tempfile
from subprocess import check_output
import shlex
from pathlib import Path
from time import sleep

# add .transferred for listener script on remote server
_, filepath_ = tempfile.mkstemp(suffix=".transferred.html", prefix="xmlhtml_")
filepath = Path(filepath_)
doc.write(filepath)
# webshare is a bash scipt that uses scp to copy the file to the server
check_output(shlex.split(f"webshare {filepath}"))
Out[22]:
b''
In [23]:
filepath
Out[23]:
PosixPath('/tmp/xmlhtml_f877db9j.transferred.html')

Display new HTML document.

In [24]:
sleep(3) # Give time for script on webserver to run and change file permissions.
new_path = '.'.join(filepath.name.split('.')[::2])
src = f"https://static.apps.selfip.com/{new_path}"
print(src)
print(new_path)
IFrame(
    src=src,
    width="auto",
    height="300",
)
https://static.apps.selfip.com/xmlhtml_f877db9j.html
xmlhtml_f877db9j.html
Out[24]:

XMLDocumentBuilder keeps a stack of nodes to track where we are in the tree (XMLDocumentBuilder._current). The tail of that list will always tell us which tag we're currently inside.

The interesting part is that the XMLDocumentBuilder.tag method also returns a context manager. On entry, it will set the entered tag as the currently active one and on exit, it will recover the previously active node.

That allows us to nest XMLDocumentBuilder.tag calls and generate a tree of tags

The actual document node can be grabbed through as, so in previous examples we were able to grab the title node that was just created and set a text for it, but XMLDocumentBuilder.text would have worked too because the title node was now the active element once we entered its context.

There's more…

There is one trick that I frequently apply when using this recipe. It makes it a bit harder to understand what's going on, on the Python side, and that's the reason why I avoided doing it while explaining the recipe itself, but it makes the HTML/XML structure even more readable by getting rid of most Python noise.

If you assign the XMLDocumentBuilder.tag and XMLDocumentBuilder.text methods to some short names, you can nearly disappear the fact that you are calling Python functions and make the XML structure more relevant

In [25]:
doc = XMLDocument('html', mode="html")

with doc as builder:
    _ = builder.tag
    _t = builder.text
    
    with _("head"):
        with _("title"): _t("This is the title.")
    with _("body"):
        with _("div", id="main-div"):
            with _("h1"): _t("My Document")
            with _("strong"): _t("Hello World")
            _("img", scr="https://placeholder.apps.selfip.com/image/325x325")
print(prettify_html(doc.__str__()))
<html>
 <head>
  <title>
   This is the title.
  </title>
 </head>
 <body>
  <div id="main-div">
   <h1>
    My Document
   </h1>
   <strong>
    Hello World
   </strong>
   <img scr="https://placeholder.apps.selfip.com/image/325x325"/>
  </div>
 </body>
</html>

Use with chamelboots.

Is it possible?

In [26]:
import pprint
import itertools as it
from functools import partial

from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
from chamelboots.constants import Join, FAKE
In [27]:
from llist import dllist, dllistnode

dll = dllist()
dll.appendleft(dllistnode(dllist(range(10, 20))))
dll.appendright(dllistnode(dllist(range(10))))
list(dll)
Out[27]:
[dllist([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
 dllist([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
In [28]:
import operator as op
from functools import reduce
In [39]:
from chamelboots.datautils import paths_in_data, get_from
from chamelboots.html import dictdata
from chamelboots.constants import HTML_PARSER
from lxml import etree
In [84]:
d = dictdata(
    etree.fromstring(
        """<top>
 <!--Generated for PyMOTW-->
 <child>
  This child contains text.
 </child>
 <child_with_tail>
  This child has regular text.
 </child_with_tail>
 And "tail" text.
 <child_with_entity_ref>
  This &amp; that
 </child_with_entity_ref>
</top>

""",
        HTML_PARSER,
    )
)
path = paths_in_data(d)[0]
# print(path)
example = get_from(d, path[:6])
# print({'top': {'html': [{'head': {}}, {'body': {}}]}})
# print(paths_in_data(d['html']))

paths_in_data({})
Out[84]:
dict_keys(['inner_content', 'attribs', 'tail'])
In [33]:
CLASS = "class"

ATTRIBUTES, CONTENT = "attributes", "content"
TAL_STATEMENTS = TSA, TSC = tuple(
    TS(*args) for args in ((item,) * 2 for item in (ATTRIBUTES, CONTENT))
)


class HTMLDocument:
    def __init__(self, root="html", mode="html"):
        # self._root = ET.Element(root) # self._root has append method
        self._root = [root]
        self._mode = mode

    def __enter__(self):
        return HTMLDocumentBuilder(self._root)

    def __exit__(self, exc_type, value, traceback):
        return None


class HTMLDocumentBuilder:

    parents = ("head", "body")

    def __init__(self, root):
        self._current = [root]
        (self.current_context_tag,) = root
        self.path = [
            self.current_context_tag,
        ]
        self.tree = {
            'top': {self.current_context_tag: [{parent: {}} for parent in self.parents]}
        }

    def paths_in_data_(self, obj):
        return paths_in_data(obj)

    def get_by_path(self, root, items):
        """Access a nested object in root by item sequence."""
        return reduce(op.getitem, items, root)

    def set_by_path(self, root, items, value):
        """Set a value in a nested object in root by item sequence."""
        # return value of get_by_path has to be a dict
        self.get_by_path(root, items[:-1])[items[-1]] = value

    def tag(self, tag, **kwargs):
        item = (tag,)
        element = [item]
        self._current[-1].append(element)

        @contextmanager
        def _context():  # runs when context entered using "with"
            self._current.append(element)

            # work on tree
            self.path.append(tag)
            self.current_context_tag = tag
            print(self.tree)

            try:
                yield element
            finally:
                self._current.pop()

        return _context()


doc = HTMLDocument()

with doc as builder:
    _ = builder.tag

    with _("head", attributes={}, content=""):
        _("meta", attributes={}, content="")
        _("title", attributes={}, content=FAKE.word())
    with _("body", attributes={}, content=""):
        with _("div", attributes={CLASS: "aa"}, content=""):
            _("span", attributes={CLASS: "bb"}, content=FAKE.paragraph())
        with _("div", attributes={CLASS: "a"}, content=""):
            _("span", attributes={CLASS: "b"}, content=FAKE.paragraph())
            with _("div", attributes={CLASS: "c"}, content=""):
                _("span", attributes={CLASS: "d"}, content=FAKE.paragraph())
                with _("p", attributes={CLASS: "e"}, content=""):
                    _("span", attributes={CLASS: "1"}, content=FAKE.paragraph())
                    _("span", attributes={CLASS: "2"}, content=FAKE.paragraph())
                    _("span", attributes={CLASS: "3"}, content=FAKE.paragraph())
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
{'top': {'html': [{'head': {}}, {'body': {}}]}}
In [34]:
from xml.etree.ElementTree import Element, SubElement, tostring, XML
In [35]:
COUNT = 3
# XML eats the parent
children = XML(
    CT("div", (TS(CONTENT, f"structure {CONTENT}"),)).render(
        content=CT(
            "p",
            (
                TS("repeat", "item items"),
                TS("content", "item"),
                TS(ATTRIBUTES, "next(attributes)"),
            ),
        ).render(
            items=range(COUNT),
            attributes=iter(dict(id=hex(id(dict()))) for _ in range(COUNT)),
        )
    )
)
In [36]:
top = Element('html')

HEAD = SubElement(top, 'head')
BODY = SubElement(top, 'body')
BODY.extend(children)

print(prettify_html(tostring(top).decode()))
<html>
 <head>
 </head>
 <body>
  <p id="0x7f39203b9ad0">
   0
  </p>
  <p id="0x7f39203f72f0">
   1
  </p>
  <p id="0x7f39203698f0">
   2
  </p>
 </body>
</html>
In [37]:
top.getchildren()
Out[37]:
[<Element 'head' at 0x7f3910736cb0>, <Element 'body' at 0x7f3910957a10>]

Create XML/HTML document

In [38]:
from xml.etree.ElementTree import Element, SubElement, Comment, tostring

top = Element('top')

comment = Comment('Generated for PyMOTW')
top.append(comment)

child = SubElement(top, 'child')
child.text = 'This child contains text.'

child_with_tail = SubElement(top, 'child_with_tail')
child_with_tail.text = 'This child has regular text.'
child_with_tail.tail = 'And "tail" text.'

child_with_entity_ref = SubElement(top, 'child_with_entity_ref')
child_with_entity_ref.text = 'This & that'

print(prettify_html(tostring(top).decode()))
<top>
 <!--Generated for PyMOTW-->
 <child>
  This child contains text.
 </child>
 <child_with_tail>
  This child has regular text.
 </child_with_tail>
 And "tail" text.
 <child_with_entity_ref>
  This &amp; that
 </child_with_entity_ref>
</top>

Real Python Pandas Groupby Tutorial

In [1]:
import pandas as pd
In [2]:
# 3 decimal places in output display
pd.set_option("display.precision", 3)
In [3]:
# Don't wrap repr(DataFrame) accross additional lines

pd.set_option("display.expand_frame_repr", False)
In [4]:
# Set max rows displayed in output to 25
pd.set_option("display.max_rows", 25)

Download datasets

In [5]:
import urllib.request
from pathlib import Path
import os

zipfilepath = Path(os.curdir, "groupby-data.zip")
with urllib.request.urlopen(
    urllib.request.Request(
        "https://github.com/realpython/materials/raw/master/pandas-groupby/groupby-data.zip",
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0"
        },
    )
) as fh:
    zipfilepath.write_bytes(fh.read())

Store on my Mino server.

In [6]:
from subprocess import check_output
import shlex
from pathlib import Path
import urllib.parse

minopath = Path("dokkuminio")
filename = Path("groupby-data.zip")
path = Path(minopath, "mymedia/realpython")
filepath = path.joinpath(filename)
command = f"mc cp {filename} {filepath}"
command
Out[6]:
'mc cp groupby-data.zip dokkuminio/mymedia/realpython/groupby-data.zip'
In [7]:
check_output(
    shlex.split(command), universal_newlines=True,
)
Out[7]:
'`groupby-data.zip` -> `dokkuminio/mymedia/realpython/groupby-data.zip`\nTotal: 28.24 MB, Transferred: 28.24 MB, Speed: 104.84 MB/s\n'
In [8]:
dataset_url = urllib.parse.urlunsplit(
    ("https", "minio.apps.selfip.com", Path(*filepath.parts[1:]).as_posix(), "", "")
)
dataset_url
Out[8]:
'https://minio.apps.selfip.com/mymedia/realpython/groupby-data.zip'

Remove local file.

In [9]:
zipfilepath.unlink()
assert not zipfilepath.exists()

Download remote file and store in temporary file.

In [10]:
tmp_zipfile, response = urllib.request.urlretrieve(dataset_url)
tmp_zipfile, response
Out[10]:
('/tmp/tmp77tpwj1z', <http.client.HTTPMessage at 0x7fa121cfdc30>)
In [11]:
response.items()
Out[11]:
[('Server', 'nginx'),
 ('Date', 'Sat, 23 Nov 2019 14:50:48 GMT'),
 ('Content-Type', 'application/zip'),
 ('Content-Length', '29612211'),
 ('Connection', 'close'),
 ('Accept-Ranges', 'bytes'),
 ('Content-Security-Policy', 'block-all-mixed-content'),
 ('ETag', '"c5e4045daa771f652d557c88a0f1cf7a-1"'),
 ('Last-Modified', 'Sat, 23 Nov 2019 14:50:48 GMT'),
 ('Vary', 'Origin'),
 ('X-Amz-Request-Id', '15D9D2301E5FD7DD'),
 ('X-Minio-Deployment-Id', '50591baa-6478-4283-9b56-a19476205422'),
 ('X-Xss-Protection', '1; mode=block')]

Unzip the tempfile.

In [12]:
import zipfile
import tempfile

destination = Path(tempfile.mkdtemp())
with zipfile.ZipFile(tmp_zipfile, "r") as zip_ref:
    zip_ref.extractall(destination)
In [13]:
import itertools as it

legislators_path, news_path, airqual_path = it.islice(destination.iterdir(), 0, 3)
legislators_path, news_path, airqual_path
Out[13]:
(PosixPath('/tmp/tmp08a67hrn/legislators-historical.csv'),
 PosixPath('/tmp/tmp08a67hrn/news.csv'),
 PosixPath('/tmp/tmp08a67hrn/airqual.csv'))

Dissecting a dataset of historical members of Congress

Load the data

In [14]:
import pandas as pd

CATEGORY = "category"
keys = "first_name gender type state party".split()
NON_CATEGORY_KEYS = BIRTHDAY_, LAST_NAME_ = "birthday", "last_name"
dtypes = dict(zip(keys, it.repeat(CATEGORY)))
dtypes
Out[14]:
{'first_name': 'category',
 'gender': 'category',
 'type': 'category',
 'state': 'category',
 'party': 'category'}

Because I am lazy, I am going to add column name names to the globals with an Enum.

I find repeating myself by typing and retyping commonly used strings error prone and tedious.

So I use the following to create variables programmatically.

For those who think updating globals() is risky, I added a conditional to prevent overwriting existing globals.

In [ ]:
from enum import Enum

ColumnNames = Enum(
    "ColumnNames",
    type=str,
    names=(
        zip(
            (
                *(key.upper() for key in keys),
                *(extra.upper() for extra in NON_CATEGORY_KEYS),
            ),
            (
                *keys,
                *NON_CATEGORY_KEYS
            ),
        )
    ),
)

Add the Enum names to the globals.

In [15]:
if not any(key.value in (GLOBALS := globals()) for key in ColumnNames):
    GLOBALS.update(ColumnNames.__members__)
assert BIRTHDAY == BIRTHDAY_
ColumnNames.__members__
Out[15]:
mappingproxy({'FIRST_NAME': <ColumnNames.FIRST_NAME: 'first_name'>,
              'GENDER': <ColumnNames.GENDER: 'gender'>,
              'TYPE': <ColumnNames.TYPE: 'type'>,
              'STATE': <ColumnNames.STATE: 'state'>,
              'PARTY': <ColumnNames.PARTY: 'party'>,
              'BIRTHDAY': <ColumnNames.BIRTHDAY: 'birthday'>,
              'LAST_NAME': <ColumnNames.LAST_NAME: 'last_name'>})
In [16]:
df = pd.read_csv(
    legislators_path,
    dtype=dtypes,
    usecols=[*dtypes, BIRTHDAY, LAST_NAME,],
    parse_dates=[BIRTHDAY],
)
df.tail()
Out[16]:
last_name first_name birthday gender type state party
11970 Garrett Thomas 1972-03-27 M rep VA Republican
11971 Handel Karen 1962-04-18 F rep GA Republican
11972 Jones Brenda 1959-10-24 F rep MI Democrat
11973 Marino Tom 1952-08-15 M rep PA Republican
11974 Jones Walter 1943-02-10 M rep NC Republican

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

In [17]:
df.dtypes
Out[17]:
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

The “Hello, World!” of Pandas GroupBy

Because I added the ColumnNames names to the globals, I don't have to type strings.

STATE is str "state".

Select the last_name count by state.

In [20]:
n_by_state = df.groupby(STATE)[LAST_NAME].count()
n_by_state.head(10)
Out[20]:
ColumnNames.STATE
AK     16
AL    206
AR    117
AS      2
AZ     48
CA    361
CO     90
CT    240
DC      2
DE     97
Name: ColumnNames.LAST_NAME, dtype: int64

A list of multiple column names.

In [26]:
n_by_state_gender = df.groupby([STATE, GENDER])[LAST_NAME].count()
n_by_state_gender
Out[26]:
ColumnNames.STATE  ColumnNames.GENDER
AK                 M                      16
AL                 F                       3
                   M                     203
AR                 F                       5
                   M                     112
                                        ... 
WI                 M                     196
WV                 F                       1
                   M                     119
WY                 F                       2
                   M                      38
Name: ColumnNames.LAST_NAME, Length: 104, dtype: int64

In the Pandas version, the grouped-on columns are pushed into the MultiIndex of the resulting Series by default:

In [27]:
type(n_by_state_gender)
Out[27]:
pandas.core.series.Series
In [28]:
n_by_state_gender.index[:5]
Out[28]:
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M'),
            ('AR', 'F'),
            ('AR', 'M')],
           names=['ColumnNames.STATE', 'ColumnNames.GENDER'])

To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you an use as_index=False:

In [29]:
df.groupby([STATE, GENDER], as_index=False)[LAST_NAME].count()
Out[29]:
ColumnNames.STATE ColumnNames.GENDER ColumnNames.LAST_NAME
0 AK F NaN
1 AK M 16.0
2 AL F 3.0
3 AL M 203.0
4 AR F 5.0
... ... ... ...
111 WI M 196.0
112 WV F 1.0
113 WV M 119.0
114 WY F 2.0
115 WY M 38.0

116 rows × 3 columns

Note: In df.groupby(["state", "gender"])["last_name"].count(), you could also use .size() instead of .count(), since you know that there are no NaN last names. Using .count() excludes NaN values, while .size() includes everything, NaN or not.

Also note that the SQL queries above explicitly use ORDER BY, whereas .groupby() does not. That’s because .groupby() does this by default through its parameter sort, which is True unless you tell it otherwise

In [30]:
df.groupby(STATE, sort=False)[LAST_NAME].count()
Out[30]:
ColumnNames.STATE
DE      97
VA     432
SC     251
MD     305
PA    1053
      ... 
AK      16
PI      13
VI       4
GU       4
AS       2
Name: ColumnNames.LAST_NAME, Length: 58, dtype: int64

Next, you’ll dive into the object that .groupby() actually produces.

One term that’s frequently used alongside .groupby() is split-apply-combine. This refers to a chain of three steps:

  1. Split a table into groups
  2. Apply some operations to each of those smaller tables
  3. Combine the results

One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__() and produces an iterator of (group, DataFrame) pairs for DataFrames

In [31]:
by_state = df.groupby(STATE)

Learn something new.

f"{state!r}" quotes the value. uses __repr__ instead of __str__

Resources

By default, f-strings will use __str__(), but you can make sure they use __repr__() if you include the conversion flag !r:

Tip

If you’re working on a challenging aggregation problem, then iterating over the Pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.

In [51]:
LINE, END = "-" * 30, "\n\n"


def endprint(item):
    print(item, end=END)


for state, frame in it.islice(by_state, 5):
    for _print, item in zip(
        (print, print, endprint,),
        (f"First 2 entries for {state!r}", LINE, frame.head(2)),
    ):
        _print(item)
First 2 entries for 'AK'
------------------------------
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent

First 2 entries for 'AL'
------------------------------
    last_name first_name   birthday gender type state       party
912   Crowell       John 1780-09-18      M  rep    AL  Republican
991    Walker       John 1783-08-12      M  sen    AL  Republican

First 2 entries for 'AR'
------------------------------
     last_name first_name   birthday gender type state party
1001     Bates      James 1788-08-25      M  rep    AR   NaN
1279    Conway      Henry 1793-03-18      M  rep    AR   NaN

First 2 entries for 'AS'
------------------------------
          last_name first_name   birthday gender type state     party
10797         Sunia       Fofó 1937-03-13      M  rep    AS  Democrat
11755  Faleomavaega        Eni 1943-08-15      M  rep    AS  Democrat

First 2 entries for 'AZ'
------------------------------
     last_name first_name   birthday gender type state       party
3674    Poston    Charles 1825-04-20      M  rep    AZ  Republican
3725   Goodwin       John 1824-10-18      M  rep    AZ  Republican

In [ ]:
 

The .groups attribute will give you a dictionary of {group name: group label} pairs.

In [41]:
by_state.groups["MT"]
Out[41]:
Int64Index([ 3756,  3941,  4094,  5033,  5342,  5536,  5537,  5581,  5968,
             6086,  6188,  6295,  6308,  6316,  6529,  6662,  6967,  7252,
             7344,  7615,  7642,  8113,  8167,  8231,  8298,  8387,  8458,
             8587,  8713,  8820,  8850,  9026,  9113,  9410,  9493,  9511,
             9618, 10037, 10080, 10156, 10261, 10282, 10473, 10683, 10939,
            11164, 11250, 11284, 11700, 11731, 11793, 11864],
           dtype='int64')
In [53]:
for i in by_state.groups["MT"][-5:]:
    endprint(df.iloc[i])
last_name                   Burns
first_name                 Conrad
birthday      1935-01-25 00:00:00
gender                          M
type                          sen
state                          MT
party                  Republican
Name: 11284, dtype: object

last_name                 Rehberg
first_name                 Dennis
birthday      1955-10-05 00:00:00
gender                          M
type                          rep
state                          MT
party                  Republican
Name: 11700, dtype: object

last_name                  Baucus
first_name                    Max
birthday      1941-12-11 00:00:00
gender                          M
type                          sen
state                          MT
party                    Democrat
Name: 11731, dtype: object

last_name                   Walsh
first_name                   John
birthday      1960-11-03 00:00:00
gender                          M
type                          sen
state                          MT
party                    Democrat
Name: 11793, dtype: object

last_name                   Zinke
first_name                   Ryan
birthday      1961-11-01 00:00:00
gender                          M
type                          rep
state                          MT
party                  Republican
Name: 11864, dtype: object

Note: I use the generic term Pandas GroupBy object to refer to both a DataFrameGroupBy object or a SeriesGroupBy object, which have a lot of commonalities between them.

Next, what about the apply part?

You can think of this step of the process as applying the same operation (or callable) to every “sub-table” that is produced by the splitting stage.

In [54]:
state, frame = next(iter(by_state))  # First tuple from iterator
state
Out[54]:
'AK'
In [55]:
frame.head(3)
Out[55]:
last_name first_name birthday gender type state party
6619 Waskey Frank 1875-04-20 M rep AK Democrat
6647 Cale Thomas 1848-09-17 M rep AK Independent
7442 Grigsby George 1874-12-02 M rep AK NaN
In [56]:
frame[LAST_NAME].count() # Count for state == 'AK'
Out[56]:
16

The last step, combine, is the most self-explanatory. It simply takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.

Next up

Example 2: Air Quality Dataset

to be continued…

Explore HTML Tools Chamelboots

Explore the html and datautil modules added to chamelboots.

Verify what exactly the attribute sourceline is in lxml. Does it correspond in any way to the raw html source?

In [1]:
from lxml import etree
In [2]:
from chamelboots.html.packages import bootstrap
from chamelboots.constants import HTML_PARSER
from chamelboots.html import get_html_as_data
from chamelboots.datautil import get_from, paths_in_data
In [3]:
[item for item in dir(bootstrap) if not item.startswith("_")]
Out[3]:
['starter_html']
In [4]:
html_element = (
    etree.fromstring(bootstrap.starter_html, HTML_PARSER).getroottree().getroot()
)
In [5]:
{
    tuple(item for item in dir(element) if not item.startswith("_"))
    for element in html_element.iterdescendants()
}
Out[5]:
{('addnext',
  'addprevious',
  'append',
  'attrib',
  'base',
  'clear',
  'cssselect',
  'extend',
  'find',
  'findall',
  'findtext',
  'get',
  'getchildren',
  'getiterator',
  'getnext',
  'getparent',
  'getprevious',
  'getroottree',
  'index',
  'insert',
  'items',
  'iter',
  'iterancestors',
  'iterchildren',
  'iterdescendants',
  'iterfind',
  'itersiblings',
  'itertext',
  'keys',
  'makeelement',
  'nsmap',
  'prefix',
  'remove',
  'replace',
  'set',
  'sourceline',
  'tag',
  'tail',
  'text',
  'values',
  'xpath')}
In [6]:
html_element.sourceline
Out[6]:
2
In [7]:
[(i, line) for i, line in enumerate(bootstrap.starter_html.splitlines())]
Out[7]:
[(0, '<!doctype html>'),
 (1, '<html lang="en">'),
 (2, '  <head>'),
 (3, '    <!-- Required meta tags -->'),
 (4, '    <meta charset="utf-8">'),
 (5,
  '    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">'),
 (6, '    <!-- Bootstrap CSS -->'),
 (7, '    <link'),
 (8, '    rel="stylesheet"'),
 (9,
  '    href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"'),
 (10,
  '    integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">'),
 (11, '    <!-- Optional JavaScript -->'),
 (12, '    <!-- jQuery first, then Popper.js, then Bootstrap JS -->'),
 (13, '    <script'),
 (14, '    defer="defer"'),
 (15, '    src="https://code.jquery.com/jquery-3.3.1.slim.min.js"'),
 (16,
  '    integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"'),
 (17, '    crossorigin="anonymous"></script>'),
 (18, '    <script'),
 (19, '    defer="defer"'),
 (20,
  '    src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js"'),
 (21,
  '    integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1"'),
 (22, '    crossorigin="anonymous"></script>'),
 (23, '    <script'),
 (24, '    defer="defer"'),
 (25,
  '    src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"'),
 (26,
  '    integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM"'),
 (27, '    crossorigin="anonymous"></script>'),
 (28, '    <title>Bootstrap title</title>'),
 (29, '  </head>'),
 (30, '  <body>'),
 (31, '    <div>'),
 (32, '        <h1>Hello, world!</h1>'),
 (33, '    </div>'),
 (34, '  </body>'),
 (35, '</html>')]
In [8]:
[
    (element, element.sourceline,)
    for element in html_element.iterdescendants()
    if element.tag == "meta"
]
Out[8]:
[(<Element meta at 0x7f137b6cdf50>, 5), (<Element meta at 0x7f137b6cd460>, 6)]
In [9]:
[
    (i, line)
    for i, line in enumerate(bootstrap.starter_html.splitlines())
    if "meta" in line
]
Out[9]:
[(3, '    <!-- Required meta tags -->'),
 (4, '    <meta charset="utf-8">'),
 (5,
  '    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">')]

Conclusion: The sourceline attribute in an lxml element doesn't correspond with the source line of the raw html.

HTML in a Pandas dataframe

In [10]:
import itertools as it
import operator as op
In [11]:
import pandas as pd
from IPython.display import display, HTML
In [12]:
data = get_html_as_data(bootstrap.starter_html)
In [13]:
paths = paths_in_data(data)
In [14]:
{len(path) for path in paths}
Out[14]:
{2, 5, 8, 11}
In [15]:
by_first = op.itemgetter(0)
dfs = [
    pd.DataFrame([op.add(p, (repr(get_from(data, p)),)) for l, p in group])
    for key, group in it.groupby(
        sorted([(len(path), path) for path in paths], key=by_first), key=by_first,
    )
]

Set indices on dataframes to the tag names which are at -3 in columns.

In [16]:
for df in dfs:

    stop = i if (i := df.columns.stop - 3) else len(df.columns)
    df.index = [list(row)[-1] for i, row in df.loc[:, :stop].iterrows()]
In [17]:
for df in dfs:
    try:
        display(df.loc[("script", "link"), :])
    except KeyError:
        pass
0 1 2 3 4 5 6 7 8
script html inner_content 0 head inner_content 7 script inner_content None
script html inner_content 0 head inner_content 7 script attributes {'defer': 'defer', 'src': 'https://code.jquery...
script html inner_content 0 head inner_content 7 script tail '\n '
script html inner_content 0 head inner_content 8 script inner_content None
script html inner_content 0 head inner_content 8 script attributes {'defer': 'defer', 'src': 'https://cdnjs.cloud...
script html inner_content 0 head inner_content 8 script tail '\n '
script html inner_content 0 head inner_content 9 script inner_content None
script html inner_content 0 head inner_content 9 script attributes {'defer': 'defer', 'src': 'https://stackpath.b...
script html inner_content 0 head inner_content 9 script tail '\n '
link html inner_content 0 head inner_content 4 link inner_content None
link html inner_content 0 head inner_content 4 link attributes {'rel': 'stylesheet', 'href': 'https://stackpa...
link html inner_content 0 head inner_content 4 link tail '\n '
In [18]:
for df in dfs:
    display(HTML(df.to_html()))
0 1 2
{'lang': 'en'} html attribs {'lang': 'en'}
None html tail None
0 1 2 3 4 5
head html inner_content 0 head attributes {}
head html inner_content 0 head tail '\n '
body html inner_content 1 body attributes {}
body html inner_content 1 body tail '\n'
0 1 2 3 4 5 6 7 8
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 0 <cyfunction Comment at 0x7f1398068ae0> inner_content ' Required meta tags '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 0 <cyfunction Comment at 0x7f1398068ae0> attributes <lxml.etree._ImmutableMapping object at 0x7f139804fc30>
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 0 <cyfunction Comment at 0x7f1398068ae0> tail '\n '
meta html inner_content 0 head inner_content 1 meta inner_content None
meta html inner_content 0 head inner_content 1 meta attributes {'charset': 'utf-8'}
meta html inner_content 0 head inner_content 1 meta tail '\n '
meta html inner_content 0 head inner_content 2 meta inner_content None
meta html inner_content 0 head inner_content 2 meta attributes {'name': 'viewport', 'content': 'width=device-width, initial-scale=1, shrink-to-fit=no'}
meta html inner_content 0 head inner_content 2 meta tail '\n '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 3 <cyfunction Comment at 0x7f1398068ae0> inner_content ' Bootstrap CSS '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 3 <cyfunction Comment at 0x7f1398068ae0> attributes <lxml.etree._ImmutableMapping object at 0x7f139804fc30>
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 3 <cyfunction Comment at 0x7f1398068ae0> tail '\n '
link html inner_content 0 head inner_content 4 link inner_content None
link html inner_content 0 head inner_content 4 link attributes {'rel': 'stylesheet', 'href': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css', 'integrity': 'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T', 'crossorigin': 'anonymous'}
link html inner_content 0 head inner_content 4 link tail '\n '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 5 <cyfunction Comment at 0x7f1398068ae0> inner_content ' Optional JavaScript '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 5 <cyfunction Comment at 0x7f1398068ae0> attributes <lxml.etree._ImmutableMapping object at 0x7f139804fc30>
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 5 <cyfunction Comment at 0x7f1398068ae0> tail '\n '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 6 <cyfunction Comment at 0x7f1398068ae0> inner_content ' jQuery first, then Popper.js, then Bootstrap JS '
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 6 <cyfunction Comment at 0x7f1398068ae0> attributes <lxml.etree._ImmutableMapping object at 0x7f139804fc30>
<cyfunction Comment at 0x7f1398068ae0> html inner_content 0 head inner_content 6 <cyfunction Comment at 0x7f1398068ae0> tail '\n '
script html inner_content 0 head inner_content 7 script inner_content None
script html inner_content 0 head inner_content 7 script attributes {'defer': 'defer', 'src': 'https://code.jquery.com/jquery-3.3.1.slim.min.js', 'integrity': 'sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo', 'crossorigin': 'anonymous'}
script html inner_content 0 head inner_content 7 script tail '\n '
script html inner_content 0 head inner_content 8 script inner_content None
script html inner_content 0 head inner_content 8 script attributes {'defer': 'defer', 'src': 'https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js', 'integrity': 'sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1', 'crossorigin': 'anonymous'}
script html inner_content 0 head inner_content 8 script tail '\n '
script html inner_content 0 head inner_content 9 script inner_content None
script html inner_content 0 head inner_content 9 script attributes {'defer': 'defer', 'src': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js', 'integrity': 'sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM', 'crossorigin': 'anonymous'}
script html inner_content 0 head inner_content 9 script tail '\n '
title html inner_content 0 head inner_content 10 title inner_content 'Bootstrap title'
title html inner_content 0 head inner_content 10 title attributes {}
title html inner_content 0 head inner_content 10 title tail '\n '
div html inner_content 1 body inner_content 0 div attributes {}
div html inner_content 1 body inner_content 0 div tail '\n '
0 1 2 3 4 5 6 7 8 9 10 11
h1 html inner_content 1 body inner_content 0 div inner_content 0 h1 inner_content 'Hello, world!'
h1 html inner_content 1 body inner_content 0 div inner_content 0 h1 attributes {}
h1 html inner_content 1 body inner_content 0 div inner_content 0 h1 tail '\n '

Create HTML with python-chamelboots: An Experiment

Experiment with python-chamelboots to create HTML.

Resources

Replicate an HTML document using chamelboots.

Specs

Replace the rel and integrity attributes in the link tag and the src and integrity attributes in the script tag with different values without editing the starter_html string.

The new result should be a list of strings that would replace a range of lines in starter_html.

In [1]:
from chamelboots.constants import HTML_PARSER, Join
from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
In [2]:
from functools import reduce
import operator as op
from pprint import pprint
import itertools as it
from subprocess import check_call
import shlex
from pathlib import Path
import tempfile
In [3]:
from lxml import etree
from bs4 import BeautifulSoup
from IPython.display import display, IFrame
In [4]:
starter_html = """<!doctype html>
<html lang="en">
  <head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <!-- Bootstrap CSS -->
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
    <!-- Optional JavaScript -->
    <!-- jQuery first, then Popper.js, then Bootstrap JS -->
    <script defer="defer" src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
    <script defer="defer" src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
    <script defer="defer" src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
    <title>Bootstrap title</title>
  </head>
  <body>
    <div>
        <h1>Hello, world!{nested_span}</h1>
        {list_}
    </div>
  </body>
</html>""".format(  # add some extra HTML using chamelboots
    list_=CT(
        "ul", (TS("content", "structure content"), TS("attributes", "attributes"))
    ).render(
        attributes={"class": "list-group"},
        content=CT(
            "li",
            (TS("repeat", "item items"), TS("attributes", "attributes")),
            "${item}",
        ).render(
            items=(f"foo item number {i}" for i in range(10)),
            attributes={"class": "list-group-item"},
        ),
    ),
    nested_span=CT("span", (), "I am a nested span."),
)
print(BeautifulSoup(starter_html, "html.parser").prettify())
<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!-- Bootstrap CSS -->
  <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" rel="stylesheet"/>
  <!-- Optional JavaScript -->
  <!-- jQuery first, then Popper.js, then Bootstrap JS -->
  <script crossorigin="anonymous" defer="defer" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" src="https://code.jquery.com/jquery-3.3.1.slim.min.js">
  </script>
  <script crossorigin="anonymous" defer="defer" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js">
  </script>
  <script crossorigin="anonymous" defer="defer" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js">
  </script>
  <title>
   Bootstrap title
  </title>
 </head>
 <body>
  <div>
   <h1>
    Hello, world!
    <span>
     I am a nested span.
    </span>
   </h1>
   <ul class="list-group">
    <li class="list-group-item">
     foo item number 0
    </li>
    <li class="list-group-item">
     foo item number 1
    </li>
    <li class="list-group-item">
     foo item number 2
    </li>
    <li class="list-group-item">
     foo item number 3
    </li>
    <li class="list-group-item">
     foo item number 4
    </li>
    <li class="list-group-item">
     foo item number 5
    </li>
    <li class="list-group-item">
     foo item number 6
    </li>
    <li class="list-group-item">
     foo item number 7
    </li>
    <li class="list-group-item">
     foo item number 8
    </li>
    <li class="list-group-item">
     foo item number 9
    </li>
   </ul>
  </div>
 </body>
</html>

Upload starter_html to my static webserver to display in an IFrame

In [5]:
def save_to_minio(text):
    tmpfile = Path(tempfile.mkstemp(suffix=".html")[-1])
    tmpfile.write_text(text)
    url = f"https://minio.apps.selfip.com/mymedia/html/{tmpfile.name}"
    check_call(shlex.split(f"mc cp {tmpfile} dokkuminio/mymedia/html/"))
    return url

Display template HTML document.

In [6]:
url = save_to_minio(starter_html)
print(url)
display(IFrame(src=url, width="auto", height=500))
https://minio.apps.selfip.com/mymedia/html/tmpixne_sks.html
In [7]:
tree = etree.fromstring(starter_html, HTML_PARSER)

Flat structure.

Flat is better than nested. Without nesting it makes it difficult to reconstruct the original HTML.

In [8]:
groups = [
    (e.tag, tuple(e.attrib.items()), e.text.strip() if e.text is not None else "")
    for e in tree.iter()
    if isinstance(e.tag, str)
]
groups
Out[8]:
[('html', (('lang', 'en'),), ''),
 ('head', (), ''),
 ('meta', (('charset', 'utf-8'),), ''),
 ('meta',
  (('name', 'viewport'),
   ('content', 'width=device-width, initial-scale=1, shrink-to-fit=no')),
  ''),
 ('link',
  (('rel', 'stylesheet'),
   ('href',
    'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css'),
   ('integrity',
    'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T'),
   ('crossorigin', 'anonymous')),
  ''),
 ('script',
  (('defer', 'defer'),
   ('src', 'https://code.jquery.com/jquery-3.3.1.slim.min.js'),
   ('integrity',
    'sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo'),
   ('crossorigin', 'anonymous')),
  ''),
 ('script',
  (('defer', 'defer'),
   ('src',
    'https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js'),
   ('integrity',
    'sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1'),
   ('crossorigin', 'anonymous')),
  ''),
 ('script',
  (('defer', 'defer'),
   ('src',
    'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js'),
   ('integrity',
    'sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM'),
   ('crossorigin', 'anonymous')),
  ''),
 ('title', (), 'Bootstrap title'),
 ('body', (), ''),
 ('div', (), ''),
 ('h1', (), 'Hello, world!'),
 ('span', (), 'I am a nested span.'),
 ('ul', (('class', 'list-group'),), ''),
 ('li', (('class', 'list-group-item'),), 'foo item number 0'),
 ('li', (('class', 'list-group-item'),), 'foo item number 1'),
 ('li', (('class', 'list-group-item'),), 'foo item number 2'),
 ('li', (('class', 'list-group-item'),), 'foo item number 3'),
 ('li', (('class', 'list-group-item'),), 'foo item number 4'),
 ('li', (('class', 'list-group-item'),), 'foo item number 5'),
 ('li', (('class', 'list-group-item'),), 'foo item number 6'),
 ('li', (('class', 'list-group-item'),), 'foo item number 7'),
 ('li', (('class', 'list-group-item'),), 'foo item number 8'),
 ('li', (('class', 'list-group-item'),), 'foo item number 9')]

Define some constants.

In [9]:
INNER_CONTENT, ATTRIBS, ATTRIBUTES, TAIL = (
    "inner_content",
    "attribs",
    "attributes",
    "tail",
)

Define functions to recursively walk the element tree and convert to nested dictionaries and lists.

In [10]:
def dictdata(node):
    res = {}
    res[node.tag] = []
    html_to_dict(node, res[node.tag])
    reply = {}
    reply[node.tag] = {
        INNER_CONTENT: res[node.tag],
        ATTRIBS: node.attrib,
        TAIL: node.tail,
    }
    return reply


def html_to_dict(node, res):
    rep = {}
    if len(node):
        for n in list(node):
            rep[node.tag] = []
            value = html_to_dict(n, rep[node.tag])
            if len(n):

                value = {
                    INNER_CONTENT: rep[node.tag],
                    ATTRIBUTES: n.attrib,
                    TAIL: n.tail,
                }
                res.append({n.tag: value})
            else:
                res.append(rep[node.tag][0])
    else:
        value = {}
        value = {INNER_CONTENT: node.text, ATTRIBUTES: node.attrib, TAIL: node.tail}
        res.append({node.tag: value})
    return None
In [11]:
data = dictdata(tree.getroottree().getroot())
In [12]:
data
Out[12]:
{'html': {'inner_content': [{'head': {'inner_content': [{<cyfunction Comment at 0x7f2c140317a0>: {'inner_content': ' Required meta tags ',
        'attributes': <lxml.etree._ImmutableMapping at 0x7f2c1401c780>,
        'tail': '\n    '}},
      {'meta': {'inner_content': None,
        'attributes': {'charset': 'utf-8'},
        'tail': '\n    '}},
      {'meta': {'inner_content': None,
        'attributes': {'name': 'viewport', 'content': 'width=device-width, initial-scale=1, shrink-to-fit=no'},
        'tail': '\n    '}},
      {<cyfunction Comment at 0x7f2c140317a0>: {'inner_content': ' Bootstrap CSS ',
        'attributes': <lxml.etree._ImmutableMapping at 0x7f2c1401c780>,
        'tail': '\n    '}},
      {'link': {'inner_content': None,
        'attributes': {'rel': 'stylesheet', 'href': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css', 'integrity': 'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T', 'crossorigin': 'anonymous'},
        'tail': '\n    '}},
      {<cyfunction Comment at 0x7f2c140317a0>: {'inner_content': ' Optional JavaScript ',
        'attributes': <lxml.etree._ImmutableMapping at 0x7f2c1401c780>,
        'tail': '\n    '}},
      {<cyfunction Comment at 0x7f2c140317a0>: {'inner_content': ' jQuery first, then Popper.js, then Bootstrap JS ',
        'attributes': <lxml.etree._ImmutableMapping at 0x7f2c1401c780>,
        'tail': '\n    '}},
      {'script': {'inner_content': None,
        'attributes': {'defer': 'defer', 'src': 'https://code.jquery.com/jquery-3.3.1.slim.min.js', 'integrity': 'sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo', 'crossorigin': 'anonymous'},
        'tail': '\n    '}},
      {'script': {'inner_content': None,
        'attributes': {'defer': 'defer', 'src': 'https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js', 'integrity': 'sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1', 'crossorigin': 'anonymous'},
        'tail': '\n    '}},
      {'script': {'inner_content': None,
        'attributes': {'defer': 'defer', 'src': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js', 'integrity': 'sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM', 'crossorigin': 'anonymous'},
        'tail': '\n    '}},
      {'title': {'inner_content': 'Bootstrap title',
        'attributes': {},
        'tail': '\n  '}}],
     'attributes': {},
     'tail': '\n  '}},
   {'body': {'inner_content': [{'div': {'inner_content': [{'h1': {'inner_content': [{'span': {'inner_content': 'I am a nested span.',
              'attributes': {},
              'tail': None}}],
           'attributes': {},
           'tail': '\n        '}},
         {'ul': {'inner_content': [{'li': {'inner_content': 'foo item number 0',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 1',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 2',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 3',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 4',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 5',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 6',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 7',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 8',
              'attributes': {'class': 'list-group-item'},
              'tail': '\n'}},
            {'li': {'inner_content': 'foo item number 9',
              'attributes': {'class': 'list-group-item'},
              'tail': None}}],
           'attributes': {'class': 'list-group'},
           'tail': '\n    '}}],
        'attributes': {},
        'tail': '\n  '}}],
     'attributes': {},
     'tail': '\n'}}],
  'attribs': {'lang': 'en'},
  'tail': None}}

Define functions for getting all the "paths" to item leaves in the nested dictionary and for getting the leaf using the path.

See this solution to Access nested dictionary items via a list of keys? on Stack Overflow.

In [13]:
def paths_in_data(data, parent=()):
    """Calculate keys and/or indices in dict."""

    if not any(isinstance(data, type_) for type_ in (dict, list, tuple)):
        return (parent,)
    else:
        try:
            return reduce(
                op.add,
                (paths_in_data(v, op.add(parent, (k,))) for k, v in data.items()),
                (),
            )
        except AttributeError:
            return reduce(
                op.add,
                (paths_in_data(v, op.add(parent, (data.index(v),))) for v in data),
                (),
            )


def get_from(data, path):
    """Get a leaf from iterable of keys and/or indices.
    
    :data: Collection where nodes are either a dict or list.
    :path: Collection of keys and/or indices leading to a leaf.
    """
    return reduce(op.getitem, path, data)

Get the items to change.

In [14]:
WANTED_TAGS = ("link", "script")
paths_to_mutables = [
    item for item in paths_in_data(data) if any(tag in item for tag in WANTED_TAGS)
]

Group the paths by HTML element

In [15]:
TAG_INDEX = 5
mutables = it.groupby(paths_to_mutables, key=op.itemgetter(TAG_INDEX))
for key, group in mutables:
    for row in group:
        print(row)
('html', 'inner_content', 0, 'head', 'inner_content', 4, 'link', 'inner_content')
('html', 'inner_content', 0, 'head', 'inner_content', 4, 'link', 'attributes')
('html', 'inner_content', 0, 'head', 'inner_content', 4, 'link', 'tail')
('html', 'inner_content', 0, 'head', 'inner_content', 7, 'script', 'inner_content')
('html', 'inner_content', 0, 'head', 'inner_content', 7, 'script', 'attributes')
('html', 'inner_content', 0, 'head', 'inner_content', 7, 'script', 'tail')
('html', 'inner_content', 0, 'head', 'inner_content', 8, 'script', 'inner_content')
('html', 'inner_content', 0, 'head', 'inner_content', 8, 'script', 'attributes')
('html', 'inner_content', 0, 'head', 'inner_content', 8, 'script', 'tail')
('html', 'inner_content', 0, 'head', 'inner_content', 9, 'script', 'inner_content')
('html', 'inner_content', 0, 'head', 'inner_content', 9, 'script', 'attributes')
('html', 'inner_content', 0, 'head', 'inner_content', 9, 'script', 'tail')
In [16]:
items_to_edit = [
    [get_from(data, row) for row in group][1:]  # attributes and (inner_content or tail)
    for key, group in it.groupby(paths_to_mutables, key=op.itemgetter(5))
]
items_to_edit
Out[16]:
[[{'rel': 'stylesheet', 'href': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css', 'integrity': 'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T', 'crossorigin': 'anonymous'},
  '\n    '],
 [{'defer': 'defer', 'src': 'https://code.jquery.com/jquery-3.3.1.slim.min.js', 'integrity': 'sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo', 'crossorigin': 'anonymous'},
  '\n    '],
 [{'defer': 'defer', 'src': 'https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js', 'integrity': 'sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1', 'crossorigin': 'anonymous'},
  '\n    '],
 [{'defer': 'defer', 'src': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js', 'integrity': 'sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM', 'crossorigin': 'anonymous'},
  '\n    ']]
In [17]:
INTEGRITY = "integrity"
link_keys = ("href", "rel", INTEGRITY, "crossorigin")
script_keys = ("defer", "src", *link_keys[link_keys.index(INTEGRITY):])
TAIL_DEFAULT = "\n    "
DEFER = "defer"

Bootswatch css breaks basic Boostrap view.

In [18]:
STYLESHEET = "stylesheet"
BOOTSWATCH_LINK_DATA = (
    [
        None,
        dict(
            zip(
                link_keys,
                (
                    "http://netdna.bootstrapcdn.com/bootswatch/4.3.1/cerulean/bootstrap.min.css",
                    STYLESHEET,
                    None,
                    None,
                ),
            )
        ),
        TAIL_DEFAULT,
    ],
)
MY_LINK_DATA = (
    None,
    dict(
        zip(
            link_keys,
            (
                "https://static.apps.selfip.com/bootstrap/4.3.1/css/boostrap.min.css",
                STYLESHEET,
                None,
                None,
            ),
        )
    ),
    TAIL_DEFAULT,
)
ALTERNATE_LINK_DATA = (
    None,
    dict(
        zip(
            link_keys,
            (
                "https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.3.1/css/bootstrap.min.css",
                STYLESHEET,
                None,
                None,
            ),
        )
    ),
    TAIL_DEFAULT,
)

LINK_DATA = (
    None,
    items_to_edit[0][0],
    TAIL_DEFAULT,
)
LINK_DATA = ALTERNATE_LINK_DATA
ALTERNATE_LINK_DATA, items_to_edit[0][0]
Out[18]:
((None,
  {'href': 'https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.3.1/css/bootstrap.min.css',
   'rel': 'stylesheet',
   'integrity': None,
   'crossorigin': None},
  '\n    '),
 {'rel': 'stylesheet', 'href': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css', 'integrity': 'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T', 'crossorigin': 'anonymous'})
In [19]:
new_values = (
    ("link", LINK_DATA),
    *(
        ("script", [None, dict(zip(script_keys, values)), TAIL_DEFAULT])
        for values in (
            (
                DEFER,
                "https://code.jquery.com/jquery-3.3.1.slim.min.js",
                "sha256-3edrmyuQ0w65f8gfBsqowzjJe2iM6n0nKciPUp8y+7E=",
                "anonymous",
            ),
            (
                DEFER,
                "https://unpkg.com/popper.js@1.14.7/dist/umd/popper.min.js",
                None,
                None,
            ),
            (
                DEFER,
                "https://ajax.aspnetcdn.com/ajax/bootstrap/4.3.1/bootstrap.min.js",
                None,
                None,
            ),
        )
    ),
)
In [20]:
TAG_INDEX = 5
grouped = (
    tuple(group)
    for key, group in it.groupby(paths_to_mutables, key=op.itemgetter(TAG_INDEX))
)
TAG_INDEX_ = 6
values = tuple(
    (paths[0][TAG_INDEX_], [get_from(data, path) for path in paths])
    for paths in grouped
)
values
Out[20]:
(('link',
  [None,
   {'rel': 'stylesheet', 'href': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css', 'integrity': 'sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T', 'crossorigin': 'anonymous'},
   '\n    ']),
 ('script',
  [None,
   {'defer': 'defer', 'src': 'https://code.jquery.com/jquery-3.3.1.slim.min.js', 'integrity': 'sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo', 'crossorigin': 'anonymous'},
   '\n    ']),
 ('script',
  [None,
   {'defer': 'defer', 'src': 'https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js', 'integrity': 'sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1', 'crossorigin': 'anonymous'},
   '\n    ']),
 ('script',
  [None,
   {'defer': 'defer', 'src': 'https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js', 'integrity': 'sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM', 'crossorigin': 'anonymous'},
   '\n    ']))
In [21]:
previous_parts = [
    (
        CT(
            **dict(
                zip(
                    ("tag", "tal_statements", INNER_CONTENT),
                    (tag, (TS(ATTRIBUTES, ATTRIBUTES),), value[2],),
                )
            )
        ).render(attributes=value[1])
    )
    for tag, value in values
]
previous_parts
Out[21]:
['<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">',
 '<script defer="defer" src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous">\n    </script>',
 '<script defer="defer" src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous">\n    </script>',
 '<script defer="defer" src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous">\n    </script>']
In [22]:
new_parts = [
    (
        CT(
            **dict(
                zip(
                    ("tag", "tal_statements", INNER_CONTENT),
                    (tag, (TS(ATTRIBUTES, ATTRIBUTES),), value[2],),
                )
            )
        ).render(attributes=value[1])
    )
    for tag, value in new_values
]
new_parts
Out[22]:
['<link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet">',
 '<script defer="defer" src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha256-3edrmyuQ0w65f8gfBsqowzjJe2iM6n0nKciPUp8y+7E=" crossorigin="anonymous">\n    </script>',
 '<script defer="defer" src="https://unpkg.com/popper.js@1.14.7/dist/umd/popper.min.js">\n    </script>',
 '<script defer="defer" src="https://ajax.aspnetcdn.com/ajax/bootstrap/4.3.1/bootstrap.min.js">\n    </script>']

Get the lines from starter_html that need replacing

In [23]:
lines_to_replace = (
    (
        i,
        line
        if any(
            item.tag in WANTED_TAGS for item in tuple(element.iterdescendants())[-1:]
        )
        else None,
    )
    for i, line in enumerate(starter_html.splitlines())
    if (element := etree.fromstring(line, HTML_PARSER)) is not None
)
indices, _ = zip(*((i, _) for i, _ in lines_to_replace if _))
indices
Out[23]:
(7, 10, 11, 12)
In [24]:
new_parts_iter = iter(new_parts)
new_html = Join.LINES(
    line if i not in indices else next(new_parts_iter)
    for i, line in enumerate(starter_html.splitlines())
)
print(BeautifulSoup(new_html, "html.parser").prettify())
<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!-- Bootstrap CSS -->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet"/>
  <!-- Optional JavaScript -->
  <!-- jQuery first, then Popper.js, then Bootstrap JS -->
  <script crossorigin="anonymous" defer="defer" integrity="sha256-3edrmyuQ0w65f8gfBsqowzjJe2iM6n0nKciPUp8y+7E=" src="https://code.jquery.com/jquery-3.3.1.slim.min.js">
  </script>
  <script defer="defer" src="https://unpkg.com/popper.js@1.14.7/dist/umd/popper.min.js">
  </script>
  <script defer="defer" src="https://ajax.aspnetcdn.com/ajax/bootstrap/4.3.1/bootstrap.min.js">
  </script>
  <title>
   Bootstrap title
  </title>
 </head>
 <body>
  <div>
   <h1>
    Hello, world!
    <span>
     I am a nested span.
    </span>
   </h1>
   <ul class="list-group">
    <li class="list-group-item">
     foo item number 0
    </li>
    <li class="list-group-item">
     foo item number 1
    </li>
    <li class="list-group-item">
     foo item number 2
    </li>
    <li class="list-group-item">
     foo item number 3
    </li>
    <li class="list-group-item">
     foo item number 4
    </li>
    <li class="list-group-item">
     foo item number 5
    </li>
    <li class="list-group-item">
     foo item number 6
    </li>
    <li class="list-group-item">
     foo item number 7
    </li>
    <li class="list-group-item">
     foo item number 8
    </li>
    <li class="list-group-item">
     foo item number 9
    </li>
   </ul>
  </div>
 </body>
</html>

Verify that new_html displays Boostrap styling.

In [25]:
url = save_to_minio(new_html)
print(url)
https://minio.apps.selfip.com/mymedia/html/tmp8rtlpbmx.html

All values were programmatically replaced with the above code.

In [26]:
display(IFrame(src=url, width="auto", height=500))

What is SymPy?

Make a note about SymPy

SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.

Resources

In [5]:
from sympy import Symbol, lambdify
def Unit(x):
    if(x != 0):
        return 0
    else:
        return 1

x = Symbol('x')
x
Out[5]:
$\displaystyle x$

Maths formatted output!

In [6]:
fx = x**2 + Unit(x)
fx
Out[6]:
$\displaystyle x^{2}$
In [7]:
lam_f = lambdify(x, fx, modules=['sympy'])
lam_f
Out[7]:
<function _lambdifygenerated(x)>
In [8]:
print(lam_f(-1)) # prints 1
1

Stack Overflow Solution: Comprehension List and new File

Stack Overflow solution

In [69]:
from pathlib import Path
import tempfile
from datetime import datetime
import operator as op
In [70]:
_, filename = tempfile.mkstemp()
file_A = Path(filename)
_, filename = tempfile.mkstemp()
file_B = Path(filename)
In [71]:
contents_A = """Adam Brown 10/11/1999
Lauren Marie Smith 9/8/2001
Vincent Guth II 7/9/1980"""
fileA.write_text(contents_A)
Out[71]:
74

Without the assignment operator.

In [72]:
file_B.write_text("\n".join([
    " ".join(
        op.add(
            line.split()[:-1],
            [
                datetime.strptime(
                    line.split()[-1], "%m/%d/%Y",
                ).strftime("%-d/%-m/%Y",),
            ],
        )
    )
    for line in fileA.read_text().splitlines()
]))
Out[72]:
74

With the assignment operator.

In [73]:
file_B.write_text(
    "\n".join( # re-join the lines
        [
            " ".join( # re-join the words
                (
                    # Store the split line into words. Then slice all but last.
                    " ".join((words := line.split())[:-1]),
                    # Convert the last word to desired date format.
                    datetime.strptime(words[-1], "%m/%d/%Y",).strftime("%-d/%-m/%Y",),
                )
            )
            for line in fileA.read_text().splitlines()
        ]
    )
)
Out[73]:
74
In [74]:
print(file_B.read_text())
Adam Brown 11/10/1999
Lauren Marie Smith 8/9/2001
Vincent Guth II 9/7/1980

Trees Using Autovivification with Default Dict

Create trees using autovivification in Python with defaultdict.

Resources

In [1]:
from collections import defaultdict
import operator as op
from functools import reduce, partial
from pprint import pprint

from bs4 import BeautifulSoup
In [2]:
_pprint = partial(pprint, indent=4)
In [3]:
def tree():
    return defaultdict(tree)
In [4]:
def dicts(t):
    return {k: dicts(t[k]) for k in t}
In [5]:
def get(item, path):
    return reduce(op.getitem, path, item)
In [6]:
def add(t, path):
    for node in path:
        t = t[node]
In [7]:
taxonomy = tree()  # root
path = "Animalia>Chordata>Mammalia>Cetacea>Balaenopteridae>Balaenoptera>blue whale".split(
    ">"
)
path
Out[7]:
['Animalia',
 'Chordata',
 'Mammalia',
 'Cetacea',
 'Balaenopteridae',
 'Balaenoptera',
 'blue whale']
In [8]:
add(
    taxonomy, path,
)
None  # prevent output

❝Flat is better than nested.❞

A dict with keys that are the path to a node.

In [9]:
{
    path_ or ("root",): dicts(get(taxonomy, path_))
    for path_ in (tuple(path[:i]) for i in range(len(path)))
}  # order is outermost to innermost
Out[9]:
{('root',): {'Animalia': {'Chordata': {'Mammalia': {'Cetacea': {'Balaenopteridae': {'Balaenoptera': {'blue whale': {}}}}}}}},
 ('Animalia',): {'Chordata': {'Mammalia': {'Cetacea': {'Balaenopteridae': {'Balaenoptera': {'blue whale': {}}}}}}},
 ('Animalia',
  'Chordata'): {'Mammalia': {'Cetacea': {'Balaenopteridae': {'Balaenoptera': {'blue whale': {}}}}}},
 ('Animalia',
  'Chordata',
  'Mammalia'): {'Cetacea': {'Balaenopteridae': {'Balaenoptera': {'blue whale': {}}}}},
 ('Animalia',
  'Chordata',
  'Mammalia',
  'Cetacea'): {'Balaenopteridae': {'Balaenoptera': {'blue whale': {}}}},
 ('Animalia',
  'Chordata',
  'Mammalia',
  'Cetacea',
  'Balaenopteridae'): {'Balaenoptera': {'blue whale': {}}},
 ('Animalia',
  'Chordata',
  'Mammalia',
  'Cetacea',
  'Balaenopteridae',
  'Balaenoptera'): {'blue whale': {}}}
In [10]:
from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
from chamelboots.constants import Join

ATTRIBUTE_CONTENT = (
    TS("attributes", "attrib"),
    TS("content", "structure content"),
)
In [11]:
ID = "query"
groups = (
    (("form", tuple({"method": "GET", "action": "#", "id": "search-form"}.items()),),),
    (("div", tuple({"class": "form-group",}.items()),),),
    (
        ("label", tuple({"class": "col-sm-2 col-form-label col-form-label-lg", "for": ID,}.items()),),
        (
            "input",
            tuple({
                "type": "text",
                "class": "form-control",
                "id": ID,
                "placeholder": "Enter search term",
            }.items()),
        ),
        ("button", tuple({"type": "submit", "class": "btn btn-primary"}.items()),),
        ("button", tuple({"type": "button", "class": "btn btn-secondary"}.items()),),
    ),
)
root = tree()
add(root, groups)
dom = dicts(root)
_pprint(dom)
{   (('form', (('method', 'GET'), ('action', '#'), ('id', 'search-form'))),): {   (('div', (('class', 'form-group'),)),): {   (('label', (('class', 'col-sm-2 col-form-label col-form-label-lg'), ('for', 'query'))), ('input', (('type', 'text'), ('class', 'form-control'), ('id', 'query'), ('placeholder', 'Enter search term'))), ('button', (('type', 'submit'), ('class', 'btn btn-primary'))), ('button', (('type', 'button'), ('class', 'btn btn-secondary')))): {   }}}}

Convert the dom into HTML.

In [12]:
groups = (
    (("form", (("method", "GET"), ("action", "#"), ("id", "search-form")),),),
    (("div", (("class", "form-group"),),),),
    (
        (
            "label",
            (("class", "col-sm-2 col-form-label col-form-label-lg",), ("for", ID,),),
        ),
        (
            "input",
            (
                ("type", "text"),
                ("class", "form-control"),
                ("id", ID),
                ("placeholder", "Enter search term"),
            ),
        ),
        ("button", (("type", "submit",), ("class", "btn btn-primary"),),),
        ("button", (("type", "submit",), ("class", "btn btn-secondary"),),),
    ),
)
root = tree()
add(root, groups)
dom = dicts(root)

start = len(groups) - 1
leaf = Join.LINES(
    CT(tag, ATTRIBUTE_CONTENT).render(attrib=dict(attrib), content="")
    for items in get(dom, groups[:start]).keys()
    for tag, attrib in items
)

for i in reversed(range(start)):
    leaf = Join.LINES(
        CT(tag, ATTRIBUTE_CONTENT).render(attrib=dict(attrib), content=leaf)
        for items in get(dom, groups[:i]).keys()
        for tag, attrib in items
    )
print(BeautifulSoup(leaf, "html.parser").prettify())
<form action="#" id="search-form" method="GET">
 <div class="form-group">
  <label class="col-sm-2 col-form-label col-form-label-lg" for="query">
  </label>
  <input class="form-control" id="query" placeholder="Enter search term" type="text"/>
  <button class="btn btn-primary" type="submit">
  </button>
  <button class="btn btn-secondary" type="submit">
  </button>
 </div>
</form>

Avoid using nested dicts.

In [13]:
groups = (
    (("form", (("method", "GET"), ("action", "#"), ("id", "search-form",)),),),
    (("div", (("class", "form-group"),),),),
    (
        (
            "label",
            (("class", "col-sm-2 col-form-label col-form-label-lg",), ("for", ID,),),
        ),
        (
            "input",
            (
                ("type", "text"),
                ("class", "form-control"),
                ("id", ID),
                ("placeholder", "Enter search term"),
            ),
        ),
        ("button", (("type", "submit",), ("class", "btn btn-primary"),),),
        ("button", (("type", "button",), ("class", "btn btn-secondary"),),),
    ),
)
groups_ = groups[::-1]  # reverse
leaf = Join.LINES(
    CT(tag, ATTRIBUTE_CONTENT).render(attrib=dict(attrib), content=content)
    for (tag, attrib), content in zip(groups_[0], ("Search", "", "search", "clear"))
)
for i in range(1, len(groups)):
    leaf = Join.LINES(
        CT(tag, ATTRIBUTE_CONTENT).render(attrib=dict(attrib), content=leaf)
        for tag, attrib in groups_[i]
    )
print(BeautifulSoup(leaf, "html.parser").prettify())
<form action="#" id="search-form" method="GET">
 <div class="form-group">
  <label class="col-sm-2 col-form-label col-form-label-lg" for="query">
   Search
  </label>
  <input class="form-control" id="query" placeholder="Enter search term" type="text"/>
  <button class="btn btn-primary" type="submit">
   search
  </button>
  <button class="btn btn-secondary" type="button">
   clear
  </button>
 </div>
</form>
In [14]:
from IPython.display import HTML
HTML(leaf)
Out[14]:
In [15]:
from chamelboots.constants import FAKE

string = CT(
    "button",
    (TS("repeat", "button buttons"), TS("attributes", "attrib")),
    inner_content="${button}",
).render(
    buttons=(FAKE.name() for _ in range(5)),
    attrib=dict((("type", "button",), ("class", "btn btn-success"),)),
)

display(HTML(string))
print(BeautifulSoup(string, "html.parser").prettify())
<button class="btn btn-success" type="button">
 John Howell
</button>
<button class="btn btn-success" type="button">
 Anthony Stevenson
</button>
<button class="btn btn-success" type="button">
 Zachary Williams
</button>
<button class="btn btn-success" type="button">
 Chelsea Wilson
</button>
<button class="btn btn-success" type="button">
 Scott Reyes
</button>