Demonstrate Example of Tail in HTML

Demonstrate an example of "tail" text in HTML.

While trying to scrape some sports parameters from a website, some text inside of the <td> tags could not be scraped.

This is when I discovered "tail" in the lxml documents.

Here, the <br/> tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property.

In [1]:
from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
from chamelboots.constants import HTML_PARSER
from chamelboots.html.utils import prettify_html

from lxml import etree
In [2]:
TSC = (TS((CONTENT := "content"), "structure content"),)
TSA = (TS((ATTRIB := "attributes"), ATTRIB),)
In [3]:
BR = CT("br")

Create HTML and display it.

In [4]:
html_string = prettify_html(CT("html", TSC).render(content=f"Hello,{BR}world."))
print(html_string)
<html>
 Hello,
 <br/>
 world.
</html>

Create an lxml tree.

In [5]:
tree = etree.fromstring(html_string, HTML_PARSER)

Get all the text from all the elements.

Note how "world" is missing.

In [6]:
[element.text for element in tree.iterdescendants()]
Out[6]:
[None, '\n Hello,\n ', None]

"world" is the tail of the <br> empty tag.

In [7]:
[(element, element.text, element.tail) for element in tree.iterdescendants()]
Out[7]:
[(<Element body at 0x7faed47b6410>, None, None),
 (<Element p at 0x7faed47b6550>, '\n Hello,\n ', None),
 (<Element br at 0x7faed47b60f0>, None, '\n world.\n')]

One way to get all the text is to use itertext.

In [8]:
{element: list(element.itertext()) for element in tree.iterdescendants()}
Out[8]:
{<Element body at 0x7faed47b6410>: ['\n Hello,\n ', '\n world.\n'],
 <Element p at 0x7faed47b6550>: ['\n Hello,\n ', '\n world.\n'],
 <Element br at 0x7faed47b60f0>: []}

Filter the empty list with assignment expression.

In [9]:
{
    element.tag: words
    for element in tree.iterdescendants()
    if (words := list(element.itertext()))
}
Out[9]:
{'body': ['\n Hello,\n ', '\n world.\n'], 'p': ['\n Hello,\n ', '\n world.\n']}

Strip whitespace — a common requirement.

In [10]:
{
    element.tag: words
    for element in tree.iterdescendants()
    if (words := [e.strip() for e in element.itertext()])
}
Out[10]:
{'body': ['Hello,', 'world.'], 'p': ['Hello,', 'world.']}
In [11]:
html_string_ = prettify_html(
    CT("html", TSC).render(
        content="Hello,{BR}world.{BR}{link}".format(
            link=CT("a", TSA).render(attributes=dict(href="http://example.com"),),
            BR=BR,
        ),
    )
)
print(html_string_)
<html>
 Hello,
 <br/>
 world.
 <br/>
 <a href="http://example.com">
 </a>
</html>
In [12]:
tree_ = etree.fromstring(html_string_, HTML_PARSER)
In [13]:
[(element, element.text, element.tail) for element in tree_.iterdescendants()]
Out[13]:
[(<Element body at 0x7faed47ad640>, None, None),
 (<Element p at 0x7faed47ad910>, '\n Hello,\n ', None),
 (<Element br at 0x7faed47ad6e0>, None, '\n world.\n '),
 (<Element br at 0x7faed47add70>, None, '\n '),
 (<Element a at 0x7faed47adb90>, '\n ', '\n')]
In [14]:
{element: list(element.itertext()) for element in tree_.iterdescendants()}
Out[14]:
{<Element body at 0x7faed47ad640>: ['\n Hello,\n ',
  '\n world.\n ',
  '\n ',
  '\n ',
  '\n'],
 <Element p at 0x7faed47ad910>: ['\n Hello,\n ',
  '\n world.\n ',
  '\n ',
  '\n ',
  '\n'],
 <Element br at 0x7faed47ad6e0>: [],
 <Element br at 0x7faed47add70>: [],
 <Element a at 0x7faed47adb90>: ['\n ']}
In [15]:
{
    element.tag: words
    for element in tree_.iterdescendants()
    if (words := list(element.itertext()))
}
Out[15]:
{'body': ['\n Hello,\n ', '\n world.\n ', '\n ', '\n ', '\n'],
 'p': ['\n Hello,\n ', '\n world.\n ', '\n ', '\n ', '\n'],
 'a': ['\n ']}

A lot of whitespace from the HTML string itself is in the text.

Here is how to eliminate it.

In [16]:
{
    element.tag: words
    for element in tree_.iterdescendants()
    if (words := [text.strip() for text in element.itertext()])
}
Out[16]:
{'body': ['Hello,', 'world.', '', '', ''],
 'p': ['Hello,', 'world.', '', '', ''],
 'a': ['']}

A lot of whitespace from the HTML string itself is in the text.

Now we have lists with empty string is in them.

Here is how to get rid of them.

And the Python assignment expression makes that process beautiful.

In [17]:
{
    element.tag: words
    for element in tree_.iterdescendants()
    if (words := [ts for text in element.itertext() if (ts := text.strip())])
}
Out[17]:
{'body': ['Hello,', 'world.'], 'p': ['Hello,', 'world.']}
In [18]:
html_string__ = prettify_html(
    CT("html", TSC).render(
        content="Hello,{BR}world.{BR}{link}{DIV}".format(
            link=CT("a", TSA).render(attributes=dict(href="http://example.com"),),
            BR=BR,
            DIV=CT(),
        ),
    )
)
print(html_string__)
<html>
 Hello,
 <br/>
 world.
 <br/>
 <a href="http://example.com">
 </a>
 <div>
 </div>
</html>

And what if the text is None?

In [19]:
tree__ = etree.fromstring(html_string__, HTML_PARSER)
In [20]:
## Error. NoneType has no attribute "strip"
try:
    [(element, element.text.strip(), element.tail) for element in tree__.iterdescendants()]
except AttributeError as err:
    print(err)
'NoneType' object has no attribute 'strip'
In [32]:
## Error. NoneType has no attribute "strip"
try:
    display(
        [
            {element: dict(text=text.strip(), tail=element.tail)}
            for element in tree__.iterdescendants()
            if (text := element.text)
        ]
    )
except AttributeError as err:
    print(err)
[{<Element p at 0x7faed4745c30>: {'text': 'Hello,', 'tail': None}},
 {<Element a at 0x7faed47459b0>: {'text': '', 'tail': '\n '}},
 {<Element div at 0x7faed4745e10>: {'text': '', 'tail': '\n'}}]

element.itertext() avoids trying to call "strip" on NoneType.

In [21]:
{
    element: words
    for element in tree__.iterdescendants()
    if (words := [text.strip() for text in element.itertext()])
}
Out[21]:
{<Element body at 0x7faed47455a0>: ['Hello,', 'world.', '', '', '', '', ''],
 <Element p at 0x7faed4745c30>: ['Hello,', 'world.', '', '', ''],
 <Element a at 0x7faed47459b0>: [''],
 <Element div at 0x7faed4745e10>: ['']}

If unwanted empty strings are in the list, remove them with another walrus operator in an if clause.

In [30]:
{
    element: words
    for element in tree__.iterdescendants()
    if (words := [ts for text in element.itertext() if (ts := text.strip())])
}
Out[30]:
{<Element body at 0x7faed47455a0>: ['Hello,', 'world.'],
 <Element p at 0x7faed4745c30>: ['Hello,', 'world.']}