Iterate through rows of HTML tables.

Iterate through rows of HTML tables.

Visually verify that scraped data matches desired data by inserting a screenshot next to the scraped data.

Jump to bottom.

In [1]:
import urllib.request
from pathlib import Path
from pprint import pprint
import itertools as it
import operator as op
import json
from functools import partial
from collections import namedtuple
import tempfile
from subprocess import check_call
import shlex
from string import whitespace, punctuation
In [2]:
from lxml import etree
import pandas as pd
from IPython.display import display, HTML, Image
from PIL import Image as PILImage
from chamelboots import ChameleonTemplate as CT
from chamelboots import TalStatement as TS
In [3]:
CONTENT = "content"
ATTRIBUTES = "attributes"
TAL_CONTENTS, TAL_ATTRIBUTES = TAL_CONTENTS_WITH_ATTRIBUTES = (
    TS(CONTENT, CONTENT),
    TS(ATTRIBUTES, ATTRIBUTES),
)
HTML_PARSER = etree.HTMLParser()

Load the contents at URL into tempfile.

The data is data scraped from this sports website using a Scrapy script.

I discovered in the process of scraping the website that the HTML tables had different shapes. Some had a head and a body. Some had just rows. I managed to put together some comprehensions that handled both cases. They are difficult to read so here I am going to iterate through all the tables on all the pages to see if a function can be defined that handles all the tables with simpler code.

In [4]:
BASEURL = Path(Path.home(), ".texpander/iowa_sports_stats_json").read_text().strip()

Get Iowa sports data

In [5]:
(filepath,) = (Path(f) for f, _ in (urllib.request.urlretrieve(BASEURL),))

Sort and group the list of dictionaries by sex and sport.

In [6]:
groups = it.groupby(
    sorted(
        json.loads(filepath.read_text()),
        key=(SORT_KEY := op.itemgetter("sex", "sport")),
    ),
    key=SORT_KEY,
)
TABLE_SIZE = 5
dfs = [
    pd.DataFrame(items).head(TABLE_SIZE)
    for key, items in ((k, list(group)) for k, group in groups)
]
In [7]:
df_table, *_ = dfs
display(df_table)
sport sex url Team Record GP AB R H 2B ... RBI SAC BB SO SB SBA HBP OBP SLG AVG
0 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStand... Xavier 41 - 2 - 0 43 1120 371 391 79 ... 306 50 212 162 92 112 53 0.466 0.485 0.349
1 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStand... West Lyon 27 - 2 - 0 29 732 298 254 49 ... 240 37 196 157 141 151 45 0.509 0.485 0.347
2 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStand... Newman Catholic 38 - 3 - 0 41 1055 429 382 90 ... 346 0 257 139 101 103 88 0.519 0.586 0.362
3 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStand... Van Meter 34 - 3 - 0 37 1000 336 340 67 ... 287 43 153 146 125 143 38 0.446 0.454 0.340
4 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStand... Central DeWitt 38 - 4 - 0 42 1222 309 408 92 ... 261 20 161 227 125 156 41 0.425 0.470 0.334

5 rows × 22 columns

In [8]:
LINE = 20 * "*"

Store the element trees in an array to avoid repeated network calls.

In [9]:
DF_Row = namedtuple("DF_Row", "Index url sex sport")
In [10]:
element_trees = [
    (
        DF_Row(index, url, (df_row := df_table.iloc[0]).sex, df_row.sport),
        etree.fromstring(Path(fp).read_text(), HTML_PARSER),
    )
    for index, df_table in enumerate(dfs)
    for url in df_table["url"].drop_duplicates()
    for fp, _ in (urllib.request.urlretrieve(url),)
]
In [11]:
element_trees[:40:8]  # display boys and girls
Out[11]:
[(DF_Row(Index=0, url='http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball'),
  <Element html at 0x7f28730536e0>),
 (DF_Row(Index=8, url='http://quikstatsiowa.com/Public/Football/TeamStandings.aspx?IDSport=91A308DE-5763-4DAA-8C03-9AF66611E0BC', sex='boys', sport='Football'),
  <Element html at 0x7f2873734820>),
 (DF_Row(Index=16, url='http://quikstatsiowa.com/Public/Tennis/TeamStandings.aspx?IDSport=19786FF3-ADA3-4C7A-A94F-FAC0811118F5', sex='boys', sport='Tennis'),
  <Element html at 0x7f28730683c0>),
 (DF_Row(Index=24, url='http://quikstatsiowa.com/Public/Golf/IndividualStandings.aspx?IDSport=6DC124A1-D8C4-4F88-84EF-5C6B4FD4A688', sex='girls', sport='Golf: IndividualStandings'),
  <Element html at 0x7f28730a0190>),
 (DF_Row(Index=32, url='http://quikstatsiowa.com/Public/Tennis/IndividualStandings.aspx?IDSport=6086C2DF-4661-4701-BFF1-3BB32C081B88', sex='girls', sport='Tennis: IndividualStandings'),
  <Element html at 0x7f287309a870>)]
In [12]:
assert len(element_trees) == len(dfs)
In [13]:
H2 = CT("h2", (TAL_CONTENTS,)).render
HR = CT("hr").render()

REDH2 = partial(  # create a red h2 HTML
    CT("h2", TAL_CONTENTS_WITH_ATTRIBUTES).render, attributes={"style": "color: red;"},
)
BLUEH3 = partial(  # create a red h2 HTML
    CT("h3", TAL_CONTENTS_WITH_ATTRIBUTES).render, attributes={"style": "color: blue;"},
)

Analyze all the tables in all of the HTML documents.

In [14]:
HTML(REDH2(content="If there are more than 2 tables, then a header is printed in red."))
Out[14]:

If there are more than 2 tables, then a header is printed in red.

In [15]:
TABLE_XPATH = "/html/body/form/div/div[3]/div[3]/div[2]/table"

for (df_row, tree), root, df_table, label, tables in zip(
    element_trees,
    (tree.getroottree() for df_row, tree in element_trees),
    dfs,
    (df_table[["sex", "sport"]].drop_duplicates() for df_table in dfs),
    (tree.xpath("//table") for df_row, tree in element_trees),
):
    for item, text in (
        (label, ""),
        (df_row, ""),
        (
            tables,
            HTML(H2(content=f"table count: {table_length}"))
            if (table_length := len(tables)) == 2
            else HTML(REDH2(content=f"table count: {table_length}")),
        ),
        (
            [root.getpath(table) for table in tables],
            HTML(H2(content="xpaths to tables")),
        ),
        ("", HTML(HR)),
    ):

        if text:
            display(text)
        if not getattr(item, "empty", True):  # handle ambiguity of bool(df)
            display(item)
        elif item:
            display(item)
sex sport
0 boys Baseball
DF_Row(Index=0, url='http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball')

table count: 2

[<Element table at 0x7f287304faa0>, <Element table at 0x7f2873040d70>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Baseball: IndividualBestGameStandings
DF_Row(Index=1, url='http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f28730420f0>, <Element table at 0x7f28726af280>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Baseball: IndividualStandings
DF_Row(Index=2, url='http://quikstatsiowa.com/Public/BBSB/IndividualStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball: IndividualStandings')

table count: 2

[<Element table at 0x7f287304e410>, <Element table at 0x7f287304ec30>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Basketball
DF_Row(Index=3, url='http://quikstatsiowa.com/Public/Basketball/TeamStandings.aspx?IDSport=57C38F60-B323-4087-A557-9ED925DC546D', sex='boys', sport='Basketball')

table count: 2

[<Element table at 0x7f287309c0f0>, <Element table at 0x7f287309ca50>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Bowling
DF_Row(Index=4, url='http://quikstatsiowa.com/Public/Bowling/TeamStandings.aspx?IDSport=DA3506E8-E4CA-4175-BF69-BEBBDC2FD878', sex='boys', sport='Bowling')

table count: 2

[<Element table at 0x7f287304cd20>, <Element table at 0x7f28726af280>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Bowling: IndividualStandings
DF_Row(Index=5, url='http://quikstatsiowa.com/Public/Bowling/IndividualStandings.aspx?IDSport=DA3506E8-E4CA-4175-BF69-BEBBDC2FD878', sex='boys', sport='Bowling: IndividualStandings')

table count: 2

[<Element table at 0x7f287309c050>, <Element table at 0x7f287309c230>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Fall Golf
DF_Row(Index=6, url='http://quikstatsiowa.com/Public/Golf/TeamStandings.aspx?IDSport=92A34DE4-ACB3-4282-BF29-571A97DE1946', sex='boys', sport='Fall Golf')

table count: 2

[<Element table at 0x7f28726af280>, <Element table at 0x7f2873807050>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Fall Golf: IndividualStandings
DF_Row(Index=7, url='http://quikstatsiowa.com/Public/Golf/IndividualStandings.aspx?IDSport=92A34DE4-ACB3-4282-BF29-571A97DE1946', sex='boys', sport='Fall Golf: IndividualStandings')

table count: 2

[<Element table at 0x7f28726afbe0>, <Element table at 0x7f2873040d70>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Football
DF_Row(Index=8, url='http://quikstatsiowa.com/Public/Football/TeamStandings.aspx?IDSport=91A308DE-5763-4DAA-8C03-9AF66611E0BC', sex='boys', sport='Football')

table count: 3

[<Element table at 0x7f28730541e0>,
 <Element table at 0x7f287309cdc0>,
 <Element table at 0x7f287309c230>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/div[10]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Football: IndividualBestGameStandings
DF_Row(Index=9, url='http://quikstatsiowa.com/Public/Football/IndividualBestGameStandings.aspx?IDSport=91A308DE-5763-4DAA-8C03-9AF66611E0BC', sex='boys', sport='Football: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f2873807410>, <Element table at 0x7f2873042dc0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Football: IndividualStandings
DF_Row(Index=10, url='http://quikstatsiowa.com/Public/Football/IndividualStandings.aspx?IDSport=91A308DE-5763-4DAA-8C03-9AF66611E0BC', sex='boys', sport='Football: IndividualStandings')

table count: 3

[<Element table at 0x7f28737b53c0>,
 <Element table at 0x7f28737b5190>,
 <Element table at 0x7f2873036730>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/div[10]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Soccer
DF_Row(Index=11, url='http://quikstatsiowa.com/Public/Soccer/TeamStandings.aspx?IDSport=9D4214D2-EBE6-429E-9005-C11D2A29C89B', sex='boys', sport='Soccer')

table count: 2

[<Element table at 0x7f28738075f0>, <Element table at 0x7f28730365a0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Soccer: IndividualBestGameStandings
DF_Row(Index=12, url='http://quikstatsiowa.com/Public/Soccer/IndividualBestGameStandings.aspx?IDSport=9D4214D2-EBE6-429E-9005-C11D2A29C89B', sex='boys', sport='Soccer: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f2873807050>, <Element table at 0x7f287309ca50>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Soccer: IndividualStandings
DF_Row(Index=13, url='http://quikstatsiowa.com/Public/Soccer/IndividualStandings.aspx?IDSport=9D4214D2-EBE6-429E-9005-C11D2A29C89B', sex='boys', sport='Soccer: IndividualStandings')

table count: 2

[<Element table at 0x7f287304e320>, <Element table at 0x7f28730367d0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Spring Golf
DF_Row(Index=14, url='http://quikstatsiowa.com/Public/Golf/TeamStandings.aspx?IDSport=FC614ADE-B5DA-4012-A95E-0FD2A594FE9D', sex='boys', sport='Spring Golf')

table count: 2

[<Element table at 0x7f2873807640>, <Element table at 0x7f2873036960>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Spring Golf: IndividualStandings
DF_Row(Index=15, url='http://quikstatsiowa.com/Public/Golf/IndividualStandings.aspx?IDSport=FC614ADE-B5DA-4012-A95E-0FD2A594FE9D', sex='boys', sport='Spring Golf: IndividualStandings')

table count: 2

[<Element table at 0x7f287309c320>, <Element table at 0x7f287309c280>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Tennis
DF_Row(Index=16, url='http://quikstatsiowa.com/Public/Tennis/TeamStandings.aspx?IDSport=19786FF3-ADA3-4C7A-A94F-FAC0811118F5', sex='boys', sport='Tennis')

table count: 2

[<Element table at 0x7f287303c1e0>, <Element table at 0x7f287303c500>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 boys Tennis: IndividualStandings
DF_Row(Index=17, url='http://quikstatsiowa.com/Public/Tennis/IndividualStandings.aspx?IDSport=19786FF3-ADA3-4C7A-A94F-FAC0811118F5', sex='boys', sport='Tennis: IndividualStandings')

table count: 2

[<Element table at 0x7f287309c3c0>, <Element table at 0x7f287309caa0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Basketball
DF_Row(Index=18, url='http://quikstatsiowa.com/Public/Basketball/TeamStandings.aspx?IDSport=B657ECDF-ECD0-4429-810A-9F9274EC4AAA', sex='girls', sport='Basketball')

table count: 2

[<Element table at 0x7f28730364b0>, <Element table at 0x7f2873036a00>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Basketball: IndividualBestGameStandings
DF_Row(Index=19, url='http://quikstatsiowa.com/Public/Basketball/IndividualBestGameStandings.aspx?IDSport=B657ECDF-ECD0-4429-810A-9F9274EC4AAA', sex='girls', sport='Basketball: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f2873026550>, <Element table at 0x7f2873026410>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Basketball: IndividualStandings
DF_Row(Index=20, url='http://quikstatsiowa.com/Public/Basketball/IndividualStandings.aspx?IDSport=B657ECDF-ECD0-4429-810A-9F9274EC4AAA', sex='girls', sport='Basketball: IndividualStandings')

table count: 2

[<Element table at 0x7f2873052dc0>, <Element table at 0x7f2873052460>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Bowling
DF_Row(Index=21, url='http://quikstatsiowa.com/Public/Bowling/TeamStandings.aspx?IDSport=0C6DFBCF-98C4-4B01-9F56-17B02E9E47E1', sex='girls', sport='Bowling')

table count: 2

[<Element table at 0x7f2873026410>, <Element table at 0x7f2873052500>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Bowling: IndividualStandings
DF_Row(Index=22, url='http://quikstatsiowa.com/Public/Bowling/IndividualStandings.aspx?IDSport=0C6DFBCF-98C4-4B01-9F56-17B02E9E47E1', sex='girls', sport='Bowling: IndividualStandings')

table count: 2

[<Element table at 0x7f2873052690>, <Element table at 0x7f2873052f00>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Golf
DF_Row(Index=23, url='http://quikstatsiowa.com/Public/Golf/TeamStandings.aspx?IDSport=6DC124A1-D8C4-4F88-84EF-5C6B4FD4A688', sex='girls', sport='Golf')

table count: 2

[<Element table at 0x7f2873076140>, <Element table at 0x7f2873076d70>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Golf: IndividualStandings
DF_Row(Index=24, url='http://quikstatsiowa.com/Public/Golf/IndividualStandings.aspx?IDSport=6DC124A1-D8C4-4F88-84EF-5C6B4FD4A688', sex='girls', sport='Golf: IndividualStandings')

table count: 2

[<Element table at 0x7f2873052e60>, <Element table at 0x7f287303cf00>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Soccer
DF_Row(Index=25, url='http://quikstatsiowa.com/Public/Soccer/TeamStandings.aspx?IDSport=65E5DA09-90C6-45F5-847A-F9A84FD9C5B0', sex='girls', sport='Soccer')

table count: 2

[<Element table at 0x7f2873052f00>, <Element table at 0x7f2873052c80>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Soccer: IndividualBestGameStandings
DF_Row(Index=26, url='http://quikstatsiowa.com/Public/Soccer/IndividualBestGameStandings.aspx?IDSport=65E5DA09-90C6-45F5-847A-F9A84FD9C5B0', sex='girls', sport='Soccer: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f28730766e0>, <Element table at 0x7f287303c140>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Soccer: IndividualStandings
DF_Row(Index=27, url='http://quikstatsiowa.com/Public/Soccer/IndividualStandings.aspx?IDSport=65E5DA09-90C6-45F5-847A-F9A84FD9C5B0', sex='girls', sport='Soccer: IndividualStandings')

table count: 2

[<Element table at 0x7f287303cdc0>, <Element table at 0x7f287303c500>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Softball
DF_Row(Index=28, url='http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=D97DD7D0-0BEF-404A-B041-7E51ACFDBD16', sex='girls', sport='Softball')

table count: 2

[<Element table at 0x7f28730765f0>, <Element table at 0x7f2873076fa0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Softball: IndividualBestGameStandings
DF_Row(Index=29, url='http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=D97DD7D0-0BEF-404A-B041-7E51ACFDBD16', sex='girls', sport='Softball: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f2873076410>, <Element table at 0x7f2873076f00>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Softball: IndividualStandings
DF_Row(Index=30, url='http://quikstatsiowa.com/Public/BBSB/IndividualStandings.aspx?IDSport=D97DD7D0-0BEF-404A-B041-7E51ACFDBD16', sex='girls', sport='Softball: IndividualStandings')

table count: 2

[<Element table at 0x7f28730760a0>, <Element table at 0x7f2873095cd0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Tennis
DF_Row(Index=31, url='http://quikstatsiowa.com/Public/Tennis/TeamStandings.aspx?IDSport=6086C2DF-4661-4701-BFF1-3BB32C081B88', sex='girls', sport='Tennis')

table count: 2

[<Element table at 0x7f2873076410>, <Element table at 0x7f2873095a50>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Tennis: IndividualStandings
DF_Row(Index=32, url='http://quikstatsiowa.com/Public/Tennis/IndividualStandings.aspx?IDSport=6086C2DF-4661-4701-BFF1-3BB32C081B88', sex='girls', sport='Tennis: IndividualStandings')

table count: 2

[<Element table at 0x7f2873095d70>, <Element table at 0x7f28730954b0>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Volleyball
DF_Row(Index=33, url='http://quikstatsiowa.com/Public/Volleyball/TeamStandings.aspx?IDSport=83298383-D7D7-4670-9C6B-24DDB8B2E773', sex='girls', sport='Volleyball')

table count: 2

[<Element table at 0x7f2873069e60>, <Element table at 0x7f2873069b90>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Volleyball: IndividualBestGameStandings
DF_Row(Index=34, url='http://quikstatsiowa.com/Public/Volleyball/IndividualBestGameStandings.aspx?IDSport=83298383-D7D7-4670-9C6B-24DDB8B2E773', sex='girls', sport='Volleyball: IndividualBestGameStandings')

table count: 2

[<Element table at 0x7f28730695a0>, <Element table at 0x7f2873069050>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

sex sport
0 girls Volleyball: IndividualStandings
DF_Row(Index=35, url='http://quikstatsiowa.com/Public/Volleyball/IndividualStandings.aspx?IDSport=83298383-D7D7-4670-9C6B-24DDB8B2E773', sex='girls', sport='Volleyball: IndividualStandings')

table count: 2

[<Element table at 0x7f28730695f0>, <Element table at 0x7f2873095280>]

xpaths to tables

['/html/body/form/div/div[2]/table',
 '/html/body/form/div/div[3]/div[3]/div[2]/table']

Assert that all the HTML docs of interest have a table at the same xpath

In [16]:
TABLE_XPATH = "/html/body/form/div/div[3]/div[3]/div[2]/table"

table_xpaths_per_page = [
    [root.getpath(table) for table in tree.xpath(TABLE_XPATH)]
    for root, tree in ((tree.getroottree(), tree) for df_row, tree in element_trees)
]
display(table_xpaths_per_page[:5])  # display first 5

assert (set_of_xpaths := set(it.chain.from_iterable(table_xpaths_per_page))) == set(
    (TABLE_XPATH,)
)
assert len(set_of_xpaths) == 1
[['/html/body/form/div/div[3]/div[3]/div[2]/table'],
 ['/html/body/form/div/div[3]/div[3]/div[2]/table'],
 ['/html/body/form/div/div[3]/div[3]/div[2]/table'],
 ['/html/body/form/div/div[3]/div[3]/div[2]/table'],
 ['/html/body/form/div/div[3]/div[3]/div[2]/table']]

Analyze tables at path TABLE_XPATH

In [17]:
from chameleon import PageTemplate
In [18]:
# Creating a template for JavaScript code becomes problamatic with templating systems that use {}
# So will use Chameleon

PHANTOM_JS_CODE_TEMPLATE = PageTemplate(
    """
<script tal:omit-tag tal:define="filename filename" tal:define="url url">
var page = require('webpage').create();
page.open('${url}', function() {
  page.render('${filename}');
  phantom.exit();
});
</script>
"""
)
print(PHANTOM_JS_CODE_TEMPLATE.render(filename="foofile", url="barurl"))

var page = require('webpage').create();
page.open('barurl', function() {
  page.render('foofile');
  phantom.exit();
});


Function to crop and resize screenshots

In [19]:
def scale_image(input_image_path, output_image_path, width=None, height=None):
    original_image = PILImage.open(input_image_path)
    w, h = original_image.size
    (
        "The original image size is {wide} wide x {height} "
        "high".format(wide=w, height=h)
    )

    if width and height:
        max_size = (width, height)
    elif width:
        max_size = (width, h)
    elif height:
        max_size = (w, height)
    else:
        # No width or height specified
        raise RuntimeError("Width or height required!")

    original_image.thumbnail(max_size, PILImage.ANTIALIAS)
    original_image.save(output_image_path)

    scaled_image = PILImage.open(output_image_path)
    width, height = scaled_image.size
    (
        "The scaled image size is {wide} wide x {height} "
        "high".format(wide=width, height=height)
    )
In [20]:
def crop_image(input_image_path, output_image_path, area=(0, 0, 0, 0,)):
    img = PILImage.open(input_image_path)
    cropped_img = img.crop(area)
    cropped_img.save(output_image_path)
In [21]:
display(
    HTML(
        REDH2(
            content="Visually verify that the headers obtained from scrape match the website."
        )
    )
)

Visually verify that the headers obtained from scrape match the website.

Issues:

  • When there are 3 tables such as is the case of Football, the wrong headers are obtained. Have the correct headers now for Football.
  • Volleyball headers are broken into too many pieces.

The text inside of the headers sometimes has a <br> tag inside which creates tail text.

Using itertext function makes it impossible to join the words properly.

The comprehension below is a monster, but it gets the job done.

Isolate and examine Footbal table.

In [22]:
(football_html,) = (
    tree for df_row, tree in element_trees if df_row.sport == "Football"
)

[
    [[t for text in e.itertext() if (t := text.strip())] for e in element.xpath("//tr")]
    for element in football_html.xpath(TABLE_XPATH)
][0][4]
Out[22]:
['Team',
 'Record',
 'Pass Att',
 'Comp',
 'Comp %',
 'Yards',
 'Yards/Comp',
 'Ints',
 'TDs',
 'Longest TD',
 'QB Rating']
In [23]:
head, tail = (op.itemgetter(i) for i in range(2))
In [24]:
headers_ = it.chain(
    *(
        [  # handle <thead>
            (
                (
                    results := [
                        words
                        for e in element.iterdescendants()
                        if any(
                            words := (
                                e_text.strip()
                                if (e_text := e.text) is not None
                                else "",
                                e_tail.strip()
                                if (e_tail := e.tail) is not None
                                else "",
                            )
                        )
                    ]
                ),
                [
                    item for item in list(zip(*results))[-1] if item
                ],  # empty if no special joins,
                df_row,
            )
            for table in tree.xpath(TABLE_XPATH)
            for element in table.xpath("//thead")
        ]
        or (
            [  # handle <tr>
                (
                    (
                        results := [
                            words
                            for e in element.iterdescendants()
                            if any(
                                words := (
                                    e_text.strip()
                                    if (e_text := e.text) is not None
                                    else "",
                                    e_tail.strip()
                                    if (e_tail := e.tail) is not None
                                    else "",
                                )
                            )
                        ]
                    ),
                    [
                        item for item in list(zip(*results))[-1] if item
                    ],  # empty if no special joins
                    df_row,
                )
                for table in tree.xpath(TABLE_XPATH)
                for element in table.xpath("//tr")
            ][1:2]
            if df_row.sport
            not in ("Football", "Football: IndividualStandings")  # special cases
            else [  # handle <tr> for special cases
                (
                    (
                        results := [
                            words
                            for e in element.iterdescendants()
                            if any(
                                words := (
                                    e_text.strip()
                                    if (e_text := e.text) is not None
                                    else "",
                                    e_tail.strip()
                                    if (e_tail := e.tail) is not None
                                    else "",
                                )
                            )
                        ]
                    ),
                    [
                        item for item in list(zip(*results))[-1] if item
                    ],  # empty if no special joins
                    df_row,
                )
                for table in tree.xpath(TABLE_XPATH)
                for element in table.xpath("//tr")
            ][
                4:5
            ]  # could not make this work with an assignment expression for some reason
        )
        for df_row, tree in element_trees
    )
)

HEADERS = [
    (
        [
            # first word of first tuple if both the last are empty otherwise join all words
            " ".join(words).strip()
            for items in zip(headers__, headers__[1:])
            if any(words := (head(head(items)), tail(tail(items))))
        ],
        df_row,
    )
    if must_join
    else (list(zip(*headers__))[0], df_row,)
    for headers__, must_join, df_row in headers_
]
for item in HEADERS[-2:]:
    display(item)
(('Athlete', 'Team', 'Aces', 'Date', 'Location', 'Opponent'),
 DF_Row(Index=34, url='http://quikstatsiowa.com/Public/Volleyball/IndividualBestGameStandings.aspx?IDSport=83298383-D7D7-4670-9C6B-24DDB8B2E773', sex='girls', sport='Volleyball: IndividualBestGameStandings'))
(['Athlete',
  'Team',
  'Sets',
  'Kills',
  'Kill Err',
  'Attack Attempts',
  'Kill Eff',
  'Kills per Set',
  'Serve Success',
  'Serves',
  'Serve Eff',
  'Aces',
  'Aces per Set'],
 DF_Row(Index=35, url='http://quikstatsiowa.com/Public/Volleyball/IndividualStandings.aspx?IDSport=83298383-D7D7-4670-9C6B-24DDB8B2E773', sex='girls', sport='Volleyball: IndividualStandings'))
In [25]:
translation = str.maketrans(
    dict(
        zip(
            (*whitespace, *(item for item in punctuation if item != "-")), it.cycle("_")
        )
    )
)

dash_join = "-".join
space_join = ": ".join

LINK = CT("a", TAL_CONTENTS_WITH_ATTRIBUTES).render

anchors, links = zip(
    *(
        (
            (base_word := df_row[-2:][::-1]),
            (word := dash_join(base_word).translate(translation).lower()),
            (top_id := f"top-{word}"),
            LINK(
                attributes=dict(
                    id=word, href=f"#{top_id}", style="font-size: x-large;"
                ),
                content="back to top",
            ),
            LINK(
                attributes=dict(href=f"#{word}", id=top_id),
                content=space_join(base_word),
            ),
        )[-2:]
        for df_row, tree in element_trees
    )
)
display(HTML(CT("a", (TAL_ATTRIBUTES,)).render(attributes=dict(id="bottom", href="3"))))
display(
    HTML(
        CT("ul", (TS(CONTENT, f"structure {CONTENT}"),)).render(
            content=CT(
                "li", (TS("repeat", "link links"), TS(CONTENT, "structure link"))
            ).render(links=links)
        )
    )
)
In [26]:
for anchor, (df_row, tree), (headers, _), df_table in it.islice(
    zip(anchors, element_trees, HEADERS, dfs), 2
):

    for item in (
        HTML(item) for item in (anchor, H2(content=" ".join(df_row[-2:][::-1])))
    ):
        display(item)
    display(df_row)
    print(df_row.url)
    code = PHANTOM_JS_CODE_TEMPLATE.render(
        filename=(input_image_path := tempfile.mkstemp(suffix=".png")[1]),
        url=df_row.url,
    )
    (code_file,) = (Path(fp) for _, fp in (tempfile.mkstemp(suffix=".js"),))
    code_file.write_text(code)
    check_call(shlex.split(f"/usr/local/bin/phantomjs {code_file}"))
    _, output_image_path = tempfile.mkstemp(suffix=".png")
    with PILImage.open(input_image_path) as img:
        w, h = img.size
    area = (0, ll := 305, w, ll * 2)

    crop_image(
        input_image_path, output_image_path, area=area,
    )

    _, scale_image_path = tempfile.mkstemp(suffix=".png")
    with PILImage.open(output_image_path) as img:
        w, h = img.size
    scale_image(output_image_path, scale_image_path, width=w // 1.3, height=None)
    for item in (
        HTML(BLUEH3(content="Scraped data from Scrapy script:")),
        HTML(df_table.head(5).to_html()),
        HTML(BLUEH3(content="Scraped headers from this notebook code:")),
        HTML(BLUEH3(content=headers)),
        Image(filename=scale_image_path),
    ):
        display(item)

Baseball boys

DF_Row(Index=0, url='http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball')
http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD

Scraped data from Scrapy script:

sport sex url Team Record GP AB R H 2B 3B HR RBI SAC BB SO SB SBA HBP OBP SLG AVG
0 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Xavier 41 - 2 - 0 43 1120 371 391 79 17 13 306 50 212 162 92 112 53 0.466 0.485 0.349
1 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD West Lyon 27 - 2 - 0 29 732 298 254 49 14 8 240 37 196 157 141 151 45 0.509 0.485 0.347
2 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Newman Catholic 38 - 3 - 0 41 1055 429 382 90 10 42 346 0 257 139 101 103 88 0.519 0.586 0.362
3 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Van Meter 34 - 3 - 0 37 1000 336 340 67 19 3 287 43 153 146 125 143 38 0.446 0.454 0.340
4 Baseball boys http://quikstatsiowa.com/Public/BBSB/TeamStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Central DeWitt 38 - 4 - 0 42 1222 309 408 92 7 20 261 20 161 227 125 156 41 0.425 0.470 0.334

Scraped headers from this notebook code:

('Team', 'Record', 'GP', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SAC', 'BB', 'SO', 'SB', 'SBA', 'HBP', 'OBP', 'SLG', 'AVG')

Baseball: IndividualBestGameStandings boys

DF_Row(Index=1, url='http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD', sex='boys', sport='Baseball: IndividualBestGameStandings')
http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD

Scraped data from Scrapy script:

sport sex url Athlete Team Runs Date Location Opponent
0 Baseball: IndividualBestGameStandings boys http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Braiden Beane 08 West Central Valley 6 6/13/2019 Woodward Academy Woodward Academy
1 Baseball: IndividualBestGameStandings boys http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Carson Wadle SR West Central Valley 6 6/13/2019 Woodward Academy Woodward Academy
2 Baseball: IndividualBestGameStandings boys http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Matt Grein SR Algona 5 5/22/2019 Algona High School Emmetsburg
3 Baseball: IndividualBestGameStandings boys http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Wes Cummings SR PCM 5 5/22/2019 Pleasantville High School Pleasantville
4 Baseball: IndividualBestGameStandings boys http://quikstatsiowa.com/Public/BBSB/IndividualBestGameStandings.aspx?IDSport=B25923B5-D303-41CA-B9B3-DF2527D84CDD Trevor Boge SR North Linn 5 5/23/2019 North Linn High School Edgewood-Colesburg

Scraped headers from this notebook code:

('Athlete', 'Team', 'Runs', 'Date', 'Location', 'Opponent')