End-to-End Data Analysis Project: Part 1. Getting Data

Sonny Torres, Sun 23 December 2018, Python

beautifulsoup, python, tutorial, webscraping


Contents


Introduction

After working with Python to manipulate and analyze commercial loan data at my job, I wanted to start a blog where I could showcase my programming projects. However, the loan data I used at my job was subject to copyright protection. Luckily, the Securities and Exchange Commission website publishes similar data.

The Goal

My goals for this project were the following:

  1. Get data : Acquire commercial property loan data that I could use and publish on my blog in order to showcase some of my Python projects in future posts.
  2. Automation : I wanted to create an automation script that could retrieve the data from scratch (no copying and pasting and/or tailoring the script to this exact URL) and that could generalize, with minimal modifications, and work on similar SEC postings.
  3. Practice: Refine my programming and data manipulation skills.

What to Expect in this Post

I will be detailing the process of parsing a messy HTML file with multiple tables and writing the data to CSV files. These files will be combined and cleaned in future posts. This particular post covers the following:

  1. The data source and what the data covers.
  2. A walk through of the layout of the web page and the HTML structure.
  3. Helper functions that are utilized in the main script.
  4. A break down of the main HTML parsing script.

This is not a Python beginner-friendly post, however, I created visualizations where I thought they would help, included several website links throughout to reference some fundamental programming concepts and aimed to explain everything in detail while trying to keep things as simple as possible.

The Source Code

Link to repository here: Github.

The Tools Used

The code in this post is primarily constructed with the main Python library (version 3.6), BeautifulSoup and a little bit of Regular Expressions.

Understanding the Data

The Data Source

To retrieve the HTML file that I used, you can download the HTML file hosted on my github page or you can visit the SEC url and press Ctrl + S to save the file as an HTML document.

Right-click on the link & save as an html file: Download

URL: https://www.sec.gov/Archives/edgar/data/1013611/000152153611000217/q1100154_fwpa1-jpm.htm

To view the webpage HTML, open the file or website in your browser and right click anywhere on the webpage and click "Inspect". Ctrl+Shift+I is the windows shortcut to do this.

What is CMBS?

Commercial Mortgage-Backed Securities (CMBS) are Bonds , or securities as they are sometimes called, backed by a group of commercial mortgages. Because the bonds are eventually sold on a public exchange, the issuers of the bonds are legally required to register the securities with the SEC and must comply with the Securities Act of 1933 by disclosing key due dilligence information for potential investors to evaluate. The public data I am working with in this blog post is the Annex A section of a prospectus. A CMBS prospectus informs potential investors of an investment opportunity (bonds of varying risk backed by mortgages) and the Annex A section tabulates the loan data as well as the commercial properties that serve as the collateral for the loans. The SEC data being used in this post covers one securitized pool, or group, of CMBS loans. There are 44 loans and 216 rows of data.

The individual CMBS data points included in the dataset fall into four main categories:

  1. Mortgage data (Loan Balance, Interest Rate, Maturity Date, etc.)
  2. Commercial Property Data (Location, Property Type, Square Feet, etc.)
  3. Securitization Data (characteristics that describe the Securitized structure such as a given loan's loan balance as a proportion of the entire pool/group of loans).
  4. Performance Data (historical financial performance of the commercial properties).

Data Arrangement

Despite being one data set, the CMBS data is arranged on the web page such that it is split into several vertically aligned tables. This is likely because the data set has over 100 columns and, if presented as a single data set, would require an inconvenient horizontal scrolling experience for the end user.

Because I wanted a single tabular data set, I had to first understand how the various tables were organized in order to put them together. The charts below illustrate how the first four tables are arranged and how they must be transformed in order to fit together as a single dataset like a CSV or excel file.

Below, are four tables with various headers, columns and rows. For simplification, the visuals below include only 10 Commercial Properties. Each column, which includes a data point for every row, is broken up into two separate tables. So, Properties 1-5 are in Table 1 and Table 2 lists the other 5 properties - Properties 6-10. Similarly, Table 4 is a continuation of Table 3.

table of four new legend

Tables 1 & 2 need to be glued together and 3 & 4 need to be glued together, both along their horizonal axes. Notice how the excess headers, (the grey-shaded headers in the above figure) have been eliminated in the figure below.

table of two

Next, the pairs of tables then have to be joined together on their respective vertical axes. Because "Prop1, Prop2..etc." represents not only commercial properties in the pool of mortgages, but are also the indices for every table, merging the data, as shown in the image below, will produce duplicate indices that must be deleted later.

table horizontal

When iteratively joining every 4 tables in this way, the result will be a single tabular dataset, albeit with plenty of clean up work left to do.

Although the data is not merged into a single data set in this post, assessing the layout significantly informed the way I approached the parsing and the writing of the data to files as you will see in this post. The tricky table layout actually pales in comparison, for me, to the complexity of parsing the header rows and creating rules to account for blank spaces in the table, which I will explain later in this post.

The HTML Structure

Before writing any code, I first had to carefully examine the HTML structure (Ctrl+Shift+I on windows to view the HTML followed by Ctrl+F to find keywords). The key to understanding the HTML, for me, was understanding the HTML tag structure, as well as what characteristics makes certain kinds of data unique, specifically what HTML tag attributes make a given row of data identifiable as being a header row, a data row or a row to skip.

Some HTML Table Basics:

  1. table tags define a table of data.
  2. tr tags define the rows in a table.
  3. td tags define a cell in table.

More on HTML table structures here.

Import Libraries and Make Soup

Below are the libraries I used for this task. The working_dir variable is the path that contains the html file and the name of the file is assigned to filename. I often use the os.chdir() method, out of preference, to direct my scripts to the appropriate directory but other methods work just as well.

from bs4 import BeautifulSoup
from bs4 import Tag
import pandas as pd
import shutil
import glob
import csv
import re
import os


working_dir = r'C:\Users\Username\Desktop\End to End Data Analysis Project'
os.chdir(working_dir)
filename = 'JPC11C05.html'

I used Python's with open syntax to create a file object which gave me the detail needed for parsing the HTML file at a granular, line by line, element by element level.

with open(filename) as sec:
    soup = BeautifulSoup(sec, 'lxml')

Grab the Tables

Each table of data is encased in table tags in the HTML document. However, there were tables that I did not want, specifically the table tags that consist of only footnotes (at bottom of HTML document). I noticed that these tables all had a border attribute within the main table tag (see below).

Good Table

good table

Bad Table

bad table

So, I used a list comprehension to grab all the tables by targeting the table HTML tag and excluded all table tags with a 'border' tag attribute.

soup.findAll('table') returns a bs4.element.ResultSet which is a list of bs4.element.Tag objects - the table tags.

table_count stores the number of tables to iterate over.

tables = [
    table for table in soup.findAll("table") if 'border' not in table.attrs
    ]

table_count = len(tables)

The Headers

Because the table headers came with their own unique challenges, I wanted to keep track of the headers and table rows separately. I first needed to see what distinguished the headers in the HTML.

Embedded in the td tag, there is a font tag with a style attribute. I found that all header cells contained the word bold in the attribute value for the style attribute (see below) and the table rows never contained the bold tag attribute.

header td

Good Headers

The function I made to identify headers has two parameters, tag and keyword.

def is_header(tag, keyword=r'bold'):
    # returns bs4.element.Tag or None
    return tag.find("font", {"style": re.compile(keyword)})
  • The tag parameter accepts an HTML tag to search for the keyword "bold" within the font's style attribute.
  • BeautifulSoup's .find() method accepts the tag to find, "font", and the tag attribute/value pair, "style" and "bold". .find() returns only the first instance of the match it finds, which is sufficient for determining a header row.
  • When the function above does not find the "bold" keyword, it will return a None object, which is important to know for later.

Bad Headers

However, I wanted to exclude some of the header rows. Some of the tables have a hierarchal index structure where one large header row describes a group of columns (See "Hotel Operating Statistics" below).

hotel op stats header

The above "Bad Header" is differentiated in the HTML from the "Good Header" by having a large colspan attribute value. The "colspan" attribute tells us how many columns a given data point spans across. The HTML below shows that the "Hotel Operating Statistics" header has a colspan value of 15.

colspan 15

To be clear, in this HTML file, all colspan attributes are associated with headers but not all headers have a colspan attribute.

I then searched the HTML document manually in my browser, using Ctrl+f, for "colspan". There are only 20 instances of "colspan" in the entire HTML document and I saw that the highest colspan value for a good header was 2.

Skip Headers with Colspans Greater than 2

The function, unwanted_header(), below, will accept the tr tag as the argument for the tag parameter and the colspan_limit default value is set to 2, as a good header can have a colspan value of 2.

def unwanted_header(tag, colspan_limit=2):
    colspan_obj = tag.find("td", {"colspan": re.compile(r"\d+")})
    if colspan_obj:
        colspan_val = int(colspan_obj['colspan'])
        if colspan_val > colspan_limit:
            return True
        else:
            return False
  • unwanted_header() searches the tr tag for any td tag that has a colspan tag attribute with a numerical string value.
  • The re.compile() method is a Regular Expressions method that creates a Regular Expression object with a pattern that can be searched for.
  • The 'r' tells Python not to interpret any backslash characters (for example \n is interpreted by Python as a newline) and to let BeautifulSoup interpret the backslashes with its own syntax rules. In this case, \d+ will be interpreted by BeautifulSoup to mean "any numerical string occuring at least one time".
  • If the colspan attribute is present in one of the row's td tags then the colspan_obj will exist as a Tag object and the if statement is True. If the specific tag is not found, the object will be a None value and the if statement is False.
  • colspan_val is a variable assigned to the colspan value, the string value for the colspan tag attribute, and is converted to an integer by using the int() function.
  • colspan_obj is a bs4.element.Tag object, which means the tag's attributes can be accessed like a dictionary like this: colspan_obj['colspan'].
  • Finally, if the integer is larger than 2, the function will return True and if there is no colspan_obj or if the integer is equal to or less than 2, the function will return False.

Then I created a simple function to test for the previous two function's criteria at once.

def is_good_header(tag):
    if is_header(tag) and not unwanted_header(tag):
        return True
    else:
        return False

The Bad Side of Good Headers

Before, I mentioned that there are some good headers with a colspan value equal to 2. These headers need to be identified in order to correct a problem they present.

In the image below, the column "Other Reserve Cap" reflects how the website's tags are organized when a header value spans across 2 columns. A header value with a colspan attribute value of 2, "Other Reserve Cap", is enclosed in 1 td tag, while the rows of data below the header have 2 td tags (see picture for yellow and blue td tags). This 2-to-1 arrangement is not visible on the website and aligns appropriately in the HTML output, however, because my script iterates over every td tag and writes each one to a cell in my CSV file, I ran into a cell alignment problem illustrated in the figure below labeled "CSV Result". You can see how the "Single Tenant" cell is pushed over to the right by the yelllow and blue cells.

cell align new

To identify the good headers with a colspan value of 2, I created a boolean function that returns True when all td tags span the same number of columns, 1 each, and returns False when they do not. Below is the boolean function cells_same_colspan().

def cells_same_colspan(colspan_dict):
    # reduce colspan dict values
    max_colspan_per_row = [max(val) for val in colspan_dict.values()]
    if max(max_colspan_per_row) == 1:
        return True
    else:
        return False
  • colspan_dict is a dictionary object where the keys are the row numbers of a table and the values are lists with the colspan values.
  • By using a list comprehension that calls the max() function on every list of colspan values, I can create a list of maximum colspan values per row. max() is called once again on the list of max values per row to find the highest colspan value for a entire table of data (headers and table rows).
  • The function returns False when the largest colspan value does not equal 1.

Below is an example of what the colspan_dict object, which is passed to the function above, would look like if a table had two rows and had a single colspan value equal to 2 in the second row.

# COLSPAN_DICT object

{0: [1, 1, 1, 1],
 1: [1, 1, 2, 1]}

# where the dict keys are table rows (including headers)
# and the dict values are lists containing
# each data point's colspan value

The function cells_same_colspan() would return False on the COLSPAN_DICT above and the row with the colspan value equal to 2 would need to be modified to align the table columns properly.

Infinite Nest

Like the book Infinite Jest, where the author's writing style can yield a page-long sentence with several tangential thoughts nested within the main thought, for the code to work the way I wanted, I created a number of nested objects. The data structures I used to track each piece of data (rows, data points, colspan values etc.), as well as the control structure of the main loop display a very layered, nested characteristic.

For example, nested_colspan_dict is a function, which I will elaborate on in a moment, that returns the dictionary object below. In the structure below, tuples are nested within lists nested within a dictionary. This structure is used to store all the information needed to make cell alignment corrections before they are written to a CSV file.

# NESTED_COLSPAN_DICT object
{0: [ (0, 1), (1, 1), (2, 1) ],
 1: [ (0, 1), (1, 1), (2, 1) ]}

# where the dict keys are table rows
# and the dict values are lists containing tuples
# each tuple contains..
# (index value of a given data point, the colspan value for given point)

Below is the function that produces the nested object.

def nested_colspan_dict(colspan_dict):
    nested_colspan_obj = {
        key: [colspan_tuple for colspan_tuple in enumerate(value)]
        for key, value in colspan_dict.items()}
    return nested_colspan_obj
  • The function nested_colspan_dict creates the data structure by iterating over the key, value pairs in the colspan_dict object, accessed via the .items() method on the dictionary object. The dictionary comprehension creates a new dictionary where the key will remain the row number and the value will be a list of tuples, similar to the example above.
  • Using Python's built-in enumerate() function on the colspan values in each row will produce tuples with the index of a given colspan value and the colspan value itself.

Next, since I needed to extract data from the nested dictionary multiple times, it made sense to create a function to do this efficiently. This function iterates over the nested dictionary object and checks to see if a given colspan value does not equal 1 and returns the object's row number, index of colspan value and the colspan value.

def colspans_to_fix(nested_colspan):
    for key, value in nested_colspan.items():
        for tupl in value:
            #tuples are (colspan_index, colspan_value)
            if tupl[1] != 1:
                # key = row number
                # tuple = (index_of_cell, colspan_value_for_cell)
                yield {key: tupl}
  • The function first iterates over the key, value pairs of the nested_colspan object which was created with the previous function nested_colspan_dict.
  • Because the nested_colspan dict values are lists containing tuples, you have to iterate over the individual tuples to access the information inside them (such as the colspan values). For any tuple, if the colspan value does not equal one, the function will yield a dictionary object which contains the row number as the key and a tuple as the dictionary value. The tuple will then be parsed for the colspan value and the index value it occurs in the row it resides.

File Writer Function

Below are commonly used lines of code in Python when doing Input/Output (I/O) operations (see Python docs for working with CSV files ). I put these basic instructions into a function to curtail redundancy.

# A function to write the rows to a file.
def write_rows(current_table, filename, data):
    with open(str(current_table) + filename + '.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        # if row avoids blank rows
        writer.writerows(row for row in data.values() if row)

File Organization Function

After the main for loop finishes (shown after this function) there will be multiple CSV files generated in the program's directory. To organize the header and table files, the following function creates folders for each and moves the files into their respective folders.

def headers_tables(name, file_path=os.getcwd()):
    if not os.path.exists(name):
        os.mkdir(name)
    pattern = '*' + name + '.csv'
    file_list = [file for file in glob.glob(os.path.join(file_path, pattern))]
    for file in file_list:
        shutil.move(file, name)

The Bulk of the Script

The main portion of the script is a long for loop with multiple conditional statements to control the flow of the various instructions.

Below is the main script in one block of code, which I will expand on in pieces.

for i in range(table_count):

    table = tables[i]
    header_rows = {'headers': {}, 'tdata': {}}
    colspan = {}
    line_count = 0

    for row in table.findAll("tr"):
        colspan[line_count] = [int(td_tag['colspan'])
                               if 'colspan' in td_tag.attrs else 1
                               for td_tag in row.findAll('td')]
        if is_good_header(row) and i % 2 == 0:
            header_rows['headers'][line_count] = [data.font.text
                                                  if isinstance(data.font, Tag) else ''
                                                  for data in row.findAll('td')]
            line_count += 1
        elif is_header(row) is None:
            header_rows['tdata'][line_count] = [data.font.text
                                                if isinstance(data.font, Tag) else ''
                                                for data in row.findAll('td')]
            line_count += 1
        else:
            continue
    if cells_same_colspan(colspan):
        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])
    else:
        nested_cs = nested_colspan_dict(colspan)
        bad_colspans = [cs for cs in colspans_to_fix(nested_cs)]
        headers_only = [h for h in header_rows['headers']]
        for colspan_dict in bad_colspans:
            for k, v in colspan_dict.items():
                row_index = k
                colspan_ind, colspan_val = v
                colspan_minus_one = colspan_val - 1
                index_for_insert = colspan_ind + 1
                if row_index in headers_only:
                    headers = header_rows['headers'][row_index]
                    for insert in range(colspan_minus_one):
                        headers.insert(index_for_insert, '')
                else:
                    rows = header_rows['tdata'][row_index]
                    for insert in range(colspan_minus_one):
                        rows.insert(index_for_insert, '')
        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

The Data Container Objects

Before expanding on the for loop, I want to cover the main objects that are created for each iteration.

  1. table : Because I previously created the tables object, I can access each table by calling a given table's index value. In this case i will be an integer, generated by the range() function being called with table_count, that will be used to index each table in the list object tables .
  2. headers_rows : This will be the main data structure that stores the data that will be written directly to a CSV file. The nested dictionary will have two main keys, one for header data ( headers ) and another for table data ( 'tdata' ). Because this object's structure is layered, I created a small visual of the data structure below.
# headers_rows

{'headers': {0: ['a', 'b', 'c'],
             1: ['d', 'e', 'f']}
 'tdata':   {2: ['g', 'h', 'i'],
             3: ['j', 'k', 'l']}
}

# 'headers' is the key for the header dictionary and 'tdata' is the key for the table dictionary.
# 'headers' contains all headers for one table and 'tdata' contains all table data for one table.
# Each integer, above, represent a row number.
# Each list represents a row.
# Each letter represents a data point in a row.
  1. colspan : This object will be a dictionary where the dictionary keys are the row count and the values will be lists containing the colspan values.
  2. line_count: a simple row counter.

Part 1: Iterate over each Table and each Row

for i in range(table_count):

    table = tables[i]
    header_rows = {'headers': {}, 'tdata': {}}
    colspan = {}
    line_count = 0

Part 2: Create List of Colspan Values

for row in table.findAll("tr"):
    colspan[line_count] = [int(td_tag['colspan'])
                           if 'colspan' in td_tag.attrs else 1
                           for td_tag in row.findAll('td')]

Part 3: Populate Dictionary with Header Data

After recording the colspan values, the row or tr tag will be passed to the boolean is_good_header() function.

if is_good_header(row) and i % 2 == 0:
    header_rows['headers'][line_count] = [data.font.text
                                          if isinstance(data.font, Tag) else ''
                                          for data in row.findAll('td')]
    line_count += 1

Part 4: Populate Dictionary with Table Data

elif is_header(row) is None:
    header_rows['tdata'][line_count] = [data.font.text
                                        if isinstance(data.font, Tag) else ''
                                        for data in row.findAll('td')]
    line_count += 1

Part 5: Write Data to File

    else:
        continue
if cells_same_colspan(colspan):
    write_rows(current_table=i, filename=' header', data=header_rows['headers'])
    write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

Part 6: Fix Colspan Misalignment Issues

This next section was pretty tricky for me. When cells_same_colspan() returns False, meaning their is a colspan value larger than 1, then a cell alignment issue is present and must be corrected.

Create Objects to Correct Alignment

  • The nested_cs object below uses the nested_colspan_dict() function which associates the index values for each colspan value via the built-in enumerate() Python function.
else:
    nested_cs = nested_colspan_dict(colspan)
    bad_colspans = [cs for cs in colspans_to_fix(nested_cs)]
  • The bad_colspans object is created with a list comprehension that creates a list of dictionaries that contain all the information needed to modify a given row's data points. Because colspans_to_fix is a generator function, the function will only yield the values when called on or calling the next() function.
  • As an example of a generated value by the colspans_to_fix() function, if row 1, cell 9 had a colspan value of 2, the generator would yield the object { 1: (9, 2)}.

Because the header data and table data are handled separately, headers_only extracts only the header data which will be referenced later.

headers_only = [h for h in header_rows['headers']]

Use For Loop to Correct Alignment

This last section essentially loops over the bad_colspans list and uses the information contained in it to modify the data container headers_rows.

for colspan_dict in bad_colspans:
    for k, v in colspan_dict.items():
        row_index = k
        colspan_ind, colspan_val = v
  • The for loop starts by iterating over the individual dictionaries in the bad_colspans list and then iterates over the keys and values of each dictionary using .items().
  • row_index is assigned to the for loop variable k, ('k' being short for 'key' and 'v' being short for 'value'). row_index will be a unique idenifier because row numbers are assigned to each row, starting with the very first header row on down to the very last row of data in the table.
  • Because the dictionary values are tuples, two variables are used for each tuple to unpack the tuples and assign a variable to each value in the tuple. The tuple variables are colspan_ind (colspan index value) and colspan_val (the colspan values).

Give it Some Space(s)

Continuing down the for loop, you will see two variables with + 1 and - 1. Because the script is structured so that one td tag will be written to one cell in excel, I gave each data point a default value of 1, unless the td explicitly showed, via a "colspan" attribute, that the colspan value was a number greater than 1. To be clear, the colspan value of 1 is only a "dummy" variable to tell Python that the data point is ready to be written to a CSV file.

  • colspan_minus_one is the colspan_val variable minus 1. Because each data point has a default colspan value of 1, I subtracted 1 from the larger than 1 colspan values so that the net result is the number of spaces that must be added to the right of the larger-than-1 colspan value.
  • Again, using the picture below, labeled "HTML", the header value "Other Reserve Cap" has a colspan value of 2 and the row beneath it has two td tags. In order to avoid the misalignment shown in the "CSV RESULT" table, 1 blank string is added next to "Other Reserve Cap", see the black box below, and the columns are aligned with the data.
cell align new two
  • Similarly, because the blank strings are added to the right of the larger-than-1 colspan values, I add 1 to the colspan index value. So, if for some reason a given header cell/value, "Interest Rate" for example, had a colspan value of 15, then 15 minus 1 or 14 spaces (empty strings) would be added to the header value to account for all cells underneath "Interest Rate".
  • Because the code iterates over one table at a time, there is only one set of headers and one set of table data rows for every table and each table has one continuous index, the row_count or as stated in current for loop, the row_index. If the row index is only found in the headers_only list of index values, the headers dictionary values are modified by using the .insert() method on the variable headers to insert a blank string one position to the right of the index value (index_for_insert).
  • The else statement then populates any larger-than-1 colspan values that are present in the table rows or rows variable. There are no large colspan values in the actual table rows for this particular website but I have seen them present in other CMBS pages on the SEC website.
colspan_minus_one = colspan_val - 1
index_for_insert = colspan_ind + 1
if row_index in headers_only:
    headers = header_rows['headers'][row_index]
    for insert in range(colspan_minus_one):
        headers.insert(index_for_insert, '')
else:
    rows = header_rows['tdata'][row_index]
    for insert in range(colspan_minus_one):
        rows.insert(index_for_insert, '')

Part 7: Write the Remaining Data to CSV

Lastly, after the cell alignment issues are taken care of, the write_rows() function is used to write the data to a CSV file. Calling the headers_rows() with the string arguments 'header' and 'table' will create directories for each and will move the respective files to their directory.

        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

headers_tables('header')
headers_tables('table')

There should now be, in the directory you ran the script, 19 Header CSV files and 19 Table CSV files. The next post will combine all the CSV files and clean up the data.