End-to-End Data Analysis Project: Part 1. Getting Data

Contents

Introduction

After working with Python to manipulate and analyze commercial loan data at my job, I wanted to start a blog where I could showcase my programming projects. However, the loan data I used at my job was subject to copyright protection. Luckily, the Securities and Exchange Commission website publishes similar data.

The Goal

My goals for this project were the following:

Get data : Acquire commercial property loan data that I could use and publish on my blog in order to showcase some of my Python projects in future posts.
Automation : I wanted to create an automation script that could retrieve the data from scratch (no copying and pasting and/or tailoring the script to this exact URL) and that could generalize, with minimal modifications, and work on similar SEC postings.
Practice: Refine my programming and data manipulation skills.

What to Expect in this Post

I will be detailing the process of parsing a messy HTML file with multiple tables and writing the data to CSV files. These files will be combined and cleaned in future posts. This particular post covers the following:

The data source and what the data covers.
A walk through of the layout of the web page and the HTML structure.
Helper functions that are utilized in the main script.
A break down of the main HTML parsing script.

This is not a Python beginner-friendly post, however, I created visualizations where I thought they would help, included several website links throughout to reference some fundamental programming concepts and aimed to explain everything in detail while trying to keep things as simple as possible.

The Source Code

Link to repository here: Github.

The Tools Used

The code in this post is primarily constructed with the main Python library (version 3.6), BeautifulSoup and a little bit of Regular Expressions.

Understanding the Data

The Data Source

To retrieve the HTML file that I used, you can download the HTML file hosted on my github page or you can visit the SEC url and press Ctrl + S to save the file as an HTML document.

Right-click on the link & save as an html file: Download

URL: https://www.sec.gov/Archives/edgar/data/1013611/000152153611000217/q1100154_fwpa1-jpm.htm

To view the webpage HTML, open the file or website in your browser and right click anywhere on the webpage and click "Inspect". Ctrl+Shift+I is the windows shortcut to do this.

What is CMBS?

Commercial Mortgage-Backed Securities (CMBS) are Bonds , or securities as they are sometimes called, backed by a group of commercial mortgages. Because the bonds are eventually sold on a public exchange, the issuers of the bonds are legally required to register the securities with the SEC and must comply with the Securities Act of 1933 by disclosing key due dilligence information for potential investors to evaluate. The public data I am working with in this blog post is the Annex A section of a prospectus. A CMBS prospectus informs potential investors of an investment opportunity (bonds of varying risk backed by mortgages) and the Annex A section tabulates the loan data as well as the commercial properties that serve as the collateral for the loans. The SEC data being used in this post covers one securitized pool, or group, of CMBS loans. There are 44 loans and 216 rows of data.

The individual CMBS data points included in the dataset fall into four main categories:

Mortgage data (Loan Balance, Interest Rate, Maturity Date, etc.)
Commercial Property Data (Location, Property Type, Square Feet, etc.)
Securitization Data (characteristics that describe the Securitized structure such as a given loan's loan balance as a proportion of the entire pool/group of loans).
Performance Data (historical financial performance of the commercial properties).

Data Arrangement

Despite being one data set, the CMBS data is arranged on the web page such that it is split into several vertically aligned tables. This is likely because the data set has over 100 columns and, if presented as a single data set, would require an inconvenient horizontal scrolling experience for the end user.

Because I wanted a single tabular data set, I had to first understand how the various tables were organized in order to put them together. The charts below illustrate how the first four tables are arranged and how they must be transformed in order to fit together as a single dataset like a CSV or excel file.

Below, are four tables with various headers, columns and rows. For simplification, the visuals below include only 10 Commercial Properties. Each column, which includes a data point for every row, is broken up into two separate tables. So, Properties 1-5 are in Table 1 and Table 2 lists the other 5 properties - Properties 6-10. Similarly, Table 4 is a continuation of Table 3.

Tables 1 & 2 need to be glued together and 3 & 4 need to be glued together, both along their horizonal axes. Notice how the excess headers, (the grey-shaded headers in the above figure) have been eliminated in the figure below.

Next, the pairs of tables then have to be joined together on their respective vertical axes. Because "Prop1, Prop2..etc." represents not only commercial properties in the pool of mortgages, but are also the indices for every table, merging the data, as shown in the image below, will produce duplicate indices that must be deleted later.

When iteratively joining every 4 tables in this way, the result will be a single tabular dataset, albeit with plenty of clean up work left to do.

Although the data is not merged into a single data set in this post, assessing the layout significantly informed the way I approached the parsing and the writing of the data to files as you will see in this post. The tricky table layout actually pales in comparison, for me, to the complexity of parsing the header rows and creating rules to account for blank spaces in the table, which I will explain later in this post.

The HTML Structure

Before writing any code, I first had to carefully examine the HTML structure (Ctrl+Shift+I on windows to view the HTML followed by Ctrl+F to find keywords). The key to understanding the HTML, for me, was understanding the HTML tag structure, as well as what characteristics makes certain kinds of data unique, specifically what HTML tag attributes make a given row of data identifiable as being a header row, a data row or a row to skip.

Some HTML Table Basics:

table tags define a table of data.
tr tags define the rows in a table.
td tags define a cell in table.

Import Libraries and Make Soup

Below are the libraries I used for this task. The working_dir variable is the path that contains the html file and the name of the file is assigned to filename. I often use the os.chdir() method, out of preference, to direct my scripts to the appropriate directory but other methods work just as well.

from bs4 import BeautifulSoup
from bs4 import Tag
import pandas as pd
import shutil
import glob
import csv
import re
import os


working_dir = r'C:\Users\Username\Desktop\End to End Data Analysis Project'
os.chdir(working_dir)
filename = 'JPC11C05.html'

I used Python's with open syntax to create a file object which gave me the detail needed for parsing the HTML file at a granular, line by line, element by element level.

with open(filename) as sec:
    soup = BeautifulSoup(sec, 'lxml')

Grab the Tables

Each table of data is encased in table tags in the HTML document. However, there were tables that I did not want, specifically the table tags that consist of only footnotes (at bottom of HTML document). I noticed that these tables all had a border attribute within the main table tag (see below).

Good Table

Bad Table

So, I used a list comprehension to grab all the tables by targeting the table HTML tag and excluded all table tags with a 'border' tag attribute.

soup.findAll('table') returns a bs4.element.ResultSet which is a list of bs4.element.Tag objects - the table tags.

table_count stores the number of tables to iterate over.

tables = [
    table for table in soup.findAll("table") if 'border' not in table.attrs
    ]

table_count = len(tables)

The Headers

Because the table headers came with their own unique challenges, I wanted to keep track of the headers and table rows separately. I first needed to see what distinguished the headers in the HTML.

Embedded in the td tag, there is a font tag with a style attribute. I found that all header cells contained the word bold in the attribute value for the style attribute (see below) and the table rows never contained the bold tag attribute.

Good Headers

The function I made to identify headers has two parameters, tag and keyword.

def is_header(tag, keyword=r'bold'):
    # returns bs4.element.Tag or None
    return tag.find("font", {"style": re.compile(keyword)})

The tag parameter accepts an HTML tag to search for the keyword "bold" within the font's style attribute.
BeautifulSoup's .find() method accepts the tag to find, "font", and the tag attribute/value pair, "style" and "bold". .find() returns only the first instance of the match it finds, which is sufficient for determining a header row.
When the function above does not find the "bold" keyword, it will return a None object, which is important to know for later.

Bad Headers

However, I wanted to exclude some of the header rows. Some of the tables have a hierarchal index structure where one large header row describes a group of columns (See "Hotel Operating Statistics" below).

The above "Bad Header" is differentiated in the HTML from the "Good Header" by having a large colspan attribute value. The "colspan" attribute tells us how many columns a given data point spans across. The HTML below shows that the "Hotel Operating Statistics" header has a colspan value of 15.

To be clear, in this HTML file, all colspan attributes are associated with headers but not all headers have a colspan attribute.

I then searched the HTML document manually in my browser, using Ctrl+f, for "colspan". There are only 20 instances of "colspan" in the entire HTML document and I saw that the highest colspan value for a good header was 2.

Skip Headers with Colspans Greater than 2

The function, unwanted_header(), below, will accept the tr tag as the argument for the tag parameter and the colspan_limit default value is set to 2, as a good header can have a colspan value of 2.

def unwanted_header(tag, colspan_limit=2):
    colspan_obj = tag.find("td", {"colspan": re.compile(r"\d+")})
    if colspan_obj:
        colspan_val = int(colspan_obj['colspan'])
        if colspan_val > colspan_limit:
            return True
        else:
            return False

unwanted_header() searches the tr tag for any td tag that has a colspan tag attribute with a numerical string value.
The re.compile() method is a Regular Expressions method that creates a Regular Expression object with a pattern that can be searched for.
The 'r' tells Python not to interpret any backslash characters (for example \n is interpreted by Python as a newline) and to let BeautifulSoup interpret the backslashes with its own syntax rules. In this case, \d+ will be interpreted by BeautifulSoup to mean "any numerical string occuring at least one time".
If the colspan attribute is present in one of the row's td tags then the colspan_obj will exist as a Tag object and the if statement is True. If the specific tag is not found, the object will be a None value and the if statement is False.
colspan_val is a variable assigned to the colspan value, the string value for the colspan tag attribute, and is converted to an integer by using the int() function.
colspan_obj is a bs4.element.Tag object, which means the tag's attributes can be accessed like a dictionary like this: colspan_obj['colspan'].
Finally, if the integer is larger than 2, the function will return True and if there is no colspan_obj or if the integer is equal to or less than 2, the function will return False.

Then I created a simple function to test for the previous two function's criteria at once.

def is_good_header(tag):
    if is_header(tag) and not unwanted_header(tag):
        return True
    else:
        return False

The Bad Side of Good Headers

Before, I mentioned that there are some good headers with a colspan value equal to 2. These headers need to be identified in order to correct a problem they present.

In the image below, the column "Other Reserve Cap" reflects how the website's tags are organized when a header value spans across 2 columns. A header value with a colspan attribute value of 2, "Other Reserve Cap", is enclosed in 1 td tag, while the rows of data below the header have 2 td tags (see picture for yellow and blue td tags). This 2-to-1 arrangement is not visible on the website and aligns appropriately in the HTML output, however, because my script iterates over every td tag and writes each one to a cell in my CSV file, I ran into a cell alignment problem illustrated in the figure below labeled "CSV Result". You can see how the "Single Tenant" cell is pushed over to the right by the yelllow and blue cells.

To identify the good headers with a colspan value of 2, I created a boolean function that returns True when all td tags span the same number of columns, 1 each, and returns False when they do not. Below is the boolean function cells_same_colspan().

def cells_same_colspan(colspan_dict):
    # reduce colspan dict values
    max_colspan_per_row = [max(val) for val in colspan_dict.values()]
    if max(max_colspan_per_row) == 1:
        return True
    else:
        return False

colspan_dict is a dictionary object where the keys are the row numbers of a table and the values are lists with the colspan values.
By using a list comprehension that calls the max() function on every list of colspan values, I can create a list of maximum colspan values per row. max() is called once again on the list of max values per row to find the highest colspan value for a entire table of data (headers and table rows).
The function returns False when the largest colspan value does not equal 1.

Below is an example of what the colspan_dict object, which is passed to the function above, would look like if a table had two rows and had a single colspan value equal to 2 in the second row.

# COLSPAN_DICT object

{0: [1, 1, 1, 1],
 1: [1, 1, 2, 1]}

# where the dict keys are table rows (including headers)
# and the dict values are lists containing
# each data point's colspan value

The function cells_same_colspan() would return False on the COLSPAN_DICT above and the row with the colspan value equal to 2 would need to be modified to align the table columns properly.

Infinite Nest

Like the book Infinite Jest, where the author's writing style can yield a page-long sentence with several tangential thoughts nested within the main thought, for the code to work the way I wanted, I created a number of nested objects. The data structures I used to track each piece of data (rows, data points, colspan values etc.), as well as the control structure of the main loop display a very layered, nested characteristic.

For example, nested_colspan_dict is a function, which I will elaborate on in a moment, that returns the dictionary object below. In the structure below, tuples are nested within lists nested within a dictionary. This structure is used to store all the information needed to make cell alignment corrections before they are written to a CSV file.

# NESTED_COLSPAN_DICT object
{0: [ (0, 1), (1, 1), (2, 1) ],
 1: [ (0, 1), (1, 1), (2, 1) ]}

# where the dict keys are table rows
# and the dict values are lists containing tuples
# each tuple contains..
# (index value of a given data point, the colspan value for given point)

Below is the function that produces the nested object.

def nested_colspan_dict(colspan_dict):
    nested_colspan_obj = {
        key: [colspan_tuple for colspan_tuple in enumerate(value)]
        for key, value in colspan_dict.items()}
    return nested_colspan_obj

The function nested_colspan_dict creates the data structure by iterating over the key, value pairs in the colspan_dict object, accessed via the .items() method on the dictionary object. The dictionary comprehension creates a new dictionary where the key will remain the row number and the value will be a list of tuples, similar to the example above.
Using Python's built-in enumerate() function on the colspan values in each row will produce tuples with the index of a given colspan value and the colspan value itself.

Next, since I needed to extract data from the nested dictionary multiple times, it made sense to create a function to do this efficiently. This function iterates over the nested dictionary object and checks to see if a given colspan value does not equal 1 and returns the object's row number, index of colspan value and the colspan value.

def colspans_to_fix(nested_colspan):
    for key, value in nested_colspan.items():
        for tupl in value:
            #tuples are (colspan_index, colspan_value)
            if tupl[1] != 1:
                # key = row number
                # tuple = (index_of_cell, colspan_value_for_cell)
                yield {key: tupl}

The function first iterates over the key, value pairs of the nested_colspan object which was created with the previous function nested_colspan_dict.
Because the nested_colspan dict values are lists containing tuples, you have to iterate over the individual tuples to access the information inside them (such as the colspan values). For any tuple, if the colspan value does not equal one, the function will yield a dictionary object which contains the row number as the key and a tuple as the dictionary value. The tuple will then be parsed for the colspan value and the index value it occurs in the row it resides.

File Writer Function

Below are commonly used lines of code in Python when doing Input/Output (I/O) operations (see Python docs for working with CSV files ). I put these basic instructions into a function to curtail redundancy.

# A function to write the rows to a file.
def write_rows(current_table, filename, data):
    with open(str(current_table) + filename + '.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        # if row avoids blank rows
        writer.writerows(row for row in data.values() if row)

One detail to note is that in the very last line of the function, the .writerows method accepts an iterable which will be the table rows in the form of Python lists.
If no row (list) is present, the row will not be written to the CSV file.

File Organization Function

After the main for loop finishes (shown after this function) there will be multiple CSV files generated in the program's directory. To organize the header and table files, the following function creates folders for each and moves the files into their respective folders.

def headers_tables(name, file_path=os.getcwd()):
    if not os.path.exists(name):
        os.mkdir(name)
    pattern = '*' + name + '.csv'
    file_list = [file for file in glob.glob(os.path.join(file_path, pattern))]
    for file in file_list:
        shutil.move(file, name)

headers_tables() is a simple function that will create a folder name. file_path is the path with the CSV files and its default value is set to os.getcwd() which is the program's current directory.
if not os.path.exists(name) uses the os module and creates (using os.mkdir()) a directory if the given name does not already exist as a folder in the current directory.
pattern = '*' + name + '.csv' creates a string pattern where '*' + name + '.csv' is a string that will be used to match all CSV files in the directory.
A list comprehension with the glob.glob() method in conjunction with os.path.join() glues file_path with the string pattern provided in pattern and creates a list of all the pattern-matching files in the directory.
The for loop iterates over every file in file_list and moves them to the newly created directory name.

The Bulk of the Script

The main portion of the script is a long for loop with multiple conditional statements to control the flow of the various instructions.

Below is the main script in one block of code, which I will expand on in pieces.

for i in range(table_count):

    table = tables[i]
    header_rows = {'headers': {}, 'tdata': {}}
    colspan = {}
    line_count = 0

    for row in table.findAll("tr"):
        colspan[line_count] = [int(td_tag['colspan'])
                               if 'colspan' in td_tag.attrs else 1
                               for td_tag in row.findAll('td')]
        if is_good_header(row) and i % 2 == 0:
            header_rows['headers'][line_count] = [data.font.text
                                                  if isinstance(data.font, Tag) else ''
                                                  for data in row.findAll('td')]
            line_count += 1
        elif is_header(row) is None:
            header_rows['tdata'][line_count] = [data.font.text
                                                if isinstance(data.font, Tag) else ''
                                                for data in row.findAll('td')]
            line_count += 1
        else:
            continue
    if cells_same_colspan(colspan):
        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])
    else:
        nested_cs = nested_colspan_dict(colspan)
        bad_colspans = [cs for cs in colspans_to_fix(nested_cs)]
        headers_only = [h for h in header_rows['headers']]
        for colspan_dict in bad_colspans:
            for k, v in colspan_dict.items():
                row_index = k
                colspan_ind, colspan_val = v
                colspan_minus_one = colspan_val - 1
                index_for_insert = colspan_ind + 1
                if row_index in headers_only:
                    headers = header_rows['headers'][row_index]
                    for insert in range(colspan_minus_one):
                        headers.insert(index_for_insert, '')
                else:
                    rows = header_rows['tdata'][row_index]
                    for insert in range(colspan_minus_one):
                        rows.insert(index_for_insert, '')
        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

The Data Container Objects

Before expanding on the for loop, I want to cover the main objects that are created for each iteration.

table : Because I previously created the tables object, I can access each table by calling a given table's index value. In this case i will be an integer, generated by the range() function being called with table_count, that will be used to index each table in the list object tables .
headers_rows : This will be the main data structure that stores the data that will be written directly to a CSV file. The nested dictionary will have two main keys, one for header data ( headers ) and another for table data ( 'tdata' ). Because this object's structure is layered, I created a small visual of the data structure below.

# headers_rows

{'headers': {0: ['a', 'b', 'c'],
             1: ['d', 'e', 'f']}
 'tdata':   {2: ['g', 'h', 'i'],
             3: ['j', 'k', 'l']}
}

# 'headers' is the key for the header dictionary and 'tdata' is the key for the table dictionary.
# 'headers' contains all headers for one table and 'tdata' contains all table data for one table.
# Each integer, above, represent a row number.
# Each list represents a row.
# Each letter represents a data point in a row.

colspan : This object will be a dictionary where the dictionary keys are the row count and the values will be lists containing the colspan values.
line_count: a simple row counter.

Part 1: Iterate over each Table and each Row

The for loop first iterates over a sequence of numbers generated by the range() function.
The table_count object is just an integer that tells us how many tables to iterate over. Because table_count is equal to the number 20, range() will produce the sequence 0-19, which will allow us to access each table. The script will start on tables[0], followed by tables[1] and so-on.

for i in range(table_count):

    table = tables[i]
    header_rows = {'headers': {}, 'tdata': {}}
    colspan = {}
    line_count = 0

Next, the loop begins iterating over each table row by using BeautifulSoup's .findAll() method, which will return a list of all tr tags.

Part 2: Create List of Colspan Values

colspan dict is created by using the line_count integer value, starting at 0, as the dictionary keys and a list of colspan values as the dictionary values. A plain-spoken description of the list comprehension could be "for each td tag in all the found td tags, if the 'colspan' string is one of the td tag attributes, return the colspan value as an integer, if not, return the number 1." I made it so that all td tags have a default colspan value of 1 (meaning they span only one column).

for row in table.findAll("tr"):
    colspan[line_count] = [int(td_tag['colspan'])
                           if 'colspan' in td_tag.attrs else 1
                           for td_tag in row.findAll('td')]

Part 3: Populate Dictionary with Header Data

After recording the colspan values, the row or tr tag will be passed to the boolean is_good_header() function.

if is_good_header(row) and i % 2 == 0:
    header_rows['headers'][line_count] = [data.font.text
                                          if isinstance(data.font, Tag) else ''
                                          for data in row.findAll('td')]
    line_count += 1

and i % 2 == 0 is a way to avoid the excess header data. If i is divisible by 2, meaning the integer (which represents the table number) is even, then the table's headers are the program's first time to encounter a given set of headers. The odd numbered tables will contain duplicate header data and will be excluded.
header_rows contains two dict keys ('headers' and 'tdata'), both of which have an empty dictionary as initial values. The empty dictionaries are then populated with key value pairs. The keys for each empty dictionary will be the row number in the table, or line_count, and the values associated with each key will be a list that contains the row's data. The individual data points in a given row will be stored as string values in each list.
for data in row.findAll('td') is the start of a list comprehension that populates the values of a given header row. The list comprehension is stating, "for each data point, encased in a row's td tags, if the data point is an instance of a Tag object, return the text value in the td tag, if not return an empty string."
data.font.text uses BeautifulSoup's syntax to access a Tag's attribute like a dictionary, such as data['font'] or using the syntax data.font. The object's class attribute .text, not to be confused with a tag's HTML attribute, extracts the text or string value within the tag.
line_count is incremented by 1 so that each row has a unique identifier and is accessible as a dictionary key via indexing.

Part 4: Populate Dictionary with Table Data

Next, because the function is_header() returns a None object when a header is not found, I used the None object as a simple way to identify rows that are not headers. Similar to the block of code above, the headers_rows nested dictionary is populated with row data but within the 'tdata' dictionary.

elif is_header(row) is None:
    header_rows['tdata'][line_count] = [data.font.text
                                        if isinstance(data.font, Tag) else ''
                                        for data in row.findAll('td')]
    line_count += 1

Part 5: Write Data to File

An else statement is also included for instances when a row, (table row or header row) does not meet the previous two criteria - a row is an odd numbered header row for instance.
Finally, the simple write_rows() function is implemented for the table's "good" (no cell alignment issues) rows and header rows. The tables are written to a CSV file and split up by their naming convention which notes the table number and whether the data is header or table data.

    else:
        continue
if cells_same_colspan(colspan):
    write_rows(current_table=i, filename=' header', data=header_rows['headers'])
    write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

Part 6: Fix Colspan Misalignment Issues

This next section was pretty tricky for me. When cells_same_colspan() returns False, meaning their is a colspan value larger than 1, then a cell alignment issue is present and must be corrected.

Create Objects to Correct Alignment

The nested_cs object below uses the nested_colspan_dict() function which associates the index values for each colspan value via the built-in enumerate() Python function.

else:
    nested_cs = nested_colspan_dict(colspan)
    bad_colspans = [cs for cs in colspans_to_fix(nested_cs)]

The bad_colspans object is created with a list comprehension that creates a list of dictionaries that contain all the information needed to modify a given row's data points. Because colspans_to_fix is a generator function, the function will only yield the values when called on or calling the next() function.
As an example of a generated value by the colspans_to_fix() function, if row 1, cell 9 had a colspan value of 2, the generator would yield the object { 1: (9, 2)}.

Because the header data and table data are handled separately, headers_only extracts only the header data which will be referenced later.

headers_only = [h for h in header_rows['headers']]

Use For Loop to Correct Alignment

This last section essentially loops over the bad_colspans list and uses the information contained in it to modify the data container headers_rows.

for colspan_dict in bad_colspans:
    for k, v in colspan_dict.items():
        row_index = k
        colspan_ind, colspan_val = v

The for loop starts by iterating over the individual dictionaries in the bad_colspans list and then iterates over the keys and values of each dictionary using .items().
row_index is assigned to the for loop variable k, ('k' being short for 'key' and 'v' being short for 'value'). row_index will be a unique idenifier because row numbers are assigned to each row, starting with the very first header row on down to the very last row of data in the table.
Because the dictionary values are tuples, two variables are used for each tuple to unpack the tuples and assign a variable to each value in the tuple. The tuple variables are colspan_ind (colspan index value) and colspan_val (the colspan values).

Give it Some Space(s)

Continuing down the for loop, you will see two variables with + 1 and - 1. Because the script is structured so that one td tag will be written to one cell in excel, I gave each data point a default value of 1, unless the td explicitly showed, via a "colspan" attribute, that the colspan value was a number greater than 1. To be clear, the colspan value of 1 is only a "dummy" variable to tell Python that the data point is ready to be written to a CSV file.

colspan_minus_one is the colspan_val variable minus 1. Because each data point has a default colspan value of 1, I subtracted 1 from the larger than 1 colspan values so that the net result is the number of spaces that must be added to the right of the larger-than-1 colspan value.
Again, using the picture below, labeled "HTML", the header value "Other Reserve Cap" has a colspan value of 2 and the row beneath it has two td tags. In order to avoid the misalignment shown in the "CSV RESULT" table, 1 blank string is added next to "Other Reserve Cap", see the black box below, and the columns are aligned with the data.

Similarly, because the blank strings are added to the right of the larger-than-1 colspan values, I add 1 to the colspan index value. So, if for some reason a given header cell/value, "Interest Rate" for example, had a colspan value of 15, then 15 minus 1 or 14 spaces (empty strings) would be added to the header value to account for all cells underneath "Interest Rate".
Because the code iterates over one table at a time, there is only one set of headers and one set of table data rows for every table and each table has one continuous index, the row_count or as stated in current for loop, the row_index. If the row index is only found in the headers_only list of index values, the headers dictionary values are modified by using the .insert() method on the variable headers to insert a blank string one position to the right of the index value (index_for_insert).
The else statement then populates any larger-than-1 colspan values that are present in the table rows or rows variable. There are no large colspan values in the actual table rows for this particular website but I have seen them present in other CMBS pages on the SEC website.

colspan_minus_one = colspan_val - 1
index_for_insert = colspan_ind + 1
if row_index in headers_only:
    headers = header_rows['headers'][row_index]
    for insert in range(colspan_minus_one):
        headers.insert(index_for_insert, '')
else:
    rows = header_rows['tdata'][row_index]
    for insert in range(colspan_minus_one):
        rows.insert(index_for_insert, '')

Part 7: Write the Remaining Data to CSV

Lastly, after the cell alignment issues are taken care of, the write_rows() function is used to write the data to a CSV file. Calling the headers_rows() with the string arguments 'header' and 'table' will create directories for each and will move the respective files to their directory.

        write_rows(current_table=i, filename=' header', data=header_rows['headers'])
        write_rows(current_table=i, filename=' table', data=header_rows['tdata'])

headers_tables('header')
headers_tables('table')

There should now be, in the directory you ran the script, 19 Header CSV files and 19 Table CSV files. The next post will combine all the CSV files and clean up the data.