End-to-End Data Analysis Project: Part 3. Join and Stack

Sonny Torres, Tue 23 April 2019, Python

python, tutorial

Contents

Recap of Part 1 & 2

After scraping all of the CMBS loan data from the HTML file downloaded from the SEC.gov website (shown in my first post) and parsing the unwieldy header data (see my previous post), I now have to join all of the table data (which are saved as a bunch of CSV files) pieces and save the data as one CSV file. After joining all of the disparate data pieces the final data set will be one table and the data will almost be ready for analysis..

Join Tables & Stack the Headers

Below is an illustration that describes the join operations I will explain in this post.

all tables numbered

The two main components of this data are:

  1. The Headers: The header data, once a disparate group of files, now exist as a single row of data in a single CSV file. The single header row continues to the right for every header in the dataset.
  2. The Tables: Because the SEC web page presented one large dataset as disjoined panels arranged in a continuous online scroll, the disparate tables need to be concatenated horizontally and vertically back together. The "Prop1, Prop2..." indices are an accurate but simplified representation of how the tables need to be coherently joined together. The entirety of the table data can be thought of as sections of fours, as seen in the image above, that continue right all the way until the end of the dataset.

The Code to Concatenate and Join the Data

import unicodedata
import pandas as pd
import re
import glob
import os

main_dir = r"C:\Users\Username\Desktop\End-to-End-Data-Analysis\1. Get the Data\table"
parsed_header_file = r"C:\Users\Username\Desktop\End-to-End-Data-Analysis\1. Get the Data\header\frame.csv"
os.chdir(main_dir)

pattern = '*table.csv'

all_table_files = [os.path.basename(file)
                   for file in glob.glob(os.path.join(main_dir, pattern))]

number_of_files = len(all_table_files)

df_top_half = []
df_bottom_half = []

def blank_rows(df):
    # selects first column, changes type to string,
    # initial type is object
    df[0] = df[0].astype(str)
    df[0] = df[0].map(lambda x: unicodedata.normalize('NFKD', x))
    blank_match = re.compile(r'\s+')
    i = [v for v in df[0].iteritems()]
    for tup in i:
        index, value = tup
        if re.match(blank_match, value):
            df.drop(index, axis=0, inplace=True)
    df.reset_index(inplace=True)
    df.drop(columns='index', inplace=True)

for num in range(0, number_of_files):

    file = str(num) + ' table.csv'
    df = pd.read_csv(file, encoding='ISO-8859-1', na_values='empty', header=None)
    blank_rows(df)
    if num % 2 == 0:
        df_top_half.append(df)
    else:
        df_bottom_half.append(df)

def join_frames(list_of_frames):
    # horizontal join of dataframes
    concat_df = pd.concat([obj for obj in list_of_frames], axis=1, join='inner')
    return concat_df

def final_stack_df(top, bottom, headers):
    top_tables = join_frames(top)
    bottom_tables = join_frames(bottom)
    final = pd.concat([top_tables, bottom_tables], axis=0, join='outer')
    headers = pd.read_csv(headers, encoding='ISO-8859-1')
    final.columns = headers.columns
    return final

cmbs = final_stack_df(top=df_top_half, bottom=df_bottom_half, headers=parsed_header_file)
cmbs.to_csv('CMBS Table.csv', index=False, encoding='ISO-8859-1')

The overall code structure (i.e. many helper functions to process data in CSV files) is similar to my previous post where I parsed the header data. Luckily, the table data is far less complicated and requires just one parsing function, blank_rows, while the rest of the script is simply appending the top half of the table to one list and the bottom half of the table to another list. In the table illustration shown earlier in this post, tables 0 and 2 would be the "top half" and tables 1 and 3 would be the "bottom half".

Because I have elaborated on similar code in Posts 1 & 2, I will break down the functionality of the code starting with the blank_rows function.

header vs. "Not Header

In my first post, the code was structured so that, with carefully considered criteria, a single header row would be identifiable and distinguished from other rows. For all other rows that didn't meet the special header criteria, they were considered table data and organized as such. However, in the HTML file, there are a few rows which are neither Had my code been built to recognize header and table data (IS NOT HEADER does not necessarily mean IS TABLE DATA), I would not have to delete blank rows in the table data CSV files.

The blank lines are a result of hierarchal indexing that occurs on some of the header data where there is a hierarchal index, a blank row and then the header data (see image below).

blank row

So the goal for the blank_rows function is to identify if the first row, and only first row, is blank. If the first row is blank, that means the not header/blank row in the image above was saved as a table data CSV as if it were indeed table data. Due to some encoding inconsistencies, covered in Part 2, even finding a blank row requires a little more work than a binary True or False.

  1. Blank rows are not actually blank rows. Due to encoding differences between Python's default encoding (UTF-8) and the website's encoding (Windows-1252), the "blank" rows actually contain a character called a non-breaking space character
  2. In order to identify the non-breaking space character as a "blank string", the data must be normalized to Unicode.
  3. In order to normalize the data, the data type must be a string.

The 3 facts above, as well as the fact that the very top left element/cell of every table is consistent and will never be blank for a valid table row was considered when making the blank_rows function.

blank_rows function

The blank_rows function below

def blank_rows(df):
    df[0] = df[0].astype(str)
    df[0] = df[0].map(lambda x: unicodedata.normalize('NFKD', x))
    blank_match = re.compile(r'\s+')
    i = [v for v in df[0].iteritems()]
    for tup in i:
        index, value = tup
        if re.match(blank_match, value):
            df.drop(index, axis=0, inplace=True)
    df.reset_index(inplace=True)
    df.drop(columns='index', inplace=True)

The function blank_rows accepts a dataframe with the df parameter and then immediately indexes the first column of the dataframe, converts the datatype to string and maps the unicode.normalize method to each element in the series (column). Again, indexing only the first column is necessary because this column, called 'Loan #', will never have any blanks for a valid row. Important note, you can only access the first column in a Pandas Dataframe with 0 because the Dataframe object was loaded with the pd.csv parameter header equal to None, meaning the headers, or column names, default to integers.

Iterate Table Files

The directory with the table files, created in Part 1, should have produced a directory containing all of the table files like the image below.

all table files

The code below iterates over each file in the directory, appending the even numbered files to the df_top_half list and appending the odd numbered tables to the df_bottom_half list.

for num in range(0, number_of_files):

    file = str(num) + ' table.csv'
    df = pd.read_csv(file, encoding='ISO-8859-1', header=None)
    blank_rows(df)
    if num % 2 == 0:
        df_top_half.append(df)
    else:
        df_bottom_half.append(df)

Join and Stack

Finally, now that the tools for processing the table files are ready, we can concatenate and stack all of the tables. Again, refer to the image presented earlier.

all tables numbered

The lists df_top_half and df_bottom_half have lists of dataframes for the top and bottom half of the large table we are building. join_frames, the function below, simply horizontally joins the lists in the top half together to create one dataframe and does the same to the bottom half. So tables 0, 2, 4, 6, 8... would be just one table and tables 1, 3, 5, 7, 9... will be another table.

The function below concatenates the tables this way. axis=1 ensures that the tables will be concatenated along the 1 axis, which can be thought of as the X axis or horizontal axis, while the 0 axis can be thought of the Y or vertical axis. See this Stack Overflow post for more.

def join_frames(list_of_frames):
    # horizontal join of dataframes
    concat_df = pd.concat([obj for obj in list_of_frames], axis=1, join='inner')
    return concat_df

The next function below simply applies the join_frames function and then finally stacks the top half and bottom halves together.

def final_stack_df(top, bottom, headers):
    top_tables = join_frames(top)
    bottom_tables = join_frames(bottom)
    final = pd.concat([top_tables, bottom_tables], axis=0, join='outer')
    headers = pd.read_csv(headers, encoding='ISO-8859-1')
    final.columns = headers.columns
    return final

Create the Final Dataframe

Finally, the final dataframe is created by utilizing the final_stack_df function and then saving the resultant dataframe to a CSV.

cmbs = final_stack_df(top=df_top_half, bottom=df_bottom_half, headers=parsed_header_file)
cmbs.to_csv('CMBS Table.csv', index=False, encoding='ISO-8859-1')

In the next post, I will refine and clean up the final dataset in order for it to be ready for analysis.