End-to-End Data Analysis Project: Part 4. More Cleanup!?!

Sonny Torres, Sat 04 May 2019, Python

python, tutorial

Contents

Recap of Part 3

The CMBS Loan data has been scraped, the headers have been parsed, the tables and headers have been joined and we can now refine the final dataset that will be used for analysis. Thanks to the meticulous extraction and parsing of the data thus far, there are only minor refinements left to do - most of which are just out of preference and for convenience.

Return of the Headers

After an entire post was dedicated entirely to parsing header data, there are still some ways that the headers could be better. It is important to be able to access columns with as much convenience and ease as possible since indexing by a column name is one of the most common operations used while working with data with the Pandas library. Below are the minor changes I will make to the headers and, in some cases, entire columns.

  1. Fix inconsistent spacing in the header string values.
  2. Delete unnecessary symbols and parenthesis.
  3. Delete duplicate columns.
  4. Delete columns that are completely empty.

Weapon of Choice

Since much of cleaning that I want to do involves parsing and locating string patterns, I will be utilizing Regular Expressions in conjunction with the Pandas library.

The Code

Below is code that I will be covering in this post.

import pandas as pd
import os
import re

main_dir = r'C:\Users\Username\Desktop\End-to-End-Data-Analysis\1. Get the Data\table'
file = 'CMBS Table.csv'

os.chdir(main_dir)

cmbs = pd.read_csv(file, encoding='ISO-8859-1')

# Delete extra Loan & Seller columns
loan_seller_cols = [val for val in cmbs.columns.values if re.search('(^Loan\s#|^Seller|^Property\sName)', val)][3:]

for col in loan_seller_cols:
    cmbs.drop(columns=col, axis=1, inplace=True)

# Regex to edit headers
regex_dict = {'_\d': '', '\(.+\)+': '', '#': '', '%': '', r'\/' : '', '\s\s+': ' ', '^\s+': '', '\s+$': ''}

for key, value in regex_dict.items():
    cmbs.columns = [re.sub(key, value, col) for col in cmbs.columns]

# Delete
for col in list(cmbs.columns.values):
    try:
        if cmbs[col].str.normalize('NFKD').str.match('  ').all():
            cmbs.drop(columns=col, axis=1, inplace=True)
    except AttributeError:
            continue

cmbs.to_csv('CMBS Final.csv', index=False, encoding='ISO-8859-1')

The Headers: Before

Loading the CSV file produced in Post 3 and calling print(cmbs.columns.values) outputs the following:

['Loan #' 'Seller(1)' 'Property Name' 'Street Address' 'City' 'State'
 'Zip Code' 'County' 'Number of Properties' 'Property_2 Type'
 'Property_9 Subtype' 'Year Built' 'Year Renovated' 'Loan #.1'
 'Seller(1).1' 'Property Name.1' 'Units(2)' 'Unit of  Measure'
 'Occupancy %' 'Occupancy Date' 'Appraised Value ($)(3)'
 'Appraisal Date(3)' 'Current_4 LTV %(3)' 'Original Balance ($)(4)_7'
 'Original Balance_4 per Unit ($)_9' 'Current_9 Balance ($)(4)_1'
 'Current Balance_0 per Unit ($)_8' '% of Initial Pool Balance'
 'Crossed Loan' 'Related Borrower(5)' 'Interest Rate %(6)_9'
 'Admin. Fee %(6)' 'Net Mortgage Rate %(6)_2' 'Accrual Type' 'Loan #.2'
 'Seller(1).2' 'Property Name.2' 'Monthly Debt Service ($)(7)'
 'Annual Debt Service ($)(8)' 'Note Date' 'First Payment Date(9)'
 'Partial IO Last IO Payment' 'Partial IO Loan First P&I Payment'
 'Rem._3  Term(9)' 'Rem._8  Amort' 'I/O Period(9)(10)' 'Seasoning'
 'Payment Due Date' 'Grace Period_4  (Late Payment)'
 'Grace Period_4  (Default)' 'Maturity Date' 'ARD Loan(11)'
 'Final Mat Date(11)' 'Maturity/ARD Balance ($)(4)' 'Maturity LTV %(3)'
 'Loan #.3' 'Seller(1).3' 'Property Name.3'
 'Prepayment Provision (Payments)(9)(12)(13)' '2008_7 Revenues ($)_9'
 '2008_0 Total Expenses ($)_5' '2008_4 NOI ($)_5' '2009_2 Revenues ($)_2'
 '2009_3 Total Expenses ($)_9' '2009_0 NOI ($)_8' '2010_3 Revenues ($)_7'
 '2010_5 Total Expenses ($)_6' '2010_6 NOI ($)_0'
 'Most Recent_3  Revenues ($)' 'Most Recent_1  Total Expenses ($)'
 'Most Recent_9  NOI ($)(14)' 'As of' 'Loan #.4' 'Seller(1).4'
 'Property Name.4' 'UW Economic Occupancy %' 'UW_8 Revenues ($)'
 'UW Total Expenses ($)' 'UW NOI ($)(14)' 'UW_8  Capital Items ($)'
 'UW NCF ($)(14)' 'UW_7  NOI DSCR(14)' 'UW NCF_7 DSCR(14)'
 'UW NOI Debt Yield %' 'UW NCF_5  Debt Yield %' 'Title Type'
 'Ground Lease_9  Expiration' 'Ground Lease_6  Extension Terms' 'PML %'
 'Loan #.5' 'Seller(1).5' 'Property Name.5' 'Upfront Capex Reserve ($)_6'
 'Upfront Engin. Reserve ($)_2' 'Upfront Envir.  Reserve ($)'
 'Upfront TI/LC Reserve ($)_4' 'Upfront RE Tax Reserve ($)_8'
 'Upfront Ins. Reserve ($)_7' 'Upfront Other Reserve ($)_3' 'Unnamed: 99'
 'Monthly Capex Reserve ($)_0' 'Monthly Envir. Reserve ($)_0'
 'Monthly TI/LC Reserve ($)_2' 'Monthly RE Tax Reserve ($)_3'
 'Monthly Ins. Reserve ($)_2' 'Monthly Other Reserve ($)_0' 'Loan #.6'
 'Seller(1).6' 'Property Name.6' 'Capex  Reserve Cap ($)_3'
 'Envir.  Reserve Cap ($)_8' 'TI/LC  Reserve Cap ($)_3'
 'RE Tax  Reserve Cap ($)_7' 'Insur.  Reserve Cap ($)_5'
 'Debt Service  Reserve Cap ($)_3' 'Other  Reserve Cap ($)_8'
 'Unnamed: 10_level_1' 'Single Tenant' 'Largest Tenant' 'Unit Size_8'
 'Lease_6 Expiration_3' 'Unnamed: 121' '2nd Largest Tenant' 'Unit Size_9'
 'Lease_5 Expiration_1' 'Loan #.7' 'Seller(1).7' 'Property Name.7'
 '3rd Largest Tenant' 'Unit Size_4' 'Lease_9 Expiration_2' 'Unnamed: 131'
 '4th Largest Tenant' 'Unit Size_5' 'Lease_3 Expiration_8'
 "( '     ' ,   '     ' ) . 1" '5th Largest Tenant' 'Unit Size_6'
 'Lease_5 Expiration_3' 'Loan Purpose' 'Loan #.8' 'Seller(1).8'
 'Property Name.8' 'Principal / Carveout Guarantor(20)' 'Lockbox_1  (Y/N)'
 'Lockbox_3  Type(21)' 'Additional Debt_2  Permitted (Y/N)'
 'Additional Debt_0  Exist (Y/N)(22)' 'Additional Debt_7  Amount ($)(22)'
 'Additional Debt_6  Type(22)' 'Total Debt_5 Current Balance ($)'
 'Total Debt_9  UW NCF  DSCR' 'Total Debt Current LTV %'
 'Total Debt_0  UW NOI  Debt Yield %' 'Loan #_3' 'Seller(1).9'
 'Property Name.9' '2008_0 Occupancy %_8' '2008_4  ADR ($)_9'
 '2008_2  RevPAR ($)_8' '2009_9  Occupancy %_9' '2009_9  ADR ($)_7'
 '2009_8  RevPAR ($)_6' '2010_3  Occupancy %_0' '2010_1  ADR ($)_6'
 '2010_6  RevPAR ($)_4' 'Most Recent_2  Occupancy %_3'
 'Most Recent_5  ADR ($)_6' 'Most Recent_1  RevPAR ($)_6'
 'UW_1 Occupancy %_0' 'UW_8  ADR ($)_9' 'UW_5  RevPAR ($)_8' 'Loan #_7']

There are a number of duplicate columns, numbers appended to column names and other messy symbols that can be deleted. Deleting the various symbols and duplicate columns is relatively easy when utilizing Regular Expressions and iterating over each header.

Delete Duplicate Columns

The duplicate columns stem from the fact that the data source was arranged such that each table of data always contained three specific columns (Loan, Seller & Property Name). Rather than deleting each observation individually, I decided to create a list of duplicate column names to iterate over which would be used to identify the various columns I want to delete.

loan_seller_cols = [val for val in cmbs.columns.values if re.search('(^Loan\s#|^Seller|^Property\sName)', val)][3:]

The list comprehension above creates a list of columns by iterating over every column in the dataframe and appending to the list the column names that are one of the three repeated columns. The Regular Expression pattern is a group of three individual regex patterns separated by a vertical bar, |, meaning "or". So, if a given column starts with (^ the carrot indicating the pattern is at the beginning of a string) "Loan #", or "Seller" or "Property Name", the column value is returned if re.search is True, meaning one of the three subject columns have been found. An important note is that I did not want to delete all columns that matched with this pattern. Indexing the list with [3:] skips the first three values because those first three observations of "Loan", "Seller", "Property Name" are the good columns I wanted to keep. Below, I used a for loop to iterate over the duplicate columns to delete.

for col in loan_seller_cols:
    cmbs.drop(columns=col, axis=1, inplace=True)

Specifying axis=1 will cause the columns to be dropped and inplace=True ensures that the deletions are permanent.

Clean Up Headers (again)

There are many ways to creatively use Regular Expressions to complete a given task. I used a dictionary of Regular Expressions patterns with patterns to identify as the dictionary keys and the respective dictionary values were the strings I wanted to replace the matched word with. I got the idea from this code I found and made a more elementary version of it.

regex_dict = {'_\d': '', '\(.+\)+': '', '#': '', '%': '', r'\/' : '', '\s\s+': ' ', '^\s+': '', '\s+$': ''}

The regex patterns above are explained below.

Now, I can iterate over the dictionary regex_dict and use each dictionary key (the regex pattern) and value (the replacement strings)

for key, value in regex_dict.items():
    cmbs.columns = [re.sub(key, value, col) for col in cmbs.columns]

The for loop modifies each column with a matching character slightly for each iteration until every regex pattern in the dictionary has been located in the headers and then replaced with an empty string. All of the replacement values are an empty string except for the one for \s\s+ which is replaced by a single space because I want to preserve spaces, I just don't want to have inconsistent spacing.

Delete Columns with All NA Values

As covered in my second post, the data I scraped is encoded using the Windows-1252 encoding and Python's default encoding is UTF-8. So, in order to recognize columns which are entirely filled with "NA" values, the column values must be normalized for Unicode first.

for col in list(cmbs.columns.values):
    try:
        if cmbs[col].str.normalize('NFKD').str.match('  ').all():
            cmbs.drop(columns=col, axis=1, inplace=True)
    except AttributeError:
            continue

Because the "NA" values I am looking for are not actually NA values but non-breaking space characters which, when normalized for Unicode, appear as two single spaces '  ', the columns I am looking to delete are completely filled with these double-spaced strings. The Pandas library has this really cool feature called "Method Chaining" where you can apply a series of transformations to a dataset by "chaining" multiple Class methods together. So, by calling .str.normalize('NFKD').str.match('  ').all() on a given column, I can normalize the column to Unicode, search for string matches and produce a True if all of the column values match the double-space string pattern. The for loop is organized by the try / except Python control flow because not all column values are loaded as strings by default and Python produces an AttributeError when a program tries to normalize a non-string object.

Headers: After

Now calling print(cmbs.columns.values) will produce:

['Loan' 'Seller' 'Property Name' 'Street Address' 'City' 'State'
 'Zip Code' 'County' 'Number of Properties' 'Property Type'
 'Property Subtype' 'Year Built' 'Year Renovated' 'Units'
 'Unit of Measure' 'Occupancy' 'Occupancy Date' 'Appraised Value'
 'Appraisal Date' 'Current LTV' 'Original Balance'
 'Original Balance per Unit' 'Current Balance' 'Current Balance per Unit'
 'of Initial Pool Balance' 'Crossed Loan' 'Related Borrower'
 'Interest Rate' 'Admin. Fee' 'Net Mortgage Rate' 'Accrual Type'
 'Monthly Debt Service' 'Annual Debt Service' 'Note Date'
 'First Payment Date' 'Partial IO Last IO Payment'
 'Partial IO Loan First P&I Payment' 'Rem. Term' 'Rem. Amort' 'IO Period'
 'Seasoning' 'Payment Due Date' 'Grace Period' 'Grace Period'
 'Maturity Date' 'ARD Loan' 'Final Mat Date' 'MaturityARD Balance'
 'Maturity LTV' 'Prepayment Provision' '2008 Revenues'
 '2008 Total Expenses' '2008 NOI' '2009 Revenues' '2009 Total Expenses'
 '2009 NOI' '2010 Revenues' '2010 Total Expenses' '2010 NOI'
 'Most Recent Revenues' 'Most Recent Total Expenses' 'Most Recent NOI'
 'As of' 'UW Economic Occupancy' 'UW Revenues' 'UW Total Expenses'
 'UW NOI' 'UW Capital Items' 'UW NCF' 'UW NOI DSCR' 'UW NCF DSCR'
 'UW NOI Debt Yield' 'UW NCF Debt Yield' 'Title Type'
 'Ground Lease Expiration' 'Ground Lease Extension Terms' 'PML'
 'Upfront Capex Reserve' 'Upfront Engin. Reserve' 'Upfront Envir. Reserve'
 'Upfront TILC Reserve' 'Upfront RE Tax Reserve' 'Upfront Ins. Reserve'
 'Upfront Other Reserve' 'Monthly Capex Reserve' 'Monthly Envir. Reserve'
 'Monthly TILC Reserve' 'Monthly RE Tax Reserve' 'Monthly Ins. Reserve'
 'Monthly Other Reserve' 'Capex Reserve Cap' 'TILC Reserve Cap'
 'Single Tenant' 'Largest Tenant' 'Unit Size' 'Lease Expiration'
 '2nd Largest Tenant' 'Unit Size' 'Lease Expiration' '3rd Largest Tenant'
 'Unit Size' 'Lease Expiration' '4th Largest Tenant' 'Unit Size'
 'Lease Expiration' '5th Largest Tenant' 'Unit Size' 'Lease Expiration'
 'Loan Purpose' 'Principal Carveout Guarantor' 'Lockbox' 'Lockbox Type'
 'Additional Debt Permitted' 'Additional Debt Exist'
 'Additional Debt Amount' 'Additional Debt Type'
 'Total Debt Current Balance' 'Total Debt UW NCF DSCR'
 'Total Debt Current LTV' 'Total Debt UW NOI Debt Yield' '2008 Occupancy'
 '2008 ADR' '2008 RevPAR' '2009 Occupancy' '2009 ADR' '2009 RevPAR'
 '2010 Occupancy' '2010 ADR' '2010 RevPAR' 'Most Recent Occupancy'
 'Most Recent ADR' 'Most Recent RevPAR' 'UW Occupancy' 'UW ADR'
 'UW RevPAR']

The duplciate columns are gone, the annoying symbols are gone and the data is generally much cleaner. Next, I will... actually begin analyzing the data.