Wrestling Text with “the Champ”

One thing I have learned about learning is that it’s often a random, messy process, full of unexpected twists and turns. I set out toward a particular goal, intent on learning every single step required to get to that goal, but then some unexpected obstacle trips me up on my way and I become tempted to give up in frustration. Then, to my surprise and wonder, the obstacle itself becomes a learning experience, and I attain valuable knowledge I wouldn’t have discovered if it had been 100% up to me to find it. This new knowledge may cause me to reconsider my goal and alter it, or even scrap it altogether and set a new one. Those who have followed this blog have seen this process first hand. With that in mind, a funny thing happened on my way to completion of my latest cool app. I found my quest unexpectedly distracted by that noisy, demanding child whose name is Immediate Practical Need.

At my job, every week I have to upload a very precisely formatted text file to our third-party retirement plan administrator. The administrator tracks this data for us to ensure compliance with federal regulations and reporting requirements. The text file includes weekly payroll information, such as employee names, gross pay and retirement plan contribution amounts. Each field must be an exact number of spaces and aligned just right in order to upload correctly.

Currently, I produce this file using Microsoft Access 2010. Access has an option called “Saved Exports” that I use to export the payroll data to a text file. Back when I set this process up I didn’t know much about MS Access’s built-in programming language (Visual Basic for Applications, aka VBA), so all the precise spacing required for the text column is “hard wired” into the report object. Should the plan administrator ever change their file format, I’d have to change the report manually. Another problem is that Access’s “Saved Exports” can’t be edited without some high-end hacking, so if the directory where I need to save the file changes, or I have to move my applications and data to a new PC, I’d have to re-do the whole “Saved Export.” I could just re-do this whole procedure using VBA code, but I decided I’d prefer to use this as an opportunity to learn more Python! Result: “CsvChamp.”

“CsvChamp” is a tool set of Python functions I designed to import, format, display and export data from a comma-separated (CSV) file. At first glance, it’s obvious none of these require Python to accomplish. The same results could be produced using a spreadsheet app. Their power, like everything in Python, lies in their programmability and flexibility. The heart of this tool set is the function GetCsvData. It reads content from a CSV file into a Python dictionary. It adds a unique integer row ID to each row read from the file, and these row IDs become the keys in the dictionary and the rows of data are stored as list object values. Once the data is in a Python dictionary, the sky is the limit, as each and every item can be accessed using code. In this way, code can be written to sort, display, add, delete, summarize and so on. Not only does this have an immediate practical use to me on the job, I can also see this applying to my database management goals. Here’s a summary of what each tool does. See the comments in the code for more details.

CsvFilePicker uses good old Tkinter to prompt the user to select a CSV file.

GetCsvData reads the CSV data into a dictionary, so a line like this:
Doe,John,456 Any St.,Anywhere,TX,45678,(997) 777-7878,john@doeadeer.com

is stored like this:
{1:[‘Doe’,’John’,’456 Any St.’,’Anywhere’,’TX’,’45678′,'(997) 777-7878′,’john@doeadeer.com’]}

GetCsvCols reads the column names from the first line of the CSV file and stores them as a list, so this:
Last,First,Address,City,State,Postal Code,Phone,Email

is stored like this:
[‘Last’,’First’,’Address’,’City’,’State’,’Postal Code’,’Phone’,’Email’]

CreateColSpec examines each column in a dictionary produced by Get CsvData to determine how to align and space the columns for display or export. It adds one space to the maximum with of each column, and defaults to left-alignment of each column.

FormatCsvCol takes a column of data and adds spaces before and/or after it — depending on the specifications read from the dictionary produced by CreateColSpec — so it will align correctly when displayed or output.

Here’s some sample input and output:

Messy, ugly, mean CSV file:
CsvChampBefore

Clean, pretty, friendly output file:
CsvChampAfter

Of course, all of this is 100% flexible. Column specifications can be find-tuned by individual column, so some could be left-aligned and others right-aligned. We don’t have to use the column names in the CSV file, nor do we even have to have the CSV file provide column names at all; we could easily pick our own. We could change the display options so that only selected columns are displayed. The surface has barely been scratched! Feel free to try it yourself. Enjoy!

CSV champ code below. Sorry for the long lines. I haven’t been able to get wrapping to work right yet.

def CsvFilePicker():
    """Prompts user to select a file, using the GUI file dialog. Returns file path.
    """
    # Use csv & the Tkinter graphical file dialog.
    import csv
    import Tkinter, tkFileDialog

    # Asks the user to select a file to load
    root = Tkinter.Tk()
    root.withdraw()
    file_path = tkFileDialog.askopenfilename()

    return file_path

def GetCsvData(file_path):
    """Reads data from a CSV file and returns a dictionary of the file's data, assigning each row a unique integer row ID.
    """
    import csv
    # Copy the contents of the file into a dictionary.
    with open(file_path,"rb") as csvData:

        FileReader = csv.reader(csvData)

        # FileData = the dictionary to hold the data.
        FileData = {}

        # Counts number of rows in file, to use for row IDs.
        row_count = 1
        for row in FileReader:
            row_id = row_count

            # Add the row to the dictionary, assigning the
            # row ID as its key.
            FileData[row_id] = row

            row_count = row_count+1

    return FileData

def GetCsvCols(FileData):
    """Reads column names from the first value of a dictionary created by GetCSVdata.
    """
    col_names = []
    for item in FileData[1]:
        col_names.append(item)

    return col_names

def CreateColSpec(FileData):
    col_spec = {}

    #Iterate through each column
    for c in range(0, len(FileData[1])):

        #Set the maximum with to the column name
        max = len(file[1])

        #Check each row in the col
        for r in FileData:

            #If the current row in the column > the
            #current maximum, store it.
            if len(FileData[r]) > max:
                max = len(FileData[r])

        #Specify the width of the column to be 1
        #space more than the width of its longest item.
        col_spec[FileData[1]] = [max+1,"L"]

    return col_spec

def ListCsvCols(col_names):
    """List and count the columns."""
    for item in range(1, len(col_names)+1):
        print str(item)+": " + col_names[item-1]

def FormatCsvCol(width,align,cont):
    """Returns a column of data with content aligned
    according to these specifications:

    width = total number of characters allotted to the column including leading and/or trailing spaces.

    align = "L" (left), "C" {center} or "R" (right).

    cont = content (the actual data to be displayed).

    """

    # The total number of empty spaces in the column.
    spaces = width - len(cont)

    if align == "L":
        # Place all the spaces after the content
        column = cont + (spaces * " ")

    elif align == "R":
        # Place all the spaces before the content.
        column = spaces * " " + cont

    elif align == "C":
        """Divide the number of spaces by 2 and round down to the nearest integer. Put the content between this number of spaces, and the rest.
        """
        sp_before = int(spaces/2) * " "
        sp_after = (spaces - sp_before) * " "
        column = sp_before + cont + sp_after

    return column

def DisplayCsvCols(FileData, col_spec):
    """Uses FormatCsvCol to display the data in FileData.
    Requires col_spec.
    """
     #Get the column names from the 1st row.
    columns = GetCsvCols(FileData)

    #Step through FileData in row ID order.
    for row_id in range(1,len(FileData)+1):
        disp_row = ""

        #Step through each column in a row
        for col_num in range(0,len(columns)):
            disp_row = disp_row + FormatCsvCol(col_spec[columns[col_num]][0],col_spec[columns[col_num]][1],FileData[row_id][col_num])

        print disp_row

def OutputCsv(FileData, output_spec):
    """Uses FormatCsvCol to output the data in FileData to a space-delimited text file. output_spec = a dictionary providing column output specifications.
    """
    import os

    # Edit path and name as needed.
    output_path = "D:\Dropbox\projects\CsvChamp"

    file_name = "output.txt"

	#Get the column names from the 1st row.
    columns = GetCsvCols(FileData)

    with open(os.path.join(output_path,file_name),"w") as OutputFile:

        #Step through FileData in row ID order.
            for row_id in range(1,len(FileData)+1):
                    disp_row = ""

                    #Step through each column in a row
                    for col_num in range(0,len(columns)):
                            disp_row = disp_row + FormatCsvCol(col_spec[columns[col_num]][0],col_spec[columns[col_num]][1],FileData[row_id][col_num])

                    disp_row = disp_row + "\n"

                    OutputFile.writelines(disp_row)

"""Sample CsvChamp usage. We need to grab data from a messy, mean, ugly CSV file and export it as a nice, neat, clean, friendly text file.
"""

#Prompt the user to select a file.
file = GetCsvData(CsvFilePicker())

#Determine column width and alignment for output.
col_spec = CreateColSpec(file)

#Create the output file.
OutputCsv(file, col_spec)