csv.Sniffer gets the sniffles

Since I regularly work with both comma delimited and tab delimited files, I’d like for my Quiffer project to be able to automatically detect if a csv file is tab delimited, and if it is, specify tab (‘\t’) as the delimiter when reading it using Python’s csv.reader class.

Here’s a sample of tab delimited data — an Amazon transaction file:

Transaction Summary for December 1, 2016 to December 8, 2016
Transactions: 3

Date Order ID SKU Transaction type Payment Type Payment Detail Amount Quantity Product Title
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Amazon fees Shipping commission $-0.50 Disney's 2012 Lady and the Tramp Sketchbook Ornament
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Product charges $29.86 1 Disney's 2012 Lady and the Tramp Sketchbook Ornament
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Other Shipping $3.99 Disney's 2012 Lady and the Tramp Sketchbook Ornament

I need Quiffer to be able to parse each row into a list, like so:

['Date', 'Order ID', 'SKU', 'Transaction type', 'Payment Type', 'Payment Detail', 'Amount', 'Quantity', 'Product Title']
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Amazon fees', 'Shipping commission', '$-0.50', '', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Product charges', '', '$29.86', '1', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Other', 'Shipping', '$3.99', '', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]

Of course, the above is the default behavior when a comma delimited file is fed to csv.reader without a delimiter specified, as in the following code:

import csv

src_file = 'sample_data/comma_sample.csv'

with open(src_file, newline='', encoding='utf-8') as csv_file:
    data = list(csv.reader(csv_file))

When a tab-delimited file is opened with this code, however, things get messy:

[['Transaction Summary for December 1', ' 2016 to December 8', ' 2016'],
['Transactions: 3'],
['Date\tOrder ID\tSKU\tTransaction type\tPayment Type\tPayment Detail\tAmount\tQuantity\tProduct Title'],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tAmazon fees\tShipping commission\t$-0.50\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tProduct charges\t\t$29.86\t1\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tOther\tShipping\t$3.99\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]]

Since csv.reader is expecting a comma delimited file, it splits each row in two at the comma after the month in the Date column.  The rest of the columns are mashed together in a single string, with ‘\t’ in between.

I’d rather not code two different versions of Quiffer for different types of files, with the only difference being the ‘delimiter’ specification in csv.reader.  Surely, I thought, there must be a way for Python to examine a file and determine how its fields are delimited, and parse it accordingly.

A while back I discovered a class in the Python csv module called csv.Sniffer.  My understanding is that Sniffer could figure out if a file was tab delimited or comma delimited, and provide this information to csv.reader in the ‘dialect’ parameters.  I wrote the following to test it with the Amazon sample file:

import csv

def sniff_test(file):
    with open(file, newline='') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(1024))

        reader = csv.reader(csvfile, dialect)

        return [row for row in reader]

This was the output:

['Transaction Summary for December 1, 2016 to December 8, 2016']
['Transactions: 3']
['Date\tOrder ID\tSKU\tTransaction type\tPayment Type\tPayment Detail\tAmount\tQuantity\tProduct Title']
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tAmazon fees\tShipping commission\t$-0.50\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tProduct charges\t\t$29.86\t1\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tOther\tShipping\t$3.99\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]

I tried again, using a list of delimiters, as suggested in How to use csv.Sniffer for 2 different CSV-types? and the code threw a Could not determine parameter error. I did some more experimenting, and to make a long story short, I found that what was giving Sniffer a stuffy nose was the inconsistent layout of the Amazon file.  Note that the first three lines are like a report header, and do not contain any delimiters.  I removed these three lines and made a copy of the file without them, and Sniffer worked.  Nevertheless, I don’t want to have to physically alter my files before parsing them with Quiffer. Solution: have Quiffer read the file a line at a time and look for tabs, and if tabs are present, open as a tab delimited file.  Here’s the latest, tested version:

import csv

def read_data(src_file):
    Import rows of data from a csv file into a list of rows, where each row is
     a list containing column values.

    with open(src_file, newline='', encoding='utf-8') as csv_file:

        # Check the file for tabs.
        line = csv_file.readline()

        while line != "":
            if '\t' in line:

                # tab was found, so read it as csv with delimiter specified

                return list(csv.reader(csv_file, delimiter='\t'))
            line = csv_file.readline()

        # if no tabs present, read as a normal csv

        return list(csv.reader(csv_file))

Reading CSV data

Last week I ran across this article on Real Python on how to code a simple CSV file reader, using test driven development, and work with data from the files.  For a long time I’ve understood the value of TDD but wasn’t quite sure how to apply it meaningfully to writing code involving data from external files.  This article helped tremendously — and just in time for me to write code for the very first step of my Quiffer project:  reading financial transactions from a CSV file.

The sample data I’m using for this project is from a PayPal transaction file download.  Here are my first tests, for which I have written code that passes:

import unittest
from quiffer import read_data

class QuifferTest(unittest.TestCase):

    def setUp(self):
        self.data = 'sample_data/paypal_sample.csv'

    def test_read_data_headers(self):
             ' Time', 
             ' Time Zone', 
             ' Name', 
             ' Type', 
             ' Status', 
             ' Currency', 
             ' Gross', 
             ' Fee', 
             ' Net', 
             ' From Email Address', 
             ' To Email Address', 
             ' Counterparty Status', 
             ' Address Status', 
             ' Item Title', 
             ' Item ID', 
             ' Shipping and Handling Amount', 
             ' Insurance Amount', 
             ' Sales Tax', 
             ' Option 1 Name', 
             ' Option 1 Value', 
             ' Option 2 Name', 
             ' Option 2 Value', 
             ' Auction Site', 
             ' Buyer ID', 
             ' Item URL', 
             ' Closing Date', 
             ' Reference Txn ID', 
             ' Invoice Number', 
             ' Custom Number', 
             ' Receipt ID', 
             ' Balance']

    def test_read_data_name(self):
        self.assertEqual(read_data(self.data)[1][3], 'Almech Devices, LLC')

    def test_read_data_name_chinese(self):
        self.assertEqual(read_data(self.data)[3][3], '广州满翼易有限公司')

    def test_read_data_gross(self):
        self.assertEqual(read_data(self.data)[1][7], '-2.75')

if __name__ == '__main__':

As you can see, I’m following right along with the RealPython article’s methods. Here’s the Quiffer code so far:

import csv

def read_data(data):
    with open(data, encoding='utf-8') as csv_file:
        data = list(csv.reader(csv_file))
    return data

data = 'sample_data/paypal_sample.csv'

paypal_data = read_data(data)

I added the “encoding=’utf-8′” to the ‘open’ statement, because otherwise it threw a UnicodeDecode error. The original article had line 5 as:

data = [row for row in csv.reader(f.read().splitlines())]

..but thanks to what I learned from Trey Hunner’s CSV Python chat, I simplified this to:

data = list(csv.reader(csv_file))

I’m sure there are more tests I can add to be more thorough, so that’s the next step.  If anyone has any suggestions, hit me up on Twitter.

‘Quiffer’: Getting CSV Data into GnuCash via QIF Import

I use the excellent open source accounting software GnuCash to manage my personal finances.  I discovered it earlier this year and found it to be just as good as Quicken and without the evil.   I’m old-fashioned when it comes to managing finances.  I don’t need to have access to my financial records on all my devices; I am content to have a single app running on my laptop (backed up using Dropbox, of course) and to enter my receipts into GnuCash when I get home each day.  I prefer apps I can easily get my data into and out of.  Finally, I don’t believe handing my bank account login information to a third party is a good idea, preferring to download financial transactions from my accounts, in CSV format, manually.

GnuCash has the option to import transactions from QIF format, which allows for “split” (also known as double entry) transactions.   For example, this is what a QIF file would look like for my monthly electric payment:

NAssets:Checking Account
PFlorida Power and Light

This transaction has two entries: the deduction of $50.00 from the bank account to the power company, and the increase of $50.00 in the GnuCash account “Expenses:Electric.”  The same transaction as downloaded in a CSV file from my bank would be something like this:

11/28/17,Florida Power and Light,-50.00

Obviously, this is a radically different file format than QIF.  Also note that the bank only has one “side” of the transaction, the deduction from the bank account.  Getting transactions like these into GnuCash, then, requires the additional step of adding the correct GnuCash expense account to each transaction — not just converting from CSV to QIF.

Fortunately, I’ve discovered Python is very good at converting between very different text file formats.

In my transition to GnuCash, the biggest challenge was converting four years of financial transactions from Microsoft Access 2010 to QIF.  I started by exporting from Access to CSV.   The resulting CSV file had over 20,000 lines.  I was able to throw together some messy Python code to convert the 20,000 lines of CSV to QIF, and then get all the data into Gnu Cash.  There were so many lines that I worked with them in batches so I could check the account balances as I went.

I still use a slightly cleaner version of this code on a monthly basis to import transactions from PayPal, my Amazon seller account and eBay into QIF format, for import into GnuCash.  This makes it easy to keep detailed sales and expense records.  Of all the Python code I’ve written, this QIF creator has been by far the most useful to me.  Now I hope to make it useful to others.

The more I learn about Python, the more I see how much my code can be improved.   My existing code for CSV to QIF conversion is neither user-friendly nor GitHub worthy, so I plan to clean up, refactor and redesign it.  From this point forward, I’ll call this project “Quiffer.”

To illustrate the challenges faced by this project, here’s a sample PayPal CSV file, which is the most complicated file type I deal with regularly:

Date, Time, Time Zone, Name, Type, Status, Currency, Gross, Fee, Net, From Email Address, To Email Address, Counterparty Status, Address Status, Item Title, Item ID, Shipping and Handling Amount, Insurance Amount, Sales Tax, Option 1 Name, Option 1 Value, Option 2 Name, Option 2 Value, Auction Site, Buyer ID, Item URL, Closing Date, Reference Txn ID, Invoice Number, Custom Number, Receipt ID, Balance
11/28/2017,21:26:38,EST,"Almech Devices, LLC",Express Checkout Payment Sent,Completed,USD,-2.75,0,-2.75,bxx4@gmail.com,bxxxk@gmail.com,Verified,Confirmed,Tempered Glass Protective Screen Protector Film for Samsung Galaxy S5 Active,131511092340,0,0,0,,,,,Ebay,rxx04,http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=131511092340,,,,,,58.28
11/28/2017,21:26:36,EST,rosie castle,Express Checkout Payment Sent,Completed,USD,-7.98,0,-7.98,bxx4@gmail.com,ixx976@gmail.com,Verified,Confirmed,"The Good Dinosaur (DVD, 2016) NEW MOVIE! NOW SHIPPING ! ",222735118701,0,0,0,,,,,Ebay,rxx4,http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=222735118701,,,,,,61.03
11/28/2017,21:26:34,EST,广州满翼易有限公司,Express Checkout Payment Sent,Completed,USD,-5.5,0,-5.5,bxx4@gmail.com,vxxxl@48express.com,Non-U.S. - Verified,Confirmed,Cotton Linen Harry Potter Pillow Case Sofa Waist Throw Cushion Cover Home Decor [S1],282606610348,0,0,0,,,,,Ebay,rxx04,http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=282606610348&var=581788364837,,,,,,69.01

Note there are a few difficulties here:

  1. There are numerous columns I don’t need.
  2. The date format will need to be changed to YYYY-MM-DD for the QIF file.
  3. There are multiple columns (Name, Type, Status, Item Title) containing data that I want to appear in GnuCash’s “Description” field.  These columns will therefore need to be combined into a single column.
  4. Transactions that are payments for products I’ve sold include an amount in separate column, “Fee”, which is the fee charged by PayPal to receive funds.  Such transactions will actually need FOUR entries in GnuCash: two each for the payment and fee.
  5. My code needs to include a means of determining, automatically, the correct GnuCash accounts where the income and expense will be posted.

The first step for Quiffer will be to import the data from the CSV file into a Python data structure.  That will be the topic of my next post.  Stay tuned.