csv.Sniffer gets the sniffles

Since I regularly work with both comma delimited and tab delimited files, I’d like for my Quiffer project to be able to automatically detect if a csv file is tab delimited, and if it is, specify tab (‘\t’) as the delimiter when reading it using Python’s csv.reader class.

Here’s a sample of tab delimited data — an Amazon transaction file:

Transaction Summary for December 1, 2016 to December 8, 2016
Transactions: 3

Date Order ID SKU Transaction type Payment Type Payment Detail Amount Quantity Product Title
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Amazon fees Shipping commission $-0.50 Disney's 2012 Lady and the Tramp Sketchbook Ornament
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Product charges $29.86 1 Disney's 2012 Lady and the Tramp Sketchbook Ornament
Dec 3, 2017 114-5190104 DI-2QQ3-4X5H Order Payment Other Shipping $3.99 Disney's 2012 Lady and the Tramp Sketchbook Ornament

I need Quiffer to be able to parse each row into a list, like so:


['Date', 'Order ID', 'SKU', 'Transaction type', 'Payment Type', 'Payment Detail', 'Amount', 'Quantity', 'Product Title']
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Amazon fees', 'Shipping commission', '$-0.50', '', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Product charges', '', '$29.86', '1', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]
['Dec 3, 2017', '114-5190104', 'DI-2QQ3-4X5H', 'Order Payment', 'Other', 'Shipping', '$3.99', '', "Disney's 2012 Lady and the Tramp Sketchbook Ornament"]

Of course, the above is the default behavior when a comma delimited file is fed to csv.reader without a delimiter specified, as in the following code:


import csv

src_file = 'sample_data/comma_sample.csv'

with open(src_file, newline='', encoding='utf-8') as csv_file:
    data = list(csv.reader(csv_file))

When a tab-delimited file is opened with this code, however, things get messy:


[['Transaction Summary for December 1', ' 2016 to December 8', ' 2016'],
['Transactions: 3'],
[],
['Date\tOrder ID\tSKU\tTransaction type\tPayment Type\tPayment Detail\tAmount\tQuantity\tProduct Title'],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tAmazon fees\tShipping commission\t$-0.50\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tProduct charges\t\t$29.86\t1\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"],
['Dec 3', " 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tOther\tShipping\t$3.99\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]]

Since csv.reader is expecting a comma delimited file, it splits each row in two at the comma after the month in the Date column.  The rest of the columns are mashed together in a single string, with ‘\t’ in between.

I’d rather not code two different versions of Quiffer for different types of files, with the only difference being the ‘delimiter’ specification in csv.reader.  Surely, I thought, there must be a way for Python to examine a file and determine how its fields are delimited, and parse it accordingly.

A while back I discovered a class in the Python csv module called csv.Sniffer.  My understanding is that Sniffer could figure out if a file was tab delimited or comma delimited, and provide this information to csv.reader in the ‘dialect’ parameters.  I wrote the following to test it with the Amazon sample file:

import csv

def sniff_test(file):
    with open(file, newline='') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(1024))
    
        csvfile.seek(0)

        reader = csv.reader(csvfile, dialect)

        return [row for row in reader]

This was the output:

['Transaction Summary for December 1, 2016 to December 8, 2016']
['Transactions: 3']
[]
['Date\tOrder ID\tSKU\tTransaction type\tPayment Type\tPayment Detail\tAmount\tQuantity\tProduct Title']
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tAmazon fees\tShipping commission\t$-0.50\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tProduct charges\t\t$29.86\t1\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]
["Dec 3, 2017\t114-5190104\tDI-2QQ3-4X5H\tOrder Payment\tOther\tShipping\t$3.99\t\tDisney's 2012 Lady and the Tramp Sketchbook Ornament"]

I tried again, using a list of delimiters, as suggested in How to use csv.Sniffer for 2 different CSV-types? and the code threw a Could not determine parameter error. I did some more experimenting, and to make a long story short, I found that what was giving Sniffer a stuffy nose was the inconsistent layout of the Amazon file.  Note that the first three lines are like a report header, and do not contain any delimiters.  I removed these three lines and made a copy of the file without them, and Sniffer worked.  Nevertheless, I don’t want to have to physically alter my files before parsing them with Quiffer. Solution: have Quiffer read the file a line at a time and look for tabs, and if tabs are present, open as a tab delimited file.  Here’s the latest, tested version:

import csv

def read_data(src_file):
    """
    Import rows of data from a csv file into a list of rows, where each row is
     a list containing column values.
    """

    with open(src_file, newline='', encoding='utf-8') as csv_file:

        # Check the file for tabs.
        line = csv_file.readline()

        while line != "":
            if '\t' in line:

                # tab was found, so read it as csv with delimiter specified
                csv_file.seek(0)

                return list(csv.reader(csv_file, delimiter='\t'))
                
            line = csv_file.readline()

        # if no tabs present, read as a normal csv
        csv_file.seek(0)

        return list(csv.reader(csv_file))