Detect Headers in CSV Tables Statistically
Generally speaking, a CSV table may or may not contain a header, and the header, when it exists, may consists of multiple lines. Question: How can we detect which first rows constitutes the header of a CSV file?
I wrote a script, named headsman, that finds the best amount of first few rows to cut off from a table in order to maximize the purity of datatypes in each column.
What do I mean?
Notice that, in a column of a table, even if it contains only numerical entries, its header (if any) is often a string. For example, the table
contains the following datatypes in each cell
so the purity in each column is
- The first column: 100%
- The second column: 67%
whose average is 83%. However, if we cut off the first row, the first and the second columns will only contain strings and numbers, respectively.
This gives purities of
- The first column: 100%
- The second column: 100%
the average of which is 100%, an increase over the former value of 67%. Hence, at least for tables with numerical-valued column(s), we can define headers as:
Apparent header: the first few rows whose removal would maximize the column-averaged purity in datatypes.
To make the argument more obvious, I used standard deviation weighted by number of rows remaining, averaged across all columns in each table, and standardized to 0~1. I call this metric “Data Type Deviation” (DTD).
Standard deviation tells me how heterogeneous the remaining cells in each column is. They are weighted by number of rows remaining, so that unnecessary cut-offs will lower these standard deviations. These weights convert saturation points into turning points, as we will see in the plot below. Weighted standard deviations (WSDs) of all columns in each table is averaged and scaled so that it’s maximum value is 1. Since the aforementioned weight, i.e. number of remaining rows, will always be zero when we have cut off all rows in each table, by scaling the max value to 1, I am essentially standardizing the WSDs (i.e., scaling the series of values so that their max is 1 and their min is 0). I ran this metric over some thousands of tables derived from Wharton Research Data Services (a big shout-out). The results looks like:
Each curve represents a table. Notice how each curve has a turning point achieving the y-value of 1. These are the turning points representing the last rows of headers.
A quick histogram tells us that most tables does not have an apparent header:
Notice how the first bin dominated the scale.
Intuitively, the header should not contain too many rows. Eliminating the tables who does not have an apparent header, the following histogram fits our expectation:
The fifth row (if present — we may have tables with 4 or less rows!) seems to be a good position to stop header detection. This greatly improved Headsman’s efficiency.