WhatfettleOne CSV, thirty stories: 2. Counting things

This is day 2 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

Statistics: The science of producing unreliable facts from reliable figures — Evan Esar

The file we made yesterday contains 19 million property transactions. Let's use awk to find some basic information:

$ cut -f1 pp.tsv | awk 'NR == 1 {
    min = $1;
    max = $1;
}
{
    if ($1 < min) min = $1;
    if ($1 > max) max = $1;
    sum += $1;
    sumsq += $1 * $1
}
END {
    printf "count\t%d\n", NR
    printf "min\t%d\n", min
    printf "max\t%d\n", max
    printf "sum\t%d\n", sum
    printf "mean\t%f\n", sum/NR
    printf "sd\t%f\n", sqrt(sumsq/NR - (sum/NR)**2)
}' > stats.tsv

That gives us some basic statistics about the property prices contained within the file:

$ cat stats.tsv
count  19325571
min    5050
max    54959000
sum    3078384329109
mean   159290.730872
sd     180368.570700

Which tells us our file contains a record of more than £3 trillion transacted over the course of a number of years, but over how many years? We can find that out by chopping out the date column, removing the month and year and counting the uniquely sorted result:

$ cut -f2 < data/pp.tsv | sed 's/-.*$//' | sort | uniq | wc -l
20

The standard deviation makes me think the median price would be useful. We can use sort to find that, along with the count of records:

$ cut -f1 < pp.tsv | sort | sed -n $(expr $(grep count stats.tsv|cut -f2) / 2)p
265000

Judging from enterprisey emails in my inbox, some people are quite excited about Data Warehousing and cool things like Hadoop, but for this kind of experimental hackery Unix sort is great. I guess it’s had 40 odd years of computer scientists showing off by optimising the heck out of the open source code. There’s an idiom of sort which I use a lot to find the distribution of a data item, for example we can quickly find the busiest days using:

cut -f2 pp.tsv | sort | uniq -c | sort -rn | head

I say quickly, but even with the wonder of sort, counting the occurrences of every value in such a large dataset is a reasonably expensive operation and we’re sorting things a lot, so let’s create some index files as a one-off activity. I’m sure they’ll come in handy later:

while read column title
do
    cat data/pp.tsv |
        cut -d'	' -f$column |
        sort |
        uniq -c |
        sort -rn |
        sed -e 's/^ *//' -e 's/  */⋯/g' > $title.tsv
done <<-!
1   price
2   date
3   postcode
4   type
5   new
6   duration
7   paon
8   saon
9   street
10  locality
11  town
12  district
13  county
14  status
!

You might like to make some tea whist that happens. You could probably use your laptop to warm the pot. When it’s complete we have a file for each column, allowing us to find the busiest days:

$ head date.tsv 
26299 2001-06-29
26154 2002-06-28
26141 2002-05-31
25454 2003-11-28
24889 2007-06-29
24749 2000-06-30
24146 2006-06-30
24138 1999-05-28
23195 2000-03-31
22870 2003-12-19

and the most popular prices:

$ head price.tsv
208199 250000
185912 125000
163323 120000
159519 60000
147645 110000
145214 150000
140833 115000
134731 135000
131334 175000
131223 85000

So that’s the mode for each column, and a breakdown of categories such as the number of recorded transactions on new versus old builds:

$ cat new.tsv
17351417 N
1974154 Y

and the most active postcodes:

$ head postcode.tsv
29913⋯
280⋯TR8 4LX
274⋯CM21 9PF
266⋯B5 4TR
260⋯BS3 3NG
258⋯CM7 1WG
255⋯N7 6JT
253⋯HA1 2EX
248⋯W2 6HP
242⋯M3 5AS

Shame the most popular postcode is blank. That could be for legitimate reasons, after all not every parcel of land bought or sold has a postal address. We’ll get to that another day.

I’ve gone this far without looking into any particular record. That’s because the data contains addresses and it feels to a little strange to highlight what is after all probably someone’s home. Ideally I’d cite somewhere such as Admiralty Arch or the Land Registry Head Office but the dataset excludes quite a few transactions including those on commercial properties, and leaseholds. That’s definitely a thing I should talk to people about.

To be fair and reduce the risk of weirding someone out I need to pick on a property almost at random. I was quite interested in the maximum price paid for a property. Let’s look for that one:

$ grep "54959000"  data/pp.tsv
54959000    2012-03-26  SW10 9SU    S   N   F   20      THE BOLTONS     LONDON  KENSINGTON AND CHELSEA  GREATER LONDON  A

It’s a very large house in Chelsea! No surprise there then. If I was interested I could now go to the Land Registry view title service, pay £3 and find out who lives there, details of others with an interest in the property including any mortgage details, a title plan and possibly other restrictions such as any rights of way and if they have to pay to repair the local church spire. This is what Theodore Burton Fox Ruoff called looking behind the curtain.

Anyway, back to the spelunking. Something that is noticeable is how slow that grep command is operating on our large file:

$ time grep "54959000"  data/pp.tsv
...
real    0m46.950s
user    0m45.926s
sys 0m0.837s

Maybe we can speed that up a little using a regular expression:

$ time egrep "^54959000"  data/pp.tsv
...
real    0m30.036s
user    0m29.130s
sys 0m0.727s

Which is still quite slow. I hope I’ve shown how plugging around with the Unix pipeline can be quick and fun, but it’s easy to forget how quick a programming language can be, even one as simplistic as awk:

$ time awk '$1 == "54959000" { print }' <  data/pp.tsv 
...
real    0m11.475s
user    0m8.553s
sys 0m1.086s

With that my lap is suffering from local warming, it’s very late, and I’m quite tired so I think that’s probably enough command line bashing for now. Statistics are great, but it’s quite hard to grok numbers. Tomorrow we should probably draw some charts.