WhatfettleOne CSV, thirty stories: 4. Scattering

This is day 4 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

I had some feedback after yesterday mostly from people enjoying my low-tech approach, which was nice. Today I wanted to look at the price paid for property. All 19 million prices on a single page in a hope to see any apparent trends or anomalies.

To do this we only need the date and the price columns, and we might as well sort them by date as I’m pretty sure that’ll be useful later:

awk -F'⋯' '{print $2 "⋯" $1}' < data/pp.tsv | sort > prices.tsv

Now to scatter the prices with time on the x-axis, and the price paid on the y-axis. We’ll use yet another awk script to do this:

cat prices.tsv | {
cat <<!
%!
%%Orientation: Landscape
%%Page: 1 1
0 0 0 setrgbcolor
/p {
    1 0 360 arc fill
} def
!
awk -F'	' -v max=15000000 '
    function epoch(s) {
        gsub(/[:-]/, " ", s);
        s = s " 00 00 00"
        return mktime(s);
    }
    NR == 1 {
        first = epoch($1);
        last = systime() - first;
    }
    {
        this = epoch($1) - first;
        x = 600 * this / last;
        y = 600 * $2 / max;
        printf "%d %d p\n", x, y;
    }'
echo showpage
}

which generates a rather large PostScript document:

%!
%%Orientation: Landscape
%%Page: 1 1
0 0 0 setrgbcolor
/p {
    1 0 360 arc fill
} def
0 4 p
0 0 p
   ... [19 million lines removed] ...
595 3 p
595 13 p
showpage

Back in the day the quickest way to see the output would be to attach a laser printer to the parallel port on the back of a server and cat prices.ps > /dev/lp but these days we have a raft of ways of executing PostScript. Most anything that can render a PDF can usually also run the older PostScript language — it’s a little bit weird how we bat executable programs back and forth when we’re exchanging text and images. Just to emphasise the capacity for mischief, the generated 1.5 Gig PostScript reliably crashes the Apple OS X preview application, so it’s best to use something more solid, such as the open source ImageMagick in this case to make a raster image:

scatterps.sh < data/prices.tsv | convert -density 300 - out.png

This image is intriguing, but we should be able to differentiate the density of points if we make them slightly transparent. PostScript is notoriously poor at rendering opacity, but luckily ImageMagick has its own drawing language which makes png files directly and it’s fairly straightforward to tweak the awk to generate MVG:

We can see from this a general, apparently slow trend in the bulk of house prices, with seasonal and a marked dip at what looks like 2009. There’s also a strange vertical gap in higher priced properties towards the right which along with the horizontal bands more apparent on the first plot could be down to bunching around the stamp duty bands.

So there’s a few stories to delve into. I completely mismanaged my time writing this post, so will leave adding axis to the graphs until tomorrow.