This is day 4 of "One CSV, 30 stories":http://blog.whatfettle.com/2014/10/13/one-csv-thirty-stories/ a series of articles exploring "price paid data":https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from "GitHub":https://github.com/psd/price-paid-data
I had some feedback after "yesterday":http://blog.whatfettle.com/2014/10/15/one-csv-30-stories-3-minimal-viable-histograms/ mostly from people enjoying my low-tech approach, which was nice. Today I wanted to look at the price paid for property. All 19 million prices on a single page in a hope to see any apparent trends or anomalies.
To do this we only need the date and the price columns, and we might as well sort them by date as I'm pretty sure that'll be useful later:
bc. awk -F'⋯' '{print $2 "⋯" $1}' < data/pp.tsv | sort > prices.tsv
Now to scatter the prices with time on the x-axis, and the price paid on the y-axis. We'll use yet another awk script to do this:
bc. cat prices.tsv | {
cat < /dev/lp@ but these days we have a raft of ways of executing PostScript. Most anything that can render a PDF can usually also run the older PostScript language — it's a little bit weird how we bat executable programs back and forth when we're exchanging text and images. Just to emphasise the capacity for mischief, the generated 1.5 Gig PostScript reliably crashes the Apple OS X preview application, so it's best to use something more solid, such as the open source "ImageMagick":http://www.imagemagick.org/ in this case to make a raster image:
bc. scatterps.sh < data/prices.tsv | convert -density 300 - out.png !https://raw.githubusercontent.com/psd/price-paid-data/master/out/scatterps.png! This image is intriguing, but we should be able to differentiate the density of points if we make them slightly transparent. PostScript is notoriously poor at rendering opacity, but luckily ImageMagick has its own drawing language which makes png files directly and it's fairly straightforward to "tweak the awk":https://github.com/psd/price-paid-data/blob/master/bin/scatterim.sh to generate "MVG":http://www.imagemagick.org/Usage/draw/: !https://raw.githubusercontent.com/psd/price-paid-data/master/out/scatterim.png! We can see from this a general, apparently slow trend in the bulk of house prices, with seasonal and a marked dip at what looks like 2009. There's also a strange vertical gap in higher priced properties towards the right which along with the horizontal bands more apparent on the first plot could be down to "bunching around the stamp duty bands":https://twitter.com/scedwar/status/522327111865237504. So there's a few stories to delve into. I completely mismanaged my time writing this post, so will leave adding axis to the graphs until "tomorrow":http://blog.whatfettle.com/2014/10/18/one-csv-thirty-stories-5-axes/.