This is day 7 of "One CSV, 30 stories":http://blog.whatfettle.com/2014/10/13/one-csv-thirty-stories/ a series of articles exploring "price paid data":https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from "GitHub":https://github.com/psd/price-paid-data
Continuing on from "yesterday's":http://blog.whatfettle.com/2014/10/19/one-csv-thirty-stories-6-prices/ foray into prices, today sees more of the same with more or less the same "gnuplot":http://gnuplot.sourceforge.net/ script.
The prices file from "Day 2":http://blog.whatfettle.com/2014/10/15/one-csv-thirty-stories-2-counting-things/ contains almost 150,000 different prices:
bc. $ wc -l price.tsv
141464
| Count | Price (£) | 
|---|---|
| 208199 | 250000 | 
| 185912 | 125000 | 
| 163323 | 120000 | 
| 159519 | 60000 | 
| 147645 | 110000 | 
| 145214 | 150000 | 
| 140833 | 115000 | 
| 134731 | 135000 | 
| 131334 | 175000 | 
| 131223 | 85000 | 
| 129597 | 130000 | 
| 129336 | 105000 | 
| 126161 | 165000 | 
| 126004 | 95000 | 
| 124379 | 145000 | 
| 123968 | 75000 | 
| 123893 | 140000 | 
| 123451 | 160000 | 
| 123340 | 90000 | 
| 120306 | 100000 | 
| 119776 | 80000 | 
which when plotted by rank using the "gnuplot pseudo-column zero":http://gnuplot.sourceforge.net/docs_4.2/node133.html :
bc. plot "/dev/stdin" using 0:1 with boxes lc rgb "black"
shows how the prices are distributed in quite a steep power-curve, a "long-tail":http://en.wikipedia.org/wiki/Long_tail if you will:
A quick awk script to collate prices, modulo 10:
bc. cut -f1 < data/pp.tsv | awk '{ print $1 % 10 }' | sort | uniq -c | sort -rn gives us the distribution of the last digit in the prices:
| Count | Price (£1) | 
|---|---|
| 18437019 | 0 | 
| 715633 | 5 | 
| 56195 | 9 | 
| 21890 | 2 | 
| 17549 | 6 | 
| 17395 | 3 | 
| 16889 | 1 | 
| 16235 | 7 | 
| 14888 | 8 | 
| 11878 | 4 | 
and can be tweaked to show the last two digits:
| Count | Price (£10) | 
|---|---|
| 16282411 | 0 | 
| 2087949 | 50 | 
| 636253 | 95 | 
| 45710 | 99 | 
| 22419 | 75 | 
| 20194 | 25 | 
| 11271 | 45 | 
| 11121 | 60 | 
| 9890 | 20 | 
| 9425 | 80 | 
| 9235 | 40 | 
| 7677 | 90 | 
| 6855 | 70 | 
| 6532 | 10 | 
| 6519 | 55 | 
| 5924 | 30 | 
and the last three digits in the prices:
| Count | Price (£100) | 
|---|---|
| 3682320 | 0 | 
| 3332503 | 5000 | 
| 980975 | 8000 | 
| 897786 | 2000 | 
| 835579 | 7000 | 
| 765799 | 3000 | 
| 732587 | 9950 | 
| 713121 | 6000 | 
| 707063 | 4000 | 
| 687129 | 9000 | 
| 596687 | 7500 | 
| 567882 | 2500 | 
| 503076 | 1000 | 
| 298398 | 8500 | 
| 294878 | 4950 | 
| 267618 | 9995 | 
A logarithmic scale can help see patterns in the lower values whilst showing the peaks on the same page; it's a bit like squinting at the chart from a low angle:
I think tomorrow will be "pretty average":http://blog.whatfettle.com/2014/10/25/one-csv-thirty-stories-8-heatmap-meh/.




