This is day 7 of "One CSV, 30 stories":http://blog.whatfettle.com/2014/10/13/one-csv-thirty-stories/ a series of articles exploring "price paid data":https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from "GitHub":https://github.com/psd/price-paid-data
Continuing on from "yesterday's":http://blog.whatfettle.com/2014/10/19/one-csv-thirty-stories-6-prices/ foray into prices, today sees more of the same with more or less the same "gnuplot":http://gnuplot.sourceforge.net/ script.
The prices file from "Day 2":http://blog.whatfettle.com/2014/10/15/one-csv-thirty-stories-2-counting-things/ contains almost 150,000 different prices:
bc. $ wc -l price.tsv
141464
| Count | Price (£) |
|---|---|
| 208199 | 250000 |
| 185912 | 125000 |
| 163323 | 120000 |
| 159519 | 60000 |
| 147645 | 110000 |
| 145214 | 150000 |
| 140833 | 115000 |
| 134731 | 135000 |
| 131334 | 175000 |
| 131223 | 85000 |
| 129597 | 130000 |
| 129336 | 105000 |
| 126161 | 165000 |
| 126004 | 95000 |
| 124379 | 145000 |
| 123968 | 75000 |
| 123893 | 140000 |
| 123451 | 160000 |
| 123340 | 90000 |
| 120306 | 100000 |
| 119776 | 80000 |
which when plotted by rank using the "gnuplot pseudo-column zero":http://gnuplot.sourceforge.net/docs_4.2/node133.html :
bc. plot "/dev/stdin" using 0:1 with boxes lc rgb "black"
shows how the prices are distributed in quite a steep power-curve, a "long-tail":http://en.wikipedia.org/wiki/Long_tail if you will:
A quick awk script to collate prices, modulo 10:
bc. cut -f1 < data/pp.tsv | awk '{ print $1 % 10 }' | sort | uniq -c | sort -rn gives us the distribution of the last digit in the prices:
| Count | Price (£1) |
|---|---|
| 18437019 | 0 |
| 715633 | 5 |
| 56195 | 9 |
| 21890 | 2 |
| 17549 | 6 |
| 17395 | 3 |
| 16889 | 1 |
| 16235 | 7 |
| 14888 | 8 |
| 11878 | 4 |
and can be tweaked to show the last two digits:
| Count | Price (£10) |
|---|---|
| 16282411 | 0 |
| 2087949 | 50 |
| 636253 | 95 |
| 45710 | 99 |
| 22419 | 75 |
| 20194 | 25 |
| 11271 | 45 |
| 11121 | 60 |
| 9890 | 20 |
| 9425 | 80 |
| 9235 | 40 |
| 7677 | 90 |
| 6855 | 70 |
| 6532 | 10 |
| 6519 | 55 |
| 5924 | 30 |
and the last three digits in the prices:
| Count | Price (£100) |
|---|---|
| 3682320 | 0 |
| 3332503 | 5000 |
| 980975 | 8000 |
| 897786 | 2000 |
| 835579 | 7000 |
| 765799 | 3000 |
| 732587 | 9950 |
| 713121 | 6000 |
| 707063 | 4000 |
| 687129 | 9000 |
| 596687 | 7500 |
| 567882 | 2500 |
| 503076 | 1000 |
| 298398 | 8500 |
| 294878 | 4950 |
| 267618 | 9995 |
A logarithmic scale can help see patterns in the lower values whilst showing the peaks on the same page; it's a bit like squinting at the chart from a low angle:
I think tomorrow will be "pretty average":http://blog.whatfettle.com/2014/10/25/one-csv-thirty-stories-8-heatmap-meh/.




