Scraping www data with Python

On reading about the Revoke article 50 ePetition at the weekend and as its a hot topic at the moment, I thought I would write something to pull the data and breakdown where the signatures were coming from.

The petition data is available as a JSON file.

I was going to do this in PERL, something I have user for years, before I thought it might be a good exercise to try in Python.

So attempt one, looked a bit like this, using Python 3.7.1 on OSX.

  • Fetch the JSON file from the Internet
  • Decode it to a Dict data structure
  • pull out the values of interest and display
from io import StringIO
import json
import urllib.request

url = "https://petition.parliament.uk/petitions/241584.json"
pageObject = urllib.request.urlopen( url )

pageContentAsString = pageObject.read().decode(pageObject.headers.get_content_charset())

decodeJson = json.load(StringIO(pageContentAsString))
#print (json.dumps(decodeJson, sort_keys=True, indent=4))

print("Total: " + str(decodeJson["data"]["attributes"]["signature_count"]))
for country in decodeJson["data"]["attributes"]["signatures_by_country"] :
	if(country["name"] == "United Kingdom"):
		print("UK   : " + str(country["signature_count"]))

That was it. However later that evening, I went back to clean up the output formatting.

  • Format the output to be clearer
    • display the top 20 countries with % of signatures
    • display the top/bottom 20 constituencies
  • Write the downloaded JSON to a file.
    • Use DateTime to datestamp filename,

That got me to the following output. Shown here are signatures from the top 20 countries, then the top 20 constituencies followed by the bottom 20 constituencies for signatures.

Write downloaded JSON to epetitions/brexit/190325_2329.json

DateTime : 2019-03-25 23:29:41.409749

Total :  5,648,486 United Kingdom                    5,416,903 95.90%
Total :  5,648,486 France                               41,783 0.74%
Total :  5,648,486 Spain                                23,763 0.42%
Total :  5,648,486 United States                        22,858 0.40%
Total :  5,648,486 Germany                              19,225 0.34%
Total :  5,648,486 Australia                            17,538 0.31%
Total :  5,648,486 Canada                               10,163 0.18%
Total :  5,648,486 Netherlands                           9,275 0.16%
Total :  5,648,486 Ireland                               8,994 0.16%
Total :  5,648,486 New Zealand                           6,536 0.12%
Total :  5,648,486 Switzerland                           5,994 0.11%
Total :  5,648,486 Italy                                 5,477 0.10%
Total :  5,648,486 Belgium                               4,764 0.08%
Total :  5,648,486 Gibraltar                             4,369 0.08%
Total :  5,648,486 Sweden                                3,680 0.07%
Total :  5,648,486 Hong Kong                             2,922 0.05%
Total :  5,648,486 Denmark                               2,666 0.05%
Total :  5,648,486 Austria                               2,605 0.05%
Total :  5,648,486 Portugal                              2,596 0.05%
Total :  5,648,486 United Arab Emirates                  2,387 0.04%

34900    Thangam Debbonaire MP          Bristol West
30320    Catherine West MP              Hornsey and Wood Green
27399    Caroline Lucas MP              Brighton, Pavilion
27335    Tulip Siddiq MP                Hampstead and Kilburn
26488    Rt Hon Keir Starmer MP         Holborn and St Pancras
26462    Rt Hon Diane Abbott MP         Hackney North and Stoke Newington
26312    Daniel Zeichner MP             Cambridge
26203    Deidre Brock MP                Edinburgh North and Leith
26104    Zac Goldsmith MP               Richmond Park
25582    Rt Hon Jeremy Corbyn MP        Islington North
25344    Meg Hillier MP                 Hackney South and Shoreditch
25029    Helen Hayes MP                 Dulwich and West Norwood
23748    Neil Coyle MP                  Bermondsey and Old Southwark
23721    Kate Hoey MP                   Vauxhall
23408    Chuka Umunna MP                Streatham
23314    Vicky Foxcroft MP              Lewisham, Deptford
23312    Rt Hon Mark Field MP           Cities of London and Westminster
23302    Rt Hon Sir Vince Cable MP      Twickenham
23188    Marsha De Cordova MP           Battersea
23105    Rushanara Ali MP               Bethnal Green and Bow

1816     Eddie Hughes MP                Walsall North
1972     Ian Austin MP                  Dudley North
1974     Rt Hon Pat McFadden MP         Wolverhampton South East
1980     Mr Adrian Bailey MP            West Bromwich West
2079     Grahame Morris MP              Easington
2083     Nick Smith MP                  Blaenau Gwent
2154     Emma Reynolds MP               Wolverhampton North East
2188     Karl Turner MP                 Kingston upon Hull East
2196     Tom Watson MP                  West Bromwich East
2221     Melanie Onn MP                 Great Grimsby
2232     Angus Brendan MacNeil MP       Na h-Eileanan an Iar
2237     Sarah Champion MP              Rotherham
2279     Stephanie Peacock MP           Barnsley East
2341     Chris Bryant MP                Rhondda
2346     Mike Wood MP                   Dudley South
2399     Rt Hon Liam Byrne MP           Birmingham, Hodge Hill
2400     Jack Brereton MP               Stoke-on-Trent South
2438     Gordon Marsden MP              Blackpool South
2438     Ruth Smeeth MP                 Stoke-on-Trent North
2472     Anna Turley MP                 Redcar

The next day, I thought of my Raspberry PI. I added a summary file append, with just the UK data and after installing the modules required, it was set running every 15 minutes using a cron job.

## crontab -l gives, 
*/15 * * * * /usr/bin/python3 /home/pi/ripPetition.py

Over time this gave the following output

2019-03-24 23:30:03.764265,5305289,United Kingdom,5093045,96.00
2019-03-24 23:45:03.690786,5308423,United Kingdom,5095988,96.00
2019-03-25 00:15:04.629576,5313114,United Kingdom,5100484,96.00
2019-03-25 00:30:04.227742,5314813,United Kingdom,5101958,96.00
2019-03-25 00:45:03.546101,5316210,United Kingdom,5103097,95.99
2019-03-25 01:00:03.889650,5317601,United Kingdom,5104143,95.99
2019-03-25 01:15:04.158192,5318623,United Kingdom,5105560,95.99
2019-03-25 01:30:03.638560,5319530,United Kingdom,5106118,95.99
2019-03-25 01:45:03.957981,5320350,United Kingdom,5106118,95.97
2019-03-25 02:00:04.410973,5321054,United Kingdom,5107640,95.99
2019-03-25 02:15:04.364589,5321701,United Kingdom,5107809,95.98
2019-03-25 02:30:03.899075,5322303,United Kingdom,5108611,95.98
2019-03-25 02:45:04.169249,5322894,United Kingdom,5109133,95.98
2019-03-25 03:00:04.066469,5323416,United Kingdom,5109337,95.98
2019-03-25 03:15:03.744909,5323877,United Kingdom,5109863,95.98
2019-03-25 03:30:03.662507,5324376,United Kingdom,5110317,95.98
2019-03-25 03:45:04.128469,5324818,United Kingdom,5110669,95.98
2019-03-25 04:00:03.608843,5325278,United Kingdom,5111061,95.98
2019-03-25 04:15:04.069418,5325733,United Kingdom,5111267,95.97
2019-03-25 04:30:04.025370,5326172,United Kingdom,5111684,95.97

To visualise the data, I used Matplotlib with a separate Python script. It is available here along with the code above. The following chart is the output from this script. I think I even found a bug. I did email the ePetitions site about this 🙂 but didn’t hear back.

Screen Shot 2019-03-25 at 21.01.46

Overall what started as a quick experiment, became me learning a bit more Python and a fun use of a Raspberry PI. Thanks to the ePetitions website, the BBC and the various StackOverflow posts that helped me out, especially with the x-axis ticks.

Screen Shot 2019-04-02 at 23.04.13.pngThe code can be found here.

Some links for reference.

Link to the petition: www.petition.parliament.uk/petitions/241584

The article on the BBC: www.bbc.co.uk/news/technology-47668946

A very cool petition visualiser: www.splasho.com/petitions/index.php?petition=241584

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s