Screen scraping MPs info from government's websites

By:

on

July 17, 2007

Well, for any number of applications, one might find oneself in need of the contact information for our federal politicians. There are any number of ways you can do this, including going onto a government web page like this, and manually copying and pasting the information.

"dude, manual data input is sooooo last decade!"

Or of course, you could screen scrape this data into a basic csv.

DISCLAIMER: while normally my scrapings are done in the likeness of ninjas robbing a museum, this time around I resorted to some pretty hacky antics. This means that there is no one way to do this. Rather I took bits and pieces that I've already written and used them in a series of trial and error steps to trim my stings to a desired length.

Ultimately, the Government of Canada could save everyone a tone of time/energy/money by exporting this to a csv for us.

Here are some functions that can help you to excise the html you need

 

This function will crop out the content between to strings (which act as bookends). Beware, if it doesn't find what you put in for "from" or "to", it will return most of the page and jumble your data. I chose not to troubleshoot this and simply manually deleted all the junk that was returned for the three or four times it got messed up.

]+>" , "
]+>" , " ]+>" , "

" , "

" , "

" , "

" , " " , "]+spacer\.gif\"\ width\=\"575\"[^<>]+>", "]+>" , "" , " " , "/n" , "", "]+>", "" ); $replace = array('src="http://webinfo.parl.gc.ca/MembersOfParliament/' , 'href="http://webinfo.parl.gc.ca/MembersOfParliament/' , '' , '', '', '', '', '' , '', '', ' ' , '' , '' , '' , '' , '' , '', '' , '' ); for ( $i=0 ; $i

This function searches using REGEX (don't assume I know REGEX, I just mumbled through it) for various things like

and such, and replaces them with ''. Effectively, this finds and replaces, but also serves to eliminate out lots of HTML junk from your string.

The first step involved going to the index page and getting the url for each MP page. Then using those URL's in an array, you can perform a for statement to get yourself the content of each page-- which you can in the same step parse into fields like name, phone, etc.

However, this is messy, and won't work without some cleaning up. Now I tried to throw the find and replace script in the previous step, but it didn't work (probably something to do with the way I wrote it). So I just did it after, in conjunction with some 'find and replace' steps using a text editor (makes things go much faster).

In the end, you should end up with a csv with all the contact information of the MPs! There are some problems usually with é's and such, which requires you to find and replace them into the right format.

... or you could just download the names (attached), but where is the fun in that?!

About The Author

Mike Gifford is the founder of OpenConcept Consulting Inc, which he started in 1999. Since then, he has been particularly active in developing and extending open source content management systems to allow people to get closer to their content. Before starting OpenConcept, Mike had worked for a number of national NGOs including Oxfam Canada and Friends of the Earth.