Virginia Web Scraping

Virginia Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Monday, 15 December 2014

Local ScraperWiki Library

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use
pip install scraperwiki_local
A dump truck dumping its payload

You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck

DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values
What happens if you do this?
import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw'][], {'shopping_list': shopping_list})
On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:
[u'carrots', u'orange juice', u'chainsaw']
In the local version, it is encoded to JSON, so it looks like this:
["carrots","orange juice","chainsaw"]

And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names
SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]:[], {'shopping_list': shopping_list})
In [3]:[], {'sHOpPiNg_liST': shopping_list})
In [4]:'* from swdata')

Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data ='* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, returns a list of ordinary dictionaries. In the local version, returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!


No comments:

Post a Comment