Journalism Data Camp NY potential data sets

Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday.

This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear).

1. Iowa accident reports

The site contains all the police roadside reports of accidents. It’s easy to scrape because the database ids are consecutive numbers:

And it contains thousands of rinky-dink diagrams of the incidents.

First step is to copy all the html from each page into one database. Second step is to scan through all these pages and progressively extract more and more data from them.

Contrast with dataset of accidents available for the UK.

2. South Dakota state budget information

Apparently complete set of expenditures, contracts and revenues disclosed on in a form that is easy to scrape (some datasets even allow CSV download). Many states do this, with varying degrees of success.

Use this case to learn how to restructure and analyse financial accountancy flow information. Can you find any contracts that have suddenly been dropped in favour of another supplier?

3. New York School budgets

The site requires a school code. Try “M411”.

Apparently there is this spreadsheet of school codes.

Is there anything interesting to plot across all schools, such as the PSAL SNAPPLE FUNDS?

4. New York Lobbying registers

Lobbying at the state and city level. Some of this is challenging.

Is there a cross-over between the jurisdictions? Can you uniquely identify the corporate interests and relate them to the legislative or regulatory program?

5. Court case information

Go to (Try “Lockheed”). Not obvious where the information is.

The New York City courts are behind a captcha. Maybe better luck with the New York State courts.

Court datasets are usually very difficult to obtain and jealously protected. The legal process resists modernization and is universally paper based. Electronic documents (contracts, settlements, filings) almost always turn out to be image scans of papers.

6. New York City Police crime data

There are weekly PDFs for each police precinct. These are taken down and replaced by the next one, so there is no historical record.

Luckily someone has scraped the data since 2010, though the numbers may need some processing before you map them.

7. New York State gas drilling permits

These are available but don’t seem to have been updated recently. What’s going on?

Wouldn’t it be nice to make another twitterbot to be friends with NorthSeaOil1?

Don’t forget to read the Well ownership transfers.

About Julian Todd

Co-creator of,, and Also writes software for running machine tools and drawing cave maps.
This entry was posted in developer, events, journalism. Bookmark the permalink.

2 Responses to Journalism Data Camp NY potential data sets

  1. Pingback: 100 Years of history…and I just hope that we do it justice… | ScraperWiki Data Blog

  2. pallih says:

    A while back I wrote a scraper for accident reports in Iceland:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s