I’m just about to return from Prague, Czech Republic, where I gave a workshop at the Big Clean. What a nice little conference this was!
It had two tracks: Talks and the workshop. So I didn’t get to see many of the talks :(. But this meant I had the whole day to teach people about cleaning up data.
We started with some overview thoughts on cleaning up data, and then I went through the architecture of an analog data-cleaning process, like it might have been done 30 years ago and is still done more often than you’d realize. I work with computers enough that I can’t stand them, so I drew out a diagram on paper instead of using slides.
The fun part happens when we realize that the architecture is the same when we digitize the process. Once we realize this, the process seems less magic; it’s just a faster version of what people would do. Also, when you break up the project like this, it’s easier to work on it in stages.
I went through the writing of a simple web scraper script, then we broke for lunch, which was prepared by HotKarot using big data experimental social media crowdsourced realtime open-source catering methodologies.
After stuffing ourselves at lunch, we worked on some of the participants’ projects/ideas.
- We added a column a spreadsheet of Czech municipality characteristics by finding municipality areas in another website.
- We talked about various approaches to parsing PDF documents for one of Juha‘s projects.
- We pulled the song-play history out of Last.fm. (I unfortunately don’t recall who’s account we were looking at.) Lastfm records exposes loads of data about your activities through it’s surprisingly convenient API, and this gets interesting if you’ve been using Last.fm for seven years.
- The Dutch acronym for FOI is way better; it’s “WOB”. Wobbing is way better than foiaing or foiing or foiling. Wob wob wob.
- Diacritics are the bane of the Czech programmer’s existence.
- Ask your public servants what you should wob them for.