Currently in Italy many prisoners die every month in jail. According to an independent dossier by the Italian non-profit association Ristretti Orizzonti (lit. Narrow Horizons), almost one thousand deaths were registered in the last ten years (2002-2012). Most of detainees who died committed suicide (57%) or died of sickness (20%). Another relevant number of prisoners deceased in unclear circumstances (19%); the rest of drug overdose (26 cases) or omicide (11 cases). Such death rate underlines inadeguate conditions in which prisoners are forced to live in.
The data journalism project I am now introducing is intended to map deaths in prisons in terms of locations and causes. The whole project has been published by Il Fatto Quotidiano (lit., The Daily Fact, a young yet leading newspaper in Italy) and on the Guardian Datablog. However, an English version is also available.
The development process was based on a series of step, each of these working through free-of-charge tools. ScraperWiki was one of those: two scrapers were coded, extracting 1276 rows of data.
Two sources of data were scraped and crossed: the dossier released by Ristretti Orizzonti and the data published on the Minister of Justice website. The former lists the actual casualties, reporting names, dates, causes and prisons’ names, the latter provides addresses of prisons in Italy.
An overview of the development follows.
- 1st scraper (sourcecode and data) – The first scraper elaborated the Excel file containing the list of deaths in prisons. Each record reported the dead detenee’s credentials (name, surname and age), date and cause of death, and name of the prison where he died (but not the address, which was taken from the 2nd source).
The XLS Python module made the scraping straight-forward. The only problem was due to a couple of malformed records, which did not contain proper Excel dates (i.e., dates were not in “date format” but in string format). To address this problem a manual modification was necessary. After that, the scraper ran smoothly.
- 2nd scraper (sourcecode and data) – Prisons’ addresses and contacts are published by the Italian Minister of Justice. A PHP scraper crawled the whole list of prisons parsing their addresses and contacts, which were distributed on a set of consecutive pages; data was stored into HTML tables. Some cells contained semi-structured text (e.g., sometimes telephone numbers started with “Tel.”, sometimes with “Telefono:”), but simple regular expressions could cope with that.
- Preprocessing – Once crawled and scraped the data, both of the tables were refined with Google Refine. GREL string functions and clustering methods were used to improve data consistency. Common transforms were perfomed as well, such as trim, HTML entities unescape and consecutive whitespace collapse.
- Join – Tables were then merged by matching prisons names in the first table to citynames in the second one. This “unnatural join” was based on string inclusion. Some prisons had to be associated explicitally since a few records in the Excel file did not contain the cityname (this happened with prisons well known by unofficial names, such as “Rebibbia” – which is the common way to call the Prison of Rome).
- Geolocation – The final table‘s records contained all dead detainees’ data and the geographical address of prisons where they died. Using Batchgeo each casualty got geolocated on the map of Italy. Markers represent single casualties and were clustered in pie charts to aggregate the causes of death on different zoom-levels.
- Error checking – Mainly, errors belonged to two categories. First, errors due to erroneous joined records. Those happened because some cities have more than one prison and the string inclusion did not work. In order to fix that, explicit rules were created (e.g., “Rebibbia” => “Prison of Rome”). This problem would not show up if each prison (in both dataset) was marked with a unique name/ID, making a natural merging straight-forward. Second, geolocation mismatches due to Google Maps disambiguation problems made hard by the complex Italian toponomastics. Such errors were fixed up manually.
- Publishing (Italian, English) – By surfing the map many stories emerged. Some of those, got published along with the map.