Cleaner Data

Geoffrey Hing (@geoffhing)

http://ghing.github.io/cleaner-data/

My kind of town

city_state
CCHICAGO, IL
CDHICAGO
CHHICAGO IL
CHICAGO CH
CHICAHO IL
CHCAGO IL
CHCAGO IL
CHCAGO IL
CHCAIGO IL
CHCIACO IL
CHCIAGO
...
xc

Multiple values in one column

statute
720-5\8-4(19-1)
720-5\8-4(18-5)

Different references over time

statutechrgdesc
38-9-1-A(1)MURDER
38-9-1-A(1)MURDER
38-9-1-A(2)MURDER
720-5/9-1(a)(1)MURDER/INTENT TO KILL/INJURE

Encoded values

minsent
5
90
24000
24000
24000
10
300000
14

Quick data stats


csvstat -c statute data/Criminal_Convictions_ALLCOOK_05-09.csv 
14. statute
<type 'unicode'>
Nulls: True
Unique values: 1616
5 most frequent values:
720-570/402(c):29290
720-5/19-1(a):14697
720-5/16A-3(a):13613
720-570/401(c)(2):13415
720-570/401(d)(i):10959
Max length: 27

Row count: 321590
                   
or facet/cluster in Refine.

Data pipelines

Example from OpenElections

openelex.us.md.transform

class RemoveBaltimoreCityComptroller(BaseTransform):
    """
    Remove Baltimore City comptroller results.

    Maryland election results use the string "Comptroller" for both the 
    state comptroller and the Baltimore City Comptroller.  We're only
    interested in the state comptroller.

    """
    name = 'remove_baltimore_city_comptroller'

    def __call__(self):
        election_id = 'md-2004-11-02-general'
        office = Office.objects.get(state='MD', name='Comptroller')
        Contest.objects.filter(election_id=election_id, office=office).delete()
        Candidate.objects.filter(election_id=election_id,
            contest_slug='comptroller').delete()
        Result.objects.filter(election_id=election_id,
            contest_slug='comptroller').delete()

Example from cook-convictions-data

convictions_data.models

def _load_field_minsent(self, val):
    self.minsent_years, self.minsent_months, self.minsent_days, self.minsent_life, self.minsent_death = self._parse_sentence(val)
    return self

The End