Cleaner Data

Geoffrey Hing (@geoffhing)

http://ghing.github.io/cleaner-data/

My kind of town

city_state
CCHICAGO, IL
CDHICAGO
CHHICAGO IL
CHICAGO CH
CHICAHO IL
CHCAGO IL
CHCAGO IL
CHCAGO IL
CHCAIGO IL
CHCIACO IL
CHCIAGO
...

Multiple values in one column

statute
720-5\8-4(19-1)
720-5\8-4(18-5)

Different references over time

statute	chrgdesc
38-9-1-A(1)	MURDER
38-9-1-A(1)	MURDER
38-9-1-A(2)	MURDER
720-5/9-1(a)(1)	MURDER/INTENT TO KILL/INJURE

Encoded values

minsent
5
90
24000
24000
24000
10
300000
14

Quick data stats


csvstat -c statute data/Criminal_Convictions_ALLCOOK_05-09.csv 
14. statute
<type 'unicode'>
Nulls: True
Unique values: 1616
5 most frequent values:
720-570/402(c):29290
720-5/19-1(a):14697
720-5/16A-3(a):13613
720-570/401(c)(2):13415
720-570/401(d)(i):10959
Max length: 27

Row count: 321590

or facet/cluster in Refine.

Data pipelines

Example from OpenElections

openelex.us.md.transform


class RemoveBaltimoreCityComptroller(BaseTransform):
    """
    Remove Baltimore City comptroller results.

    Maryland election results use the string "Comptroller" for both the 
    state comptroller and the Baltimore City Comptroller.  We're only
    interested in the state comptroller.

    """
    name = 'remove_baltimore_city_comptroller'

    def __call__(self):
        election_id = 'md-2004-11-02-general'
        office = Office.objects.get(state='MD', name='Comptroller')
        Contest.objects.filter(election_id=election_id, office=office).delete()
        Candidate.objects.filter(election_id=election_id,
            contest_slug='comptroller').delete()
        Result.objects.filter(election_id=election_id,
            contest_slug='comptroller').delete()

Example from cook-convictions-data

convictions_data.models


def _load_field_minsent(self, val):
    self.minsent_years, self.minsent_months, self.minsent_days, self.minsent_life, self.minsent_death = self._parse_sentence(val)
    return self

Cleaner Data

My kind of town

Multiple values in one column

Different references over time

Encoded values

Quick data stats

Data pipelines

Example from OpenElections

Example from cook-convictions-data

The End