Skip to main content
download.png

Cleaning Dirty Data with Pandas & Python

While you can do a lot of really powerful things with Python and data analysis, your analysis is only ever as good as your dataset. And many datasets have missing, malformed, or erroneous data. It’s often unavoidable–anything from incomplete reporting to technical glitches can cause “dirty” data.

Thankfully, Pandas provides a robust library of functions to help you clean up, sort through, and make sense of your datasets, no matter what state they’re in. For our example, we’re going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year. In practice, you’ll be using much larger datasets consisting of potentially millions of rows, but this is a good sample dataset to start with.

Unfortunately, some of the fields in this dataset aren’t filled in and some of them have default values such as 0 or NaN (Not a Number).

 

Screen Shot 2017-08-09 at 5.19.32 PM.png

 

No good. Let’s go through some Pandas hacks you can use to clean up your dirty data.

Getting started

To get started with Pandas, first you will need to have it installed. You can do so by running:

$ pip install pandas

Then we need to load the data we downloaded into Pandas. You can do this with a few Python commands:

import pandas as pd

data = pd.read_csv(‘movie_metadata.csv’)

Make sure you have your movie dataset in the same folder as you’re running the Python script. If you have it stored elsewhere, you’ll need to change the read_csv parameter to point to the file’s location.

Look at your data

To check out the basic structure of the data we just read in, you can use the head() command to print out the first five rows. That should give you a general idea of the structure of the dataset.

data.head()

When we look at the dataset either in Pandas or in a more traditional program like Excel, we can start to note down the problems, and then we’ll come up with solutions to fix those problems.

Pandas has some selection methods which you can use to slice and dice the dataset based on your queries. Let’s go through some quick examples before moving on:

  • Look at the some basic stats for the ‘imdb_score’ column: data.imdb_score.describe()
  • Select a column: data[‘movie_title’]
  • Select the first 10 rows of a column: data[‘duration’][:10]
  • Select multiple columns: data[[‘budget’,’gross’]]
  • Select all movies over two hours long: data[data[‘duration’] > 120]

Deal with missing data

One of the most common problems is missing data. This could be because it was never filled out properly, the data wasn’t available, or there was a computing error. Whatever the reason, if we leave the blank values in there, it will cause errors in analysis later on. There are a couple of ways to deal with missing data:

  • Add in a default value for the missing data
  • Get rid of (delete) the rows that have missing data
  • Get rid of (delete) the columns that have a high incidence of missing data

We’ll go through each of those in turn.

Add default values

First of all, we should probably get rid of all those nasty NaN values. But what to put in its place? Well, this is where you’re going to have to eyeball the data a little bit. For our example, let’s look at the ‘country’ column. It’s straightforward enough, but some of the movies don’t have a country provided so the data shows up as NaN. In this case, we probably don’t want to assume the country, so we can replace it with an empty string or some other default value.

data.country = data.country.fillna(‘’)

This replaces the NaN entries in the ‘country’ column with the empty string, but we could just as easily tell it to replace with a default name such as “None Given”. You can find more information on fillna() in the Pandas documentation.

With numerical data like the duration of the movie, a calculation like taking the mean duration can help us even the dataset out. It’s not a great measure, but it’s an estimate of what the duration could be based on the other data. That way we don’t have crazy numbers like 0 or NaN throwing off our analysis.

data.duration = data.duration.fillna(data.duration.mean())

Remove incomplete rows

Let’s say we want to get rid of any rows that have a missing value. It’s a pretty aggressive technique, but there may be a use case where that’s exactly what you want to do.

Dropping all rows with any NA values is easy:

data.dropna()

Of course, we can also drop rows that have all NA values:

data.dropna(how=’all’)

We can also put a limitation on how many non-null values need to be in a row in order to keep it (in this example, the data needs to have at least 5 non-null values):

data.dropna(thresh=5)

Let’s say for instance that we don’t want to include any movie that doesn’t have information on when the movie came out:

data.dropna(subset=[‘title_year’])

The subset parameter allows you to choose which columns you want to look at. You can also pass it a list of column names here.

Deal with error-prone columns

We can apply the same kind of criteria to our columns. We just need to use the parameter axis=1 in our code. That means to operate on columns, not rows. (We could have used axis=0 in our row examples, but it is 0 by default if you don’t enter anything.)

Drop the columns with that are all NA values:

data.dropna(axis=1, how=’all’)

Drop all columns with any NA values:

data.dropna(axis=1, how=’any’)

The same threshold and subset parameters from above apply as well. For more information and examples, visit the Pandas documentation.

Normalize data types

Sometimes, especially when you’re reading in a CSV with a bunch of numbers, some of the numbers will read in as strings instead of numeric values, or vice versa. Here’s a way you can fix that and normalize your data types:

data = pd.read_csv(‘movie_metadata.csv’, dtype={‘duration’: int})

This tells Pandas that the column ‘duration’ needs to be an integer value. Similarly, if we want the release year to be a string and not a number, we can do the same kind of thing:

data = pd.read_csv(‘movie_metadata.csv’, dtype={title_year: str})

Keep in mind that this data reads the CSV from disk again, so make sure you either normalize your data types first or dump your intermediary results to a file before doing so.

Change casing

Columns with user-provided data are ripe for corruption. People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.

To change all our movie titles to uppercase:

data[‘movie_title’].str.upper()

Similarly, to get rid of trailing whitespace:

data[‘movie_title’].str.strip()

We won’t be able to cover correcting spelling mistakes in this tutorial, but you can read up on fuzzy matchingfor more information.

Rename columns

Finally, if your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it like this:

data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})

Here we’ve renamed ‘title_year’ to ‘release_date’ and ‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is not an in-place operation, you’ll need to save the DataFrame by assigning it to a variable.

data = data.rename(columns = {‘title_year’:’release_date’,‘movie_facebook_likes’:’facebook_likes’})

Save your results

When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program. This is easy to do in Pandas:

data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

More resources

Of course, this is only the tip of the iceberg. With variations in user environments, languages, and user input, there are many ways that a potential dataset may be dirty or corrupted. At this point you should have learned some of the most common ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these additional resources:

Source: http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/

 

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.