Data Cleaning with Pandas

Some say data scientists spend 80% of their time cleaning data and 20% of their time doing analysis. Learn how to do this 80% in Python!

Start[missing "en.views.course_landing_page.practical-data-cleaning.course_illustration" translation]
Chevron Left Icon
Introduction to Regular Expressions
Lesson 1 of 2
Chevron Right Icon
  1. 1

    When registering an account for a new social media app or completing an order for a gift online, nearly every piece of information you enter into a web form is validated. Did you enter a properly f...

  2. 2

    The simplest text we can match with regular expressions are literals. This is where our regular expression contains the exact text that we want to match. The regex [...] , for example, will ...

  3. 3

    Do you love baboons and gorillas? You can find either of them with the same regular expression using alternation! Alternation, performed in regular expressions with the pipe symbol, [...] , ...

  4. 4

    Spelling tests may seem like a distant memory from grade school, but we ultimately take them every day while typing. It's easy to make mistakes on commonly misspelled words like [...] , and on top...

  5. 5

    Sometimes we don't care exactly WHAT characters are in a text, just that there are SOME characters. Enter the wildcard [...] ! Wildcards will match any single character (letter, number, symb...

  6. 6

    Character sets are great, but their true power isn't realized without ranges. Ranges allow us to specify a range of characters in which we can make a match without having to type out each ind...

  7. 7

    While character ranges are extremely useful, they can be cumbersome to write out every single time you want to match common ranges such as those that designate alphabetical characters or digits. To...

  8. 8

    Remember when we were in love with baboons and gorillas a few exercises ago? We were able to match either [...] or [...] using the regex [...] , taking advantage of the [...] symbol. But wh...

  9. 9

    Here's where things start to get really interesting. So far we have only matched text on a character by character basis. But instead of writing the regex [...] , which would match 6 word character...

  10. 10

    You are working on a research project that summarizes the findings of primate behavioral scientists from around the world. Of particular interest to you are the scientists' observations of humor in...

  11. 11

    In 1951, mathematician Stephen Cole Kleene developed a system to match patterns in written language with mathematical notation. This notation is now known as regular expressions! In his honor, the...

  12. 12

    When writing regular expressions, its useful to make the expression as specific as possible in order to ensure that we do not match unintended text. To aid in this mission of specificity, we can us...

  13. 13

    Do you feel those regular expression superpowers coursing through your body? Do you just want to scream [...] really loud? Awesome! You are now ready to take these skills and use them out in the ...

  1. 1

    A huge part of data science involves acquiring raw data and getting it into a form ready for analysis. Some have estimated that data scientists spend 80% of their time cleaning and manipulating dat...

  2. 2

    We often describe data that is easy to analyze and visualize as "tidy data". What does it mean to have tidy data? For data to be tidy, it must have: - Each variable as a separate column - Each row...

  3. 3

    Often, you have the same data separated out into multiple files. Let's say that we have a ton of files following the filename structure: [...] , [...] , [...] , and so on. The power of pandas i...

  4. 4

    Since we want - Each variable as a separate column - Each row as a separate observation We would want to reshape a table like: |Account|Checking|Savings| |-|-|-| |"12456543"|8500|8900| |"...

  5. 5

    Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can us...

  6. 6

    In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so ...

  7. 7

    Let's say we have a column called "type" with data entries in the format [...] or [...] . Just like we saw before, this column actually contains two types of data. One seems to be the user type ...

  8. 8

    Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert ...

  9. 9

    Sometimes we need to modify strings in our DataFrames to help us transform them into more meaningful metrics. For example, in our fruits table from before: |item|price|calories| |-|-|-| |"...

  10. 10

    Sometimes we want to do analysis on numbers that are hidden within string values. We can use regex to extract this numerical data from the strings they are trapped in. Suppose we had this DataFrame...

  11. 11

    We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as [...] (or No...

  12. 12

    Great! We have looked at a number of different methods we may use to get data into the format we want for analysis. Specifically, we have covered: - diagnosing the "tidiness" of the data - reshapi...

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo

Data Cleaning with Pandas

Start[missing "en.views.course_landing_page.practical-data-cleaning.course_illustration" translation]