Key Concepts

Review core concepts you need to learn to master this subject

The dplyr and tidyr packages

The dplyr and tidyr packages provide functions that solve common data cleaning challenges in R.

Data cleaning and preparation should be performed on a “messy” dataset before any analysis can occur. This process can include:

  • diagnosing the “tidiness” of the data
  • reshaping the data
  • combining multiple files of data
  • changing the data types of values
  • manipulating strings to better represent the data
Data Cleaning in R
Lesson 1 of 1
  1. 1
    A huge part of data science involves acquiring raw data and getting it into a form ready for analysis. Some have estimated that data scientists spend 80% of their time cleaning and manipulating dat…
  2. 2
    We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data? For data to be tidy, it must have: - Each variable as a separate column - Each row…
  3. 3
    Often, you have the same data separated out into multiple files. Let’s say that you have a ton of files following the filename structure: ‘file_1.csv’, ‘file_2.csv’, ‘file_3.csv’, and so on. The p…
  4. 4
    Since we want - Each variable as a separate column - Each row as a separate observation We would want to reshape a table like: |Account|Checking|Savings| |-|-|-| |”12456543”|8500|8900| |…
  5. 5
    Often we see duplicated rows of data in the data frames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can u…
  6. 6
    In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so …
  7. 7
    Let’s say we have a column called “type” with data entries in the format “admin_US” or “user_Kenya”, as shown in the table below. |id|type| |-|-| |1011|”user_Kenya”| |1112|”admin_US”| |1113…
  8. 8
    Each column of a data frame can hold items of the same data type. The data types that R uses are: character, numeric (real or decimal), integer, logical, or complex. Often, we want to convert bet…
  9. 9
    Sometimes we need to modify strings in our data frames to help us transform them into more meaningful metrics. For example, in our fruits table from before: |item|price|calories| |-|-|-|…
  10. 10
    Great! We have looked at a number of different methods we may use to get data into the format we want for analysis. Specifically, we have covered: * diagnosing the “tidiness” of data * combining …

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo