Learn
Data Cleaning in R
Diagnose the Data

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data?

For data to be tidy, it must have:

• Each variable as a separate column
• Each row as a separate observation

For example, we would want to reshape a table like:

Account Checkings Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000

Into a table that looks more like:

Account Account Type Amount
“12456543” “Checking” 8500
“12456543” “Savings” 8900
“12283942” “Checking” 6410
“12283942” “Savings” 8020
“12839485” “Checking” 78000
“12839485” “Savings” 920000

The first step of diagnosing whether or not a dataset is tidy is using base R and dplyr functions to explore and probe the dataset.

You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

• `head()` — display the first 6 rows of the table
• `summary()` — display the summary statistics of the table
• `colnames()` — display the column names of the table

### Instructions

1.

Provided in `notebook.Rmd` are two data frames, `grocery_1` and `grocery_2`.

Begin by viewing the `head()` of both `grocery_1` and `grocery_2`.

2.

Explore the data frames using the other functions listed.

Which data frame is “clean”, tidy, and ready for analysis? Create a variable named `clean_data_frame` and assign it the value `1` if `grocery_1` is a clean and tidy data frame or `2` if `grocery_2` is a clean and tidy data frame.