In this exercise, you will learn how to *combine* all of the values from a column for a single calculation. This can be done with the help of the dplyr function `summarize()`

, which returns a new data frame containing the desired calculation.

Some examples of this type of calculation include:

- The data frame
`customers`

contains the names and ages of all of your customers. You want to find the median age:

customers %>% select(age) # c(23, 25, 31, 35, 35, 46, 62) customers %>% summarize(median_age = median(age)) # 35

- The data frame
`shipments`

contains address information for all shipments that you’ve sent out in the past year. You want to know how many different states you have shipped to.

shipments %>% select(states) # c('CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ') shipments %>% summarize(n_distinct_states = n_distinct(states)) # 3

- The data frame
`inventory`

contains a list of types of t-shirts that your company makes. You want to know the standard deviation of the prices of your inventory.

inventory %>% select(price) # c(31, 23, 30, 27, 30, 22, 27, 22, 39, 27, 36) inventory %>% summarize(sd_price = sd(price)) # 5.465595

The general syntax for these calculations is:

df %>% summarize(var_name = command(column_name))

`df`

is the data frame you are working with`summarize`

is a dplyr function that reduces multiple values to a single value`var_name`

is the name you assign to the column that stores the results of the summary function in the returned data frame`command`

is the summary function that is applied to the column by`summarize()`

`column_name`

is the name of the column of`df`

that is being summarized

The following table includes common summary functions that can be given as an argument to `summarize()`

:

Command | Description |
---|---|

`mean()` |
Average of all values in column |

`median()` |
Median value of column |

`sd()` |
Standard deviation of column |

`var()` |
Variance of column |

`min()` |
Minimum value in column |

`max()` |
Maximum value in column |

`IQR()` |
Interquartile range of column |

`n_distinct()` |
Number of unique values in column |

`sum()` |
Sum values of column |

### Instructions

**1.**

ShoeFly.com has a new batch of orders stored in the data frame `orders`

. Inspect the first `10`

rows if the data frame using `head()`

.

**2.**

Our finance department wants to know the price of the most expensive pair of shoes purchased. Save your answer to the variable `most_expensive`

.

**3.**

Woah, wait a minute! Take a look at the output of the code you just ran. The result for the most expensive pair of shoes is coming back as `NA`

. Why is this happening?

If you scroll up in the rendered notebook to where `orders.csv`

is loaded, you can see a warning about row `99`

of the file. There is a missing column of information! It appears that the `price`

for row `99`

was not in the file, and this is causing your maximum value calculation to return `NA`

.

Add the following as an additional argument to `max()`

so that it removes all missing values before computing the maximum value.

na.rm = TRUE

**4.**

Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable `num_colors`

.

# Take this course for free

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.