Key Concepts

Review core concepts you need to learn to master this subject

dplyr package

The dplyr package provides functions that perform data manipulation operations oriented to explore and manipulate datasets. At the most basic level, the package functions refers to data manipulation “verbs” such as select, filter, mutate, arrange, summarize among others that allow to chain multiple steps in a few lines of code. The dplyr package is suitable to work with a single dataset as well as to achieve complex results in large datasets.

data frame object

A data frame is an R object that store data in two dimensions represented by columns and rows. The columns are the different variables of the dataframe and the rows are the observations of each variable. Each row of the dataframe represent a unique set of observations. This object is a useful data structure to store data with different types in columns and perform analysis around them.

Comma Separated Values (CSV)

CSV (Comma-separated values) files represent plain text in the form of a spreadsheet that use comma to separate individual values. This type of file is easy to manage and compatible with many different platforms. This file can be imported to a database or to an Integrated Development Environment (IDE) to work with its content.

Loading and Saving CSVs with R

The read_csv() and write_csv() functions belong to the tidyverse package and perform smart reading and writing operations of files in R. The read_csv() function reads a file and converts it to a better format of a data frame called a tibble. The first argument of the read_csv() is the file to be read. Tibbles in R can be exported to csv files using the write_csv() function. The first argument of write_csv() is the tibble to be exported.

data frames primary information

Data frames in R can be inspected using head() and summary(). The head() function accepts an integer argument which determines the number of rows of the data frame that you can see. The default value of the head() function is 6. The summary() returns summary statistics such as min, max, mean, and three quartiles.

pipes

The pipe %>% can be used to input a value or an object into the first argument of a function. Instead of passing the argument into the function seperately, it is possible to write the value or object and then use the pipe to convert it as the function argument in the same line. This can be used with the functions select() and filter() that contain a data frame as the first argument.

In the example, the weather data frame is piped into the select function that would select the first two columns of the weather data frame.

Dplyr’s select()

The select() function of dplyr package is used to choose which columns of a data frame you would like to work with. It takes column names as arguments and creates a new data frame using the selected columns. select() can be combined with others functions such as filter().

Excluding Columns with select() in Dplyr

The select() function of dplyr allows users to select all columns of the data frame except for the specified columns. To exclude columns, add the - operator before the name of the column or columns when passing them as an arguments to select(). This will return a new data frame with all columns except ones preceded by a - operator. For example: select(-genre, -spotify_monthly_listeners, -year_founded).

Dplyr’s filter()

The filter() function of the dplyr package allows users to select a subset of rows in a data frame that match with certain conditions that are passed as arguments. The first argument of the function is the data frame and the following arguments are the conditional expressions that serve as the filter() criteria. For example: filter(artists, genre == 'Rock', spotify_monthly_listeners > 20000000).

filter with logical operators

The filter() function can subset rows of a data frame based on logical operations of certain columns. The condition of the filter should be explicity passed as a parameter of the function with the following syntax: name of the column, operator(<,==,>,!=) and value. On the other hand is possible to chain conditions within a column or on different columns using logical operators such as boolean operators(&,|,!).

dplyr arrange()

The arrange() function of dplyr package, order the rows of a dataframe based on the values of a column or a set of columns that are passed as parameters. The resulting order of the dataframe can be in ascending or descending order. By default arrange() order the dataframe in ascending order, but it is possible to change this and order the dataframe in descending order using the desc() parameter over the column.

rename-dplyr

The rename() function of dplyr package can be used to change the column names of a data frame. It has a simple syntax where it is necessary to pass the new name followed by the = operator and the old name of the column. On the other hand to rename multiple columns based on logical criteria, the rename() function has variants such as rename_if(), rename_at() and rename_all().

mutate() dplyr

The mutate() function from dplyr package adds new columns to an existing data frame based on a transformation of an existing column, while maintaining all the other columns. The function receives the data frame as the first parameter, and subsequently specify the new column name followed by the = operator and a transformation function. After the first variable parameter, further parameters can be added to mutate more variables at the same time.

transmute() dplyr

The transmute() function, from dpylr, creates new columns from a data frame by transforming existing ones. The result of the function is the new column while all the original columns are removed in the new data frame. The function receives the data frame as the first argument, and the new variable name with the function to transform it as the second parameter. It is possible to perform multiple transformations in the same line, by specifying each individual transformation.

Chevron Left Icon
Introduction to Data Frames in R
Lesson 1 of 2
Chevron Right Icon
  1. 1
    Data lies at the heart of nearly every problem in the business world and society. Having the right tools to manipulate data and organize it in a meaningful way is integral to performing data analys…
  2. 2
    A data frame is an R object that stores tabular data in a table structure made up of rows and columns. You can think of a data frame as a spreadsheet or as a SQL table. While data frames can be cre…
  3. 3
    When working with data frames, most of the time you will load in data from an existing data set. One of the most common formats for big datasets is the CSV. CSV (comma separated values) is a t…
  4. 4
    When you have data in a CSV, you can load it into a data frame in R using readr’s read_csv() function: df <- read_csv(‘my_csv_file.csv’) * In the example above, the read_csv() function is called …
  5. 5
    When you load a new data frame from a CSV, you want to get an understanding of what the data looks like. If the data frame is small, you can display it by typing its name df. If the data frame is …
  6. 6
    One of the most appealing aspects of dplyr is the ability to easily manipulate data frames. Each of the dplyr functions you will explore takes a data frame as its first argument. The _pipe operato…
  7. 7
    Suppose you have a data frame called customers, which contains the ages of your business’s customers: |name|age|gender| |-|-|-| |Rebecca Erikson|35|F| |Thomas Roberson|28|M| |Diane Ochoa|42|NA|…
  8. 8
    Sometimes rather than specify what columns you want to select from a data frame, it’s easier to state what columns you do not want to select. dplyr’s select() function also enables you to do just t…
  9. 9
    In addition to subsetting a data frame by columns, you can also subset a data frame by rows using dplyr’s filter() function and comparison operators! Consider an orders data frame that contains dat…
  10. 10
    The filter() function also allows for more complex filtering with the help of logical operators! Take a look at the same orders data frame from the last exercise: |id|first_name|last_name|email…
  11. 11
    Sometimes all the data you want is in your data frame, but it’s all unorganized! Step in the handy dandy dplyr function arrange()! arrange() will sort the rows of a data frame in ascending order by…
  12. 12
    There you have it! With the power of readr and dplyr in your hands, you can now: load data from a CSV into a data frame inspect the data frame with head() and summary() * select() the columns y…
  1. 1
    When working with data frames, you often need to modify the columns for your analysis at hand. With the help of the dplyr package, data frame modifications are easily performed. In this lesson, yo…
  2. 2
    Sometimes you might want to add a new column to a data frame. This new column could be a calculation based on the data that you already have. Suppose you own a hardware store called The Handy Woma…
  3. 3
    Let’s refer back to the inventory table for your store, The Handy Woman. |product_id|product_description|cost_to_manufacture|price|sales_tax| |————–|—————————|——-…
  4. 4
    When creating new columns from a data frame, sometimes you are interested in only keeping the new columns you add, and removing the ones you do not need. dplyr’s transmute() function will add new c…
  5. 5
    Since dplyr functions operate on data frames using column names, it is often useful to update the column names of a data frame so they are as clear and meaningful as possible. dplyr’s rename() func…
  6. 6
    With an understanding of how to manipulate data frames with dplyr, you are well on your way to becoming a data analysis expert! In this lesson you learned how to: * add new columns to a data frame …

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo