Key Concepts

Review core concepts you need to learn to master this subject

Efficient Data Storage with Multiple Tables

For efficient data storage, related information is often spread across multiple tables of a database.

Consider an e-commerce business that tracks the products that have been ordered from its website. Business data for the company could be split into three tables:

  • orders would contain the information necessary to describe an order: order_id, customer_id, product_id, quantity, and timestamp
  • products would contain the information to describe each product: product_id, product_description and product_price
  • customers would contain the information for each customer: customer_id, customer_name, customer_address, and customer_phone_number

This table structure prevents the storage of redundant information, given that each customer’s and product’s information is only stored once, rather than each time a customer places an order for another item.

dplyr inner_join()

R data frame objects can be joined together with the dplyr function inner_join(). Corresponding rows with a matching column value in each data frame are combined into one row of a new data frame, and non-matching rows are dropped.

dplyr‘s inner_join() takes two data frames as arguments and returns a new data frame with the corresponding rows merged together. Non-matching rows from each data frame are dropped in the resulting data frame.

For example, consider the sales and targets data frames of a t-shirt company. sales contains the monthly revenue for the company and has two columns: month and revenue. targets contains the goals for monthly revenue for each month and has two columns: month and target. To perform an inner join on the two data frames using dplyr:

sales_vs_targets <- sales %>% inner_join(targets)

inner_join() will use the month column as the column to match on, as both the sales and target data frames have a month column. The resultant data frame will only contain the matching rows from sales and targets.

Multiple data frames can be merged together at once by stringing multiple calls to inner_join with the pipe %>%.

For example, consider the same sales and targets data frames of a t-shirt company. An additional data frame small_medium_large contains the number of small, medium and large t-shirts sold per month and has four columns: month, small, medium, and large. To perform an inner join on the three data frames using dplyr:

sales_vs_targets <- sales %>% inner_join(targets) %>% inner_join(small_medium_large)

inner_join() will use the month column as the column to match on, as the sales, target, and small_medium_large data frames have a month column. The resultant data frame will only contain the matching rows from sales, targets, and small_medium_large.

dplyr join functions

R data frames can be joined on specific columns using one of the dplyr join functions and the by argument.

The dplyr join functions can take the additional by argument, which indicates the columns in the “left” and “right” data frames of a join to match on.

For example, consider the orders and products data frames of a business. The orders data frame contains five columns: id, product_id, customer_id,quantity and timestamp. The products data frame contains three columns: id, product_id, and price. To perform an inner join on the two data frames using product_id from the orders data frame and id from the products data frame as the columns to join on:

orders_products <- orders %>% inner_join(products, by = c("product_id" = "id") suffix = c("_orders", "_products"))

The suffix argument will append suffixes to column names that duplicate between the two data frames. id in the original orders data frame will become id_orders in the resultant data frame and id in the original products data frame will become id_products in the resultant data frame.

dplyr full_join()

In a full join, R data frame objects are merged together with the dplyr function full_join(). Corresponding rows with a matching column value in each data frame are combined into one row of a new data frame, and non-matching rows are also added to the resultant data frame with NAs for the missing information.

dplyr‘s full_join() function will perform a full join, where non-matching rows are also added to the resultant merged data frame with NAs for the missing information.

For example, consider the inventory data frames of two stores, store_a_inventory and store_b_inventory. The store_a_inventory data frame contains two columns: item and store_a_inventory. The store_b_inventory data frame contains two columns: item and store_b_inventory. To perform a full join on the two data frames:

store_a_b_inventory <- store_a_inventory %>% full_join(store_b_inventory)

The resultant data frame will contain each matching row from store_a_inventory and store_b_inventory as well as the non-matching rows from store_a_inventory and store_b_inventory.

dplyr bind_rows()

Multiple R data frames containing the same columns can be concatenated into one data frame using the dplyr function bind_rows().

dplyr‘s bind_rows() function takes all the data frames to bind as arguments and returns a single data frame where the data frames have been concatenated into a longer data frame.

For example, consider two customer data frames customer_1 and customer_2, each containing columns name and email. To concatenate the data frames into one longer data frame:

customers <- customer_1 %>% bind_rows(customer_2)

If a third data frame customer_3 with columns name and email also existed, all three data frames could be concatenated into one longer data frame as follows:

customers <- customer_1 %>% bind_rows(customer_2) %>% bind_rows(customer_3)
Joining Tables in R
Lesson 1 of 1
  1. 1
    In order to efficiently store data, we often spread related information across multiple tables. For instance, imagine that we own an e-commerce business and we want to track the products that have…
  2. 2
    Suppose we have the following three tables that describe our eCommerce business: - orders — a table with information on each transaction: order_id customer_id product_id quantity …
  3. 3
    It is easy to do this kind of matching for one row, but hard to do it for multiple rows. Luckily, dplyr can efficiently do this for the entire table using the inner_join() method. The **inner_jo…
  4. 4
    In addition to using inner_join() to join two data frames together, we can use the pipe %>% to join multiple data frames together at once. The following command would join orders with customers, an…
  5. 5
    In the previous example, the inner_join() function “knew” how to combine tables based on the columns that were the same between two tables. For instance, orders and customers both had a column cal…
  6. 6
    In the previous exercise, we learned how to use rename() to join two data frames whose columns don’t match. A better option, however, exists. We can add the by argument when calling inner_join() …
  7. 7
    In our previous examples, there were always matching values when we were performing our joins. What happens when that isn’t true? Let’s imagine that our products table is out of date and is missi…
  8. 8
    In the previous exercise, we saw that when we join two data frames whose rows don’t match perfectly, we lose the unmatched rows. This type of join (where we only include matching rows) is called a…
  9. 9
    Let’s return to the join of Company A and Company B. — ### Left Join Suppose we want to identify which customers are missing phone information. We would want a list of all customers who have e…
  10. 10
    Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller. When we need to reconstruct a single data frame fr…
  11. 11
    This lesson introduced some methods for combining multiple data frames: * Creating a data frame made by matching the common columns of two data frames is called a join * We can specify which column…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo