Skip to Content
Learn
Data Cleaning in R
Dealing with Multiple Files

Often, you have the same data separated out into multiple files.

Let’s say that you have a ton of files following the filename structure: 'file_1.csv', 'file_2.csv', 'file_3.csv', and so on. The power of dplyr and tidyr is mainly in being able to manipulate large amounts of structured data, so you want to be able to get all of the relevant information into one table so that you can analyze the aggregate data.

You can combine the base R functions list.files() and lapply() with readr and dplyr to organize this data better, as shown below:

files <- list.files(pattern = "file_.*csv") df_list <- lapply(files,read_csv) df <- bind_rows(df_list)
  • The first line uses list.files() and a regular expression, a sequence of characters describing a pattern of text that should be matched, to find any file in the current directory that starts with 'file_' and has an extension of csv, storing the name of each file in a vector files
  • The second line uses lapply() to read each file in files into a data frame with read_csv(), storing the data frames in df_list
  • The third line then concatenates all of those data frames together with dplyr’s bind_rows() function

Instructions

1.

You have 10 different files containing 100 students each. These files follow the naming structure:

  • exams_0.csv
  • exams_1.csv
  • … up to exams_9.csv

You are going to read each file into an individual data frame and then combine all of the entries into one data frame.

First, create a variable called student_files and set it equal to the list.files() of all of the CSV files we want to import.

2.

Read each file in student_files into a data frame using lapply() and save the result to df_list.

3.

Concatenate all of the data frames in df_list into one data frame called students.

4.

Inspect students. Save the number of rows in students to nrow_students. Did you get all of them?

Folder Icon

Take this course for free

Already have an account?