Wrangling Client Data

Exploratory Data Analysis - Part 1

Last updated on Mar 23, 2021 15 min read R, Senior Project, Data Wrangling, tidyverse

I’m currently working on a project with data provided by BYU-Idaho Peer-Success Mentoring. Luckily, the provided data was already mostly in the format that I need for my project. However, no project is complete without at least some data wrangling.

In this post, I’ll go through all of the steps I took to wrangle the data - from reading in the raw data to my final tidy dataset. I’ll be using using the tidyverse package, specifically dplyr for data manipulation and readxl for reading in data from an excel workbook.

#load packages
library(dplyr) #for data wrangling
library(readxl) #to read in .xls and .xlsx sheets.
library(pander) #to print tables in markdown
library(tidyverse) #includes dplyr, readxl, and other packages that are in the tidyverse core

Project Background

BYU-Idaho Peer-Success Mentoring provides short-term mentoring to a variety of BYU-Idaho student populations with a goal to improve student success and retention. We’re interested in looking for possible correlation between participation in mentoring and defined measures of student success, including improved GPA.

The project will be focused on answering the following question: Do students who participate in Peer-Success Mentoring have more success (improvement in GPA) than students who don’t participate in Peer-Success Mentoring?

This question will be explored through various forms of data analysis, hypothesis testing, and inferential statistics, using data about students who met defined mentoring criteria. A Control group of students who met the criteria but did not participate in mentoring will be compared with a Mentored group of students who met the same criteria and did participate in mentoring. Click here to read more about Peer-Success Mentoring and the background for this project.

Provided Data

The data is stored in an excel workbook, where each tab represents a different group of potential mentee.

Control (Students who met Tier 3 mentoring criteria but opted out of mentoring)
Mentored (Students who met Tier 3 mentoring criteria and participated in mentoring)
Referrals (Students who were referred to the mentoring program by a faculty member)

Tier 3

Most students in the provided data are classified as “Tier 3”. These students were all offered mentoring and either participated or did not participate, so they are assigned into either the Mentored or Control group. Tier 3 can be defined by following: “The third tier category focuses on freshmen, in particular any freshman with a 2.0 GPA or lower after their second semester. These students are referred to the [Peer-Success] Mentoring program, where they receive a personal student mentor’s help in addition to their academic advisor.” source

Referral

A smaller number of students in this data are classified as “Referral”. These students were referred to Peer-Success Mentoring by a professor, friend, faculty member, or other person, or they signed up for mentoring on their own accord. These students do not necessarily meet the Tier 3 criteria, so the impact of mentoring might be different.

Click to hide definitions

(Note: Mentored and Referrals are both made up of students who did participate in mentoring, so we will actually only have two groups: Control and Mentored.)

Each tab in the workbook follows the same format, where each row is one student with the following columns:

Semester Mentored : Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)
Mentee Name: Student’s name (Lastname, Firstname)
Status : Mentoring Status (Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)

Status Definitions

Every student that meets mentoring criteria is assigned a “status” that describes where the student is at in the mentoring process. Their mentor adjusts this status as the student moves through the process. In this project, each student’s status is the status they were marked as at the end of their mentoring semester. The following statuses are included in this project:

Attempted

Students classified as “Attempted” are students who met Tier 3 criteria and were contacted by a Peer-Success Mentor but did not participate in mentoring. They could have declined mentoring or could have not responded to the mentor’s contact attempts. These students are all in the control group.

Fulfilled Program

These students completed the 3-4 mentoring sessions in the program.

Dropped

These students were dropped from the mentoring program for a variety of reasons:

They attended one or more mentoring sessions and informed the mentor that they wanted to discontinue mentoring.
They attended one or more mentoring sessions but no-showed for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring
They enrolled in mentoring but no-showed for their first session without communicating to the mentor that they wanted to discontinue mentoring

Referred to HJG/LS

These students were referred to other peer-mentoring programs, including Heber J. Grant Mentoring and Life-Skills Mentoring. These students could have met with a Peer-Success Mentor one or more times before being referred to another program, or they could have been referred without meeting with a Peer-Success Mentor.

Active Mentoring

These students enrolled in Peer-Success Mentoring, but for some reason were never marked as “Fulfilled Program”. This could be due to a variety of reasons:

The student signed up for mentoring, but never attended any of their sessions
They attended one or more mentoring sessions but then “no-showed” for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring
The student actively participated in mentoring, but enrolled too late in the semester to complete enough sessions to be marked as “fulfilled” (Each semester, the mentors start over with a new list of Tier 3 students)
The student was moved to another status (Fulfilled, Dropped, Referred to HJG/LS), but the mentor forgot to change the student’s status in the records.

Because of this, it is hard to know the experience that these students had. They probably do not all belong in the same group.

Click to hide definitions of each Status

I-Number : Student’s unique ID number
Student Type : Mentoring criteria that student falls under (Tier 3 or Referral)
Pre 1 : Student’s GPA the semester before the semester mentored
Semester Mentored : Student’s GPA the semester mentored
Post 1 : Student’s GPA the semester after the semester mentored
Post 2 : Student’s GPA 2 semesters after the semester mentored

Even before reading in the data, I noticed some changes that needed to be made:

Two columns have the same name, “Semester Mentored”.
There are 3 tabs, but there are really two overlapping sets of two groups: Control vs Mentored and Tier 3 vs Referral. The three tabs don’t group the students in the way we need.
This data includes student identifiers, such as their name and their student ID number, which need to be removed for privacy reasons.
Some rows don’t have any student data in them.
The column names contain spaces, which will make referencing them harder.

Read in the Data using read_excel

The read_excel() function from the readxl package can read in data from an excel workbook to store in a dataframe.

#Read in the data from the client's excel sheet
Mentored <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Mentored", skip = 1)
Control <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Control", skip = 1)
Referrals <- read_excel("StudentGPAs.xlsx", sheet = "Referrals", skip = 1)

Mentored %>% head() %>% pander::pander()

Table continues below
Semester Mentored…1	Mentee Name	Status	I-Number
Fall 2018	Student 14	Active Mentoring	107937214
Fall 2018	Student 57	Active Mentoring	109482852
Winter 2019	NA	Attempted	NA
Fall 2018	Student 73	Active Mentoring	115113476
Fall 2018	Student 88	Referred to HJG/LS	1.29e+08
Fall 2018	Student 71	Active Mentoring	136664545

Student Type	Pre 1	Semester Mentored…7	Post 1	Post 2
Tier 3	0	2.1	0.6	0
Tier 3	1.3	2.5	0.7	NA
Tier 3	NA	NA	NA	NA
Tier 3	0.6	3.6	3.1	NA
Tier 3	1.7	2.3	3	NA
Tier 3	0	3.5	0.2	NA

(For the sake of this post, I created fabricated data in the same format as the real data).

Combine the 3 Datasets

Because the three groups overlap a bit, let’s add a new column to specify whether that student is Mentored or Control. I used the mutate() function from the dplyr package for this.

#Add a column to specify which group that student belongs to.
Mentored <- Mentored %>%
  mutate("Group" = "Mentored")

Control <- Control %>%
  mutate("Group" = "Control")

Referrals <- Referrals %>%
  mutate("Group" = "Mentored")

Now that all three datasets have a Group column, let’s combine them into one dataframe using rbind() function. While we’re at it, let’s also rename the columns to make things a bit easier going forward.

#Combine both groups into one dataframe
Students <- rbind(Mentored, Control, Referrals) %>%
  
  #rename some columns
  rename(semester_mentored = `Semester Mentored...1`,
         `Pre_1` = `Pre 1`, 
         `During` = `Semester Mentored...7`, 
         `Post_1` = `Post 1`, 
         `Post_2` = `Post 2`,
        `Student_Type` = `Student Type`)

Finally, we need to remove the rows that have no data in them. The main reason for doing this is that later on, I’ll want to count how many NA’s are in each column. If there are complete rows of NA, that will mess with those counts. The full NA rows shouldn’t be in there at all. To make sure we remove rows that have no student data at all, we’ll check for rows where the name is “NA” and remove those. Then we know for sure that we didn’t accidentally remove a student who is in the data, but is missing a GPA in some semesters.

#Remove rows that do not have any actual mentee info
Students <- Students %>%
  filter(`Mentee Name` != "NA")

Here’s what the dataset looks like so far:

#Show row 1, 30, and 50 (for a variety of different types of mentees)
Students[c(1,30, 50),] %>% pander()

Table continues below
semester_mentored	Mentee Name	Status	I-Number	Student_Type
Fall 2018	Student 14	Active Mentoring	107937214	Tier 3
Fall 2018	Student 41	Active Mentoring	362332643	Tier 3
Fall 2018	Student 29	Fulfilled Program	480249861	Tier 3

Pre_1	During	Post_1	Post_2	Group
0	2.1	0.6	0	Mentored
1.3	2.8	1.2	NA	Mentored
1.8	2.7	0.2	NA	Mentored

Optional Wrangling to Aid in Plotting

Later on when I create plots, I want the Status column to appear in a specific order in facets or on an axis. I can specify this order using fct_relevel() from the forcats package, which is included in the tidyverse.

#Reorder the status values so they are in the order we want for plots
Students <- Students %>%
  mutate(Status = fct_relevel(Status,
                              "Attempted",
                              "Fulfilled Program",
                              "Dropped",
                              "Referred to HJG/LS",
                              "Active Mentoring"))

Creating Sample Size Labels

I want to know sample sizes of each group/status and have them ready so I can easily label my plots later on. To do this, I’ll first create small dataframes using group_by() and summarise() that contain a column for the sample group and a column for the size of that sample group.

#Number of students in each Type
sample_size_type = Students %>%
  group_by(Student_Type) %>%
  summarise(Type_size = n()) 

sample_size_type %>% 
  pander()

Student_Type	Type_size
Referral	14
Tier 3	369

#Number of students in each Type/semester mentored
sample_size_type_sem = Students %>%
  group_by(Student_Type, semester_mentored) %>%
  summarise(Type_sem_size = n()) 

sample_size_type_sem %>% 
  pander()

Student_Type	semester_mentored	Type_sem_size
Referral	Fall 2018	12
Referral	Winter 2019	2
Tier 3	Fall 2018	200
Tier 3	Winter 2019	169

## Group Sizes ##

#Number of students in each Group
sample_size_group = Students %>%
  group_by(Group) %>%
  summarise(Group_size = n())

sample_size_group %>% 
  pander()

Group	Group_size
Control	169
Mentored	214

#Number of students in each Group/semester mentored
sample_size_group_sem = Students %>%
  group_by(Group, semester_mentored) %>%
  summarise(Group_sem_size = n()) 

sample_size_group_sem %>% 
  pander()

Group	semester_mentored	Group_sem_size
Control	Fall 2018	98
Control	Winter 2019	71
Mentored	Fall 2018	114
Mentored	Winter 2019	100

## Status Sizes ##

#Number of students in each Status
sample_size_status = Students %>%
  group_by(Status) %>%
  summarise(Status_size = n()) 

sample_size_status %>% 
  pander()

Status	Status_size
Attempted	169
Fulfilled Program	72
Dropped	42
Referred to HJG/LS	21
Active Mentoring	79

#Number of students in each Status/semester mentored
sample_size_status_sem = Students %>%
  group_by(Status, semester_mentored) %>%
  summarise(Status_sem_size = n()) 

sample_size_status_sem %>% 
  pander()

Status	semester_mentored	Status_sem_size
Attempted	Fall 2018	98
Attempted	Winter 2019	71
Fulfilled Program	Fall 2018	36
Fulfilled Program	Winter 2019	36
Dropped	Fall 2018	13
Dropped	Winter 2019	29
Referred to HJG/LS	Fall 2018	9
Referred to HJG/LS	Winter 2019	12
Active Mentoring	Fall 2018	56
Active Mentoring	Winter 2019	23

Next, I used left_join() to add each sample size dataframe onto my original Students dataframe. I also used mutate() to make a new column with the text for the label, using the value that we just added on. After that, I no longer need the sample size dataframes, so I removed them from my environment.

#### Add sample size labels to my dataset ####

Students <- Students %>%
 
  ## Student Type ##
  
  #Add on Type sizes
  left_join(sample_size_type) %>%
  
  #make new column with Type size labels
  mutate(Type_size_label = paste0(Student_Type, "\n", "n = ", Type_size)) %>%
  
  #Add on Type size / semester mentored
  left_join(sample_size_type_sem) %>%
  
  #make new column with Type size/semester mentored labels
  mutate(Type_sem_size_label = paste0(Student_Type, "\n", semester_mentored, "\n", "n = ", Type_sem_size)) %>%
  
  
  ## Group ##
  
  #Add on Group sizes
  left_join(sample_size_group) %>%
  
  #make new column with Group size labels
  mutate(Group_size_label = paste0(Group, "\n", "n = ", Group_size)) %>%
  
  #Add on Groups size / semester mentored
  left_join(sample_size_group_sem) %>%
  
  #make new column with Group size/semester mentored labels
  mutate(Group_sem_size_label = paste0(Group, "\n", semester_mentored, "\n", "n = ", Group_sem_size)) %>%
  
  
  ## Status ##
  
  #Add on Status sizes
  left_join(sample_size_status) %>%
  
  #make new column with Status size labels
  mutate(Status_size_label = paste0(Status, "\n", "n = ", Status_size)) %>%
  
  #Add on Status sizes / semester mentored
  left_join(sample_size_status_sem) %>%
  
  #make new column with Status size labels
  mutate(Status_sem_size_label = paste0(Status, "\n", semester_mentored, "\n", "n = ", Status_sem_size))



### Remove the sample size datasets to clean up my environment ###
rm(sample_size_type,
   sample_size_type_sem,
   sample_size_group,
   sample_size_group_sem,
   sample_size_status,
   sample_size_status_sem)

The dataset now has a ton of new columns:

Students[c(1,30),] %>% pander()

Table continues below
semester_mentored	Mentee Name	Status	I-Number	Student_Type
Fall 2018	Student 14	Active Mentoring	107937214	Tier 3
Fall 2018	Student 41	Active Mentoring	362332643	Tier 3

Table continues below
Pre_1	During	Post_1	Post_2	Group	Type_size	Type_size_label
0	2.1	0.6	0	Mentored	369	Tier 3 n = 369
1.3	2.8	1.2	NA	Mentored	369	Tier 3 n = 369

Table continues below
Type_sem_size	Type_sem_size_label	Group_size	Group_size_label
200	Tier 3 Fall 2018 n = 200	214	Mentored n = 214
200	Tier 3 Fall 2018 n = 200	214	Mentored n = 214

Table continues below
Group_sem_size	Group_sem_size_label	Status_size
114	Mentored Fall 2018 n = 114	79
114	Mentored Fall 2018 n = 114	79

Status_size_label	Status_sem_size	Status_sem_size_label
Active Mentoring n = 79	56	Active Mentoring Fall 2018 n = 56
Active Mentoring n = 79	56	Active Mentoring Fall 2018 n = 56

Now I want to just keep the columns that are useful to me using select(). This will also take care of removing the student identifiers.

Students_new <- Students %>%
  select(Pre_1,
         During,
         Post_1,
         Post_2,
         semester_mentored,
         Student_Type,
         Group,
         Status,
         Type_size_label,
         Type_sem_size_label,
         Group_size_label,
         Group_sem_size_label,
         Status_size_label,
         Status_sem_size_label)

Wide and Long Formats

Our data is currently in a “wide” format. It is technically not tidy, but the wide format will be useful in some cases, so I want to keep this. Later on, I’ll be looking at each student’s change in GPA from Pre_1 to Post_1. I’ll create a column with those values now so they are ready to go when I need them.

# Wide format data with a new column that is the change in GPA from pre1 to post 1
Students_wide <- Students_new %>%
  mutate(Change = Post_1 - Pre_1)

Students_wide[c(1,30),] %>% pander()

Table continues below
Pre_1	During	Post_1	Post_2	semester_mentored	Student_Type	Group
0	2.1	0.6	0	Fall 2018	Tier 3	Mentored
1.3	2.8	1.2	NA	Fall 2018	Tier 3	Mentored

Table continues below
Status	Type_size_label	Type_sem_size_label
Active Mentoring	Tier 3 n = 369	Tier 3 Fall 2018 n = 200
Active Mentoring	Tier 3 n = 369	Tier 3 Fall 2018 n = 200

Table continues below
Group_size_label	Group_sem_size_label	Status_size_label
Mentored n = 214	Mentored Fall 2018 n = 114	Active Mentoring n = 79
Mentored n = 214	Mentored Fall 2018 n = 114	Active Mentoring n = 79

Status_sem_size_label	Change
Active Mentoring Fall 2018 n = 56	0.6
Active Mentoring Fall 2018 n = 56	-0.1

The tidy-est version of our data will be the “long” format. Pre_1, During, Post_1, and Post_2 could be considered variables, but they are really values for the variable semester. I’ll use pivot_longer() to turn those 4 semester GPA columns into two columns - semester and GPA, and then fct_relevel() to specify the order of the semesters.

#We want pivot the data so that each row is only one GPA measurement (one student+one semester)
Students_long <- Students_new  %>%
  pivot_longer(cols = c(`Pre_1`, `During`, `Post_1`, `Post_2`), 
               names_to = "semester", 
               values_to = "GPA")

#Reorder the semester values so they are in the right order
Students_long <- Students_long %>%
  mutate(semester = fct_relevel(semester,
                                "Pre_1",
                                "During",
                                "Post_1",
                                "Post_2"))
Students_long %>% head(4) %>% pander()

Table continues below
semester_mentored	Student_Type	Group	Status
Fall 2018	Tier 3	Mentored	Active Mentoring
Fall 2018	Tier 3	Mentored	Active Mentoring
Fall 2018	Tier 3	Mentored	Active Mentoring
Fall 2018	Tier 3	Mentored	Active Mentoring

Table continues below
Type_size_label	Type_sem_size_label	Group_size_label
Tier 3 n = 369	Tier 3 Fall 2018 n = 200	Mentored n = 214
Tier 3 n = 369	Tier 3 Fall 2018 n = 200	Mentored n = 214
Tier 3 n = 369	Tier 3 Fall 2018 n = 200	Mentored n = 214
Tier 3 n = 369	Tier 3 Fall 2018 n = 200	Mentored n = 214

Table continues below
Group_sem_size_label	Status_size_label
Mentored Fall 2018 n = 114	Active Mentoring n = 79
Mentored Fall 2018 n = 114	Active Mentoring n = 79
Mentored Fall 2018 n = 114	Active Mentoring n = 79
Mentored Fall 2018 n = 114	Active Mentoring n = 79

Status_sem_size_label	semester	GPA
Active Mentoring Fall 2018 n = 56	Pre_1	0
Active Mentoring Fall 2018 n = 56	During	2.1
Active Mentoring Fall 2018 n = 56	Post_1	0.6
Active Mentoring Fall 2018 n = 56	Post_2	0

Our new tidy dataframe Student_long holds all students with the following columns:

semester_mentored : Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)
Status : Mentoring Status (Attempted, Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)
Student Type : Mentoring criteria that student falls under (Tier 3 or Referral)
Group : Control or Mentored
semester : The semester of that GPA measurement (Pre_1, During, Post_1, or Post_2)
GPA : Student’s GPA that semester

(Plus the size labels for Type, Group, and Status)

Moving Forward

Now that the data is tidy and ready to go, I can use it to make visuals or run hypothesis tests. Wrangling can be a big part of any data science project (sometimes even the biggest part!), but it is so necessary for the rest of the project to go smoothly.

Check out the rest of my work on this project here.

r tidyverse