Wrangling Client Data

Exploratory Data Analysis - Part 1

I’m currently working on a project with data provided by BYU-Idaho Peer-Success Mentoring. Luckily, the provided data was already mostly in the format that I need for my project. However, no project is complete without at least some data wrangling.

In this post, I’ll go through all of the steps I took to wrangle the data - from reading in the raw data to my final tidy dataset. I’ll be using using the tidyverse package, specifically dplyr for data manipulation and readxl for reading in data from an excel workbook.

#load packages
library(dplyr) #for data wrangling
library(readxl) #to read in .xls and .xlsx sheets.
library(pander) #to print tables in markdown
library(tidyverse) #includes dplyr, readxl, and other packages that are in the tidyverse core

Project Background

BYU-Idaho Peer-Success Mentoring provides short-term mentoring to a variety of BYU-Idaho student populations with a goal to improve student success and retention. We’re interested in looking for possible correlation between participation in mentoring and defined measures of student success, including improved GPA.

The project will be focused on answering the following question: Do students who participate in Peer-Success Mentoring have more success (improvement in GPA) than students who don’t participate in Peer-Success Mentoring?

This question will be explored through various forms of data analysis, hypothesis testing, and inferential statistics, using data about students who met defined mentoring criteria. A Control group of students who met the criteria but did not participate in mentoring will be compared with a Mentored group of students who met the same criteria and did participate in mentoring. Click here to read more about Peer-Success Mentoring and the background for this project.

Provided Data

The data is stored in an excel workbook, where each tab represents a different group of potential mentee.

  1. Control (Students who met Tier 3 mentoring criteria but opted out of mentoring)

  2. Mentored (Students who met Tier 3 mentoring criteria and participated in mentoring)

  3. Referrals (Students who were referred to the mentoring program by a faculty member)

Tier 3

Most students in the provided data are classified as “Tier 3”. These students were all offered mentoring and either participated or did not participate, so they are assigned into either the Mentored or Control group. Tier 3 can be defined by following: “The third tier category focuses on freshmen, in particular any freshman with a 2.0 GPA or lower after their second semester. These students are referred to the [Peer-Success] Mentoring program, where they receive a personal student mentor’s help in addition to their academic advisor.” source

Referral

A smaller number of students in this data are classified as “Referral”. These students were referred to Peer-Success Mentoring by a professor, friend, faculty member, or other person, or they signed up for mentoring on their own accord. These students do not necessarily meet the Tier 3 criteria, so the impact of mentoring might be different.

Click to hide definitions

(Note: Mentored and Referrals are both made up of students who did participate in mentoring, so we will actually only have two groups: Control and Mentored.)

Each tab in the workbook follows the same format, where each row is one student with the following columns:

  1. Semester Mentored : Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)

  2. Mentee Name: Student’s name (Lastname, Firstname)

  3. Status : Mentoring Status (Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)

Status Definitions

Every student that meets mentoring criteria is assigned a “status” that describes where the student is at in the mentoring process. Their mentor adjusts this status as the student moves through the process. In this project, each student’s status is the status they were marked as at the end of their mentoring semester. The following statuses are included in this project:

Attempted

Students classified as “Attempted” are students who met Tier 3 criteria and were contacted by a Peer-Success Mentor but did not participate in mentoring. They could have declined mentoring or could have not responded to the mentor’s contact attempts. These students are all in the control group.

Fulfilled Program

These students completed the 3-4 mentoring sessions in the program.

Dropped

These students were dropped from the mentoring program for a variety of reasons:

  • They attended one or more mentoring sessions and informed the mentor that they wanted to discontinue mentoring.

  • They attended one or more mentoring sessions but no-showed for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring

  • They enrolled in mentoring but no-showed for their first session without communicating to the mentor that they wanted to discontinue mentoring

Referred to HJG/LS

These students were referred to other peer-mentoring programs, including Heber J. Grant Mentoring and Life-Skills Mentoring. These students could have met with a Peer-Success Mentor one or more times before being referred to another program, or they could have been referred without meeting with a Peer-Success Mentor.

Active Mentoring

These students enrolled in Peer-Success Mentoring, but for some reason were never marked as “Fulfilled Program”. This could be due to a variety of reasons:

  • The student signed up for mentoring, but never attended any of their sessions

  • They attended one or more mentoring sessions but then “no-showed” for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring

  • The student actively participated in mentoring, but enrolled too late in the semester to complete enough sessions to be marked as “fulfilled” (Each semester, the mentors start over with a new list of Tier 3 students)

  • The student was moved to another status (Fulfilled, Dropped, Referred to HJG/LS), but the mentor forgot to change the student’s status in the records.

Because of this, it is hard to know the experience that these students had. They probably do not all belong in the same group.

Click to hide definitions of each Status

  1. I-Number : Student’s unique ID number

  2. Student Type : Mentoring criteria that student falls under (Tier 3 or Referral)

  3. Pre 1 : Student’s GPA the semester before the semester mentored

  4. Semester Mentored : Student’s GPA the semester mentored

  5. Post 1 : Student’s GPA the semester after the semester mentored

  6. Post 2 : Student’s GPA 2 semesters after the semester mentored

Even before reading in the data, I noticed some changes that needed to be made:

  • Two columns have the same name, “Semester Mentored”.
  • There are 3 tabs, but there are really two overlapping sets of two groups: Control vs Mentored and Tier 3 vs Referral. The three tabs don’t group the students in the way we need.
  • This data includes student identifiers, such as their name and their student ID number, which need to be removed for privacy reasons.
  • Some rows don’t have any student data in them.
  • The column names contain spaces, which will make referencing them harder.

Read in the Data using read_excel

The read_excel() function from the readxl package can read in data from an excel workbook to store in a dataframe.

#Read in the data from the client's excel sheet
Mentored <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Mentored", skip = 1)
Control <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Control", skip = 1)
Referrals <- read_excel("StudentGPAs.xlsx", sheet = "Referrals", skip = 1)

Mentored %>% head() %>% pander::pander()
Table continues below
Semester Mentored…1Mentee NameStatusI-Number
Fall 2018Student 14Active Mentoring107937214
Fall 2018Student 57Active Mentoring109482852
Winter 2019NAAttemptedNA
Fall 2018Student 73Active Mentoring115113476
Fall 2018Student 88Referred to HJG/LS1.29e+08
Fall 2018Student 71Active Mentoring136664545
Student TypePre 1Semester Mentored…7Post 1Post 2
Tier 302.10.60
Tier 31.32.50.7NA
Tier 3NANANANA
Tier 30.63.63.1NA
Tier 31.72.33NA
Tier 303.50.2NA

(For the sake of this post, I created fabricated data in the same format as the real data).

Combine the 3 Datasets

Because the three groups overlap a bit, let’s add a new column to specify whether that student is Mentored or Control. I used the mutate() function from the dplyr package for this.

#Add a column to specify which group that student belongs to.
Mentored <- Mentored %>%
  mutate("Group" = "Mentored")

Control <- Control %>%
  mutate("Group" = "Control")

Referrals <- Referrals %>%
  mutate("Group" = "Mentored")

Now that all three datasets have a Group column, let’s combine them into one dataframe using rbind() function. While we’re at it, let’s also rename the columns to make things a bit easier going forward.

#Combine both groups into one dataframe
Students <- rbind(Mentored, Control, Referrals) %>%
  
  #rename some columns
  rename(semester_mentored = `Semester Mentored...1`,
         `Pre_1` = `Pre 1`, 
         `During` = `Semester Mentored...7`, 
         `Post_1` = `Post 1`, 
         `Post_2` = `Post 2`,
        `Student_Type` = `Student Type`)

Finally, we need to remove the rows that have no data in them. The main reason for doing this is that later on, I’ll want to count how many NA’s are in each column. If there are complete rows of NA, that will mess with those counts. The full NA rows shouldn’t be in there at all. To make sure we remove rows that have no student data at all, we’ll check for rows where the name is “NA” and remove those. Then we know for sure that we didn’t accidentally remove a student who is in the data, but is missing a GPA in some semesters.

#Remove rows that do not have any actual mentee info
Students <- Students %>%
  filter(`Mentee Name` != "NA")

Here’s what the dataset looks like so far:

#Show row 1, 30, and 50 (for a variety of different types of mentees)
Students[c(1,30, 50),] %>% pander()
Table continues below
semester_mentoredMentee NameStatusI-NumberStudent_Type
Fall 2018Student 14Active Mentoring107937214Tier 3
Fall 2018Student 41Active Mentoring362332643Tier 3
Fall 2018Student 29Fulfilled Program480249861Tier 3
Pre_1DuringPost_1Post_2Group
02.10.60Mentored
1.32.81.2NAMentored
1.82.70.2NAMentored

Optional Wrangling to Aid in Plotting

Later on when I create plots, I want the Status column to appear in a specific order in facets or on an axis. I can specify this order using fct_relevel() from the forcats package, which is included in the tidyverse.

#Reorder the status values so they are in the order we want for plots
Students <- Students %>%
  mutate(Status = fct_relevel(Status,
                              "Attempted",
                              "Fulfilled Program",
                              "Dropped",
                              "Referred to HJG/LS",
                              "Active Mentoring"))

Creating Sample Size Labels

I want to know sample sizes of each group/status and have them ready so I can easily label my plots later on. To do this, I’ll first create small dataframes using group_by() and summarise() that contain a column for the sample group and a column for the size of that sample group.

#Number of students in each Type
sample_size_type = Students %>%
  group_by(Student_Type) %>%
  summarise(Type_size = n()) 

sample_size_type %>% 
  pander()
Student_TypeType_size
Referral14
Tier 3369
#Number of students in each Type/semester mentored
sample_size_type_sem = Students %>%
  group_by(Student_Type, semester_mentored) %>%
  summarise(Type_sem_size = n()) 

sample_size_type_sem %>% 
  pander()
Student_Typesemester_mentoredType_sem_size
ReferralFall 201812
ReferralWinter 20192
Tier 3Fall 2018200
Tier 3Winter 2019169
## Group Sizes ##

#Number of students in each Group
sample_size_group = Students %>%
  group_by(Group) %>%
  summarise(Group_size = n())

sample_size_group %>% 
  pander()
GroupGroup_size
Control169
Mentored214
#Number of students in each Group/semester mentored
sample_size_group_sem = Students %>%
  group_by(Group, semester_mentored) %>%
  summarise(Group_sem_size = n()) 

sample_size_group_sem %>% 
  pander()
Groupsemester_mentoredGroup_sem_size
ControlFall 201898
ControlWinter 201971
MentoredFall 2018114
MentoredWinter 2019100
## Status Sizes ##

#Number of students in each Status
sample_size_status = Students %>%
  group_by(Status) %>%
  summarise(Status_size = n()) 

sample_size_status %>% 
  pander()
StatusStatus_size
Attempted169
Fulfilled Program72
Dropped42
Referred to HJG/LS21
Active Mentoring79
#Number of students in each Status/semester mentored
sample_size_status_sem = Students %>%
  group_by(Status, semester_mentored) %>%
  summarise(Status_sem_size = n()) 

sample_size_status_sem %>% 
  pander()
Statussemester_mentoredStatus_sem_size
AttemptedFall 201898
AttemptedWinter 201971
Fulfilled ProgramFall 201836
Fulfilled ProgramWinter 201936
DroppedFall 201813
DroppedWinter 201929
Referred to HJG/LSFall 20189
Referred to HJG/LSWinter 201912
Active MentoringFall 201856
Active MentoringWinter 201923

Next, I used left_join() to add each sample size dataframe onto my original Students dataframe. I also used mutate() to make a new column with the text for the label, using the value that we just added on. After that, I no longer need the sample size dataframes, so I removed them from my environment.

#### Add sample size labels to my dataset ####

Students <- Students %>%
 
  ## Student Type ##
  
  #Add on Type sizes
  left_join(sample_size_type) %>%
  
  #make new column with Type size labels
  mutate(Type_size_label = paste0(Student_Type, "\n", "n = ", Type_size)) %>%
  
  #Add on Type size / semester mentored
  left_join(sample_size_type_sem) %>%
  
  #make new column with Type size/semester mentored labels
  mutate(Type_sem_size_label = paste0(Student_Type, "\n", semester_mentored, "\n", "n = ", Type_sem_size)) %>%
  
  
  ## Group ##
  
  #Add on Group sizes
  left_join(sample_size_group) %>%
  
  #make new column with Group size labels
  mutate(Group_size_label = paste0(Group, "\n", "n = ", Group_size)) %>%
  
  #Add on Groups size / semester mentored
  left_join(sample_size_group_sem) %>%
  
  #make new column with Group size/semester mentored labels
  mutate(Group_sem_size_label = paste0(Group, "\n", semester_mentored, "\n", "n = ", Group_sem_size)) %>%
  
  
  ## Status ##
  
  #Add on Status sizes
  left_join(sample_size_status) %>%
  
  #make new column with Status size labels
  mutate(Status_size_label = paste0(Status, "\n", "n = ", Status_size)) %>%
  
  #Add on Status sizes / semester mentored
  left_join(sample_size_status_sem) %>%
  
  #make new column with Status size labels
  mutate(Status_sem_size_label = paste0(Status, "\n", semester_mentored, "\n", "n = ", Status_sem_size))



### Remove the sample size datasets to clean up my environment ###
rm(sample_size_type,
   sample_size_type_sem,
   sample_size_group,
   sample_size_group_sem,
   sample_size_status,
   sample_size_status_sem)

The dataset now has a ton of new columns:

Students[c(1,30),] %>% pander()
Table continues below
semester_mentoredMentee NameStatusI-NumberStudent_Type
Fall 2018Student 14Active Mentoring107937214Tier 3
Fall 2018Student 41Active Mentoring362332643Tier 3
Table continues below
Pre_1DuringPost_1Post_2GroupType_sizeType_size_label
02.10.60Mentored369Tier 3 n = 369
1.32.81.2NAMentored369Tier 3 n = 369
Table continues below
Type_sem_sizeType_sem_size_labelGroup_sizeGroup_size_label
200Tier 3 Fall 2018 n = 200214Mentored n = 214
200Tier 3 Fall 2018 n = 200214Mentored n = 214
Table continues below
Group_sem_sizeGroup_sem_size_labelStatus_size
114Mentored Fall 2018 n = 11479
114Mentored Fall 2018 n = 11479
Status_size_labelStatus_sem_sizeStatus_sem_size_label
Active Mentoring n = 7956Active Mentoring Fall 2018 n = 56
Active Mentoring n = 7956Active Mentoring Fall 2018 n = 56

Now I want to just keep the columns that are useful to me using select(). This will also take care of removing the student identifiers.

Students_new <- Students %>%
  select(Pre_1,
         During,
         Post_1,
         Post_2,
         semester_mentored,
         Student_Type,
         Group,
         Status,
         Type_size_label,
         Type_sem_size_label,
         Group_size_label,
         Group_sem_size_label,
         Status_size_label,
         Status_sem_size_label)

Wide and Long Formats

Our data is currently in a “wide” format. It is technically not tidy, but the wide format will be useful in some cases, so I want to keep this. Later on, I’ll be looking at each student’s change in GPA from Pre_1 to Post_1. I’ll create a column with those values now so they are ready to go when I need them.

# Wide format data with a new column that is the change in GPA from pre1 to post 1
Students_wide <- Students_new %>%
  mutate(Change = Post_1 - Pre_1)

Students_wide[c(1,30),] %>% pander()
Table continues below
Pre_1DuringPost_1Post_2semester_mentoredStudent_TypeGroup
02.10.60Fall 2018Tier 3Mentored
1.32.81.2NAFall 2018Tier 3Mentored
Table continues below
StatusType_size_labelType_sem_size_label
Active MentoringTier 3 n = 369Tier 3 Fall 2018 n = 200
Active MentoringTier 3 n = 369Tier 3 Fall 2018 n = 200
Table continues below
Group_size_labelGroup_sem_size_labelStatus_size_label
Mentored n = 214Mentored Fall 2018 n = 114Active Mentoring n = 79
Mentored n = 214Mentored Fall 2018 n = 114Active Mentoring n = 79
Status_sem_size_labelChange
Active Mentoring Fall 2018 n = 560.6
Active Mentoring Fall 2018 n = 56-0.1

The tidy-est version of our data will be the “long” format. Pre_1, During, Post_1, and Post_2 could be considered variables, but they are really values for the variable semester. I’ll use pivot_longer() to turn those 4 semester GPA columns into two columns - semester and GPA, and then fct_relevel() to specify the order of the semesters.

#We want pivot the data so that each row is only one GPA measurement (one student+one semester)
Students_long <- Students_new  %>%
  pivot_longer(cols = c(`Pre_1`, `During`, `Post_1`, `Post_2`), 
               names_to = "semester", 
               values_to = "GPA")

#Reorder the semester values so they are in the right order
Students_long <- Students_long %>%
  mutate(semester = fct_relevel(semester,
                                "Pre_1",
                                "During",
                                "Post_1",
                                "Post_2"))
Students_long %>% head(4) %>% pander()
Table continues below
semester_mentoredStudent_TypeGroupStatus
Fall 2018Tier 3MentoredActive Mentoring
Fall 2018Tier 3MentoredActive Mentoring
Fall 2018Tier 3MentoredActive Mentoring
Fall 2018Tier 3MentoredActive Mentoring
Table continues below
Type_size_labelType_sem_size_labelGroup_size_label
Tier 3 n = 369Tier 3 Fall 2018 n = 200Mentored n = 214
Tier 3 n = 369Tier 3 Fall 2018 n = 200Mentored n = 214
Tier 3 n = 369Tier 3 Fall 2018 n = 200Mentored n = 214
Tier 3 n = 369Tier 3 Fall 2018 n = 200Mentored n = 214
Table continues below
Group_sem_size_labelStatus_size_label
Mentored Fall 2018 n = 114Active Mentoring n = 79
Mentored Fall 2018 n = 114Active Mentoring n = 79
Mentored Fall 2018 n = 114Active Mentoring n = 79
Mentored Fall 2018 n = 114Active Mentoring n = 79
Status_sem_size_labelsemesterGPA
Active Mentoring Fall 2018 n = 56Pre_10
Active Mentoring Fall 2018 n = 56During2.1
Active Mentoring Fall 2018 n = 56Post_10.6
Active Mentoring Fall 2018 n = 56Post_20

Our new tidy dataframe Student_long holds all students with the following columns:

  • semester_mentored : Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)

  • Status : Mentoring Status (Attempted, Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)

  • Student Type : Mentoring criteria that student falls under (Tier 3 or Referral)

  • Group : Control or Mentored

  • semester : The semester of that GPA measurement (Pre_1, During, Post_1, or Post_2)

  • GPA : Student’s GPA that semester

(Plus the size labels for Type, Group, and Status)

Moving Forward

Now that the data is tidy and ready to go, I can use it to make visuals or run hypothesis tests. Wrangling can be a big part of any data science project (sometimes even the biggest part!), but it is so necessary for the rest of the project to go smoothly.

Check out the rest of my work on this project here.

Aubrey Shuga
Aubrey Shuga
Data Science Student

Related