Wrangling Client Data
Exploratory Data Analysis - Part 1
I’m currently working on a project with data provided by BYU-Idaho Peer-Success Mentoring. Luckily, the provided data was already mostly in the format that I need for my project. However, no project is complete without at least some data wrangling.
In this post, I’ll go through all of the steps I took to wrangle the data - from reading in the raw data to my final tidy dataset. I’ll be using using the tidyverse package, specifically dplyr for data manipulation and readxl for reading in data from an excel workbook.
#load packages
library(dplyr) #for data wrangling
library(readxl) #to read in .xls and .xlsx sheets.
library(pander) #to print tables in markdown
library(tidyverse) #includes dplyr, readxl, and other packages that are in the tidyverse core
Project Background
BYU-Idaho Peer-Success Mentoring provides short-term mentoring to a variety of BYU-Idaho student populations with a goal to improve student success and retention. We’re interested in looking for possible correlation between participation in mentoring and defined measures of student success, including improved GPA.
The project will be focused on answering the following question: Do students who participate in Peer-Success Mentoring have more success (improvement in GPA) than students who don’t participate in Peer-Success Mentoring?
This question will be explored through various forms of data analysis, hypothesis testing, and inferential statistics, using data about students who met defined mentoring criteria. A Control group of students who met the criteria but did not participate in mentoring will be compared with a Mentored group of students who met the same criteria and did participate in mentoring. Click here to read more about Peer-Success Mentoring and the background for this project.
Provided Data
The data is stored in an excel workbook, where each tab represents a different group of potential mentee.
Control (Students who met Tier 3 mentoring criteria but opted out of mentoring)
Mentored (Students who met Tier 3 mentoring criteria and participated in mentoring)
Referrals (Students who were referred to the mentoring program by a faculty member)
Tier 3
Most students in the provided data are classified as “Tier 3”. These students were all offered mentoring and either participated or did not participate, so they are assigned into either the Mentored or Control group. Tier 3 can be defined by following: “The third tier category focuses on freshmen, in particular any freshman with a 2.0 GPA or lower after their second semester. These students are referred to the [Peer-Success] Mentoring program, where they receive a personal student mentor’s help in addition to their academic advisor.” source
Referral
A smaller number of students in this data are classified as “Referral”. These students were referred to Peer-Success Mentoring by a professor, friend, faculty member, or other person, or they signed up for mentoring on their own accord. These students do not necessarily meet the Tier 3 criteria, so the impact of mentoring might be different.
(Note: Mentored and Referrals are both made up of students who did participate in mentoring, so we will actually only have two groups: Control and Mentored.)
Each tab in the workbook follows the same format, where each row is one student with the following columns:
Semester Mentored
: Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)Mentee Name
: Student’s name (Lastname, Firstname)Status
: Mentoring Status (Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)
Status Definitions
Every student that meets mentoring criteria is assigned a “status” that describes where the student is at in the mentoring process. Their mentor adjusts this status as the student moves through the process. In this project, each student’s status is the status they were marked as at the end of their mentoring semester. The following statuses are included in this project:
Attempted
Students classified as “Attempted” are students who met Tier 3 criteria and were contacted by a Peer-Success Mentor but did not participate in mentoring. They could have declined mentoring or could have not responded to the mentor’s contact attempts. These students are all in the control group.
Fulfilled Program
These students completed the 3-4 mentoring sessions in the program.
Dropped
These students were dropped from the mentoring program for a variety of reasons:
They attended one or more mentoring sessions and informed the mentor that they wanted to discontinue mentoring.
They attended one or more mentoring sessions but no-showed for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring
They enrolled in mentoring but no-showed for their first session without communicating to the mentor that they wanted to discontinue mentoring
Referred to HJG/LS
These students were referred to other peer-mentoring programs, including Heber J. Grant Mentoring and Life-Skills Mentoring. These students could have met with a Peer-Success Mentor one or more times before being referred to another program, or they could have been referred without meeting with a Peer-Success Mentor.
Active Mentoring
These students enrolled in Peer-Success Mentoring, but for some reason were never marked as “Fulfilled Program”. This could be due to a variety of reasons:
The student signed up for mentoring, but never attended any of their sessions
They attended one or more mentoring sessions but then “no-showed” for their remaining sessions without communicating to the mentor that they wanted to discontinue mentoring
The student actively participated in mentoring, but enrolled too late in the semester to complete enough sessions to be marked as “fulfilled” (Each semester, the mentors start over with a new list of Tier 3 students)
The student was moved to another status (Fulfilled, Dropped, Referred to HJG/LS), but the mentor forgot to change the student’s status in the records.
Because of this, it is hard to know the experience that these students had. They probably do not all belong in the same group.
I-Number
: Student’s unique ID numberStudent Type
: Mentoring criteria that student falls under (Tier 3 or Referral)Pre 1
: Student’s GPA the semester before the semester mentoredSemester Mentored
: Student’s GPA the semester mentoredPost 1
: Student’s GPA the semester after the semester mentoredPost 2
: Student’s GPA 2 semesters after the semester mentored
Even before reading in the data, I noticed some changes that needed to be made:
- Two columns have the same name, “Semester Mentored”.
- There are 3 tabs, but there are really two overlapping sets of two groups: Control vs Mentored and Tier 3 vs Referral. The three tabs don’t group the students in the way we need.
- This data includes student identifiers, such as their name and their student ID number, which need to be removed for privacy reasons.
- Some rows don’t have any student data in them.
- The column names contain spaces, which will make referencing them harder.
Read in the Data using read_excel
The read_excel()
function from the readxl package can read in data from an excel workbook to store in a dataframe.
#Read in the data from the client's excel sheet
Mentored <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Mentored", skip = 1)
Control <- read_excel("StudentGPAs.xlsx", sheet = "Tier 3 Control", skip = 1)
Referrals <- read_excel("StudentGPAs.xlsx", sheet = "Referrals", skip = 1)
Mentored %>% head() %>% pander::pander()
Semester Mentored…1 | Mentee Name | Status | I-Number |
---|---|---|---|
Fall 2018 | Student 14 | Active Mentoring | 107937214 |
Fall 2018 | Student 57 | Active Mentoring | 109482852 |
Winter 2019 | NA | Attempted | NA |
Fall 2018 | Student 73 | Active Mentoring | 115113476 |
Fall 2018 | Student 88 | Referred to HJG/LS | 1.29e+08 |
Fall 2018 | Student 71 | Active Mentoring | 136664545 |
Student Type | Pre 1 | Semester Mentored…7 | Post 1 | Post 2 |
---|---|---|---|---|
Tier 3 | 0 | 2.1 | 0.6 | 0 |
Tier 3 | 1.3 | 2.5 | 0.7 | NA |
Tier 3 | NA | NA | NA | NA |
Tier 3 | 0.6 | 3.6 | 3.1 | NA |
Tier 3 | 1.7 | 2.3 | 3 | NA |
Tier 3 | 0 | 3.5 | 0.2 | NA |
(For the sake of this post, I created fabricated data in the same format as the real data).
Combine the 3 Datasets
Because the three groups overlap a bit, let’s add a new column to specify whether that student is Mentored or Control. I used the mutate()
function from the dplyr package for this.
#Add a column to specify which group that student belongs to.
Mentored <- Mentored %>%
mutate("Group" = "Mentored")
Control <- Control %>%
mutate("Group" = "Control")
Referrals <- Referrals %>%
mutate("Group" = "Mentored")
Now that all three datasets have a Group
column, let’s combine them into one dataframe using rbind()
function. While we’re at it, let’s also rename the columns to make things a bit easier going forward.
#Combine both groups into one dataframe
Students <- rbind(Mentored, Control, Referrals) %>%
#rename some columns
rename(semester_mentored = `Semester Mentored...1`,
`Pre_1` = `Pre 1`,
`During` = `Semester Mentored...7`,
`Post_1` = `Post 1`,
`Post_2` = `Post 2`,
`Student_Type` = `Student Type`)
Finally, we need to remove the rows that have no data in them. The main reason for doing this is that later on, I’ll want to count how many NA’s are in each column. If there are complete rows of NA, that will mess with those counts. The full NA rows shouldn’t be in there at all. To make sure we remove rows that have no student data at all, we’ll check for rows where the name is “NA” and remove those. Then we know for sure that we didn’t accidentally remove a student who is in the data, but is missing a GPA in some semesters.
#Remove rows that do not have any actual mentee info
Students <- Students %>%
filter(`Mentee Name` != "NA")
Here’s what the dataset looks like so far:
#Show row 1, 30, and 50 (for a variety of different types of mentees)
Students[c(1,30, 50),] %>% pander()
semester_mentored | Mentee Name | Status | I-Number | Student_Type |
---|---|---|---|---|
Fall 2018 | Student 14 | Active Mentoring | 107937214 | Tier 3 |
Fall 2018 | Student 41 | Active Mentoring | 362332643 | Tier 3 |
Fall 2018 | Student 29 | Fulfilled Program | 480249861 | Tier 3 |
Pre_1 | During | Post_1 | Post_2 | Group |
---|---|---|---|---|
0 | 2.1 | 0.6 | 0 | Mentored |
1.3 | 2.8 | 1.2 | NA | Mentored |
1.8 | 2.7 | 0.2 | NA | Mentored |
Optional Wrangling to Aid in Plotting
Later on when I create plots, I want the Status column to appear in a specific order in facets or on an axis. I can specify this order using fct_relevel()
from the forcats
package, which is included in the tidyverse
.
#Reorder the status values so they are in the order we want for plots
Students <- Students %>%
mutate(Status = fct_relevel(Status,
"Attempted",
"Fulfilled Program",
"Dropped",
"Referred to HJG/LS",
"Active Mentoring"))
Creating Sample Size Labels
I want to know sample sizes of each group/status and have them ready so I can easily label my plots later on. To do this, I’ll first create small dataframes using group_by()
and summarise()
that contain a column for the sample group and a column for the size of that sample group.
#Number of students in each Type
sample_size_type = Students %>%
group_by(Student_Type) %>%
summarise(Type_size = n())
sample_size_type %>%
pander()
Student_Type | Type_size |
---|---|
Referral | 14 |
Tier 3 | 369 |
#Number of students in each Type/semester mentored
sample_size_type_sem = Students %>%
group_by(Student_Type, semester_mentored) %>%
summarise(Type_sem_size = n())
sample_size_type_sem %>%
pander()
Student_Type | semester_mentored | Type_sem_size |
---|---|---|
Referral | Fall 2018 | 12 |
Referral | Winter 2019 | 2 |
Tier 3 | Fall 2018 | 200 |
Tier 3 | Winter 2019 | 169 |
## Group Sizes ##
#Number of students in each Group
sample_size_group = Students %>%
group_by(Group) %>%
summarise(Group_size = n())
sample_size_group %>%
pander()
Group | Group_size |
---|---|
Control | 169 |
Mentored | 214 |
#Number of students in each Group/semester mentored
sample_size_group_sem = Students %>%
group_by(Group, semester_mentored) %>%
summarise(Group_sem_size = n())
sample_size_group_sem %>%
pander()
Group | semester_mentored | Group_sem_size |
---|---|---|
Control | Fall 2018 | 98 |
Control | Winter 2019 | 71 |
Mentored | Fall 2018 | 114 |
Mentored | Winter 2019 | 100 |
## Status Sizes ##
#Number of students in each Status
sample_size_status = Students %>%
group_by(Status) %>%
summarise(Status_size = n())
sample_size_status %>%
pander()
Status | Status_size |
---|---|
Attempted | 169 |
Fulfilled Program | 72 |
Dropped | 42 |
Referred to HJG/LS | 21 |
Active Mentoring | 79 |
#Number of students in each Status/semester mentored
sample_size_status_sem = Students %>%
group_by(Status, semester_mentored) %>%
summarise(Status_sem_size = n())
sample_size_status_sem %>%
pander()
Status | semester_mentored | Status_sem_size |
---|---|---|
Attempted | Fall 2018 | 98 |
Attempted | Winter 2019 | 71 |
Fulfilled Program | Fall 2018 | 36 |
Fulfilled Program | Winter 2019 | 36 |
Dropped | Fall 2018 | 13 |
Dropped | Winter 2019 | 29 |
Referred to HJG/LS | Fall 2018 | 9 |
Referred to HJG/LS | Winter 2019 | 12 |
Active Mentoring | Fall 2018 | 56 |
Active Mentoring | Winter 2019 | 23 |
Next, I used left_join()
to add each sample size dataframe onto my original Students dataframe. I also used mutate()
to make a new column with the text for the label, using the value that we just added on. After that, I no longer need the sample size dataframes, so I removed them from my environment.
#### Add sample size labels to my dataset ####
Students <- Students %>%
## Student Type ##
#Add on Type sizes
left_join(sample_size_type) %>%
#make new column with Type size labels
mutate(Type_size_label = paste0(Student_Type, "\n", "n = ", Type_size)) %>%
#Add on Type size / semester mentored
left_join(sample_size_type_sem) %>%
#make new column with Type size/semester mentored labels
mutate(Type_sem_size_label = paste0(Student_Type, "\n", semester_mentored, "\n", "n = ", Type_sem_size)) %>%
## Group ##
#Add on Group sizes
left_join(sample_size_group) %>%
#make new column with Group size labels
mutate(Group_size_label = paste0(Group, "\n", "n = ", Group_size)) %>%
#Add on Groups size / semester mentored
left_join(sample_size_group_sem) %>%
#make new column with Group size/semester mentored labels
mutate(Group_sem_size_label = paste0(Group, "\n", semester_mentored, "\n", "n = ", Group_sem_size)) %>%
## Status ##
#Add on Status sizes
left_join(sample_size_status) %>%
#make new column with Status size labels
mutate(Status_size_label = paste0(Status, "\n", "n = ", Status_size)) %>%
#Add on Status sizes / semester mentored
left_join(sample_size_status_sem) %>%
#make new column with Status size labels
mutate(Status_sem_size_label = paste0(Status, "\n", semester_mentored, "\n", "n = ", Status_sem_size))
### Remove the sample size datasets to clean up my environment ###
rm(sample_size_type,
sample_size_type_sem,
sample_size_group,
sample_size_group_sem,
sample_size_status,
sample_size_status_sem)
The dataset now has a ton of new columns:
Students[c(1,30),] %>% pander()
semester_mentored | Mentee Name | Status | I-Number | Student_Type |
---|---|---|---|---|
Fall 2018 | Student 14 | Active Mentoring | 107937214 | Tier 3 |
Fall 2018 | Student 41 | Active Mentoring | 362332643 | Tier 3 |
Pre_1 | During | Post_1 | Post_2 | Group | Type_size | Type_size_label |
---|---|---|---|---|---|---|
0 | 2.1 | 0.6 | 0 | Mentored | 369 | Tier 3 n = 369 |
1.3 | 2.8 | 1.2 | NA | Mentored | 369 | Tier 3 n = 369 |
Type_sem_size | Type_sem_size_label | Group_size | Group_size_label |
---|---|---|---|
200 | Tier 3 Fall 2018 n = 200 | 214 | Mentored n = 214 |
200 | Tier 3 Fall 2018 n = 200 | 214 | Mentored n = 214 |
Group_sem_size | Group_sem_size_label | Status_size |
---|---|---|
114 | Mentored Fall 2018 n = 114 | 79 |
114 | Mentored Fall 2018 n = 114 | 79 |
Status_size_label | Status_sem_size | Status_sem_size_label |
---|---|---|
Active Mentoring n = 79 | 56 | Active Mentoring Fall 2018 n = 56 |
Active Mentoring n = 79 | 56 | Active Mentoring Fall 2018 n = 56 |
Now I want to just keep the columns that are useful to me using select()
. This will also take care of removing the student identifiers.
Students_new <- Students %>%
select(Pre_1,
During,
Post_1,
Post_2,
semester_mentored,
Student_Type,
Group,
Status,
Type_size_label,
Type_sem_size_label,
Group_size_label,
Group_sem_size_label,
Status_size_label,
Status_sem_size_label)
Wide and Long Formats
Our data is currently in a “wide” format. It is technically not tidy, but the wide format will be useful in some cases, so I want to keep this. Later on, I’ll be looking at each student’s change in GPA from Pre_1 to Post_1. I’ll create a column with those values now so they are ready to go when I need them.
# Wide format data with a new column that is the change in GPA from pre1 to post 1
Students_wide <- Students_new %>%
mutate(Change = Post_1 - Pre_1)
Students_wide[c(1,30),] %>% pander()
Pre_1 | During | Post_1 | Post_2 | semester_mentored | Student_Type | Group |
---|---|---|---|---|---|---|
0 | 2.1 | 0.6 | 0 | Fall 2018 | Tier 3 | Mentored |
1.3 | 2.8 | 1.2 | NA | Fall 2018 | Tier 3 | Mentored |
Status | Type_size_label | Type_sem_size_label |
---|---|---|
Active Mentoring | Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 |
Active Mentoring | Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 |
Group_size_label | Group_sem_size_label | Status_size_label |
---|---|---|
Mentored n = 214 | Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Mentored n = 214 | Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Status_sem_size_label | Change |
---|---|
Active Mentoring Fall 2018 n = 56 | 0.6 |
Active Mentoring Fall 2018 n = 56 | -0.1 |
The tidy-est version of our data will be the “long” format. Pre_1, During, Post_1, and Post_2 could be considered variables, but they are really values for the variable semester. I’ll use pivot_longer()
to turn those 4 semester GPA columns into two columns - semester and GPA, and then fct_relevel()
to specify the order of the semesters.
#We want pivot the data so that each row is only one GPA measurement (one student+one semester)
Students_long <- Students_new %>%
pivot_longer(cols = c(`Pre_1`, `During`, `Post_1`, `Post_2`),
names_to = "semester",
values_to = "GPA")
#Reorder the semester values so they are in the right order
Students_long <- Students_long %>%
mutate(semester = fct_relevel(semester,
"Pre_1",
"During",
"Post_1",
"Post_2"))
Students_long %>% head(4) %>% pander()
semester_mentored | Student_Type | Group | Status |
---|---|---|---|
Fall 2018 | Tier 3 | Mentored | Active Mentoring |
Fall 2018 | Tier 3 | Mentored | Active Mentoring |
Fall 2018 | Tier 3 | Mentored | Active Mentoring |
Fall 2018 | Tier 3 | Mentored | Active Mentoring |
Type_size_label | Type_sem_size_label | Group_size_label |
---|---|---|
Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 | Mentored n = 214 |
Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 | Mentored n = 214 |
Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 | Mentored n = 214 |
Tier 3 n = 369 | Tier 3 Fall 2018 n = 200 | Mentored n = 214 |
Group_sem_size_label | Status_size_label |
---|---|
Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Mentored Fall 2018 n = 114 | Active Mentoring n = 79 |
Status_sem_size_label | semester | GPA |
---|---|---|
Active Mentoring Fall 2018 n = 56 | Pre_1 | 0 |
Active Mentoring Fall 2018 n = 56 | During | 2.1 |
Active Mentoring Fall 2018 n = 56 | Post_1 | 0.6 |
Active Mentoring Fall 2018 n = 56 | Post_2 | 0 |
Our new tidy dataframe Student_long holds all students with the following columns:
semester_mentored
: Semester they participated in (or opted out of) mentoring (i.e. Fall 2018)Status
: Mentoring Status (Attempted, Active Mentoring, Dropped, Fulfilled Program, or Referred to HJG/LS)Student Type
: Mentoring criteria that student falls under (Tier 3 or Referral)Group
: Control or Mentoredsemester
: The semester of that GPA measurement (Pre_1, During, Post_1, or Post_2)GPA
: Student’s GPA that semester
(Plus the size labels for Type, Group, and Status)
Moving Forward
Now that the data is tidy and ready to go, I can use it to make visuals or run hypothesis tests. Wrangling can be a big part of any data science project (sometimes even the biggest part!), but it is so necessary for the rest of the project to go smoothly.
Check out the rest of my work on this project here.