#TidyTuesday - Plastic Pollution

Intro

library(tidytuesdayR)
library(tidyverse)
library(knitr)

blogpost

Challenge intro

data

Tidy Tuesday

The Data

Tidy Tuesday provides the following code to load in the data:

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2021-01-26')
tuesdata <- tidytuesdayR::tt_load(2021, week = 5)

plastics <- tuesdata$plastics

# Or read in the data manually

plastics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-26/plastics.csv')

Since there are a few options, I decided to use the tidytuesdayR package option. That way, I can work with any other Tidy Tuesday data without having to find the link. I just need to know the week number.

#tuesdata <- tidytuesdayR::tt_load(2021, week = 5) #load in this week's data

tuesdata is a list where the first item is our dataframe, so we’ll need to extract the plastics item from the list and save it.

#plastics <- tuesdata$plastics #grab the plastics item from the list and save it as plastics (this is our data)

Note: I ended up having to use the other option sometimes when I had too many github requests

plastics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-26/plastics.csv')
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   parent_company = col_character(),
##   empty = col_double(),
##   hdpe = col_double(),
##   ldpe = col_double(),
##   o = col_double(),
##   pet = col_double(),
##   pp = col_double(),
##   ps = col_double(),
##   pvc = col_double(),
##   grand_total = col_double(),
##   num_events = col_double(),
##   volunteers = col_double()
## )

If we peek at the data, it looks like each row is a country-year-parent_company combination. This data only contains 2019 and 2020.

plastics %>% head() %>% knitr::kable()
countryyearparent_companyemptyhdpeldpeopetpppspvcgrand_totalnum_eventsvolunteers
Argentina2019Grand Total02155560713762811161826684243
Argentina2019Unbranded0155505328481221141718384243
Argentina2019The Coca-Cola Company000022235002574243
Argentina2019Secco000039400434243
Argentina2019Doble Cola000038000384243
Argentina2019Pritty000022700294243

Tidy Tuesday also provides a Data Dictionary that explains each column in the dataset:

Data Dictionary

The plastic is categorized by recycling codes.

plastics.csv

variableclassdescription
countrycharacterCountry of cleanup
yeardoubleYear (2019 or 2020)
parent_companycharacterSource of plastic
emptydoubleCategory left empty count
hdpedoubleHigh density polyethylene count (Plastic milk containers, plastic bags, bottle caps, trash cans, oil cans, plastic lumber, toolboxes, supplement containers)
ldpedoubleLow density polyethylene count (Plastic bags, Ziploc bags, buckets, squeeze bottles, plastic tubes, chopping boards)
odoubleCategory marked other count
petdoublePolyester plastic count (Polyester fibers, soft drink bottles, food containers (also see plastic bottles)
ppdoublePolypropylene count (Flower pots, bumpers, car interior trim, industrial fibers, carry-out beverage cups, microwavable food containers, DVD keep cases)
psdoublePolystyrene count (Toys, video cassettes, ashtrays, trunks, beverage/food coolers, beer cups, wine and champagne cups, carry-out food containers, Styrofoam)
pvcdoublePVC plastic count (Window frames, bottles for chemicals, flooring, plumbing pipes)
grand_totaldoubleGrand total count (all types of plastic)
num_eventsdoubleNumber of counting events
volunteersdoubleNumber of volunteers

Data Summary

We can use skimr::skim() to see a summary of the data, including the number of rows and columns, column types, and some summary stats for each column.

skimr::skim(plastics) #print a summary of the data
Table 1: Data summary
Nameplastics
Number of rows13380
Number of columns14
_______________________
Column type frequency:
character2
numeric12
________________________
Group variablesNone

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
country014500690
parent_company011840108230

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
year01.002019.310.4620192019201920202020▇▁▁▁▃
empty32430.760.4122.5900002208▇▁▁▁▁
hdpe16460.883.0566.1200003728▇▁▁▁▁
ldpe20770.8410.32194.64000011700▇▁▁▁▁
o2670.9849.611601.990002120646▇▁▁▁▁
pet2140.9820.94428.16000036226▇▁▁▁▁
pp14960.898.22141.8100006046▇▁▁▁▁
ps19720.851.8639.7400002101▇▁▁▁▁
pvc43280.680.357.890000622▇▁▁▁▁
grand_total141.0090.151873.680116120646▇▁▁▁▁
num_events01.0033.3744.71141542145▇▃▁▁▂
volunteers1070.991117.651812.401114400141631318▇▁▁▁▁

Data Wrangling

The first step I’d like to take is renaming the columns using the information in the data dictionary. This will help me to remember what each column means.

plastics_new <- plastics %>%
  rename("empty_count" = "empty",
         "high_density_polyethylene_count" = "hdpe",
         "low_density_polyethylene_count" = "ldpe",
         "other_count" = "o",
         "polyester_plastic_count" = "pet",
         "polypropylene_count" = "pp",
         "polystyrene_count" = "ps",
         "pvc_plastic_count" = "pvc",
         "total_plastic_count" = "grand_total", #I'm renaming this because there is also a country called "Grand Total" and I don't want to mix them up
         "times_counted" = "num_events")
  
  # #rename some values that are the same but have diff names
  # mutate(parent_company = gsub("estle", "estlé", parent_company),
  #        parent_company = gsub("PT Mayora Indah Tbk", "Mayora Indah", parent_company),
  #        parent_company = gsub("Pepsico", "PepsiCo", parent_company))

Viz

I’m interested in which parent companies create the most plastic pollution. Here is the code for my final plot:

#I want to look at the top 10 `parent_company`s with the highest total `total_plastic_count` (of all countries and years)

#Create new dataframe to use for my plot
p_dat <- plastics_new %>% 
  
  #Get total `total_plastic_count` for each company
  group_by(parent_company) %>%
  summarise(total_plastic_count = sum(total_plastic_count)) %>%
  
  #Remove any parent_company where the name isn't actually 1 company 
  filter(!parent_company %in% c("null","NULL","Grand Total","Unbranded", "Assorted")) %>%
  
  #Keep the rows with the top 10 total_plastic_count values
  slice_max(order_by = total_plastic_count, n = 10) %>%
  
  #Turn all " " and "-" into "\n" to fit more on one line
  mutate(parent_company = gsub("Tamil Nadu Co-operative Milk Producers' Federation Ltd", "Tamil Nadu Co-operative \nMilk Producers' Federation Ltd", parent_company))
  

#Create my plot
p <- p_dat %>%
  ggplot() +
  
  #Add a point for each company's count
  geom_point(aes(y = reorder(parent_company, total_plastic_count),
                 x = total_plastic_count,
                 col = parent_company),
                size = 15) +
  
  #Add a line for each company (lollipop chart)
  geom_segment(aes(y = parent_company,
                   yend = parent_company,
                   x = 0,
                   xend = total_plastic_count,
                   col = parent_company),
                  size = 2) +
  
  geom_text(aes(y = reorder(parent_company, total_plastic_count),
                x = total_plastic_count,
                label = total_plastic_count),
            size = 4,
            col = "white",
            fontface = "bold") +
  
  #scale the y-axis on a log scale so the values aren't as spread out
  # scale_y_continuous(trans='log10', 
  #                    breaks = scales::trans_breaks("log10", function(x) 10^x),
  #                    labels = scales::trans_format("log2",
  #                                                  scales::math_format(10^.x))) +
  
  #Add all my labels
  labs(title = "Top 10 Creators of Plastic Pollutions Worldwide", 
       subtitle = "Out of all cleanup events in 2019 and 2020, these 10 companies had the most plastic items that were made by the company.",
       caption = "Data from breakfreefromplastic.org | Viz by Aubrey Shuga",
       y = "Company", 
       x = "Number of Plastic Items Found") +
  
  #All theme elements
  theme(legend.position ="none",
        panel.background = element_blank(), #remove gray background
        axis.text.y = element_text(face = "bold", size = 10), #no y-axis tick labels
        axis.ticks.y = element_blank(), #no y-axis tick marks
        axis.ticks.x = element_blank()) #no x-axis tick marks
p


#save each plot iteration so I can create a gif at the end
ggsave(plot = p, filename = file.path("iterations", paste0(Sys.time(),".png"))) 

Gif

While creating my viz for this week, I saved a copy of each iteration. I now want to create a gif to show how my plot chnaged from my initial stab at it to my final plot. some steps taken from this tutorial

library(magick)

## list file names in interations folder
imgs <- list.files("iterations", full.names = TRUE)

#For each filename in imgs, read the image and store it in img_list
img_list <- lapply(imgs, image_read)

## join the images together
img_joined <- image_join(img_list)

## animate at 2 frames per second
img_animated <- image_animate(img_joined, fps = 4)

## view animated image
img_animated

## save to disk
image_write(image = img_animated,
            path = "plastic_pollution.gif")

You can see that I wanted to add more, but I ended up giving up and going back to an earlier iteration.

Aubrey Shuga
Aubrey Shuga
Data Science Student

Related