#TidyTuesday - Plastic Pollution

Last updated on Mar 23, 2021 7 min read R, TidyTuesday

Intro

library(tidytuesdayR)
library(tidyverse)
library(knitr)

The Data

Tidy Tuesday provides the following code to load in the data:

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2021-01-26')
tuesdata <- tidytuesdayR::tt_load(2021, week = 5)

plastics <- tuesdata$plastics

# Or read in the data manually

plastics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-26/plastics.csv')

Since there are a few options, I decided to use the tidytuesdayR package option. That way, I can work with any other Tidy Tuesday data without having to find the link. I just need to know the week number.

#tuesdata <- tidytuesdayR::tt_load(2021, week = 5) #load in this week's data

tuesdata is a list where the first item is our dataframe, so we’ll need to extract the plastics item from the list and save it.

#plastics <- tuesdata$plastics #grab the plastics item from the list and save it as plastics (this is our data)

Note: I ended up having to use the other option sometimes when I had too many github requests

plastics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-26/plastics.csv')

## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   parent_company = col_character(),
##   empty = col_double(),
##   hdpe = col_double(),
##   ldpe = col_double(),
##   o = col_double(),
##   pet = col_double(),
##   pp = col_double(),
##   ps = col_double(),
##   pvc = col_double(),
##   grand_total = col_double(),
##   num_events = col_double(),
##   volunteers = col_double()
## )

If we peek at the data, it looks like each row is a country-year-parent_company combination. This data only contains 2019 and 2020.

plastics %>% head() %>% knitr::kable()

country	year	parent_company	hdpe	ldpe	o	pet	pp	ps	pvc	grand_total	num_events	volunteers
Argentina	2019	Grand Total	215	55	607	1376	281	116	18	2668	4	243
Argentina	2019	Unbranded	155	50	532	848	122	114	17	1838	4	243
Argentina	2019	The Coca-Cola Company	0	0	0	222	35	0	0	257	4	243
Argentina	2019	Secco	0	0	0	39	4	0	0	43	4	243
Argentina	2019	Doble Cola	0	0	0	38	0	0	0	38	4	243
Argentina	2019	Pritty	0	0	0	22	7	0	0	29	4	243

Tidy Tuesday also provides a Data Dictionary that explains each column in the dataset:

Data Dictionary

The plastic is categorized by recycling codes.

plastics.csv

variable	class	description
country	character	Country of cleanup
year	double	Year (2019 or 2020)
parent_company	character	Source of plastic
empty	double	Category left empty count
hdpe	double	High density polyethylene count (Plastic milk containers, plastic bags, bottle caps, trash cans, oil cans, plastic lumber, toolboxes, supplement containers)
ldpe	double	Low density polyethylene count (Plastic bags, Ziploc bags, buckets, squeeze bottles, plastic tubes, chopping boards)
o	double	Category marked other count
pet	double	Polyester plastic count (Polyester fibers, soft drink bottles, food containers (also see plastic bottles)
pp	double	Polypropylene count (Flower pots, bumpers, car interior trim, industrial fibers, carry-out beverage cups, microwavable food containers, DVD keep cases)
ps	double	Polystyrene count (Toys, video cassettes, ashtrays, trunks, beverage/food coolers, beer cups, wine and champagne cups, carry-out food containers, Styrofoam)
pvc	double	PVC plastic count (Window frames, bottles for chemicals, flooring, plumbing pipes)
grand_total	double	Grand total count (all types of plastic)
num_events	double	Number of counting events
volunteers	double	Number of volunteers

Data Summary

We can use skimr::skim() to see a summary of the data, including the number of rows and columns, column types, and some summary stats for each column.

skimr::skim(plastics) #print a summary of the data

Table 1: Data summary
Name	plastics
Number of rows	13380
Number of columns	14
_______________________
Column type frequency:
character	2
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	4	50	0	69	0
parent_company	0	1	1	84	0	10823	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2019.31	0.46	2019	2019	2019	2020	2020	▇▁▁▁▃
empty	3243	0.76	0.41	22.59	0	0	0	0	2208	▇▁▁▁▁
hdpe	1646	0.88	3.05	66.12	0	0	0	0	3728	▇▁▁▁▁
ldpe	2077	0.84	10.32	194.64	0	0	0	0	11700	▇▁▁▁▁
o	267	0.98	49.61	1601.99	0	0	0	2	120646	▇▁▁▁▁
pet	214	0.98	20.94	428.16	0	0	0	0	36226	▇▁▁▁▁
pp	1496	0.89	8.22	141.81	0	0	0	0	6046	▇▁▁▁▁
ps	1972	0.85	1.86	39.74	0	0	0	0	2101	▇▁▁▁▁
pvc	4328	0.68	0.35	7.89	0	0	0	0	622	▇▁▁▁▁
grand_total	14	1.00	90.15	1873.68	0	1	1	6	120646	▇▁▁▁▁
num_events	0	1.00	33.37	44.71	1	4	15	42	145	▇▃▁▁▂
volunteers	107	0.99	1117.65	1812.40	1	114	400	1416	31318	▇▁▁▁▁

Data Wrangling

The first step I’d like to take is renaming the columns using the information in the data dictionary. This will help me to remember what each column means.

plastics_new <- plastics %>%
  rename("empty_count" = "empty",
         "high_density_polyethylene_count" = "hdpe",
         "low_density_polyethylene_count" = "ldpe",
         "other_count" = "o",
         "polyester_plastic_count" = "pet",
         "polypropylene_count" = "pp",
         "polystyrene_count" = "ps",
         "pvc_plastic_count" = "pvc",
         "total_plastic_count" = "grand_total", #I'm renaming this because there is also a country called "Grand Total" and I don't want to mix them up
         "times_counted" = "num_events")
  
  # #rename some values that are the same but have diff names
  # mutate(parent_company = gsub("estle", "estlé", parent_company),
  #        parent_company = gsub("PT Mayora Indah Tbk", "Mayora Indah", parent_company),
  #        parent_company = gsub("Pepsico", "PepsiCo", parent_company))

Viz

I’m interested in which parent companies create the most plastic pollution. Here is the code for my final plot:

#I want to look at the top 10 `parent_company`s with the highest total `total_plastic_count` (of all countries and years)

#Create new dataframe to use for my plot
p_dat <- plastics_new %>% 
  
  #Get total `total_plastic_count` for each company
  group_by(parent_company) %>%
  summarise(total_plastic_count = sum(total_plastic_count)) %>%
  
  #Remove any parent_company where the name isn't actually 1 company 
  filter(!parent_company %in% c("null","NULL","Grand Total","Unbranded", "Assorted")) %>%
  
  #Keep the rows with the top 10 total_plastic_count values
  slice_max(order_by = total_plastic_count, n = 10) %>%
  
  #Turn all " " and "-" into "\n" to fit more on one line
  mutate(parent_company = gsub("Tamil Nadu Co-operative Milk Producers' Federation Ltd", "Tamil Nadu Co-operative \nMilk Producers' Federation Ltd", parent_company))
  

#Create my plot
p <- p_dat %>%
  ggplot() +
  
  #Add a point for each company's count
  geom_point(aes(y = reorder(parent_company, total_plastic_count),
                 x = total_plastic_count,
                 col = parent_company),
                size = 15) +
  
  #Add a line for each company (lollipop chart)
  geom_segment(aes(y = parent_company,
                   yend = parent_company,
                   x = 0,
                   xend = total_plastic_count,
                   col = parent_company),
                  size = 2) +
  
  geom_text(aes(y = reorder(parent_company, total_plastic_count),
                x = total_plastic_count,
                label = total_plastic_count),
            size = 4,
            col = "white",
            fontface = "bold") +
  
  #scale the y-axis on a log scale so the values aren't as spread out
  # scale_y_continuous(trans='log10', 
  #                    breaks = scales::trans_breaks("log10", function(x) 10^x),
  #                    labels = scales::trans_format("log2",
  #                                                  scales::math_format(10^.x))) +
  
  #Add all my labels
  labs(title = "Top 10 Creators of Plastic Pollutions Worldwide", 
       subtitle = "Out of all cleanup events in 2019 and 2020, these 10 companies had the most plastic items that were made by the company.",
       caption = "Data from breakfreefromplastic.org | Viz by Aubrey Shuga",
       y = "Company", 
       x = "Number of Plastic Items Found") +
  
  #All theme elements
  theme(legend.position ="none",
        panel.background = element_blank(), #remove gray background
        axis.text.y = element_text(face = "bold", size = 10), #no y-axis tick labels
        axis.ticks.y = element_blank(), #no y-axis tick marks
        axis.ticks.x = element_blank()) #no x-axis tick marks
p


#save each plot iteration so I can create a gif at the end
ggsave(plot = p, filename = file.path("iterations", paste0(Sys.time(),".png")))

Gif

While creating my viz for this week, I saved a copy of each iteration. I now want to create a gif to show how my plot chnaged from my initial stab at it to my final plot. some steps taken from this tutorial

library(magick)

## list file names in interations folder
imgs <- list.files("iterations", full.names = TRUE)

#For each filename in imgs, read the image and store it in img_list
img_list <- lapply(imgs, image_read)

## join the images together
img_joined <- image_join(img_list)

## animate at 2 frames per second
img_animated <- image_animate(img_joined, fps = 4)

## view animated image
img_animated

## save to disk
image_write(image = img_animated,
            path = "plastic_pollution.gif")

You can see that I wanted to add more, but I ended up giving up and going back to an earlier iteration.

#TidyTuesday - Plastic Pollution

Intro

The Data

Data Dictionary

Data Summary

Data Wrangling

Viz

Gif

Aubrey Shuga

Data Science Student

Related