This data set developed by the Environmental Protection Agency (EPA) contains air quality data per county. Other related data sets are available here.
In R, we need 2 packages to download and visualize the data. First, check if the packages are already installed. Install them if they are not:
packages <- c("tidyverse", "httr")
new.packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new.packages)>0) install.packages(new.packages)
Then, load them:
lapply(packages, library, character.only = TRUE)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.4.1 ✔ purrr 0.3.4
✔ tibble 3.2.1 ✔ dplyr 1.0.9
✔ tidyr 1.1.2 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
[[1]]
[1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
[7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
[13] "utils" "datasets" "methods" "base"
[[2]]
[1] "httr" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
Download the data set:
url <- "https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_2022.zip"
aqi <- GET(url)
data_file <-"aqi.zip"
writeBin(content(aqi, "raw"), data_file)
# Unzip the file
unzip(data_file)
Read the data set:
aqi <- read_csv('annual_aqi_by_county_2022.csv')
── Column specification ────────────────────────────────────────────────────────
cols(
State = col_character(),
County = col_character(),
Year = col_double(),
`Days with AQI` = col_double(),
`Good Days` = col_double(),
`Moderate Days` = col_double(),
`Unhealthy for Sensitive Groups Days` = col_double(),
`Unhealthy Days` = col_double(),
`Very Unhealthy Days` = col_double(),
`Hazardous Days` = col_double(),
`Max AQI` = col_double(),
`90th Percentile AQI` = col_double(),
`Median AQI` = col_double(),
`Days CO` = col_double(),
`Days NO2` = col_double(),
`Days Ozone` = col_double(),
`Days PM2.5` = col_double(),
`Days PM10` = col_double()
)
Select 5 states and make violin plots of the maximum air quality index per county in 2022:
aqi_5_states <- aqi %>%
filter(State %in% c("Alabama", "Colorado", "New Jersey", "Rhode Island", "West Virginia"))
ggplot(aqi_5_states) +
geom_violin(aes(x = `Max AQI`, y = State), draw_quantiles = c(.25, .5, .75)) +
theme_bw() +
ylab("State")
In Python, we need 5 libraries to download and visualize the data.
import requests
import zipfile
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Download the data set:
url = "https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_2022.zip"
aqi = requests.get(url)
data_file = "aqi.zip"
with open(data_file, 'wb') as f:
f.write(aqi.content)
data_file = "aqi.zip"
# Unzip the file
20589
with zipfile.ZipFile(data_file, 'r') as zip_ref:
zip_ref.extractall()
Read it:
csv_file = "annual_aqi_by_county_2022.csv"
aqi = pd.read_csv(csv_file)
Select 5 states and make violin plots of the maximum air quality index per county in 2022:
states = ["Alabama", "Colorado", "New Jersey", "Rhode Island", "West Virginia"]
aqi_5_states = aqi[aqi['State'].isin(states)]
plt.figure()
sns.catplot(data=aqi_5_states, x='Max AQI', y='State', kind='violin', bw=.15)