Let’s use data from the Smithsonian Conservation Biology Institute (SCBI) ForestGEO plot. They present their data at https://scbi-forestgeo.github.io/SCBI-ForestGEO-Data/:
This is the public data portal for the SCBI ForestGEO plot, which points to archive locations for our various data products (some in this repository, many elsewhere).
SCBI datasets are scattered across multiple organizations and repositories. I’ll use the ghr package to find and access some of those datasets, and the purrr package mostly to apply functions over multiple elements of a vector.
(I’ll also use fs to manipulate paths, and readr to read files. I’ll refer to functions from these packages using the syntax package::function()
.)
Climate data from SCBI is stored in the GitHub organization “forestgeo”, particularly in the repository “Climate”.
ghr_ls()
lists GitHub directories in a way similar to how fs::dir_ls()
lists local directories.
ghr_ls("forestgeo/Climate/Met_Station_Data/SCBI")
#> [1] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI"
#> [2] "Met_Station_Data/SCBI/Front Royal weather station"
#> [3] "Met_Station_Data/SCBI/README.md"
I can use a regular expression (regexp
) to focus, for example, on .csv files.
# All files
ghr_ls("forestgeo/Climate/Met_Station_Data/SCBI/Front Royal weather station")
#> [1] "Met_Station_Data/SCBI/Front Royal weather station/Front Royal_NOAA_11162015.csv"
#> [2] "Met_Station_Data/SCBI/Front Royal weather station/README.md"
# .csv files
ghr_ls(
"forestgeo/Climate/Met_Station_Data/SCBI/ForestGEO_met_station-SCBI",
regexp = "[.]csv$"
)
#> [1] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2009.csv"
#> [2] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2010.csv"
#> [3] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2011.csv"
#> [4] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2012.csv"
#> [5] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2013.csv"
#> [6] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2014.csv"
#> [7] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2015.csv"
#> [8] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2016.csv"
#> [9] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2017.csv"
#> [10] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2018.csv"
#> [11] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/SCB_Metdata_5min_2019.csv"
#> [12] "Met_Station_Data/SCBI/ForestGEO_met_station-SCBI/mettower_metadata.csv"
In addition to climate data, SCBI has a number of species-list datasets. These datasets are in the GitHub organization “SCBI-ForestGEO”, in the “SCBI-ForestGEO-Data” repository, and in the “species_lists” folder.
species_lists <- "SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists"
# Get the last part of the path
(subdirs <- fs::path_file(ghr_ls(species_lists)))
#> [1] "Full plant list" "GenBank" "Tree ecology"
#> [4] "insects_pathogens"
Let’s explore the .csv files in each sub directory. To reduce duplication I first create the vector paths
to store the path to all of the sub directories I want to explore. Then I apply ghr_ls()
to each element in paths
and use regexp
to focus on .csv files only.
(paths <- fs::path(species_lists, subdirs))
#> SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/Full plant list
#> SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/GenBank
#> SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/Tree ecology
#> SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/insects_pathogens
paths %>%
map(~ ghr_ls(.x, regexp = "[.]csv$"))
#> Warning: Nothing in 'SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/
#> GenBank' matches '[.]csv$'
#> [[1]]
#> [1] "species_lists/Full plant list/SCBI_all_sp_woody_&_herb.csv"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "species_lists/Tree ecology/SCBI_ForestGEO_sp_ecology.csv"
#> [2] "species_lists/Tree ecology/SCBI_ForestGEO_sp_ecology_metadata.csv"
#>
#> [[4]]
#> [1] "species_lists/insects_pathogens/insects_pathogens metadata.csv"
#> [2] "species_lists/insects_pathogens/insects_pathogens.csv"
Instead of the file names I can show the first few rows of each file. I use ghr_ls_download_url()
to get download URLs of each .csv file in each sub directory. To iterate over all sub directories I use purrr::map()
.
download_urls <- paths %>%
map(~ ghr_ls_download_url(.x, regexp = "[.]csv$"))
#> Warning: Nothing in 'SCBI-ForestGEO/SCBI-ForestGEO-Data/species_lists/
#> GenBank' matches '[.]csv$'
download_urls
#> [[1]]
#> [1] "https://raw.githubusercontent.com/SCBI-ForestGEO/SCBI-ForestGEO-Data/master/species_lists/Full%20plant%20list/SCBI_all_sp_woody_%26_herb.csv"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "https://raw.githubusercontent.com/SCBI-ForestGEO/SCBI-ForestGEO-Data/master/species_lists/Tree%20ecology/SCBI_ForestGEO_sp_ecology.csv"
#> [2] "https://raw.githubusercontent.com/SCBI-ForestGEO/SCBI-ForestGEO-Data/master/species_lists/Tree%20ecology/SCBI_ForestGEO_sp_ecology_metadata.csv"
#>
#> [[4]]
#> [1] "https://raw.githubusercontent.com/SCBI-ForestGEO/SCBI-ForestGEO-Data/master/species_lists/insects_pathogens/insects_pathogens%20metadata.csv"
#> [2] "https://raw.githubusercontent.com/SCBI-ForestGEO/SCBI-ForestGEO-Data/master/species_lists/insects_pathogens/insects_pathogens.csv"
And finally I use readr::read_csv()
to read each dataset into R. Again I use purrr::map()
to iterate over each sub directory.
download_urls %>%
unlist() %>%
map(~ head(readr::read_csv(.x)))
#> [[1]]
#> # A tibble: 6 x 4
#> FAMILY `SCIENTIFIC NAME` `COMMON NAME` life_form
#> <chr> <chr> <chr> <chr>
#> 1 Acanthaceae Ruellia strepens limestone wild petunia herbaceous
#> 2 Adoxaceae Sambucus canadensis American black elderberry woody
#> 3 Adoxaceae Sambucus pubens Red elderberry woody
#> 4 Adoxaceae Viburnum acerifolium Mapleleaf viburnum woody
#> 5 Adoxaceae Viburnum prunifolium Black haw woody
#> 6 Adoxaceae Viburnum recognitum Southern arrowood woody
#>
#> [[2]]
#> # A tibble: 6 x 17
#> family genus species author spcode canopy_position live_form
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Sapin… Acer negundo Michx. acne understory tree
#> 2 Sapin… Acer platan… L. acpl canopy tree
#> 3 Sapin… Acer rubrum L. acru canopy tree
#> 4 Simar… Aila… altiss… (Mill… aial canopy tree
#> 5 Rosac… Amel… arborea (Mich… amar understory small tr…
#> 6 Annon… Asim… triloba (L.) … astr understory small tr…
#> # … with 10 more variables: native_status <chr>, habitat <chr>,
#> # successional_status <chr>, drought_tolerance <chr>,
#> # deer_herbivory <chr>, flower_reproduction <chr>, fruit_type <chr>,
#> # dispersal_vector <chr>, IUCN_status <chr>, References <chr>
#>
#> [[3]]
#> # A tibble: 6 x 3
#> Column Field Description
#> <chr> <chr> <chr>
#> 1 1 / A family Plant family name
#> 2 2 / B genus Genus name
#> 3 3 / C species Species epithet (it could include variety or subspec…
#> 4 4 / D author Author of species name as Flora of Virginia (2012)
#> 5 5 / E spcode Species code used at SCBI, four characters refers to…
#> 6 6 / F canopy_posi… Most common canopy position for individuals within a…
#>
#> [[4]]
#> # A tibble: 6 x 4
#> Column Field Description Variable.Codes
#> <chr> <chr> <chr> <chr>
#> 1 1 / A pest_pathoge… common name of pest or patho… -
#> 2 2 / B pest_pathoge… scientific name of pest or p… -
#> 3 3 / C tree_species… tree species affected by pes… -
#> 4 4 / D pathogen_type indicates pathogen type fungus, insect
#> 5 5 / E origin indicates whether pest/patho… native, exotic, cosmo…
#> 6 6 / F native range Geographic origin of pest/pa… -
#>
#> [[5]]
#> # A tibble: 6 x 17
#> pest_pathogen_c… pest_pathogen_s… tree_species_af… pathogen_type origin
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Emerald ash bor… Agrilus planipe… Fraxinus spp., … Insect exotic
#> 2 Balsam woolly a… Adelgis piceae Abies balsamea,… Insect exotic
#> 3 Elongate hemloc… Fiorinia externa Tsuga spp Insect exotic
#> 4 Gypsy moth Lymantria dispar Quercus and oth… Insect exotic
#> 5 Hemlock woolly … Adelgis tsugae Tsuga canadensi… Insect exotic
#> 6 Woolly beech sc… Cryptococcus fa… Fagus spp. Insect exotic
#> # … with 12 more variables: `native range` <chr>,
#> # year_introduced_north_america <chr>, Virginia_Blue_Ridge_status <chr>,
#> # Virginia_source <chr>, Virginia_year_intro <chr>, SNP_year_ID <chr>,
#> # SNP_notes <chr>, SCBI_year_ID <chr>, SCBI_observation_type <chr>,
#> # SCBI_notes <chr>, general_notes <chr>, citation <chr>