Jens Åström
25 Oct, 2021
This is a slidy presentation made by RMarkdown. It probably won’t look good on your phone.
Sharing your data
## # A tibble: 9 × 10
## country region location habitat sample_round sample_time sample_id covariate_1 covariate_2 measurement_to_explain
## <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Norway region_1 lok_1 forest sampling_1 2020-05-01 1 0.426 6.90 29.0
## 2 Norway region_1 lok_1 forest sampling_2 2020-05-10 4 0.443 9.43 37.8
## 3 Norway region_1 lok_1 forest sampling_3 2020-06-10 7 0.786 8.95 17.6
## 4 Norway region_1 lok_2 grassland sampling_1 2020-05-01 2 0.358 5.76 91.7
## 5 Norway region_1 lok_2 grassland sampling_2 2020-05-10 5 0.345 5.99 58.0
## 6 Norway region_1 lok_2 grassland sampling_3 2020-06-10 8 0.296 9.50 97.4
## 7 Norway region_2 lok_3 forest sampling_1 2020-05-01 3 0.322 9.42 24.6
## 8 Norway region_2 lok_3 forest sampling_2 2020-05-10 6 0.556 8.59 32.9
## 9 Norway region_2 lok_3 forest sampling_3 2020-06-10 9 0.659 7.18 69.3
## Note that this could be split up further
## # A tibble: 4 × 4
## location habitat region country
## <chr> <chr> <chr> <chr>
## 1 lok_1 forest region_1 Norway
## 2 lok_2 grassland region_1 Norway
## 3 lok_3 forest region_2 Norway
## 4 lok_4 grassland region_2 Norway
## sample_id location sample_round sample_time
## 1 1 lok_1 sampling_1 2020-05-01
## 2 2 lok_2 sampling_1 2020-05-01
## 3 3 lok_3 sampling_1 2020-05-01
## 4 4 lok_1 sampling_2 2020-05-10
## 5 5 lok_2 sampling_2 2020-05-10
## 6 6 lok_3 sampling_2 2020-05-10
## 7 7 lok_1 sampling_3 2020-06-10
## 8 8 lok_2 sampling_3 2020-06-10
## 9 9 lok_3 sampling_3 2020-06-10
## # A tibble: 9 × 4
## sample_id covariate_1 covariate_2 measurement_to_explain
## <int> <dbl> <dbl> <dbl>
## 1 1 0.426 6.90 29.0
## 2 4 0.443 9.43 37.8
## 3 7 0.786 8.95 17.6
## 4 2 0.358 5.76 91.7
## 5 5 0.345 5.99 58.0
## 6 8 0.296 9.50 97.4
## 7 3 0.322 9.42 24.6
## 8 6 0.556 8.59 32.9
## 9 9 0.659 7.18 69.3
## # A tibble: 5 × 26
## id type modified ownerInstitution… dynamicProperties eventID parentEventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate
## <chr> <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dttm>
## 1 fffcc… Event 2020-04-20 14:10:32 Miljødirektoratet <NA> fffcc6… <NA> Sweep netting 1000 metre 2013-05-18 00:00:00
## 2 6aab1… Event 2020-04-20 14:10:32 Miljødirektoratet "{\"observerID\"… 6aab11… fffcc609-fbb… Sweep netting 50 metre 2013-05-18 00:00:00
## 3 02465… Event 2020-04-20 14:10:32 Miljødirektoratet "{\"observerID\"… 02465a… fffcc609-fbb… Sweep netting 50 metre 2013-05-18 00:00:00
## 4 54882… Event 2020-04-20 14:10:32 Miljødirektoratet "{\"observerID\"… 548821… fffcc609-fbb… Sweep netting 50 metre 2013-05-18 00:00:00
## 5 38a24… Event 2020-04-20 14:10:32 Miljødirektoratet "{\"observerID\"… 38a248… fffcc609-fbb… Sweep netting 50 metre 2013-05-18 00:00:00
## # … with 15 more variables: eventTime <chr>, eventRemarks <chr>, locationID <chr>, country <chr>, countryCode <chr>, stateProvince <chr>, municipality <chr>,
## # locality <chr>, locationRemarks <chr>, decimalLatitude <dbl>, decimalLongitude <dbl>, geodeticDatum <chr>, coordinateUncertaintyInMeters <dbl>,
## # footprintWKT <chr>, footprintSRS <chr>
## # A tibble: 5 × 19
## id modified basisOfRecord occurrenceID individualCount sex lifeStage occurrenceStatus eventID taxonID scientificName kingdom phylum class
## <chr> <dttm> <chr> <chr> <int> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 c57a8e… 2020-06-10 11:28:40 HumanObservat… 4ce65989-d9… 0 NA Adult absent c57a8ec… https:/… Melitaea diam… Animal… Arthr… Inse…
## 2 c57a8e… 2020-06-10 11:28:09 HumanObservat… a4436c79-88… 0 NA Adult absent c57a8ec… https:/… Bombus consob… Animal… Arthr… Inse…
## 3 c57a8e… 2020-04-20 11:53:51 HumanObservat… a6383b77-d2… 0 NA Adult absent c57a8ec… https:/… Vanessa cardui Animal… Arthr… Inse…
## 4 c57a8e… 2020-04-20 11:53:51 HumanObservat… bb9e2dec-0d… 0 NA Adult absent c57a8ec… https:/… Pieris napi Animal… Arthr… Inse…
## 5 c57a8e… 2020-06-02 14:46:15 HumanObservat… dab70776-c4… 0 NA Adult absent c57a8ec… <NA> Bombus sensu … Animal… Arthr… Inse…
## # … with 5 more variables: order <chr>, family <chr>, genus <chr>, specificEpithet <chr>, vernacularName <chr>
Data export from the Norwegian Bumblebee and Butterfly Monitoring program
Jens Åström 08 July, 2021
With data formatting for the European Butterfly Monitoring Scheme
Verified to work with the 2021 published version of the data (exporting data from 2009 - 2020 to BMS format)
suppressPackageStartupMessages(tidyverse)
require(DBI)
require(RPostgres)
require(sf)
require(maps)
require(tidyjson)
require(xml2)
Introduction
This script downloads observation records from the Norwegian national bumblebee and butterfly monitoring program, and arranges the data to suit the formatting needs of the European Butterfly Monitoring Scheme (BMS). The data is stored within GBIF, and the first section shows how to download the data from there. The rest of the script deals with rearranging the data, from the GBIF format to the BMS format. Portions of the script can be used as an example for projects with other data formatting preferences. The data is freely available, given that you site the source appropriately. See “Citation and meta-data” below.
Data structure
The data has a hierarchical nature, where we have 1) observations of the individual counts of each bumblebee and butterfly species within a 50 meter transect, 2) several such transects within each survey square (20 transects that of in total 1000m), and 3) several visits to each survey square throughout the years (typically 3 visits each year). The data contains a lot of individual counts of zero, as we include also zero-records of all the species found earlier in the project, plus potential additional species that think could have been recorded. The list of species might expand in the future as new species are recorded.
The hierarchical structure is handled within the GBIF structure through the Darwin Event Core-standard. Here, the observation records of species are located in an “occurrence”-table, which is linked to an “event”-table. Within the event table, there can be multiple levels of “parent-events”. In our case, the 50 meter transect walks represent the base-level event. The centroid of the transects are used as the point geometry for the individual observations (latitude, longitude), although the observations could have been performed along the whole 50 m transect. The linestring-geometry of the whole 50m section is also included as a “footprintWKT” for each event. These events are linked to a parent-event that represents the whole visit to the entire survey square, thus containing 20 such transect walks. By this design of GBIF, you can have an arbitrary number of parent events (hierarchical levels) within the same table, instead of normalizing the data across several tables. The downside is that you need to keep track of the eventID’s and the parentEventID’s to recreate the original data structure.
Finding the raw data and how to cite it
The dataset can be found at http://gbif.org by searching for “Bumblebees and butterflies in Norway”. This will take you to the webpage of the dataset: https://www.gbif.org/dataset/aea17af8-5578-4b04-b5d3-7adf0c5a1e60 Here you can access the meta-data of the dataset. This data is freely available, but please cite the source. The citation is visible at the dataset’s webpage at GBIF. The citation can be fetched programmatically this way.
datasetID <- "aea17af8-5578-4b04-b5d3-7adf0c5a1e60" ##From the webpage URL
# Suggested citation: Take the citation as from downloaded from GBIF website, replace "via GBIF.org" by endpoint url.
tmp <- tempfile()
download.file(paste0("http://api.gbif.org/v1/dataset/",datasetID,"/document"),tmp) # get medatadata from gbif api
meta <- read_xml(tmp) %>% as_list() # create list from xml schema
gbif_citation <- meta$eml$additionalMetadata$metadata$gbif$citation[[1]] # extract citation
The citation to use is then:
Fetching the raw-data from GBIF
We need an “endpoint_url” to retrieve the actual data set. This is found at the bottom of the webpage for the dataset. This can also be found programatically as shown here.
datasetURL <- paste0("http://api.gbif.org/v1/dataset/",datasetID,"/endpoint")
dataset <- RJSONIO::fromJSON(datasetURL)
endpoint_url <- dataset[[1]]$url # extracting URL from API call result
The endpoint for our dataset looks like this: (Also visible on the webpage for the dataset)
With the endpoint, we can download the data in a zipped format.
# Download from dwc-a from IPT putting into data folder
download.file(endpoint_url, destfile="GBIF_data/gbif_download.zip", mode="wb")
We unzip it in the same folder.
Read in the raw GBIF files
The data is a simple tab delimited text file. I use read_delim
here, which unfortunately parses some columns incorrectly, probably since the file starts with lots of null values. So here we specify the data types manually.
eventRaw <- read_delim("GBIF_data/event.txt",
delim = "\t",
locale = locale(encoding = "UTF-8"),
col_types = cols(id = col_character(),
type = col_character(),
modified = col_datetime(format = ""),
ownerInstitutionCode = col_character(),
dynamicProperties = col_character(),
eventID = col_character(),
parentEventID = col_character(),
samplingProtocol = col_character(),
sampleSizeValue = col_double(),
sampleSizeUnit = col_character(),
eventDate = col_datetime(format = ""),
eventTime = col_character(),
eventRemarks = col_character(),
locationID = col_character(),
country = col_character(),
countryCode = col_character(),
stateProvince = col_character(),
municipality = col_character(),
locality = col_character(),
locationRemarks = col_character(),
decimalLatitude = col_double(),
decimalLongitude = col_double(),
geodeticDatum = col_character(),
coordinateUncertaintyInMeters = col_double(),
footprintWKT = col_character(),
footprintSRS = col_character()
))
The occurrence data reads in easier.
occurrenceRaw <- read_delim("GBIF_data/occurrence.txt",
delim = "\t",
locale = locale(encoding = "UTF-8")
)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id = col_character(),
## modified = col_datetime(format = ""),
## basisOfRecord = col_character(),
## occurrenceID = col_character(),
## individualCount = col_double(),
## sex = col_logical(),
## lifeStage = col_character(),
## occurrenceStatus = col_character(),
## eventID = col_character(),
## taxonID = col_character(),
## scientificName = col_character(),
## kingdom = col_character(),
## phylum = col_character(),
## class = col_character(),
## order = col_character(),
## family = col_character(),
## genus = col_character(),
## specificEpithet = col_character(),
## vernacularName = col_character()
## )
A note on terminology
The terminology of the BMS differ from the original dataset in an important way. In the BMS format, the 20 50 meter transects within the same 1.5x1.5km survey square are combined, and is referred to as one “transect”. The original 50 meter transects are referred to as “sections” within each transect. From here on, we will use the BMS nomenclature.
Extracting the data from the GBIF raw format
We have to split the event data into section events (with parent ID), and transect events (without parent ID’s). This is because GBIF combines all hierarchical events into one table. If we would have more hierarchical levels, this would have been more convoluted.
transectEventRaw <- eventRaw %>%
filter(is.na(parentEventID))
sectionEventRaw <- eventRaw %>%
filter(!is.na(parentEventID))
Subset the butterflies
The original data also contains bumblebees, but the BMS only wants butterflies, so we subset the data. This is of course optional. This subset could also have been performed at a later stage, but it is more efficient to do it at this stage.
## # A tibble: 3,056,525 x 19
## id modified basisOfRecord occurrenceID individualCount sex
## <chr> <dttm> <chr> <chr> <dbl> <lgl>
## 1 7734… 2020-04-20 11:53:51 HumanObserva… f37a5321-ff… 0 NA
## 2 7734… 2020-04-20 11:53:51 HumanObserva… ff3a563a-07… 0 NA
## 3 7734… 2020-06-05 13:26:53 HumanObserva… e357de4a-56… 0 NA
## 4 7734… 2020-04-20 11:53:51 HumanObserva… df17a4bc-1c… 0 NA
## 5 7734… 2020-04-20 11:53:51 HumanObserva… ff9188a9-67… 0 NA
## 6 7734… 2021-07-01 08:58:24 HumanObserva… be1fddc5-14… 0 NA
## 7 7734… 2020-04-20 11:53:51 HumanObserva… c0499825-23… 0 NA
## 8 7734… 2020-04-20 11:53:51 HumanObserva… b58de6a9-02… 0 NA
## 9 7734… 2021-07-01 08:58:24 HumanObserva… bbaa3924-fe… 0 NA
## 10 7734… 2021-07-01 08:58:24 HumanObserva… d687da1d-f9… 0 NA
## # … with 3,056,515 more rows, and 13 more variables: lifeStage <chr>,
## # occurrenceStatus <chr>, eventID <chr>, taxonID <chr>, scientificName <chr>,
## # kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## # genus <chr>, specificEpithet <chr>, vernacularName <chr>
Extract the data that didn’t fit into the GBIF standard
We have additional information that don’t fit into the GBIF data format. These are bundled together in to a JSON-string in a GBIF column named “dynamicProperties”. We need to spread this content into separate columns.
## # A tibble: 1 x 1
## dynamicProperties
## <chr>
## 1 "{\"observerID\" : \"f35304ca-96ed-434f-9a3b-8b9ccf804127\", \"habitatType\" …
The following solution is a bit cumbersome but the best I’ve found so far. The tidyjson
package seem to have pretty poor performance on large datasets. I had a go at using purrr::map
and rjson::fromJSON
but I don’t seem to get this to work at the moment (although it did work at some point).
Here, we split out the json string (dynamicProperties) into a separate table, and join it back to the original data afterwards. This takes about 30 seconds on a pretty fast machine.
system.time({
sectionEvent <- sectionEventRaw %>%
collect() %>%
mutate(`document.id` = 1:nrow(.))
tempDynamic <- sectionEventRaw %>%
#head() %>%
select(dyn = dynamicProperties) %>%
collect() %>%
unlist() %>%
spread_all() %>%
as_tibble()
sectionEvent <- sectionEvent %>%
left_join(tempDynamic,
by = c("document.id" = "document.id"))
})
## user system elapsed
## 18.940 0.189 19.129
The new columns contain a habitat type classification, cloud cover in %, temperature, and a 4 level classification of the total flower cover for each transect. A sample of the extracted data:
sectionEvent %>%
select(observerID,
habitatType,
`cloudCover%`,
temperatureCelsius,
flowerCover0123) %>%
head()
## # A tibble: 6 x 5
## observerID habitatType `cloudCover%` temperatureCels… flowerCover0123
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 f35304ca-96ed-434f… forest 50 17.5 1
## 2 9b98f4cf-9e65-41ea… forest 10 18 1
## 3 a6495896-d41b-4998… forest 30 20 1
## 4 e67031d2-1c93-44d3… grassland 0 23 3
## 5 ef323094-1f26-48d2… forest 40 21 1
## 6 a6495896-d41b-4998… forest 0 17 1
Split the event times into start and end times
The GBIF format has a single column for event time, where we have stored both the start and the end times. These need to be split into separate columns. First the “section” times, i.e. the survey times for each 50m section, which range between 2 minutes and 30 minutes depending on the number of individuals handled. Note that the effective search time should be the same, irrespective of the handling time for identification, since the observers are instructed to walk at a set pace along the transects and stop when handling an individual.
sectionEvent <- sectionEvent %>%
separate(eventTime,
c("start_time", "end_time"),
sep = "/") %>%
mutate(start_time = as.POSIXct(start_time),
end_time = as.POSIXct(end_time))
sectionEvent %>%
select(start_time,
end_time)
## # A tibble: 29,675 x 2
## start_time end_time
## <dttm> <dttm>
## 1 2016-06-28 15:12:00 2016-06-28 15:18:00
## 2 2011-05-10 11:49:00 2011-05-10 11:52:00
## 3 2018-06-23 15:12:00 2018-06-23 15:18:00
## 4 2009-06-23 15:53:00 2009-06-23 15:57:00
## 5 2012-07-27 14:54:00 2012-07-27 14:59:00
## 6 2017-08-13 13:40:00 2017-08-13 13:42:00
## 7 2018-06-24 15:35:00 2018-06-24 15:43:00
## 8 NA NA
## 9 2017-06-05 12:50:00 2017-06-05 12:55:00
## 10 2014-05-01 13:48:00 2014-05-01 13:57:00
## # … with 29,665 more rows
We split the times for the parent events the same way (“transects” in BMS lingo). Surveying an entire “transect” of 1000m can take anywhere between 20 minutes and several hours, depending on the amount of individuals handled, and potential weather breaks. Note that we need to collect the data from the database and not use lazy queries for many operations. Also, later joins between lazy queries and tibbles doesn’t seem to work so we need to collect everything before joining.
transectEvent <- transectEventRaw %>%
collect() %>%
separate(eventTime,
c("start_time", "end_time"),
sep = "/") %>%
mutate(start_time = as.POSIXct(start_time),
end_time = as.POSIXct(end_time))
transectEvent %>%
select(start_time,
end_time)
## # A tibble: 1,506 x 2
## start_time end_time
## <dttm> <dttm>
## 1 2011-06-30 14:55:00 2011-06-30 16:14:00
## 2 2013-07-20 12:15:00 2013-07-20 13:20:00
## 3 2016-06-09 13:00:00 2016-06-09 17:30:00
## 4 2011-08-03 13:10:00 2011-08-03 14:30:00
## 5 2014-05-22 12:30:00 2014-05-22 14:30:00
## 6 2014-07-10 14:54:00 2014-07-10 16:38:00
## 7 2019-06-29 13:30:00 2019-06-29 14:30:00
## 8 2015-06-12 11:00:00 2015-06-12 14:11:00
## 9 2019-06-11 12:00:00 2019-06-11 14:31:00
## 10 2018-05-18 15:30:00 2018-05-18 17:11:00
## # … with 1,496 more rows
Join the tables
We can now join the occurrence, section_event, and transect_event tables into a single large table. This could be a suitable endpoint for other uses, that don’t need to split up the data into the tables used by the BMS. Just note that this join contains a lot of columns.
occTrans <- lepiRaw %>%
left_join(sectionEvent,
by = c("eventID" = "eventID"),
suffix = c(".occurr", ".section")) %>%
left_join(transectEvent,
by = c("parentEventID" = "eventID"),
suffix = c(".section", ".transect"))
#occTrans %>%
# print(n = 1, width = Inf)
Further arranging of the data to the BMS format
The following section only pertains to the BMS format, but could be used as an example for other uses, where you want to normalize the data (split it into separate tables with unique records for each row).
Subset the data according to time
At this moment, the BMS only wants data up until including 2018, so we’ll subset the joined table, here performed on the event dates. This time around, the BMS wants all data including 2020.
Arrange a butterfly count data table.
Here we want the species names, amount of individuals, and enough identifying information to link these to the transect walks.
#colnames(occTrans)
lepiCountTab <- occTrans %>%
select(visit_ID = parentEventID,
transect_ID = locationID.transect,
section_ID = locationID.section, #Note that we use unique identifiers here, not ids that are nested within transects (1, 2, 3, ..).
date = eventDate.transect,
species_name = scientificName,
count = individualCount
) %>%
arrange(visit_ID, transect_ID, section_ID, date, species_name)
We need to alter some species names to fit the BMS nomenclature. It turned out to be only one species names that was slightly different. Here are all the butterfly species in the dataset.
## # A tibble: 103 x 1
## species_name
## <chr>
## 1 Adscita statices
## 2 Aglais io
## 3 Aglais urticae
## 4 Agriades aquilo
## 5 Agriades optilete
## 6 Agriades orbitulus
## 7 Anthocharis cardamines
## 8 Aphantopus hyperantus
## 9 Aporia crataegi
## 10 Argynnis paphia
## 11 Aricia artaxerxes
## 12 Aricia nicias
## 13 Boloria aquilonaris
## 14 Boloria chariclea
## 15 Boloria eunomia
## 16 Boloria euphrosyne
## 17 Boloria freija
## 18 Boloria frigga
## 19 Boloria improba
## 20 Boloria napaea
## 21 Boloria polaris
## 22 Boloria selene
## 23 Boloria thore
## 24 Brenthis ino
## 25 Callophrys rubi
## 26 Carterocephalus palaemon
## 27 Carterocephalus silvicola
## 28 Celastrina argiolus
## 29 Coenonympha arcania
## 30 Coenonympha hero
## 31 Coenonympha pamphilus
## 32 Coenonympha tullia
## 33 Colias croceus
## 34 Colias hecla
## 35 Colias palaeno
## 36 Colias tyche
## 37 Cupido minimus
## 38 Cyaniris semiargus
## 39 Erebia disa
## 40 Erebia embla
## 41 Erebia ligea
## 42 Erebia pandrose
## 43 Erebia polaris
## 44 Erynnis tages
## 45 Eumedonia eumedon
## 46 Euphydryas iduna
## 47 Fabriciana adippe
## 48 Fabriciana niobe
## 49 Favonius quercus
## 50 Glaucopsyche alexis
## 51 Gonepteryx rhamni
## 52 Hesperia comma
## 53 Hipparchia hermione
## 54 Hipparchia semele
## 55 Issoria lathonia
## 56 Lasiommata maera
## 57 Lasiommata megera
## 58 Lasiommata petropolitana
## 59 Leptidea sinapis/reali
## 60 Limenitis populi
## 61 Lycaena helle
## 62 Lycaena hippothoe
## 63 Lycaena phlaeas
## 64 Lycaena virgaureae
## 65 Maniola jurtina
## 66 Melitaea athalia
## 67 Melitaea cinxia
## 68 Melitaea diamina
## 69 Nymphalis antiopa
## 70 Nymphalis polychloros
## 71 Ochlodes sylvanus
## 72 Oeneis bore
## 73 Oeneis jutta
## 74 Oeneis norna
## 75 Papilio machaon
## 76 Pararge aegeria
## 77 Parnassius apollo
## 78 Parnassius mnemosyne
## 79 Pieris brassicae
## 80 Pieris napi
## 81 Pieris rapae
## 82 Plebejus argus/idas
## 83 Plebejus argyrognomon
## 84 Polygonia c-album
## 85 Polyommatus amandus
## 86 Polyommatus icarus
## 87 Pontia daplidice
## 88 Pyrgus alveus
## 89 Pyrgus andromedae
## 90 Pyrgus centaureae
## 91 Pyrgus malvae
## 92 Satyrium w-album
## 93 Scolitantides orion
## 94 Speyeria aglaja
## 95 Thecla betulae
## 96 Thymelicus lineola
## 97 Vanessa atalanta
## 98 Vanessa cardui
## 99 Zygaena exulans
## 100 Zygaena filipendulae
## 101 Zygaena lonicerae
## 102 Zygaena osterodensis
## 103 Zygaena viciae
lepiCountTab <- lepiCountTab %>%
mutate(species_name = ifelse(species_name == 'Colias croceus', 'Colias crocea', species_name))
The butterfly count table is now ready.
## # A tibble: 3,056,525 x 6
## visit_ID transect_ID section_ID date species_name count
## <chr> <chr> <chr> <dttm> <chr> <dbl>
## 1 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Adscita sta… 0
## 2 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Aglais io 0
## 3 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Aglais urti… 0
## 4 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Agriades aq… 0
## 5 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Agriades op… 0
## 6 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Agriades or… 0
## 7 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Anthocharis… 0
## 8 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Aphantopus … 0
## 9 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Aporia crat… 0
## 10 00161ef1-3… 20cb4183-069… 04901d87-48… 2011-06-30 00:00:00 Argynnis pa… 0
## # … with 3,056,515 more rows
Arrange a monitoring visit table
Here we summarize the characteristics for the individual transect walks.
In the original data, we have specific data on time, temperature, and cloud cover for each “section”. In rare occurrences, the “transects” can also be split over two days, so that there are two dates for a single “transect”-visit. We therefore have information that is unique to individual transect sections (50m) that might be worth including. This isn’t strictly compatible with the BMS format, which expects only set of conditions for the whole “transect”.
We have two options here. Either provide the data on the original resolution, with unique values per section-visit, or aggregate the values to the “transect” (square-visit) resolution. We will show a solution for both approaches, arranging the data on the “section” level (our 50m transects), in addition to the “transect” level, which the BMS asks for.
First we take the section level arrangement.
monVisTabSection <- occTrans %>%
mutate(wind = NA, #We don't have wind data
completed = 1, #All recorded transects are implicitly completed
start_time = strftime(start_time.section, format = "%H:%M:%S"), #Remove the date part of the datetimes
end_time = strftime(end_time.section, format = "%H:%M:%S")) %>%
select(visit_ID = parentEventID,
recorder_id = observerID,
transect_ID = locationID.transect,
section = locationID.section,
date = eventDate.section,
start_time,
end_time,
temperature = temperatureCelsius,
cloud = `cloudCover%`,
wind,
completed) %>%
distinct()
monVisTabSection
## # A tibble: 29,675 x 11
## visit_ID recorder_id transect_ID section date start_time
## <chr> <chr> <chr> <chr> <dttm> <chr>
## 1 e5b7974… 2f0ccd4f-a… d53aef1f-7… dfb927… 2013-05-26 00:00:00 11:20:00
## 2 6c70518… 2f0ccd4f-a… d53aef1f-7… dfb927… 2013-07-23 00:00:00 11:45:00
## 3 6c81a4a… 2f0ccd4f-a… d53aef1f-7… dfb927… 2014-06-01 00:00:00 17:36:00
## 4 cac20b4… 2f0ccd4f-a… d53aef1f-7… dfb927… 2014-07-12 00:00:00 14:50:00
## 5 cdb2134… 2f0ccd4f-a… d53aef1f-7… dfb927… 2014-08-27 00:00:00 12:10:00
## 6 30792c0… 2f0ccd4f-a… d53aef1f-7… dfb927… 2015-07-04 00:00:00 12:00:00
## 7 4f44d6d… 2f0ccd4f-a… d53aef1f-7… dfb927… 2015-08-13 00:00:00 14:30:00
## 8 2689da3… 2f0ccd4f-a… d53aef1f-7… dfb927… 2016-05-27 00:00:00 15:55:00
## 9 d413064… 2f0ccd4f-a… d53aef1f-7… dfb927… 2016-07-20 00:00:00 15:09:00
## 10 3f292a1… 2f0ccd4f-a… d53aef1f-7… dfb927… 2017-06-27 00:00:00 15:37:00
## # … with 29,665 more rows, and 5 more variables: end_time <chr>,
## # temperature <dbl>, cloud <dbl>, wind <lgl>, completed <dbl>
Then the corresponding operation on the transect level (survey square). We first select the columns of interest.
#Select the data of interest
monVisTabTransect <- occTrans %>%
mutate(wind = NA, #We don't have wind data
completed = 1, #All recorded transects are implicitly completed
start_time = strftime(start_time.transect, format = "%H:%M:%S"), #Remove the date part of the datetimes
end_time = strftime(end_time.transect, format = "%H:%M:%S")) %>%
select(visit_ID = parentEventID,
recorder_id = observerID,
transect_ID = locationID.transect,
date = eventDate.transect,
start_time,
end_time,
temperature = temperatureCelsius,
cloud = `cloudCover%`,
wind,
completed) %>%
distinct()
Then we aggregate the data to the transect (survey square) level.
monVisTabTransect <- monVisTabTransect %>%
group_by(visit_ID) %>%
summarise(recorder_id = min(recorder_id), #Use the first recorder, if several has divided the sections up between them
transect_ID = min(transect_ID), #Has only one value per survey event
date = min(date),
start_time = min(start_time),
end_time = max(end_time),
temperature = mean(temperature),
cloud = mean(cloud),
wind = mean(wind),
completed = mean(completed))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,506 x 10
## visit_ID recorder_id transect_ID date start_time end_time
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 00161ef… 9b98f4cf-9… 20cb4183-0… 2011-06-30 00:00:00 14:55:00 16:14:00
## 2 004d789… 76839443-4… cce42c93-e… 2013-07-20 00:00:00 12:15:00 13:20:00
## 3 005b7cf… fc6b9ecf-7… e0a3c18a-3… 2016-06-09 00:00:00 13:00:00 17:30:00
## 4 008d7ab… a939d930-1… e3756fff-2… 2011-08-03 00:00:00 13:10:00 14:30:00
## 5 009cf41… 6a478010-e… 911293b2-2… 2014-05-22 00:00:00 12:30:00 14:30:00
## 6 00ef24d… 99c120bf-8… 7bb92922-4… 2014-07-10 00:00:00 14:54:00 16:38:00
## 7 013b043… 2b780a04-1… a2834b8c-8… 2019-06-29 00:00:00 13:30:00 14:30:00
## 8 018d499… c2f4a043-e… ccf55543-4… 2015-06-12 00:00:00 11:00:00 14:11:00
## 9 01a22c7… c2f4a043-e… ccf55543-4… 2019-06-11 00:00:00 12:00:00 14:31:00
## 10 01a6668… a6495896-d… df94d08e-b… 2018-05-18 00:00:00 15:30:00 17:11:00
## # … with 1,496 more rows, and 4 more variables: temperature <dbl>, cloud <dbl>,
## # wind <dbl>, completed <dbl>
Arrange a site geographical information table
This output table contains the geographical information for the individual survey sections, i.e. the starting points, ending points, and the centroid for the 50m sections.
siteGeoInfo <- occTrans %>%
select(transect_ID = locationID.transect,
section = locationID.section,
decimalLatitude.section,
decimalLongitude.section,
footprintWKT.section) %>%
distinct()
##Register the geometry as SF object
siteGeoInfoSF <- siteGeoInfo %>%
st_as_sf(wkt = "footprintWKT.section",
crs = 4326)
#%>%
# st_transform(crs = 25833)
We can plot the locatons of the transects on a simple map for a simple quality check. We can at least see that the transects lie within Norway, which rules out transformation and projection errors.
world1 <- sf::st_as_sf(map('world', plot = FALSE, fill = TRUE))
norway <- world1 %>% filter(ID == "Norway")
#%>%
# st_transform(crs = 25833) #To be able to use st_cast
plot(norway,
main = NULL,
reset = F,
col = "seagreen")
plot(siteGeoInfoSF["transect_ID"],
lwd = 4,
add = T)
We now extract the start and end points from the linestring that represents the section. It seems you have to jump through some hoops to get the start and end points from a linestring in sf
. When we extract the end points from the linestrings, these are first stored in separate rows instead of columns, and we have to spread them into separate columns manually.
points <- siteGeoInfoSF %>%
st_transform(crs = 25833) %>%
st_cast("POINT",
) %>%
st_transform(4326) %>%
select(section) %>%
as_tibble()
## Warning in st_cast.sf(., "POINT", ): repeating attributes for all sub-geometries
## for which they may not be constant
points <- points %>%
mutate(which = rep(c("start", "end"), nrow(.)/2)) %>%
group_by(section, which) %>%
mutate(row = row_number()) %>%
pivot_wider(values_from = footprintWKT.section,
names_from = which) %>%
select(-row)
#Join the points to the original data.
siteGeoInfoTab <- siteGeoInfoSF %>%
as_tibble() %>%
right_join(points, by = c("section" = "section"))
We can plot a transect as a sanity check. Starting points are here shown in blue, endpoints in red, centroids in black. Notice that the order of the start and end point of the individual sections doesn’t necessarily follow the transect in a continuous way. Meaning, some sections go left to right, while some go right to left. We would need to sort this out in the source database.
##Sanity check plot
##Seems to work, but the start and endpoints of the sections seem to be random.
#Register the centroid points as a geometry
tempSection <- siteGeoInfoTab %>%
st_as_sf(coords = c("decimalLongitude.section", "decimalLatitude.section"),
crs = 4326)
tempSection
## Simple feature collection with 1851 features and 2 fields
## Active geometry column: footprintWKT.section
## geometry type: LINESTRING
## dimension: XY
## bbox: xmin: 5.221364 ymin: 58.0589 xmax: 13.18196 ymax: 64.9271
## geographic CRS: WGS 84
## # A tibble: 1,851 x 6
## transect_ID section footprintWKT.section start
## * <chr> <chr> <LINESTRING [°]> <POINT [°]>
## 1 d53aef1f-7… dfb927… (5.221583 59.39742, 5.22… (5.221583 59.39742)
## 2 d53aef1f-7… efc706… (5.221583 59.39742, 5.22… (5.221583 59.39742)
## 3 d53aef1f-7… fff25b… (5.221911 59.39699, 5.22… (5.221911 59.39699)
## 4 d53aef1f-7… 316c21… (5.223379 59.39645, 5.22… (5.223379 59.39645)
## 5 d53aef1f-7… e07338… (5.224109 59.39617, 5.22… (5.224109 59.39617)
## 6 d53aef1f-7… 34a6ae… (5.224726 59.39583, 5.22… (5.224726 59.39583)
## 7 d53aef1f-7… 17696e… (5.225234 59.39545, 5.22… (5.225234 59.39545)
## 8 d53aef1f-7… cc8b58… (5.225234 59.39545, 5.22… (5.225234 59.39545)
## 9 d53aef1f-7… b191a5… (5.226234 59.39467, 5.22… (5.226234 59.39467)
## 10 d53aef1f-7… 0628de… (5.226706 59.39428, 5.22… (5.226706 59.39428)
## # … with 1,841 more rows, and 2 more variables: end <POINT [°]>,
## # geometry <POINT>
#Plot the linestring
tempSection %>%
filter(transect_ID == 'd53aef1f-7502-4724-9be7-dd49f44dce59') %>%
select(transect_ID) %>%
plot(reset = F,
key.pos = NULL,
main = NULL)
#Plot the start points
st_geometry(tempSection) <- "start"
tempSection %>%
filter(transect_ID == 'd53aef1f-7502-4724-9be7-dd49f44dce59') %>%
select(transect_ID) %>%
plot(add = T,
col = "blue")
#Plot the end points
st_geometry(tempSection) <- "end"
tempSection %>%
filter(transect_ID == 'd53aef1f-7502-4724-9be7-dd49f44dce59') %>%
select(transect_ID) %>%
plot(add = T,
col = "red")
#Plot the centroids
st_geometry(tempSection) <- "geometry"
tempSection %>%
filter(transect_ID == 'd53aef1f-7502-4724-9be7-dd49f44dce59') %>%
select(transect_ID) %>%
plot(add = T,
col = "black")
We see that we have the correct start, end, and centroid points of the transect line, although the order of the start and end points are a little off. We can then rename the colums and extract the columns of interest.
siteGeoInfoTab <- siteGeoInfoTab %>%
mutate(lon_start_point = st_coordinates(start)[,1],
lat_start_point = st_coordinates(start)[,2],
lon_end_point = st_coordinates(end)[,1],
lat_end_point = st_coordinates(end)[,2],
SRID_for_all_coordinates = 4326,
monitoring_type = 2) %>%
select(transect_ID,
section_ID = section,
lon_start_point,
lat_start_point,
lon_end_point,
lat_end_point,
lon_centroid = decimalLongitude.section,
lat_centroid = decimalLatitude.section,
SRID_for_all_coordinates,
monitoring_type
)
siteGeoInfoTab
## # A tibble: 1,851 x 10
## transect_ID section_ID lon_start_point lat_start_point lon_end_point
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 d53aef1f-7… dfb9272e-… 5.22 59.4 5.22
## 2 d53aef1f-7… efc706a7-… 5.22 59.4 5.22
## 3 d53aef1f-7… fff25b5b-… 5.22 59.4 5.22
## 4 d53aef1f-7… 316c2185-… 5.22 59.4 5.22
## 5 d53aef1f-7… e0733813-… 5.22 59.4 5.22
## 6 d53aef1f-7… 34a6aebd-… 5.22 59.4 5.22
## 7 d53aef1f-7… 17696e5a-… 5.23 59.4 5.22
## 8 d53aef1f-7… cc8b5806-… 5.23 59.4 5.23
## 9 d53aef1f-7… b191a51d-… 5.23 59.4 5.23
## 10 d53aef1f-7… 0628debb-… 5.23 59.4 5.23
## # … with 1,841 more rows, and 5 more variables: lat_end_point <dbl>,
## # lon_centroid <dbl>, lat_centroid <dbl>, SRID_for_all_coordinates <dbl>,
## # monitoring_type <dbl>
Arrange a habitat type table
This contains information on the habitat types for each section. BMS asks for the habitat type for both sides of a transect. In the national monitoring program, we haven’t differentiated between the sides of the transect, but only characterized the individual sections to be in either “grassland” or “forest”. All sections are located in open/low vegetation, and most often align with roads or trails.
“Grassland” refers to habitats that are dominated by continuous clearing/grazing/plowing. A substantial part is lowland semi-natural grasslands, but the sections also span more intensively managed farmland.
“Forest” refers to open but otherwise forested areas, which is dominated by managed forests but can include natural forests. These sections typically align with smaller gravel roads in conifer forests (mostly Norway spruce and Pine), but can include powerline clearings.
habTypeTab <- occTrans %>%
select(transect_ID = locationID.transect,
section_ID = locationID.section,
habitat_Type = habitatType
) %>%
distinct()
habTypeTab
## # A tibble: 1,834 x 3
## transect_ID section_ID habitat_Type
## <chr> <chr> <chr>
## 1 d53aef1f-7502-4724-9be7-dd49f44… dfb9272e-be2a-4a7e-b882-4cc5a7… grassland
## 2 d53aef1f-7502-4724-9be7-dd49f44… efc706a7-9cef-4199-b5bf-5f5b97… grassland
## 3 d53aef1f-7502-4724-9be7-dd49f44… fff25b5b-7a7c-4fcc-a927-cb6047… grassland
## 4 d53aef1f-7502-4724-9be7-dd49f44… 316c2185-30c1-4077-9f39-52cfe5… grassland
## 5 d53aef1f-7502-4724-9be7-dd49f44… e0733813-696e-4ec5-9986-d98fe0… grassland
## 6 d53aef1f-7502-4724-9be7-dd49f44… 34a6aebd-a193-4851-a46b-98417f… grassland
## 7 d53aef1f-7502-4724-9be7-dd49f44… 17696e5a-5f7a-4a2b-b879-b3d6e1… grassland
## 8 d53aef1f-7502-4724-9be7-dd49f44… cc8b5806-0a52-4d58-8a6c-73c902… grassland
## 9 d53aef1f-7502-4724-9be7-dd49f44… b191a51d-6385-43d5-98a0-238c5e… grassland
## 10 d53aef1f-7502-4724-9be7-dd49f44… 0628debb-a252-4e14-ba19-0ef29a… grassland
## # … with 1,824 more rows
Arrange a species name table
The only discrepancy between the locally used latin names and tha Fauna Europea latin names is “Colias Croceus”/“Colias crocea”. Note that some taxa are not distinguished to species. We also include the local vernacular name (Norwegian).
speciesNameTab <- occTrans %>%
select(national_species_latin = scientificName,
fauna_europea_species_latin = scientificName,
local_species_norwegian = vernacularName) %>%
mutate(national_species_latin = ifelse(national_species_latin == "Colias croceus", "Colias crocea", national_species_latin)) %>%
distinct()
speciesNameTab
## # A tibble: 103 x 3
## national_species_latin fauna_europea_species_latin local_species_norwegian
## <chr> <chr> <chr>
## 1 Vanessa cardui Vanessa cardui Tistelsommerfugl
## 2 Pieris brassicae Pieris brassicae Stor Kålsommerfugl
## 3 Polygonia c-album Polygonia c-album Hvit C
## 4 Callophrys rubi Callophrys rubi Grønnstjertvinge
## 5 Agriades optilete Agriades optilete Myrblåvinge
## 6 Melitaea diamina Melitaea diamina Mørk Rutevinge
## 7 Satyrium w-album Satyrium w-album Almestjertvinge
## 8 Issoria lathonia Issoria lathonia Sølvkåpe
## 9 Boloria improba Boloria improba Dvergperlemorvinge
## 10 Thecla betulae Thecla betulae Slåpetornstjertvinge
## # … with 93 more rows
Export the BMS data into CSV-files
We are now ready to export the tables. Here, we use the csv-format. We use the default encoding (UTF-8).
## Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
## Please use the `file` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
write_csv(monVisTabTransect,
path = "out/MonitoringVisitTableTransectLevel.csv")
write_csv(monVisTabSection,
path = "out/MonitoringVisitTableSectionLevel.csv")
write_csv(siteGeoInfoTab,
path = "out/SiteGeographicalInformationTable.csv")
write_csv(habTypeTab,
path = "out/HabitatTypeTable.csv")
write_csv(speciesNameTab,
path = "out/SpeciesNameTable.csv")
Extract zeroes to avoid large sample size
Since we include count data for all species for all surveys, the count table contains a huge amount of zeroes. This results in a file that is too large to store on github.
However, it is possible to exlude the rows with zero individuals and recreate these in R afterwards. To do that, we can store all unique surveys, and the records with non zero counts.
uniqueVisits <- lepiCountTab %>%
select(visit_ID,
transect_ID,
section_ID,
date) %>%
distinct()
lepiCountTabNoZeroes <- lepiCountTab %>%
filter(count > 0)
Instead of storing 2.4 million rows, we are suddenly down to 23.5 thousand individual surveys, and a little over 9 thousand non zero observation, with no loss of information. We store this in separate csv-files. We already have all species names stored in the species table.
write_csv(uniqueVisits,
path = "out/uniqueVisits.csv")
write_csv(lepiCountTabNoZeroes,
path = "out/ButterflyCountDataTableNoZeroes.csv")
Recreate the zero counts
Here is how to restore the full data, with zeroes. We start with reading in the csv files for completeness. Note that we also read in the full table of species names.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## visit_ID = col_character(),
## transect_ID = col_character(),
## section_ID = col_character(),
## date = col_datetime(format = "")
## )
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## visit_ID = col_character(),
## transect_ID = col_character(),
## section_ID = col_character(),
## date = col_datetime(format = ""),
## species_name = col_character(),
## count = col_double()
## )
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## national_species_latin = col_character(),
## fauna_europea_species_latin = col_character(),
## local_species_norwegian = col_character()
## )
We then expand all combinations of species and survey times.
zeroesToJoin <- speciesNameTab %>%
select(species_name = national_species_latin) %>%
left_join(uniqueVisits,
by = character())
And join these to the records with non-zero counts.
lepiCountTabRecreatedZeroes <- zeroesToJoin %>%
left_join(lepiCountTabNoZeroes) %>%
mutate(count = coalesce(count, 0)) %>%
select(visit_ID,
transect_ID,
section_ID,
date,
species_name,
count) %>%
arrange(visit_ID, transect_ID, section_ID, date, species_name)
## Joining, by = c("species_name", "visit_ID", "transect_ID", "section_ID", "date")
We can check that the recreated data matches the original.
## [1] TRUE
library(blogdown)
Go back to the blog here: https://jenast.github.io/