Main goal of this exercise is to show main applications of the relational data concept with the dplyr package.
For more information, see relational data:
Following the previous example, use the same working directory and download this data file to the folder 03-joints-exercise within your working directory (see previous examples for more information).
Add columns with the explanation of the variables Trade.Flow.Code, Reporter.Code and Partner.Code using the left_join() or right_join() and the mapping tables from the package tradeAnalysis explained in the previous exercise. Use the subset of 20 rows of the data ctData, loaded as in the previous exercise.
library(tidyverse)
library(readr)
library(tradeAnalysis)
ctData <- read_rds("03-joints-exercise/subsetting-data.rds")
ctData %>%
sample_n(100) %>%
left_join(x = ., y = partners, by = "Partner.Code") %>%
left_join(x = ., y = reporters, by = "Reporter.Code") %>%
left_join(x = ., y = tradeAnalysis::regimes, by = "Trade.Flow.Code")
## # A tibble: 100 x 11
## Classification Year Trade.Flow.Code Reporter.Code Partner.Code
## <chr> <int> <dbl> <int> <int>
## 1 H4 2016 2 36 598
## 2 H4 2015 2 214 332
## 3 H4 2015 1 604 862
## 4 H3 2015 2 780 826
## 5 H4 2016 1 699 703
## 6 H4 2016 2 702 156
## 7 H4 2015 2 818 706
## 8 H4 2014 2 512 381
## 9 H4 2015 1 682 156
## 10 H4 2016 1 191 203
## # ... with 90 more rows, and 6 more variables: Commodity.Code <chr>,
## # Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Partner <chr>,
## # Reporter <chr>, Trade.Flow <chr>
Do the same with the help of the tradeAnalysis package.
ctData %>%
sample_n(100) %>%
join_labels_flows() %>%
join_labels_par() %>%
join_labels_rep()
## # A tibble: 100 x 11
## Classification Year Trade.Flow.Code Reporter.Code Partner.Code
## <chr> <int> <int> <int> <int>
## 1 H4 2015 1 528 591
## 2 H4 2015 1 376 804
## 3 H4 2015 1 579 842
## 4 H4 2016 2 466 384
## 5 H4 2015 1 702 642
## 6 H4 2014 2 616 834
## 7 H4 2014 1 558 604
## 8 H4 2015 1 440 100
## 9 H4 2015 2 604 188
## 10 H4 2016 2 710 616
## # ... with 90 more rows, and 6 more variables: Commodity.Code <chr>,
## # Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Trade.Flow <chr>,
## # Partner <chr>, Reporter <chr>
Do the same with the help of the tradeAnalysis package in one line.
ctData %>%
sample_n(100) %>%
join_labs()
## # A tibble: 100 x 12
## Classification Year Trade.Flow.Code Reporter.Code Partner.Code
## <chr> <int> <int> <int> <int>
## 1 H4 2015 2 372 724
## 2 H4 2016 1 124 604
## 3 H4 2016 1 410 792
## 4 H4 2014 2 208 4
## 5 H4 2015 2 388 796
## 6 H3 2014 1 690 784
## 7 H4 2014 1 834 381
## 8 H4 2015 1 376 188
## 9 H4 2016 1 222 32
## 10 H4 2014 2 191 705
## # ... with 90 more rows, and 7 more variables: Commodity.Code <chr>,
## # Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Commodity <chr>,
## # Partner <chr>, Reporter <chr>, Trade.Flow <chr>
Sometimes, when you need to filter a big data set, using multiple arguments, it is better to use join() functions then filter().
From ctData filter all EU countries import of wheat (1001) with the world. (Not the best example because it is easier to make it with the help of the filter function)
ctData %>%
right_join(getEU() %>%
select(-Partner) %>%
mutate(Commodity.Code = "1001",
Trade.Flow.Code = 1),
by = c("Reporter.Code" = "Partner.Code"))
## # A tibble: 1,265,298 x 10
## Classification Year Trade.Flow.Code.x Reporter.Code Partner.Code
## <chr> <int> <int> <int> <int>
## 1 H4 2014 1 20 0
## 2 H4 2014 2 20 0
## 3 H4 2014 1 20 251
## 4 H4 2014 2 20 251
## 5 H4 2014 1 20 724
## 6 H4 2014 2 20 724
## 7 H4 2014 1 20 899
## 8 H4 2014 1 20 0
## 9 H4 2014 2 20 0
## 10 H4 2014 1 20 251
## # ... with 1,265,288 more rows, and 5 more variables:
## # Commodity.Code.x <chr>, Netweight..kg. <dbl>, Trade.Value..US.. <dbl>,
## # Commodity.Code.y <chr>, Trade.Flow.Code.y <dbl>
Filtering joints are very useful, when there are multiple combinations of the variables that need to be filtered. for examples, use the following filtering data frame for sub-setting the data set. The data frame generated below represents 12 reporters and their trade with the 3 random partners.
set.seed(1)
fltDF <-
tibble(Reporter.Code = c(getFSR()$Partner.Code)) %>%
group_by(Reporter.Code) %>%
do({tibble(Partner.Code = sample_n(partners, 3)$Partner.Code)}) %>%
ungroup() %>%
mutate(Commodity.Code = "01") %>%
expand(Commodity.Code = c("01", "02", "03"), Partner.Code, Reporter.Code)
Use semi_join() and compare the results with the right_join().
semi_join(ctData, fltDF, by = c("Reporter.Code", "Partner.Code", "Commodity.Code"))
## # A tibble: 469 x 8
## Classification Year Trade.Flow.Code Reporter.Code Partner.Code
## <chr> <int> <int> <int> <int>
## 1 H4 2014 1 31 528
## 2 H4 2014 1 31 792
## 3 H4 2014 1 31 804
## 4 H4 2014 1 51 528
## 5 H4 2014 1 51 792
## 6 H4 2014 1 51 804
## 7 H4 2014 1 112 381
## 8 H4 2014 1 112 528
## 9 H4 2014 1 112 804
## 10 H4 2014 1 268 381
## # ... with 459 more rows, and 3 more variables: Commodity.Code <chr>,
## # Netweight..kg. <dbl>, Trade.Value..US.. <dbl>
The main idea of the tidy data concept is that we store and use data in the way that we can apply to it any analysis easily and without problems. We can easily split it subset it and re-arrange it any suitable way. For that purpose we use two main verbs: spread() and gather(). Also there are unite(), separate() and others, see tidy data.
Use the previously loaded data in order to return a table with the columns: Commodity.Code, Commodity, Reporter, Partner, Trade.Flow, 2014, 2015, 2016 and Trade Values on intersection only for EU countries trade with the world and commodity code is 1001.
ctData %>%
filter(Partner.Code == 0,
Reporter.Code %in% getEU()$Partner.Code,
Commodity.Code == "1001") %>%
join_labs() %>%
select(Year, Commodity.Code, Commodity, Reporter,
Partner, Trade.Flow, Trade.Value..US..) %>%
spread(Year, Trade.Value..US..)
## # A tibble: 59 x 8
## Commodity.Code Commodity Reporter Partner Trade.Flow
## * <chr> <chr> <chr> <chr> <chr>
## 1 1001 1001 - Wheat and meslin Andorra World Import
## 2 1001 1001 - Wheat and meslin Austria World Export
## 3 1001 1001 - Wheat and meslin Austria World Import
## 4 1001 1001 - Wheat and meslin Belgium World Export
## 5 1001 1001 - Wheat and meslin Belgium World Import
## 6 1001 1001 - Wheat and meslin Bulgaria World Export
## 7 1001 1001 - Wheat and meslin Bulgaria World Import
## 8 1001 1001 - Wheat and meslin Croatia World Export
## 9 1001 1001 - Wheat and meslin Croatia World Import
## 10 1001 1001 - Wheat and meslin Cyprus World Export
## # ... with 49 more rows, and 3 more variables: `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>
Copyright © 2017 Eduard Bukin. All rights reserved.