Concetp of relational data

Main goal of this exercise is to show main applications of the relational data concept with the dplyr package.

For more information, see relational data:

Following the previous example, use the same working directory and download this data file to the folder 03-joints-exercise within your working directory (see previous examples for more information).

Mutating joint

Add columns with the explanation of the variables Trade.Flow.Code, Reporter.Code and Partner.Code using the left_join() or right_join() and the mapping tables from the package tradeAnalysis explained in the previous exercise. Use the subset of 20 rows of the data ctData, loaded as in the previous exercise.

Solution

library(tidyverse)
library(readr)
library(tradeAnalysis)
ctData <- read_rds("03-joints-exercise/subsetting-data.rds")
ctData %>% 
  sample_n(100) %>% 
  left_join(x = ., y = partners, by = "Partner.Code") %>% 
  left_join(x = ., y = reporters, by = "Reporter.Code") %>% 
  left_join(x = ., y = tradeAnalysis::regimes, by = "Trade.Flow.Code")
## # A tibble: 100 x 11
##    Classification  Year Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>           <dbl>         <int>        <int>
##  1             H4  2016               2            36          598
##  2             H4  2015               2           214          332
##  3             H4  2015               1           604          862
##  4             H3  2015               2           780          826
##  5             H4  2016               1           699          703
##  6             H4  2016               2           702          156
##  7             H4  2015               2           818          706
##  8             H4  2014               2           512          381
##  9             H4  2015               1           682          156
## 10             H4  2016               1           191          203
## # ... with 90 more rows, and 6 more variables: Commodity.Code <chr>,
## #   Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Partner <chr>,
## #   Reporter <chr>, Trade.Flow <chr>

Do the same with the help of the tradeAnalysis package.

Solution

ctData %>% 
  sample_n(100) %>% 
  join_labels_flows() %>% 
  join_labels_par() %>% 
  join_labels_rep()
## # A tibble: 100 x 11
##    Classification  Year Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>           <int>         <int>        <int>
##  1             H4  2015               1           528          591
##  2             H4  2015               1           376          804
##  3             H4  2015               1           579          842
##  4             H4  2016               2           466          384
##  5             H4  2015               1           702          642
##  6             H4  2014               2           616          834
##  7             H4  2014               1           558          604
##  8             H4  2015               1           440          100
##  9             H4  2015               2           604          188
## 10             H4  2016               2           710          616
## # ... with 90 more rows, and 6 more variables: Commodity.Code <chr>,
## #   Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Trade.Flow <chr>,
## #   Partner <chr>, Reporter <chr>

Do the same with the help of the tradeAnalysis package in one line.

Solution

ctData %>% 
  sample_n(100) %>% 
  join_labs()
## # A tibble: 100 x 12
##    Classification  Year Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>           <int>         <int>        <int>
##  1             H4  2015               2           372          724
##  2             H4  2016               1           124          604
##  3             H4  2016               1           410          792
##  4             H4  2014               2           208            4
##  5             H4  2015               2           388          796
##  6             H3  2014               1           690          784
##  7             H4  2014               1           834          381
##  8             H4  2015               1           376          188
##  9             H4  2016               1           222           32
## 10             H4  2014               2           191          705
## # ... with 90 more rows, and 7 more variables: Commodity.Code <chr>,
## #   Netweight..kg. <dbl>, Trade.Value..US.. <dbl>, Commodity <chr>,
## #   Partner <chr>, Reporter <chr>, Trade.Flow <chr>

Filtering joints

Sometimes, when you need to filter a big data set, using multiple arguments, it is better to use join() functions then filter().

From ctData filter all EU countries import of wheat (1001) with the world. (Not the best example because it is easier to make it with the help of the filter function)

Solution

ctData %>% 
  right_join(getEU() %>% 
               select(-Partner) %>% 
               mutate(Commodity.Code = "1001",
                      Trade.Flow.Code = 1), 
             by = c("Reporter.Code" = "Partner.Code"))
## # A tibble: 1,265,298 x 10
##    Classification  Year Trade.Flow.Code.x Reporter.Code Partner.Code
##             <chr> <int>             <int>         <int>        <int>
##  1             H4  2014                 1            20            0
##  2             H4  2014                 2            20            0
##  3             H4  2014                 1            20          251
##  4             H4  2014                 2            20          251
##  5             H4  2014                 1            20          724
##  6             H4  2014                 2            20          724
##  7             H4  2014                 1            20          899
##  8             H4  2014                 1            20            0
##  9             H4  2014                 2            20            0
## 10             H4  2014                 1            20          251
## # ... with 1,265,288 more rows, and 5 more variables:
## #   Commodity.Code.x <chr>, Netweight..kg. <dbl>, Trade.Value..US.. <dbl>,
## #   Commodity.Code.y <chr>, Trade.Flow.Code.y <dbl>

Filtering joints are very useful, when there are multiple combinations of the variables that need to be filtered. for examples, use the following filtering data frame for sub-setting the data set. The data frame generated below represents 12 reporters and their trade with the 3 random partners.

Solution

set.seed(1)
fltDF <-
  tibble(Reporter.Code = c(getFSR()$Partner.Code)) %>% 
  group_by(Reporter.Code) %>% 
  do({tibble(Partner.Code = sample_n(partners, 3)$Partner.Code)}) %>% 
  ungroup() %>% 
  mutate(Commodity.Code = "01") %>% 
  expand(Commodity.Code = c("01", "02", "03"), Partner.Code, Reporter.Code)

Use semi_join() and compare the results with the right_join().

Solution

semi_join(ctData, fltDF, by = c("Reporter.Code", "Partner.Code", "Commodity.Code"))
## # A tibble: 469 x 8
##    Classification  Year Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>           <int>         <int>        <int>
##  1             H4  2014               1            31          528
##  2             H4  2014               1            31          792
##  3             H4  2014               1            31          804
##  4             H4  2014               1            51          528
##  5             H4  2014               1            51          792
##  6             H4  2014               1            51          804
##  7             H4  2014               1           112          381
##  8             H4  2014               1           112          528
##  9             H4  2014               1           112          804
## 10             H4  2014               1           268          381
## # ... with 459 more rows, and 3 more variables: Commodity.Code <chr>,
## #   Netweight..kg. <dbl>, Trade.Value..US.. <dbl>

Tidy data

The main idea of the tidy data concept is that we store and use data in the way that we can apply to it any analysis easily and without problems. We can easily split it subset it and re-arrange it any suitable way. For that purpose we use two main verbs: spread() and gather(). Also there are unite(), separate() and others, see tidy data.

Use the previously loaded data in order to return a table with the columns: Commodity.Code, Commodity, Reporter, Partner, Trade.Flow, 2014, 2015, 2016 and Trade Values on intersection only for EU countries trade with the world and commodity code is 1001.

Solution

ctData %>% 
  filter(Partner.Code == 0, 
         Reporter.Code %in% getEU()$Partner.Code, 
         Commodity.Code == "1001") %>% 
  join_labs() %>% 
  select(Year, Commodity.Code, Commodity, Reporter, 
         Partner, Trade.Flow, Trade.Value..US..) %>% 
  spread(Year, Trade.Value..US..)
## # A tibble: 59 x 8
##    Commodity.Code               Commodity Reporter Partner Trade.Flow
##  *          <chr>                   <chr>    <chr>   <chr>      <chr>
##  1           1001 1001 - Wheat and meslin  Andorra   World     Import
##  2           1001 1001 - Wheat and meslin  Austria   World     Export
##  3           1001 1001 - Wheat and meslin  Austria   World     Import
##  4           1001 1001 - Wheat and meslin  Belgium   World     Export
##  5           1001 1001 - Wheat and meslin  Belgium   World     Import
##  6           1001 1001 - Wheat and meslin Bulgaria   World     Export
##  7           1001 1001 - Wheat and meslin Bulgaria   World     Import
##  8           1001 1001 - Wheat and meslin  Croatia   World     Export
##  9           1001 1001 - Wheat and meslin  Croatia   World     Import
## 10           1001 1001 - Wheat and meslin   Cyprus   World     Export
## # ... with 49 more rows, and 3 more variables: `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>

Copyright © 2017 Eduard Bukin. All rights reserved.