Using COMTRADE bulk/regular downloads and loading data into R

COMTRADE
Regular CT download
- Loading regular CT data in R
Bulk CT download
- Loading zipped bulk download data into R
  - Manual unzipping
  - Unzipping files using R
CT data availability and API commands for data download
Downloading CT bulk data using CT API, R and an internet connection

COMTRADE

There are two main options of data download in COMTRADE (CT):

Regular - manually insert a query in the browser and get a .csv file response
Bulk - manually insert short query on a web page, or use the browser command line (API) to pass arguments to the server. In return, consumer downloads a zip archive with data that covers trade of at least one country in one time period in a specified coding system. Bulk download has an option to download data of all reporters for one period/coding system in file.

Regular CT download

Regular downloads are available at https://comtrade.un.org/data/. This is a query based approach to load CT data selecting a specific range of parameters (API calls are covered for bulk downloads). The downloaded CSV file can then be inserted into the R code.

On https://comtrade.un.org/data/, submit a query and download the following data:

Type of product: Goods
Frequency: Annual
HS: As reported (Most frequently used as it combines all codes that were ever recorded)
Period: 2014-2016
Reporter: Finland
Partners: World, USA
Trade flow: All
HS (as reported) commodity codes: Total, any 2, 4 and 6 digits code (11, 1101, 110100)

Loading regular CT data in R

Before starting practicing in R, make sure that you have setup a proper working directly. Best option would be to create an R project see and use the project folder as your working directory. To check what is your working directory use getwd().

After downloading the CSV file, it’s time to load the data in R, using the tools available in the base package. It’s good to always use the stringsAsFactors = FALSE argument to avoid having too many factors variables, which may complicate your work.

file <- "01-data-import/ct_regular_extraction.csv"
ctData <- read.csv(file, stringsAsFactors = FALSE) 
names(ctData)     # See the names of all variables

##  [1] "Classification"         "Year"                  
##  [3] "Period"                 "Period.Desc."          
##  [5] "Aggregate.Level"        "Is.Leaf.Code"          
##  [7] "Trade.Flow.Code"        "Trade.Flow"            
##  [9] "Reporter.Code"          "Reporter"              
## [11] "Reporter.ISO"           "Partner.Code"          
## [13] "Partner"                "Partner.ISO"           
## [15] "X2nd.Partner.Code"      "X2nd.Partner"          
## [17] "X2nd.Partner.ISO"       "Customs.Proc..Code"    
## [19] "Customs"                "Mode.of.Transport.Code"
## [21] "Mode.of.Transport"      "Commodity.Code"        
## [23] "Commodity"              "Qty.Unit.Code"         
## [25] "Qty.Unit"               "Qty"                   
## [27] "Alt.Qty.Unit.Code"      "Alt.Qty.Unit"          
## [29] "Alt.Qty"                "Netweight..kg."        
## [31] "Gross.weight..kg."      "Trade.Value..US.."     
## [33] "CIF.Trade.Value..US.."  "FOB.Trade.Value..US.." 
## [35] "Flag"

head(ctData)     # See the first 5 rows of the data frame
tail(ctData)     # See the last 4 rows of the data frame
str(ctData)     # See the structure of the data object
class(ctData)     # See the class of the data object

Attention: when assigning the .csv file to file specify the path leading to the .csv file from the working directory. To find out what the working directory is run getwd(). In the example above the .csv file is situated in the 01-data-directory, which itself is in the working directory.

There are more efficient packages than the base package to load data in R. We can, for example, do the same using the more user friendly tools from the tidyverse package:

library(tidyverse)  # Setups
library(readr)      # Setups

ctData <- read_csv(file)

Using read_csv() returns a tbl_df(), or tibble, object that allows you to explore all variables of the data frame without printing it all in the console. There is no need to use head(), tail(), str().

ctData

## # A tibble: 24 x 35
##    Classification  Year Period `Period Desc.` `Aggregate Level`
##             <chr> <int>  <int>          <int>             <int>
##  1             H4  2014   2014           2014                 2
##  2             H4  2014   2014           2014                 2
##  3             H4  2014   2014           2014                 2
##  4             H4  2014   2014           2014                 2
##  5             H4  2015   2015           2015                 2
##  6             H4  2015   2015           2015                 2
##  7             H4  2015   2015           2015                 2
##  8             H4  2015   2015           2015                 2
##  9             H4  2014   2014           2014                 4
## 10             H4  2014   2014           2014                 4
## # ... with 14 more rows, and 30 more variables: `Is Leaf Code` <int>,
## #   `Trade Flow Code` <int>, `Trade Flow` <chr>, `Reporter Code` <int>,
## #   Reporter <chr>, `Reporter ISO` <chr>, `Partner Code` <int>,
## #   Partner <chr>, `Partner ISO` <chr>, `2nd Partner Code` <chr>, `2nd
## #   Partner` <chr>, `2nd Partner ISO` <chr>, `Customs Proc. Code` <chr>,
## #   Customs <chr>, `Mode of Transport Code` <chr>, `Mode of
## #   Transport` <chr>, `Commodity Code` <chr>, Commodity <chr>, `Qty Unit
## #   Code` <int>, `Qty Unit` <chr>, Qty <chr>, `Alt Qty Unit Code` <chr>,
## #   `Alt Qty Unit` <int>, `Alt Qty` <chr>, `Netweight (kg)` <int>, `Gross
## #   weight (kg)` <chr>, `Trade Value (US$)` <dbl>, `CIF Trade Value
## #   (US$)` <chr>, `FOB Trade Value (US$)` <chr>, Flag <int>

Bulk CT download

Bulk downloads are available using the web page https://comtrade.un.org/data/bulk.

There are two options how to access data here:

Web based query in the browser https://comtrade.un.org/data/bulk as we have just seen.
API call based approach to load CT data. In this case the user constructs a query such as “http://comtrade.un.org/api/get/bulk/C/A/1994/251/HS” following the guidelines outlined on https://comtrade.un.org/data/doc/api/bulk/. To request a bulk data file we need to use the specific format explained on COMTRADE, as follows: http://comtrade.un.org/api/get/bulk/{type}/{freq}/{ps}/{r}/{px}?{token=}, with type (trade type), freq (frequency), ps (period), r (reporter), px (classification), and optionally token (see authentication).

Using the web query approach let’s download data for Franceâ€™s annual trade of commodities in 1994.

The zip archive returned by COMTRADE is named “type-C_r-251_ps-1994_freq-A_px-HS_pub-20020901_fmt-csv_ex-20150310.zip”, and contains the file “type-C_r-251_ps-1994_freq-A_px-HS_pub-20020901_fmt-csv_ex-20150310.csv” with data. Attention! Data files can be VERY HEAVY (up to 9-10 GB), which may lead to difficulties while loading them.

The same file can be obtained through the API call “http://comtrade.un.org/api/get/bulk/C/A/1994/251/H”.

Loading zipped bulk download data into R

Manual unzipping

Manually unzipping the bulk COMTRADE data file into the same folder and reading it is done in the same way as before.

file <- "01-data-import/type-C_r-251_ps-1994_freq-A_px-HS_pub-20020901_fmt-csv_ex-20150310.csv"
ctFranceData <- read_csv(file, guess_max = 1000000) # guess_max = 1000000 is used to avoid useless warnings.

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Classification = col_character(),
##   `Trade Flow` = col_character(),
##   Reporter = col_character(),
##   `Reporter ISO` = col_character(),
##   Partner = col_character(),
##   `Partner ISO` = col_character(),
##   `Commodity Code` = col_character(),
##   Commodity = col_character(),
##   `Qty Unit` = col_character(),
##   Qty = col_double(),
##   `Netweight (kg)` = col_double(),
##   `Trade Value (US$)` = col_double()
## )

## See spec(...) for full column specifications.

ctFranceData

## # A tibble: 465,137 x 22
##    Classification  Year Period `Period Desc.` `Aggregate Level`
##             <chr> <int>  <int>          <int>             <int>
##  1             H0  1994   1994           1994                 2
##  2             H0  1994   1994           1994                 2
##  3             H0  1994   1994           1994                 2
##  4             H0  1994   1994           1994                 2
##  5             H0  1994   1994           1994                 2
##  6             H0  1994   1994           1994                 2
##  7             H0  1994   1994           1994                 2
##  8             H0  1994   1994           1994                 2
##  9             H0  1994   1994           1994                 2
## 10             H0  1994   1994           1994                 2
## # ... with 465,127 more rows, and 17 more variables: `Is Leaf Code` <int>,
## #   `Trade Flow Code` <int>, `Trade Flow` <chr>, `Reporter Code` <int>,
## #   Reporter <chr>, `Reporter ISO` <chr>, `Partner Code` <int>,
## #   Partner <chr>, `Partner ISO` <chr>, `Commodity Code` <chr>,
## #   Commodity <chr>, `Qty Unit Code` <int>, `Qty Unit` <chr>, Qty <dbl>,
## #   `Netweight (kg)` <dbl>, `Trade Value (US$)` <dbl>, Flag <int>

To speed up the reading process, in the package tradeAnalysis there is a function readCTCSV() which reads the bulk downloaded COMTRADE data file and converts all columns into a readable format, as well as removing all irrelevant columns and formats.

To install the tradeAnalysis package:

install.packages("devtools") # Run if needed
devtools::install_github("EBukin/tradeAnalysis-pack")

library(tradeAnalysis)

And to read the file using the new readCTCSV() function:

ctData_2 <- readCTCSV(file)

Using tradeAnalysis the file is slightly lighter and loads faster. This is especially important when loaded files are 4 Gb or more in volume.

ctData_2

## # A tibble: 465,137 x 12
##    Classification  Year Period Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>  <chr>           <int>         <int>        <int>
##  1             H0  1994   1994               1           251            0
##  2             H0  1994   1994               2           251            0
##  3             H0  1994   1994               1           251            8
##  4             H0  1994   1994               1           251           12
##  5             H0  1994   1994               2           251           12
##  6             H0  1994   1994               1           251           20
##  7             H0  1994   1994               2           251           20
##  8             H0  1994   1994               1           251           32
##  9             H0  1994   1994               2           251           32
## 10             H0  1994   1994               1           251           36
## # ... with 465,127 more rows, and 6 more variables: Commodity.Code <chr>,
## #   Qty.Unit.Code <int>, Qty <dbl>, Netweight..kg. <dbl>,
## #   Trade.Value..US.. <dbl>, Flag <int>

Unzipping files using R

To load bulk CT data files into R directly from the .zip archive we need to use the tradeAnalysis package. In the tradeAnalysis library we use the function readCTZIP(). To learn more about the syntax of the readCTZIP()function run ?readCTZIP, to see the source code run readCTZIP, or select readCTZIP with the mouse and press F2.

ctData_3 <-
  readCTZIP(folder = "01-data-import/",
            file = "type-C_r-804_ps-2014_freq-A_px-HS_pub-20150317_fmt-csv_ex-20150609.zip")
ctData_3

## # A tibble: 193,015 x 12
##    Classification  Year Period Trade.Flow.Code Reporter.Code Partner.Code
##             <chr> <int>  <chr>           <int>         <int>        <int>
##  1             H4  2014   2014               1           804            0
##  2             H4  2014   2014               2           804            0
##  3             H4  2014   2014               2           804           31
##  4             H4  2014   2014               1           804           40
##  5             H4  2014   2014               2           804           51
##  6             H4  2014   2014               1           804           84
##  7             H4  2014   2014               1           804           90
##  8             H4  2014   2014               2           804          112
##  9             H4  2014   2014               1           804          124
## 10             H4  2014   2014               2           804          124
## # ... with 193,005 more rows, and 6 more variables: Commodity.Code <chr>,
## #   Qty.Unit.Code <int>, Qty <dbl>, Netweight..kg. <dbl>,
## #   Trade.Value..US.. <dbl>, Flag <int>

Attention: if the .zip file we want to open is placed in the home directory, specify folder = "." in the function.

CT data availability and API commands for data download

In R it is possible to see all the data available in the COMTRADE bulk download data files. This is achieved by using the COMTRADE API interface (https://comtrade.un.org/data/doc/api/bulk/#DataRequests).

Firstly, we run the command “http://comtrade.un.org/api//refs/da/bulk?parameters” in the browser address bar. After a certain loading time due to the large volume of the data, we will be returned the JSON object with thousands of rows of text in the browser. This Object can be parsed in R into a data frame showing all data available in COMTRADE and all query commands needed to download each data file.

To load this object into R we can do the following, using the package jsonlite:

install.packages("jsonlite")

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

availableCtData_1 <- jsonlite::fromJSON("http://comtrade.un.org/api/refs/da/bulk?parameters") %>% tbl_df()
availableCtData_1

## # A tibble: 51,288 x 10
##                                                                        name
##  *                                                                    <chr>
##  1 type-C_r-ALL_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170825.zip
##  2 type-C_r-ALL_ps-201705_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170825.zip
##  3 type-C_r-ALL_ps-201704_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170825.zip
##  4 type-C_r-124_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
##  5 type-C_r-152_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
##  6 type-C_r-203_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
##  7 type-C_r-208_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
##  8 type-C_r-233_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
##  9  type-C_r-31_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
## 10  type-C_r-32_ps-201706_freq-M_px-HS_pub-20170818_fmt-csv_ex-20170824.zip
## # ... with 51,278 more rows, and 9 more variables: r <chr>, px <chr>,
## #   ps <chr>, type <chr>, freq <chr>, publicationDate <chr>,
## #   extractDate <chr>, filesize <int>, downloadUri <chr>

This table illustrates the reporters (‘r’ variable), classification (‘px’), period (‘ps’) of each data file, and what data files exist.

glimpse(availableCtData_1)

## Observations: 51,288
## Variables: 10
## $ name            <chr> "type-C_r-ALL_ps-201706_freq-M_px-HS_pub-20170...
## $ r               <chr> "ALL", "ALL", "ALL", "124", "152", "203", "208...
## $ px              <chr> "HS", "HS", "HS", "HS", "HS", "HS", "HS", "HS"...
## $ ps              <chr> "201706", "201705", "201704", "201706", "20170...
## $ type            <chr> "COMMODITIES", "COMMODITIES", "COMMODITIES", "...
## $ freq            <chr> "MONTHLY", "MONTHLY", "MONTHLY", "MONTHLY", "M...
## $ publicationDate <chr> "2017-08-18T00:00:00", "2017-08-18T00:00:00", ...
## $ extractDate     <chr> "2017-08-25T00:00:00", "2017-08-25T00:00:00", ...
## $ filesize        <int> 45392464, 58406789, 109458248, 2837034, 113915...
## $ downloadUri     <chr> "/api/get/bulk/C/M/201706/ALL/HS", "/api/get/b...

The variable downloadUri provides a COMTRADE query (such as “/api/get/bulk/C/A/2016/428/H0”) that can be combined into one command with http://comtrade.un.org/.

The result would look something like “http://comtrade.un.org/api/get/bulk/C/A/2016/428/H0” and will return a zipped data file if run in the browser address bar.

The same list of available bulk data in COMTRADE can be accessed using the getCTParameters() function of the tradeAnalysis package, making some of the variables look more user friendly.

library(tradeAnalysis)
availableCtData_2 <- getCTParameters()
glimpse(availableCtData_2)

## Observations: 51,288
## Variables: 12
## $ name            <chr> "type-C_r-ALL_ps-201706_freq-M_px-HS_pub-20170...
## $ r               <chr> "ALL", "ALL", "ALL", "124", "152", "203", "208...
## $ px              <chr> "HS", "HS", "HS", "HS", "HS", "HS", "HS", "HS"...
## $ ps              <chr> "201706", "201705", "201704", "201706", "20170...
## $ type            <chr> "COMMODITIES", "COMMODITIES", "COMMODITIES", "...
## $ freq            <chr> "MONTHLY", "MONTHLY", "MONTHLY", "MONTHLY", "M...
## $ publicationDate <date> 2017-08-18, 2017-08-18, 2017-08-18, 2017-08-1...
## $ extractDate     <date> 2017-08-25, 2017-08-25, 2017-08-25, 2017-08-2...
## $ filesize        <int> 45392464, 58406789, 109458248, 2837034, 113915...
## $ downloadUri     <chr> "/api/get/bulk/C/M/201706/ALL/HS", "/api/get/b...
## $ year            <chr> "2017", "2017", "2017", "2017", "2017", "2017"...
## $ month           <chr> "06", "05", "04", "06", "06", "06", "06", "06"...

Downloading CT bulk data using CT API, R and an internet connection

After checking for data availability we may want to download some of this data.

To do the download, after loading the table of available data with tradeAnalysis::getCTParameters(), we need to select the row containing the name of the country that we want to download and the destination folder where we want to save it.

See the following example.

We want to download data about all annual trade of commodities by Ukraine (reporter code 804), in 2014, in the HS â€˜As reportedâ€™ classification. To see an extensive list of reporters and their codes, run tradeAnalysis::getCTReporters()

library(tidyverse)

Let’s first create oneLine, containing all the specifications for our query (the usage of the ‘filter()’ function will be discussed with more detail in the following lesson):

oneLine <-
  filter(availableCtData_2,
         r == 804,
         type == "COMMODITIES",
         freq == "ANNUAL",
         year == 2014,
         px == "HS")

To load the data file:

downloadCTZIP(df = oneLine, toFolder = "01-data-import/", token = NA)

Explore more about the token parameter with ?downloadCTZIP.

token is an important parameter, necessary to access COMTRADE data from outside FAO headquarters. To access the data from the outside, it is first necessary to register our FAO e-mail address in the COMTRADE system and then save the token provided on the computer. Inserting the token in the function downloadCTZIP() will ensure CT data can be downloaded from outside headquarters.

To generate an authentication token from the registered IP range (i.e. from within headquarters), we first need to go to https://comtrade.un.org/data/auth/login?ReturnUrl=%2Fdata%2Fbulk and insert our email into the Sign in using: IP Authentication: field. Pressing Authenticate my IP will ensure we are registered in the system. Subsequently we need to go to https://comtrade.un.org/api/swagger/ui/index#/ and select /getSubUserToken in the Auth section. After inserting our email into the corresponding field we press “try it out”. In “Response body” the token will appear as a long string of symbols and characters. This token is to be copied to a secure place and used to access data as follows:

downloadCTZIP(df = oneLine, toFolder = "01-data-import/", token = "token_content_as_it_is_in_comtrade")