Import data

MP223 - Applied Econometrics Methods for the Social Sciences

Eduard Bukin

R setup

library(tidyverse)
library(readxl)
library(readr)
library(janitor)       

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%"
)

Data import

General notes

Data import must be done in a reproducible way!

  • Raw data must be stored together with the project.
  • Data import and cleaning should be done with scripts.

It is tricky to load data, because, we need to interact with the file system.

  • You may use interactive Users Interface to load data once,
  • But you should use R code to reload same data again.

Important

Always save data import R code in your scripts!

Data import: more materials

readr package (1/3)

readr package (2/3) key functions

  • read_csv() - for a coma separate data in the text file

  • read_dta() - for Stata data files.

The file, which we want to read is in

[1] "/home/runner/work/_temp/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/readr/2.1.2/9c59de1357dc209868b5feb5c9f0fe2f/readr/extdata/chickens.csv"

Instead of specifying the path to this file, we use readr_example("chickens.csv").

Rows: 5
Columns: 4
$ chicken   <chr> "Foghorn Leghorn", "Chicken Little", "Ginger", "Camilla the …
$ sex       <chr> "rooster", "hen", "hen", "hen", "rooster"
$ eggs_laid <dbl> 0, 3, 12, 7, 0
$ motto     <chr> "That's a joke, ah say, that's a joke, son.", "The sky is fa…

readr package (3/3) user interface

readxl package (1/3)

readxl package (2/3) Basic usage

  • First, locate the file.
[1] "/home/runner/work/_temp/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/readxl/1.4.0/170c35f745563bb307e963bde0197e4f/readxl/extdata/datasets.xls"
  • Then, open it manually to see if it is alright.
  • Then, check what sheets re present there:
[1] "iris"     "mtcars"   "chickwts" "quakes"  
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 181…
$ feed   <chr> "horsebean", "horsebean", "horsebean", "horsebean", "horsebean"…

readxl package (3/3) user interface

janitor package (1/2)

Helps cleaning data and variables names

  • Documentation janitor + Source code GitHub

    • Check out the landing page with examples

janitor package (2/2) key functions

  • janitor::clean_names() - Cleans names of an object (usually a data.frame).

  • janitor::row_to_names(row_number = 1) - Elevate a row to be the column names of a data.frame.

  • janitor::convert_to_date() + excel_numeric_to_date() - Convert many date and datetime formats as may be received from Microsoft Excel

  • janitor::remove_empty() - Remove empty rows and/or columns from a data.frame or matrix.

Takeaways

  • Loading data: readr::read_csv() and readxl::excel_sheets() with readxl::read_excel();

  • Renaming variables: janitor::clean_names(...), dplyr::rename();

  • Converting excel dates to R-relevant <date> variable type: janitor::convert_to_date() and janitor::excel_numeric_to_date().