Chapter 1 Introduction to R and RStudio

1.1 The R Language

R is a free, open-source programming language developed for statistical computing and graphics. It is open source meaning that everyone can access and contribute to its development. R was born out of S, which was intended to be a programming language focused on data analysis, and has evolved into a system used not only by computer programmers and data analysts but also by physical scientists, psychologists, journalists, etc. The first publicly available version of R was released in 2000. The latest version of R, released on 2021-05-18, is R-4.1.0.

For scholarly articles published in 2018 found on Google Scholar, R is the second most frequently used data science software following SPSS.

1.2 Install R

R is available for Windows, Mac, and Linux operating systems. To install R, go to the Comprehensive R Archive Network (CRAN), download the version compatible with your operating system. For Windows or MacOS users, you probably want to download the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the CRAN webpage.

The version downloaded includes the base R package. Often times, it is necessary to install other R packages to perform analysis. For example, for this course, we will use the lavaan package. I will show how to install that package in 1.4.4

1.3 Install RStudio

RStudio is an integrated development environment (IDE) for R. It uses R to develop codes and analysis that can be executed and has greater usability than R itself. Essentially RStudio is an interface between the user and R (there are other interfaces for R, e.g., R Commander). It depends on and adds onto R, which means that the R program has to be installed before RStudio for RStudio to implement R. Any R package or function can be used in RStudio.

To install RStudio, go to the RStudio download page, and download and install the free RStudio Desktop.

Open R Studio and you will see a window with four panes.

  • R script
  • R console
  • Environment/History/Connections/Tutorial
  • Files/Plots/Packages…

You can change the appearances and layout of the panes. Go to Tools -> Global Options -> Appearance to change the appearance.

1.4 Use RStudio

There are two basic ways for RStudio to execute your R syntax: (1) type your code directly in the R console and (2) write R script. If you type your code directly in the R console, the code will no longer be accessible after you close your R session. Writing R script is recommended if you would like to save the code (with “.R” extension).

1.4.1 Basic operations

1.4.1.1 entering data

2+2 #press cmd/ctrol enter
## [1] 4
1:20 #sequence
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
print("Hello World!")
## [1] "Hello World!"

1.4.1.2 assign value

a <- 5 #use <-, not =
3 -> b #can go other way, but silly
c <- d <- e <- 2 #multiple assignments

1.4.1.3 multiple values

x <- c (5, 3, 6, 9) #c means combine or concatenate
x
## [1] 5 3 6 9
(y <- c(1, 3, 0, 10)) #surround command with parentheses to also print
## [1]  1  3  0 10

1.4.1.4 sequences

0:5 #0 through 5
## [1] 0 1 2 3 4 5
5:0 #5 through 0
## [1] 5 4 3 2 1 0
seq(5) #1 to 5
## [1] 1 2 3 4 5
seq(5, 0, by = -2) #count down by 2 
## [1] 5 3 1

1.4.1.5 math

x + y       #adds corresponding elements in x and y
## [1]  6  6  6 19
x * 2       #multiplies each element in x by 2
## [1] 10  6 12 18
2^4         #powers/exponents
## [1] 16
sqrt(36)    #squareroot
## [1] 6
log(100)    #natural log: base e (2.71828...)
## [1] 4.60517
log10(100)  #base 10 log
## [1] 2

1.4.2 Data types in R

1.4.2.1 numeric

(n1 <- 3)
## [1] 3
typeof(n1)
## [1] "double"
(n2 <- 2.5)
## [1] 2.5
typeof(n2)
## [1] "double"

1.4.2.2 character

(c1 <- "c")
## [1] "c"
typeof(c1)
## [1] "character"
(c2 <- "a string of text")
## [1] "a string of text"
typeof(c2)
## [1] "character"

1.4.2.3 logical

(l1 <- TRUE) #must use uppercase
## [1] TRUE
typeof(l1)
## [1] "logical"
(l2 <- F) #must use uppercase
## [1] FALSE
typeof(l2)
## [1] "logical"

1.4.3 Data structures in R

1.4.3.1 vector

A vector has elements of the same data type.

(v1 <- c(1, 2, 3, 4, 5))
## [1] 1 2 3 4 5
is.vector(v1)
## [1] TRUE
(v2 <- c("a", "b", "c"))
## [1] "a" "b" "c"
is.vector(v2)
## [1] TRUE
(v3 <- c(TRUE, TRUE, FALSE, FALSE, TRUE))
## [1]  TRUE  TRUE FALSE FALSE  TRUE
is.vector(v3)
## [1] TRUE

1.4.3.2 matrix

A matrix is two-dimensional and has elements of the same type.

(m1 <- matrix(c(T, T, F, F, T, F), nrow = 2))
##      [,1]  [,2]  [,3]
## [1,] TRUE FALSE  TRUE
## [2,] TRUE FALSE FALSE
(m2 <- matrix(c("a", "b", 
               "c", "d"), 
               nrow = 2,
               byrow = T)) #default is FALSE
##      [,1] [,2]
## [1,] "a"  "b" 
## [2,] "c"  "d"

1.4.3.3 array

An array can be more than two dimensions

(a1 <- array(c( 1:24), c(4, 3, 2))) #four rows, three columns, and two tables
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   13   17   21
## [2,]   14   18   22
## [3,]   15   19   23
## [4,]   16   20   24

1.4.3.4 data frame

A data frame combines vectors of the same length.

Numeric   <- c(1, 2, 3)
Character <- c("a", "b", "c")
Logical   <- c(T, F, T)
(df1 <- cbind(Numeric, Character, Logical))  #coerces all values to most basic data type
##      Numeric Character Logical
## [1,] "1"     "a"       "TRUE" 
## [2,] "2"     "b"       "FALSE"
## [3,] "3"     "c"       "TRUE"
(df2 <- as.data.frame(cbind(Numeric, Character, Logical))) #makes a data frame with three different data types
##   Numeric Character Logical
## 1       1         a    TRUE
## 2       2         b   FALSE
## 3       3         c    TRUE
(df3 <- data.frame(Numeric, Character, Logical)) #the right way!
##   Numeric Character Logical
## 1       1         a    TRUE
## 2       2         b   FALSE
## 3       3         c    TRUE

1.4.3.5 list

A list can have anything.

o1 <- c(1, 2, 3)
o2 <- c("a", "b", "c", "d")
o3 <- c(T, F, T, T, F)
(list1 <- list(o1, o2, o3))
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE
(list2 <- list(o1, o2, o3, list1))  #lists within lists!
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE
## 
## [[4]]
## [[4]][[1]]
## [1] 1 2 3
## 
## [[4]][[2]]
## [1] "a" "b" "c" "d"
## 
## [[4]][[3]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE

1.4.3.6 a few notes on R data structures

  • matrices and arrays are vectors with the dim attribute.

  • Factors are integer vectors with theclass and levels attributes.
    A factor is a vector that can contain only predefined values.

  • data frames are built on top of lists and therefore have the list type.

  • vectors are the most important family of data types in R. list are sometimes called generic vectors and the regular vectors are sometimes called atomic vectors.

  • Hadley Wickham had a good description of the relationships between different types
    in his Advanced R book.

1.4.4 R Packages

When you install R, the base package is installed. The datasets package with example datasets is also ready to use. Type plot(iris) in Rstudio Console, or type it in an R script file and run the syntax. The iris flower data set is a multivariate data set introduced by the British statistician Ronald Fisher.

plot(iris)

For this class, we will use the lavaan package. Type install.packages("lavaan") to install the package. You only need to install the package once on your computer.

To use the package, type either library(lavaan) or require(lavaan). You need to load the package everytime you start a new R session.

1.4.5 RStudio Projects

An RStudio project saves a relative base. When you create a R Studio project, a folder is created and all the file paths are relative to this folder. You can copy and paste this folder to another place without having to change file paths. For example, you can create a data folder inside the RStudio project for this class and keep all data in this folder. And then you can import a dataset to R.

1.4.6 Import and Export Data

There are several different packages (e.g., foreign and haven) than can help import and export data in different statistical formats (e.g., SPSS, SAS, Stata, Excel). The one that I particularly like is the [rio] (https://cran.r-project.org/web/packages/rio/index.html) package. Its import() and export() functions work with different data types. Install and load rio.

install.packages("rio")
library(rio)

Import an SPSS data from the data folder within the RStudio project.

mydata <- import("data/example.sav")
head(mydata) # view first 6 observations
##   IDSTUD ITSEX BSMMAT01  BSBGHER
## 1  10301     1 514.4249 10.91822
## 2  10302     2 587.6541 11.62212
## 3  10303     2 582.5530  8.97690
## 4  10304     2 507.9202 10.91822
## 5  10305     2 534.9643 11.62212
## 6  10306     2 465.7001 10.91822

1.4.6.1 a few functions for checking data

str(mydata) 
## 'data.frame':    10221 obs. of  4 variables:
##  $ IDSTUD  : num  10301 10302 10303 10304 10305 ...
##   ..- attr(*, "label")= chr "Student ID"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num 1e+08
##   .. ..- attr(*, "names")= chr "Omitted or invalid"
##  $ ITSEX   : num  1 2 2 2 2 2 1 1 2 1 ...
##   ..- attr(*, "label")= chr "Sex of Students"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "labels")= Named num [1:3] 1 2 9
##   .. ..- attr(*, "names")= chr [1:3] "Female" "Male" "Omitted or invalid"
##  $ BSMMAT01: num  514 588 583 508 535 ...
##   ..- attr(*, "label")= chr "1ST PLAUSIBLE VALUE MATHEMATICS"
##   ..- attr(*, "format.spss")= chr "F6.2"
##   ..- attr(*, "labels")= Named num 999
##   .. ..- attr(*, "names")= chr "Omitted or invalid"
##  $ BSBGHER : num  10.92 11.62 8.98 10.92 11.62 ...
##   ..- attr(*, "label")= chr "Home Educational Resources/SCL"
##   ..- attr(*, "format.spss")= chr "F12.5"
##   ..- attr(*, "display_width")= int 12
##   ..- attr(*, "labels")= Named num 1e+06
##   .. ..- attr(*, "names")= chr "Omitted or invalid"
typeof(mydata)
## [1] "list"
class(mydata)
## [1] "data.frame"
dim(mydata) # alternatively, use `nrow()` and `ncol()`
## [1] 10221     4
length(mydata) # number of variables in a data frame; information output by`length()` function depends on the data structure 
## [1] 4

1.4.6.2 several functions for basic data management and descriptive statistics

names(mydata) # list variable names
## [1] "IDSTUD"   "ITSEX"    "BSMMAT01" "BSBGHER"
names(mydata)[names(mydata) == "ITSEX"] <- "gender" #rename a variable
names(mydata)[names(mydata) == "BSMMAT01"] <- "math" #rename another variable
var <- c("gender", "math", "BSBGHER")
dat1 <- mydata[, var] #subset data
dat2 <- na.omit(dat1) #listwise deletion
dat1$gender_c <- factor(dat1$gender) #create a categorical variable for `gender`
levels(dat1$gender_c) <- c("f", "m") #change levels for `gender` variable
summary(dat1) 
##      gender           math          BSBGHER       gender_c   
##  Min.   :1.000   Min.   :248.2   Min.   : 4.232   f   :5119  
##  1st Qu.:1.000   1st Qu.:459.0   1st Qu.: 9.623   m   :5098  
##  Median :1.000   Median :518.9   Median :10.918   NA's:   4  
##  Mean   :1.499   Mean   :515.8   Mean   :10.793              
##  3rd Qu.:2.000   3rd Qu.:574.7   3rd Qu.:11.622              
##  Max.   :2.000   Max.   :762.2   Max.   :13.884              
##  NA's   :4                       NA's   :103
table(dat1$gender_c) #compute frequencies for categorical variables
## 
##    f    m 
## 5119 5098
psych::describe(dat1) #compute descriptives for quantitative variables. Need to install and load the `psych` package
##           vars     n   mean    sd median trimmed   mad    min    max  range
## gender       1 10217   1.50  0.50   1.00    1.50  0.00   1.00   2.00   1.00
## math         2 10221 515.82 82.36 518.93  516.81 86.02 248.22 762.23 514.01
## BSBGHER      3 10118  10.79  1.67  10.92   10.77  1.83   4.23  13.88   9.65
## gender_c*    4 10217   1.50  0.50   1.00    1.50  0.00   1.00   2.00   1.00
##            skew kurtosis   se
## gender     0.00    -2.00 0.00
## math      -0.10    -0.28 0.81
## BSBGHER   -0.07    -0.03 0.02
## gender_c*  0.00    -2.00 0.00
attach(dat1) # save some typing. Alternatively, `math <- dat1$BSMMAT01` to create a separate vector.
mean(math, na.rm = TRUE)
## [1] 515.8243
median(math, na.rm = TRUE)
## [1] 518.9269
var(math, na.rm = TRUE)
## [1] 6782.527
sd(math, na.rm = TRUE)
## [1] 82.3561
quantile(math, na.rm = TRUE)
##       0%      25%      50%      75%     100% 
## 248.2235 458.9607 518.9269 574.6662 762.2316
detach(dat1)

1.4.6.3 Pearson’s product moment correlation

Correlation between two variables, hypothesis test, and confidence interval.

cor.test(dat1$BSBGHER, dat1$math)
## 
##  Pearson's product-moment correlation
## 
## data:  dat1$BSBGHER and dat1$math
## t = 42.923, df = 10116, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3759021 0.4088708
## sample estimates:
##       cor 
## 0.3925126

All pairwise correlations.

var <- c("gender", "math", "BSBGHER") 
cor(dat1[,var], use = "pairwise.complete.obs") #for numeric variables only; "gender" is still a numeric variable!
##              gender       math    BSBGHER
## gender   1.00000000 0.01305597 -0.0281559
## math     0.01305597 1.00000000  0.3925126
## BSBGHER -0.02815590 0.39251256  1.0000000
round(cor(dat1[,var], use = "pairwise.complete.obs"), 2) #round to 2 decimal places
##         gender math BSBGHER
## gender    1.00 0.01   -0.03
## math      0.01 1.00    0.39
## BSBGHER  -0.03 0.39    1.00

Correlations and significance levels for correlation matrix. Need to install and load the Hmisc package. Need to coerce from dataframe to matrix to get both a correlation matrix and p-values.

install.packages("Hmisc")
library(Hmisc)
df <- as.matrix(dat1[,var]) # numeric variables only
rcorr(df)
##         gender math BSBGHER
## gender    1.00 0.01   -0.03
## math      0.01 1.00    0.39
## BSBGHER  -0.03 0.39    1.00
## 
## n
##         gender  math BSBGHER
## gender   10217 10217   10117
## math     10217 10221   10118
## BSBGHER  10117 10118   10118
## 
## P
##         gender math   BSBGHER
## gender         0.1870 0.0046 
## math    0.1870        0.0000 
## BSBGHER 0.0046 0.0000

1.5 Mplus

Mplus is a latent variable modeling program with a wide variety of analysis capabilities. There is a free Mplus demo version (with certain limitations) available for download. I find the Mplus User’s Guide is particularly helpful.