Chapter 1 Introduction to R and RStudio
1.1 The R Language
R is a free, open-source programming language developed for statistical computing and graphics. It is open source meaning that everyone can access and contribute to its development. R was born out of S, which was intended to be a programming language focused on data analysis, and has evolved into a system used not only by computer programmers and data analysts but also by physical scientists, psychologists, journalists, etc. The first publicly available version of R was released in 2000. The latest version of R, released on 2021-05-18, is R-4.1.0.
For scholarly articles published in 2018 found on Google Scholar, R is the second most frequently used data science software following SPSS.
1.2 Install R
R is available for Windows, Mac, and Linux operating systems. To install R, go to the Comprehensive R Archive Network (CRAN), download the version compatible with your operating system. For Windows or MacOS users, you probably want to download the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the CRAN webpage.
The version downloaded includes the base R package. Often times, it is necessary to install other R packages to perform analysis. For example, for this course, we will use the lavaan
package. I will show how to install that package in 1.4.4
1.3 Install RStudio
RStudio is an integrated development environment (IDE) for R. It uses R to develop codes and analysis that can be executed and has greater usability than R itself. Essentially RStudio is an interface between the user and R (there are other interfaces for R, e.g., R Commander). It depends on and adds onto R, which means that the R program has to be installed before RStudio for RStudio to implement R. Any R package or function can be used in RStudio.
To install RStudio, go to the RStudio download page, and download and install the free RStudio Desktop.
Open R Studio and you will see a window with four panes.
- R script
- R console
- Environment/History/Connections/Tutorial
- Files/Plots/Packages…
You can change the appearances and layout of the panes. Go to Tools -> Global Options -> Appearance to change the appearance.
1.4 Use RStudio
There are two basic ways for RStudio to execute your R syntax: (1) type your code directly in the R console and (2) write R script. If you type your code directly in the R console, the code will no longer be accessible after you close your R session. Writing R script is recommended if you would like to save the code (with “.R” extension).
1.4.1 Basic operations
1.4.1.1 entering data
2+2 #press cmd/ctrol enter
## [1] 4
1:20 #sequence
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
print("Hello World!")
## [1] "Hello World!"
1.4.1.2 assign value
5 #use <-, not =
a <-3 -> b #can go other way, but silly
d <- e <- 2 #multiple assignments c <-
1.4.1.3 multiple values
c (5, 3, 6, 9) #c means combine or concatenate
x <- x
## [1] 5 3 6 9
c(1, 3, 0, 10)) #surround command with parentheses to also print (y <-
## [1] 1 3 0 10
1.4.2 Data types in R
1.4.2.1 numeric
3) (n1 <-
## [1] 3
typeof(n1)
## [1] "double"
2.5) (n2 <-
## [1] 2.5
typeof(n2)
## [1] "double"
1.4.3 Data structures in R
1.4.3.1 vector
A vector has elements of the same data type.
c(1, 2, 3, 4, 5)) (v1 <-
## [1] 1 2 3 4 5
is.vector(v1)
## [1] TRUE
c("a", "b", "c")) (v2 <-
## [1] "a" "b" "c"
is.vector(v2)
## [1] TRUE
c(TRUE, TRUE, FALSE, FALSE, TRUE)) (v3 <-
## [1] TRUE TRUE FALSE FALSE TRUE
is.vector(v3)
## [1] TRUE
1.4.3.2 matrix
A matrix is two-dimensional and has elements of the same type.
matrix(c(T, T, F, F, T, F), nrow = 2)) (m1 <-
## [,1] [,2] [,3]
## [1,] TRUE FALSE TRUE
## [2,] TRUE FALSE FALSE
matrix(c("a", "b",
(m2 <-"c", "d"),
nrow = 2,
byrow = T)) #default is FALSE
## [,1] [,2]
## [1,] "a" "b"
## [2,] "c" "d"
1.4.3.3 array
An array can be more than two dimensions
array(c( 1:24), c(4, 3, 2))) #four rows, three columns, and two tables (a1 <-
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
1.4.3.4 data frame
A data frame combines vectors of the same length.
c(1, 2, 3)
Numeric <- c("a", "b", "c")
Character <- c(T, F, T)
Logical <- cbind(Numeric, Character, Logical)) #coerces all values to most basic data type (df1 <-
## Numeric Character Logical
## [1,] "1" "a" "TRUE"
## [2,] "2" "b" "FALSE"
## [3,] "3" "c" "TRUE"
as.data.frame(cbind(Numeric, Character, Logical))) #makes a data frame with three different data types (df2 <-
## Numeric Character Logical
## 1 1 a TRUE
## 2 2 b FALSE
## 3 3 c TRUE
data.frame(Numeric, Character, Logical)) #the right way! (df3 <-
## Numeric Character Logical
## 1 1 a TRUE
## 2 2 b FALSE
## 3 3 c TRUE
1.4.3.5 list
A list can have anything.
c(1, 2, 3)
o1 <- c("a", "b", "c", "d")
o2 <- c(T, F, T, T, F)
o3 <- list(o1, o2, o3)) (list1 <-
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "a" "b" "c" "d"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE FALSE
list(o1, o2, o3, list1)) #lists within lists! (list2 <-
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "a" "b" "c" "d"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE FALSE
##
## [[4]]
## [[4]][[1]]
## [1] 1 2 3
##
## [[4]][[2]]
## [1] "a" "b" "c" "d"
##
## [[4]][[3]]
## [1] TRUE FALSE TRUE TRUE FALSE
1.4.3.6 a few notes on R data structures
matrices and arrays are vectors with the
dim
attribute.Factors are integer vectors with the
class
andlevels
attributes.
A factor is a vector that can contain only predefined values.data frames are built on top of lists and therefore have the
list
type.vectors are the most important family of data types in R. list are sometimes called generic vectors and the regular vectors are sometimes called atomic vectors.
Hadley Wickham had a good description of the relationships between different types
in his Advanced R book.
1.4.4 R Packages
When you install R, the base
package is installed. The datasets
package with example datasets is also ready to use. Type plot(iris)
in Rstudio Console, or type it in an R script file and run the syntax. The iris flower data set is a multivariate data set introduced by the British statistician Ronald Fisher.
plot(iris)
For this class, we will use the lavaan
package. Type install.packages("lavaan")
to install the package. You only need to install the package once on your computer.
To use the package, type either library(lavaan)
or require(lavaan)
. You need to load the package everytime you start a new R session.
1.4.5 RStudio Projects
An RStudio project saves a relative base. When you create a R Studio project, a folder is created and all the file paths are relative to this folder. You can copy and paste this folder to another place without having to change file paths. For example, you can create a data
folder inside the RStudio project for this class and keep all data in this folder. And then you can import a dataset to R.
1.4.6 Import and Export Data
There are several different packages (e.g., foreign
and haven
) than can help import and export data in different statistical formats (e.g., SPSS, SAS, Stata, Excel). The one that I particularly like is the [rio
] (https://cran.r-project.org/web/packages/rio/index.html) package. Its import()
and export()
functions work with different data types. Install and load rio
.
install.packages("rio")
library(rio)
Import an SPSS data from the data
folder within the RStudio project.
import("data/example.sav")
mydata <-head(mydata) # view first 6 observations
## IDSTUD ITSEX BSMMAT01 BSBGHER
## 1 10301 1 514.4249 10.91822
## 2 10302 2 587.6541 11.62212
## 3 10303 2 582.5530 8.97690
## 4 10304 2 507.9202 10.91822
## 5 10305 2 534.9643 11.62212
## 6 10306 2 465.7001 10.91822
1.4.6.1 a few functions for checking data
str(mydata)
## 'data.frame': 10221 obs. of 4 variables:
## $ IDSTUD : num 10301 10302 10303 10304 10305 ...
## ..- attr(*, "label")= chr "Student ID"
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num 1e+08
## .. ..- attr(*, "names")= chr "Omitted or invalid"
## $ ITSEX : num 1 2 2 2 2 2 1 1 2 1 ...
## ..- attr(*, "label")= chr "Sex of Students"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "labels")= Named num [1:3] 1 2 9
## .. ..- attr(*, "names")= chr [1:3] "Female" "Male" "Omitted or invalid"
## $ BSMMAT01: num 514 588 583 508 535 ...
## ..- attr(*, "label")= chr "1ST PLAUSIBLE VALUE MATHEMATICS"
## ..- attr(*, "format.spss")= chr "F6.2"
## ..- attr(*, "labels")= Named num 999
## .. ..- attr(*, "names")= chr "Omitted or invalid"
## $ BSBGHER : num 10.92 11.62 8.98 10.92 11.62 ...
## ..- attr(*, "label")= chr "Home Educational Resources/SCL"
## ..- attr(*, "format.spss")= chr "F12.5"
## ..- attr(*, "display_width")= int 12
## ..- attr(*, "labels")= Named num 1e+06
## .. ..- attr(*, "names")= chr "Omitted or invalid"
typeof(mydata)
## [1] "list"
class(mydata)
## [1] "data.frame"
dim(mydata) # alternatively, use `nrow()` and `ncol()`
## [1] 10221 4
length(mydata) # number of variables in a data frame; information output by`length()` function depends on the data structure
## [1] 4
1.4.6.2 several functions for basic data management and descriptive statistics
names(mydata) # list variable names
## [1] "IDSTUD" "ITSEX" "BSMMAT01" "BSBGHER"
names(mydata)[names(mydata) == "ITSEX"] <- "gender" #rename a variable
names(mydata)[names(mydata) == "BSMMAT01"] <- "math" #rename another variable
c("gender", "math", "BSBGHER")
var <- mydata[, var] #subset data
dat1 <- na.omit(dat1) #listwise deletion
dat2 <-$gender_c <- factor(dat1$gender) #create a categorical variable for `gender`
dat1levels(dat1$gender_c) <- c("f", "m") #change levels for `gender` variable
summary(dat1)
## gender math BSBGHER gender_c
## Min. :1.000 Min. :248.2 Min. : 4.232 f :5119
## 1st Qu.:1.000 1st Qu.:459.0 1st Qu.: 9.623 m :5098
## Median :1.000 Median :518.9 Median :10.918 NA's: 4
## Mean :1.499 Mean :515.8 Mean :10.793
## 3rd Qu.:2.000 3rd Qu.:574.7 3rd Qu.:11.622
## Max. :2.000 Max. :762.2 Max. :13.884
## NA's :4 NA's :103
table(dat1$gender_c) #compute frequencies for categorical variables
##
## f m
## 5119 5098
::describe(dat1) #compute descriptives for quantitative variables. Need to install and load the `psych` package psych
## vars n mean sd median trimmed mad min max range
## gender 1 10217 1.50 0.50 1.00 1.50 0.00 1.00 2.00 1.00
## math 2 10221 515.82 82.36 518.93 516.81 86.02 248.22 762.23 514.01
## BSBGHER 3 10118 10.79 1.67 10.92 10.77 1.83 4.23 13.88 9.65
## gender_c* 4 10217 1.50 0.50 1.00 1.50 0.00 1.00 2.00 1.00
## skew kurtosis se
## gender 0.00 -2.00 0.00
## math -0.10 -0.28 0.81
## BSBGHER -0.07 -0.03 0.02
## gender_c* 0.00 -2.00 0.00
attach(dat1) # save some typing. Alternatively, `math <- dat1$BSMMAT01` to create a separate vector.
mean(math, na.rm = TRUE)
## [1] 515.8243
median(math, na.rm = TRUE)
## [1] 518.9269
var(math, na.rm = TRUE)
## [1] 6782.527
sd(math, na.rm = TRUE)
## [1] 82.3561
quantile(math, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 248.2235 458.9607 518.9269 574.6662 762.2316
detach(dat1)
1.4.6.3 Pearson’s product moment correlation
Correlation between two variables, hypothesis test, and confidence interval.
cor.test(dat1$BSBGHER, dat1$math)
##
## Pearson's product-moment correlation
##
## data: dat1$BSBGHER and dat1$math
## t = 42.923, df = 10116, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3759021 0.4088708
## sample estimates:
## cor
## 0.3925126
All pairwise correlations.
c("gender", "math", "BSBGHER")
var <-cor(dat1[,var], use = "pairwise.complete.obs") #for numeric variables only; "gender" is still a numeric variable!
## gender math BSBGHER
## gender 1.00000000 0.01305597 -0.0281559
## math 0.01305597 1.00000000 0.3925126
## BSBGHER -0.02815590 0.39251256 1.0000000
round(cor(dat1[,var], use = "pairwise.complete.obs"), 2) #round to 2 decimal places
## gender math BSBGHER
## gender 1.00 0.01 -0.03
## math 0.01 1.00 0.39
## BSBGHER -0.03 0.39 1.00
Correlations and significance levels for correlation matrix.
Need to install and load the Hmisc
package.
Need to coerce from dataframe to matrix to get both a correlation matrix and p-values.
install.packages("Hmisc")
library(Hmisc)
as.matrix(dat1[,var]) # numeric variables only
df <-rcorr(df)
## gender math BSBGHER
## gender 1.00 0.01 -0.03
## math 0.01 1.00 0.39
## BSBGHER -0.03 0.39 1.00
##
## n
## gender math BSBGHER
## gender 10217 10217 10117
## math 10217 10221 10118
## BSBGHER 10117 10118 10118
##
## P
## gender math BSBGHER
## gender 0.1870 0.0046
## math 0.1870 0.0000
## BSBGHER 0.0046 0.0000
1.5 Mplus
Mplus is a latent variable modeling program with a wide variety of analysis capabilities. There is a free Mplus demo version (with certain limitations) available for download. I find the Mplus User’s Guide is particularly helpful.