Common R commands
I have lately been working with R for some data analysis. As I am not an R expert, I often forget the commands I need to use between my R sessions. I suspect this happens a lot because I google things up, and then do a copy paste/fiddle with the code I find online and end up being more interested in the results than the actual code. The result is that I forget the actual syntax or how to do things. Thus, in this post I will record some of the most common R commands, that I have been using, such that I can come back to them when I need to :-)
Word of caution: I am not an R expert, so some code might be less ideal. However, I plan to update this as often as needed, so feel free to send me an email if you spot some mistake.
a <- c("1","2") # make a list of elements and assign it to variable a
# Make a data frame with 9 columns and 0 rows
df <- data.frame(matrix(ncol = 9, nrow = 0))
# Append to a, b to list a
x <- append(c("a", "b"), a)
# Add column names to my data frame
colnames(df) <- x
# Let's assume we have this data (mpg) stored in df_mpg
df_mpg <- mpg
head(df_mpg)
# It looks likes this
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
# Select only the manufacturer column
df_mpg[1]
# A tibble: 234 x 1
manufacturer
<chr>
1 audi
2 audi
...
# Another option is to get the rows of manufacturer column
df_mpg$manufacturer # use $ to indicate which
- Read and write from/to CSV file
df <- read.csv(file)
write.csv(df, file)
Select + Filter data #
- Select numeric columns
library(dplyr) # this provides as with the pipe function %>%, and many other goodies
df_mpg %>% select(where(is.numeric)) # get all numeric values in the dataset
# Result
# displ year cyl cty hwy
<dbl> <int> <int> <int> <int>
1 1.8 1999 4 18 29
2 1.8 1999 4 21 29
3 2 2008 4 20 31
- Select and filter
df_mpg %>% select(everything()) # provides all the columns
df_mpg %>% select(everything()) %>% filter(cyl > 4) # filter more based on the cyl value > 4
df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) # filter more based on the cyl value > 4 and midsize class
df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) %>% count(manufacturer) # count how many models each manufacturer has
# Result
manufacturer n
<chr> <int>
1 audi 3
2 chevrolet 3
3 hyundai 3
4 nissan 5
5 pontiac 5
6 toyota 3
7 volkswagen 3
- Sort by largest by passing sort argument to count function
# Sort by largest by passing sort argument to count function
df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) %>% count(manufacturer, sort=TRUE)
# Result
manufacturer n
<chr> <int>
1 nissan 5
2 pontiac 5
3 audi 3
Colinearity, VIF #
Sometimes, I need to check if the independent variables in my data are correlated.
- Correlation matrix
# Correlation matrix.
# By default it uses pearson correlation. See docs for other possibilities
df_mpg %>% select(is.numeric) %>% cor(.) # we can do correlation matrix only on numeric values
# or one can use the Hmisc package and the rcorr function
library("Hmisc")
corr_data <- df_mpg %>% select(where(is.numeric))
rcorr(as.matrix(corr_data))
# Result
displ year cyl cty hwy
displ 1.00 0.15 0.93 -0.80 -0.77
year 0.15 1.00 0.12 -0.04 0.00
cyl 0.93 0.12 1.00 -0.81 -0.76
cty -0.80 -0.04 -0.81 1.00 0.96
hwy -0.77 0.00 -0.76 0.96 1.00
- Visualizing the correlation matrix
install.packages("corrplot")
library("corrplot")
corr_data <- df_mpg %>% select(where(is.numeric))
cor_matrix <- cor(corr_data)
corrplot(cor_matrix, method="circle")
corrplot(cor_matrix, method="number")

- Variance Inflation Factor
library(car) # car is a library containing the function VIF - Variance inflation factor
model <- lm(hwy ~ cyl + year, data=df_mpg)
car::vif(model) # where model is the regression model
car::vif(model)
cyl year
1.015171 1.015171
Statistical tests #
I want to test if two datasets have independent data. We need to use a different dataset, so let's use mtcars instead.
- T-Test
df <- mtcars # has mpg and automatic/manual transmission.
t.test(mpg ~ am, data = df) # normal t.test
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
- Wilcox Test
wilcox.test(mpg ~ am, data=df)
# Result
Wilcoxon rank sum test with continuity correction
data: mpg by am
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0
Plotting data with ggplot2 #
I quite like using ggplot2, but I often forget the order of arguments, how to do different things, etc. Let's have some of the most common ggplot commands and the outputs. You can find a fanastic cheatsheet online.
library(ggplot2)
ggplot(data = df_mpg) + aes(x = displ, y = hwy, fill=class, color=class) + geom_point()
Add a regression line #
ggplot(df_mpg) + aes(x = displ, y = hwy) + geom_point(mapping=aes(color=class)) + geom_smooth()
One thing to observe here is that we need to move the
color=classintogeom_point(mapping=aes())function compared to the previous graph.
Boxplot #
ggplot(data = df_mpg, mapping = aes(x = displ, y = hwy, fill=class)) + geom_boxplot()
More advanced plotting #
- Color filling and create labels for those fills using
scale_fill_discrete
The labels in scale_fill_discrete function should be written in the ascending order of your variable values (here our variable is cyl) values. E.g., cyl has the values: 4,6,8.
ggplot(data=mtcars) + geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) # plot on the left
ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) +
scale_fill_discrete(labels=c("S4", "V6", "V8")) # plot on the right
- How to plot one plot with two sublplots? For example, the plot above?
Use the ggpubr library and particularly the function ggarrange.
library("ggpubr")
p1 <- ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) # plot on the left
p2 <- ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=cyl, y=mpg, fill=factor(am))) +
scale_fill_discrete(labels=c("automatic", "manual"))
ggarrange(p1,p2)
There are several fantastic resources online. Here are a few: