Skip to main content
Stefan's Blog

Common R commands

I have lately been working with R for some data analysis. As I am not an R expert, I often forget the commands I need to use between my R sessions. I suspect this happens a lot because I google things up, and then do a copy paste/fiddle with the code I find online and end up being more interested in the results than the actual code. The result is that I forget the actual syntax or how to do things. Thus, in this post I will record some of the most common R commands, that I have been using, such that I can come back to them when I need to :-)

Word of caution: I am not an R expert, so some code might be less ideal. However, I plan to update this as often as needed, so feel free to send me an email if you spot some mistake.

a <- c("1","2") # make a list of elements and assign it to variable a

# Make a data frame with 9 columns and 0 rows
df <- data.frame(matrix(ncol = 9, nrow = 0))

# Append to a, b to list a
x <- append(c("a", "b"), a)

# Add column names to my data frame
colnames(df) <- x

# Let's assume we have this data (mpg) stored in df_mpg
df_mpg <- mpg

head(df_mpg)
# It looks likes this
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

# Select only the manufacturer column
df_mpg[1]
# A tibble: 234 x 1
manufacturer
<chr>
1 audi
2 audi
...

# Another option is to get the rows of manufacturer column
df_mpg$manufacturer # use $ to indicate which
df <- read.csv(file)
write.csv(df, file)

Select + Filter data #

library(dplyr) # this provides as with the pipe function %>%, and many other goodies
df_mpg %>% select(where(is.numeric)) # get all numeric values in the dataset
# Result
# displ year cyl cty hwy
<dbl> <int> <int> <int> <int>
1 1.8 1999 4 18 29
2 1.8 1999 4 21 29
3 2 2008 4 20 31
df_mpg %>% select(everything()) # provides all the columns

df_mpg %>% select(everything()) %>% filter(cyl > 4) # filter more based on the cyl value > 4

df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) # filter more based on the cyl value > 4 and midsize class

df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) %>% count(manufacturer) # count how many models each manufacturer has
# Result
manufacturer n
<chr> <int>
1 audi 3
2 chevrolet 3
3 hyundai 3
4 nissan 5
5 pontiac 5
6 toyota 3
7 volkswagen 3
# Sort by largest by passing sort argument to count function
df_mpg %>% select(everything()) %>% filter(cyl > 4, class == "midsize" ) %>% count(manufacturer, sort=TRUE)
# Result
manufacturer n
<chr> <int>
1 nissan 5
2 pontiac 5
3 audi 3

Colinearity, VIF #

Sometimes, I need to check if the independent variables in my data are correlated.

# Correlation matrix.
# By default it uses pearson correlation. See docs for other possibilities
df_mpg %>% select(is.numeric) %>% cor(.) # we can do correlation matrix only on numeric values

# or one can use the Hmisc package and the rcorr function
library("Hmisc")
corr_data <- df_mpg %>% select(where(is.numeric))
rcorr(as.matrix(corr_data))
# Result
displ year cyl cty hwy
displ 1.00 0.15 0.93 -0.80 -0.77
year 0.15 1.00 0.12 -0.04 0.00
cyl 0.93 0.12 1.00 -0.81 -0.76
cty -0.80 -0.04 -0.81 1.00 0.96
hwy -0.77 0.00 -0.76 0.96 1.00
install.packages("corrplot")
library("corrplot")
corr_data <- df_mpg %>% select(where(is.numeric))
cor_matrix <- cor(corr_data)
corrplot(cor_matrix, method="circle")
corrplot(cor_matrix, method="number")

geom point

library(car) # car is a library containing the function VIF - Variance inflation factor
model <- lm(hwy ~ cyl + year, data=df_mpg)

car::vif(model) # where model is the regression model
 car::vif(model)
cyl year
1.015171 1.015171

Statistical tests #

I want to test if two datasets have independent data. We need to use a different dataset, so let's use mtcars instead.

df <- mtcars # has mpg and automatic/manual transmission.
t.test(mpg ~ am, data = df) # normal t.test
Welch Two Sample t-test

data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
wilcox.test(mpg ~ am, data=df)
# Result
Wilcoxon rank sum test with continuity correction

data:  mpg by am
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0

Plotting data with ggplot2 #

I quite like using ggplot2, but I often forget the order of arguments, how to do different things, etc. Let's have some of the most common ggplot commands and the outputs. You can find a fanastic cheatsheet online.

library(ggplot2)
ggplot(data = df_mpg) + aes(x = displ, y = hwy, fill=class, color=class) + geom_point()

geom point

Add a regression line #

ggplot(df_mpg) + aes(x = displ, y = hwy) + geom_point(mapping=aes(color=class)) + geom_smooth()

geom point regression line

One thing to observe here is that we need to move the color=class into geom_point(mapping=aes()) function compared to the previous graph.

Boxplot #

ggplot(data = df_mpg, mapping = aes(x = displ, y = hwy, fill=class)) + geom_boxplot()

geom boxplot

More advanced plotting #

The labels in scale_fill_discrete function should be written in the ascending order of your variable values (here our variable is cyl) values. E.g., cyl has the values: 4,6,8.

ggplot(data=mtcars) + geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) # plot on the left
ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) +
scale_fill_discrete(labels=c("S4", "V6", "V8")) # plot on the right

geom labels

Use the ggpubr library and particularly the function ggarrange.

library("ggpubr")
p1 <- ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=hp, y=mpg, fill=factor(cyl))) # plot on the left

p2 <- ggplot(data=mtcars) +
geom_boxplot(mtcars, mapping=aes(x=cyl, y=mpg, fill=factor(am))) +
scale_fill_discrete(labels=c("automatic", "manual"))

ggarrange(p1,p2)

gg arrange

There are several fantastic resources online. Here are a few:

R for Data Science

Hands-on Programming with R

ggplot2 Tutorial

R Graphics Cookbook

Awesome R