── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Introduction
Prerequisites
Numeric vectors are the backbone of data science
But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like mutate() and filter().
library(tidyverse)library(nycflights13)
Making numbers
In most cases, you’ll get numbers already recorded in one of R’s numeric types:
integer
double.
In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.
Parsing numbers
readr provides two useful functions for parsing strings into numbers: parse_double() and parse_number(). Use parse_double() when you have numbers that have been written as strings:
x <-c("1.2", "5.6", "1e3")parse_double(x)
[1] 1.2 5.6 1000.0
Use parse_number() when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:
x <-c("$1,234", "USD 3,513", "59%")parse_number(x)
[1] 1234 3513 59
Counts
The dplyr::count() is great for quick exploration and checks during analysis:
flights |>count(dest)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# ℹ 95 more rows
Counts
If you want to see the most common values, add sort = TRUE:
flights |>count(dest, sort =TRUE)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
# ℹ 95 more rows
If you want to see all the values:
|> View()
|> print(n = Inf).
Counts alternative
Same computation “by hand” with group_by(), summarize() and n().
There are a couple of variants of n() and count() that you might find useful:
n_distinct(x) counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:
Transformation functions work well with mutate() because their output is the same length as the input.
Minimum and maximum
The arithmetic functions work with pairs of variables (or columns). Two closely related functions are pmin() and pmax(), which when given two or more variables will return the smallest or largest value in each row:
df <-tribble(~x, ~y,1, 3,5, 2,7, NA,)df |>mutate(min =pmin(x, y, na.rm =TRUE),max =pmax(x, y, na.rm =TRUE) )
# A tibble: 3 × 4
x y min max
<dbl> <dbl> <dbl> <dbl>
1 1 3 1 3
2 5 2 2 5
3 7 NA 7 7
Minimum and maximum
Note that these are different to the summary functions min() and max() which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:
df |>mutate(min =min(x, y, na.rm =TRUE),max =max(x, y, na.rm =TRUE) )
# A tibble: 3 × 4
x y min max
<dbl> <dbl> <dbl> <dbl>
1 1 3 1 7
2 5 2 1 7
3 7 NA 1 7
Modular arithmetic
Modular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder. In R, %/% does integer division and %% computes the remainder:
1:10%/%3
[1] 0 0 1 1 1 2 2 2 3 3
1:10%%3
[1] 1 2 0 1 2 0 1 2 0 1
Modular arithmetic
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into hour and minute:
We can combine that with the mean(is.na(x)) trick from ?@sec-logical-summaries to see how the proportion of cancelled flights varies over the course of the day. The results are shown in Figure 1.
Figure 1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.
Logarithms
Three logarithms
In R, you have a choice of three logarithms:
log() (the natural log, base e),
log2() (base 2),
log10() (base 10).
Logarithm recommendation
We recommend using log2() or log10():
log2() is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving;
log10() is easy to back-transform because (e.g.) 3 is 10^3 = 1000.
The inverse of log() is exp(); to compute the inverse of log2() or log10() you’ll need to use 2^ or 10^.
Rounding
Use round(x) to round a number to the nearest integer:
round(123.456)
[1] 123
Rounding digits
You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01. This definition is useful because it implies round(x, -3) will round to the nearest thousand, which indeed it does:
round(123.456, 2) # two digits
[1] 123.46
round(123.456, 1) # one digit
[1] 123.5
round(123.456, -1) # round to nearest ten
[1] 120
round(123.456, -2) # round to nearest hundred
[1] 100
Rounding digits - weirdness
There’s one weirdness with round() that seems surprising at first glance:
round(c(1.5, 2.5))
[1] 2 2
round() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.
Rounding digits - floor/ceiling
round() is paired with floor() which always rounds down and ceiling() which always rounds up:
x <-123.456floor(x)
[1] 123
ceiling(x)
[1] 124
Cutting numbers into ranges
Use cut()1 to break up (aka bin) a numeric vector into discrete buckets:
See the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b].
Cumulative and rolling aggregates
Base R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice:
If you need more complex rolling or sliding aggregates, try the slider package.
General transformations
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
Ranks
dplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank(). It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th.
x <-c(1, 3, 2, 2, 4, 20, 15, NA)min_rank(x)
[1] 1 4 2 2 5 7 6 NA
Ranks
Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:
min_rank(desc(x))
[1] 7 4 5 5 3 1 2 NA
Ranks alternatives
If min_rank() doesn’t do what you need, look at the variants dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist(). See the documentation for details.
# A tibble: 8 × 5
x row_number dense_rank percent_rank cume_dist
<dbl> <int> <int> <dbl> <dbl>
1 1 1 1 0 0.143
2 3 4 3 0.5 0.571
3 2 2 2 0.167 0.429
4 2 3 2 0.167 0.429
5 4 5 4 0.667 0.714
6 20 7 6 1 1
7 15 6 5 0.833 0.857
8 NA NA NA NA NA
Offsets
dplyr::lead() and dplyr::lag() allow you to refer to the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:
x <-c(2, 5, 11, 11, 19, 35)lag(x)
[1] NA 2 5 11 11 19
lead(x)
[1] 5 11 11 19 35 NA
Offsets - lag
x - lag(x) gives you the difference between the current and previous value.
x -lag(x)
[1] NA 3 6 0 8 16
x == lag(x) tells you when the current value changes.
x ==lag(x)
[1] NA FALSE FALSE TRUE FALSE FALSE
You can lead or lag by more than one position by using the second argument, n.
Consecutive identifiers
Sometimes you want to start a new group every time some event occurs. For example, when you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after a gap of more than x minutes since the last activity. For example, imagine you have the times when someone visited a website:
But how do we go from that logical vector to something that we can group_by()? cumsum(), from Section 5.11, comes to the rescue as gap, i.e. has_gap is TRUE, will increment group by one (?@sec-numeric-summaries-of-logicals):
Another approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes. For example, inspired by this stackoverflow question, imagine you have a data frame with a bunch of repeated values:
# A tibble: 7 × 3
# Groups: id [7]
x y id
<chr> <dbl> <int>
1 a 1 1
2 b 2 2
3 c 4 3
4 d 3 4
5 e 9 5
6 a 4 6
7 b 10 7
Numeric summaries
Just using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.
Center
Function mean()
Function median()
Minimum, maximum, and quantiles
min() and max() will give you the largest and smallest values.
Another powerful tool is quantile() which is a generalization of the median:
quantile(x, 0.25) will find the value of x that is greater than 25% of the values,
quantile(x, 0.5) is equivalent to the median,
quantile(x, 0.95) will find the value that’s greater than 95% of the values.
Minimum, maximum, and quantiles
For the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position:
first(x),
last(x), and
nth(x, n).
Positions
For example, we can find the first, fifth and last departure for each day:
NB: Because dplyr functions use _ to separate components of function and arguments names, these functions use na_rm instead of na.rm.
With mutate()
As the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules we discussed in ?@sec-recycling they can also be usefully paired with mutate(), particularly when you want do some sort of group standardization. For example:
x / sum(x) calculates the proportion of a total.
(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1).
(x - min(x)) / (max(x) - min(x)) standardizes to range [0, 1].
x / first(x) computes an index based on the first observation.