I never liked this because the thing I was for-eaching over had to be the first input of the function, and then you have to add arguments after the first one separately. For example, if you want base 10 logs1 of a bunch of numbers:2
I would read this to myself in English as “for each thing in 1 through 10, work out the square root of it”, where ~ was read as “work out” and . (or .x if you prefer) was read as “it”.
You can also create a new column of a dataframe this way:
This is a little odd, for learners, because the thing inside the sqrt1 is crying out to be called x. I still think this is all right: “for each thing in x, work out the square root of it”, in the same way that you would use i as a loop index in a for loop.
The log examples both work more smoothly this way:
without the need to handle additional inputs specially, and without the requirement to have the “it” be the first input to the function. The call to the function looks exactly the same as it does when you call it outside a map, which makes it easier to learn.
Method 3: anonymous functions
A third way of specifying what to “work out” is to use the new (to R 4.0) concept of an “anonymous function”: a function, typically a one-liner, defined inline without a name. This is how it goes:
This one, to my mind, is not any clearer than the “work out” notation with a squiggle, though you can still cast your eyes over it and read “for each thing in 1 through 10, work out the square root of it” with a bit of practice.
This notation wins where the input things have names:3
number <-1:10map_dbl(number, \(number) sqrt1(number))
The clarity comes from the ability to use the name of the input column also as the name of the input to the anonymous function, so that everything joins up: “for each thing in x, work out the square root of that x”.4
This also works if you are for-eaching over two columns, for example working out logs of different numbers to different bases:
x <-2:4base
[1] 2.000000 2.718282 10.000000
crossing (from tidyr) makes a dataframe out of all combinations of its inputs, and so:
This doesn’t only apply to making dataframe columns, but again works nicely any time the input things have names:
u <-1:5v <-11:15map2_dbl(u, v, \(u, v) sqrt1(u+v))
[1] 3.464102 3.741657 4.000000 4.242641 4.472136
Collatz
When I am teaching this stuff, I say that if the thing you are working out is complicated, write a function to do that first, and then worry about for-eaching it. For example, imagine you want a function that takes an integer as input, and the output is:
if the input is even, half the input
if the input is odd, three times the input plus one
This is a bit long to put in the anonymous function of a map, so we’ll define a function hotpo to do it first:5
hotpo <-function(x) {stopifnot(x ==round(x)) # error out if input is not an integerif (x %%2==0) { ans <- x %/%2 } else { ans <-3* x +1 } ans}hotpo(4)
[1] 2
hotpo(3)
[1] 10
hotpo(5.6)
Error in hotpo(5.6): x == round(x) is not TRUE
So now, we can use a map to work out hotpo of each of the numbers 1 through 6:
first <-1:6map_int(first, hotpo)
[1] 4 1 10 2 16 3
or
map_int(first, ~hotpo(.))
[1] 4 1 10 2 16 3
or
map_int(first, \(first) hotpo(first))
[1] 4 1 10 2 16 3
where we call our function in the anonymous function. The answer is the same any of these ways, and you can reasonably argue that the last one is the clearest because the inputs to the map_int and the function have the same name.
This one is map_int because hotpo returns an integer.
This function is actually more than a random function defined on integers; it is part of an open problem in number theory called the Collatz conjecture. The idea is if you do this:
10
[1] 10
hotpo(10)
[1] 5
hotpo(hotpo(10))
[1] 16
hotpo(hotpo(hotpo(10)))
[1] 8
hotpo(hotpo(hotpo(hotpo(10))))
[1] 4
hotpo(hotpo(hotpo(hotpo(hotpo(10)))))
[1] 2
hotpo(hotpo(hotpo(hotpo(hotpo(hotpo(10))))))
[1] 1
you obtain a sequence of integers. If you ever get to 1, you’ll go back to 4, 2, 1, and loop forever, so we’ll say the sequence ends if it gets to 1. The Collatz conjecture says that, no matter where you start, you will always get to 1.6
Let’s assume that we are going to get to 1, and write a function to generate the whole sequence. The two key ingredients are: the hotpo function we wrote, and a while loop to keep going until we do get to 1:
hotpo_seq <-function(x) { ans <- xwhile(x !=1) { x <-hotpo(x) ans <-c(ans, x) } ans}
and test it:
hotpo_seq(10)
[1] 10 5 16 8 4 2 1
the same short ride that we had above, and a rather longer one:
Now, let’s suppose that we want to make a dataframe with the sequences for the starting points 1 through 10. The sequence is a vector rather than an integer, so that we need to do this with map:7
and we have made a list-column. You can see by the lengths of the vectors in the list-column how long each sequence is.8 We might want to make explicit how long each sequence is, and how high it goes:
This does indeed have a length of 17 and goes up as high as 52 before coming back down to 1.
Keeping and discarding by name
We don’t have to make a dataframe of these (though that, these days, is usually my preferred way of working). We can instead put the sequences in a list. This one is a “named list”, with each sequence paired with its starting point (its “name”):
If these were in a dataframe as above, a filter would pick out the sequences for particular starting points. As an example, we will pick out the sequences for odd-numbered starting points. Here, this allows us to learn about the new keep_at and discard_at. There is already keep and discard,9 for selecting by value, but the new ones allow selecting by name.
There are different ways to use keep_at, but one is to write a function that accepts a name and returns TRUE if that is one of the names you want to keep. Mine is below. The names are text, so I convert the name to an integer and then test it for oddness as we did in hotpo:
keep the sequences for odd-numbered starting points
is_odd <-function(x) { x <-as.integer(x) x %%2==1}is_odd(3)
[1] TRUE
is_odd(4)
[1] FALSE
and now I keep the sequences that have odd starting points thus:
I have long been a devotee of the lambda-function notation with a map:
x <-1:5map_dbl(x, ~sqrt1(.))
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
but I have always had vague misgivings about teaching this, because it is not immediately obvious why the thing inside sqrt1 is not also x. The reason, of course, is the same as this in Python:
x = ['a', 'b', 'c']for i in x:print(i)
a
b
c
where i stands for “the element of x that I am currently looking at”, but it takes a bit of thinking for the learner to get to that point.
Using the anonymous function approach makes things a bit clearer:
x <-1:5map_dbl(x, \(x) sqrt1(x))
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
where x appears three times in the map, first as the vector of values of which we want the square roots, and then as the input to sqrt1, so that everything appears to line up.
But there is some sleight of hand here: the meaning of x actually changes as you go along! The first x is a vector, but the second and third x values are numbers, elements of the vector x. Maybe this is all right, because we are used to treating vectors elementwise in R:
Functions like sqrt are vectorized, so the mutate really means something like “take the elements of x one at a time and take the square root of each one, gluing the result back together into a vector”. So, in the grand scheme of things, I am sold on the (new) anonymous function way of running map, and I think I will be using this rather than the lambda-function way of doing things in the future.
Now, if you’ll excuse me, I now have to attend to all the times I’ve used map in my lecture notes!
Footnotes
R’s log function has two arguments: the number whose log you want, and then the base of the log, which defaults to \(e\).↩︎
The logic here seems to require the vector to have a singular name.↩︎
The input to the anonymous function could be called anything, but it seems like a waste to not use the same name as the column being for-eached over.↩︎
%/% is integer division, discarding the remainder, and %% is the remainder itself. We need to be careful with the division because, for example, 4 / 2 is actually a decimal number, what we old FORTRAN programmers used to write as 2.0 or 2..↩︎
Spoiler: nobody has been able to prove that this is always true, but every starting point that has been tried gets to 1.↩︎
Using plain map means that its output will be a list, and in a dataframe will result in the new column being a list-column with something more than a single number stored in each cell.↩︎
I am a little bothered by most of them being dbl rather than int.↩︎
I must be having flashbacks of SAS, because I expected the opposite of “keep” to be “drop”.↩︎