Separating Hyperplanes

Why you should use = and never <-

Matthew Martin 2/09/2018 12:55:00 PM

In R one of the ways to do variable asignment is like this:

x <- 42

This assigns the value 42 to the variable named x. It is directional, so you could do the same thing like this

42 -> x

This differs from how variable assignment is done in the most commonly used programming languages today, which is usually a plain old equals sign

x = 42

The equals sign is unidirectional, always assigning the value from the right hand side to the variable name on the left hand side. R actually allows this too. In fact there are, as far as I know, nine ways to make a variable assignment in R:

x <- 42
42 -> x
x <<- 42
42 ->> x
x = 42
assign("x", 42)
`=`("x", 42)
`<-`("x", 42)
`<<-`("x", 42)

These all make a variable assignment but they are not equivalent ways of doing so. The <<- notation, for example, makes a variable available in the global scope, meaning the variable will be accessible outside of the function scope in which it is assigned. That's bad though, don't do that.

Everyone agrees all of these that aren't =, <-, or -> are bad. But R users disagree about whether to use the equals sign or arrows. It has become something of a convention to use the arrows, and even Google has put out guidance saying you should always use the arrows

This is the wrongest Google has ever been about anything.

instead of the equals sign in R. This guidance is wrong.

Originally R only had the arrow style assignment operators, and did not allow equal signs, which were used only for logicals (==) and passing objects as arguments in functions. S, R's predecessor language, had borrowed this from the even older APL language, which had it's own keyboards that actually had a <- key on it so that it was just one keystroke instead of two. In 2001, R began allowing = for assignment to make it easier for programers coming from other languages since most of them were more familiar with this notaion. To make it easier for programmers. Here's Stack Overflow's list of programming languages that are more popular than R:

I'll set aside that somehow assembly language edged R out in popularity—all of the other languages use = for assigment, and while C and Objective-C have a bunch of other kinds of assignment operators, none of them use R's <-. The first, best, and only reason that you actually need, to favor = over <- is that it is easier. No one who is new to R is familiar with <-. No one.

What makes this quirk in R conventions more difficult to those coming from other more popular languages is the fact that in many other respects, R resembles the language they're coming from. Consider the immediately-invoked function expression:

(function(){
    return(25*25);
})();

This function (anonymous function that simply returns 625) actually runs in both R and javascript. This method of wrapping a chunk of code in an anonymous function that is immediately invoked is a very common construct in javascript used to encapsulate blocks of code to prevent them from poisoning the namespace, a technique that R users would greatly benefit from adopting. It is confusing to have a language so similar to other C-like languages to have a totally bonkers symbol for the very basic task of assigning variables.

In addition to being more familiar to literally everyone, = is also easier to type now that no one has APL keyboards.

One argument against using = is that R uses that operator to pass named parameters to functions, an operation that is not the same thing as variable assignment, like so:

my.function(firstparam="foo", secondparam="bar")

It's true that this is not literally the same operator, and also true that in other languages, different symbols are used for each. In C# for example : is used to pass a named parameter in a method call, while = is used for variable assignment. I would argue, however, that using the same symbol for each is good because users of the language should be blind to the difference—we should act as if passing a named argument in a function is assigning a variable. Especially in the case of R. Unlike C#, R allows passing arbitrary named arguments to functions—that is, arguments not named in the method signature. Let's compare. The C# approach to passing an arbitrary number of parameters is to use the params keyword and pass an array of arbitrary length, like so

public static void myFunction(params string[] args){
    for(int i=0; i < args.Length; i++){
        Console.WriteLine(args[i]);
    }
}
myFunction("foo","bar");

We can pass an arbitrary number of parameters in the args position, but there is no mechanism for naming these parameters in C#. We can actually give them variable names in R though

my.function=function(...){
    args=list(...)
    arg.names=names(args)
    n=length(args)
    if(n>0){
        for(i in 1:n){
            print(paste0(arg.names[i],": ",args[i]))
        }
    }
}
my.function(foo="foo value", bar="bar value")

Nor can we pass these arguments using <- instead of = because of the way the arrow operator evaluates: it will assign the value to foo and then pass that value to the function and the name "foo" will not be accessible from inside the function, which is never ever the behavior we'd want. The expected behavior of the named parameters in the function call is exactly like assigning a variable within the function scope (also note another subtle difference from C#: in R, altering an object within the function scope doesn't mutate the object outside the function scope). This isn't some crank example I dredged up from the unexplored depths of the R universe either—this usage is now central to R syntax due to prominent packages like dplyr which rely on it. Consider for example

library(tidyverse)

data.df %>%
    group_by(state) %>%
    summarize(quantity=n()) %>%
    ggplot(aes(x=state,y=quantity))+
    geom_bar()

In the very popular dplyr package passing arbitrary named parameters features prominently in functions like mutate and summarize and this behavior is crucial to allowing the rest of the pipeline to work, for example as we can see here in ggplot where we need to use the name of the parameter passed to summarize. Here's an even more basic scenario:

data.df=data.frame( state=as.factor(statevector), cumulative.quantity=cumsum(quantityvector)))

Given that this is the predominant style of coding in the R universe, it makes no sense to continue making a visual distinction between variable assignment versus passing a named parameter in a function call. The difference matters only to the internal mechanics and should be abstracted away to R users.

The equal sign operator is also just plain safer than the arrow operator. A common programming mistake is to use assignment= instead of comparison == operator in conditional, which in a lot of languages would lead to the program quietly failing rather than error. Not R though:

if(x=5){
    print("equals five")
}

leads to an error instead of assigning 5 to x, which is good because in R, any number other than 0 will evaluate to TRUE. However, R will quietly fail without warning if you use the assignment operator in a conditional.

if(x<-5){
    print("Less than negative five")
}

It will also queitly fail without warning if you use the arrow operator to pass function parameters:

data.df <- data.frame( state <- as.factor(statevector), cumulative.quantity <- cumsum(quantityvector)))

will return data frame filled with garbage instead of warning you that you made a mistake.

I'll end with this parting note. Take a look at that last conditional statement above. Because the arrow operator is a two-part token, it is ambiguous. Did you mean assignment x=5 or comparison x < -5? The hierarchy R goes through to resolve these ambiguities is incredibly obscure and doesn't always resolve in the ways users would expect. Not only should we never use the arrow operator, we need to actually ban it from the language completely. Enough is enough.