Spark + R

A week ago, my subscription to the Apache Spark Youtube Channal recommanded me a Bay area meetup video. And it captured my attention immediately as it is about Spark + R. I watched the full video and headed to the site.

“SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.” — SparkR authors

Hmmm, sounds promising, I thought I have to give it a try, so I went to install the dependencies for “devtools” package in R: sudo yum install openssl-devel sudo yum install libcurl4-gnutls-dev. I also had to upgrade my R package to 3.0.2 from 3.0.1, which was as simple as sudo yum upgrade R, as R 3.0.1 won’t support “devtools”

Then I followed the instruction and did:

library(devtools)
install_github("amplab-extras/SparkR-pkg", subdir="pkg")

After a 10 mins wait and hoo-ray, it was a successful installation. But… I got this error initializing Spark context:

> library(SparkR)
Loading required package: rJava
[SparkR] Initializing with classpath /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar
> sc <- sparkR.init("local")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.NoClassDefFoundError: scala.collection.immutable.Vector

I had no clue and since SparkR is so new that not much about it is on Google, so I posted the issue on the github.

Next morning a SURPRISE, both creators replied and the problem is solved by one simple line of code:

sudo -E R CMD javareconf.

And everything starts to work like a charm:)

> library(SparkR)
Loading required package: rJava
[SparkR] Initializing with classpath /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar

> sc <- sparkR.init("local")
14/04/20 07:45:54 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127..........; using 192.......... instead (on interface eth1)
14/04/20 07:45:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/04/20 07:45:56 INFO Slf4jLogger: Slf4jLogger started
> lines <- textFile(sc, "/home/cloudera/Desktop/R/R-3.1.0/README")
> words <- flatMap(lines,
+                  function(line) {
+                    strsplit(line, " ")[[1]]
+                  })
> wordCount <- lapply(words, function(word) { list(word, 1L) })
> counts <- reduceByKey(wordCount, "+", 2L)
> output <- collect(counts)
14/04/20 07:46:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/20 07:46:30 WARN LoadSnappy: Snappy native library not loaded
14/04/20 07:46:30 INFO FileInputFormat: Total input paths to process : 1
> for (wordcount in output) {
+   cat(wordcount[[1]], ": ", wordcount[[2]], "\n")
+ }
problems :  1 
modify :  1 
superficially :  1 
manual :  2 
...
...

Spark and R share a lot similarities, as they are two of the most powerful in-memory analytic tools, in spite they are made for totally different purposes. Both SparkR and jvmr packages aim at combining Spark and R, although the difference between running SparkR and calling jvmr(rJava) package in Spark is fundamental, they excel in different scenarios.

Published

20 April 2014

Spark + R Testing SparkR

Published

Tags