Running Mahout on a Spark cluster

Discussion:

Hoa Nguyen

2017-09-22 02:37:52 UTC

I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed Spark
cluster. The code runs as advertised when Spark is running locally on one
machine but the minute I point Spark to a cluster and master url, I can't
get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory"

I know my Spark cluster is configured and working correctly because I ran
non-Mahout code and it runs on a distributed cluster fine. What am I doing
wrong? The only thing I can think of is that my Spark version is too recent
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am I
doing something else wrong?

Thanks for any advice,
Hoa

Trevor Grant

2017-09-22 05:09:32 UTC

Permalink

Hi Hoa,

A few things could be happening here, I haven't run across that specific
error.

1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
you need to build from source (not the binaries). You can do this by
downloading mahout source or cloning the repo and building with:
mvn clean install -Pspark-2.1,scala-2.11 -DskipTests

2) Have you setup spark with Kryo serialization? How you do this depends on
if you're in the shell/zeppelin or using spark submit.

However, for both of these cases- it shouldn't have even run local afaik so
the fact it did tells me you probably have gotten this far?

Assuming you've done 1 and 2, can you share some code? I'll see if I can
recreate on my end.

Thanks!

tg

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed Spark
cluster. The code runs as advertised when Spark is running locally on one
machine but the minute I point Spark to a cluster and master url, I can't
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I ran
non-Mahout code and it runs on a distributed cluster fine. What am I doing
wrong? The only thing I can think of is that my Spark version is too recent
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am I
doing something else wrong?
Thanks for any advice,
Hoa

Hoa Nguyen

2017-09-23 03:06:01 UTC

Permalink

Hey all,

Thanks for the offers of help. I've been able to narrow down some of the
problems to version incompatibility and I just wanted to give an update.
Just to back track a bit, my initial goal was to run Mahout on a
distributed cluster whether that was running Hadoop Map Reduce or Spark.

I started out trying to get it to run on Spark, which I have some
familiarity, but that didn't seem to work. While the error messages seem to
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only because
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.

After a bit of wrangling, I was able to narrow down some of the issues. It
turns out I was kind of blindly using this repo
https://github.com/pferrel/3-input-cooc as a guide without fully realizing
that it was from several years ago and based on Mahout 0.10.0, Scala 2.10
and Spark 1.1.1 That is significantly different from my environment, which
has Mahout 0.13.0 and Spark 2.1.1 installed, which also means I have to use
Scala 2.11. After modifying the build.sbt file to account for those
versions, I now have compile type mismatch issues that I'm just not that
savvy to fix (see attached screenshot if you're interested).

Anyway, the good news that I was able to finally get Mahout code running on
Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well with
Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).

Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.

Many thanks!
Hoa

Post by Trevor Grant
Hi Hoa,
A few things could be happening here, I haven't run across that specific
error.
1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
you need to build from source (not the binaries). You can do this by
mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
2) Have you setup spark with Kryo serialization? How you do this depends on
if you're in the shell/zeppelin or using spark submit.
However, for both of these cases- it shouldn't have even run local afaik so
the fact it did tells me you probably have gotten this far?
Assuming you've done 1 and 2, can you share some code? I'll see if I can
recreate on my end.
Thanks!
tg

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I ran
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Trevor Grant

2017-10-03 06:00:30 UTC

Permalink

Hey- sorry for long delay. I've been traveling.

Pat Ferrel was telling me he was having some simlar issues with
Spark+Mahout+SBT recently, and that we need to re-examine our naming
conventions on JARs.

Fwiw- I have several project that use Spark+Mahout in Spark 2.1/Scala-2.11,
and we even test this in our Travis CI tests, but the trick is- we use
Maven for the build. Any chance you could use maven? If not, maybe Pat can
chime in here, I'm just not an SBT user, so I'm not 100% sure what to tell
you.

Post by Hoa Nguyen
Hey all,
Thanks for the offers of help. I've been able to narrow down some of the
problems to version incompatibility and I just wanted to give an update.
Just to back track a bit, my initial goal was to run Mahout on a
distributed cluster whether that was running Hadoop Map Reduce or Spark.
I started out trying to get it to run on Spark, which I have some
familiarity, but that didn't seem to work. While the error messages seem to
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only because
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.
After a bit of wrangling, I was able to narrow down some of the issues. It
turns out I was kind of blindly using this repo https://github.com/
pferrel/3-input-cooc as a guide without fully realizing that it was from
several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
That is significantly different from my environment, which has Mahout
0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
2.11. After modifying the build.sbt file to account for those versions, I
now have compile type mismatch issues that I'm just not that savvy to fix
(see attached screenshot if you're interested).
Anyway, the good news that I was able to finally get Mahout code running
on Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well with
Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).
Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.
Many thanks!
Hoa

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Trevor Grant

2017-10-03 06:02:56 UTC

Permalink

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Pat Ferrel

2017-10-03 19:55:57 UTC

Permalink

Iâm the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First donât use the Hadoop part of Mahout, it is not supported and will be deprecated. The Spark version of cooccurrence will be supported. You find it in the SimilarityAnalysis object.

If you go back to the last release you should be able to make that https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-input-cooc> work with version updates to Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are the problems listed below.

Iâm having a hard time building with sbt using the mahout-spark module when I build that latest mahout master with `mvn clean install`. This puts the mahout-spark module in the local ~/.m2 maven cache. The structure doesnât match what SBT expects the path and filenames to be.

The build.sbt `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % â0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I canât find a way to make SBT parse that structure and name.

On Oct 2, 2017, at 11:02 PM, Trevor Grant <***@gmail.com> wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Pat Ferrel

2017-10-03 19:58:47 UTC

Permalink

Actually if you require scala 2.11 and spark 2.1 you have to use the current master (o.13.0 does not support these) and also can’t use sbt, unless you have some trick I haven’t discovered.

On Oct 3, 2017, at 12:55 PM, Pat Ferrel <***@occamsmachete.com> wrote:

I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use the Hadoop part of Mahout, it is not supported and will be deprecated. The Spark version of cooccurrence will be supported. You find it in the SimilarityAnalysis object.

If you go back to the last release you should be able to make that https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-input-cooc> work with version updates to Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are the problems listed below.

I’m having a hard time building with sbt using the mahout-spark module when I build that latest mahout master with `mvn clean install`. This puts the mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match what SBT expects the path and filenames to be.

The build.sbt `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.

On Oct 2, 2017, at 11:02 PM, Trevor Grant <***@gmail.com> wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Trevor Grant

2017-10-03 20:26:27 UTC

Permalink

The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"

Post by Pat Ferrel
Iâm the aforementioned pferrel
@Hoa, thanks for that reference, I forgot I had that example. First donât
use the Hadoop part of Mahout, it is not supported and will be deprecated.
The Spark version of cooccurrence will be supported. You find it in the
SimilarityAnalysis object.
If you go back to the last release you should be able to make that
https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-
input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
To use the latest master of Mahout, there are the problems listed below.
Iâm having a hard time building with sbt using the mahout-spark module
when I build that latest mahout master with `mvn clean install`. This puts
the mahout-spark module in the local ~/.m2 maven cache. The structure
doesnât match what SBT expects the path and filenames to be.
`"org.apache.mahout" %% "mahout-spark-2.1" % â0.13.1-SNAPSHOT`
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
spark-0.13.1-SNAPSHOT-spark_2.1.jar
I canât find a way to make SBT parse that structure and name.
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
However, I build Mahout (0.13.1-SNAPSHOT) locally with
mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
That's how maven was able to pick those up.

Post by Hoa Nguyen
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only

because

Post by Hoa Nguyen
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.
After a bit of wrangling, I was able to narrow down some of the issues.

Post by Hoa Nguyen
turns out I was kind of blindly using this repo https://github.com/
pferrel/3-input-cooc as a guide without fully realizing that it was from
several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
That is significantly different from my environment, which has Mahout
0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
2.11. After modifying the build.sbt file to account for those versions, I
now have compile type mismatch issues that I'm just not that savvy to fix
(see attached screenshot if you're interested).
Anyway, the good news that I was able to finally get Mahout code running
on Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well

with

Post by Hoa Nguyen
Hadoop 2.7.4, which prevented me from running any sample Mahout code

(from

Post by Hoa Nguyen
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).
Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.
Many thanks!
Hoa

Post by Trevor Grant
Hi Hoa,
A few things could be happening here, I haven't run across that specific
error.
1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x,

however

Post by Hoa Nguyen

Post by Trevor Grant
you need to build from source (not the binaries). You can do this by
mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
2) Have you setup spark with Kryo serialization? How you do this depends on
if you're in the shell/zeppelin or using spark submit.
However, for both of these cases- it shouldn't have even run local afaik so
the fact it did tells me you probably have gotten this far?
Assuming you've done 1 and 2, can you share some code? I'll see if I can
recreate on my end.
Thanks!
tg

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa

Pat Ferrel

2017-10-03 21:12:12 UTC

Permalink

Thanks Trevor,

this encoding leaves the Scala version hard coded. But this is an appreciated clue and will get me going. There may be a way to use the %% with this or just explicitly add the scala version string.

@Hoa, I plan to update that repo.

On Oct 3, 2017, at 1:26 PM, Trevor Grant <***@gmail.com> wrote:

The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"

Post by Pat Ferrel
I’m the aforementioned pferrel
@Hoa, thanks for that reference, I forgot I had that example. First don’t
use the Hadoop part of Mahout, it is not supported and will be deprecated.
The Spark version of cooccurrence will be supported. You find it in the
SimilarityAnalysis object.
If you go back to the last release you should be able to make that
https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-
input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
To use the latest master of Mahout, there are the problems listed below.
I’m having a hard time building with sbt using the mahout-spark module
when I build that latest mahout master with `mvn clean install`. This puts
the mahout-spark module in the local ~/.m2 maven cache. The structure
doesn’t match what SBT expects the path and filenames to be.
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
spark-0.13.1-SNAPSHOT-spark_2.1.jar
I can’t find a way to make SBT parse that structure and name.
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
However, I build Mahout (0.13.1-SNAPSHOT) locally with
mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
That's how maven was able to pick those up.

because

with

Post by Hoa Nguyen
Hadoop 2.7.4, which prevented me from running any sample Mahout code

(from

Post by Trevor Grant
Hi Hoa,
A few things could be happening here, I haven't run across that specific
error.
1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x,

however

Post by Hoa Nguyen

Post by Hoa Nguyen
I apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed

Spark

Post by Hoa Nguyen
cluster. The code runs as advertised when Spark is running locally on

one

Post by Hoa Nguyen
machine but the minute I point Spark to a cluster and master url, I

can't

Post by Hoa Nguyen
Initial job has not accepted any resources; check your cluster UI to

ensure

Post by Hoa Nguyen
that workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I

ran

Post by Hoa Nguyen
non-Mahout code and it runs on a distributed cluster fine. What am I

doing

Post by Hoa Nguyen
wrong? The only thing I can think of is that my Spark version is too

recent

Post by Hoa Nguyen
-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or

am I

Post by Hoa Nguyen
doing something else wrong?
Thanks for any advice,
Hoa