Actually if you require scala 2.11 and spark 2.1 you have to use the current master (o.13.0 does not support these) and also can’t use sbt, unless you have some trick I haven’t discovered.
On Oct 3, 2017, at 12:55 PM, Pat Ferrel <***@occamsmachete.com> wrote:
I’m the aforementioned pferrel
@Hoa, thanks for that reference, I forgot I had that example. First don’t use the Hadoop part of Mahout, it is not supported and will be deprecated. The Spark version of cooccurrence will be supported. You find it in the SimilarityAnalysis object.
If you go back to the last release you should be able to make that https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-input-cooc> work with version updates to Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are the problems listed below.
I’m having a hard time building with sbt using the mahout-spark module when I build that latest mahout master with `mvn clean install`. This puts the mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match what SBT expects the path and filenames to be.
The build.sbt `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar
I can’t find a way to make SBT parse that structure and name.
On Oct 2, 2017, at 11:02 PM, Trevor Grant <***@gmail.com> wrote:
Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
However, I build Mahout (0.13.1-SNAPSHOT) locally with
mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
That's how maven was able to pick those up.
Post by Hoa NguyenHey all,
Thanks for the offers of help. I've been able to narrow down some of the
problems to version incompatibility and I just wanted to give an update.
Just to back track a bit, my initial goal was to run Mahout on a
distributed cluster whether that was running Hadoop Map Reduce or Spark.
I started out trying to get it to run on Spark, which I have some
familiarity, but that didn't seem to work. While the error messages seem to
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only because
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.
After a bit of wrangling, I was able to narrow down some of the issues. It
turns out I was kind of blindly using this repo https://github.com/
pferrel/3-input-cooc as a guide without fully realizing that it was from
several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
That is significantly different from my environment, which has Mahout
0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
2.11. After modifying the build.sbt file to account for those versions, I
now have compile type mismatch issues that I'm just not that savvy to fix
(see attached screenshot if you're interested).
Anyway, the good news that I was able to finally get Mahout code running
on Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well with
Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).
Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.
Many thanks!
Hoa
Post by Trevor GrantHi Hoa,
A few things could be happening here, I haven't run across that specific
error.
1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
you need to build from source (not the binaries). You can do this by
mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
2) Have you setup spark with Kryo serialization? How you do this depends on
if you're in the shell/zeppelin or using spark submit.
However, for both of these cases- it shouldn't have even run local afaik so
the fact it did tells me you probably have gotten this far?
Assuming you've done 1 and 2, can you share some code? I'll see if I can
recreate on my end.
Thanks!
tg
Post by Hoa NguyenI apologize in advance if this is too much of a newbie question but I'm
having a hard time running any Mahout example code in a distributed
Spark
Post by Hoa Nguyencluster. The code runs as advertised when Spark is running locally on
one
Post by Hoa Nguyenmachine but the minute I point Spark to a cluster and master url, I
can't
Post by Hoa NguyenInitial job has not accepted any resources; check your cluster UI to
ensure
Post by Hoa Nguyenthat workers are registered and have sufficient memory"
I know my Spark cluster is configured and working correctly because I
ran
Post by Hoa Nguyennon-Mahout code and it runs on a distributed cluster fine. What am I
doing
Post by Hoa Nguyenwrong? The only thing I can think of is that my Spark version is too
recent
Post by Hoa Nguyen-- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or
am I
Post by Hoa Nguyendoing something else wrong?
Thanks for any advice,
Hoa