Discussion:
spark-itemsimilarity out of memory problem
(too old to reply)
AlShater, Hani
2014-12-22 07:44:23 UTC
Permalink
Hi All,

I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.

I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.

Can you please advice ?

Thanks,
Hani.​​
​

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Pat Ferrel
2014-12-22 17:30:42 UTC
Permalink
The job has an option -sem to set the spark.executor.memory config. Also you can change runtime job config with -D:key=value to access any of the Spark config values.

On Dec 21, 2014, at 11:44 PM, AlShater, Hani <***@souq.com> wrote:

Hi All,

I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.

I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.

Can you please advice ?

Thanks,
Hani.​​


Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Ted Dunning
2014-12-22 18:11:52 UTC
Permalink
Can you say what kind of cluster you have?

How many machines? How much memory? How much memory is given to Spark?
Post by AlShater, Hani
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc>
**and never
miss a deal! *
Pat Ferrel
2014-12-22 18:52:30 UTC
Permalink
Hi Hani,

I recently read about Souq.com. A vey promising project.

If you are looking at the spark-itemsimilarity for ecommerce type recommendations you may be interested in some slide decs and blog posts I’ve done on the subject.
Check out:
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/

Also I put up a demo site that uses some of these techniques: https://guide.finderbots.com

Good luck,
Pat

On Dec 21, 2014, at 11:44 PM, AlShater, Hani <***@souq.com> wrote:

Hi All,

I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.

I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.

Can you please advice ?

Thanks,
Hani.​​


Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Pat Ferrel
2014-12-22 19:17:21 UTC
Permalink
Also Ted has an ebook you can download:
mapr.com/practical-machine-learning

On Dec 22, 2014, at 10:52 AM, Pat Ferrel <***@occamsmachete.com> wrote:

Hi Hani,

I recently read about Souq.com. A vey promising project.

If you are looking at the spark-itemsimilarity for ecommerce type recommendations you may be interested in some slide decs and blog posts I’ve done on the subject.
Check out:
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/

Also I put up a demo site that uses some of these techniques: https://guide.finderbots.com

Good luck,
Pat

On Dec 21, 2014, at 11:44 PM, AlShater, Hani <***@souq.com> wrote:

Hi All,

I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.

I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.

Can you please advice ?

Thanks,
Hani.​​


Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
hlqv
2014-12-23 09:17:47 UTC
Permalink
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc>
**and never
miss a deal! *
AlShater, Hani
2014-12-23 15:23:49 UTC
Permalink
@Pat, Thanks for your answers. It seems that I have cloned the snapshot
before the feature of configuring spark was added. It worked now in the
local mode. Unfortunately, after trying the new snapshot and spark,
submitting to the cluster in yarn-client mode raise the following error:
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:52)
at org.apache.spark.deploy.yarn.Client.log(Client.scala:39)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:39)
at
org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:103)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:60)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:323)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)

and submitting in yarn-cluster mode raise this error:
Exception in thread "main" org.apache.spark.SparkException: YARN mode not
available ?
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1571)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:310)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1566)
... 10 more

My cluster consists from 3 nodes, andi using hadoop 2.4.0. I have get spark
1.1.0 and mahout-snapshot, compile, package and install them to the local
maven repo. Am I missing something ?

Thanks again



Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by hlqv
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
**and never
miss a deal! *
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
AlShater, Hani
2014-12-23 15:39:00 UTC
Permalink
@Pat, I am aware of your blog and of Ted practical machine learning books
and webinars. I have learn a lot
from you guys ;)

@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.

@Simon, I am using mahout and not spark because I need similarity not
matrix factorization. Actually, the appoach of spark-itemsimilarity is
giving a good way for augmenting content recommendations with collaborative
features. I found their approach more suitable in case of building lambda
architecture supporting recommendations based on content, collaborative
features and recent interactive events in addition to other injected rules.
I think predefined recommendation server cant fit all requirement at once,
for these reasons I am trying to use mahout.



Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by AlShater, Hani
@Pat, Thanks for your answers. It seems that I have cloned the snapshot
before the feature of configuring spark was added. It worked now in the
local mode. Unfortunately, after trying the new snapshot and spark,
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:52)
at org.apache.spark.deploy.yarn.Client.log(Client.scala:39)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:39)
at
org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:103)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:60)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:323)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)
Exception in thread "main" org.apache.spark.SparkException: YARN mode not
available ?
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1571)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:310)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1566)
... 10 more
My cluster consists from 3 nodes, andi using hadoop 2.4.0. I have get
spark 1.1.0 and mahout-snapshot, compile, package and install them to the
local maven repo. Am I missing something ?
Thanks again
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by hlqv
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
**and never
miss a deal! *
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Ted Dunning
2014-12-23 16:05:31 UTC
Permalink
Post by AlShater, Hani
@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.
Have you tried the map-reduce version?
Pat Ferrel
2014-12-23 17:25:59 UTC
Permalink
There is a large-ish data structure in the Spark version of this algorithm. Each slave has a copy of several BiMaps that handle translation of your IDs into and out of Mahout IDs. One of these is created for user IDs, and one for each item ID set. For a single action that would be 2 BiMaps. These are broadcast values. So enough memory must be available for these. Their size depends on how many user and item IDs you have.
Post by AlShater, Hani
@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.
Have you tried the map-reduce version?
Pat Ferrel
2014-12-23 17:16:08 UTC
Permalink
Both errors happen when the Spark Context is created using Yarn. I have no experience with Yarn and so would try it in standalone clustered mode first. Then if all is well check this page to make sure the Spark cluster is configured correctly for Yarn
https://spark.apache.org/docs/1.1.0/running-on-yarn.html

Are you able to run Spark examples using Yarn? If so maybe some of the Yarn config needs to be pass into the SparkConf using the -D:key=value

I’m very interested in helping with this, it has to work on Hadoop+Spark+Yarn so if it looks like a change needs to be made to Mahout, I’ll try to respond quickly.

To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the cross-cooccurrence indicators and you’ll have to translate your IDs into Mahout IDs. This means mapping user and item IDs from your values into non-negative integers representing the row (user) and column (item) numbers.


BTW: Spark’s maven artifacts were built incorrectly when using Hadoop 1.2.1. This is being fixed in Spark in a future version and in any case I don’t think it affects hadoop 2.x versions of the Spark artifacts so you may not need to build Spark 1.1.0

On Dec 23, 2014, at 7:23 AM, AlShater, Hani <***@souq.com> wrote:

@Pat, Thanks for your answers. It seems that I have cloned the snapshot
before the feature of configuring spark was added. It worked now in the
local mode. Unfortunately, after trying the new snapshot and spark,
submitting to the cluster in yarn-client mode raise the following error:
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:52)
at org.apache.spark.deploy.yarn.Client.log(Client.scala:39)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:39)
at
org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:103)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:60)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:323)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)

and submitting in yarn-cluster mode raise this error:
Exception in thread "main" org.apache.spark.SparkException: YARN mode not
available ?
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1571)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:310)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:95)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:81)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:128)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:211)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1566)
... 10 more

My cluster consists from 3 nodes, andi using hadoop 2.4.0. I have get spark
1.1.0 and mahout-snapshot, compile, package and install them to the local
maven repo. Am I missing something ?

Thanks again



Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by hlqv
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
**and never
miss a deal! *
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Ted Dunning
2014-12-23 17:29:08 UTC
Permalink
To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the
cross-cooccurrence indicators and you’ll have to translate your IDs into
Mahout IDs. This means mapping user and item IDs from your values into
non-negative integers representing the row (user) and column (item) numbers.
I don't think that I was sufficiently discouraging about the map-reduce
version. To be avoided if feasible.
Pat Ferrel
2014-12-23 16:14:28 UTC
Permalink
Why do you say it will lead to less accuracy?

The weights are LLR weights and they are used to filter and downsample the indicator matrix. Once the downsampling is done they are not needed. When you index the indicators in a search engine they will get TF-IDF weights and this is a good effect. It will downweight very popular items which hold little value as an indicator of user’s taste.

On Dec 23, 2014, at 1:17 AM, hlqv <***@gmail.com> wrote:

Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages continuously
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure spark
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc>
**and never
miss a deal! *
hlqv
2014-12-23 17:18:55 UTC
Permalink
Thank you for your explanation

There is a situation that I'm not clear, I have the result of item
similarity

iphone nexus:1 ipad:10
surface nexus:10 ipad:1 galaxy:1

Omit LLR weights then
If a user A has the purchase history : 'nexus', which one the
recommendation engine should prefer - 'iphone' or 'surface'
If a user B has the purchase history: 'ipad', 'galaxy' then I think the
recommendation engine should recommend 'iphone' instead of 'surface' (if
apply TF-IDF weight then the recommendation engine will return 'surface')

I really don't know whether my understanding here has some mistake
Post by Pat Ferrel
Why do you say it will lead to less accuracy?
The weights are LLR weights and they are used to filter and downsample the
indicator matrix. Once the downsampling is done they are not needed. When
you index the indicators in a search engine they will get TF-IDF weights
and this is a good effect. It will downweight very popular items which hold
little value as an indicator of user’s taste.
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
**and never
miss a deal! *
Pat Ferrel
2014-12-23 17:42:23 UTC
Permalink
First of all you need to index that indicator matrix with a search engine. Then the query will be your user’s history. The search engine weights with TF-IDF and the query is based on cosine similarity of doc to query terms. So the weights won’t be the ones you have below, they will be TF-IDF weights. This is as expected.

In a real-world setting you will have a great deal more data than below and the downsampling, which uses the LLR weights, will take only the highest weighted items and toss the lower weighted ones so the difference in weight will not really matter. The reason for downsampling is that the lower weighted items add very little value to the results. Leaving them all in will cause the algorithm to approach O(n^2) runtime.

In short the answer to the question of how to interpret the data below is: you don’t have enough data for real-world recs. Intuitions in the microscopic do not always scale up to real-world data.


On Dec 23, 2014, at 9:18 AM, hlqv <***@gmail.com> wrote:

Thank you for your explanation

There is a situation that I'm not clear, I have the result of item
similarity

iphone nexus:1 ipad:10
surface nexus:10 ipad:1 galaxy:1

Omit LLR weights then
If a user A has the purchase history : 'nexus', which one the
recommendation engine should prefer - 'iphone' or 'surface'
If a user B has the purchase history: 'ipad', 'galaxy' then I think the
recommendation engine should recommend 'iphone' instead of 'surface' (if
apply TF-IDF weight then the recommendation engine will return 'surface')

I really don't know whether my understanding here has some mistake
Post by Pat Ferrel
Why do you say it will lead to less accuracy?
The weights are LLR weights and they are used to filter and downsample the
indicator matrix. Once the downsampling is done they are not needed. When
you index the indicators in a search engine they will get TF-IDF weights
and this is a good effect. It will downweight very popular items which hold
little value as an indicator of user’s taste.
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
**and never
miss a deal! *
AlShater, Hani
2015-01-04 11:12:45 UTC
Permalink
Hi Pat,

Thanks again, spark-1.1.0 works without compilations and the errors have
gone. But still, there is out of memory problem. The error occurred when
spark is trying to write broadcast variable to desk. I tried to give each
executer 25g of memory but the same error occurs again. Also, I noticed
that when memory is increased, spark uses only one executer instead of
multiple. And surprisingly, the out of memory error occurs although there
is free memory available to Yarn.

Do you have examples of dataset size (number of items, users, actions) and
a cluster memory used to fit it?

If I understand you correctly, there are large broadcast variable for
mapping ids, is it kind of map side join to map recommendations results
with ids? Can it be avoided using spark joins?

best regards

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by Pat Ferrel
First of all you need to index that indicator matrix with a search engine.
Then the query will be your user’s history. The search engine weights with
TF-IDF and the query is based on cosine similarity of doc to query terms.
So the weights won’t be the ones you have below, they will be TF-IDF
weights. This is as expected.
In a real-world setting you will have a great deal more data than below
and the downsampling, which uses the LLR weights, will take only the
highest weighted items and toss the lower weighted ones so the difference
in weight will not really matter. The reason for downsampling is that the
lower weighted items add very little value to the results. Leaving them all
in will cause the algorithm to approach O(n^2) runtime.
you don’t have enough data for real-world recs. Intuitions in the
microscopic do not always scale up to real-world data.
Thank you for your explanation
There is a situation that I'm not clear, I have the result of item
similarity
iphone nexus:1 ipad:10
surface nexus:10 ipad:1 galaxy:1
Omit LLR weights then
If a user A has the purchase history : 'nexus', which one the
recommendation engine should prefer - 'iphone' or 'surface'
If a user B has the purchase history: 'ipad', 'galaxy' then I think the
recommendation engine should recommend 'iphone' instead of 'surface' (if
apply TF-IDF weight then the recommendation engine will return 'surface')
I really don't know whether my understanding here has some mistake
Post by Pat Ferrel
Why do you say it will lead to less accuracy?
The weights are LLR weights and they are used to filter and downsample
the
Post by Pat Ferrel
indicator matrix. Once the downsampling is done they are not needed. When
you index the indicators in a search engine they will get TF-IDF weights
and this is a good effect. It will downweight very popular items which
hold
Post by Pat Ferrel
little value as an indicator of user’s taste.
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​
​
Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
Post by Pat Ferrel
**and never
miss a deal! *
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Pat Ferrel
2015-01-04 18:33:21 UTC
Permalink
The data structure is a HashBiMap from Guava. Yes they could be replaced with joins but there is some extra complexity. The code would have to replace each HashBiMap with some RDD backed collection. But if there is memory available perhaps something else is causing the error. Let’s think this through.

Do you know the physical memory required for your user and item ID HashBiMap? The HashBiMap is Int <-> String. How many users and items do you have in your complete dataset? You say that the error occurs when the HashBiMap is being written to disk? It is never explicitly written, do you mean serialized as in broadcast to another executor? But you only have one. Can you attach logs and send them to my email address?

One problem we in Mahout have is accessibility to public large datasets. There is no public large ecom dataset I know of with multiple actions. We use the epinions dataset because it has two actions and is non-trivial but not extra large either. Not sure of the size, I’ll look into it. It requires on the order of 6g of executor memory. There is only one copy of the broadcast HashBiMaps created on each node machine where all local tasks use it read-only.

As to using only one executor, I’ve seen that too. It seems to be related to how data splits are created in Spark. You may have enough memory that no other executor is needed. Odd because in some cases you might want the executors for CPU bound problems so there is probably some config to force more executors. I doubt very much that you are CPU bound though so it may be ok here.

If we really do have HashBiMaps that are too large then there are ways to remove them.

The special problem in spark-itemsimilarity is getting one collection of unique user IDs that span all cross-cooccurrence indicators. A matrix multiply is performed for each cross-cooccurrence indicator so the row space of _all_ matrices must be the same. This means that as new data for the secondary actions are read in, the dimensionality of the previously read in matrices must be updated and the user ID collection must be updated.

There at least two ways to solve the user and item ID mapping that don’t require a HashMap. 1) do it the way legacy hadoop Mahout did, ignore the issue and use only internal Mahout IDs, which means the developer must perform the mapping before and after the job. This would be relatively easy to do in spark-itemsimilarity, in fact it is noted as a “todo" in the code, for optimization purposes. 2) Another way is to restructure the input pipeline to read in all data before the Mahout spark DRMs are created. This would allow for easier use of joins and rdd.distinct for managing very large ID sets. I think the input would have to use extrernal IDs initially then a join the distinct IDs with Mahout ID to create a DRM. Then another join would be required before output to get external IDs again. A partial solution might come from recent work to allow DRMs with non-Int ids I’ll ask about that but it would only solve the user ID problem not the Item IDs--that may be enough for you.

#1 just puts the problem on the user of Mahout and this has been a constant issue with pervious versions so unless someone is already doing the translation of IDs, it’s not very satisfying.
#2 would cause a fair bit longer runtime since joins are much much slower than hashes. But it may be an option since there is probably no better one given the constraints. Optimizing to a hash when memory is not a problem, then using joins when memory is a constraint may be the best solution.

On Jan 4, 2015, at 3:12 AM, AlShater, Hani <***@souq.com> wrote:

Hi Pat,

Thanks again, spark-1.1.0 works without compilations and the errors have
gone. But still, there is out of memory problem. The error occurred when
spark is trying to write broadcast variable to desk. I tried to give each
executer 25g of memory but the same error occurs again. Also, I noticed
that when memory is increased, spark uses only one executer instead of
multiple. And surprisingly, the out of memory error occurs although there
is free memory available to Yarn.

Do you have examples of dataset size (number of items, users, actions) and
a cluster memory used to fit it?

If I understand you correctly, there are large broadcast variable for
mapping ids, is it kind of map side join to map recommendations results
with ids? Can it be avoided using spark joins?

best regards

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
***@outlook.com | ***@souq.com <***@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
Post by Pat Ferrel
First of all you need to index that indicator matrix with a search engine.
Then the query will be your user’s history. The search engine weights with
TF-IDF and the query is based on cosine similarity of doc to query terms.
So the weights won’t be the ones you have below, they will be TF-IDF
weights. This is as expected.
In a real-world setting you will have a great deal more data than below
and the downsampling, which uses the LLR weights, will take only the
highest weighted items and toss the lower weighted ones so the difference
in weight will not really matter. The reason for downsampling is that the
lower weighted items add very little value to the results. Leaving them all
in will cause the algorithm to approach O(n^2) runtime.
you don’t have enough data for real-world recs. Intuitions in the
microscopic do not always scale up to real-world data.
Thank you for your explanation
There is a situation that I'm not clear, I have the result of item
similarity
iphone nexus:1 ipad:10
surface nexus:10 ipad:1 galaxy:1
Omit LLR weights then
If a user A has the purchase history : 'nexus', which one the
recommendation engine should prefer - 'iphone' or 'surface'
If a user B has the purchase history: 'ipad', 'galaxy' then I think the
recommendation engine should recommend 'iphone' instead of 'surface' (if
apply TF-IDF weight then the recommendation engine will return 'surface')
I really don't know whether my understanding here has some mistake
Post by Pat Ferrel
Why do you say it will lead to less accuracy?
The weights are LLR weights and they are used to filter and downsample
the
Post by Pat Ferrel
indicator matrix. Once the downsampling is done they are not needed. When
you index the indicators in a search engine they will get TF-IDF weights
and this is a good effect. It will downweight very popular items which
hold
Post by Pat Ferrel
little value as an indicator of user’s taste.
Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
Post by Pat Ferrel
mapr.com/practical-machine-learning
Hi Hani,
I recently read about Souq.com. A vey promising project.
If you are looking at the spark-itemsimilarity for ecommerce type
recommendations you may be interested in some slide decs and blog posts
I’ve done on the subject.
http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
Post by Pat Ferrel
Post by Pat Ferrel
https://guide.finderbots.com
Good luck,
Pat
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions
dataset.
Post by Pat Ferrel
The job launches and running successfully for small data 1M action.
However, when trying for the larger dataset, some spark stages
continuously
Post by Pat Ferrel
fail with out of memory exception.
I tried to change the spark.storage.memoryFraction from spark default
configuration, but I face the same issue again. How could I configure
spark
Post by Pat Ferrel
when using spark-itemsimilarity, or how to overcome this out of memory
issue.
Can you please advice ?
Thanks,
Hani.​​

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<
http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
Post by Pat Ferrel
Post by Pat Ferrel
**and never
miss a deal! *
--
*Download free Souq.com <http://souq.com/> mobile apps for iPhone
<https://itunes.apple.com/us/app/id675000850>, iPad
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows
Phone
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never
miss a deal! *
Continue reading on narkive:
Loading...