Discussion:
Mahout ML vs Spark Mlib vs Mahout-Spark integreation
(too old to reply)
Reth RM
2016-09-17 00:03:15 UTC
Permalink
Hi,

I am trying to learn the key differences between mahout ML and spark ML and
then the mahout-spark integration specifically for clustering algorithms. I
learned through forms and blogposts that one of the major difference is
mahout runs as batch process and spark backed by streaming apis. But I do
see mahout-spark integration as well. So I'm slightly confused and would
like to know the major differences that should be considered(looked into)?

Background:
I'm working on a new research project that requires clustering of
documents( 50M webpages for now) and focus is only towards using clustering
algorithms and the LSH implementation. Right now, I started with
experimenting mahout-kmean (standalone not the streaming-kmean) and also
looked in to LSH, which is again available in both frameworks, so the above
questions rising at this point.

Looking forward to hear thoughts and insights from all users here.
Thank you.
Andrew Musselman
2016-09-17 06:36:03 UTC
Permalink
Mahout has changed a lot in the past couple years, becoming more focused on
serving the needs of data workers and scientists who need to experiment
with large matrix math problems. To that end we've broadened the execution
engines that perform the distribution of computation to include Spark and
Flink, and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.

There is a new declarative language that is R/MATLAB-like and allows for
interactive sessions at scale; see the "Mahout-Samsara" tab in the
navigation on the home page http://mahout.apache.org.

This book was written by two of the major contributors to the new
declarative language, worth taking a look:
https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785

Thanks for your interest; we'll be happy to help you as you proceed if you
have any other questions.
Post by Reth RM
Hi,
I am trying to learn the key differences between mahout ML and spark ML and
then the mahout-spark integration specifically for clustering algorithms. I
learned through forms and blogposts that one of the major difference is
mahout runs as batch process and spark backed by streaming apis. But I do
see mahout-spark integration as well. So I'm slightly confused and would
like to know the major differences that should be considered(looked into)?
I'm working on a new research project that requires clustering of
documents( 50M webpages for now) and focus is only towards using clustering
algorithms and the LSH implementation. Right now, I started with
experimenting mahout-kmean (standalone not the streaming-kmean) and also
looked in to LSH, which is again available in both frameworks, so the above
questions rising at this point.
Looking forward to hear thoughts and insights from all users here.
Thank you.
Isabel Drost-Fromm
2017-01-31 11:01:40 UTC
Permalink
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.

To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?


Isabel
Florent Empis
2017-01-31 11:31:44 UTC
Permalink
Hi,

I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing some
data transformation with spark, but I am not sure where Samsara fits in the
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!

Many thanks,

Florent


Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <***@apache.org> a écrit :


Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for
quite
some time. So from what limited background I have of Samsara I really like
it's
approach to be able to run on more than one execution engine.

To give some advise to downstream users in the field - what would be your
advise
for people tasked with concrete use cases (stuff like fraud detection,
anomaly
detection, learning search ranking functions, building a recommender
system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help
users get
that accomplished? Is there even interest from users in such a use case
based
perspective? If so, would there be interest among the Mahout committers to
help
users publicly create docs/examples/modules to support these use cases?


Isabel
Trevor Grant
2017-01-31 14:50:13 UTC
Permalink
Hello Isabel and Florent,

I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
/ Mahout, but in very broad strokes here is how I would compare them:

R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance- handles
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos

Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos

SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.

(FlinkML - Fits in same spot as SparkML, but significantly less developed)

Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is 'math
centric'. Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)

I hope that provides a high level comparison.

Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout

^^ All of this is just my opinion.

Re: integration-

We're working on that too. Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896

(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).

Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing some
data transformation with spark, but I am not sure where Samsara fits in the
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
Isabel
Pat Ferrel
2017-01-31 17:21:28 UTC
Permalink
My perspective comes from the data side. I work in recommenders and that means log analysis for huge amounts of data. Even a small shop doing this will immediately run our of the capacity in Python or R on a single node. MLlib is a set of prepackaged algorithms that will work (mostly) with big data. Mahout Samsara is the only general linear algebra tool I know of that will natively let you interactively run R-like code on any size cluster, then polish it for production all without changing tools, or language.

Going from analytics to recommenders means a jump in data size of several orders of magnitude and this is just one example.


On Jan 31, 2017, at 6:50 AM, Trevor Grant <***@gmail.com> wrote:

Hello Isabel and Florent,

I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
/ Mahout, but in very broad strokes here is how I would compare them:

R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance- handles
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos

Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos

SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.

(FlinkML - Fits in same spot as SparkML, but significantly less developed)

Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is 'math
centric'. Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)

I hope that provides a high level comparison.

Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout

^^ All of this is just my opinion.

Re: integration-

We're working on that too. Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896

(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).

Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing some
data transformation with spark, but I am not sure where Samsara fits in the
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
Isabel
Ted Dunning
2017-01-31 17:30:35 UTC
Permalink
From my perspective, the state of the art of machine learning is with
systems like Tensorflow and dl4j. If you can deal with the limits of a
non-clustered GPU system, then Theano and Cafe are very useful. Keras
papers over the difference between different back-ends nicely.

Tensorflow and Theano can do a lot of mathematical and linear (tensor,
actually) algebra work nicely, especially if there is an optimization
problem lurking.

NVidia also has a very strong commercial offering that supports their GPU
clustering well.

Spark ML lags this state of the art very far behind, but is still useful
for simpler situations.

For recommendations, the situation is very different. Almost all
applications are most easily and often most accurately solved using an
indicator-based approach and the go-to implementation of this is Mahout.

There is a lot of noise in the world about factorization-based
recommendation using ALS and such, but the noise is not warranted.
Deploying a recommender in a search engine is just better.

I have not personally used Samsara much, but the idea of a strong optimizer
over the top of a nice syntax for linear algebra is a good one.
Post by Pat Ferrel
My perspective comes from the data side. I work in recommenders and that
means log analysis for huge amounts of data. Even a small shop doing this
will immediately run our of the capacity in Python or R on a single node.
MLlib is a set of prepackaged algorithms that will work (mostly) with big
data. Mahout Samsara is the only general linear algebra tool I know of that
will natively let you interactively run R-like code on any size cluster,
then polish it for production all without changing tools, or language.
Going from analytics to recommenders means a jump in data size of several
orders of magnitude and this is just one example.
Hello Isabel and Florent,
I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance- handles
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos
Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos
SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.
(FlinkML - Fits in same spot as SparkML, but significantly less developed)
Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is 'math
centric'. Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)
I hope that provides a high level comparison.
Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout
^^ All of this is just my opinion.
Re: integration-
We're working on that too. Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896
(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).
Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing
some
Post by Florent Empis
data transformation with spark, but I am not sure where Samsara fits in
the
Post by Florent Empis
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really
like
Post by Florent Empis
it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers
to
Post by Florent Empis
help
users publicly create docs/examples/modules to support these use cases?
Isabel
Florent Empis
2017-01-31 20:54:41 UTC
Permalink
From my point of view, mahout as a whole has shifted from what it was in
2009-2012:
At the time, Mahout (and Mahout in Action is a great testimony of that era)
was a sum of bricks, full of relatively high-level mathematics concepts but
useable by what I'd call (myself included) wanna-be datascientists.
With an approach akin to "datascience for hackers", it was possible to
build a crude but working ML tool, such as a recommender. My memory is
lacking, but I think my first experiments with Taste date from 2008. I had,
at the time, no intimate mathematical knowledge of what the code wrote by
Ted, Sean & many others did. I fed order histories into it and got back
recommendations that "made sense". I then did the same with knn & more.
Over the years, I got better at understanding the mathematical concepts,
thanks in particular to scientists that took the time of explaining to me,
a tech/data guy, what were the mathematical concepts behind the blocks I
blindly used.

I'd say that "mahout today" is a distributed mathematical toolbox. Nothing
wrong with that, absolutely nothing. It has its purposes but I feel it's no
longer aimed at "tech people wanting to have a go at machine learning"..

When I take a look at my company code repository, even though I'm less and
less involved in day to day design decisions, I see that its "lively"
components are indeed using stuff like Tensorflow & dl4j.

My scientific credentials are obviously way less impressive than Ted's,
whom I had the pleasure to meet a few times as well as quite a few of MapR
employees, but I make exactly the same analysis coming from a
tech/functionnal background: for recommendation, don't bother reinventing
the wheel or using "fancy" ALS stuff (been there, done that, shown no
impressive gain in practical use-cases): buy an off the shelf solution
(disclaimer: I sell one ;-) ) or build it from Mahout Taste and do some
data wrangling with a search engine (but if you're in a hurry, definitely
go and talk to vendors, a few caveats apply :-) ). For everything else ML
related, have a go at tensorflow implementations related to your use case,
you will find books which are as didactic as Mahout in Action was 6 years
ago.

All in all: congrats to the Mahout team, past and current contributors, you
achieved a good damn job and got me into this field, for which I am very
grateful!
Post by Ted Dunning
From my perspective, the state of the art of machine learning is with
systems like Tensorflow and dl4j. If you can deal with the limits of a
non-clustered GPU system, then Theano and Cafe are very useful. Keras
papers over the difference between different back-ends nicely.
Tensorflow and Theano can do a lot of mathematical and linear (tensor,
actually) algebra work nicely, especially if there is an optimization
problem lurking.
NVidia also has a very strong commercial offering that supports their GPU
clustering well.
Spark ML lags this state of the art very far behind, but is still useful
for simpler situations.
For recommendations, the situation is very different. Almost all
applications are most easily and often most accurately solved using an
indicator-based approach and the go-to implementation of this is Mahout.
There is a lot of noise in the world about factorization-based
recommendation using ALS and such, but the noise is not warranted.
Deploying a recommender in a search engine is just better.
I have not personally used Samsara much, but the idea of a strong optimizer
over the top of a nice syntax for linear algebra is a good one.
Post by Pat Ferrel
My perspective comes from the data side. I work in recommenders and that
means log analysis for huge amounts of data. Even a small shop doing this
will immediately run our of the capacity in Python or R on a single node.
MLlib is a set of prepackaged algorithms that will work (mostly) with big
data. Mahout Samsara is the only general linear algebra tool I know of
that
Post by Pat Ferrel
will natively let you interactively run R-like code on any size cluster,
then polish it for production all without changing tools, or language.
Going from analytics to recommenders means a jump in data size of several
orders of magnitude and this is just one example.
Hello Isabel and Florent,
I'm currently working on a side-by-side demo of R / Python /
SparkML(Mllib)
Post by Pat Ferrel
R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance-
handles
Post by Pat Ferrel
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos
Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos
SparkML / Mllib - Very Limited Mathematical functionality (usually
collects
Post by Pat Ferrel
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.
(FlinkML - Fits in same spot as SparkML, but significantly less
developed)
Post by Pat Ferrel
Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is
'math
Post by Pat Ferrel
centric'. Well suited to "medium and big" data sets. Fairly easy to
extend
Post by Pat Ferrel
/ create new algos (MAHOUT-1856)
I hope that provides a high level comparison.
Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full
dataset
Post by Pat Ferrel
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout
^^ All of this is just my opinion.
Re: integration-
We're working on that too. Recently MAHOUT-1896 added convenience
methods
Post by Pat Ferrel
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896
(No support yet for SparkML type dataframes, or spitting DRMs back out
into
Post by Pat Ferrel
RDDs/DataFrames).
Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open
up
Post by Pat Ferrel
documentation /website maintenance to contributors.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing
some
Post by Florent Empis
data transformation with spark, but I am not sure where Samsara fits in
the
Post by Florent Empis
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for
instance
Post by Pat Ferrel
Post by Florent Empis
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really
like
Post by Florent Empis
it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be
your
Post by Pat Ferrel
Post by Florent Empis
advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take
to
Post by Pat Ferrel
Post by Florent Empis
get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers
to
Post by Florent Empis
help
users publicly create docs/examples/modules to support these use cases?
Isabel
scott cote
2017-01-31 20:41:45 UTC
Permalink
Trevor gave a great presentation at our user group. It was live streamed on Periscope. Trevor - maybe you could share the url? I don’t have it handy at the moment.

SCott
Post by Trevor Grant
Hello Isabel and Florent,
I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance- handles
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos
Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos
SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.
(FlinkML - Fits in same spot as SparkML, but significantly less developed)
Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is 'math
centric'. Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)
I hope that provides a high level comparison.
Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout
^^ All of this is just my opinion.
Re: integration-
We're working on that too. Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896
(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).
Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing some
data transformation with spark, but I am not sure where Samsara fits in the
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
Isabel
Keith Aumiller
2017-01-31 20:43:18 UTC
Permalink
I was just watching it. ;)

https://trevorgrant.org/

Thanks Trevor!
Post by scott cote
Trevor gave a great presentation at our user group. It was live streamed
on Periscope. Trevor - maybe you could share the url? I don’t have it
handy at the moment.
SCott
Post by Trevor Grant
Hello Isabel and Florent,
I'm currently working on a side-by-side demo of R / Python /
SparkML(Mllib)
Post by Trevor Grant
R- Most statistical functionality. Most flexibility. Implement your own
algorithms- mathematically expressive language. Worst performance-
handles
Post by Trevor Grant
only "small" data sets. Language is 'math centric'. Easy to extend /
create new algos
Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though. Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos
SparkML / Mllib - Very Limited Mathematical functionality (usually
collects
Post by Trevor Grant
to driver to do anything of substance). Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.
(FlinkML - Fits in same spot as SparkML, but significantly less
developed)
Post by Trevor Grant
Mahout - Good mathematical functionality. Good performance relative to
underlying engine (possibly superior with MAHOUT-1885). Language is
'math
Post by Trevor Grant
centric'. Well suited to "medium and big" data sets. Fairly easy to
extend
Post by Trevor Grant
/ create new algos (MAHOUT-1856)
I hope that provides a high level comparison.
Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full
dataset
Post by Trevor Grant
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout
^^ All of this is just my opinion.
Re: integration-
We're working on that too. Recently MAHOUT-1896 added convenience
methods
Post by Trevor Grant
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896
(No support yet for SparkML type dataframes, or spitting DRMs back out
into
Post by Trevor Grant
RDDs/DataFrames).
Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support. The CMS
makes it difficult to keep up with documentation, and Jekyll would open
up
Post by Trevor Grant
documentation /website maintenance to contributors.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Florent Empis
Hi,
I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing
some
Post by Trevor Grant
Post by Florent Empis
data transformation with spark, but I am not sure where Samsara fits in
the
Post by Trevor Grant
Post by Florent Empis
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for
instance
Post by Trevor Grant
Post by Florent Empis
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!
Many thanks,
Florent
Hi,
Post by Andrew Musselman
and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
scenes.
To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really
like
Post by Trevor Grant
Post by Florent Empis
it's
approach to be able to run on more than one execution engine.
To give some advise to downstream users in the field - what would be
your
Post by Trevor Grant
Post by Florent Empis
advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers
to
Post by Trevor Grant
Post by Florent Empis
help
users publicly create docs/examples/modules to support these use cases?
Isabel
--
Thanks,

Keith Aumiller
MBA - IT Professional
Lafayette Hill PA
314-369-0811
Dmitriy Lyubimov
2017-02-01 00:06:36 UTC
Permalink
Post by Isabel Drost-Fromm
Hi,
To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender
system)?
If you are an off-the-shelf practitioner (most of smaller startup companies
without a chief scientist), with very few exceptions you might want to look
for an off-the-shelf solution where it exists, and most likely it does not
exist on Samsara in open domain. Except for a several applied
off-the-shelves, Mahout has not (hopefully just yet) developed a
comprehensive set of things to use.

The off-the-shelves currently are cross-occurrence recommendations (which
still require real time serving component taken from elsewhere), svd-pca,
some algebra, and Naive/complement Bayes at scale.

Most of the bigger companies i worked for never deal with completely the
off-the-shelf open source solutions. It always requires more understanding
of their problem. (E.g., much as COO recommender is wonderful, i don't
think Netflix would entertain taking Mahout's COO run on it verbatim).

It is quite common that companies invest in their own specific
understanding of their problem and requirements and a specific solution to
their problem through iterative experimentation with different
methodologies, most of which are either new-ish enough or proprietary
enough that public solution does not exist.

That latter case was pretty much motivation for Samsara. If you are a
practitioner solving numerical problems thru experimentation cycle, Mahout
is much more useful than any of the off-the-shelf collections.

So the idea, first, is to get R-like platform out for the practitioners,
and grow packages (just like with R). The platform obviously needs work
which unfortunately is not sufficiently sponsored imo at the moment by
industry or academia, compared to other projects.

Is there even interest from users in such a use case based
Post by Isabel Drost-Fromm
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
yes
Post by Isabel Drost-Fromm
Isabel
Isabel Drost
2017-02-01 09:55:29 UTC
Permalink
Post by Dmitriy Lyubimov
Except for a several applied
off-the-shelves, Mahout has not (hopefully just yet) developed a
comprehensive set of things to use.
Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?
Post by Dmitriy Lyubimov
The off-the-shelves currently are cross-occurrence recommendations (which
still require real time serving component taken from elsewhere), svd-pca,
some algebra, and Naive/complement Bayes at scale.
Most of the bigger companies i worked for never deal with completely the
off-the-shelf open source solutions. It always requires more understanding
of their problem. (E.g., much as COO recommender is wonderful, i don't
think Netflix would entertain taking Mahout's COO run on it verbatim).
Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?
Post by Dmitriy Lyubimov
It is quite common that companies invest in their own specific
understanding of their problem and requirements and a specific solution to
their problem through iterative experimentation with different
methodologies, most of which are either new-ish enough or proprietary
enough that public solution does not exist.
While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.
Post by Dmitriy Lyubimov
That latter case was pretty much motivation for Samsara. If you are a
practitioner solving numerical problems thru experimentation cycle, Mahout
is much more useful than any of the off-the-shelf collections.
+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.
Post by Dmitriy Lyubimov
Post by Isabel Drost-Fromm
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
yes
Where do we start? ;)


Isabel
Andrew Palumbo
2017-02-01 20:29:49 UTC
Permalink
________________________________
From: Isabel Drost <***@apache.org>
Sent: Wednesday, February 1, 2017 4:55 AM
To: Dmitriy Lyubimov
Cc: ***@mahout.apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation
Post by Dmitriy Lyubimov
Except for a several applied
off-the-shelves, Mahout has not (hopefully just yet) developed a
comprehensive set of things to use.
Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?


I'm sorry to cut in on the convorsation here, but I wanted people to be aware of the algorithm framework effort that is currently underway.

I think that https://issues.apache.org/jira/browse/MAHOUT-1856 , a solid framework for new algorithms will go A long way towards helping out new users understand how easy it is to add algorithms. There has been significant work on this issue already merged to master with a fine OLS example including statistical tests for Autocorrelation and Heteroskedasticity. Trevor G. has been heading up the framework effort, which is still in development, and will continue to be throughout 0.13.x releases (and hopefully added to in 0.14.x as well).

I believe that having the framework in place will both make make Mahout More intuitive for new users and developers to write algorithms and pipelines, as well as to provide a set of canned algos to those who are looking for something off-the-shelf.

Just wanted to get that into the conversation.
Post by Dmitriy Lyubimov
The off-the-shelves currently are cross-occurrence recommendations (which
still require real time serving component taken from elsewhere), svd-pca,
some algebra, and Naive/complement Bayes at scale.
Most of the bigger companies i worked for never deal with completely the
off-the-shelf open source solutions. It always requires more understanding
of their problem. (E.g., much as COO recommender is wonderful, i don't
think Netflix would entertain taking Mahout's COO run on it verbatim).
Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?
Post by Dmitriy Lyubimov
It is quite common that companies invest in their own specific
understanding of their problem and requirements and a specific solution to
their problem through iterative experimentation with different
methodologies, most of which are either new-ish enough or proprietary
enough that public solution does not exist.
While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.
Post by Dmitriy Lyubimov
That latter case was pretty much motivation for Samsara. If you are a
practitioner solving numerical problems thru experimentation cycle, Mahout
is much more useful than any of the off-the-shelf collections.
+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.
Post by Dmitriy Lyubimov
Post by Isabel Drost-Fromm
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?
yes
Where do we start? ;)


Isabel
i***@apache.org
2017-02-07 10:22:50 UTC
Permalink
Hi,
Post by Andrew Palumbo
I think that https://issues.apache.org/jira/browse/MAHOUT-1856 , a solid
framework for new algorithms will go A long way towards helping out new users
understand how easy it is to add algorithms. There has been significant work
on this issue already merged to master with a fine OLS example including
statistical tests for Autocorrelation and Heteroskedasticity. Trevor G. has
been heading up the framework effort, which is still in development, and will
continue to be throughout 0.13.x releases (and hopefully added to in 0.14.x as
well).
No need to be sorry here, from a first glimpse at the issue it does look useful.
Post by Andrew Palumbo
I believe that having the framework in place will both make make Mahout More
intuitive for new users and developers to write algorithms and pipelines, as
well as to provide a set of canned algos to those who are looking for
something off-the-shelf.
Are there things within or on top of that framework that interested users could
help out with?
Post by Andrew Palumbo
Just wanted to get that into the conversation.
Thank you for that.


Isabel
Dmitriy Lyubimov
2017-02-01 23:32:24 UTC
Permalink
Isabel, if i understand it correctly, you are asking whether it makes sense
add end2end scenarios based on Samsara to current codebase?

The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.

The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.

But yes, there's, absolutely, tons of value in that. Not everything fits
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.
Post by Isabel Drost-Fromm
Post by Isabel Drost-Fromm
perspective? If so, would there be interest among the Mahout
committers to
Post by Isabel Drost-Fromm
help
users publicly create docs/examples/modules to support these use cases?
yes
Where do we start? ;)
I would start with figuring a problem I want to solve AND I have a budget
to do it AND i can legally contribute on behalf of the IP owner.

Then we can think of whether it is a good fit (Samsara is mostly limited to
tensor based data only, just like Mapreduce DRM was/is). Some things may
not have a convenient algebraic formulation.

-d
Isabel Drost
2017-02-07 10:30:19 UTC
Permalink
Post by Dmitriy Lyubimov
Isabel, if i understand it correctly, you are asking whether it makes sense
add end2end scenarios based on Samsara to current codebase?
Sorry for being fuzzy. The meta question that I'm trying to find an answer for
is if there's something can/ should be done to increase the number of people
that potentially could be assimilated and turned into committers one day. One
specific idea I had on my mind was to make the project easier to use for
beginners, one idea to get that accomplished I had was to focus on end to end
implementations of popular use cases. (Sorry, fairly meta...)
Post by Dmitriy Lyubimov
The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.
The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.
But yes, there's, absolutely, tons of value in that. Not everything fits
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.
Makes sense.
Post by Dmitriy Lyubimov
Post by Isabel Drost
Where do we start? ;)
I would start with figuring a problem I want to solve AND I have a budget
to do it AND i can legally contribute on behalf of the IP owner.
I guess given the meta explanation above - if increase in contributions was a
goal one could also think about making potential areas of contribution explicit
and highlight the value the project brings compared to other systems with a
specific focus on samsara. That's another angle of me asking weird questions
here.
Post by Dmitriy Lyubimov
Then we can think of whether it is a good fit (Samsara is mostly limited to
tensor based data only, just like Mapreduce DRM was/is). Some things may
not have a convenient algebraic formulation.
+1

Isabel
--
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most likely involving some kind of mobile connection only.)
Trevor Grant
2017-02-07 16:47:37 UTC
Permalink
The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.

Increased contribution was a goal I had certainly hoped for. I have begun
promoting the idea at Meetups. There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.

Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.

The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat). The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.

There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.

It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.html
or on a page that is linked to by that.

Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933

The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Isabel Drost
Post by Dmitriy Lyubimov
Isabel, if i understand it correctly, you are asking whether it makes
sense
Post by Dmitriy Lyubimov
add end2end scenarios based on Samsara to current codebase?
Sorry for being fuzzy. The meta question that I'm trying to find an answer for
is if there's something can/ should be done to increase the number of people
that potentially could be assimilated and turned into committers one day. One
specific idea I had on my mind was to make the project easier to use for
beginners, one idea to get that accomplished I had was to focus on end to end
implementations of popular use cases. (Sorry, fairly meta...)
Post by Dmitriy Lyubimov
The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.
The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.
But yes, there's, absolutely, tons of value in that. Not everything fits
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.
Makes sense.
Post by Dmitriy Lyubimov
Post by Isabel Drost
Where do we start? ;)
I would start with figuring a problem I want to solve AND I have a budget
to do it AND i can legally contribute on behalf of the IP owner.
I guess given the meta explanation above - if increase in contributions was a
goal one could also think about making potential areas of contribution explicit
and highlight the value the project brings compared to other systems with a
specific focus on samsara. That's another angle of me asking weird questions
here.
Post by Dmitriy Lyubimov
Then we can think of whether it is a good fit (Samsara is mostly limited
to
Post by Dmitriy Lyubimov
tensor based data only, just like Mapreduce DRM was/is). Some things may
not have a convenient algebraic formulation.
+1
Isabel
--
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most
likely involving some kind of mobile connection only.)
Saikat Kanjilal
2017-02-07 20:31:21 UTC
Permalink
@Trevor Grant

The landscape in machine learning is getting more and more diluted with lots of tools, here's a question, given that some folks are taking R and connecting it to spark and map reduce to make the R algorithms work at scale (https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what would be the additional value added in porting the R code using the algorithms/samsara framework, to me the MRS efforts and the approach you are proposing are 2 parallel tracks, as far as the barriers to entry to contributing I think its largely due to the complexity of the codebase and the lack of familiarity with Samsara, I'd love to help create some good docs/tutorials on both the algorithms framework and samsara when and where it makes sense, however I feel like it'd be useful to really identify the use cases where using the algorithms/samsara approach has clear wins versus MRS with spark or spark by itself or python/scikit-learn, I've found that in general people dont really need custom algorithms in datascience , they typically are answering some very basic classification or clustering question and can use linear/logistic regression or a variant of kmeans. I'd also like to help dig into some use cases with Samsara and put those use cases maybe in the examples section.


Thoughts?

ScaleR Functions - msdn.microsoft.com<https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler>
msdn.microsoft.com
The RevoScaleR package provides a set of over one hundred portable, scalable, and distributable data analysis functions. This topic presents a curated list ...




________________________________
From: Trevor Grant <***@gmail.com>
Sent: Tuesday, February 7, 2017 8:47 AM
To: ***@mahout.apache.org; ***@apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.

Increased contribution was a goal I had certainly hoped for. I have begun
promoting the idea at Meetups. There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.

Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.

The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat). The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.

There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.

It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.html
Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/developers/how-to-contribute.html>
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than just writing code -- it's about doing what you can to make the project better.



or on a page that is linked to by that.

Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933

The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
stackexchange.com
Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions



http://trevorgrant.org
[Loading Image...]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>
trevorgrant.org
Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.




*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Isabel Drost
Post by Dmitriy Lyubimov
Isabel, if i understand it correctly, you are asking whether it makes
sense
Post by Dmitriy Lyubimov
add end2end scenarios based on Samsara to current codebase?
Sorry for being fuzzy. The meta question that I'm trying to find an answer
for
is if there's something can/ should be done to increase the number of
people
that potentially could be assimilated and turned into committers one day.
One
specific idea I had on my mind was to make the project easier to use for
beginners, one idea to get that accomplished I had was to focus on end to
end
implementations of popular use cases. (Sorry, fairly meta...)
Post by Dmitriy Lyubimov
The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.
The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.
But yes, there's, absolutely, tons of value in that. Not everything fits
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.
Makes sense.
Post by Dmitriy Lyubimov
Post by Isabel Drost
Where do we start? ;)
I would start with figuring a problem I want to solve AND I have a budget
to do it AND i can legally contribute on behalf of the IP owner.
I guess given the meta explanation above - if increase in contributions
was a
goal one could also think about making potential areas of contribution
explicit
and highlight the value the project brings compared to other systems with a
specific focus on samsara. That's another angle of me asking weird
questions
here.
Post by Dmitriy Lyubimov
Then we can think of whether it is a good fit (Samsara is mostly limited
to
Post by Dmitriy Lyubimov
tensor based data only, just like Mapreduce DRM was/is). Some things may
not have a convenient algebraic formulation.
+1
Isabel
--
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most
likely involving some kind of mobile connection only.)
Trevor Grant
2017-02-08 02:53:51 UTC
Permalink
Answers inline below.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Saikat Kanjilal
@Trevor Grant
The landscape in machine learning is getting more and more diluted with
lots of tools, here's a question, given that some folks are taking R and
connecting it to spark and map reduce to make the R algorithms work at
scale (https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what
would be the additional value added in porting the R code using the
algorithms/samsara framework, to me the MRS efforts and the approach you
are proposing are 2 parallel tracks,
Correct, one is a commercial product by Microsoft- the other is a
business-friendly open source Apache Software Foundation Project.
Post by Saikat Kanjilal
as far as the barriers to entry to contributing I think its largely due to
the complexity of the codebase and the lack of familiarity with Samsara,
This is what we hope to overcome with Algorithms framework and perhaps
more documentation.

I'd love to help create some good docs/tutorials on both the algorithms
Post by Saikat Kanjilal
framework and samsara when and where it makes sense,
Would love the help- will be easier once we get migrated to Jekyll. (More
motivation to do this).
Post by Saikat Kanjilal
however I feel like it'd be useful to really identify the use cases where
using the algorithms/samsara approach has clear wins versus MRS
When you don't want to pay Microsoft to use your work in production.
Post by Saikat Kanjilal
with spark or spark by itself or python/scikit-learn,
Out of scope for Mahout project, but I do have a talk forth coming that
will address this- stay tuned.
Post by Saikat Kanjilal
I've found that in general people dont really need custom algorithms in
datascience , they typically are answering some very basic classification
or clustering question and can use linear/logistic regression or a variant
of kmeans.
That has not been my experience. In fact quite the opposite- most people
need more depth to their algorithms and many other big data ML packages
imply they have more depth than basic linear/logisitc regresion + kmeans,
but in fact that is all their is. Not to say one is right or wrong- the
data scientists who are happy with simple tools can find them in
SparkML/FlinkML, those who need more advanced tools may turn to Mahout.
Post by Saikat Kanjilal
I'd also like to help dig into some use cases with Samsara and put those
use cases maybe in the examples section.
Tutorials would be great- q.e.d. - more documentation would be helpful.
Post by Saikat Kanjilal
Thoughts?
ScaleR Functions - msdn.microsoft.com<https://msdn.microsoft.com/en-us/
microsoft-r/scaler/scaler>
msdn.microsoft.com
The RevoScaleR package provides a set of over one hundred portable,
scalable, and distributable data analysis functions. This topic presents a
curated list ...
________________________________
Sent: Tuesday, February 7, 2017 8:47 AM
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation
The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.
Increased contribution was a goal I had certainly hoped for. I have begun
promoting the idea at Meetups. There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.
Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.
The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat). The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.
There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.
It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.
html
Apache Mahout: Scalable machine learning and data mining<
https://mahout.apache.org/developers/how-to-contribute.html>
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than
just writing code -- it's about doing what you can to make the project
better.
or on a page that is linked to by that.
Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933
The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://
github.com/rawkintrevo>
rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.
http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack Exchange<http://stackexchange.
com/users/3002022/rawkintrevo>
stackexchange.com
Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation
activity favorites subscriptions. Top Questions
http://trevorgrant.org
[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>
The musings of rawkintrevo<http://trevorgrant.org/>
trevorgrant.org
Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.
*"Fortunate is he, who is able to know the causes of things." -Virgil*
Post by Isabel Drost
Post by Dmitriy Lyubimov
Isabel, if i understand it correctly, you are asking whether it makes
sense
Post by Dmitriy Lyubimov
add end2end scenarios based on Samsara to current codebase?
Sorry for being fuzzy. The meta question that I'm trying to find an
answer
Post by Isabel Drost
for
is if there's something can/ should be done to increase the number of people
that potentially could be assimilated and turned into committers one day. One
specific idea I had on my mind was to make the project easier to use for
beginners, one idea to get that accomplished I had was to focus on end to end
implementations of popular use cases. (Sorry, fairly meta...)
Post by Dmitriy Lyubimov
The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.
The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.
But yes, there's, absolutely, tons of value in that. Not everything
fits
Post by Isabel Drost
Post by Dmitriy Lyubimov
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.
Makes sense.
Post by Dmitriy Lyubimov
Post by Isabel Drost
Where do we start? ;)
I would start with figuring a problem I want to solve AND I have a
budget
Post by Isabel Drost
Post by Dmitriy Lyubimov
to do it AND i can legally contribute on behalf of the IP owner.
I guess given the meta explanation above - if increase in contributions was a
goal one could also think about making potential areas of contribution explicit
and highlight the value the project brings compared to other systems
with a
Post by Isabel Drost
specific focus on samsara. That's another angle of me asking weird questions
here.
Post by Dmitriy Lyubimov
Then we can think of whether it is a good fit (Samsara is mostly
limited
Post by Isabel Drost
to
Post by Dmitriy Lyubimov
tensor based data only, just like Mapreduce DRM was/is). Some things
may
Post by Isabel Drost
Post by Dmitriy Lyubimov
not have a convenient algebraic formulation.
+1
Isabel
--
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh
(most
Post by Isabel Drost
likely involving some kind of mobile connection only.)
Loading...