Quantcast

Mathematics and Statistics libraries

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Mathematics and Statistics libraries

Ben Jones
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.

Thanks for reading,
      -Benjamin

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Ryan Newton
I think such libraries are high priority!

My own experience with them is not deep, but I'll echo what I think is a common observation:
  • Matrix libraries are good
  • Statistics libs need more work
And as far as wrappers around machine learning or computer vision libs (openCV)... I'm not really sure about the status of those.


On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones <[hidden email]> wrote:
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.

Thanks for reading,
      -Benjamin

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Daniel Peebles
In reply to this post by Ben Jones
I'd like to see more statistics work, definitely. Bryan's statistics library is excellent, but Ed Kmett has been talking about some very interesting approaches to sampling from complicated distributions, which I'd like to see implemented eventually in a library.

On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones <[hidden email]> wrote:
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.

Thanks for reading,
      -Benjamin

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Aleksey Khudyakov
In reply to this post by Ben Jones
On 21.03.2012 21:24, Ben Jones wrote:

> I am a student currently interested in participating in Google Summer of
> Code. I have a strong interest in Haskell, and a semester's worth of
> coding experience in the language. I am a mathematics and cs double
> major with only a semester left and I am looking for information
> regarding what the community is lacking as far as mathematics and
> statistics libraries are concerned. If there is enough interest I would
> like to put together a project with this. I understand that such
> libraries are probably low priority, but if anyone has anything I would
> love to hear it.
>
There is existing statistics related GSoC project[1]. It proposes
implementation of analog of R's data frames. I think it's rather
difficult since there is no obvious design. Also I think implementation
will require a lot of type trickery

[1] http://hackage.haskell.org/trac/summer-of-code/ticket/1596

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Gershom Bazerman
In reply to this post by Ryan Newton
On 3/21/12 3:00 PM, Ryan Newton wrote:
I think such libraries are high priority!

My own experience with them is not deep, but I'll echo what I think is a common observation:
  • Matrix libraries are good
  • Statistics libs need more work
I would also be very excited about a solid statistics proposal. The ticket Aleksey links to is a good start (as is the experience report linked from there), although I think that it would be possible to implement a core library with less type-trickery than he supposes. Such an interface wouldn't necessarily be perfectly statically safe, but other, tricker interfaces could be built on top of it (just as we have fancier type-level interfaces with statically checked dimensions on top of lower-level matrix libs, etc.). I envision a set of tools that let users get up and running with loading a dump of data and calculating a set of metrics on it with only a few lines. It should be designed such that the basic framework is easily extensible with various other analyses, and such that analyses compose fairly straightforwardly. Which indeed amounts to some Frame-type structure, and a core set of functions on it :-)

--g

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Tom Doris
In reply to this post by Ben Jones
If the goal is to help Haskell be a more acceptable choice for general
statistical analysis tasks, then  hmatrix, statistics, and the various
gsl wrappers already provide the majority of the functionality needed.
I think the bigger problem is that there is no guidance on which
libraries are industrial strength, and there's no glue layer making it
easier to use the APIs you'd want to, and GHCi isn't always ideal as a
repl for this workflow.

If you're interested in UI work, ideally we'd have something similar
to RStudio as an environment, a simple set of windows encapsulating an
editor, a repl, a plotting panel and help/history, this sounds
superficial but it really has an impact when you're exploring a data
set and trying stuff out. However, it would be a bigger contribution
to get us to the point where we are able to just "import
Quant.Prelude" to bring into scope all the standard functionality
assumed in an environment like R or Matlab. In my experience most of
this can come from re-exporting existing libraries while occasionally
wrapping functions to simplify the interfaces and make them more
consistent (e.g., a quant doesn't particularly need to know why
Statistics.Sample.KernelDensity.kde uses unboxed vectors when the rest
of that lib uses Generic, and they certainly won't want to spend their
time remembering that they need to convert to call that function).

As an exercise, in GHCi, try loading a few arbitrary csv files of
tables including floating point columns, do a linear regression of one
such column on another, and then display a scatterplot with the
regression line, maybe throw in a check for the normality of the
residuals. Assume you'll need to be able to handle large data sets so
you need to use bytestring, attoparsec etc; beware that there's a
known bug that will cause a segfault/bus error if you use some
hmatrix/gsl functions from GHCi on x86_64, which is kind of a blocker
in itself. Maybe I missed something obvious but it took me a looong
time to figure out which containers, persistence + parsing, stats and
plotting packages I should choose.

I really disagree that we need a data frame type structure; they're an
abomination in R, they try to accommodate event records and time
series, and do neither well. Haskell records are fine for
inhomogeneous event series and for homogeneous time series parallel
Vectors or Matrices are better as they can be passed to BLAS and
LAPACK with consequent performance and clarity advantages - column
oriented storage rocks, and Haskell is already a good fit.

Having used C++, Matlab and R (the latter for quite a while) I now use
Haskell for all of my statistical analysis work, despite the many
shortcomings it's definitely worth it for the code clarity and type
checking, to say nothing of the pre-optimization performance and
robustness.

Best of luck, happy to share some preliminary code with you directly
if you're interested!
Tom



On 21 March 2012 17:24, Ben Jones <[hidden email]> wrote:

> I am a student currently interested in participating in Google Summer of
> Code. I have a strong interest in Haskell, and a semester's worth of coding
> experience in the language. I am a mathematics and cs double major with only
> a semester left and I am looking for information regarding what the
> community is lacking as far as mathematics and statistics libraries are
> concerned. If there is enough interest I would like to put together a
> project with this. I understand that such libraries are probably low
> priority, but if anyone has anything I would love to hear it.
>
> Thanks for reading,
>       -Benjamin
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Heinrich Apfelmus
Tom Doris wrote:
>
> If you're interested in UI work, ideally we'd have something similar
> to RStudio as an environment, a simple set of windows encapsulating an
> editor, a repl, a plotting panel and help/history, this sounds
> superficial but it really has an impact when you're exploring a data
> set and trying stuff out.

Concerning UI, the following project suggestion aims to give GHCi a web GUI

   http://hackage.haskell.org/trac/summer-of-code/ticket/1609

But one of your criteria is that a good UI should come with a help
system, too, right?


Best regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Tom Doris
Hi Heinrich,

If we compare the GHCi experience with R or IPython, leaving aside any
GUIs, the help system they have at the repl level is just a lot more
intuitive and easy to use, and you get access to the full manual
entries. For example, compare what you see if you type :info sort into
GHCi versus ?sort in R. R gives you a view of the full docs for the
function, whereas in GHCi you just get the type signature.

I usually def a command to call out to ":!hoogle --info %", which
gives what you expect :info should. So, as is usually the case,
there's a solution in Haskell that matches the features in other
systems, but it's not the default and you have to invest effort
getting it set up right. This is fine for Haskell devs who do some
stats work, but it represents an offputtingly steep learning curve for
quants who are willing to learn a little Haskell but expect
(reasonably) some basic stuff like inline help to Just Work.

Tom

On 25 March 2012 08:26, Heinrich Apfelmus <[hidden email]> wrote:

> Tom Doris wrote:
>>
>>
>> If you're interested in UI work, ideally we'd have something similar
>> to RStudio as an environment, a simple set of windows encapsulating an
>> editor, a repl, a plotting panel and help/history, this sounds
>> superficial but it really has an impact when you're exploring a data
>> set and trying stuff out.
>
>
> Concerning UI, the following project suggestion aims to give GHCi a web GUI
>
>  http://hackage.haskell.org/trac/summer-of-code/ticket/1609
>
> But one of your criteria is that a good UI should come with a help system,
> too, right?
>
>
> Best regards,
> Heinrich Apfelmus
>
> --
> http://apfelmus.nfshost.com
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Aleksey Khudyakov
On 25.03.2012 14:52, Tom Doris wrote:
> Hi Heinrich,
>
> If we compare the GHCi experience with R or IPython, leaving aside any
> GUIs, the help system they have at the repl level is just a lot more
> intuitive and easy to use, and you get access to the full manual
> entries. For example, compare what you see if you type :info sort into
> GHCi versus ?sort in R. R gives you a view of the full docs for the
> function, whereas in GHCi you just get the type signature.
>
Ingrating haddock documentation into GHCi would be really helpful but
it's GSoC project on its own.

For me most important difference between R's repl and GHCi is that
:reload wipes all local binding. Effectively it forces to write
everything in file and to avoid doing anything which couldn't be fitted
into one-liner. It may not be bad but it's definitely different style

And of course data visualization. Only library I know of is Chart[1] but
I don't like API much.

I think talking about data frames is a bit pointless unless we specify
what is data frame. Basically there are two representations of tabular
data structure: array of tuples or tuple of arrays. If you want first go
for Data.Vector.Vector YourData. If you want second you'll probably end
up with some HList-like data structure to hold arrays.



[1] http://hackage.haskell.org/package/Chart

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Ketil Malde-5
In reply to this post by Tom Doris
Tom Doris <[hidden email]> writes:

> If you're interested in UI work, ideally we'd have something similar
> to RStudio as an environment, a simple set of windows encapsulating an
> editor, a repl, a plotting panel and help/history, this sounds
> superficial but it really has an impact when you're exploring a data
> set and trying stuff out.

I agree, this sounds really nice.

> I really disagree that we need a data frame type structure; they're an
> abomination in R, they try to accommodate event records and time
> series, and do neither well.

Just to clarify (since I think the original suggestion was mine), I
don't want to copy R's data frame (which I never quite understood,
anyway), but I'd like some standardized data structure, ideally with an
option to label columns, and functions to slice and join.  The
underlying structure can just be a list of columns (Vector) or whatever.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Richard A. O'Keefe

On 26/03/2012, at 8:35 PM, Ketil Malde wrote:
> Just to clarify (since I think the original suggestion was mine), I
> don't want to copy R's data frame (which I never quite understood,
> anyway)

A data.frame is
 - a record of vectors all the same length
 - which can be sliced and diced like a 2d matrix

It's not unlike an SQL table (think of a column-oriented data base
so a table is really a collection of named columns, but it _looks_
like a collection of rows).


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Alexander McPhail
In reply to this post by Ben Jones
Date: Sun, 25 Mar 2012 17:54:11 +0400
From: Aleksey Khudyakov <[hidden email]>
Subject: Re: [Haskell-cafe] Mathematics and Statistics libraries
To: [hidden email]
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On <a href="tel:25.03.2012%2014" value="+12503201214">25.03.2012 14:52, Tom Doris wrote:
> Hi Heinrich,

And of course data visualization. Only library I know of is Chart[1] but
I don't like API much.

There is the plot[1] library which provides for updateable plots from GHCi REPL and has a gnuplot-like interface.  I wrote it for this very reason, a mathematics/statistics development environment.

It uses Data.Vector.Storable, which provides for compatability with both statistics and hmatrix packages (as well as hstatistics).

I think talking about data frames is a bit pointless unless we specify
what is data frame. Basically there are two representations of tabular
data structure: array of tuples or tuple of arrays. If you want first go
for Data.Vector.Vector YourData. If you want second you'll probably end
up with some HList-like data structure to hold arrays.

Matrices from hmatrix are easily converted to rows or columns of Data.Vector.Storable and can be sliced and otherwise manipulated.


<a href="%20%20[1]%20http://hackage.haskell.org/package/plot"> [1] http://hackage.haskell.org/package/plot

Vivian


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Aleksey Khudyakov
> There is the plot[1] library which provides for updateable plots from GHCi
> REPL and has a gnuplot-like interface.  I wrote it for this very reason, a
> mathematics/statistics development environment.
>
> It uses Data.Vector.Storable, which provides for compatability with both
> statistics and hmatrix packages (as well as hstatistics).

Looks very interesting. I'll try it out.


>> I think talking about data frames is a bit pointless unless we specify
>> what is data frame. Basically there are two representations of tabular
>> data structure: array of tuples or tuple of arrays. If you want first go
>> for Data.Vector.Vector YourData. If you want second you'll probably end
>> up with some HList-like data structure to hold arrays.
>>
> Matrices from hmatrix are easily converted to rows or columns of
> Data.Vector.Storable and can be sliced and otherwise manipulated.
>
That's why I said that homogenous data frame is simple. But if you want to
have columns which hold values with different type they lo longer a matrix
and thing become way more interesting.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mathematics and Statistics libraries

Carter Schonwald
In reply to this post by Aleksey Khudyakov
Hey All,

Theres actually a number of issues the come up with an effective dataframe-like for haskell, and data vis as well.  (both of which I have some strong personal opinions on for haskell and which I'm exploring / experimenting with this spring). While folks have touched on a bunch, I just thought I'd put together my own opinions in the mix.

First of all: any good data manipulation (i.e. data frame -like ) library needs support for efficiently querying subsets of the data in various ways. Not just that,  it really should provide coherent way of dealing with out of core data! From there you might want to ask the question: "do I want to iterate through chunks of the data" or "do i want to allow more general patterns of data access, and perhaps even ways to parallelize?". The basic thing (as others have remarked after this draft email got underway), you do essentially want to support some sql-like selection operations, and have them be efficient too, along with playing nice with columns of differing types

What sort of abstractions you provide are somewhat crucial, because that in turn affects how you can write algorithms! If you look closely, this is tantamount to saying that any sufficiently well designed (industrial grade) data frame lib for haskell might wind up leading into a model for supporting mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / not distributed regime, though a first version would pragmatically just provide an interface with sequentially chunked data and use pipes-core, or one of the other enumerator libraries. Theres also some need for the aforementioned fancy types for managing data, but that not even the real challenge (in my opinion). Probably the best lib to take ideas from is the python Pandas library, or at least thats my personal opinion.

Now in the space of data vis, probably the best example of a good library in terms of easy of getting informative (and pretty) outputs is ggplot2 (also in R). Now if you look there, you'll see that its VERY much integrated with the model fitting and data analysis functionality of R, and has a very compositional approach  which could easily be ported pretty directly over to haskell. 
However, as with a good data frame-like, certain obstacles come up partly because if we insist a type safe way to do things while being at least as high level as R or python, the absence of row types for frame column names makes specifying linear models that are statically well formed  (as in only referencing column names that are actually in the underlying data frame) bit tricky, and while there are approaches that do work some of the time,  theres not really a good general purpose way (as far as I can tell) for that small problem of trying to resolve names as early as possible. Or at the very least I don't see a simple approach that i'm happy with.

these can be summarized I think as follows:
  • Any "practical" data frame lib needs to interact well with out of core data, and ideally also simplify the task of writing algorithms on top in a way that sort of gives out of core goodness for free. Theres a lot of different ways this can be perhaps done under the covers, perhaps using one of the libraries like reducers, enumerator or pipes core, but it really should be invisible for the client algorithms author, or at least invisible by default. And more over I think any attack in that direction is essentially a precursor to sorting out map-reduce and graph lab like tools for haskell.
  • Any really nice high level data vis tool really needs to have some data analysis / machine  learning style library that its working with, and this is probably best understood by looking at things already out there, such as ggplot2 in R
that said, I'm all ears for other folks takes on this, especially since I'm spending some time this spring experimenting in both these directions.

cheers
-Carter

On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov <[hidden email]> wrote:
On <a href="tel:25.03.2012%2014" value="+12503201214" target="_blank">25.03.2012 14:52, Tom Doris wrote:
Hi Heinrich,

If we compare the GHCi experience with R or IPython, leaving aside any
GUIs, the help system they have at the repl level is just a lot more
intuitive and easy to use, and you get access to the full manual
entries. For example, compare what you see if you type :info sort into
GHCi versus ?sort in R. R gives you a view of the full docs for the
function, whereas in GHCi you just get the type signature.

Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own.

For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style

And of course data visualization. Only library I know of is Chart[1] but I don't like API much.

I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays.



[1] http://hackage.haskell.org/package/Chart


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Loading...