Loading a csv file with ~200 columns into Haskell Record

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading a csv file with ~200 columns into Haskell Record

Gurudev Devanla
Hello All,

I am in the process of replicating some code in Python in Haskell.

In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
 
I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:

1.  I need to declare all these 100+ columns into multiple record structures.
2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
3.  Create a dictionary of each record structure which will help me index into into them.'

I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem. 

Also, I do not want to add too many dependencies into the project, but open to suggestions.

Any input/advice on this would be very helpful.

Thank you for the time!
Guru

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Leandro Ostera
Two things come to mind.

The first one is *Crazy idea, bad pitch*: generate the record code from the data.

The second is to make the records dynamically typed:

Would it be simpler to define a Column type you can parameterize with a string for its name (GADTs?) so you automatically get a type of that specific column?

That way as you read the CSV files you could define the type of the columns based on the actual column name.

Rows would then become sets of pairings of defined columns and values, perhaps having a Maybe would encode that any given value for a particular column is missing. You could encode these pairings a list too.

At least there you can have type guarantees that you’re joining fields that are of the same column type. I think.

Either way, my 2 cents and keep it up!


sön 1 okt. 2017 kl. 03:34 skrev Guru Devanla <[hidden email]>:
Hello All,

I am in the process of replicating some code in Python in Haskell.

In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
 
I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:

1.  I need to declare all these 100+ columns into multiple record structures.
2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
3.  Create a dictionary of each record structure which will help me index into into them.'

I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem. 

Also, I do not want to add too many dependencies into the project, but open to suggestions.

Any input/advice on this would be very helpful.

Thank you for the time!
Guru
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda
If your data is originating from a DB, read the DB schema and use code-gen or TH to generate your record structure. Please confirm that your Haskell data pipeline is able to handle 100-field+ records beforehand. I have a strange feeling that some library or the other is going to break at the 64-field mark.

If you don't have access to the underlying DB, read the CSV header and code-gen your data structures. This will still lead to a lot of boilerplate because your code-gen script will need to maintain a col-name<>data-type mapping. See if you can peek at the first row of the data and take an educated guess about each column's data-type based on the column values. This will not be 100% accurate, but you can get good results by manually specifying only a few data-types instead of the entire 100+ data-types.

-- Saurabh.

On Sun, Oct 1, 2017 at 4:38 PM, Leandro Ostera <[hidden email]> wrote:
Two things come to mind.

The first one is *Crazy idea, bad pitch*: generate the record code from the data.

The second is to make the records dynamically typed:

Would it be simpler to define a Column type you can parameterize with a string for its name (GADTs?) so you automatically get a type of that specific column?

That way as you read the CSV files you could define the type of the columns based on the actual column name.

Rows would then become sets of pairings of defined columns and values, perhaps having a Maybe would encode that any given value for a particular column is missing. You could encode these pairings a list too.

At least there you can have type guarantees that you’re joining fields that are of the same column type. I think.

Either way, my 2 cents and keep it up!


sön 1 okt. 2017 kl. 03:34 skrev Guru Devanla <[hidden email]>:
Hello All,

I am in the process of replicating some code in Python in Haskell.

In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
 
I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:

1.  I need to declare all these 100+ columns into multiple record structures.
2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
3.  Create a dictionary of each record structure which will help me index into into them.'

I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem. 

Also, I do not want to add too many dependencies into the project, but open to suggestions.

Any input/advice on this would be very helpful.

Thank you for the time!
Guru
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Imants Cekusins
Is this the only sane way to do this? 

Would Mutable Vector per column do?


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Anthony Cowley
In reply to this post by Gurudev Devanla


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>  
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.  
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Gurudev Devanla
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda
Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between Haskell vs python for your use-case? 

On 02-Oct-2017 7:29 AM, "Guru Devanla" <[hidden email]> wrote:
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Gurudev Devanla
Did not mean to complain. For example, being able to use Data Frame library in Pandas, did not involve a big learning curve to understand the syntax of Pandas. With the basic knowledge of Python is was easy to learn and start using it.  Trying, to replicate that kind of program in Haskell seems to be a lot difficult for me. For example,  the leap from dynamic typing to static typing does involve this kind of boiler plate an I am fine with it. Now, when I try to reach out to the libraries in use, it involves a lot of learning of the library syntax/special operators etc to get stuff done. 
I understand that is the philosophy eschewed by Haskell community, but it takes up a lot of the spare time I have to both learn and also build my toy projects. I love coding in Haskell. But, that love takes a lot of time before it translates to any good code I  can show. It could be just me.

Again, I am happy to do this out of my love for Haskell. But, I am hesitant to recommend that to other team members since it is difficult for me to quantify the gains. And I say this with limited experience building real world Haskell applications and therefore my train of thought is totally mis-guided.

On Sun, Oct 1, 2017 at 7:22 PM, Saurabh Nanda <[hidden email]> wrote:
Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between Haskell vs python for your use-case? 

On 02-Oct-2017 7:29 AM, "Guru Devanla" <[hidden email]> wrote:
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Neil Mayhew
In reply to this post by Gurudev Devanla

On 2017-10-01 07:55 PM, Guru Devanla wrote:

Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Something, somewhere has to keep track of the type of each column, and since the data doesn’t have that itself you have to store it somewhere else. That could be in another data file of some kind which would be loaded at runtime, but then you would lose the benefit of static type checking by the compiler. So it’s better to have it in source code, even if that’s generated by TH or some other process.

I recommend taking a look at the Cassava library. You can do some pretty neat things with it, including defining your own mapping from rows to records. In particular, if you need only a small subset of the 100 columns, you can provide a (de)serializer that looks at only the columns it needs. The library reads the row into a vector of Text, and your serialization code works with just the elements it needs. You could even have different record types (and associated serializers) for different tasks, all working off the same input record, since the serialization methods are from a typeclass and each record type can be a different instance of the class.

Cassava supports Applicative, which makes for some very succinct code, and it can make use of a header record at the start of the data. Here’s an example:

data Account = Business | Visa | Personal | Cash | None
    deriving (Eq, Ord, Show, Read, Enum, Bounded)

instance FromField Account where
    parseField f
        | f == "B"  = pure Business
        | f == "V"  = pure Visa
        | f == "P"  = pure Personal
        | f == "C"  = pure Cash
        | f == "CC" = pure Visa
        | f == ""   = pure None
        | otherwise = fail $ "Invalid account type: \"" ++ B.unpack f ++ "\""

instance ToField Account where
    toField Business  = "B"
    toField Visa      = "V"
    toField Personal  = "P"
    toField Cash      = "C"
    toField None      = ""

type Money = Centi

data Transaction = Transaction
    { date :: Day
    , description :: Text
    , category :: Text
    , account :: Account
    , debit :: Maybe Money
    , credit :: Maybe Money
    , business :: Money
    , visa :: Money
    , personal :: Money
    , cash :: Money }
    deriving (Eq, Ord, Show, Read)

instance FromNamedRecord Transaction where
    parseNamedRecord r = Transaction <$>
        r .: "Date" <*>
        r .: "Description" <*>
        r .: "Category" <*>
        r .: "Account" <*>
        r .: "Debit" <*>
        r .: "Credit" <*>
        r .: "Business" <*>
        r .: "Visa" <*>
        r .: "Personal" <*>
        r .: "Cash"

instance ToNamedRecord Transaction where
    toNamedRecord r = namedRecord [
        "Date" .= date r,
        "Description" .= description r,
        "Category" .= category r,
        "Account" .= account r,
        "Debit" .= debit r,
        "Credit" .= credit r,
        "Business" .= business r,
        "Visa" .= visa r,
        "Personal" .= personal r,
        "Cash" .= cash r]

Note that the code doesn’t assume fixed positions for the different columns, nor a total number of columns in a row, because it indirects through the column headers. There could be 1000 columns and the code wouldn’t care.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Anthony Cowley
In reply to this post by Gurudev Devanla


On Oct 1, 2017, at 9:55 PM, Guru Devanla <[hidden email]> wrote:

Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


The pain is that of a rock yet to be smoothed by a running current: it is neither your lack of understanding nor something inherent to static typing. I ask for a sample file because the only way we can improve is through contact with real world use. I can say that Frames has been demonstrated to give performance neck and neck with Pandas in conjunction with greatly reduced (ie order of magnitude less) memory use. You also get the confidence of writing transformation and reduction functions whose types are consistent with your actual data, and that consistency can be verified as you type by tooling like Intero.

Your concerns are justified: the problem with using Haskell for data processing is that without attempts like Frames, you still have this disconnect between the types that characterize your data and the types delineating your program code. Add to this the comparative dearth of statistical analysis and plotting options between Haskell and R or Python, and you can see that Haskell only makes sense if you want to use it for other reasons (eg familiarity, or interpretation with streaming or server libraries where the Haskell ecosystem is healthy). In the realm of data analysis, you are taking a risk choosing Haskell, but it is not a thoughtless risk. The upside is compiler-verified safety, and runtime performance informed by that compile-time work.

So I’ll be happy if you can help improve the Frames story, but it is certainly a story still in progress.

Anthony 




On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Gurudev Devanla
Yes, I am totally in agreement. My motivation to replicate this project and demonstrate the power of Haskell in these scenarios boils down to the 2 reasons you rightly mentioned:

>> You also get the confidence of writing transformation and reduction functions whose types are consistent with your actual data,

Just this aspect makes me loose sleep looking at Python code.  I crave for such guarantees at compile-time and that is the reason why I am replicating this implementation in Haskell. I am sure I will get this guarantee is Haskell. *But, at what cost is what I am in the process of understanding.*

>>  The upside is compiler-verified safety, and runtime performance informed by that compile-time work.

I agree with compiler-verified safety. I will have to prove the performance part of this exercise to myself.

I was not able to share the data due to licensing restrictions. But, I will get in touch with you offline once I am at a point of sharing some stats. Thank you very much for your input and the effort you have been putting into Frames.

Regards
Guru





On Sun, Oct 1, 2017 at 8:46 PM, Anthony Cowley <[hidden email]> wrote:


On Oct 1, 2017, at 9:55 PM, Guru Devanla <[hidden email]> wrote:

Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


The pain is that of a rock yet to be smoothed by a running current: it is neither your lack of understanding nor something inherent to static typing. I ask for a sample file because the only way we can improve is through contact with real world use. I can say that Frames has been demonstrated to give performance neck and neck with Pandas in conjunction with greatly reduced (ie order of magnitude less) memory use. You also get the confidence of writing transformation and reduction functions whose types are consistent with your actual data, and that consistency can be verified as you type by tooling like Intero.

Your concerns are justified: the problem with using Haskell for data processing is that without attempts like Frames, you still have this disconnect between the types that characterize your data and the types delineating your program code. Add to this the comparative dearth of statistical analysis and plotting options between Haskell and R or Python, and you can see that Haskell only makes sense if you want to use it for other reasons (eg familiarity, or interpretation with streaming or server libraries where the Haskell ecosystem is healthy). In the realm of data analysis, you are taking a risk choosing Haskell, but it is not a thoughtless risk. The upside is compiler-verified safety, and runtime performance informed by that compile-time work.

So I’ll be happy if you can help improve the Frames story, but it is certainly a story still in progress.

Anthony 




On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda
In reply to this post by Gurudev Devanla
I whole heartedly agree with your sentiment. I have felt the same way in my initial days, and only my stubborn head prevented me from giving up on Haskell [1]

Haskell is **unnecessarily** hard. It doesn't have to be that way. Stop beating yourself up over what is essentially a tooling, API design, and documentation problem. Start speaking up instead. 

Wrt the current problem at hand, try thinking of the types as a **spec** rather than boilerplate. That spec is necessary to give you your compile time guarantees. Without the spec the compiler can't do anything. This spec is non-existent in python. 

Also, be sure of what exactly is the warm fuzzy feeling that the compiler is giving you. From whatever you have described, most of your bugs are going to occur when you change your data transformation pipeline (core logic) or your CSV format. Compilation and static types will help in only one of those. 



On 02-Oct-2017 8:20 AM, "Guru Devanla" <[hidden email]> wrote:
Did not mean to complain. For example, being able to use Data Frame library in Pandas, did not involve a big learning curve to understand the syntax of Pandas. With the basic knowledge of Python is was easy to learn and start using it.  Trying, to replicate that kind of program in Haskell seems to be a lot difficult for me. For example,  the leap from dynamic typing to static typing does involve this kind of boiler plate an I am fine with it. Now, when I try to reach out to the libraries in use, it involves a lot of learning of the library syntax/special operators etc to get stuff done. 
I understand that is the philosophy eschewed by Haskell community, but it takes up a lot of the spare time I have to both learn and also build my toy projects. I love coding in Haskell. But, that love takes a lot of time before it translates to any good code I  can show. It could be just me.

Again, I am happy to do this out of my love for Haskell. But, I am hesitant to recommend that to other team members since it is difficult for me to quantify the gains. And I say this with limited experience building real world Haskell applications and therefore my train of thought is totally mis-guided.

On Sun, Oct 1, 2017 at 7:22 PM, Saurabh Nanda <[hidden email]> wrote:
Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between Haskell vs python for your use-case? 

On 02-Oct-2017 7:29 AM, "Guru Devanla" <[hidden email]> wrote:
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Gurudev Devanla
Yes, Thank you for the encouraging words. I will keep at it.

>> Also, be sure of what exactly is the warm fuzzy feeling that the compiler is giving you. From whatever you have described, most of your bugs are going to occur when you change your data transformation pipeline (core logic) or your CSV format. Compilation and static types will help in only one of those.

Yes, i am aware of that. I have tests for the core logic, but the mechanical part of type checking the data that passes through this pipeline is much desired.




On Sun, Oct 1, 2017 at 10:00 PM, Saurabh Nanda <[hidden email]> wrote:
I whole heartedly agree with your sentiment. I have felt the same way in my initial days, and only my stubborn head prevented me from giving up on Haskell [1]

Haskell is **unnecessarily** hard. It doesn't have to be that way. Stop beating yourself up over what is essentially a tooling, API design, and documentation problem. Start speaking up instead. 

Wrt the current problem at hand, try thinking of the types as a **spec** rather than boilerplate. That spec is necessary to give you your compile time guarantees. Without the spec the compiler can't do anything. This spec is non-existent in python. 

Also, be sure of what exactly is the warm fuzzy feeling that the compiler is giving you. From whatever you have described, most of your bugs are going to occur when you change your data transformation pipeline (core logic) or your CSV format. Compilation and static types will help in only one of those. 



On 02-Oct-2017 8:20 AM, "Guru Devanla" <[hidden email]> wrote:
Did not mean to complain. For example, being able to use Data Frame library in Pandas, did not involve a big learning curve to understand the syntax of Pandas. With the basic knowledge of Python is was easy to learn and start using it.  Trying, to replicate that kind of program in Haskell seems to be a lot difficult for me. For example,  the leap from dynamic typing to static typing does involve this kind of boiler plate an I am fine with it. Now, when I try to reach out to the libraries in use, it involves a lot of learning of the library syntax/special operators etc to get stuff done. 
I understand that is the philosophy eschewed by Haskell community, but it takes up a lot of the spare time I have to both learn and also build my toy projects. I love coding in Haskell. But, that love takes a lot of time before it translates to any good code I  can show. It could be just me.

Again, I am happy to do this out of my love for Haskell. But, I am hesitant to recommend that to other team members since it is difficult for me to quantify the gains. And I say this with limited experience building real world Haskell applications and therefore my train of thought is totally mis-guided.

On Sun, Oct 1, 2017 at 7:22 PM, Saurabh Nanda <[hidden email]> wrote:
Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between Haskell vs python for your use-case? 

On 02-Oct-2017 7:29 AM, "Guru Devanla" <[hidden email]> wrote:
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda
Do evaluate the option of peeking at the first few rows of the CSV and generating the types via code-gen. This will allow your transformation pipeline to fail-fast if your CSV format changes. 

On 02-Oct-2017 8:27 PM, "Guru Devanla" <[hidden email]> wrote:
Yes, Thank you for the encouraging words. I will keep at it.

>> Also, be sure of what exactly is the warm fuzzy feeling that the compiler is giving you. From whatever you have described, most of your bugs are going to occur when you change your data transformation pipeline (core logic) or your CSV format. Compilation and static types will help in only one of those.

Yes, i am aware of that. I have tests for the core logic, but the mechanical part of type checking the data that passes through this pipeline is much desired.




On Sun, Oct 1, 2017 at 10:00 PM, Saurabh Nanda <[hidden email]> wrote:
I whole heartedly agree with your sentiment. I have felt the same way in my initial days, and only my stubborn head prevented me from giving up on Haskell [1]

Haskell is **unnecessarily** hard. It doesn't have to be that way. Stop beating yourself up over what is essentially a tooling, API design, and documentation problem. Start speaking up instead. 

Wrt the current problem at hand, try thinking of the types as a **spec** rather than boilerplate. That spec is necessary to give you your compile time guarantees. Without the spec the compiler can't do anything. This spec is non-existent in python. 

Also, be sure of what exactly is the warm fuzzy feeling that the compiler is giving you. From whatever you have described, most of your bugs are going to occur when you change your data transformation pipeline (core logic) or your CSV format. Compilation and static types will help in only one of those. 



On 02-Oct-2017 8:20 AM, "Guru Devanla" <[hidden email]> wrote:
Did not mean to complain. For example, being able to use Data Frame library in Pandas, did not involve a big learning curve to understand the syntax of Pandas. With the basic knowledge of Python is was easy to learn and start using it.  Trying, to replicate that kind of program in Haskell seems to be a lot difficult for me. For example,  the leap from dynamic typing to static typing does involve this kind of boiler plate an I am fine with it. Now, when I try to reach out to the libraries in use, it involves a lot of learning of the library syntax/special operators etc to get stuff done. 
I understand that is the philosophy eschewed by Haskell community, but it takes up a lot of the spare time I have to both learn and also build my toy projects. I love coding in Haskell. But, that love takes a lot of time before it translates to any good code I  can show. It could be just me.

Again, I am happy to do this out of my love for Haskell. But, I am hesitant to recommend that to other team members since it is difficult for me to quantify the gains. And I say this with limited experience building real world Haskell applications and therefore my train of thought is totally mis-guided.

On Sun, Oct 1, 2017 at 7:22 PM, Saurabh Nanda <[hidden email]> wrote:
Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between Haskell vs python for your use-case? 

On 02-Oct-2017 7:29 AM, "Guru Devanla" <[hidden email]> wrote:
Thank you all for your helpful suggestions. As I wrote the original question, even I was trying to decide between the approach of using Records to represent each row or  define a vector for each column and each vector becomes an attribute of the record.  Even, I was leaning towards the latter given the performance needs.

Since, the file is currently available as a CSV adding Persistent and any ORM library would be an added dependency.

I was trying to solve this problem without too many dependencies of other libraries and wanting to learn new DSLs. Its a tempting time killer as everyone here would understand.

@Anthony Thank your for your answer as well. I have explored Frames library in the past as I tried to look for Pandas like features in Haskell The library is useful and I have played around with it. But, I was never confident in adopting it for a serious project. Part of my reluctance, would be the learning curve plus I also need to familiarize myself with `lens` as well. But, looks like this project I have in hand is a good motivation to do both. I will try to use Frames and then report back. Also, apologies for not being able to share the data I am working on.

With the original question, what I was trying to get to is, how are these kinds of problems solved in real-world projects. Like when Haskell is used in data mining, or in financial applications. I believe these applications deal with this kind of data where the tables are wide. Having to not have something which I can quickly start off on troubles me and makes me wonder if the reason is my lack of understanding or just the pain of using static typing.

Regards


On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <[hidden email]> wrote:


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <[hidden email]> wrote:
>
> Hello All,
>
> I am in the process of replicating some code in Python in Haskell.
>
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
> 3.  Create a dictionary of each record structure which will help me index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.
>
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
>
> Any input/advice on this would be very helpful.
>
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles.

Anthony




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Loading a csv file with ~200 columns into Haskell Record

Mario Blazevic-2
In reply to this post by Gurudev Devanla
On 2017-09-30 09:30 PM, Guru Devanla wrote:

> ...
> I am not looking to replicate the Pandas data-frame functionality in
> Haskell. First thing I want to do is reach out to the 'record' data
> structure. Here are some ideas I have:
>
> 1.  I need to declare all these 100+ columns into multiple record
> structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the
> attributes of the record structure would be 'MayBe' values. Now, I could
> drop some columns during load and cut down the number of attributes i
> created per record structure.
> 3.  Create a dictionary of each record structure which will help me
> index into into them.'
>
> I would like some feedback on the first 2 points. Seems like there is a
> lot of boiler plate code I have to generate for creating 100s of record
> attributes. Is this the only sane way to do this?  What other patterns
> should I consider while solving such a problem.


        I can only offer a suggestion with point #2. Have a look at the README
for the rank2classes package. You'd still need to generate the
boilerplate code for the 100+ record fields, but only once.

http://hackage.haskell.org/package/rank2classes
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.