Simple data summarization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Simple data summarization

Andy Elvey
Hi all -

In the process of learning Haskell I'm wanting to do some simple data
summarization.
( Btw, I'm looking at putting any submitted code for this in the
"cookbook" section of
the Haskell wiki.  Imo it would be very useful there as a "next step" up
from just reading
in a file and printing it out.  )

This would involve reading in a delimited file like this - ( just a
contrived example of how many books
some people own ) -

Name,Gender,Age,Ethnicity,Books
Mary,F,14,NZ European, 11
Brian,M,13,NZ European, 6
Josh,M,12,NZ European, 14
Regan,M,14,NZ Maori, 9
Helen,F,15,NZ Maori, 17
Anna,F,14,NZ European, 16
Jess,F,14,NZ Maori, 21

.... and doing some operations on it.
As you can see, the file has column headings - I prefer to be able to
manipulate data with
headings (as it is what I do a lot of at work, using another programming
language).

I've tried to break the problem down into small parts as follows.
a) Read the file into a list of pairs.
The first element of the pair would be the column heading.
The second will be a list containing the data.
For example, ("Name",  [Mary,  Brian,  Josh,  Regan, ..... ]  )  

b) Select a numeric variable to summarise ( "Books" in this example)
c) Do a fold to summarize the variable. I think a left-fold would be the
one to use here, but I may
be wrong....
 
After looking through previous postings on this list, I found some code
which is somewhat similar to what I'm after (although the data it was
crunching is very different).  This is what I've come up with so far -

summarize [] = []
summarize ls = let
        byvariable = head ls
        numeric_variable = last ls
        sum = foldl (+) 0 $ numeric_variable

    in (byvariable, sum) : sum ls

main = interact (unlines . map show . summarize . lines)

I think this might be a useful start, but I still need to read the data
into a list of pairs as mentioned, and I'm unsure as to how to
do that.

Many thanks in advance for any help received.  As mentioned, I'm sure
that examples like this could be very useful to other beginners, so I'm
keen to make sure that any help given is made maximum use of (by putting
any code on the Haskell wiki).
- Andy

Reply | Threaded
Open this post in threaded view
|

Simple data summarization

Roland Zumkeller-2
Hi Andy,

Here is a function that parses a comma-separated list of strings:

> uncommas :: String -> [String]
> uncommas s = case break (==',') s of
>               (w,[]) -> [w]
>               (w,_:s') -> w : uncommas s'

We can then sum over the 4th column like this:

> main = putStrLn . show . sum . map (read . (!!4) . uncommas)
>        . tail . lines =<< getContents

This program is best read backwards: "getContents" gives stdin as a
string and "lines" breaks it into lines. The (!!) function yields the
nth element of a list. "read" and "show" convert between strings and
integers.

Best,

Roland
Reply | Threaded
Open this post in threaded view
|

Simple data summarization

Thomas Davie

On 10 Mar 2009, at 16:26, Roland Zumkeller wrote:

> Hi Andy,
>
> Here is a function that parses a comma-separated list of strings:
>
>> uncommas :: String -> [String]
>> uncommas s = case break (==',') s of
>>              (w,[]) -> [w]
>>              (w,_:s') -> w : uncommas s'
>
> We can then sum over the 4th column like this:
>
>> main = putStrLn . show . sum . map (read . (!!4) . uncommas)
>>       . tail . lines =<< getContents
>
> This program is best read backwards: "getContents" gives stdin as a
> string and "lines" breaks it into lines. The (!!) function yields the
> nth element of a list. "read" and "show" convert between strings and
> integers.

An alternative solution, though similar is to implement a data type  
for each record, and implement read for it:

data Gender = Male | Female

data Ethnicity = European | Maori | ..........

data Record = R {name :: String, gender :: Gender, age :: Int,  
ethnicity :: Ethnicity, books :: Int}

instance Read Gender where
   readsPrec _ s = case toLower $ read s of {'m' -> [(Male,"")]; 'f' -
 > [(Female,"")]; _ -> []}

instance Read Ethnicity where
   ...

instance Read Record where
   readsPrec _ = buildRec . uncommas
     where
       buildRec [n,g,a,e,b] =
         fromMaybe []
           do (n',_) <- listToMaybe $ reads n
              (g',_) <- listToMaybe $ reads g
              (a',_) <- listToMaybe $ reads a
              (e',_) <- listToMaybe $ reads e
              (b',_) <- listToMaybe $ reads b
              return [(R n' g' a' e' b', "")]

Now you can get at just the names for example by mapping the getter  
over the list:

main = putStrLn . ("Names: " ++) . concat . intersperse " " . map  
(name . read) . lines =<< getContents

Bob
Reply | Threaded
Open this post in threaded view
|

Simple data summarization

Patrick LeBoutillier
In reply to this post by Andy Elvey
Andy,

I came up with this solution that works like you described:


import Data.List.Split

mysplit = wordsBy (==',')

toPairs :: [String] -> [(String, [String])]
toPairs (header:rows) = foldr f (initPairs header) $ splitRows rows
    where f row acc = zipWith (\f (h,r) -> (h,f:r)) row acc
          initPairs header = map (\h -> (h, [])) $ mysplit header
          splitRows rows = map (mysplit) rows

summarizeByWith :: String -> (Int -> Int -> Int) -> [(String,
[String])] -> (String, Int)
summarizeByWith var agg pairs = case (lookup var pairs) of
    Just vals -> (var, foldl agg 0 $ map (read) vals)
    otherwise -> ("", 0)

main = interact (show . summarizeByWith "Books" (+) . toPairs . lines)


However in my opinion a solution like that proposed by Roland is
preferable since it can process the input line by line instead of
storing it all in memory. It seems also simpler and propably more
efficient.

However it was interesting hacking at your algorithm because it made
me realize how you can use lists of pairs (association lists) in
haskell where you might have used hash tables in another language.


Cheers,

Patrick



On Tue, Mar 10, 2009 at 4:33 AM, Andy Elvey <[hidden email]> wrote:

> Hi all -
> In the process of learning Haskell I'm wanting to do some simple data
> summarization.
> ( Btw, I'm looking at putting any submitted code for this in the "cookbook"
> section of
> the Haskell wiki. ?Imo it would be very useful there as a "next step" up
> from just reading
> in a file and printing it out. ?)
> This would involve reading in a delimited file like this - ( just a
> contrived example of how many books
> some people own ) -
>
> Name,Gender,Age,Ethnicity,Books
> Mary,F,14,NZ European, 11
> Brian,M,13,NZ European, 6
> Josh,M,12,NZ European, 14
> Regan,M,14,NZ Maori, 9
> Helen,F,15,NZ Maori, 17
> Anna,F,14,NZ European, 16
> Jess,F,14,NZ Maori, 21
>
> .... and doing some operations on it. As you can see, the file has column
> headings - I prefer to be able to manipulate data with
> headings (as it is what I do a lot of at work, using another programming
> language).
>
> I've tried to break the problem down into small parts as follows. a) Read
> the file into a list of pairs.
> The first element of the pair would be the column heading.
> The second will be a list containing the data.
> For example, ("Name", ?[Mary, ?Brian, ?Josh, ?Regan, ..... ] ?)
> b) Select a numeric variable to summarise ( "Books" in this example) c) Do a
> fold to summarize the variable. I think a left-fold would be the one to use
> here, but I may
> be wrong....
>
> After looking through previous postings on this list, I found some code
> which is somewhat similar to what I'm after (although the data it was
> crunching is very different). ?This is what I've come up with so far -
>
> summarize [] = []
> summarize ls = let
> ? ? ? byvariable = head ls
> ? ? ? numeric_variable = last ls
> ? ? ? sum = foldl (+) 0 $ numeric_variable
>
> ? in (byvariable, sum) : sum ls
>
> main = interact (unlines . map show . summarize . lines)
> I think this might be a useful start, but I still need to read the data into
> a list of pairs as mentioned, and I'm unsure as to how to
> do that.
> Many thanks in advance for any help received. ?As mentioned, I'm sure that
> examples like this could be very useful to other beginners, so I'm keen to
> make sure that any help given is made maximum use of (by putting any code on
> the Haskell wiki). - Andy
>
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/beginners
>



--
=====================
Patrick LeBoutillier
Rosem?re, Qu?bec, Canada
Reply | Threaded
Open this post in threaded view
|

Simple data summarization

Andy Elvey
Hi Patrick!

Thanks very much - that's great!  
Many thanks also to Roland and Thomas - your solutions are great too!

Although I'm still new to Haskell, I love its power and elegance.  It
does have a bit of a learning curve, mainly because its *so* powerful
that its a bit like trying to fly a jumbo jet..... ;)  

Only one more question. If I wanted to do a crosstab (say, with
ethnicity down the left-hand side, and gender across the top), how could
that be done?  
In other words, the output would look like this -

                         F      M
NZ European    xxx   xxx
NZ Maori          xxx   xxx

- where the x's are the totals for each category (NZ European F), (NZ
European M), (NZ Maori, F), NZ Maori, M).
I think it would involve zipWith and Array, but beyond that, I find it
hard to think this through in two dimensions.... :)
Crosstab code would be *great* to play around with!

Thanks again - bye for now -
 - Andy


Patrick LeBoutillier wrote:

> Andy,
>
> I came up with this solution that works like you described:
>
>
> import Data.List.Split
>
> mysplit = wordsBy (==',')
>
> toPairs :: [String] -> [(String, [String])]
> toPairs (header:rows) = foldr f (initPairs header) $ splitRows rows
>     where f row acc = zipWith (\f (h,r) -> (h,f:r)) row acc
>           initPairs header = map (\h -> (h, [])) $ mysplit header
>           splitRows rows = map (mysplit) rows
>
> summarizeByWith :: String -> (Int -> Int -> Int) -> [(String,
> [String])] -> (String, Int)
> summarizeByWith var agg pairs = case (lookup var pairs) of
>     Just vals -> (var, foldl agg 0 $ map (read) vals)
>     otherwise -> ("", 0)
>
> main = interact (show . summarizeByWith "Books" (+) . toPairs . lines)
>
>
> However in my opinion a solution like that proposed by Roland is
> preferable since it can process the input line by line instead of
> storing it all in memory. It seems also simpler and propably more
> efficient.
>
> However it was interesting hacking at your algorithm because it made
> me realize how you can use lists of pairs (association lists) in
> haskell where you might have used hash tables in another language.
>
>
> Cheers,
>
> Patrick
>
>
>
> On Tue, Mar 10, 2009 at 4:33 AM, Andy Elvey <[hidden email]> wrote:
>  
>> Hi all -
>> In the process of learning Haskell I'm wanting to do some simple data
>> summarization.
>> ( Btw, I'm looking at putting any submitted code for this in the "cookbook"
>> section of
>> the Haskell wiki.  Imo it would be very useful there as a "next step" up
>> from just reading
>> in a file and printing it out.  )
>> This would involve reading in a delimited file like this - ( just a
>> contrived example of how many books
>> some people own ) -
>>
>> Name,Gender,Age,Ethnicity,Books
>> Mary,F,14,NZ European, 11
>> Brian,M,13,NZ European, 6
>> Josh,M,12,NZ European, 14
>> Regan,M,14,NZ Maori, 9
>> Helen,F,15,NZ Maori, 17
>> Anna,F,14,NZ European, 16
>> Jess,F,14,NZ Maori, 21
>>
>> .... and doing some operations on it. As you can see, the file has column
>> headings - I prefer to be able to manipulate data with
>> headings (as it is what I do a lot of at work, using another programming
>> language).
>>
>> I've tried to break the problem down into small parts as follows. a) Read
>> the file into a list of pairs.
>> The first element of the pair would be the column heading.
>> The second will be a list containing the data.
>> For example, ("Name",  [Mary,  Brian,  Josh,  Regan, ..... ]  )
>> b) Select a numeric variable to summarise ( "Books" in this example) c) Do a
>> fold to summarize the variable. I think a left-fold would be the one to use
>> here, but I may
>> be wrong....
>>
>> After looking through previous postings on this list, I found some code
>> which is somewhat similar to what I'm after (although the data it was
>> crunching is very different).  This is what I've come up with so far -
>>
>> summarize [] = []
>> summarize ls = let
>>       byvariable = head ls
>>       numeric_variable = last ls
>>       sum = foldl (+) 0 $ numeric_variable
>>
>>   in (byvariable, sum) : sum ls
>>
>> main = interact (unlines . map show . summarize . lines)
>> I think this might be a useful start, but I still need to read the data into
>> a list of pairs as mentioned, and I'm unsure as to how to
>> do that.
>> Many thanks in advance for any help received.  As mentioned, I'm sure that
>> examples like this could be very useful to other beginners, so I'm keen to
>> make sure that any help given is made maximum use of (by putting any code on
>> the Haskell wiki). - Andy
>>
>> _______________________________________________
>> Beginners mailing list
>> [hidden email]
>> http://www.haskell.org/mailman/listinfo/beginners
>>
>>    
>
>
>
>