Code help requested

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Code help requested

Joe Van Dyk
I've written two versions of the same program, one in ruby and one in
haskell.  Given words on stdin, find all the anagrams in those words.
For nicer display, we're only going to display the top 3 results.

I'm obviously new to haskell.  The ruby version runs about 5x as fast
on a large file.   How can I improve the haskell version?

http://gist.github.com/274774

# Ruby version
input = STDIN.read.split("\n")
result = Hash.new([])
input.each do |word|
  sorted_word = word.split('').sort.join
  result[sorted_word] += [word]
end
values = result.values.sort { |a, b| b.size <=> a.size }
p values[0..3]

# Haskell version
import List
import qualified Data.Map as Map

-- Given as stdin
-- presents
-- serpents
-- no
-- on
-- whatever
-- Expected Output:
-- [["serpents","presents"],["on","no"]]

-- This version only displays words that have more than one
-- match in the list, and sorts by the words that got the most matches.

-- Can we do the map bit better?

main = do
  input <- getContents
  print $ anagrams $ lines input

anagrams words =
  sorted_anagrams
  where
    sorted_anagrams = sortBy sorter filtered_anagrams
    sorter a b = compare (length b)  (length a)
    filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams
    filter_function words = length words > 1
    all_anagrams = do_anagrams words Map.empty
    do_anagrams [] result = result
    do_anagrams words result = do_anagrams
                                 (tail words)
                                 (Map.unionWith
                                   (++)
                                   (Map.fromList
[(sorted_current_word, [current_word])])
                                   result)
      where
        current_word = head words
        sorted_current_word = sort current_word



--
Joe Van Dyk
http://fixieconsulting.com
Reply | Threaded
Open this post in threaded view
|

Code help requested

David Frey
On January 11, 2010 5:22:49 pm Joe Van Dyk wrote:

> I've written two versions of the same program, one in ruby and one in
> haskell.  Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell.  The ruby version runs about 5x as fast
> on a large file.   How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
>   sorted_word = word.split('').sort.join
>   result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b)  (length a)
>     filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams
>     filter_function words = length words > 1
>     all_anagrams = do_anagrams words Map.empty
>     do_anagrams [] result = result
>     do_anagrams words result = do_anagrams
>                                  (tail words)
>                                  (Map.unionWith
>                                    (++)
>                                    (Map.fromList
> [(sorted_current_word, [current_word])])
>                                    result)
>       where
>         current_word = head words
>         sorted_current_word = sort current_word
>


Can you provide a link to the data you are using as input?  I ran your program
over a list of 15000 words and it finished in 0.4 seconds.
Reply | Threaded
Open this post in threaded view
|

Code help requested

Yusaku Hashimoto
In reply to this post by Joe Van Dyk
Hello,

I forked your gist and made some changes: http://gist.github.com/274956

The main change is use foldl' instead of explicit recursion (HLint may
point this out.) This gives us clearness and strictness.

And It runs faster than ruby's one on my 2GHz MacBook =)

I used ruby-1.9.1 for ruby interpreter, and GHC-6.12.1 for haskell compiler.

time ruby ./anagram.rb < /usr/share/dict/words
[["caret", "carte", "cater", "crate", "creat", "creta", "react",
"recta", "trace"], ["ester", "estre", "reest", "reset", "steer",
"stere", "stree", "terse", "tsere"], ["angor", "argon", "goran",
"grano", "groan", "nagor", "orang", "organ", "rogan"]]

real 0m4.620s
user 0m4.473s
sys 0m0.150s

ghc --make anagram
time ./anagram < /usr/share/dict/words
[["caret","carte","cater","crate","creat","creta","react","recta","trace"],["angor","argon","goran","grano","groan","nagor","orang","organ","rogan"],["ester","estre","reest","reset","steer","stere","stree","terse","tsere"]]

real 0m3.797s
user 0m3.613s
sys 0m0.173s

Each output is slightly different because of spec of Hash in ruby 1.9.

see also:
http://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/
http://hackage.haskell.org/packages/archive/containers/0.2.0.1/doc/html/Data-Map.html#v:elems

Cheers
-- nwn

On Tue, Jan 12, 2010 at 10:22 AM, Joe Van Dyk <[hidden email]> wrote:

> I've written two versions of the same program, one in ruby and one in
> haskell. ?Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell. ?The ruby version runs about 5x as fast
> on a large file. ? How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
> ?sorted_word = word.split('').sort.join
> ?result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
> ?input <- getContents
> ?print $ anagrams $ lines input
>
> anagrams words =
> ?sorted_anagrams
> ?where
> ? ?sorted_anagrams = sortBy sorter filtered_anagrams
> ? ?sorter a b = compare (length b) ?(length a)
> ? ?filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams
> ? ?filter_function words = length words > 1
> ? ?all_anagrams = do_anagrams words Map.empty
> ? ?do_anagrams [] result = result
> ? ?do_anagrams words result = do_anagrams
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (tail words)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (Map.unionWith
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (++)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (Map.fromList
> [(sorted_current_word, [current_word])])
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? result)
> ? ? ?where
> ? ? ? ?current_word = head words
> ? ? ? ?sorted_current_word = sort current_word
>
>
>
> --
> Joe Van Dyk
> http://fixieconsulting.com
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/beginners
>
Reply | Threaded
Open this post in threaded view
|

Code help requested

Daniel Fischer-4
In reply to this post by Joe Van Dyk
Am Dienstag 12 Januar 2010 02:22:49 schrieb Joe Van Dyk:

> I've written two versions of the same program, one in ruby and one in
> haskell.  Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell.  The ruby version runs about 5x as fast
> on a large file.   How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
>   sorted_word = word.split('').sort.join
>   result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b)  (length a)
>     filtered_anagrams = Map.elems $ Map.filter filter_function
> all_anagrams filter_function words = length words > 1
>     all_anagrams = do_anagrams words Map.empty
>     do_anagrams [] result = result
>     do_anagrams words result = do_anagrams
>                                  (tail words)

Here be dragons.
unionWith is O(n+m) where n and m are the sizes of the two maps, the insert
variants are O(log n). So this may be quadratic [actually, if the second
map in unionWith is a singleton, it behaves better, but it's still much
slower than inserts] (and very lazy, which means that without
optimisations, all the unions form a giant thunk which overflows the stack
for large enough input), using

all_anagrams
    = foldl' (\m w -> Map.insertWith' (++) (sort w) [w] m) Map.empty words

, you get an O(n*log n) algorithm with sufficient strictness to not blow
the stack. For

$ wc -l /usr/share/dict/words
380645 /usr/share/dict/words

that is a heck of a difference.

>                                  (Map.unionWith
>                                    (++)

Don't use Map.fromList [(key,value)], use Map.singleton key value instead.

>                                    (Map.fromList
> [(sorted_current_word, [current_word])])
>                                    result)
>       where
>         current_word = head words
>         sorted_current_word = sort current_word

While the original version got a stack overflow without optimisations, it
ran with -O2, but took a *lot* of memory and was ~10% slower than the Ruby
version. But it spent 68% of the time garbage collecting.

Change all_anagrams as above, and it uses reasonable memory (about 30% more
than Ruby if left to choose how much to use, it can run on less than Ruby
with +RTS -MxM, but that of course increases GC times a bit) and takes
about a third of the time of the Ruby version (*without optimisations*, -O2
makes only a small difference [~6%] here).

$ time ruby ./Anagrams.rb < /usr/share/dict/words > /dev/null
26.27user 0.21system 0:26.48elapsed 100%CPU

$ ./AnagramsH +RTS -sstderr < /usr/share/dict/words  > /dev/null
./AnagramsH +RTS -sstderr
   1,807,965,184 bytes allocated in the heap
     577,083,904 bytes copied during GC
      73,277,232 bytes maximum residency (12 sample(s))
         858,644 bytes maximum slop
             166 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  3437 collections,     0 parallel,  2.47s,  2.62s elapsed
  Generation 1:    12 collections,     0 parallel,  1.07s,  1.25s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    5.16s  (  5.19s elapsed)
  GC    time    3.54s  (  3.86s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.70s  (  9.05s elapsed)

  %GC time      40.7%  (42.7% elapsed)

  Alloc rate    350,629,823 bytes per MUT second

  Productivity  59.3% of total user, 57.0% of total elapsed

Reply | Threaded
Open this post in threaded view
|

Code help requested

David Frey
In reply to this post by Joe Van Dyk
On January 11, 2010 5:22:49 pm Joe Van Dyk wrote:

> I've written two versions of the same program, one in ruby and one in
> haskell.  Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell.  The ruby version runs about 5x as fast
> on a large file.   How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
>   sorted_word = word.split('').sort.join
>   result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b)  (length a)
>     filtered_anagrams = Map.elems $ Map.filter filter_function all_anagrams
>     filter_function words = length words > 1
>     all_anagrams = do_anagrams words Map.empty
>     do_anagrams [] result = result
>     do_anagrams words result = do_anagrams
>                                  (tail words)
>                                  (Map.unionWith
>                                    (++)
>                                    (Map.fromList
> [(sorted_current_word, [current_word])])
>                                    result)
>       where
>         current_word = head words
>         sorted_current_word = sort current_word
>

These are the numbers I got once I modified your Haskell program to only print
out 4 results the way the Ruby program does.

Your Haskell version: ~10.0 s
My Haskell version: ~2.5 s
Your Ruby version (Ruby 1.8): ~4.6 s
Your Ruby version (Ruby 1.9): ~4.2 s

This is my version of your program:

import Control.Monad (liftM)
import Data.Function (on)
import Data.List (sort, sortBy)
import qualified Data.ByteString.Char8 as B
import qualified Data.Map as Map

-- Given as stdin
-- presents
-- serpents
-- no
-- on
-- whatever
-- Expected Output:
-- [["serpents","presents"],["on","no"]]


main = do
    input <- liftM B.lines B.getContents
    let wordMap = buildMap $ map B.unpack input
    print $ take 4 (listAnagrams wordMap)


buildMap words = let
    entries = map (\x -> (sort x, [x])) words
    in Map.fromListWith (++) entries


listAnagrams wordMap = let
    anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap
    in sortBy (flip (compare `on` length)) anagrams


I found that the performance improved when I used ByteStrings to read the
input and then unpacked to regular strings before creating the Map.  For some
reason, using BytesStrings everywhere made the program slower.  Can anyone
tell me why?

Dave
Reply | Threaded
Open this post in threaded view
|

Code help requested

Daniel Fischer-4
Am Dienstag 12 Januar 2010 17:44:12 schrieb David Frey:
> These are the numbers I got once I modified your Haskell program to only
> print out 4 results the way the Ruby program does.
>
> Your Haskell version: ~10.0 s
> My Haskell version: ~2.5 s
> Your Ruby version (Ruby 1.8): ~4.6 s
> Your Ruby version (Ruby 1.9): ~4.2 s

I sincerely hope your input file is smaller than mine :)

>
> This is my version of your program:
>
> import Control.Monad (liftM)
> import Data.Function (on)
> import Data.List (sort, sortBy)
> import qualified Data.ByteString.Char8 as B
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
>
> main = do
>     input <- liftM B.lines B.getContents
>     let wordMap = buildMap $ map B.unpack input
>     print $ take 4 (listAnagrams wordMap)
>
>
> buildMap words = let
>     entries = map (\x -> (sort x, [x])) words
>     in Map.fromListWith (++) entries
>
>
> listAnagrams wordMap = let
>     anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap
>     in sortBy (flip (compare `on` length)) anagrams
>
>
> I found that the performance improved when I used ByteStrings to read
> the input and then unpacked to regular strings before creating the Map.
> For some reason, using BytesStrings everywhere made the program slower.
> Can anyone tell me why?

Yes. ByteString's sort is a bucket-sort. It allocates an array of
256*sizeof(CSize) bytes and counts the occurrences of each character.
That's fine for long ByteStrings, but for short ByteStrings like those we
consider here, allocating a bucket-array of 1K or 2K is incredibly much.

Sorting plain Strings is faster (not very much, though) and uses (much)
less memory if they are short.

You can further speed up your programme if you put lists of ByteStrings in
your Map (less memory, less GC) and unpack them only for sorting (and
finally for output):


main = do
? ? input <- liftM B.lines B.getContents
? ? let wordMap = buildMap input
? ? print $ take 4 (listAnagrams wordMap)

buildMap words = let
? ? entries = map (\x -> (B.pack . sort $ B.unpack x, [x])) words
? ? in Map.fromListWith (++) entries


listAnagrams wordMap = let
? ? anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap
    -- a small speedup can be obtained by not using length:
    -- Map.filter (not . null . drop 1)
    -- or Map.filter (\l -> case l of { (_:_:_) -> True; _ -> False })
    -- if there are many long lists in the map, the speedup will become
    -- significant
? ? in map (map B.unpack) $ sortBy (flip (compare `on` length)) anagrams


Yours:
$ ./DFAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null                                                                                        
./DFAnagrams +RTS -sstderr                                                                        
   1,218,862,708 bytes allocated in the heap                                                      
     544,113,420 bytes copied during GC                                                          
      98,018,856 bytes maximum residency (10 sample(s))                                          
         768,552 bytes maximum slop                                                              
             211 MB total memory in use (2 MB lost due to fragmentation)                          

  Generation 0:  2315 collections,     0 parallel,  1.98s,  2.01s elapsed
  Generation 1:    10 collections,     0 parallel,  1.28s,  1.53s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.60s  (  3.61s elapsed)
  GC    time    3.26s  (  3.54s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.86s  (  7.15s elapsed)

  %GC time      47.5%  (49.4% elapsed)

  Alloc rate    338,176,931 bytes per MUT second

  Productivity  52.4% of total user, 50.3% of total elapsed

Modified:
$ ./DFBAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null                                                                                        
./DFBTAnagrams +RTS -sstderr                                                                      
   1,108,946,552 bytes allocated in the heap
     237,869,304 bytes copied during GC
      41,907,844 bytes maximum residency (10 sample(s))
       4,374,152 bytes maximum slop
              89 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  2091 collections,     0 parallel,  1.14s,  1.19s elapsed
  Generation 1:    10 collections,     0 parallel,  0.40s,  0.50s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.90s  (  2.90s elapsed)
  GC    time    1.54s  (  1.69s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.44s  (  4.59s elapsed)

  %GC time      34.7%  (36.8% elapsed)

  Alloc rate    382,369,915 bytes per MUT second

  Productivity  65.2% of total user, 63.1% of total elapsed


versus 26.25s for the Ruby version (ruby 1.8.7).

Yay!

>
> Dave

Reply | Threaded
Open this post in threaded view
|

Code help requested

Tim Perry-2
I compiled the original version, Yusaka's version, and a version I wrote and found the following:

$ time ./Anagram_me < /usr/share/dict/words > /dev/null
real    0m2.197s
user    0m2.040s
sys     0m0.160s

$ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
real    0m4.570s
user    0m4.290s
sys     0m0.260s

perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words > /dev/null
real    0m1.337s
user    0m1.230s
sys     0m0.100s


From
this, it looks like mine version takes less than half the time of the
original. However, if I run a bigger dictionary (Ubuntu package
wamerican-large instead of wamerican-small) then I'm only about 30%
faster than the original. This makes me think I have some sort of
exponential data structure growth going on.  Here is my version. Can anyone confirm that data
structure growth is the problem with my approach?  Thanks, Tim



import Data.List as Lst
import Data.Map as Map

-- This version only displays words that have more than two
-- match in the list, and sorts by the words that got the most matches.

-- Can we do the map bit better?

main = do
  input <- getContents
  print $ anagrams $ lines input

anagrams words =
  sorted_anagrams
  where
    sorted_anagrams = sortBy sorter filtered_anagrams
    sorter a b = compare (length b) (length a)
    longEnoughWords = [x | x <- words, length x > 1]
    filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $ wordPairs, length x > 2]
       where
         wordPairs =  zip (Prelude.map Lst.sort longEnoughWords) longEnoughWords
         insert (sorted, original) = insertWith (++) sorted [original]






----- Original Message ----
From: Daniel Fischer <[hidden email]>
To: [hidden email]
Sent: Tue, January 12, 2010 10:46:03 AM
Subject: Re: [Haskell-beginners] Code help requested

Am Dienstag 12 Januar 2010 17:44:12 schrieb David Frey:
> These are the numbers I got once I modified your Haskell program to only
> print out 4 results the way the Ruby program does.
>
> Your Haskell version: ~10.0 s
> My Haskell version: ~2.5 s
> Your Ruby version (Ruby 1.8): ~4.6 s
> Your Ruby version (Ruby 1.9): ~4.2 s

I sincerely hope your input file is smaller than mine :)

>
> This is my version of your program:
>
> import Control.Monad (liftM)
> import Data.Function (on)
> import Data.List (sort, sortBy)
> import qualified Data.ByteString.Char8 as B
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
>
> main = do
>     input <- liftM B.lines B.getContents
>     let wordMap = buildMap $ map B.unpack input
>     print $ take 4 (listAnagrams wordMap)
>
>
> buildMap words = let
>     entries = map (\x -> (sort x, [x])) words
>     in Map.fromListWith (++) entries
>
>
> listAnagrams wordMap = let
>     anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap
>     in sortBy (flip (compare `on` length)) anagrams
>
>
> I found that the performance improved when I used ByteStrings to read
> the input and then unpacked to regular strings before creating the Map.
> For some reason, using BytesStrings everywhere made the program slower.
> Can anyone tell me why?

Yes. ByteString's sort is a bucket-sort. It allocates an array of
256*sizeof(CSize) bytes and counts the occurrences of each character.
That's fine for long ByteStrings, but for short ByteStrings like those we
consider here, allocating a bucket-array of 1K or 2K is incredibly much.

Sorting plain Strings is faster (not very much, though) and uses (much)
less memory if they are short.

You can further speed up your programme if you put lists of ByteStrings in
your Map (less memory, less GC) and unpack them only for sorting (and
finally for output):


main = do
    input <- liftM B.lines B.getContents
    let wordMap = buildMap input
    print $ take 4 (listAnagrams wordMap)

buildMap words = let
    entries = map (\x -> (B.pack . sort $ B.unpack x, [x])) words
    in Map.fromListWith (++) entries


listAnagrams wordMap = let
    anagrams = (Map.elems . Map.filter (\x -> length x > 1)) wordMap
    -- a small speedup can be obtained by not using length:
    -- Map.filter (not . null . drop 1)
    -- or Map.filter (\l -> case l of { (_:_:_) -> True; _ -> False })
    -- if there are many long lists in the map, the speedup will become
    -- significant
    in map (map B.unpack) $ sortBy (flip (compare `on` length)) anagrams


Yours:
$ ./DFAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null                                                                                        
./DFAnagrams +RTS -sstderr                                                                        
   1,218,862,708 bytes allocated in the heap                                                      
     544,113,420 bytes copied during GC                                                          
      98,018,856 bytes maximum residency (10 sample(s))                                          
         768,552 bytes maximum slop                                                              
             211 MB total memory in use (2 MB lost due to fragmentation)                          

  Generation 0:  2315 collections,     0 parallel,  1.98s,  2.01s elapsed
  Generation 1:    10 collections,     0 parallel,  1.28s,  1.53s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.60s  (  3.61s elapsed)
  GC    time    3.26s  (  3.54s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.86s  (  7.15s elapsed)

  %GC time      47.5%  (49.4% elapsed)

  Alloc rate    338,176,931 bytes per MUT second

  Productivity  52.4% of total user, 50.3% of total elapsed

Modified:
$ ./DFBAnagrams +RTS -sstderr < /usr/share/dict/words > /dev/null                                                                                        
./DFBTAnagrams +RTS -sstderr                                                                      
   1,108,946,552 bytes allocated in the heap
     237,869,304 bytes copied during GC
      41,907,844 bytes maximum residency (10 sample(s))
       4,374,152 bytes maximum slop
              89 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  2091 collections,     0 parallel,  1.14s,  1.19s elapsed
  Generation 1:    10 collections,     0 parallel,  0.40s,  0.50s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.90s  (  2.90s elapsed)
  GC    time    1.54s  (  1.69s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.44s  (  4.59s elapsed)

  %GC time      34.7%  (36.8% elapsed)

  Alloc rate    382,369,915 bytes per MUT second

  Productivity  65.2% of total user, 63.1% of total elapsed


versus 26.25s for the Ruby version (ruby 1.8.7).

Yay!

>
> Dave

_______________________________________________
Beginners mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/beginners

Reply | Threaded
Open this post in threaded view
|

Code help requested

Joe Van Dyk
On Tue, Jan 12, 2010 at 6:38 PM, Tim Perry <[hidden email]> wrote:
> I compiled the original version, Yusaka's version, and a version I wrote and found the following:

Thanks all for your help!

What I'm most interested is getting my haskell code to be more
readable and idiomatic, as opposed to as fast as possible.
Reply | Threaded
Open this post in threaded view
|

Code help requested

Daniel Fischer-4
In reply to this post by Tim Perry-2
Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:

> I compiled the original version, Yusaka's version, and a version I wrote
> and found the following:
>
> $ time ./Anagram_me < /usr/share/dict/words > /dev/null
> real ? ?0m2.197s
> user ? ?0m2.040s
> sys ? ? 0m0.160s
>
> $ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
> real ? ?0m4.570s
> user ? ?0m4.290s
> sys ? ? 0m0.260s
>
> perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words >
> /dev/null real ? ?0m1.337s
> user ? ?0m1.230s
> sys ? ? 0m0.100s
>
>
> From
> this, it looks like mine version takes less than half the time of the
> original. However, if I run a bigger dictionary (Ubuntu package
> wamerican-large instead of wamerican-small) then I'm only about 30%
> faster than the original. This makes me think I have some sort of
> exponential data structure growth going on. ?Here is my version. Can
> anyone confirm that data structure growth is the problem with my
> approach? ?Thanks, Tim
>
>
>
> import Data.List as Lst
> import Data.Map as Map
>
> -- This version only displays words that have more than two
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
> ? input <- getContents
> ? print $ anagrams $ lines input
>
> anagrams words =
> ? sorted_anagrams
> ? where
> ? ? sorted_anagrams = sortBy sorter filtered_anagrams
> ? ? sorter a b = compare (length b) (length a)
> ? ? longEnoughWords = [x | x <- words, length x > 1]

The words are short here, so it's not catastrophic, but

*don't use length unless you really want to know the length*

Here, use (not . null . drop 1) [an input line might be empty, so don't use
tail], in general, instead of

length list > k, use not . null $ drop k list;

if you want to check (length list == k),

case drop (k-1) list of
  (_:[]) -> True
 _      -> False

is O(min k (length list)), if there's a slight possibility that the list is
much longer than k, it's safer.

However, it's unlikely that there are more than 52 one-letter words in the
word list, so filtering out those shouldn't make it faster.

> ? ? filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $
> wordPairs, length x > 2] where

That's bad.

Using foldr to construct the map, you must have the whole list from which
to construct it in the memory at once - since the list takes less memory
than the map, that is not a real problem, if you run out of memory thus,
you would anyway - and can start constructing the map only after the entire
reading is done - this is the real problem.
You build a nice huge thunk that way, which may blow the stack. And it's
slow.

foldr is for the cases where you can start returning output before the
entire list has been consumed, a necessary condition for that is that the
accumulation function is lazy in its second argument, like (++), (&&),
(||).
In practically all other cases, you want foldl' (there might be a few cases
where foldl is what you want, I haven't seen such a case yet, though).

> ? ? ? ? ?wordPairs = ?zip (Prelude.map Lst.sort longEnoughWords)
>                                    longEnoughWords
>          insert (sorted, original) = insertWith (++) sorted [original]

changing foldr to foldl' and insert to

insert m (sorted,original) = insertWith (++) sorted [original] m

reduces the time on my /usr/share/dict/words from 31 seconds to 8.

The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on
70,000 words, 2.12s vs. 4.22s on 120,000 words.

Reply | Threaded
Open this post in threaded view
|

Code help requested

Tim Perry-2
A side benefit of using foldl' instead of foldr is that I can now run it against the entire dictionary!

I found a good explanation of why here:
http://www.haskell.org/haskellwiki/Foldr_Foldl_Foldl%27




----- Original Message ----
From: Daniel Fischer <[hidden email]>
To: [hidden email]
Cc: Tim Perry <[hidden email]>
Sent: Wed, January 13, 2010 2:56:45 AM
Subject: Re: [Haskell-beginners] Code help requested

Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:

> I compiled the original version, Yusaka's version, and a version I wrote
> and found the following:
>
> $ time ./Anagram_me < /usr/share/dict/words > /dev/null
> real    0m2.197s
> user    0m2.040s
> sys     0m0.160s
>
> $ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
> real    0m4.570s
> user    0m4.290s
> sys     0m0.260s
>
> perry@emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words >
> /dev/null real    0m1.337s
> user    0m1.230s
> sys     0m0.100s
>
>
> From
> this, it looks like mine version takes less than half the time of the
> original. However, if I run a bigger dictionary (Ubuntu package
> wamerican-large instead of wamerican-small) then I'm only about 30%
> faster than the original. This makes me think I have some sort of
> exponential data structure growth going on.  Here is my version. Can
> anyone confirm that data structure growth is the problem with my
> approach?  Thanks, Tim
>
>
>
> import Data.List as Lst
> import Data.Map as Map
>
> -- This version only displays words that have more than two
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b) (length a)
>     longEnoughWords = [x | x <- words, length x > 1]

The words are short here, so it's not catastrophic, but

*don't use length unless you really want to know the length*

Here, use (not . null . drop 1) [an input line might be empty, so don't use
tail], in general, instead of

length list > k, use not . null $ drop k list;

if you want to check (length list == k),

case drop (k-1) list of
  (_:[]) -> True
_      -> False

is O(min k (length list)), if there's a slight possibility that the list is
much longer than k, it's safer.

However, it's unlikely that there are more than 52 one-letter words in the
word list, so filtering out those shouldn't make it faster.

>     filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $
> wordPairs, length x > 2] where

That's bad.

Using foldr to construct the map, you must have the whole list from which
to construct it in the memory at once - since the list takes less memory
than the map, that is not a real problem, if you run out of memory thus,
you would anyway - and can start constructing the map only after the entire
reading is done - this is the real problem.
You build a nice huge thunk that way, which may blow the stack. And it's
slow.

foldr is for the cases where you can start returning output before the
entire list has been consumed, a necessary condition for that is that the
accumulation function is lazy in its second argument, like (++), (&&),
(||).
In practically all other cases, you want foldl' (there might be a few cases
where foldl is what you want, I haven't seen such a case yet, though).

>          wordPairs =  zip (Prelude.map Lst.sort longEnoughWords)
>                                    longEnoughWords
>          insert (sorted, original) = insertWith (++) sorted [original]

changing foldr to foldl' and insert to

insert m (sorted,original) = insertWith (++) sorted [original] m

reduces the time on my /usr/share/dict/words from 31 seconds to 8.

The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on
70,000 words, 2.12s vs. 4.22s on 120,000 words.
Reply | Threaded
Open this post in threaded view
|

Code help requested

Daniel Fischer-4
Am Mittwoch 13 Januar 2010 20:58:17 schrieb Tim Perry:
> A side benefit of using foldl' instead of foldr is that I can now run it
> against the entire dictionary!
>
> I found a good explanation of why here:
> http://www.haskell.org/haskellwiki/Foldr_Foldl_Foldl%27
>

Yep. One thing, though: seq, and hence foldl' only evaluates to weak head
normal form (WHNF), that is (leaving aside lambda expressions), to the
topmost constructor.
So, e.g. list `seq` value only checks whether list is [] or (_:_), that may
not be enough strictness in a fold. The constructors of Map are strict
enough to avoid large thunks in general with foldl', but for example to
compute the average of a list of numbers:

average :: [Double] -> Double
average list = sumList / countList
   where
      (sumList,countList) = foldl' add (0,0) list
      add (s,c) x = (s+x,c+1)

isn't strict enough, sumList and countList will be large thunks because in
each step add (s,c) x will be evaluated enough only to see that it is
indeed a pair.

For such cases, one needs to force the evaluation further by hand.
One possibility is to write a stricter function:

add' (s,c) x
   = let s1 = s+x
         c1 = c+1
     in s1 `seq` c1 `seq` (s1,c1)

or, using BangPatterns:

add' (!s,!c) x = (s+x,c+1)

Another possibility is to use a sufficiently strict data type

data DPair = DP !Double !Double

add (DP s c) x = DP (s+x) (c+1)

(the strict fields will keep the numbers completely evaluated at each
step), a third is to use rnf {- reduce to normal form -} from
Control.Parallel.Strategies or another sufficiently strict strategy,
respectively deepseq.

Which one is the best choice varies of course.
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Joe Van Dyk
In reply to this post by Joe Van Dyk
Thanks all for your help.

Here's another one.  Seems like I could use a fold here, but am unsure
how that would work.  Also, if I pass in a search value that's too
big, then the function blows up.

(source at http://github.com/joevandyk/haskell/raw/master/pearls/binary_search/binary_search.hs
--  Feel free to fork it.)

-- A translation of
http://en.wikipedia.org/wiki/Binary_search_algorithm#Recursive
binary_find :: Ord a => [a] -> a -> Maybe Int
binary_find [] elem   = Nothing

binary_find list elem =
  do_search list elem 0 (length list)
  where
    do_search list elem low high =
      if high < low then Nothing
      else
        if list !! mid > elem
        then do_search list elem low (mid - 1)
        else
          if list !! mid < elem
          then do_search list elem (mid + 1) high
          else Just mid
      where
        mid = low + (high - low) `div` 2

main = do
  print $ binary_find [1] 1
  print $ binary_find [1,3] 1
  print $ binary_find [1,3,4] 3
  print $ binary_find [1,3,4] 4
  print $ binary_find [1,2,4,6,8,9,12,15,17,20] 17
  print $ binary_find "hello" 'l'
  print $ binary_find [0.0, 1.5, 3.0] 3.0

  print $ binary_find [] 1
  print $ binary_find [1,3] 2
  print $ binary_find [1,4,6,8,9,12,15,17,20] 2

  -- boom?
  print $ binary_find [1,4,6,8,9,12,15,17,20] 100
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Daniel Fischer-4
Am Samstag 16 Januar 2010 01:17:55 schrieb Joe Van Dyk:

> Thanks all for your help.
>
> Here's another one.  Seems like I could use a fold here, but am unsure
> how that would work.  Also, if I pass in a search value that's too
> big, then the function blows up.
>
> (source at
> http://github.com/joevandyk/haskell/raw/master/pearls/binary_search/bina
>ry_search.hs --  Feel free to fork it.)
>
> -- A translation of
> http://en.wikipedia.org/wiki/Binary_search_algorithm#Recursive
> binary_find :: Ord a => [a] -> a -> Maybe Int
> binary_find [] elem   = Nothing
>
> binary_find list elem =
>   do_search list elem 0 (length list)

That should be (length list - 1).

binary_find [1] 2
~> do_search [1] 2 0 1
~> mid = 0 + 1 `div` 2 = 0
~> [1] !! mid < 2
~> do_search [1] 2 (0+1) 1
~> mid = 1 + 0 `div` 2 = 1
~> [1] !! 1 => boom

>   where
>     do_search list elem low high =
>       if high < low then Nothing
>       else
>         if list !! mid > elem
>         then do_search list elem low (mid - 1)
>         else
>           if list !! mid < elem
>           then do_search list elem (mid + 1) high
>           else Just mid
>       where
>         mid = low + (high - low) `div` 2

I'd prefer
      mid = (low + high) `div` 2
here.

>
> main = do
>   print $ binary_find [1] 1
>   print $ binary_find [1,3] 1
>   print $ binary_find [1,3,4] 3
>   print $ binary_find [1,3,4] 4
>   print $ binary_find [1,2,4,6,8,9,12,15,17,20] 17
>   print $ binary_find "hello" 'l'
>   print $ binary_find [0.0, 1.5, 3.0] 3.0
>
>   print $ binary_find [] 1
>   print $ binary_find [1,3] 2
>   print $ binary_find [1,4,6,8,9,12,15,17,20] 2
>
>   -- boom?
>   print $ binary_find [1,4,6,8,9,12,15,17,20] 100


However:
Lists are _not_ arrays.

list !! n

is O(n), except n >= length list, then getting the error is O(length list).
And getting the length is O(length list), too.

So the binary search on list is O(l*log l), where l = length list, while
straightforward linear search is O(l).

You can make the binary search O(l) if you have

binaryFind list e = search list e (length list)
   where
      search _ _ 0 = Nothing
      search lst e len
         | x == e      = Just e
         | x < e        = search front e half
         | otherwise = search back e (len - half - 1)
           where
             half = (len - 1) `div` 2
             (front, x:back) = splitAt half lst

but in general, that is still much worse than straightforward search.

The binary search algorithm is for data structures with constant time
access (arrays, anything else?), not singly linked lists.


foldSearch list e = foldr f Nothing list
   where
      f x y
         | x == e    = Just x
         | otherwise = y

Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Tim Perry-2
In reply to this post by Joe Van Dyk
Hi Joe,

I think you wanted (length list - 1) where you call do_search.  The version below is my rewrite with guards, this change, and "midVal" which keeps "list !! mid" from being evaluated twice per recursion.  Unfortunately, I have no idea how to work a fold into this.  Good luck!

--Tim


binary_find :: Ord a => [a] -> a -> Maybe Int
binary_find [] elem  = Nothing

binary_find list elem =
  do_search list elem 0 (length list -1)
  where
    do_search list elem low high
      | high < low = Nothing
      | midVal > elem = do_search list elem low (mid - 1)
      | midVal < elem = do_search list elem (mid + 1) high
      | otherwise = Just mid
      where
        midVal = list !! mid
        mid = low + (high - low) `div` 2

main = do
  print $ binary_find [1] 2
  print $ binary_find [1,3] 1
  print $ binary_find [1,3,4] 3
  print $ binary_find [1,3,4] 4
  print $ binary_find [1,2,4,6,8,9,12,15,17,20] 17
  print $ binary_find "hello" 'l'
  print $ binary_find [0.0, 1.5, 3.0] 3.0

  print $ binary_find [] 1
  print $ binary_find [1,3] 2
  print $ binary_find [1,4,6,8,9,12,15,17,20] 2

  -- boom?
  print $ binary_find [1,4,6,8,9,12,15,17,20] 19
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Joe Van Dyk
On Fri, Jan 15, 2010 at 5:17 PM, Tim Perry <[hidden email]> wrote:
> Hi Joe,
>
> I think you wanted (length list - 1) where you call do_search. ?The version below is my rewrite with guards, this change, and "midVal" which keeps "list !! mid" from being evaluated twice per recursion. ?Unfortunately, I have no idea how to work a fold into this. ?Good luck!

Ah, I was trying to figure out how to use guards here, but I couldn't
figure out the syntax.  Thanks for the help!

What's the best way to unit test this function?  Quick-check?  What
would that look like?

Joe
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Brandon S Allbery KF8NH
In reply to this post by Tim Perry-2
On Jan 15, 2010, at 20:17 , Tim Perry wrote:

> binary_find list elem =
>  do_search list elem 0 (length list -1)
>  where
>    do_search list elem low high
>      | high < low = Nothing
>      | midVal > elem = do_search list elem low (mid - 1)
>      | midVal < elem = do_search list elem (mid + 1) high
>      | otherwise = Just mid
>      where
>        midVal = list !! mid
>        mid = low + (high - low) `div` 2


Observation:  the first two parameters to do_search never change  
within an invocation of binary_find, and the corresponding arguments  
to binary_find are in scope; depending on the (lack of) cleverness of  
the compiler, you could see a speedup by not passing them around  
unnecessarily.  In addition, it's *conceptually* cleaner because  
passing them around explicitly suggests to someone reading the source  
that they *do* change when that isn't actually the case.

--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [hidden email]
system administrator [openafs,heimdal,too many hats] [hidden email]
electrical and computer engineering, carnegie mellon university    KF8NH


-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/beginners/attachments/20100116/acd280c2/PGP.bin
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Joe Van Dyk
On Sat, Jan 16, 2010 at 5:30 AM, Brandon S. Allbery KF8NH
<[hidden email]> wrote:

> On Jan 15, 2010, at 20:17 , Tim Perry wrote:
>>
>> binary_find list elem =
>> ?do_search list elem 0 (length list -1)
>> ?where
>> ? do_search list elem low high
>> ? ? | high < low = Nothing
>> ? ? | midVal > elem = do_search list elem low (mid - 1)
>> ? ? | midVal < elem = do_search list elem (mid + 1) high
>> ? ? | otherwise = Just mid
>> ? ? where
>> ? ? ? midVal = list !! mid
>> ? ? ? mid = low + (high - low) `div` 2
>
>
> Observation: ?the first two parameters to do_search never change within an
> invocation of binary_find, and the corresponding arguments to binary_find
> are in scope; depending on the (lack of) cleverness of the compiler, you
> could see a speedup by not passing them around unnecessarily. ?In addition,
> it's *conceptually* cleaner because passing them around explicitly suggests
> to someone reading the source that they *do* change when that isn't actually
> the case.

Good one, thanks.  The reason I had them in there originally is
because I was thinking it would be simpler if a function was passed in
all the inputs it was working with -- instead of referring to
variables defined outside its scope.

Joe
Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Steven Chaplin
In reply to this post by Joe Van Dyk
On Sat, 2010-01-16 at 14:42 -0500, [hidden email] wrote:

> binary_find [1] 2
> ~> do_search [1] 2 0 1
> ~> mid = 0 + 1 `div` 2 = 0
> ~> [1] !! mid < 2
> ~> do_search [1] 2 (0+1) 1
> ~> mid = 1 + 0 `div` 2 = 1
> ~> [1] !! 1 => boom
>
> >   where
> >     do_search list elem low high =
> >       if high < low then Nothing
> >       else
> >         if list !! mid > elem
> >         then do_search list elem low (mid - 1)
> >         else
> >           if list !! mid < elem
> >           then do_search list elem (mid + 1) high
> >           else Just mid
> >       where
> >         mid = low + (high - low) `div` 2
>
> I'd prefer
>       mid = (low + high) `div` 2
> here.
mid = (low + high) `div` 2
is buggy code. Java had this bug for 9 years until someone noticed.
http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

mid = low + (high - low) `div` 2
is useful because it avoids Int overflow errors (when using very, very
long lists).

Steve


Reply | Threaded
Open this post in threaded view
|

Re: Code help requested

Daniel Fischer-4
Am Sonntag 17 Januar 2010 06:21:56 schrieb Steve:

> On Sat, 2010-01-16 at 14:42 -0500, [hidden email] wrote:
> > binary_find [1] 2
> > ~> do_search [1] 2 0 1
> > ~> mid = 0 + 1 `div` 2 = 0
> > ~> [1] !! mid < 2
> > ~> do_search [1] 2 (0+1) 1
> > ~> mid = 1 + 0 `div` 2 = 1
> > ~> [1] !! 1 => boom
> >
> > >   where
> > >     do_search list elem low high =
> > >       if high < low then Nothing
> > >       else
> > >         if list !! mid > elem
> > >         then do_search list elem low (mid - 1)
> > >         else
> > >           if list !! mid < elem
> > >           then do_search list elem (mid + 1) high
> > >           else Just mid
> > >       where
> > >         mid = low + (high - low) `div` 2
> >
> > I'd prefer
> >       mid = (low + high) `div` 2
> > here.
>
> mid = (low + high) `div` 2
> is buggy code. Java had this bug for 9 years until someone noticed.
> http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it
>-nearly.html
>
> mid = low + (high - low) `div` 2
> is useful because it avoids Int overflow errors (when using very, very
> long lists).

True - except I'm Out Of Memory long before overflow ;)

>
> Steve