# Optimizing nearest-k code

 Classic List Threaded
6 messages
Reply | Threaded
Open this post in threaded view
|

## Optimizing nearest-k code

 Dear Haskell Cafe,Given a set of sets, and a particular target set, I want to find the sets that are nearest (in terms of Hamming distance) to the target set. I am using the following code:import Data.List    import qualified Data.Set as Setnearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, v)]nearest_k k bs b = take k bs' where    bs' = sortOn (hamming b) bshamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Inthamming x (y, _) = hamming_distance x yhamming_distance :: Ord a => Set.Set a -> Set.Set a -> Inthamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size (Set.difference ys xs)subsets :: [a] -> [[a]]subsets []  = [[]]subsets (x:xs) = subsets xs ++ map (x:) (subsets xs)    int_lists :: [[Int]]int_lists = subsets [1..20]values :: [(Set.Set Int, Int)]values = map f (zip [1..] int_lists) where    f (i, x) = (Set.fromList x, i)test = nearest_k 8 values (Set.fromList [1,2,3])----This works ok for the test above (with sets of ints), but is rather slow in my actual application (in which the sets are large sets of ground atoms of first-order logic). Is there some major optimization I should be doing here? thanks,Richard _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

## Re: Optimizing nearest-k code

 there are a few things you can do. I ran it locally and got 3s cpu time and 3gb of allocation, more or less. first off, your subset implementation computes the subsets twice. taking that off took it down to 2s. After that it's mostly in the Set implementation. Using Map a () instead gives you access to Data.Map.Merge.Strict.merge, which is a bit more efficient (going over the two data structures once only) ``` module Main where import Data.List import Data.Map.Strict (Map) import qualified Data.Map.Merge.Strict as Map import qualified Data.Map.Strict as Map nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)] nearest_k k bs b = take k bs' where     bs' = sortOn (hamming b) bs hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int hamming x (y, _) = hamming_distance x y hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs ys) subsets :: [a] -> [[a]] subsets []  = [[]] subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs int_lists :: [[Int]] int_lists = subsets [1..20] values :: [(Map Int (), Int)] values = map f (zip [1..] int_lists) where     f (i, x) = (Map.fromList(zip  x (repeat ())), i) test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ()))) main = print test ``` that took it to 1s, and now profiling indicates more than half the time is spent in generation of the test values, so I'll leave it there. I think if you wanted to do better than this you'd have to do some algorithmic changes - for instance, once your worst candidate is n steps away, you can stop calculating the hamming distance for anything else once it's > n, as it can't contribute usefully to the nearest neighbours. cheers, Mark On Thu, May 24, 2018 at 12:15 PM, Richard Evans <[hidden email]> wrote: > Dear Haskell Cafe, > > Given a set of sets, and a particular target set, I want to find the sets > that are nearest (in terms of Hamming distance) to the target set. > > I am using the following code: > > import Data.List > import qualified Data.Set as Set > > nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, > v)] > nearest_k k bs b = take k bs' where >     bs' = sortOn (hamming b) bs > > hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int > hamming x (y, _) = hamming_distance x y > > hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int > hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size > (Set.difference ys xs) > > > > subsets :: [a] -> [[a]] > subsets []  = [[]] > subsets (x:xs) = subsets xs ++ map (x:) (subsets xs) > > int_lists :: [[Int]] > int_lists = subsets [1..20] > > values :: [(Set.Set Int, Int)] > values = map f (zip [1..] int_lists) where >     f (i, x) = (Set.fromList x, i) > > test = nearest_k 8 values (Set.fromList [1,2,3]) > > ---- > > This works ok for the test above (with sets of ints), but is rather slow in > my actual application (in which the sets are large sets of ground atoms of > first-order logic). Is there some major optimization I should be doing here? > > thanks, > Richard > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe> Only members subscribed via the mailman list are allowed to post. -- A UNIX signature isn't a return address, it's the ASCII equivalent of a black velvet clown painting. It's a rectangle of carets surrounding a quote from a literary giant of weeniedom like Heinlein or Dr. Who.         -- Chris Maeda _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

## Re: Optimizing nearest-k code

 Thanks Mark, this is very helpful. On Thu, May 24, 2018 at 6:36 PM, Mark Wotton wrote:there are a few things you can do. I ran it locally and got 3s cpu time and 3gb of allocation, more or less. first off, your subset implementation computes the subsets twice. taking that off took it down to 2s. After that it's mostly in the Set implementation. Using Map a () instead gives you access to Data.Map.Merge.Strict.merge, which is a bit more efficient (going over the two data structures once only) ``` module Main where import Data.List import Data.Map.Strict (Map) import qualified Data.Map.Merge.Strict as Map import qualified Data.Map.Strict as Map nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)] nearest_k k bs b = take k bs' where     bs' = sortOn (hamming b) bs hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int hamming x (y, _) = hamming_distance x y hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs ys) subsets :: [a] -> [[a]] subsets []  = [[]] subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs int_lists :: [[Int]] int_lists = subsets [1..20] values :: [(Map Int (), Int)] values = map f (zip [1..] int_lists) where     f (i, x) = (Map.fromList(zip  x (repeat ())), i) test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ()))) main = print test ``` that took it to 1s, and now profiling indicates more than half the time is spent in generation of the test values, so I'll leave it there. I think if you wanted to do better than this you'd have to do some algorithmic changes - for instance, once your worst candidate is n steps away, you can stop calculating the hamming distance for anything else once it's > n, as it can't contribute usefully to the nearest neighbours. cheers, Mark On Thu, May 24, 2018 at 12:15 PM, Richard Evans <[hidden email]> wrote: > Dear Haskell Cafe, > > Given a set of sets, and a particular target set, I want to find the sets > that are nearest (in terms of Hamming distance) to the target set. > > I am using the following code: > > import Data.List > import qualified Data.Set as Set > > nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, > v)] > nearest_k k bs b = take k bs' where >     bs' = sortOn (hamming b) bs > > hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int > hamming x (y, _) = hamming_distance x y > > hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int > hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size > (Set.difference ys xs) > > > > subsets :: [a] -> [[a]] > subsets []  = [[]] > subsets (x:xs) = subsets xs ++ map (x:) (subsets xs) > > int_lists :: [[Int]] > int_lists = subsets [1..20] > > values :: [(Set.Set Int, Int)] > values = map f (zip [1..] int_lists) where >     f (i, x) = (Set.fromList x, i) > > test = nearest_k 8 values (Set.fromList [1,2,3]) > > ---- > > This works ok for the test above (with sets of ints), but is rather slow in > my actual application (in which the sets are large sets of ground atoms of > first-order logic). Is there some major optimization I should be doing here? > > thanks, > Richard > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -- A UNIX signature isn't a return address, it's the ASCII equivalent of a black velvet clown painting. It's a rectangle of carets surrounding a quote from a literary giant of weeniedom like Heinlein or Dr. Who.         -- Chris Maeda _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

## Re: Optimizing nearest-k code

 In reply to this post by Mark Wotton If you're looking for an efficient way to compute subsets, import Control.Monad subsets :: [a] -> [[a]] subsets = filterM (pure [True, False]) does the trick nicely. On 05/24/2018 12:36 PM, Mark Wotton wrote: > there are a few things you can do. > > I ran it locally and got 3s cpu time and 3gb of allocation, more or less. > > first off, your subset implementation computes the subsets twice. > taking that off took it down to 2s. > > After that it's mostly in the Set implementation. Using Map a () > instead gives you access to Data.Map.Merge.Strict.merge, which is a > bit more efficient (going over the two data structures once only) > > ``` > module Main where > > import Data.List > import Data.Map.Strict (Map) > import qualified Data.Map.Merge.Strict as Map > import qualified Data.Map.Strict as Map > > nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)] > nearest_k k bs b = take k bs' where >     bs' = sortOn (hamming b) bs > > hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int > hamming x (y, _) = hamming_distance x y > > hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int > hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing > Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs > ys) > > subsets :: [a] -> [[a]] > subsets []  = [[]] > subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs > > int_lists :: [[Int]] > int_lists = subsets [1..20] > > values :: [(Map Int (), Int)] > values = map f (zip [1..] int_lists) where >     f (i, x) = (Map.fromList(zip  x (repeat ())), i) > > test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ()))) > > > main = print test > ``` > > that took it to 1s, and now profiling indicates more than half the > time is spent in generation of the test values, so I'll leave it > there. > > I think if you wanted to do better than this you'd have to do some > algorithmic changes - for instance, once your worst candidate is n > steps away, you can stop calculating the hamming distance for anything > else once it's > n, as it can't contribute usefully to the nearest > neighbours. > > cheers, > Mark > > On Thu, May 24, 2018 at 12:15 PM, Richard Evans > <[hidden email]> wrote: >> Dear Haskell Cafe, >> >> Given a set of sets, and a particular target set, I want to find the sets >> that are nearest (in terms of Hamming distance) to the target set. >> >> I am using the following code: >> >> import Data.List >> import qualified Data.Set as Set >> >> nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, >> v)] >> nearest_k k bs b = take k bs' where >>     bs' = sortOn (hamming b) bs >> >> hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int >> hamming x (y, _) = hamming_distance x y >> >> hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int >> hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size >> (Set.difference ys xs) >> >> >> >> subsets :: [a] -> [[a]] >> subsets []  = [[]] >> subsets (x:xs) = subsets xs ++ map (x:) (subsets xs) >> >> int_lists :: [[Int]] >> int_lists = subsets [1..20] >> >> values :: [(Set.Set Int, Int)] >> values = map f (zip [1..] int_lists) where >>     f (i, x) = (Set.fromList x, i) >> >> test = nearest_k 8 values (Set.fromList [1,2,3]) >> >> ---- >> >> This works ok for the test above (with sets of ints), but is rather slow in >> my actual application (in which the sets are large sets of ground atoms of >> first-order logic). Is there some major optimization I should be doing here? >> >> thanks, >> Richard >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post. signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

## Re: Optimizing nearest-k code

 In reply to this post by Mark Wotton Also: are you sure let rs = subsets xs in rs  ++ map (x:) rs is more efficient than subsets xs ++ map (x:) (subsets xs) I would have assumed these would be the same due to laziness, etc. On 05/24/2018 12:36 PM, Mark Wotton wrote: > there are a few things you can do. > > I ran it locally and got 3s cpu time and 3gb of allocation, more or less. > > first off, your subset implementation computes the subsets twice. > taking that off took it down to 2s. > > After that it's mostly in the Set implementation. Using Map a () > instead gives you access to Data.Map.Merge.Strict.merge, which is a > bit more efficient (going over the two data structures once only) > > ``` > module Main where > > import Data.List > import Data.Map.Strict (Map) > import qualified Data.Map.Merge.Strict as Map > import qualified Data.Map.Strict as Map > > nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)] > nearest_k k bs b = take k bs' where >     bs' = sortOn (hamming b) bs > > hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int > hamming x (y, _) = hamming_distance x y > > hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int > hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing > Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs > ys) > > subsets :: [a] -> [[a]] > subsets []  = [[]] > subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs > > int_lists :: [[Int]] > int_lists = subsets [1..20] > > values :: [(Map Int (), Int)] > values = map f (zip [1..] int_lists) where >     f (i, x) = (Map.fromList(zip  x (repeat ())), i) > > test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ()))) > > > main = print test > ``` > > that took it to 1s, and now profiling indicates more than half the > time is spent in generation of the test values, so I'll leave it > there. > > I think if you wanted to do better than this you'd have to do some > algorithmic changes - for instance, once your worst candidate is n > steps away, you can stop calculating the hamming distance for anything > else once it's > n, as it can't contribute usefully to the nearest > neighbours. > > cheers, > Mark > > On Thu, May 24, 2018 at 12:15 PM, Richard Evans > <[hidden email]> wrote: >> Dear Haskell Cafe, >> >> Given a set of sets, and a particular target set, I want to find the sets >> that are nearest (in terms of Hamming distance) to the target set. >> >> I am using the following code: >> >> import Data.List >> import qualified Data.Set as Set >> >> nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, >> v)] >> nearest_k k bs b = take k bs' where >>     bs' = sortOn (hamming b) bs >> >> hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int >> hamming x (y, _) = hamming_distance x y >> >> hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int >> hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size >> (Set.difference ys xs) >> >> >> >> subsets :: [a] -> [[a]] >> subsets []  = [[]] >> subsets (x:xs) = subsets xs ++ map (x:) (subsets xs) >> >> int_lists :: [[Int]] >> int_lists = subsets [1..20] >> >> values :: [(Set.Set Int, Int)] >> values = map f (zip [1..] int_lists) where >>     f (i, x) = (Set.fromList x, i) >> >> test = nearest_k 8 values (Set.fromList [1,2,3]) >> >> ---- >> >> This works ok for the test above (with sets of ints), but is rather slow in >> my actual application (in which the sets are large sets of ground atoms of >> first-order logic). Is there some major optimization I should be doing here? >> >> thanks, >> Richard >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe>> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post. signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

## Re: Optimizing nearest-k code

 Yes. Showed up in testing - if you're using a decent chunk of the subsets, you can avoid computing twice. If the pressure is actually memory rather than cpu and you can consume lazily, you are quite right that the original could be better.On Fri, May 25, 2018, 10:00 AM Vanessa McHale <[hidden email]> wrote:Also: are you sure let rs = subsets xs in rs  ++ map (x:) rs is more efficient than subsets xs ++ map (x:) (subsets xs) I would have assumed these would be the same due to laziness, etc. On 05/24/2018 12:36 PM, Mark Wotton wrote: > there are a few things you can do. > > I ran it locally and got 3s cpu time and 3gb of allocation, more or less. > > first off, your subset implementation computes the subsets twice. > taking that off took it down to 2s. > > After that it's mostly in the Set implementation. Using Map a () > instead gives you access to Data.Map.Merge.Strict.merge, which is a > bit more efficient (going over the two data structures once only) > > ``` > module Main where > > import Data.List > import Data.Map.Strict (Map) > import qualified Data.Map.Merge.Strict as Map > import qualified Data.Map.Strict as Map > > nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)] > nearest_k k bs b = take k bs' where >     bs' = sortOn (hamming b) bs > > hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int > hamming x (y, _) = hamming_distance x y > > hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int > hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing > Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs > ys) > > subsets :: [a] -> [[a]] > subsets []  = [[]] > subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs > > int_lists :: [[Int]] > int_lists = subsets [1..20] > > values :: [(Map Int (), Int)] > values = map f (zip [1..] int_lists) where >     f (i, x) = (Map.fromList(zip  x (repeat ())), i) > > test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ()))) > > > main = print test > ``` > > that took it to 1s, and now profiling indicates more than half the > time is spent in generation of the test values, so I'll leave it > there. > > I think if you wanted to do better than this you'd have to do some > algorithmic changes - for instance, once your worst candidate is n > steps away, you can stop calculating the hamming distance for anything > else once it's > n, as it can't contribute usefully to the nearest > neighbours. > > cheers, > Mark > > On Thu, May 24, 2018 at 12:15 PM, Richard Evans > <[hidden email]> wrote: >> Dear Haskell Cafe, >> >> Given a set of sets, and a particular target set, I want to find the sets >> that are nearest (in terms of Hamming distance) to the target set. >> >> I am using the following code: >> >> import Data.List >> import qualified Data.Set as Set >> >> nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a, >> v)] >> nearest_k k bs b = take k bs' where >>     bs' = sortOn (hamming b) bs >> >> hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int >> hamming x (y, _) = hamming_distance x y >> >> hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int >> hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size >> (Set.difference ys xs) >> >> >> >> subsets :: [a] -> [[a]] >> subsets []  = [[]] >> subsets (x:xs) = subsets xs ++ map (x:) (subsets xs) >> >> int_lists :: [[Int]] >> int_lists = subsets [1..20] >> >> values :: [(Set.Set Int, Int)] >> values = map f (zip [1..] int_lists) where >>     f (i, x) = (Set.fromList x, i) >> >> test = nearest_k 8 values (Set.fromList [1,2,3]) >> >> ---- >> >> This works ok for the test above (with sets of ints), but is rather slow in >> my actual application (in which the sets are large sets of ground atoms of >> first-order logic). Is there some major optimization I should be doing here? >> >> thanks, >> Richard >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafeOnly members subscribed via the mailman list are allowed to post.