Proposal : Data.Graph performance improvements (graph creation and lookups when nodes keys are consecutive)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Proposal : Data.Graph performance improvements (graph creation and lookups when nodes keys are consecutive)

Olivier S.
Hello,

This is a 3-part proposal for improving the time performance of Data.Graph.

The related PR for proposal I is https://github.com/haskell/containers/pull/549, including some graph benchmarks.

Reading the PR may be a good introduction to this material, if you're not yet familiar with Data.Graph!

I'm by advance sorry for the length of this mail, but since these are all related proposals I thought it was best to keep them together. I may be wrong, if you feel the number of issues addressed herein is too big, I can start a thread with just Proposal I for example.

Note that I may or may not have time to do more work than done in the current PR, but I hope this discussion will be of use to anyone wanting to follow-up on this.

That being said, I'm looking forward to hearing what the community thinks of these proposals!

- Proposal I : Optimize the time complexity of (key -> Maybe Vertex) lookups and graph creation.

Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges consist of a binary search on an array, with a time complexity of O(log V) (I will use V for "Count of vertices", E for "Count of edges").

When key is Integral, and keys (of nodes passed to the graph creation function) form a set of /consecutive/ values, we can have an O(1) lookup by substracting the value of the smallest key, and checking for bounds.

Hence, graph creation complexity is improved, and user algorithms using (key -> Maybe Vertex) lookups will see their complexity reduced by a factor of up-to O(log V).

The PR introduces this lookup and uses it for functions graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys.

Note that in this PR, graphFromEdges in unchanged, so it doesn't benefit from these improvements (see Proposal II below).

Here is a summary of complexities as they stand in the current state of the PR:

-------------------------------------------------------------------------------------
|                                      |          Time complexities                 |
|          Functions                   |---------------------------------------------
|                                      | graph creation     | (key -> Maybe Vertex) |
-------------------------------------------------------------------------------------
| graphFromEdges                       | O( (V+E) * log V ) | O(log V)              |
| graphFromEdgesWithConsecutiveKeys    | O( E + (V*log V) ) | O(1)                  |
| graphFromEdgesWithConsecutiveAscKeys | O( V+E )           | O(1)                  |
-------------------------------------------------------------------------------------

- Proposal II : Make users of graphFromEdges benefit from performance improvements "for free" when they use Integral keys that are consecutive.

We could use a rewrite rule to redirect graphFromEdges (when key is Integral) to a function checking if the keys are successive or not, to decide if the optimized lookup can be used.

The problem with this is that some potential optimizations would be hidden-away, consider this example:

-- 'userAlgorithm' doesn't have the (Integral a) constraint
userAlgorithm :: (Ord a) => ...
userAlgorithm =
  (...)
-- hence, the rewrite rule here will never fire:
  graphFromEdges ...

userProgram :: (Integral a, Ord a) => a -> ...
userProgram input =
  userAlgorithm input
-- Eventhough input /is/ Integral (and its keys may be consecutive),
-- a potential optimization is lost because 'userAlgorithm' doesn't have the (Integral a) constraint.

To address this, I propose to introduce the following a breaking change : instead of using a rewrite rule, just add an (Integral key) constraint on graphFromEdges, and do the check (wether optimized lookups can be used) in graphFromEdges itself.

Then, in the previously described scenario, the user would see a compilation error indicating that (Integral a) could not be deduced from the context.

Hence, the user would be more likely to just add the Integral constraint on myAlgorithm (and benefit from optimizations) than to use graphFromEdgesOld, which will have been added to do exactly what graphFromEdges does today, and which is intended to be used when the key is genuinely not Integral.

And for completeness, note that we could have the idea of the rewrite rule on graphFromEdgesOld, eventhough I'm not sure it will benefit any real use case (but who knows...).

- Proposal III : Deprecate `graphFromEdges` taking [(node, key, [key])] in favor of `graphFromMap` taking (Map key (node,[key]))

While working on the PR, I found two arguments in favor of this change:

  -a) Today in graphFromEdges, if we pass the same key twice, it is undefined which node for that key will actually be used.

  graphFromMap, taking a (Map key (node,[key]) would alleviate this issue.

  -b) "sortBy is expensive, even on sorted input": I became aware of this as I looked
  at these 2 benchmarks:

    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveAscKeys
    time                 53.55 μs   (52.44 μs .. 54.76 μs)
                         0.991 R²   (0.986 R² .. 0.995 R²)
    mean                 57.41 μs   (55.02 μs .. 61.75 μs)
    std dev              10.64 μs   (5.100 μs .. 17.09 μs)
    variance introduced by outliers: 95% (severely inflated)

    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveKeys
    time                 75.34 μs   (73.79 μs .. 76.85 μs)
                         0.992 R²   (0.987 R² .. 0.995 R²)
    mean                 74.67 μs   (72.29 μs .. 77.47 μs)
    std dev              8.335 μs   (6.692 μs .. 10.70 μs)
    variance introduced by outliers: 85% (severely inflated)

In the second benchmark, we do the same work as in the first benchmark, except for an initial
sortBy on an already sorted list, which introduces a 30% overhead! Looking at the implementation
of sortBy, I saw that there was no initial check to see if the list was sorted or not.

Since we use sortBy in graphFromEdges, we could check wether the input is sorted or not, before using sortBy.

But also, if we used a Map (this is the reason I use this example here), there would be no need to sortBy because Map.toAscList gives exactly the sorted list we want.

This constitutes another argument for this proposal.

Note that graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys would also
be deprecated in favor of graphFromConsecutiveMap, because they have the same a) and b) issues.

Naming proposal N1:
    - graphFromEdges                 (takes a List, deprecated, existing function)
    - graphFromEdgesInMap            (takes a Map)
    - graphFromEdgesInConsecutiveMap (takes a Map with consecutive keys)
Naming proposal N2:
    - graphFromEdges                 (takes a List, deprecated, existing function)
    - graphFromMap
    - graphFromConsecutiveMap
 with these, to reflect the Map / List duality in the naming scheme:
    - graphFromList               (takes a List, deprecated, redirects to graphFromEdges)
    - graphFromConsecutiveList    (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveKeys)
    - graphFromConsecutiveAscList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveAscKeys)

--
Olivier Sohn

PS : when writing the criterion graph benchmarks, I discovered that Graphs can be
evaluated to normal form by criterion only when they are acyclic. Cyclic graphs
will make the evaluation loop infinitely. I wonder why that is, if anyone has an
explanation on that subject I'd be happy to hear it! Or should I submit a bug report
at criterion?


_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries