Does anyone know the reason for differeng measurements in profiling under -threaded?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Does anyone know the reason for differeng measurements in profiling under -threaded?

Jesper Louis Andersen-2
Hi,

I am asking if anyone has seen the following behaviour from GHC under
-threaded and heavy use of File I/O, network I/O and STM use. After
having run Combinatorrent for a while and then terminating it, we get
the following output from the RTS in GC statistics:

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.22s  ( 60.06s elapsed)
  GC    time    0.05s  (  0.07s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.26s  ( 60.12s elapsed)

  %GC time      18.2%  (0.1% elapsed)

  Alloc rate    431,635,651 bytes per MUT second

  Productivity  80.3% of total user, 0.4% of total elapsed

Note that I am running it for exactly 60 seconds of wall-clock time.
Yet, the profile information reports:

        total time  =       11.20 secs   (560 ticks @ 20 ms)
        total alloc =  78,696,568 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc
ticks     bytes

MAIN                           MAIN                  80.4    0.2
450     20033
PeerControl                    Process.Peer          11.2    1.0
63     95272
Sender                         Process.Peer.Sender    6.2    0.0
35       387
sha1_digest                    Digest                 1.8    1.4
10    136377
checkPiece                     FS                     0.2   92.3
1   9083658
TorrentManager                 Process.TorrentManager   0.2    1.8
 1    179587

which is clearly wrong. Total time is nowhere near 11.20 secs with 560
ticks here. I am almost certain that the 450 ticks attributed to MAIN
is wrong as well. I have SCCs at each possible forkIO-point in the
program. -prof -auto-all -caf-all doesn't make any difference either,
I've tried. Has anyone seen this kind of problem and in that case,
what was the fix? Barring nobody knows, the only thing I have left to
try is to dive into the GHC RTS and figure out what is going on in
there.

As an aside: Before writing combinatorrent, I wrote etorrent, A
BitTorrent client in Erlang. The two clients follow more or less the
same basic structure and largely does the same ting. Combinatorrent
has better memory usage right now, mostly due to representation
choices: Haskell can do much better data representation than erlang.
On the other hand, etorrent still beats out combinatorrent in the CPU
department, using less CPU. And that somewhat perplexes me. This is
why I am looking at performance profiles and trying to find the
culprit. But with the above problem, I can't find it, since I don't
trust the profiler anymore to give me the right results. All logic
suggest that Haskell is ripping apart Erlang in all parts that has no
communication, mostly due to being compiled and having a static type
system (The erlang code is interpreted, even though they have a HiPE
JIT). In practice however, my guess is that the binary handling in
Erlang is quite effective, as is the communication between processes,
so this might turn the tides in the favor of the erlang-RTS.

As a fun thing, I tried downloading the same file in the test
environment and running strace -cf on both programs. Here are the top
winners. I don't think you can trust all the columns:

Erlang:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.57  285.220815        2241    127300     10518 futex
  2.16    6.453726         271     23828      9838 read
  2.15    6.420512      279153        23           wait4
  0.07    0.198851           3     64073         1 epoll_wait
  0.02    0.048004        4000        12           poll
  0.01    0.039131           2     16297           write
  0.01    0.026917           0     88057     37842 recvfrom
  0.01    0.018174           0     75803        36 epoll_ctl
  0.00    0.008001         667        12         1 select
  0.00    0.007133           1      5138         3 lseek
  0.00    0.001931          64        30         6 execve
  0.00    0.001912           0      5535           writev
  0.00    0.001334         191         7           vfork
  0.00    0.001209           0      2684      1995 open
  0.00    0.001128           0      3094      2257 stat
  0.00    0.000942           6       153           munmap

Haskell/GHC:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 32.20 15834.075616      176163     89883     10065 futex
 21.72 10682.355809      176009     60692     26672 recvfrom
 12.22 6009.064699      176023     34138        18 select
 10.25 5038.622913      176009     28627         1 write
  8.45 4153.652812      176010     23599         1 read
  4.84 2380.702238      176009     13526         1 writev
  3.59 1765.107536      175965     10031           rt_sigprocmask
  1.95  961.185526      176009      5461           fstat
  1.94  956.257032      176009      5433           lseek
  1.80  883.829682      175922      5024           getrusage
  0.85  419.781465      176009      2385       107 rt_sigreturn
  0.04   19.713008      176009       112           fcntl
  0.03   12.320630      176009        70           mmap
  0.02    9.680495      176009        55           socket
  0.02    9.504619      176011        54        53 connect
  0.02    8.976459      176009        51           getsockopt
  0.01    5.456279      176009        31           close

Apart from Erlang using epoll, there is not too much difference from
the perspective of the kernel, which I think is rather good :)

--
J.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Does anyone know the reason for differeng measurements in profiling under -threaded?

Iustin Pop
On Fri, Apr 23, 2010 at 12:14:29AM +0200, Jesper Louis Andersen wrote:
> Hi,
>
> I am asking if anyone has seen the following behaviour from GHC under
> -threaded and heavy use of File I/O, network I/O and STM use. After
> having run Combinatorrent for a while and then terminating it, we get
> the following output from the RTS in GC statistics:

[snip]

I don't know GHC internals, but from the description of the program (and
the fact that you don't use more than one core), I wonder why you use
-threaded?

How does the behaviour change if you use a single-threaded runtime?

(I'm asking this as the single-threaded runtime can be much faster in
certain use-cases.)

regards,
iustin
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Does anyone know the reason for differeng measurements in profiling under -threaded?

Jesper Louis Andersen-2
On Fri, Apr 23, 2010 at 2:19 AM, Iustin Pop <[hidden email]> wrote:
> I don't know GHC internals, but from the description of the program (and
> the fact that you don't use more than one core), I wonder why you use
> -threaded?
>

A worthy question indeed. The reason is that I am using registerDelay
and that requires -threaded. I could try rewriting the code to not use
that.


--
J.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe