Quantcast

Cloud Haskell real usage example

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Cloud Haskell real usage example

Thiago Negri
Hello everyone. I'm taking my first steps in Cloud Haskell and got
some unexpected behaviors.

I used the code from Raspberry Pi in a Haskell Cloud [1] as a first
example. Did try to switch the code to use Template Haskell with no
luck, stick with the verbose style.
I changed some of the code, from ProcessId-based messaging to typed
channel to receive the Pong; using "startSlave" to start the worker
nodes; and changed the master node to loop forever sending pings to
the worker nodes.

The unexpected behaviors:
- Dropping a worker node while the master is running makes the master
node to crash.
- Master node do not see worker nodes started after the master process.

In order to fix this, I tried to "findSlaves" at the start of the
master process and send ping to only these ones, ignoring the list of
NodeId enforced by the type signature of "startMaster".

Now the master finds new slaves. The bad thing is that when I close
one of the workers, the master process freezes. It simply stop doing
anything. No more messages and no more Pings to other slaves. :(


My view of Cloud Haskell usage would be something similar to this: a
master node sending work to slaves; slave instances getting up or down
based on demand. So, the master node should be slave-failure-proof and
also find new slaves somehow.

Am I misunderstanding the big picture of Cloud Haskell or doing
anything wrong in the following code?

Code (skipped imports and wiring stuff):

--
newtype Ping = Ping (SendPort Pong)
        deriving (Typeable, Binary, Show)

newtype Pong = Pong ProcessId
        deriving (Typeable, Binary, Show)

worker :: Ping -> Process ()
worker (Ping sPong) = do
  wId <- getSelfPid
  say "Got a Ping!"
  sendChan sPong (Pong wId)

master :: Backend -> [NodeId] -> Process ()
master backend _ = forever $ do
  workers <- findSlaves backend
  say $ "Slaves: " ++ show workers

  (sPong, rPong) <- newChan

  forM_ workers $ \w -> do
    say $ "Sending a Ping to " ++ (show w) ++ "..."
    spawn w (workerClosure (Ping sPong))

  say $ "Waiting for reply from " ++ (show (length workers)) ++ " worker(s)"

  replicateM_ (length workers) $ do
      (Pong wId) <- receiveChan rPong
      say $ "Got back a Pong from " ++ (show $ processNodeId wId) ++ "!"

  (liftIO . threadDelay) 2000000 -- Wait a bit before return

main = do
  prog <- getProgName
  args <- getArgs

  case args of
    ["master", host, port] -> do
      backend <- initializeBackend host port remoteTable
      startMaster backend (master backend)

    ["worker", host, port] -> do
      backend <- initializeBackend host port remoteTable
      startSlave backend

    _ ->
      putStrLn $ "usage: " ++ prog ++ " (master | worker) host port"
--

[1] http://alenribic.com/writings/post/raspberry-pi-in-a-haskell-cloud

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Cloud Haskell real usage example

Felipe Almeida Lessa
On Tue, Aug 21, 2012 at 9:01 PM, Thiago Negri <[hidden email]> wrote:
> My view of Cloud Haskell usage would be something similar to this: a
> master node sending work to slaves; slave instances getting up or down
> based on demand. So, the master node should be slave-failure-proof and
> also find new slaves somehow.
>
> Am I misunderstanding the big picture of Cloud Haskell or doing
> anything wrong in the following code?

(Disclaimer: I can't speak for Cloud Haskell's developers.)

AFAIK this is CH's goal.  However, they're not quite there yet.  Their
network implementation is still a lot naive as you're seeing =).

Cheers,

--
Felipe.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Cloud Haskell real usage example

yi huang
On Wed, Aug 22, 2012 at 8:30 AM, Felipe Almeida Lessa <[hidden email]> wrote:
On Tue, Aug 21, 2012 at 9:01 PM, Thiago Negri <[hidden email]> wrote:
> My view of Cloud Haskell usage would be something similar to this: a
> master node sending work to slaves; slave instances getting up or down
> based on demand. So, the master node should be slave-failure-proof and
> also find new slaves somehow.
>
> Am I misunderstanding the big picture of Cloud Haskell or doing
> anything wrong in the following code?

(Disclaimer: I can't speak for Cloud Haskell's developers.)

AFAIK this is CH's goal.  However, they're not quite there yet.  Their
network implementation is still a lot naive as you're seeing =).

I believe this behavior is due to the usage of channel, you just need to implement some kind of timeout function.
 

Cheers,

--
Felipe.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe



--
http://yi-programmer.com/

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Cloud Haskell real usage example

Edsko de Vries-4
In reply to this post by Thiago Negri
Hi Thiago,

Let me address your questions one by one.

On Wed, Aug 22, 2012 at 1:01 AM, Thiago Negri <[hidden email]> wrote:
> Hello everyone. I'm taking my first steps in Cloud Haskell and got
> some unexpected behaviors.
>
> I used the code from Raspberry Pi in a Haskell Cloud [1] as a first
> example. Did try to switch the code to use Template Haskell with no
> luck, stick with the verbose style.

I have pasted a version of your code that uses Template Haskell at
http://hpaste.org/73520. Where did you get stuck?

> I changed some of the code, from ProcessId-based messaging to typed
> channel to receive the Pong; using "startSlave" to start the worker
> nodes; and changed the master node to loop forever sending pings to
> the worker nodes.
>
> The unexpected behaviors:
> - Dropping a worker node while the master is running makes the master
> node to crash.

There are two things going on here:

1. A bug in the SimpleLocalnet backend meant that if you dropped a
worker node findSlaves might not return. I have fixed this and
uploaded it to Hackage as version 0.2.0.5.

2. But even with this fix, you will still need to take into account
that workers may disappear once they have been reported by findSlaves.
spawn will actually throw an exception if the specified node is
unreachable (it is debatable whether this is the right behaviour --
see below).

> - Master node do not see worker nodes started after the master process.

Yes, startMaster is merely a convenience function. I have modified the
documentation to specify more clearly what startMaster does:

-- | 'startMaster' finds all slaves /currently/ available on the local network,
-- redirects all log messages to itself, and then calls the specified process,
-- passing the list of slaves nodes.
--
-- Terminates when the specified process terminates. If you want to terminate
-- the slaves when the master terminates, you should manually call
-- 'terminateAllSlaves'.
--
-- If you start more slave nodes after having started the master node, you can
-- discover them with later calls to 'findSlaves', but be aware that you will
-- need to call 'redirectLogHere' to redirect their logs to the master node.
--
-- Note that you can use functionality of "SimpleLocalnet" directly (through
-- 'Backend'), instead of using 'startMaster'/'startSlave', if the master/slave
-- distinction does not suit your application.

Note that with these modifications there is still something slightly
unfortunate: if you delete a worker, and then restart it *at the same
port*, the master will not see it. There is a very good reason for
this: Cloud Haskell guarantees reliable ordered message passing, and
we want a clear semantics for this (unlike, say, in Erlang, where you
might send messages M1, M2 and M3 from P to Q, and Q might receive M1,
M3 but not M2, under certain circumstances). We (developers of Cloud
Haskell, Simon Peyton-Jones and some others) are still debating over
what the best approach is here; in the meantime, if you restart a
worker node, just give a different port number.

Let me know if you have any other questions, and feel free to open an
issue at https://github.com/haskell-distributed/distributed-process/issues?state=open
if you think you found a bug.

Edsko

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Cloud Haskell real usage example

Thiago Negri
| I have pasted a version of your code that uses Template Haskell at
| http://hpaste.org/73520. Where did you get stuck?

Your version worked like a charm. I'm quite new to Haskell, so I was
trying desperately to get TH working: forgot to quote "worker" at
mkClosure.


| 1. A bug in the SimpleLocalnet backend meant that if you dropped a
| worker node findSlaves might not return. I have fixed this and
| uploaded it to Hackage as version 0.2.0.5.

Updated to version 0.2.0.5 and it's working now. :-)


| 2. But even with this fix, you will still need to take into account
| that workers may disappear once they have been reported by findSlaves.
| spawn will actually throw an exception if the specified node is
| unreachable (it is debatable whether this is the right behaviour --
| see below).

Added exception catching, thanks.


| Note that with these modifications there is still something slightly
| unfortunate: if you delete a worker, and then restart it *at the same
| port*, the master will not see it. There is a very good reason for
| this: Cloud Haskell guarantees reliable ordered message passing, and
| we want a clear semantics for this (unlike, say, in Erlang, where you
| might send messages M1, M2 and M3 from P to Q, and Q might receive M1,
| M3 but not M2, under certain circumstances). We (developers of Cloud
| Haskell, Simon Peyton-Jones and some others) are still debating over
| what the best approach is here; in the meantime, if you restart a
| worker node, just give a different port number.

I trust you will make a good decision on this.


By the way, my new code with TH and exception catching: http://hpaste.org/73548

Thanks,
Thiago.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Loading...