Project idea, seeking feedback

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Project idea, seeking feedback

Alex-6
Hi all,

I'm seeking feedback for a project I'd like to start. I have a bit of
experience developing large scale systems containing many
microservices, databases, message queues, and caches over many VMs. Time
and time again I find myself confronted with the same problems:

1. It is difficult to trace events through the system: Consider an HTTP
request made by a customer to a public API. Which microservices were
impacted by that request? What SQL queries were run as a result of that
request? What 3rd party APIs were consulted during the request's
fulfillment? Answers to these questions are essential to fixing bugs
quickly, and yet they are so difficult to answer (at least in my
experience).

2. Problems are difficult to reproduce: When Customer Success walks in
and says, "I have an angry customer on the phone. They want to know why
[FOO] wasn't properly [BAR]" it is often impossible to give an answer
without interactive troubleshooting and hours of grepping through
unstructured log files. Troubleshooting may incur additional expenses
too, since (for instance) you may hit your API request limit for a 3rd
party service.

3. Business and non-business logic are not well encapsulated: Often I
see code related to (for example) RabbitMQ interwoven with core business
logic when calls need to be made to other microservices. The fact that
RabbitMQ facilitates communication between microservices is an
implementation detail that I shouldn't have to think about.

4. Resource consumption is non-uniform: Some microservices are more
demanding than others in terms of CPU, memory, and disk usage.
Achieving optimal "packing" is difficult. In other words, some VMs
will have a high load and others will remain idle. Auto scaling groups
can help with this in theory, but I don't think they can achieve the
kind of density I would like to see.
   Moreover, what constitutes a "resource"? If a 3rd party service
rate limits requests by IP address, couldn't each request be considered
a resource unit which needs to be properly load balanced, just as you
would with CPU?

Given these motivations, I would like to flesh out some ideas for a
framework/platform which addresses these issues. These ideas are
half-baked and may not tie in well with one another.

I envision a distributed system as follows:

1. One kind of VM:
    DevOps people have a saying: "Treat your VMs like cattle, not
pets". In practice, "cattle" becomes "cows, chickens, pigs, and
lobster". VMs typically have an assigned role, and they become part of
a group which may or may not be auto-scaling. For a given instantiation
of this hypothetical platform, I would like to see a single kind of VM.
That is, every VM is identical to every other VM, and they all run the
same Haskell application.

2. Strict separation of business and non-business logic: The framework
should handle all aspects of communication between nodes (like Cloud
Haskell does) in a pluggable and transparent way, but that's not all.
The framework should have first class support for other integrations
(such as PagerDuty alerting, performance monitoring, etc) which are
described below.

3. Pool coordination via DSL: The entire pool of VMs is
orchestrated/coordinated by one ore more "scripts" written in a DSL,
which is implemented as a Free Monad. Every single "operation" or
"primitive" in your AST data type is Serializable, and when the
framework interprets the DSL, it serializes the instruction and sends
it over the network to a node for execution. The particular node on
which the instruction gets executed is chosen by the platform, not the
developer.

4. Smart resource consumption: Each node brings with it a set of
resources. It is *not* my intention to create a system which views CPU,
memory, etc as a contiguous unit. Rather, each primitive instruction in
the AST is viewed as a "black box" which can only consume as much CPU
and memory as the node has available to it. The framework is
responsible for profiling each instruction and scheduling future
instructions to a node for which resources are predicted to be
available.
   The developer should be able to define new resources such as 3rd
party API calls, bandwidth, database connections, etc, all of which are
profiled just as CPU and memory would.

5. Browser based control panel: Engineers should have a GUI at their
disposal which allows them to watch -- in real time -- the execution
flow of the DSL script.

6. Structured logs with advanced filtering: All log output should be
structured with first class support for shipping the data to
Logstash/ElasticSearch. The aforementioned GUI should be able to
selectively filter output based on certain pre-defined predicates and
display them to the developer. For example, if you're building an email
virus scanning system (which may see millions of emails per day), you
may want to limit the real-time debugging output to only a specific
customer.

7. First class integration with modern tools and services: The system
should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache,
DataDog, Logstash, and Slack, with new integrations being easy to add.
This is vital for clean separation of business and non-business logic.
For example, the developer should be able to cache certain bits of data
at will, without having to worry about opening and managing a TCP
connection to memcache.

This is my vision, and I want to build it completely in Haskell. What
do you all think?

--
Alex
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Project idea, seeking feedback

Markus Läll-2
Hi Alex,

this is obviously highly ambitious if you want to get this right. If
you actually plan to start on this, then since there are plenty of
DSL's that eventually run on one machine, then where I would start is
the distributed part. I.e make something that passes around an Int,
and have it deploy to any number of machines. Then add gradually add
complexity, like distributed queues and workers, ways to enforce
ordering on when results of works are to be
submitted/accepted. Implementing precedence graphs would be
interesting. Then there is limiting congestion and probably many more
kinds of limits you want to add to different points in the graph.

Just some random ideas. But start and deploy something very simple at
first.

On Wed, Nov 8, 2017 at 2:21 PM, Alex <[hidden email]> wrote:
Hi all,

I'm seeking feedback for a project I'd like to start. I have a bit of
experience developing large scale systems containing many
microservices, databases, message queues, and caches over many VMs. Time
and time again I find myself confronted with the same problems:

1. It is difficult to trace events through the system: Consider an HTTP
request made by a customer to a public API. Which microservices were
impacted by that request? What SQL queries were run as a result of that
request? What 3rd party APIs were consulted during the request's
fulfillment? Answers to these questions are essential to fixing bugs
quickly, and yet they are so difficult to answer (at least in my
experience).

2. Problems are difficult to reproduce: When Customer Success walks in
and says, "I have an angry customer on the phone. They want to know why
[FOO] wasn't properly [BAR]" it is often impossible to give an answer
without interactive troubleshooting and hours of grepping through
unstructured log files. Troubleshooting may incur additional expenses
too, since (for instance) you may hit your API request limit for a 3rd
party service.

3. Business and non-business logic are not well encapsulated: Often I
see code related to (for example) RabbitMQ interwoven with core business
logic when calls need to be made to other microservices. The fact that
RabbitMQ facilitates communication between microservices is an
implementation detail that I shouldn't have to think about.

4. Resource consumption is non-uniform: Some microservices are more
demanding than others in terms of CPU, memory, and disk usage.
Achieving optimal "packing" is difficult. In other words, some VMs
will have a high load and others will remain idle. Auto scaling groups
can help with this in theory, but I don't think they can achieve the
kind of density I would like to see.
   Moreover, what constitutes a "resource"? If a 3rd party service
rate limits requests by IP address, couldn't each request be considered
a resource unit which needs to be properly load balanced, just as you
would with CPU?

Given these motivations, I would like to flesh out some ideas for a
framework/platform which addresses these issues. These ideas are
half-baked and may not tie in well with one another.

I envision a distributed system as follows:

1. One kind of VM:
    DevOps people have a saying: "Treat your VMs like cattle, not
pets". In practice, "cattle" becomes "cows, chickens, pigs, and
lobster". VMs typically have an assigned role, and they become part of
a group which may or may not be auto-scaling. For a given instantiation
of this hypothetical platform, I would like to see a single kind of VM.
That is, every VM is identical to every other VM, and they all run the
same Haskell application.

2. Strict separation of business and non-business logic: The framework
should handle all aspects of communication between nodes (like Cloud
Haskell does) in a pluggable and transparent way, but that's not all.
The framework should have first class support for other integrations
(such as PagerDuty alerting, performance monitoring, etc) which are
described below.

3. Pool coordination via DSL: The entire pool of VMs is
orchestrated/coordinated by one ore more "scripts" written in a DSL,
which is implemented as a Free Monad. Every single "operation" or
"primitive" in your AST data type is Serializable, and when the
framework interprets the DSL, it serializes the instruction and sends
it over the network to a node for execution. The particular node on
which the instruction gets executed is chosen by the platform, not the
developer.

4. Smart resource consumption: Each node brings with it a set of
resources. It is *not* my intention to create a system which views CPU,
memory, etc as a contiguous unit. Rather, each primitive instruction in
the AST is viewed as a "black box" which can only consume as much CPU
and memory as the node has available to it. The framework is
responsible for profiling each instruction and scheduling future
instructions to a node for which resources are predicted to be
available.
   The developer should be able to define new resources such as 3rd
party API calls, bandwidth, database connections, etc, all of which are
profiled just as CPU and memory would.

5. Browser based control panel: Engineers should have a GUI at their
disposal which allows them to watch -- in real time -- the execution
flow of the DSL script.

6. Structured logs with advanced filtering: All log output should be
structured with first class support for shipping the data to
Logstash/ElasticSearch. The aforementioned GUI should be able to
selectively filter output based on certain pre-defined predicates and
display them to the developer. For example, if you're building an email
virus scanning system (which may see millions of emails per day), you
may want to limit the real-time debugging output to only a specific
customer.

7. First class integration with modern tools and services: The system
should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache,
DataDog, Logstash, and Slack, with new integrations being easy to add.
This is vital for clean separation of business and non-business logic.
For example, the developer should be able to cache certain bits of data
at will, without having to worry about opening and managing a TCP
connection to memcache.

This is my vision, and I want to build it completely in Haskell. What
do you all think?

--
Alex
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Markus Läll

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Project idea, seeking feedback

MarLinn
In reply to this post by Alex-6

Hi Alex,

sounds ambitious. But you might be able to reduce the scope massively by relying on existing tools.

Examples:

  • Let something like Nagios do the monitoring. I know there's tools to control Nagios from Haskell. What I don't know is how up-to-date they are, and I haven't seen something that reports internal performance data of a Haskell app to Nagios, but that should be simple if necessary.

  • Let something like Cassandra handle both the heaviest parts of messaging between your node controllers and the storage of their config data. If you base your WUI on top of the DB, you can separate it from the controllers as well.

  • Coordination of resources is a variant of scheduling, which is a ""solved"" problem. So there should be libraries you can use.

  • Logging has been worked on by many a commercial Haskeller. My guess is that filtering is just a matter of looking at one of the libraries from the right angle.

This leaves orchestration, API connectors, and the DSL as the missing parts. Which sounds way more doable than having your tool do all the lifting itself.

Or just use Kubernetes. Whichever is easier. ;)

Cheers,
MarLinn


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Project idea, seeking feedback

Alex-6
On Wed, 15 Nov 2017 15:30:37 +0100
MarLinn <[hidden email]> wrote:

> Hi Alex,
>
> sounds ambitious. But you might be able to reduce the scope massively
> by relying on existing tools.
>

Yes! I do not wish to reinvent the wheel.

> Examples:
>
>   *
>
>     Let something like Nagios do the monitoring. I know there's tools
> to control Nagios from Haskell. What I don't know is how up-to-date
>     they are, and I haven't seen something that reports internal
>     performance data of a Haskell app to Nagios, but that should be
>     simple if necessary.
>

I don't think Nagios is a good fit because I want to do more than
monitor the performance of the interpreter. I want to rely on that
performance data so that I can use resources more effectively. For
example, I want to know what the load average of a particular node is,
and then I want to rely on historical performance data of the DSL
primitives to determine if the next instruction to be executed should
be scheduled to run on that node or a different one.

>   *
>
>     Let something like Cassandra handle both the heaviest parts of
>     messaging between your node controllers and the storage of their
>     config data. If you base your WUI on top of the DB, you can
> separate it from the controllers as well.
>
>   *
>
>     Coordination of resources is a variant of scheduling, which is a
>     ""solved"" problem. So there should be libraries you can use.
>

For cluster coordination/configuration I was thinking of using Consul.

>   *
>
>     Logging has been worked on by many a commercial Haskeller. My
> guess is that filtering is just a matter of looking at one of the
>     libraries from the right angle.
>

I intend to leverage existing libraries where possible. I want to
create an environment in which the commercial Haskeller never has to
choose and wire in a logging library. The decision is already made by
the framework. They just need to insert logging statements where
appropriate.

> Or just use Kubernetes. Whichever is easier. ;)
>

Kubernetes is a great tool, but it doesn't do what I envision.

--
Alex
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Project idea, seeking feedback

Ben Kolera
Don't forget http://zipkin.io/ . It's awesome. :)

Cheers,
Ben

On Thu, 16 Nov 2017 at 08:46 Alex <[hidden email]> wrote:
On Wed, 15 Nov 2017 15:30:37 +0100
MarLinn <[hidden email]> wrote:

> Hi Alex,
>
> sounds ambitious. But you might be able to reduce the scope massively
> by relying on existing tools.
>

Yes! I do not wish to reinvent the wheel.

> Examples:
>
>   *
>
>     Let something like Nagios do the monitoring. I know there's tools
> to control Nagios from Haskell. What I don't know is how up-to-date
>     they are, and I haven't seen something that reports internal
>     performance data of a Haskell app to Nagios, but that should be
>     simple if necessary.
>

I don't think Nagios is a good fit because I want to do more than
monitor the performance of the interpreter. I want to rely on that
performance data so that I can use resources more effectively. For
example, I want to know what the load average of a particular node is,
and then I want to rely on historical performance data of the DSL
primitives to determine if the next instruction to be executed should
be scheduled to run on that node or a different one.

>   *
>
>     Let something like Cassandra handle both the heaviest parts of
>     messaging between your node controllers and the storage of their
>     config data. If you base your WUI on top of the DB, you can
> separate it from the controllers as well.
>
>   *
>
>     Coordination of resources is a variant of scheduling, which is a
>     ""solved"" problem. So there should be libraries you can use.
>

For cluster coordination/configuration I was thinking of using Consul.

>   *
>
>     Logging has been worked on by many a commercial Haskeller. My
> guess is that filtering is just a matter of looking at one of the
>     libraries from the right angle.
>

I intend to leverage existing libraries where possible. I want to
create an environment in which the commercial Haskeller never has to
choose and wire in a logging library. The decision is already made by
the framework. They just need to insert logging statements where
appropriate.

> Or just use Kubernetes. Whichever is easier. ;)
>

Kubernetes is a great tool, but it doesn't do what I envision.

--
Alex
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.