static analysis of a C-like language

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

static analysis of a C-like language

Olaf Klinke
Dear Cafe,

I want to build a linter for a C-like language. My company uses a UI system that comes with a C-like language that is used to provide functionality to the UI elements. The language is interpreted at runtime. The user, however, has no access to the built-in parser and checking functions. Only a basic syntax check comes with their editor. Therefore many stupid mistakes only show at runtime and can not be tested for statically.
The language is in many respects even simpler than C. No structs, no enumerations, only a handful of basic data types and arrays thereof, plus pointers and the usual control flow structures. On the other hand, there are some features that make static analysis near impossible, e.g. the possibility of dynamic registration of variables. Higher-order types are emulated by passing the name of other functions as strings. There is explicit referencing of other sources (#include statements) as well as implicit referencing within UI elements and within the project hierarchy.

We want to catch basic programming mistakes - misspelled variables, missing return statements, usage of variables before assignment, wrong number of function arguments/types. A project's code is scattered across multiple files and layers - a function declaration can mask a declaration from the layer below. Therefore, a per-file-linter will not suffice.
As a side-effect this may also yield a documentation generator (boy do I miss haddock while working with that), which is also missing in the system.

I have never done something like this and will seek help from local CS departments (If you are a CS student in Paderborn or Bielefeld, Germany, get in touch!). My company would be willing to sponsor a Bachelor or Master's thesis project. Ultimately this will be comercialized as an add-on, so anyone who helps will also profit from the result financially. The user community is not large, but includes big names such as Siemens, London Heathrow, the Metros in NYC and Bilbao, and CERN.

What kind of expertise shall I look for? Compiler theorists? What information shall I request from the language author? I feel that Haskell might be a good language to represent the AST in. But I am overwhelmed by the numerous AST libraries on hackage.

How similar is my problem to what is covered by, say, haskell-src-exts, hlint, alex or syntactic? Can I re-use code from these projects?

What solutions are out there for dependency resolution between source files?

I suppose for informative error messages I need to decorate each semantic bit with information about
- source file and location
- what is in scope at that position, plus the source of the symbols in scope
What data structures can be used to build and represent this information?

Any pointers welcome.
Olaf


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Haskell - Haskell-Cafe mailing list
Hello Olaf,

to me that sounds as if you want to do an abstract interpretation with a
forward collecting semantics that employs non-relational abstract
domains for the primitive data types and summarizes the dimensions of
arrays. That is what the Java compiler does when it analyzes whether a
variable is 'effectively final' (except that it doesn't even have to
summarize arrays for that), and it should suffice for the kind of bugs
you want to find.

I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized. You also get a
clear formalization of the semantics, and you can annotate the syntax
tree and implement informative error messages for the bugs you want to
detect. You don't even need to implement every primitive function - it
suffices that they return some random value, provided that the intended
return value is one of the possible values. Lifting the interpreter to
an abstract interpreter can then be done in a canonical way.


Regards,

Holger


Am 21.11.2018 um 22:01 schrieb Olaf Klinke:

> Dear Cafe,
>
> I want to build a linter for a C-like language. My company uses a UI system that comes with a C-like language that is used to provide functionality to the UI elements. The language is interpreted at runtime. The user, however, has no access to the built-in parser and checking functions. Only a basic syntax check comes with their editor. Therefore many stupid mistakes only show at runtime and can not be tested for statically.
> The language is in many respects even simpler than C. No structs, no enumerations, only a handful of basic data types and arrays thereof, plus pointers and the usual control flow structures. On the other hand, there are some features that make static analysis near impossible, e.g. the possibility of dynamic registration of variables. Higher-order types are emulated by passing the name of other functions as strings. There is explicit referencing of other sources (#include statements) as well as implicit referencing within UI elements and within the project hierarchy.
>
> We want to catch basic programming mistakes - misspelled variables, missing return statements, usage of variables before assignment, wrong number of function arguments/types. A project's code is scattered across multiple files and layers - a function declaration can mask a declaration from the layer below. Therefore, a per-file-linter will not suffice.
> As a side-effect this may also yield a documentation generator (boy do I miss haddock while working with that), which is also missing in the system.
>
> I have never done something like this and will seek help from local CS departments (If you are a CS student in Paderborn or Bielefeld, Germany, get in touch!). My company would be willing to sponsor a Bachelor or Master's thesis project. Ultimately this will be comercialized as an add-on, so anyone who helps will also profit from the result financially. The user community is not large, but includes big names such as Siemens, London Heathrow, the Metros in NYC and Bilbao, and CERN.
>
> What kind of expertise shall I look for? Compiler theorists? What information shall I request from the language author? I feel that Haskell might be a good language to represent the AST in. But I am overwhelmed by the numerous AST libraries on hackage.
>
> How similar is my problem to what is covered by, say, haskell-src-exts, hlint, alex or syntactic? Can I re-use code from these projects?
>
> What solutions are out there for dependency resolution between source files?
>
> I suppose for informative error messages I need to decorate each semantic bit with information about
> - source file and location
> - what is in scope at that position, plus the source of the symbols in scope
> What data structures can be used to build and represent this information?
>
> Any pointers welcome.
> Olaf
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Siddharth Bhat
What is the canonical way to lift an interpreter to an abstract interpreter, while ensuring convergence ---figuring out how and when to widen?

Thanks
Siddharth

On Thu, 22 Nov, 2018, 18:52 Holger Siegel via Haskell-Cafe, <[hidden email]> wrote:
Hello Olaf,

to me that sounds as if you want to do an abstract interpretation with a
forward collecting semantics that employs non-relational abstract
domains for the primitive data types and summarizes the dimensions of
arrays. That is what the Java compiler does when it analyzes whether a
variable is 'effectively final' (except that it doesn't even have to
summarize arrays for that), and it should suffice for the kind of bugs
you want to find.

I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized. You also get a
clear formalization of the semantics, and you can annotate the syntax
tree and implement informative error messages for the bugs you want to
detect. You don't even need to implement every primitive function - it
suffices that they return some random value, provided that the intended
return value is one of the possible values. Lifting the interpreter to
an abstract interpreter can then be done in a canonical way.


Regards,

Holger


Am 21.11.2018 um 22:01 schrieb Olaf Klinke:
> Dear Cafe,
>
> I want to build a linter for a C-like language. My company uses a UI system that comes with a C-like language that is used to provide functionality to the UI elements. The language is interpreted at runtime. The user, however, has no access to the built-in parser and checking functions. Only a basic syntax check comes with their editor. Therefore many stupid mistakes only show at runtime and can not be tested for statically.
> The language is in many respects even simpler than C. No structs, no enumerations, only a handful of basic data types and arrays thereof, plus pointers and the usual control flow structures. On the other hand, there are some features that make static analysis near impossible, e.g. the possibility of dynamic registration of variables. Higher-order types are emulated by passing the name of other functions as strings. There is explicit referencing of other sources (#include statements) as well as implicit referencing within UI elements and within the project hierarchy.
>
> We want to catch basic programming mistakes - misspelled variables, missing return statements, usage of variables before assignment, wrong number of function arguments/types. A project's code is scattered across multiple files and layers - a function declaration can mask a declaration from the layer below. Therefore, a per-file-linter will not suffice.
> As a side-effect this may also yield a documentation generator (boy do I miss haddock while working with that), which is also missing in the system.
>
> I have never done something like this and will seek help from local CS departments (If you are a CS student in Paderborn or Bielefeld, Germany, get in touch!). My company would be willing to sponsor a Bachelor or Master's thesis project. Ultimately this will be comercialized as an add-on, so anyone who helps will also profit from the result financially. The user community is not large, but includes big names such as Siemens, London Heathrow, the Metros in NYC and Bilbao, and CERN.
>
> What kind of expertise shall I look for? Compiler theorists? What information shall I request from the language author? I feel that Haskell might be a good language to represent the AST in. But I am overwhelmed by the numerous AST libraries on hackage.
>
> How similar is my problem to what is covered by, say, haskell-src-exts, hlint, alex or syntactic? Can I re-use code from these projects?
>
> What solutions are out there for dependency resolution between source files?
>
> I suppose for informative error messages I need to decorate each semantic bit with information about
> - source file and location
> - what is in scope at that position, plus the source of the symbols in scope
> What data structures can be used to build and represent this information?
>
> Any pointers welcome.
> Olaf
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
--
Sending this from my phone, please excuse any typos!

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Haskell - Haskell-Cafe mailing list
Hello Siddharth,


Am 22.11.2018 um 14:26 schrieb Siddharth Bhat:
> What is the canonical way to lift an interpreter to an abstract
> interpreter, while ensuring convergence ---figuring out how and when
> to widen?

Yes, exactly. In case of Olaf's imperative language with control
structures that would be the entry point of while-loops. Potentially
recursive function calls also have to be considered when you move from a
per-function analysis to interprocedural analysis.


Regards,

Holger
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Olaf Klinke
In reply to this post by Olaf Klinke
> Hello Olaf,
>
> to me that sounds as if you want to do an abstract interpretation with a
> forward collecting semantics that employs non-relational abstract
> domains for the primitive data types and summarizes the dimensions of
> arrays.
...
> I would start by writing a simple interpreter for the language to be
> analyzed. That way you fix messy details before they bite you, e.g. the
> order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language author wrote a translation to C++ already), but if that is the way to go, I'll do it. As I understand it, the Haskell semantics should contain just enough substance so that the errors I am after will cause hiccups in the Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting:  How could a translation to Haskell preserve information about scope, source position and masking? Can I leverage the ghc API for that?

Regards,
Olaf
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Carter Schonwald
Most parser libraries in Haskell provide facilities for including source information. 

And then you include in your syntax tree extra fields for those positions. 

Simple as that 


The paper : 
Is a great reference for how to add a bunch of program analysis features after you have a working interpreter .  Also it’s citations show a bunch of ways you can Add different flavored of anayses 

On Mon, Nov 26, 2018 at 3:37 PM Olaf Klinke <[hidden email]> wrote:
> Hello Olaf,
>
> to me that sounds as if you want to do an abstract interpretation with a
> forward collecting semantics that employs non-relational abstract
> domains for the primitive data types and summarizes the dimensions of
> arrays.
...
> I would start by writing a simple interpreter for the language to be
> analyzed. That way you fix messy details before they bite you, e.g. the
> order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language author wrote a translation to C++ already), but if that is the way to go, I'll do it. As I understand it, the Haskell semantics should contain just enough substance so that the errors I am after will cause hiccups in the Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting:  How could a translation to Haskell preserve information about scope, source position and masking? Can I leverage the ghc API for that?

Regards,
Olaf
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: static analysis of a C-like language

Paul Brauner-3
If you're interested I have a draft implemetation of that paper in Haskell that uses monad transformers. There's another one online that uses effect handlers. The original scheme implementation is also available. 

A caveat: while playing with my implementation I found some issue with the GC as described in the paper and was able to reproduce on the reference implementation: https://github.com/plum-umd/abstracting-definitional-interpreters/issues/82. I'm not knowledgeable enough to understand whether this is a flaw of the approach or something that can be easily fixed.

Paul

On Thu, Nov 29, 2018 at 3:31 PM Carter Schonwald <[hidden email]> wrote:
Most parser libraries in Haskell provide facilities for including source information. 

And then you include in your syntax tree extra fields for those positions. 

Simple as that 


The paper : 
Is a great reference for how to add a bunch of program analysis features after you have a working interpreter .  Also it’s citations show a bunch of ways you can Add different flavored of anayses 

On Mon, Nov 26, 2018 at 3:37 PM Olaf Klinke <[hidden email]> wrote:
> Hello Olaf,
>
> to me that sounds as if you want to do an abstract interpretation with a
> forward collecting semantics that employs non-relational abstract
> domains for the primitive data types and summarizes the dimensions of
> arrays.
...
> I would start by writing a simple interpreter for the language to be
> analyzed. That way you fix messy details before they bite you, e.g. the
> order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language author wrote a translation to C++ already), but if that is the way to go, I'll do it. As I understand it, the Haskell semantics should contain just enough substance so that the errors I am after will cause hiccups in the Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting:  How could a translation to Haskell preserve information about scope, source position and masking? Can I leverage the ghc API for that?

Regards,
Olaf
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.