What is a punctuation character?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

What is a punctuation character?

Gabriel Dos Reis
Hi,

The lexical structure chapter defines the non-terminal uniSymbol as

     uniSymbol ::= any Unicode symbol or punctuation

There is a slight ambiguity here: is that description supposed to
be parsed as:
   (a) "Unicode (symbol or punctuation)", or
   (b) "(Unicode symbol) or punctuation"?

If (b), then what qualifies as "punctuation"?  As far as I can tell,
that is not defined anywhere in the Report.  Is it "punctuation" in the
basic ASCII charset or in the extended ASCII charset?  Everywhere
else the Report has been careful in listing which ASCII characters
are meant.

Thanks,

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Brandon Allbery
On Fri, Mar 16, 2012 at 14:08, Gabriel Dos Reis <[hidden email]> wrote:
The lexical structure chapter defines the non-terminal uniSymbol as

    uniSymbol ::= any Unicode symbol or punctuation

There is a slight ambiguity here: is that description supposed to
be parsed as:
  (a) "Unicode (symbol or punctuation)", or
  (b) "(Unicode symbol) or punctuation"?

(a) and I thought the report specified that the language's lexemes are defined in terms of Unicode properties so (a) is the only meaningful interpretation.  (b) is not particularly meaningful, as your own question demonstrates.

--
brandon s allbery                                      [hidden email]
wandering unix systems administrator (available)     (412) 475-9364 vm/sms


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Fri, Mar 16, 2012 at 1:18 PM, Brandon Allbery <[hidden email]> wrote:

> On Fri, Mar 16, 2012 at 14:08, Gabriel Dos Reis
> <[hidden email]> wrote:
>>
>> The lexical structure chapter defines the non-terminal uniSymbol as
>>
>>     uniSymbol ::= any Unicode symbol or punctuation
>>
>> There is a slight ambiguity here: is that description supposed to
>> be parsed as:
>>   (a) "Unicode (symbol or punctuation)", or
>>   (b) "(Unicode symbol) or punctuation"?
>
>
> (a) and I thought the report specified that the language's lexemes are
> defined in terms of Unicode properties so (a) is the only meaningful
> interpretation.  (b) is not particularly meaningful, as your own question
> demonstrates.

It is not clear what "the language's lexemes are defined in terms of
Unicode properties"
really means.  Why would you need ascSmall (and similar ASCII
character categories) then
when you already have uniSmall and associates?

It is not clear that (b) is all that "not particularly meaningful".
Have a look at the production
<symbol>: it excludes double quote(") and apostrophe (') from uniSymbol.

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Brandon Allbery
On Fri, Mar 16, 2012 at 14:30, Gabriel Dos Reis <[hidden email]> wrote:
It is not clear what "the language's lexemes are defined in terms of
Unicode properties"
really means.  Why would you need ascSmall (and similar ASCII
character categories) then
when you already have uniSmall and associates?

I have to assume that is a leftover from an earlier version of the report, because it is indeed already included.

See in section 2.1:

"Haskell uses the Unicode [11] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell."

I understand this to indicate that Unicode character classes are intended, and it does indeed hint that references to ASCII are references to older versions of the language (and should probably be considered fossils, as ASCII itself is; the American Standard Code for Information Interchange was obsoleted by ISO 8859, and modern references to "ASCII" usually should be taken to mean "ISO 8859/1").
 
It is not clear that (b) is all that "not particularly meaningful".
Have a look at the production
<symbol>: it excludes double quote(") and apostrophe (') from uniSymbol.

The notion of "symbol with certain lexicals that have other meanings *that are specified elsewhere in the report*" is not precise enough?  It may be difficult to characterize things with your required precision, since every general statement will necessarily have to carry part or potentially all of the entire Report within it if it is not sufficient to use the statement's context (as describing some part of the Report).

--
brandon s allbery                                      [hidden email]
wandering unix systems administrator (available)     (412) 475-9364 vm/sms


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Fri, Mar 16, 2012 at 1:49 PM, Brandon Allbery <[hidden email]> wrote:

> On Fri, Mar 16, 2012 at 14:30, Gabriel Dos Reis
> <[hidden email]> wrote:
>>
>> It is not clear what "the language's lexemes are defined in terms of
>> Unicode properties"
>> really means.  Why would you need ascSmall (and similar ASCII
>> character categories) then
>> when you already have uniSmall and associates?
>
>
> I have to assume that is a leftover from an earlier version of the report,
> because it is indeed already included.

I believe this part has seen very little change from the Revised
Haskell 98 Report.
It is not clear that it is an unintended leftover.  Section 2.1 that
you quote below
is the same as in the (Revised) Haskell 98 report.

> See in section 2.1:
>
> "Haskell uses the Unicode [11] character set. However, source programs are
> currently biased toward the ASCII character set used in earlier versions of
> Haskell."
>
> I understand this to indicate that Unicode character classes are intended,
> and it does indeed hint that references to ASCII are references to older
> versions of the language (and should probably be considered fossils, as
> ASCII itself is; the American Standard Code for Information Interchange was
> obsoleted by ISO 8859, and modern references to "ASCII" usually should be
> taken to mean "ISO 8859/1").

Unicode support is clearly intended.  Also clearly, ASCII support is intended.
However, the Report does not say what the concrete syntax of a Unicode character
should be. (At least I have been unable to find it from the report.)

>>
>> It is not clear that (b) is all that "not particularly meaningful".
>> Have a look at the production
>> <symbol>: it excludes double quote(") and apostrophe (') from uniSymbol.
>
>
> The notion of "symbol with certain lexicals that have other meanings *that
> are specified elsewhere in the report*" is not precise enough?  It may be
> difficult to characterize things with your required precision, since every
> general statement will necessarily have to carry part or potentially all of
> the entire Report within it if it is not sufficient to use the statement's
> context (as describing some part of the Report).

Well, I hope nobody is suggesting that it is unreasonable to require precision
of a language definition -- especially of Haskell! :-)

A problem with "use the statement's context" is that the context themselves
are not unquestionably unambiguous -- which is part of the reason we are having
this conversation in the first place.

That being said, I am not sure how the passage you quote applies here
or answers conclusively  the original questions. Where else is punctutation
defined in the Report?  What is the concrete syntax of a punctuation?  If you
were going to write a lexer and a parser for Haskell, how you would recognize
a character as a punctuation?

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Brandon Allbery
On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis <[hidden email]> wrote:
I believe this part has seen very little change from the Revised
Haskell 98 Report.

I was in fact looking at the Haskell 98 report at the time.
 
It is not clear that it is an unintended leftover.  Section 2.1 that

Nothing is ever clear.  This useless pedanticism being stipulated, there is no purpose to a completely overlapping category unless it is intended to relate to an earlier standard (say Haskell 1.4).
Unicode support is clearly intended.  Also clearly, ASCII support is intended.
However, the Report does not say what the concrete syntax of a Unicode character
should be. (At least I have been unable to find it from the report.)

Maybe what needs to be pedantically specified is that the link to the Unicode standard is intended to be inclusion of that standard by reference (the [11] in the section I quoted is an endnote referencing the Unicode standard) and not merely informational.  Or are you insisting we are not precise enough unless we enumerate all the Unicode characters explicitly in the Haskell standard?

--
brandon s allbery                                      [hidden email]
wandering unix systems administrator (available)     (412) 475-9364 vm/sms


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Fri, Mar 16, 2012 at 3:22 PM, Brandon Allbery <[hidden email]> wrote:

> On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis
> <[hidden email]> wrote:
>>
>> I believe this part has seen very little change from the Revised
>> Haskell 98 Report.
>
>
> I was in fact looking at the Haskell 98 report at the time.
>
>>
>> It is not clear that it is an unintended leftover.  Section 2.1 that
>
>
> Nothing is ever clear.  This useless pedanticism being stipulated, there is

I very much appreciate any clarification you have on the topic.  However, I
believe we do best when we leave phrases like "useless pedanticism"
or "pedantically"  out.  They are rarely constructive and no substance to an
otherwise informative discussion.  At best, they would distract us.

(In matter of programming language definition, "pedanticism" should be the
least of our worries -- and it probably should not come with a modifier
such as "useless", we should probably wear it as badge of honor.)

> no purpose to a completely overlapping category unless it is intended to
> relate to an earlier standard (say Haskell 1.4).

which in itself is not an unambiguous interpretation :-)

>>
>> Unicode support is clearly intended.  Also clearly, ASCII support is
>> intended.
>> However, the Report does not say what the concrete syntax of a Unicode
>> character
>> should be. (At least I have been unable to find it from the report.)
>
>
> Maybe what needs to be pedantically specified is that the link to the
> Unicode standard is intended to be inclusion of that standard by reference
> (the [11] in the section I quoted is an endnote referencing the Unicode
> standard) and not merely informational.  Or are you insisting we are not
> precise enough unless we enumerate all the Unicode characters explicitly in
> the Haskell standard?

Giving a link to the Unicode standard does not really help with the
original questions.
I know where to find the Unicode standard; that wasn't the issue.

One of the underlying questions is: what is the concrete syntax of a
Unicode character
in a Haskell program?  Note that Chapter 2 goes to a great pain to
specify the ASCII
concrete syntax.

To put things in perspective, have look at this specification of
programs supposed
to be written using Unicode characters.

   http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.2

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Malcolm Wallace-2
>> no purpose to a completely overlapping category unless it is intended to
>> relate to an earlier standard (say Haskell 1.4).

I believe all Haskell Reports, even since 1.0, have specified that the language "uses" Unicode.  If it helps to bring perspective to this discussion, it is my impression that the initial designers of Haskell did not know very much about Unicode, but wanted to avoid the trap of being stuck with ASCII-only, and so decided to reference "whatever Unicode does", as the most obvious and unambiguous way of not having to think about (or specify) these lexical issues themselves.

> One of the underlying questions is: what is the concrete syntax of a
> Unicode character in a Haskell program?  Note that Chapter 2 goes to a great pain to
> specify the ASCII concrete syntax.

In my view, the Haskell Report is deliberately agnostic on concrete syntax for Unicode, believing that to be outside the scope of a programming language standard, whilst entirely within the scope of the Unicode standards body.  Seeing as there are (in practice) numerous concrete representations of Unicode (UTF-8 and other encodings), it is largely up to individual compiler implementations which encodings they support for (a) source text, and (b) input/output at runtime.

Regards,
    Malcolm

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Fri, Mar 16, 2012 at 6:00 PM, Malcolm Wallace <[hidden email]> wrote:
>>> no purpose to a completely overlapping category unless it is intended to
>>> relate to an earlier standard (say Haskell 1.4).
>
> I believe all Haskell Reports, even since 1.0, have specified that the language "uses" Unicode.  If it helps to bring perspective to this discussion, it is my impression that the initial designers of Haskell did not know very much about Unicode, but wanted to avoid the trap of being stuck with ASCII-only, and so decided to reference "whatever Unicode does", as the most obvious and unambiguous way of not having to think about (or specify) these lexical issues themselves.
>

OK.

>> One of the underlying questions is: what is the concrete syntax of a
>> Unicode character in a Haskell program?  Note that Chapter 2 goes to a great pain to
>> specify the ASCII concrete syntax.
>
> In my view, the Haskell Report is deliberately agnostic on concrete syntax for Unicode, believing that to be outside the scope of a programming language standard, whilst entirely within the scope of the Unicode standards body.

The trouble is the Unicode standards body believes that the concrete syntax
is entirely within the scope of the programming language definition
(or any client
using Unicode characters), whilst largely restricting itself to the
talking about
code points which are more abstract.  So, the trick of reference the
Unicode standards
is not satisfactory :-(

> Seeing as there are (in practice) numerous concrete representations of Unicode (UTF-8 and other encodings), it is largely up to individual compiler implementations which encodings they support for (a) source text, and (b) input/output at runtime.

OK, thanks!  I guess a take away from this discussion is that what
is a punctuation is far less well defined than it appears...

A common practice (exemplified by the link I gave earlier) is to restrict the
concrete -syntax- of the input program to the ASCII charset, and use Unicode
escape sequences to include the entire Unicode charset.  It is common to use
\uNNNNNN or \UNNNNNN to introduce Unicode characters, but I suspect that is
out of question for Haskell programs because it would clash with
lambda abstraction.

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Ian Lynagh

Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>
> OK, thanks!  I guess a take away from this discussion is that what
> is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all
Unicode characters (should that be codepoints? I'm not a Unicode expert)
in the punctuation category; I'm not sure what the best reference is,
but e.g. table 12 in
    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
lists a number of Px categories, and a meta-category P "Punctuation".


Thanks
Ian


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Iavor Diatchki
Hello,
I am also not an expert but I got curious and did a bit of Wikipedia
reading.  Based on what I understood, here are two (related) questions
that it might be nice to clarify in a future version of the report:

1. What is the alphabet used by the grammar in the Haskell report?  My
understanding is that the intention is that the alphabet is unicode
codepoints (sometimes referred to as unicode characters).  There is no
way to refer to specific code-points by escaping as in Java (the link
that Gaby shared), you just have to write the code-points directly
(and there are plenty of encodings for doing that, e.g. UTF-8 etc.)

2. Do we respect "unicode equivalence"
(http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
code.  The issue here is that, apparently, some sequences of unicode
code points/characters are supposed to be morally the same.  For
example, it would appear that there are two different ways to write
the Spanish letter ñ: it has its own number, but it can also be made
by writing "n" followed by a modifier to put the wavy sign on top.

I would guess that implementing "unicode equivalence"  would not be
too hard---supposedly the unicode standard specifies a "text
normalization procedure".  However, this would complicate the report
specification, because now the alphabet becomes not just unicode
code-points, but equivalence classes of code points.

Thoughts?

-Iavor






On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh <[hidden email]> wrote:

>
> Hi Gaby,
>
> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>>
>> OK, thanks!  I guess a take away from this discussion is that what
>> is a punctuation is far less well defined than it appears...
>
> I'm not really sure what you're asking. Haskell's uniSymbol includes all
> Unicode characters (should that be codepoints? I'm not a Unicode expert)
> in the punctuation category; I'm not sure what the best reference is,
> but e.g. table 12 in
>    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
> lists a number of Px categories, and a meta-category P "Punctuation".
>
>
> Thanks
> Ian
>
>
> _______________________________________________
> Haskell-prime mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-prime

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
In reply to this post by Ian Lynagh
On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh <[hidden email]> wrote:

> Hi Gaby,
>
> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>>
>> OK, thanks!  I guess a take away from this discussion is that what
>> is a punctuation is far less well defined than it appears...
>
> I'm not really sure what you're asking. Haskell's uniSymbol includes all
> Unicode characters (should that be codepoints? I'm not a Unicode expert)
> in the punctuation category; I'm not sure what the best reference is,
> but e.g. table 12 in
>    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
> lists a number of Px categories, and a meta-category P "Punctuation".
>
>
> Thanks
> Ian
>

Hi Ian,

I guess what I am asking was partly summarized in Iavor's message.

For me, the issue started with bullet number 4 in section 1.1

     http://www.haskell.org/onlinereport/intro.html#sect1.1

which states that:

       The lexical structure captures the concrete representation
       of Haskell programs in text files.

That combined with the opening section 2.1 (e.g. example of terminal syntax)
and the fact that the grammar  routinely described two non-terminals
ascXXX (for ASCII characters) and uniXXX for (Unicode character)
suggested that the concrete syntax of Haskell programs in text files
is in ASCII charset.  Note this does not conflict with the
general statement that Haskell programs use the Unicode character
because the uniXXX could use the ASCII charset to introduce Unicode
characters -- this is not uncommon practice for programming languages
using Unicode characters; see the link I gave earlier.

However, if I understand Malcolm's message correctly, this is not the case.
Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
representation of Haskell programs in text files.  What it does is to capture
the structure of what is obtained from interpreting, *in some unspecified
encoding or unspecified alphabet*,  the concrete representation of Haskell
programs in text files.  This conclusion is unfortunate, but I believe
it is correct.
Since the encoding or the alphabet is unspecified, it is no longer necessarily
the case that two Haskell implementations would agree on the same lexical
interpretation when presented with the same exact text file containing
 a Haskell program.

In its current form, you are correct that the Report should say "codepoint"
instead of characters.

I join Iavor's request in clarifying the alphabet used in the grammar.

Thanks,

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

RE: What is a punctuation character?

Simon Marlow
> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh <[hidden email]> wrote:
> > Hi Gaby,
> >
> > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
> >>
> >> OK, thanks!  I guess a take away from this discussion is that what is
> >> a punctuation is far less well defined than it appears...
> >
> > I'm not really sure what you're asking. Haskell's uniSymbol includes
> > all Unicode characters (should that be codepoints? I'm not a Unicode
> > expert) in the punctuation category; I'm not sure what the best
> > reference is, but e.g. table 12 in
> >    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
> > lists a number of Px categories, and a meta-category P "Punctuation".
> >
> >
> > Thanks
> > Ian
> >
>
> Hi Ian,
>
> I guess what I am asking was partly summarized in Iavor's message.
>
> For me, the issue started with bullet number 4 in section 1.1
>
>      http://www.haskell.org/onlinereport/intro.html#sect1.1
>
> which states that:
>
>        The lexical structure captures the concrete representation
>        of Haskell programs in text files.
>
> That combined with the opening section 2.1 (e.g. example of terminal
> syntax) and the fact that the grammar  routinely described two non-
> terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
> suggested that the concrete syntax of Haskell programs in text files is in
> ASCII charset.  Note this does not conflict with the general statement
> that Haskell programs use the Unicode character because the uniXXX could
> use the ASCII charset to introduce Unicode characters -- this is not
> uncommon practice for programming languages using Unicode characters; see
> the link I gave earlier.
>
> However, if I understand Malcolm's message correctly, this is not the
> case.
> Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
> representation of Haskell programs in text files.  What it does is to
> capture the structure of what is obtained from interpreting, *in some
> unspecified encoding or unspecified alphabet*,  the concrete
> representation of Haskell programs in text files.  This conclusion is
> unfortunate, but I believe it is correct.
> Since the encoding or the alphabet is unspecified, it is no longer
> necessarily the case that two Haskell implementations would agree on the
> same lexical interpretation when presented with the same exact text file
> containing  a Haskell program.
>
> In its current form, you are correct that the Report should say
> "codepoint"
> instead of characters.
>
> I join Iavor's request in clarifying the alphabet used in the grammar.

The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all.

Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate.  We should also clarify that "punctuation" means exactly the Punctuation class.

With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints.  Again, we could add a clarifying sentence to the report.

Cheers,
        Simon



_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow <[hidden email]> wrote:

>> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh <[hidden email]> wrote:
>> > Hi Gaby,
>> >
>> > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>> >>
>> >> OK, thanks!  I guess a take away from this discussion is that what is
>> >> a punctuation is far less well defined than it appears...
>> >
>> > I'm not really sure what you're asking. Haskell's uniSymbol includes
>> > all Unicode characters (should that be codepoints? I'm not a Unicode
>> > expert) in the punctuation category; I'm not sure what the best
>> > reference is, but e.g. table 12 in
>> >    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> > lists a number of Px categories, and a meta-category P "Punctuation".
>> >
>> >
>> > Thanks
>> > Ian
>> >
>>
>> Hi Ian,
>>
>> I guess what I am asking was partly summarized in Iavor's message.
>>
>> For me, the issue started with bullet number 4 in section 1.1
>>
>>      http://www.haskell.org/onlinereport/intro.html#sect1.1
>>
>> which states that:
>>
>>        The lexical structure captures the concrete representation
>>        of Haskell programs in text files.
>>
>> That combined with the opening section 2.1 (e.g. example of terminal
>> syntax) and the fact that the grammar  routinely described two non-
>> terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
>> suggested that the concrete syntax of Haskell programs in text files is in
>> ASCII charset.  Note this does not conflict with the general statement
>> that Haskell programs use the Unicode character because the uniXXX could
>> use the ASCII charset to introduce Unicode characters -- this is not
>> uncommon practice for programming languages using Unicode characters; see
>> the link I gave earlier.
>>
>> However, if I understand Malcolm's message correctly, this is not the
>> case.
>> Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
>> representation of Haskell programs in text files.  What it does is to
>> capture the structure of what is obtained from interpreting, *in some
>> unspecified encoding or unspecified alphabet*,  the concrete
>> representation of Haskell programs in text files.  This conclusion is
>> unfortunate, but I believe it is correct.
>> Since the encoding or the alphabet is unspecified, it is no longer
>> necessarily the case that two Haskell implementations would agree on the
>> same lexical interpretation when presented with the same exact text file
>> containing  a Haskell program.
>>
>> In its current form, you are correct that the Report should say
>> "codepoint"
>> instead of characters.
>>
>> I join Iavor's request in clarifying the alphabet used in the grammar.
>
> The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all.

Thanks, Simon.

The fact that the Report is silent about encoding used to
represent concrete Haskell programs in text files adds
a certain level of non-portability (and confusion.)  I found
last night that a proposal has been made to add some
support for encoding specification

    http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource

I believe that is a good start.  What are the odds of it being considered
for Haskell 2012?  I suspect the pragma proposal works only if something
is said about the position of that pragma in the source file (e.g. it
must be the
first line, or file N bytes in the source file) otherwise we have an
infinite descent.


>
> Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate.  We should also clarify that "punctuation" means exactly the Punctuation class.

That would be great.  Do you have any comment about the
UnicodeInHaskellSource proposal?

> With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints.  Again, we could add a clarifying sentence to the report.
>

Ugh.

Writing a parser for Haskell was an interesting exercise :-)

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Brandon Allbery
On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis <[hidden email]> wrote:
The fact that the Report is silent about encoding used to
represent concrete Haskell programs in text files adds
a certain level of non-portability (and confusion.)  I found

Specifying the encoding can *also* limit portability, if you specify an encoding that is not widely supported on some target platform.  (Please try to remember that the universe is not composed solely of Windows and Linux.  The fact that those are the only ones you care about is not relevant to the standard; nor is the list of platforms that GHC or any other implementation supports.)

Encoding does not belong in the language standard; it is an aspect of implementing the language standard on a given platform.

--
brandon s allbery                                      [hidden email]
wandering unix systems administrator (available)     (412) 475-9364 vm/sms


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Mon, Mar 19, 2012 at 5:36 AM, Brandon Allbery <[hidden email]> wrote:

> On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis
> <[hidden email]> wrote:
>>
>> The fact that the Report is silent about encoding used to
>> represent concrete Haskell programs in text files adds
>> a certain level of non-portability (and confusion.)  I found
>
>
> Specifying the encoding can *also* limit portability, if you specify an
> encoding that is not widely supported on some target platform.

That is why I find the pragma suggestion attractive.

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Colin Paul Adams
In reply to this post by Iavor Diatchki


    Iavor> report?  My understanding is that the intention is that the
    Iavor> alphabet is unicode codepoints (sometimes referred to as
    Iavor> unicode characters).

Unicode characters are not the same as Unicode codepoints. What we want
is Unicode characters.

We don't want to be able to write a Unicode codepoint, as that would
permit writing half of a surrogate pair, which is malformed Unicode.
--
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Iavor Diatchki
In reply to this post by Iavor Diatchki
Hello,

So I looked at what GHC does with Unicode and to me it is seems quite
reasonable:

* The alphabet is Unicode code points, so a valid Haskell program is
simply a list of those.
* Combining characters are not allowed in identifiers, so no need for
complex normalization rules: programs should always use the "short"
version of a character, or be rejected.
* Combining characters may appear in string literals, and there they
are left "as is" without any modification (so some string literals may
be longer than what's displayed in a text editor.)

Perhaps this is simply what the report already states (I haven't
checked, for which I apologize) but, if not, perhaps we should clarify
things.

-Iavor
PS:  I don't think that there is any need to specify a particular
representation for the unicode code-points (e.g., utf-8 etc.) in the
language standard.





On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki
<[hidden email]> wrote:

> Hello,
> I am also not an expert but I got curious and did a bit of Wikipedia
> reading.  Based on what I understood, here are two (related) questions
> that it might be nice to clarify in a future version of the report:
>
> 1. What is the alphabet used by the grammar in the Haskell report?  My
> understanding is that the intention is that the alphabet is unicode
> codepoints (sometimes referred to as unicode characters).  There is no
> way to refer to specific code-points by escaping as in Java (the link
> that Gaby shared), you just have to write the code-points directly
> (and there are plenty of encodings for doing that, e.g. UTF-8 etc.)
>
> 2. Do we respect "unicode equivalence"
> (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
> code.  The issue here is that, apparently, some sequences of unicode
> code points/characters are supposed to be morally the same.  For
> example, it would appear that there are two different ways to write
> the Spanish letter ñ: it has its own number, but it can also be made
> by writing "n" followed by a modifier to put the wavy sign on top.
>
> I would guess that implementing "unicode equivalence"  would not be
> too hard---supposedly the unicode standard specifies a "text
> normalization procedure".  However, this would complicate the report
> specification, because now the alphabet becomes not just unicode
> code-points, but equivalence classes of code points.
>
> Thoughts?
>
> -Iavor
>
>
>
>
>
>
> On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh <[hidden email]> wrote:
>>
>> Hi Gaby,
>>
>> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>>>
>>> OK, thanks!  I guess a take away from this discussion is that what
>>> is a punctuation is far less well defined than it appears...
>>
>> I'm not really sure what you're asking. Haskell's uniSymbol includes all
>> Unicode characters (should that be codepoints? I'm not a Unicode expert)
>> in the punctuation category; I'm not sure what the best reference is,
>> but e.g. table 12 in
>>    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> lists a number of Px categories, and a meta-category P "Punctuation".
>>
>>
>> Thanks
>> Ian
>>
>>
>> _______________________________________________
>> Haskell-prime mailing list
>> [hidden email]
>> http://www.haskell.org/mailman/listinfo/haskell-prime

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: What is a punctuation character?

Gabriel Dos Reis
On Tue, Mar 20, 2012 at 5:37 PM, Iavor Diatchki
<[hidden email]> wrote:

> Hello,
>
> So I looked at what GHC does with Unicode and to me it is seems quite
> reasonable:
>
> * The alphabet is Unicode code points, so a valid Haskell program is
> simply a list of those.
> * Combining characters are not allowed in identifiers, so no need for
> complex normalization rules: programs should always use the "short"
> version of a character, or be rejected.
> * Combining characters may appear in string literals, and there they
> are left "as is" without any modification (so some string literals may
> be longer than what's displayed in a text editor.)
>
> Perhaps this is simply what the report already states (I haven't
> checked, for which I apologize) but, if not, perhaps we should clarify
> things.
>
> -Iavor
> PS:  I don't think that there is any need to specify a particular
> representation for the unicode code-points (e.g., utf-8 etc.) in the
> language standard.

Thanks Iavor.

If the report intended to talk about code points only (and indeed ruling
out normalization suggests that), then the Report needs to be
clarified.  As you know, there is a distinction between a Unicode code
point and a Unicode character

    http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G25564

Until I sent my original query, I had been reading the Report as meaning
Unicode characters (as the grammar seemed to suggest), but now it is
clear to me that only code points were intended.  That seemed to be
confirmed by your investigation of the GHC code base.

-- Gaby

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime