Encoding of Haskell source files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Yitzchak Gale
malcolm.wallace wrote:
>> BOM is not part of UTF8, because UTF8 is byte-oriented.  But applications
>> should be prepared to read and discard it, because some applications
>> erroneously generate it.

For maximum portability, the standard should be require compilers
to accept and discard an optional BOM as the first character of a
source code file.

Tako Schotanus wrote:
> That's not what the official unicode site says in its FAQ:
http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

That FAQ clearly states that BOM is part of some "protocols".
It carefully avoids stating whether it is part of the encoding.

It is certainly not erroneous to include the BOM
if it is part of the protocol for the applications being used.
Applications can include whatever characters they'd like, and
they can use whatever handshake mechanism they'd like to
agree upon an encoding. The BOM mechanism is common
on the Windows platform. It has since appeared in other
places as well, but it is certainly not universally adopted.

Python supports a pseudo-encoding called "utf8-bom" that
automatically generates and discards the BOM in support
of that handshake mechanism But it isn't really an encoding,
it's a convenience.

Part of the source of all this confusion is some documentation
that appeared in the past on Microsoft's site which was unclear
about the fact that the BOM handshake is a protocol adopted
by Microsoft, not a part of the encoding itself. Some people
claim that this was intentional, part of the "extend and embrace"
tactic Microsoft allegedly employed in those days in an effort
to expand its monopoly.

The wording of the Unicode FAQ is obviously trying to tip-toe
diplomatically around this issue without arousing the ire of
either pro-Microsoft or anti-Microsoft developers.

Thanks,
Yitz

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Roel van Dijk-3
I made an official proposal on the haskell-prime list:

http://www.haskell.org/pipermail/haskell-prime/2011-April/003368.html

Let's have further discussion there.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Tako Schotanus
In reply to this post by Yitzchak Gale


On Mon, Apr 4, 2011 at 17:51, Yitzchak Gale <[hidden email]> wrote:
malcolm.wallace wrote:
>> BOM is not part of UTF8, because UTF8 is byte-oriented.  But applications
>> should be prepared to read and discard it, because some applications
>> erroneously generate it.

For maximum portability, the standard should be require compilers
to accept and discard an optional BOM as the first character of a
source code file.

Tako Schotanus wrote:
> That's not what the official unicode site says in its FAQ:
http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

That FAQ clearly states that BOM is part of some "protocols".
It carefully avoids stating whether it is part of the encoding.

It is certainly not erroneous to include the BOM
if it is part of the protocol for the applications being used.
Applications can include whatever characters they'd like, and
they can use whatever handshake mechanism they'd like to
agree upon an encoding. The BOM mechanism is common
on the Windows platform. It has since appeared in other
places as well, but it is certainly not universally adopted.

Python supports a pseudo-encoding called "utf8-bom" that
automatically generates and discards the BOM in support
of that handshake mechanism But it isn't really an encoding,
it's a convenience.

Part of the source of all this confusion is some documentation
that appeared in the past on Microsoft's site which was unclear
about the fact that the BOM handshake is a protocol adopted
by Microsoft, not a part of the encoding itself. Some people
claim that this was intentional, part of the "extend and embrace"
tactic Microsoft allegedly employed in those days in an effort
to expand its monopoly.

The wording of the Unicode FAQ is obviously trying to tip-toe
diplomatically around this issue without arousing the ire of
either pro-Microsoft or anti-Microsoft developers.


Some reliable sources for all this would be entertaining (although irrelevant for the rest of this discussion).

Cheers,
 -Tako
 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Richard A. O'Keefe
In reply to this post by Daniel Fischer

On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:
> Colin spoke of *leading* characters, for .hs files, that drastically
> reduces the possibilities - not for .lhs, though.

A .hs file can, amongst other things, begin with any "small" letter.



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Mark Lentczner-2
In reply to this post by Roel van Dijk-3
On Mon, Apr 4, 2011 at 3:52 PM, Roel van Dijk <[hidden email]> wrote:
I made an official proposal on the haskell-prime list:

http://www.haskell.org/pipermail/haskell-prime/2011-April/003368.html

Let's have further discussion there.

I'm not on that mailing list, so I'll comment here:

My only caveat is that the encoding provision should apply when Haskell source is presented to the compiler as a bare stream of octets. Where Haskell source is interchanged as a stream of Unicode characters, then encoding is not relevant -- but may be likely governed by some outer protocol - and hence may not be UTF-8 but nonetheless invisible at the Haskell level.

Two examples where this might come into play are:

1) An IDE that stores module source in some database. It would not be relevant what encoding that IDE and database choose to store the source in if the source is presented to the integrated compiler as Unicode characters.

2) If a compilation system fetches module source via HTTP (I could imagine a compiler that chased down included modules directly off of Hackage, say), then HTTP already has a mechanism (via MIME types) of transmitting the encoding clearly. As such, there should be no problem if that outer protocol (HTTP) transmits the source to the compiler via some other encoding. There is no reason (and only potential interoperability restrictions) to enforce that UTF-8 be the only legal encoding here.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Herbert Valerio Riedel
In reply to this post by Roel van Dijk-3
On Mon, 2011-04-04 at 11:50 +0200, Roel van Dijk wrote:
> I am not aware of any algorithm that can reliably infer the character
> encoding used by just looking at the raw data. Why would people bother
> with stuff like <?xml version="1.0" encoding="UTF-8"?> if
> automatically figuring out the encoding was easy?

It is possible, if the syntax/grammar of the encoded content restricts
the set of allowed code-points in the first few characters.

For instance, valid JSON (see RFC 4673 section 3) requires the first two
characters to be plain "ASCII" code-points, thus which of the 5 BOM-less
UTF-encodings is used is uniquely determined by inspecting the first 4
bytes of the UTF encoded stream.



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Roel van Dijk-3
In reply to this post by Mark Lentczner-2
On 5 April 2011 07:04, Mark Lentczner <[hidden email]> wrote:
> I'm not on that mailing list, so I'll comment here:

I recommend joining the prime list. It is very low traffic and the
place where language changes should be discussed.

> My only caveat is that the encoding provision should apply when Haskell
> source is presented to the compiler as a bare stream of octets. Where
> Haskell source is interchanged as a stream of Unicode characters, then
> encoding is not relevant -- but may be likely governed by some outer
> protocol - and hence may not be UTF-8 but nonetheless invisible at the
> Haskell level.

My intention is that every time you need an encoding for Haskell
sources, it must be UTF-8. At least if you want to call it Haskell.
This is not limited to compilers but concerns all tools that process
Haskell sources.

> Two examples where this might come into play are:
> 1) An IDE that stores module source in some database. It would not be
> relevant what encoding that IDE and database choose to store the source in
> if the source is presented to the integrated compiler as Unicode characters.

An IDE and database are free to store sources any way they see fit.
But as soon as you want to exchange that source with some standards
conforming system it must be encoded as UTF-8.

> 2) If a compilation system fetches module source via HTTP (I could imagine a
> compiler that chased down included modules directly off of Hackage, say),
> then HTTP already has a mechanism (via MIME types) of transmitting the
> encoding clearly. As such, there should be no problem if that outer protocol
> (HTTP) transmits the source to the compiler via some other encoding. There
> is no reason (and only potential interoperability restrictions) to enforce
> that UTF-8 be the only legal encoding here.

This is an interesting example. What distinguishes this scenario from
others is that there is a clear understanding between two parties
(client and server) how a file should be interpreted. I could word my
proposal in such a way that it only concerns situations where such a
prior agreement doesn't or can't exist. For example, when storing
source on a file system.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Daniel Fischer
In reply to this post by Richard A. O'Keefe
On Tuesday 05 April 2011 04:35:39, Richard O'Keefe wrote:
> On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:
> > Colin spoke of *leading* characters, for .hs files, that drastically
> > reduces the possibilities - not for .lhs, though.
>
> A .hs file can, amongst other things, begin with any "small" letter.

D'oh, yes, I always forget that a module declaration isn't required.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Colin Adams-3


On 5 April 2011 10:35, Daniel Fischer <[hidden email]> wrote:
On Tuesday 05 April 2011 04:35:39, Richard O'Keefe wrote:
> On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:
> > Colin spoke of *leading* characters, for .hs files, that drastically
> > reduces the possibilities - not for .lhs, though.
>
> A .hs file can, amongst other things, begin with any "small" letter.

D'oh, yes, I always forget that a module declaration isn't required.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

True, but we could say that UTF-8 is complusory in the absence of a module declaration.

--
Colin Adams
Preston, Lancashire, ENGLAND
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
12