Encoding of Haskell source files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Encoding of Haskell source files

Roel van Dijk-3
Hello,

The Haskell 2010 language specification states that "Haskell uses
the Unicode character set" [1]. I interpret this as saying that,
at the lowest level, a Haskell program is a sequence of Unicode
code points. The standard doesn't say how such a sequence should
be encoded. You can argue that the encoding of source files is
not part of the language. But I think it would be highly
practical to standardise on an encoding scheme.

Strictly speaking it is not possible to reliably exchange Haskell
source files on the byte level. If I download some package from
hackage I can't tell how the source files are encoded from just
looking at the files.

I propose a few solutions:

A - Choose a single encoding for all source files.

This is wat GHC does: "GHC assumes that source files are ASCII or
UTF-8 only, other encodings are not recognised" [2]. UTF-8 seems like
a good candidate for such an encoding.

B - Specify encoding in the source files.

Start each source file with a special comment specifying the encoding
used in that file. See Python for an example of this mechanism in
practice [3]. It would be nice to use already existing facilities to
specify the encoding, for example:
{-# ENCODING <encoding name> #-}

An interesting idea in the Python PEP is to also allow a form
recognised by most text editors:
# -*- coding: <encoding name> -*-

C - Option B + Default encoding

Like B, but also choose a default encoding in case no specific
encoding is specified.

I would further like to propose to specify the encoding of haskell
source files in the language standard. Encoding of source files
belongs somewhere between a language specification and specific
implementations. But the language standard seems to be the most
practical place.

This is not an official proposal. I am just interested in what the
Haskell community has to say about this.

Regards,
Roel


[1] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1
[2] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compilation.html#source-files
[3] - http://www.python.org/dev/peps/pep-0263/

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Colin Adams-3


2011/4/4 Roel van Dijk <[hidden email]>
Hello,

The Haskell 2010 language specification states that "Haskell uses
the Unicode character set" [1]. I interpret this as saying that,
at the lowest level, a Haskell program is a sequence of Unicode
code points. The standard doesn't say how such a sequence should
be encoded. You can argue that the encoding of source files is
not part of the language. But I think it would be highly
practical to standardise on an encoding scheme.

Strictly speaking it is not possible to reliably exchange Haskell
source files on the byte level. If I download some package from
hackage I can't tell how the source files are encoded from just
looking at the files.

Not from looking with your eyes perhaps. Does that matter? Your text editor, and the compiler, can surely figure it out for themselves. There aren't many Unicode encoding formats, and there aren't very many possibilities for the leading characters of a Haskell source file, are there?

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Roel van Dijk-3
2011/4/4 Colin Adams <[hidden email]>:
> Not from looking with your eyes perhaps. Does that matter? Your text editor,
> and the compiler, can surely figure it out for themselves.
I am not aware of any algorithm that can reliably infer the character
encoding used by just looking at the raw data. Why would people bother
with stuff like <?xml version="1.0" encoding="UTF-8"?> if
automatically figuring out the encoding was easy?

> There aren't many Unicode encoding formats
From casually scanning some articles about encodings I can count at
least 70 character encodings [1].

> and there aren't very many possibilities for the
> leading characters of a Haskell source file, are there?
Since a Haskell program is a sequence of Unicode code points the
programmer can choose from up to 1,112,064 characters. Many of these
can legitimately be part of the interface of a module, as function
names, operators or names of types.


[1] - http://en.wikipedia.org/wiki/Character_encoding

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Michael Snoyman
Firstly, I personally would love to insist on using UTF-8 and be done with it. I see no reason to bother with other character encodings.

2011/4/4 Roel van Dijk <[hidden email]>
2011/4/4 Colin Adams <[hidden email]>:
> Not from looking with your eyes perhaps. Does that matter? Your text editor,
> and the compiler, can surely figure it out for themselves.
I am not aware of any algorithm that can reliably infer the character
encoding used by just looking at the raw data. Why would people bother
with stuff like <?xml version="1.0" encoding="UTF-8"?> if
automatically figuring out the encoding was easy?

There *is* an algorithm for determining the encoding of an XML file based on a combination of the BOM (Byte Order Marker) and an assumption that the file will start with a XML declaration (i.e., <?xml ... ?>). But this isn't capable of determining every possible encoding on the planet, just distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and Windows-1255 (Hebrew), for example.
 
> There aren't many Unicode encoding formats
From casually scanning some articles about encodings I can count at
least 70 character encodings [1].

I think the implication of "Unicode encoding formats" is something in the UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly translated into Unicode codepoints, but it's not exactly an encoding of Unicode, but rather a subset of Unicode.
 
> and there aren't very many possibilities for the
> leading characters of a Haskell source file, are there?
Since a Haskell program is a sequence of Unicode code points the
programmer can choose from up to 1,112,064 characters. Many of these
can legitimately be part of the interface of a module, as function
names, operators or names of types.

My guess is that a large subset of Haskell modules start with one of left brace (starting with comment or language pragma), m (for starting with module), or some whitespace character. So it *might* be feasible to take a guess at things. But as I said before: I like UTF-8. Is there anyone out there who has a compelling reason for writing their Haskell source in EBCDIC?

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Daniel Fischer
In reply to this post by Roel van Dijk-3
On Monday 04 April 2011 11:50:03, Roel van Dijk wrote:
> > and there aren't very many possibilities for the
> > leading characters of a Haskell source file, are there?
>
> Since a Haskell program is a sequence of Unicode code points the
> programmer can choose from up to 1,112,064 characters. Many of these
> can legitimately be part of the interface of a module, as function
> names, operators or names of types.

Colin spoke of *leading* characters, for .hs files, that drastically
reduces the possibilities - not for .lhs, though.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Roel van Dijk-3
In reply to this post by Michael Snoyman
On 4 April 2011 12:22, Michael Snoyman <[hidden email]> wrote:
> Firstly, I personally would love to insist on using UTF-8 and be done with
> it. I see no reason to bother with other character encodings.

This is also my preferred choice.

> There *is* an algorithm for determining the encoding of an XML file based on
> a combination of the BOM (Byte Order Marker) and an assumption that the file
> will start with a XML declaration (i.e., <?xml ... ?>). But this isn't
> capable of determining every possible encoding on the planet, just
> distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and
> EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and
> Windows-1255 (Hebrew), for example.

I think I was confused between character encodings in general and
Unicode encodings.

> I think the implication of "Unicode encoding formats" is something in the
> UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly
> translated into Unicode codepoints, but it's not exactly an encoding of
> Unicode, but rather a subset of Unicode.

That would validate Colin's point about there not being that many encodings.

> My guess is that a large subset of Haskell modules start with one of left
> brace (starting with comment or language pragma), m (for starting with
> module), or some whitespace character. So it *might* be feasible to take a
> guess at things. But as I said before: I like UTF-8. Is there anyone out
> there who has a compelling reason for writing their Haskell source in
> EBCDIC?

I think I misinterpreted the word 'leading'. I thought Colin meant
"most used". The set of characters with which Haskell programmes start
is indeed small. But like you I prefer no guessing and just default to
UTF-8.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Daniel Fischer
In reply to this post by Roel van Dijk-3
On Monday 04 April 2011 10:46:46, Roel van Dijk wrote:
> I propose a few solutions:
>
> A - Choose a single encoding for all source files.
>
> This is wat GHC does: "GHC assumes that source files are ASCII or
> UTF-8 only, other encodings are not recognised" [2]. UTF-8 seems like
> a good candidate for such an encoding.

If there's only a single encoding recognised, UTF-8 surely should be the
one (though perhaps Windows users might disagree, iirc, Windows uses UCS2
as standard encoding).

>
> B - Specify encoding in the source files.
>
> Start each source file with a special comment specifying the encoding
> used in that file. See Python for an example of this mechanism in
> practice [3]. It would be nice to use already existing facilities to
> specify the encoding, for example:
> {-# ENCODING <encoding name> #-}
>
> An interesting idea in the Python PEP is to also allow a form
> recognised by most text editors:
> # -*- coding: <encoding name> -*-
>
> C - Option B + Default encoding
>
> Like B, but also choose a default encoding in case no specific
> encoding is specified.

default = UTF-8
Laziness makes me prefer that over B.

>
> I would further like to propose to specify the encoding of haskell
> source files in the language standard. Encoding of source files
> belongs somewhere between a language specification and specific
> implementations. But the language standard seems to be the most
> practical place.

I'd agree.

>
> This is not an official proposal. I am just interested in what the
> Haskell community has to say about this.
>
> Regards,
> Roel

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Felipe Lessa
In reply to this post by Roel van Dijk-3
2011/4/4 Roel van Dijk <[hidden email]>:
> On 4 April 2011 12:22, Michael Snoyman <[hidden email]> wrote:
>> Firstly, I personally would love to insist on using UTF-8 and be done with
>> it. I see no reason to bother with other character encodings.
>
> This is also my preferred choice.

+1

I'm also in favor of sticking with UTF-8 and being done with it.  All
of Hackage *today* is UTF-8 (ASCII included), why open a can of worms?
 Also, this means that we would be standardizing the current practice.

Cheers,

--
Felipe.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Max Bolingbroke-2
In reply to this post by Daniel Fischer
On 4 April 2011 11:34, Daniel Fischer <[hidden email]> wrote:
> If there's only a single encoding recognised, UTF-8 surely should be the
> one (though perhaps Windows users might disagree, iirc, Windows uses UCS2
> as standard encoding).

Windows APIs use UTF-16, but the encoding of files (which is the
relevant point here) is almost uniformly UTF-8 - though of course you
can find legacy apps making other choices.

Cheers,
Max

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Ketil Malde-5
In reply to this post by Michael Snoyman
Michael Snoyman <[hidden email]> writes:

> My guess is that a large subset of Haskell modules start with one of left
> brace (starting with comment or language pragma), m (for starting with
> module), or some whitespace character. So it *might* be feasible to take a
> guess at things. But as I said before: I like UTF-8. Is there anyone out
> there who has a compelling reason for writing their Haskell source in
> EBCDIC?

Probably not EBCDIC. :-)

Correct me if I'm wrong here, but I think nobody has compelling
reasons for using any other Unicode format than UTF-8.  Although some
systems use UTF-16 (or some approximation thereof) internally, UTF-8
seems to be the universal choice external encoding.  However, there
probably exists a bit of code using Latin-1 and Windows charsets, and
here leading characters aren't going to help you all that much.

I think the safest thing to do is to require source to be ASCII, and
provide escapes for code points >127...

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Steve Schafer
In reply to this post by Max Bolingbroke-2
On Mon, 4 Apr 2011 13:30:08 +0100, you wrote:

>Windows APIs use UTF-16...

The newer ones, at least. The older ones usually come in two flavors,
UTF-16LE and 8-bit code page-based.

>...but the encoding of files (which is the relevant point here) is
>almost uniformly UTF-8 - though of course you can find legacy apps
>making other choices.

If you're talking about files written and read by the operating system
itself, then perhaps. But my experience is that there are a lot of
applications that use UTF-16LE, especially ones that typically only work
with smaller files (configuration files, etc.).

As for Haskell, I would still vote for UTF-8 only, though. The only
reason to favor anything else is legacy compatibility with existing
Haskell source files, and that isn't really an issue here.

-Steve Schafer

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Antoine Latter-2
In reply to this post by Max Bolingbroke-2
On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke
<[hidden email]> wrote:
> On 4 April 2011 11:34, Daniel Fischer <[hidden email]> wrote:
>> If there's only a single encoding recognised, UTF-8 surely should be the
>> one (though perhaps Windows users might disagree, iirc, Windows uses UCS2
>> as standard encoding).
>
> Windows APIs use UTF-16, but the encoding of files (which is the
> relevant point here) is almost uniformly UTF-8 - though of course you
> can find legacy apps making other choices.
>

Would we need to specifically allow for a Windows-style leading BOM in
UTF-8 documents? I can never remember if it is truly a part of UTF-8
or not.

> Cheers,
> Max
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Malcolm Wallace-2
BOM is not part of UTF8, because UTF8 is byte-oriented.  But applications should be prepared to read and discard it, because some applications erroneously generate it.
Regards,
    Malcolm

On 04 Apr, 2011,at 02:09 PM, Antoine Latter <[hidden email]> wrote:

On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke
<[hidden email]> wrote:
> On 4 April 2011 11:34, Daniel Fischer <[hidden email]> wrote:
>> If there's only a single encoding recognised, UTF-8 surely should be the
>> one (though perhaps Windows users might disagree, iirc, Windows uses UCS2
>> as standard encoding).
>
> Windows APIs use UTF-16, but the encoding of files (which is the
> relevant point here) is almost uniformly UTF-8 - though of course you
> can find legacy apps making other choices.
>

Would we need to specifically allow for a Windows-style leading BOM in
UTF-8 documents? I can never remember if it is truly a part of UTF-8
or not.

> Cheers,
> Max
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Tako Schotanus
That's not what the official unicode site says in its FAQ: http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

Cheers,
-Tako


On Mon, Apr 4, 2011 at 15:18, malcolm.wallace <[hidden email]> wrote:
BOM is not part of UTF8, because UTF8 is byte-oriented.  But applications should be prepared to read and discard it, because some applications erroneously generate it.

Regards,
    Malcolm

On 04 Apr, 2011,at 02:09 PM, Antoine Latter <[hidden email]> wrote:

On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke
<[hidden email]> wrote:
> On 4 April 2011 11:34, Daniel Fischer <[hidden email]> wrote:
>> If there's only a single encoding recognised, UTF-8 surely should be the
>> one (though perhaps Windows users might disagree, iirc, Windows uses UCS2
>> as standard encoding).
>
> Windows APIs use UTF-16, but the encoding of files (which is the
> relevant point here) is almost uniformly UTF-8 - though of course you
> can find legacy apps making other choices.
>

Would we need to specifically allow for a Windows-style leading BOM in
UTF-8 documents? I can never remember if it is truly a part of UTF-8
or not.

> Cheers,
> Max
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Max Rabkin-2
In reply to this post by Ketil Malde-5
2011/4/4 Ketil Malde <[hidden email]>:
> I think the safest thing to do is to require source to be ASCII, and
> provide escapes for code points >127...

I used to think that until I realised it meant having

-- Author: Ma\xef N\xe5me

In code, single characters aren't bad (does Haskell have something
like Python's named escapes ("\N{small letter a with ring}"?) but
reading UI strings is less fun.

Also, unicode symbols for -> and the like are becoming more popular.

--Max

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Roel van Dijk-3
In reply to this post by Ketil Malde-5
2011/4/4 Ketil Malde <[hidden email]>:
> I think the safest thing to do is to require source to be ASCII, and
> provide escapes for code points >127...

I do not think that that is the safest option. The safest is just
writing down whatever GHC does. Escape codes for non-ASCII would break
a lot of packages and make programming really painful. Consider the
following, utf-8 encoded, file:

http://code.haskell.org/numerals/test/Text/Numeral/Language/ZH/TestData.hs

I don't want to imagine writing that with escape characters. It would
also be very error prone, not being able to readily read what you
write.

But the overall consensus appears to be UTF-8 as the default encoding.
I will write an official proposal to amend the haskell language
specification. (Probably this evening, utc+1).

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Brandon Moore-2
In reply to this post by Michael Snoyman
>From: Michael Snoyman <[hidden email]>
>Sent: Mon, April 4, 2011 5:22:02 AM
>
>Firstly, I personally would love to insist on using UTF-8 and be done with it. I
>
>see no reason to bother with other character encodings.

If by "insist", you mean the standard insist that implementations support
UTF-8 by default.

The rest of the standard already just talks about sequences of unicode
characters, so I don't see much to be gained by prohibiting other encodings.

In particular, I have read that systems set up for east asian scripts
often use UTF-16 as a default encoding.

Brandon


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Colin Adams-3


2011/4/4 Brandon Moore <[hidden email]>

The rest of the standard already just talks about sequences of unicode
characters, so I don't see much to be gained by prohibiting other encodings.

In particular, I have read that systems set up for east asian scripts
often use UTF-16 as a default encoding.


Presumably because this will use less disk space on average.

I too don't see any reason to forbid other Unicode encodings. Perhaps mandate support for UTF-8, and allow others with a pragma. But unless someone adds support to a Haskell compiler for such a pragma, it will be fairly pointless putting this in the standard.

--
Colin Adams
Preston, Lancashire, ENGLAND
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Michael Snoyman
In reply to this post by Brandon Moore-2
2011/4/4 Brandon Moore <[hidden email]>:
>>From: Michael Snoyman <[hidden email]>
>>Sent: Mon, April 4, 2011 5:22:02 AM
>>
>>Firstly, I personally would love to insist on using UTF-8 and be done with it. I
>>
>>see no reason to bother with other character encodings.
>
> If by "insist", you mean the standard insist that implementations support
> UTF-8 by default.

No, I mean that compliant compilers should only support UTF-8. I don't
see a reason to allow the creation of Haskell files that can only be
read by some compilers.

> The rest of the standard already just talks about sequences of unicode
> characters, so I don't see much to be gained by prohibiting other encodings.
>
> In particular, I have read that systems set up for east asian scripts
> often use UTF-16 as a default encoding.

I don't know about that, but I'd be very surprised if there are any
editors out there that don't support UTF-8. If a user is
inconvenienced once because he/she needs to change the default
encoding to UTF-8, and the result is all Haskell files share the same
encoding, I'm OK with that.

@Colin: Even if UTF-16 was more space-efficient than UTF-8 (which I
highly doubt[1]), I'd be incredibly surprised if this held true for
Haskell source, which will almost certainly be at least 90%
code-points below 128. For those code points, UTF-16 is twice the size
as UTF-8.

Michael

[1] http://www.haskell.org/pipermail/haskell-cafe/2010-August/082268.html

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Encoding of Haskell source files

Yitzchak Gale
In reply to this post by Brandon Moore-2
+1 for UTF-8 only.

Brandon Moore wrote:
> ...I don't see much to be gained by prohibiting other encodings.

Universal portability of Haskell source code with respect to its
encoding is to be gained. We can achieve that simplicity now
with almost no cost. Why miss the opportunity?

> In particular, I have read that systems set up for east asian scripts
> often use UTF-16 as a default encoding.

Default encoding is not an issue for any normal source code
editing tool.

Thanks,
Yitz

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
12