Proposal: Define UTF-8 to be the encoding of Haskell source files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: Define UTF-8 to be the encoding of Haskell source files

Roel van Dijk-3
Per the Haskell Prime process I would like to make an official
proposal [1].


* Proposal

The Haskell 2010 language specification states that: "Haskell uses the
Unicode character set" [2]. It does not state what encoding should be
used. This means, strictly speaking, it is not possible to reliably
exchange Haskell source files on the byte level.

I propose to make UTF-8 the only allowed encoding for Haskell source
files. Implementations must discard an initial Byte Order Mark (BOM)
if present [3].


* Pros
- Ensures that Haskell source can be reliably exchanged on the byte
  level.
- Disallows implicit ISO-8859-* encodings in source code, ensuring
  portability.
- Little or no implementation burden for compiler writers.


* Cons

- Existing code relying on a non-UTF8, locale-/implementation-specific
  encoding will need conversion. (Only relevant for Hugs-only code).


* Implementation status

** GHC
"GHC assumes that source files are ASCII or UTF-8 only, other
encodings are not recognised. However, invalid UTF-8 sequences will be
ignored in comments, so it is possible to use other encodings such as
Latin-1, as long as the non-comment source code is ASCII only." [4]

From this I deduce that all current code accepted by GHC is compatible
with UTF-8. No working code will be broken.

** JHC
"JHC allows unrestricted use of the Unicode character set in Haskell
source, treating input as UTF-8." [5]

** Hugs
Hugs treats input as being in the encoding specified by the current
locale, but permits Unicode only in comments and character and string
literals. [6]


* Related proposal

There is one, 5 year old, proposal that is related:
"SourceEncodingDetection" [5]. There it is proposed to detect the
encoding using an algorithm which can distinguish between UTF-8,
UTF-16 and (not always) UTF-32. It can also detect the endianness of
the document, if applicable.

I think choosing just UTF-8 is a better choice than a detection
algorithm. It places less burden on implementation writers and is even
more portable.


* Next step

Discussion! There was already some discussion on the haskell-cafe
mailing list [7].

Attached is a patch for the Haskell Report which adds a note stating
that source encodings must be UTF-8.


Regards,
Roel van Dijk


[1] - http://hackage.haskell.org/trac/haskell-prime/wiki/Process
[2] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1
[3] - http://www.unicode.org/faq/utf_bom.html#bom5
[4] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compilation.html#source-files
[5] - http://hackage.haskell.org/trac/haskell-prime/wiki/SourceEncodingDetection
[6] - http://cvs.haskell.org/Hugs/pages/users_guide/locale.html
[7] - http://article.gmane.org/gmane.comp.lang.haskell.cafe/87815

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime

utf8_encoding.dpatch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Tillmann Rendel-5
Hi,

Roel van Dijk wrote:
> I propose to make UTF-8 the only allowed encoding for Haskell source
> files. Implementations must discard an initial Byte Order Mark (BOM)
> if present [3].

How would that affect the non-code parts of literate Haskell (*.lhs)
files? In particular, would it place any burden on third-party tools
processing these files?

   Tillmann

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Yitzchak Gale
In reply to this post by Roel van Dijk-3
Roel van Dijk wrote:
> I propose to make UTF-8 the only allowed encoding for Haskell source
> files. Implementations must discard an initial Byte Order Mark (BOM)
> if present

I am in favor of this proposal.

However, you wrote:

> "GHC assumes that source files are ASCII or UTF-8 only, other
> encodings are not recognised. However, invalid UTF-8 sequences will be
> ignored in comments, so it is possible to use other encodings such as
> Latin-1, as long as the non-comment source code is ASCII only." [4]
>
> From this I deduce that all current code accepted by GHC is compatible
> with UTF-8. No working code will be broken.

No. If GHC is changed to conform to this proposal, source code
including invalid UTF-8 in comments which previously compiled
successfully will now be rejected.

But anyway I think allowing invalid UTF-8 in comments is a
mistake. It could lead to the end of the comment being detected
in the wrong place, thus changing the meaning of the program in
very unexpected ways. Not likely, but possible.

I doubt that there is a whole lot of code out there which would
be affected. And GHC can easily provide a certain degree of
backward compatibility with a flag and/or pragma.

Thanks,
Yitz

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Jason Reich
In reply to this post by Roel van Dijk-3
Tillmann Rendel wrote:
> How would that affect the non-code parts of literate Haskell (*.lhs)
> files? In particular, would it place any burden on third-party tools
> processing these files?

lhs2TeX already has limited support for UTF-8 for the rendering of
Literate Agda files.

Jason

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Duncan Coutts-4
In reply to this post by Roel van Dijk-3
On 4 April 2011 23:48, Roel van Dijk <[hidden email]> wrote:

> * Proposal
>
> The Haskell 2010 language specification states that: "Haskell uses the
> Unicode character set" [2]. It does not state what encoding should be
> used. This means, strictly speaking, it is not possible to reliably
> exchange Haskell source files on the byte level.
>
> I propose to make UTF-8 the only allowed encoding for Haskell source
> files. Implementations must discard an initial Byte Order Mark (BOM)
> if present [3].

> * Next step
>
> Discussion! There was already some discussion on the haskell-cafe
> mailing list [7].

This is a simple and obviously sensible proposal. I'm certainly in favour.

I think the only area where there might be some issue to discuss is
the language of the report. As far as I can see, the report does not
require that modules exist as files, does not require the ".hs"
extension and does not give the "standard" mapping from module name to
file name.

So since the goal is interoperability of source files then perhaps we
should also have a section somewhere with interoperability guidelines
for implementations that do store Haskell programs as OS files. The
section would describe the one module per file convention, the .hs
extension (this is already obliquely mentioned in the section on
literate Haskell syntax) and the mapping of module names to file names
in common OS file systems. Then this UTF8 stipulation could go there
(and it would be clear that it applies only to conventional
implementations that store Haskell programs as files).

e.g.

Interoperability Guidelines
========================

This Report does not specify how Haskell programs are represented or
stored. There is however a conventional representation using OS files.
Implementations that conform to these guidelines will benefit from the
portability of Haskell program representations.

Haskell modules are stored as files, one module per file. These
Haskell source files are given the file extension ".hs" for usual
Haskell files and ".lhs" for literate Haskell files (see section
10.4).

Source files must be encoded as UTF-8 \cite{utf8}. Implementations
must discard an initial Byte Order Mark (BOM) if present.

To find a source file corresponding to a module name used in an import
declaration, the following mapping from module name to OS file name is
used. The '.' character is mapped to the OS's directory separator
string while all other characters map to themselves. The ".hs" or
".lhs" extension is added. Where both ".hs" and ".lhs" files exist for
the same module, the ".lhs" one should be used. The OS's standard
convention for representing Unicode file names should be used.

For example, on a UNIX based OS, the module A.B would map to the file
name "A/B.hs" for a normal Haskell file or to "A/B.lhs" for a literate
Haskell file. Note that because it is rare for a Main module to be
imported, there is no restriction on the name of the file containing
the Main module. It is conventional, but not strictly necessary, that
the Main module use the ".hs" or ".lhs" extension.


Duncan

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Ben Millwood
On Wed, Apr 6, 2011 at 2:13 PM, Duncan Coutts
<[hidden email]> wrote:

>
> Interoperability Guidelines
> ========================
>
> [...]
>
> To find a source file corresponding to a module name used in an import
> declaration, the following mapping from module name to OS file name is
> used. The '.' character is mapped to the OS's directory separator
> string while all other characters map to themselves. The ".hs" or
> ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for
> the same module, the ".lhs" one should be used. The OS's standard
> convention for representing Unicode file names should be used.
>

This standard isn't quite universal. For example, jhc will look for
Data.Foo in Data/Foo.hs but also Data.Foo.hs [1]. We could take this
as an opportunity to discuss that practice, or we could try to make
the changes to the report orthogonal to that issue.

In some sense I think it's cute that the Report doesn't specify
anything about how Haskell modules are stored or represented, but I
don't think that freedom is actually used, so I'm happy to see it go.
I'd think, though, that in that case there would be more to discuss
than just the encoding, so if we could separate out the issues here, I
think that would be useful.

[1]: http://repetae.net/computer/jhc/manual.html#module-search-path

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Duncan Coutts-4
On Wed, 2011-04-06 at 16:09 +0100, Ben Millwood wrote:

> On Wed, Apr 6, 2011 at 2:13 PM, Duncan Coutts
> <[hidden email]> wrote:
> >
> > Interoperability Guidelines
> > ========================
> >
> > [...]
> >
> > To find a source file corresponding to a module name used in an import
> > declaration, the following mapping from module name to OS file name is
> > used. The '.' character is mapped to the OS's directory separator
> > string while all other characters map to themselves. The ".hs" or
> > ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for
> > the same module, the ".lhs" one should be used. The OS's standard
> > convention for representing Unicode file names should be used.
> >
>
> This standard isn't quite universal. For example, jhc will look for
> Data.Foo in Data/Foo.hs but also Data.Foo.hs [1]. We could take this
> as an opportunity to discuss that practice, or we could try to make
> the changes to the report orthogonal to that issue.

Indeed. But it's true to say that if you do support the common
convention then you get portability. This does not preclude JHC from
supporting something extra, but sources that take advantage of JHC's
extension are not portable to implementations that just use the common
convention.

> In some sense I think it's cute that the Report doesn't specify
> anything about how Haskell modules are stored or represented, but I
> don't think that freedom is actually used, so I'm happy to see it go.
> I'd think, though, that in that case there would be more to discuss
> than just the encoding, so if we could separate out the issues here, I
> think that would be useful.

It's not going. I hope I was clear in the example text that the
interoperability guidelines were not forcing implementations to use
files etc, just that if they do, if they uses these conventions then
sources will be portable between implementations.

It doesn't stop an implementation using URLs, sticking multiple modules
in a file or keeping modules in a database.

Duncan


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Roel van Dijk-3
In reply to this post by Duncan Coutts-4
On 6 April 2011 15:13, Duncan Coutts <[hidden email]> wrote:
> So since the goal is interoperability of source files then perhaps we
> should also have a section somewhere with interoperability guidelines
> for implementations that do store Haskell programs as OS files.

I think a set of interoperability guidelines is a great idea. It seems
these guidelines are already followed by GHC, Cabal, Hackage, Jhc and
possibly others.

Shall we consider this the proposal instead of just the encoding part?

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Tillmann Rendel-5
In reply to this post by Jason Reich
Hi,

Jason Reich wrote:
> Tillmann Rendel wrote:
>> How would that affect the non-code parts of literate Haskell (*.lhs)
>> files? In particular, would it place any burden on third-party tools
>> processing these files?
>
> lhs2TeX already has limited support for UTF-8 for the rendering of
> Literate Agda files.

My point is that literate Haskell programs are not just Haskell files,
but also, for example, markdown or latex files, or even database entries
representing a wiki page or a blog entry. Such programs are therefore
processed by third-party tools outside of the Haskell eco-system, and it
seems unrealistic that the Haskell report could unilateraly mandate how
they are encoded.

I think the Haskell report should not discourage Haskell implementations
from being flexible about encoding.

   Tillmann

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Duncan Coutts-4
In reply to this post by Roel van Dijk-3
On Thu, 2011-04-07 at 09:07 +0200, Roel van Dijk wrote:

> On 6 April 2011 15:13, Duncan Coutts <[hidden email]> wrote:
> > So since the goal is interoperability of source files then perhaps we
> > should also have a section somewhere with interoperability guidelines
> > for implementations that do store Haskell programs as OS files.
>
> I think a set of interoperability guidelines is a great idea. It seems
> these guidelines are already followed by GHC, Cabal, Hackage, Jhc and
> possibly others.
>
> Shall we consider this the proposal instead of just the encoding part?

I would be happy to work with you and others to develop the report text
for such a proposal. I posted my first draft already :-)

Duncan


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Doug McIlroy
In reply to this post by Roel van Dijk-3
This suggestion supposes that every system names files in
the same way modulo choice of "directory separator":

> To find a source file corresponding to a module name used in an import
> declaration, the following mapping from module name to OS file name is
> used. The '.' character is mapped to the OS's directory separator
> string while all other characters map to themselves. The ".hs" or
> ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for
> the same module, the ".lhs" one should be used.

This supposition is unwarranted.  We have all seen relative naming
systems that run both ways: a.b.c versus c(b(a)). And Haskellites
would simplify the latter to c$b$a.  Secondary storage may be
organized by files, segments, objects, etc.  Combinations of these
notions have been created in order to cater for legacy languages
that depend on particular models.

It is a step too far to try to predict how Haskell modules will
be adopted into every possible naming environment.

Doug McIlroy

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Roel van Dijk-3
On 7 April 2011 14:33, Doug McIlroy <[hidden email]> wrote:
> This supposition is unwarranted.  We have all seen relative naming
> systems that run both ways: a.b.c versus c(b(a)). And Haskellites
> would simplify the latter to c$b$a.  Secondary storage may be
> organized by files, segments, objects, etc.  Combinations of these
> notions have been created in order to cater for legacy languages
> that depend on particular models.
>
> It is a step too far to try to predict how Haskell modules will
> be adopted into every possible naming environment.

The proposal doesn't try to regulate the use of Haskell modules in
every possible naming environment. Just file systems. And there only
as a set of guidelines.

To quote Duncan Coutts previously in this thread:

"I hope I was clear in the example text that the
interoperability guidelines were not forcing implementations to use
files etc, just that if they do, if they uses these conventions then
sources will be portable between implementations.
It doesn't stop an implementation using URLs, sticking multiple modules
a file or keeping modules in a database."

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Roel van Dijk-3
In reply to this post by Duncan Coutts-4
On 7 April 2011 14:11, Duncan Coutts <[hidden email]> wrote:
> I would be happy to work with you and others to develop the report text
> for such a proposal. I posted my first draft already :-)

What would be a good way to proceed? Looking at the process I think we
should create a wiki page and a ticket for this proposal. If necessary
I'll volunteer to be the proposal owner.

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Duncan Coutts-4
On Thu, 2011-04-07 at 15:44 +0200, Roel van Dijk wrote:
> On 7 April 2011 14:11, Duncan Coutts <[hidden email]> wrote:
> > I would be happy to work with you and others to develop the report text
> > for such a proposal. I posted my first draft already :-)
>
> What would be a good way to proceed? Looking at the process I think we
> should create a wiki page and a ticket for this proposal. If necessary
> I'll volunteer to be the proposal owner.

Ok, I can give you permissions on the wiki. What is your username on the
haskell-prime wiki?

Duncan


_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Define UTF-8 to be the encoding of Haskell source files

Roel van Dijk-3
> Ok, I can give you permissions on the wiki. What is your username on the
> haskell-prime wiki?

Great! My haskell-prime username is "roelvandijk".

_______________________________________________
Haskell-prime mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-prime