Haskell Platform Proposal: add the 'text' library

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
108 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

Haskell Platform Proposal: add the 'text' library

Don Stewart-2

= Proposal: Add Data.Text to the Haskell Platform =

Maintainer: Bryan O'Sullivan (submitted with his approval)

== Introduction ==

This is a proposal for the 'text' package to be included in the next
major release of the Haskell platform.

An up to date copy of this text is kept at:

    http://trac.haskell.org/haskell-platform/wiki/Proposals/text

Everyone is invited to review this proposal, following the standard
procedure for proposing and reviewing packages.

    http://trac.haskell.org/haskell-platform/wiki/AddingPackages

Review comments should be sent to the libraries mailing list by
October 1 so that we have time to discuss and resolve issues
before the final deadline on November 1.

    http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable 

== Credits ==

Proposal author and package maintainer: Bryan O'Sullivan, originally by
Tom Harper, based on ByteString and Vector (fusion) packages.

The following individuals contributed to the review process: Don
Stewart, Johan Tibell

== Abstract ==

The 'text' package provides an efficient packed, immutable Unicode text type
(both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and
space-efficient manner. This package provides text processing
capabilities that are optimized for performance critical use, both
in terms of large data quantities and high speed.

The 'Text' type provides character-encoding, type-safe case
conversion via whole-string case conversion functions. It also
provides a range of functions for converting Text values to and from
'ByteStrings', using several standard encodings (see the 'text-icu'
package for a much larger variety of encoding functions).
 
Efficient locale-sensitive support for text IO is also supported.
 
This module is intended to be imported qualified, to avoid name
clashes with Prelude functions, e.g.
 
    import qualified Data.Text as T

Documentation and tarball from the hackage page:

    http://hackage.haskell.org/package/text

Development repo:

    darcs get http://code.haskell.org/text/

== Rationale ==

While Haskell's Char type is capable of reprenting Unicode code points, the
String sequence of such Chars has some drawbacks that prevent is general
use:

 1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
 2. the representation is space inefficient.
 3. the data structure is element-level lazy, whereas a number of
   applications require either some level of additional strictness

An intermediate solution to these was via 'Data.ByteString' (an
efficient byte sequence type, that addresses points 2 and 3), which,
when used in conjunction with utf8-string, provides very simple
non-latin1 encoding support (though with significant drawbacks in terms
of locale and encoding range).

The 'text' package addresses these shortcomings in a number of way:

 1. support whole-string case conversion (thus, type correct unicode
    transformations)
 2. a space and time efficient representation, based on unboxed Word16
    arrays
 3. either fully strict, or chunk-level lazy data types (in the style of
    Data.ByteString)
 4. full support for locale-sensitive, encoding-aware IO.

The 'text' library has rapidly become popular for a number of
applications, and is used by more than 50 other Hackage packages. As of
Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
in particular, in web programming. It is used by:

 * the blaze html pretty printing library
 * the hstringtemplate file templating library
 * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
 * the hexpat and libxml xml parsers

The design is based on experience from Data.Vector and Data.ByteString:
 
 * the underlying type is based on unpinned, packed arrays on the Haskell heap
    with an ST interface for memory effects.
 * pipelines of operations are optimized via converstion to and from the
   'stream' abstraction[1]

== The API ==

The API is broken into several logical pieces, which are
self-explanatory:

 * combinators for operating on strict, abstract 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html

 * an equivalent API for chunk-element lazy 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html

 * encoding transformations, to and from bytestrings:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html

 * support for conversion to Ptr Word16:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html

 * locale-aware IO layer:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html

== Design decisions ==

 * IO and pure combinators are in separate modules.
 * Both a fully strict, and partially-strict type are provided.
 * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
 * Unpinned arrays are used, to prevent fragmentation.
 * Large numbers of additional encodings are delegated to the text-icu package.
 * An 'IsString' instance is provided for String literals.
 * The implementation is OS and architecture neutral (portable).
 * The implementation uses a number of language extensions:

    CPP
    MagicHash
    UnboxedTuples
    BangPatterns
    Rank2Types
    RecordWildCards
    ScopedTypeVariables
    ExistentialQuantification
    DeriveDataTypeable

 * The implementation is entirely Haskell (no additional C code or libraries).
 * The package provides a QuickCheck/HUnit testsuite, and coverage data.
 * The package adds no new dependencies to the HP.
 * The package builds with the Simple cabal way.
 * There is no existing functionality for packed unicode text in the HP.
 * The package has complexity annotations.

== Open issues ==

The text-icu package is not part of this propposal.

== Notes ==

The implementation consists of 30 modules, and relies on cabal's package
hiding mechanism to expose only 5 modules. The implementation is around
8000 lines of text total.

The public modules expose none of these (?).

The Python standard library provides both a string and a unicode
sequence type. These are somewhat analogous to the
ByteString/String/Text split.

= References =

[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
     Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Duncan Coutts-4
On 7 September 2010 16:26, Don Stewart <[hidden email]> wrote:
>
> = Proposal: Add Data.Text to the Haskell Platform =

> == The API ==
>
> The API is broken into several logical pieces, which are
> self-explanatory:

I just want to point out to people who would like to review the API
that Bryan recently released version 0.8 which contains some minor API
changes from version 0.7. So you should look at the 0.8 haddock
documentation rather than the 0.7 versions linked in the initial
proposal.

Don: since you're making the proposal, can you update the wiki version
of the proposal with links to the latest API docs.

   http://trac.haskell.org/haskell-platform/wiki/Proposals/text

Duncan
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Krasimir Angelov-2
In reply to this post by Don Stewart-2
I see that the text package provides its own encoding/decoding
functions. This overlaps with the Unicode API offered from the base
package. The API in base is oriented towards encoding/decoding of text
when doing file IO but definitely the conversion utils should be
reused. I implemented myself conversion functions using the internal
API:

http://code.haskell.org/gf/src/compiler/GF/Text/Coding.hs

This is String <-> ByteString conversion but it could work with Text as well.

This is mainly implementation issue but if we add text to Haskell
Platform then it will be harder to change the API later if that is
needed in order to reuse the API from base. For instance in base there
is a notion of TextEncoding which I don't see in text.

Regards,
  Krasimir


2010/9/7 Don Stewart <[hidden email]>:

>
> = Proposal: Add Data.Text to the Haskell Platform =
>
> Maintainer: Bryan O'Sullivan (submitted with his approval)
>
> == Introduction ==
>
> This is a proposal for the 'text' package to be included in the next
> major release of the Haskell platform.
>
> An up to date copy of this text is kept at:
>
>    http://trac.haskell.org/haskell-platform/wiki/Proposals/text
>
> Everyone is invited to review this proposal, following the standard
> procedure for proposing and reviewing packages.
>
>    http://trac.haskell.org/haskell-platform/wiki/AddingPackages
>
> Review comments should be sent to the libraries mailing list by
> October 1 so that we have time to discuss and resolve issues
> before the final deadline on November 1.
>
>    http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable
>
> == Credits ==
>
> Proposal author and package maintainer: Bryan O'Sullivan, originally by
> Tom Harper, based on ByteString and Vector (fusion) packages.
>
> The following individuals contributed to the review process: Don
> Stewart, Johan Tibell
>
> == Abstract ==
>
> The 'text' package provides an efficient packed, immutable Unicode text type
> (both strict and lazy), with a powerful loop fusion optimization framework.
>
> The 'Text' type represents Unicode character strings, in a time and
> space-efficient manner. This package provides text processing
> capabilities that are optimized for performance critical use, both
> in terms of large data quantities and high speed.
>
> The 'Text' type provides character-encoding, type-safe case
> conversion via whole-string case conversion functions. It also
> provides a range of functions for converting Text values to and from
> 'ByteStrings', using several standard encodings (see the 'text-icu'
> package for a much larger variety of encoding functions).
>
> Efficient locale-sensitive support for text IO is also supported.
>
> This module is intended to be imported qualified, to avoid name
> clashes with Prelude functions, e.g.
>
>    import qualified Data.Text as T
>
> Documentation and tarball from the hackage page:
>
>    http://hackage.haskell.org/package/text
>
> Development repo:
>
>    darcs get http://code.haskell.org/text/
>
> == Rationale ==
>
> While Haskell's Char type is capable of reprenting Unicode code points, the
> String sequence of such Chars has some drawbacks that prevent is general
> use:
>
>  1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
>  2. the representation is space inefficient.
>  3. the data structure is element-level lazy, whereas a number of
>   applications require either some level of additional strictness
>
> An intermediate solution to these was via 'Data.ByteString' (an
> efficient byte sequence type, that addresses points 2 and 3), which,
> when used in conjunction with utf8-string, provides very simple
> non-latin1 encoding support (though with significant drawbacks in terms
> of locale and encoding range).
>
> The 'text' package addresses these shortcomings in a number of way:
>
>  1. support whole-string case conversion (thus, type correct unicode
>    transformations)
>  2. a space and time efficient representation, based on unboxed Word16
>    arrays
>  3. either fully strict, or chunk-level lazy data types (in the style of
>    Data.ByteString)
>  4. full support for locale-sensitive, encoding-aware IO.
>
> The 'text' library has rapidly become popular for a number of
> applications, and is used by more than 50 other Hackage packages. As of
> Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
> in particular, in web programming. It is used by:
>
>  * the blaze html pretty printing library
>  * the hstringtemplate file templating library
>  * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
>  * the hexpat and libxml xml parsers
>
> The design is based on experience from Data.Vector and Data.ByteString:
>
>  * the underlying type is based on unpinned, packed arrays on the Haskell heap
>    with an ST interface for memory effects.
>  * pipelines of operations are optimized via converstion to and from the
>   'stream' abstraction[1]
>
> == The API ==
>
> The API is broken into several logical pieces, which are
> self-explanatory:
>
>  * combinators for operating on strict, abstract 'text's.
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html
>
>  * an equivalent API for chunk-element lazy 'text's.
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html
>
>  * encoding transformations, to and from bytestrings:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html
>
>  * support for conversion to Ptr Word16:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html
>
>  * locale-aware IO layer:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html
>
> == Design decisions ==
>
>  * IO and pure combinators are in separate modules.
>  * Both a fully strict, and partially-strict type are provided.
>  * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
>  * Unpinned arrays are used, to prevent fragmentation.
>  * Large numbers of additional encodings are delegated to the text-icu package.
>  * An 'IsString' instance is provided for String literals.
>  * The implementation is OS and architecture neutral (portable).
>  * The implementation uses a number of language extensions:
>
>    CPP
>    MagicHash
>    UnboxedTuples
>    BangPatterns
>    Rank2Types
>    RecordWildCards
>    ScopedTypeVariables
>    ExistentialQuantification
>    DeriveDataTypeable
>
>  * The implementation is entirely Haskell (no additional C code or libraries).
>  * The package provides a QuickCheck/HUnit testsuite, and coverage data.
>  * The package adds no new dependencies to the HP.
>  * The package builds with the Simple cabal way.
>  * There is no existing functionality for packed unicode text in the HP.
>  * The package has complexity annotations.
>
> == Open issues ==
>
> The text-icu package is not part of this propposal.
>
> == Notes ==
>
> The implementation consists of 30 modules, and relies on cabal's package
> hiding mechanism to expose only 5 modules. The implementation is around
> 8000 lines of text total.
>
> The public modules expose none of these (?).
>
> The Python standard library provides both a string and a unicode
> sequence type. These are somewhat analogous to the
> ByteString/String/Text split.
>
> = References =
>
> [1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
>     Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/libraries
>
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Ian Lynagh
In reply to this post by Don Stewart-2
On Tue, Sep 07, 2010 at 08:26:36AM -0700, Donald Bruce Stewart wrote:
>
> = Proposal: Add Data.Text to the Haskell Platform =

I feel silly saying this, but as this will probably serve as an example
of the policy I'll say it anyway: I think this should be:
    Proposal: Add 'text' to the Haskell Platform

> Proposal Author: Don Stewart
> Maintainer: Bryan O'Sullivan (submitted with his approval)

> Credits
> Proposal author and package maintainer: Bryan O'Sullivan, originally by
> Tom Harper, based on ByteString? and Vector (fusion) packages.
>
> The following individuals contributed to the review process: Don
> Stewart, Johan Tibell

These two sections appear to contradict each other.

Also, the hackage page says
    Maintainer  Bryan O'Sullivan <[hidden email]>
                Tom Harper <[hidden email]>
                Duncan Coutts <[hidden email]>

> This is a proposal for the 'text' package

Should mention the version number, and link to the hackage page.

> This package provides text processing capabilities that are optimized
> for performance critical use, both in terms of large data quantities and
> high speed.

Are there other uses it is less suitable for, or are you just saying
that the code has been optimised?

If performance is important for the proposal, do you have evidence that
it performs well, or a way to check that performance has not regressed
in future releases?

> using several standard encodings

Just ASCII and UTF*, right?

Incidentally, I've just noticed some broken haddock markup for:
    I/O libraries /do not support locale-sensitive I\O
in
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html

> see the 'text-icu' package

Would be nice for this to link to the hackage page.

> a much larger variety of encoding functions

Why not bundle these in the text package, or also put this package in
the platform? hackage doesn't have the haddocks as I write this, but I
assume they are text-specific.

> http://hackage.haskell.org/package/text

Should link to the version-specific page.

This item of "Proposal content" on AddingPackages doesn't seem to be
covered:
    For library packages, an example of how the API is intended to be
    used should be given.



This is really a comment on the process rather than your proposal, but
    After a proposal is accepted (or conditionally accepted) the
    proposal must remain on the wiki.
and
    An explicit checklist of the package requirements below is not
    required. The proposal should state however that all the
    requirements are met
seem incompatible to me, as your
    All package requirements are met.
comment will become out of date as the requirement list evolves.


On
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
a number of haddocks say
    Subject to fusion.
but I can't see an explanation for the new user of what this means or
why they should care. Also, what it not be better to say
    Warning: Not subject to fusion.
for the handful that aren't? Currently it's hard to notice.


In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Encoding-Error.html
I would expect lenientDecode etc to use the On{En,De}codeError type
synonyms defined above.


In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy.html
the choice 'B' seems odd:
    import qualified Data.Text.Lazy as B


I would have expected
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
to mention the existence of .Lazy in its description, and an explanation
of when I should use it.


Are there cases when Data.Text is significantly faster than
Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
Data.Text, but do we need the user to have a complete interface for
both?)


In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
isInfixOf's docs day:
    O(n+m) The isInfixOf function takes two Texts and returns True iff the
    first is contained, wholly and intact, anywhere within the second.
    In (unlikely) bad cases, this function's time complexity degrades
    towards O(n*m).
I think the complexity at the start, in the same place as all the other
complexities, ought to be O(n*m), with the common case given afterwards.

And replace's docs just say
    O(m+n) Replace every occurrence of one substring with another.
but should presumably be O(n*m). It's also not necessarily clear what m
and n refer to.

> length :: Text -> Int
> O(n) Returns the number of characters in a Text. Subject to fusion.

Did you consider keeping the number of characters in the Text directly?
Is there a reason it couldn't be done?

> prevent is general use

"prevent its general use"

> a number of way:

"a number of ways:"

> unicode-unaware case conversion (map toUpper is an unsafe case
> conversion)

Surely this is something that should be added to Data.Char, irrespective
of whether text is added to the HP?

> the data structure is element-level lazy, whereas a number of
> applications require either some level of additional strictness

This sentence looks like it has been mis-edited?

And by "a number of applications" I think you mean "high performance
applications"?

> support whole-string case conversion (thus, type correct unicode transformations)

I don't really get what you mean by "type correct" here.

> based on unboxed Word16 arrays

Why Word16?

> As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most
> popular), in particular, in web programming.

I can't work out what you mean here. Ranked 27 by what metric? Why web
programming in particular?

> A large testsuite, with coverage data, is provided.

It would be nice if this was on the text package's page, rather than in
~dons.

> RecordWildCards

I'm not a fan, but I fear I may be in the minority.

> propposal

"proposal"

> to expose only 5 modules

9, no?

> The public modules expose none of these (?).

None of what?



I compared the API of Data.Text and Data.ByteString.Char8 and found a
number of differences:

BS:   break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
Text: break :: Text -> Text -> (Text, Text)
      breakEnd :: Text -> Text -> (Text, Text)
      breakBy :: (Char -> Bool) -> Text -> (Text, Text)

BS:   count :: Char -> ByteString -> Int
Text: count :: Text -> Text -> Int

BS:   find :: (Char -> Bool) -> ByteString -> Maybe Char
Text: find :: Text -> Text -> [(Text, Text)]
      findBy :: (Char -> Bool) -> Text -> Maybe Char

BS:   replicate :: Int -> Char -> ByteString
Text: replicate :: Int -> Text -> Text

BS:   split :: Char -> ByteString -> [ByteString]
Text: split :: Text -> Text -> [Text]

BS:   span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)

BS:   splitBy :: (Char -> Bool) -> Text -> [Text]
Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]

BS:   unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text

BS:   zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text

I think the two APIs ought to be brought into agreement.

There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and
ByteString has IO functions mixed in with the non-IO functions), but
those seemed to be the most significant ones. Also,
    prefixed :: Text -> Text -> Maybe Text
is analogous to
    stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
in Data.List

This also made me notice that Text haddocks tend to use 'b' as a type
variable rather than 'a', e.g.
    foldl :: (b -> Char -> b) -> b -> Text -> b


Thanks
Ian

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Duncan Coutts-4
I'll answer a few of Ian's questions about the design of the text package:

On 7 September 2010 22:50, Ian Lynagh <[hidden email]> wrote:

>> see the 'text-icu' package
>
> Would be nice for this to link to the hackage page.
>
>> a much larger variety of encoding functions
>
> Why not bundle these in the text package, or also put this package in
> the platform? hackage doesn't have the haddocks as I write this, but I
> assume they are text-specific.

It would depend on the ICU C library. Similarly if we added a
conversion lib based on iconv. The ones in the text package now are
pure Haskell.

> Are there cases when Data.Text is significantly faster than
> Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
> Data.Text, but do we need the user to have a complete interface for
> both?)

Mm, this is a fair question. In the case of bytestring we need both
because sometimes for dealing with foreign code or IO you need the
representation to be a contigious block of memory. For text the
representation is more abstract so that need does not arrise. One
might argue that if it is simply to control strictness then one could
use the lazy version and provide a deepseq instance.

Here's an alternative argument: suppose we change the representation
of strict text to be a tree of chunks (e.g. finger tree). We could
achieve effecient concatenation. This representation would be
impossible while preserving semantics of a lazy tail. A tree impl that
has any kind of balance needs to know the overall length so cannot
have a lazy tail.

> Did you consider keeping the number of characters in the Text directly?
> Is there a reason it couldn't be done?

There's little point. Knowing the length does not usually help you
save any other O(n) operations. It'd also only help for strict text,
not lazy. Just like lists, asking for the length is usually not a good
idea.

>> unicode-unaware case conversion (map toUpper is an unsafe case
>> conversion)
>
> Surely this is something that should be added to Data.Char, irrespective
> of whether text is added to the HP?

No, not to Data.Char. Case folding is not a per-Char operation, it's
only works for [Char] / String / Text. It could be added to
Data.String or something.


>> based on unboxed Word16 arrays
>
> Why Word16?

It doesn't actually matter. It's an implementation detail. It was
originally chosen based on benchmarks. It could be changed again based
on new benchmarks without affecting the public API.

> I compared the API of Data.Text and Data.ByteString.Char8 and found a
> number of differences:

Many of these are deliberate and sensible. The thing with text as
opposed to lists/arrays is that almost all operations you want to do
are substring based and not element based. A Unicode code point (a
Char) is sadly only roughly related to the human concept of a
character. In particular there are combining characters. So even if
you want to search or split on a particular "character" that may mean
searching for a short sequence of Chars / code points.

So where the ByteString API followed the List api by being byte
oriented, the Text API is substring oriented.

> BS:   break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
> Text: break :: Text -> Text -> (Text, Text)
>      breakEnd :: Text -> Text -> (Text, Text)
>      breakBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   count :: Char -> ByteString -> Int
> Text: count :: Text -> Text -> Int
>
> BS:   find :: (Char -> Bool) -> ByteString -> Maybe Char
> Text: find :: Text -> Text -> [(Text, Text)]
>      findBy :: (Char -> Bool) -> Text -> Maybe Char
>
> BS:   replicate :: Int -> Char -> ByteString
> Text: replicate :: Int -> Text -> Text
>
> BS:   split :: Char -> ByteString -> [ByteString]
> Text: split :: Text -> Text -> [Text]
>
> BS:   span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
> Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   splitBy :: (Char -> Bool) -> Text -> [Text]
> Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]
>
> BS:   unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
> Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text
>
> BS:   zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
> Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text
>
> I think the two APIs ought to be brought into agreement.

Perhaps. If so, then it is the ByteString.Char8 that ought to be
brought into agreement with Text, not the other way around. I think
Text is right in this area. On the other hand, perhaps it makes sense
for ByteString.Char8 to remain like the ByteString byte interface
which is byte oriented (and probably rightly so). I hope the
significance and use of ByteString.Char8 will decrease as Text becomes
more popular. ByteString.Char8 is really just for the cases where
you're handling ASCII-like protocols.

> There are a number of other differences which probably want to be tidied
> up (mostly functions which are in one package but not the other,

What are you thinking of specifically?

> ByteString has IO functions mixed in with the non-IO functions,

Which I don't think was a good idea. I would prefer to split them up.

> but those seemed to be the most significant ones. Also,

>    prefixed :: Text -> Text -> Maybe Text
> is analogous to
>    stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
> in Data.List

Ah, that one probably does make sense to change to match Data.List.

Duncan
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Bryan O'Sullivan
In reply to this post by Krasimir Angelov-2
On Tue, Sep 7, 2010 at 2:06 PM, Krasimir Angelov <[hidden email]> wrote:
I see that the text package provides its own encoding/decoding
functions. This overlaps with the Unicode API offered from the base
package.

It doesn't really. As you note, the stuff in base is all tied to I/O on Handles, whereas the functions in the text package are pure.
 
The API in base is oriented towards encoding/decoding of text
when doing file IO but definitely the conversion utils should be
reused.

Unfortunately, that's not possible, as it would break backwards compatibility with 6.10, which some industrial users still need.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Bryan O'Sullivan
In reply to this post by Ian Lynagh
Thanks for your comments, Ian. I appreciate your time and care in looking this over!
 
Incidentally, I've just noticed some broken haddock markup for:
   I/O libraries /do not support locale-sensitive I\O
in
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html

Thanks for spotting that. It appears to be due to a Haddock bug, unfortunately.

> a much larger variety of encoding functions

Why not bundle these in the text package, or also put this package in
the platform?

Either one would induce a dependency on text-icu, which is not as mature as text, and which would imply a dependency on the rather large ICU library. I do believe that text-icu should be submitted, but not until it's ready.

On
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
a number of haddocks say
   Subject to fusion.
but I can't see an explanation for the new user of what this means or
why they should care.

That's not quite true: it's actually the very first thing documented: http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html#1

However, that description is skimpy, and I've replaced it:

-- Most of the functions in this module are subject to /fusion/,
-- meaning that a pipeline of such functions will usually allocate at
-- most one 'Text' value.
--
-- As an example, consider the following pipeline:
--
-- > import Data.Text as T
-- > import Data.Text.Encoding as E
-- >
-- > countChars :: ByteString -> Int
-- > countChars = T.length . T.toUpper . E.decodeUtf8
--
-- From the type signatures involved, this looks like it should
-- allocate one 'ByteString' value, and two 'Text' values. However,
-- when a module is compiled with optimisation enabled under GHC, the
-- two intermediate 'Text' values will be optimised away, and the
-- function will be compiled down to a single loop over the source
-- 'ByteString'.
--
-- Functions that can be fused by the compiler are marked with the
-- phrase \"Subject to fusion\".
 
In
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Encoding-Error.html
I would expect lenientDecode etc to use the On{En,De}codeError type
synonyms defined above.

Good point. I've fixed that up.
 
In
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy.html
the choice 'B' seems odd:
   import qualified Data.Text.Lazy as B

Yep. Fixed :-)
 
I would have expected
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
to mention the existence of .Lazy in its description, and an explanation
of when I should use it.

I've expanded that discussion.

Are there cases when Data.Text is significantly faster than
Data.Text.Lazy?

It's often about twice as fast, but that depends on the nature of the code and data involved.
 
In
   http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
isInfixOf's docs day:
   O(n+m) The isInfixOf function takes two Texts and returns True iff the
   first is contained, wholly and intact, anywhere within the second.
   In (unlikely) bad cases, this function's time complexity degrades
   towards O(n*m).
I think the complexity at the start, in the same place as all the other
complexities, ought to be O(n*m), with the common case given afterwards.

I'd prefer to keep this as is.
 
And replace's docs just say
   O(m+n) Replace every occurrence of one substring with another.
but should presumably be O(n*m). It's also not necessarily clear what m
and n refer to.

The two parameters to the function?
 
> unicode-unaware case conversion (map toUpper is an unsafe case
> conversion)

Surely this is something that should be added to Data.Char, irrespective
of whether text is added to the HP?

Yes, but that's a not-this-problem problem.

> A large testsuite, with coverage data, is provided.

It would be nice if this was on the text package's page, rather than in
~dons.

I don't know how to do that.
 
> RecordWildCards

I'm not a fan, but I fear I may be in the minority.

It's just used internally, so why do you mind?

There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and
ByteString has IO functions mixed in with the non-IO functions), but
those seemed to be the most significant ones. Also,
   prefixed :: Text -> Text -> Maybe Text
is analogous to
   stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
in Data.List

I hadn't seen that. Hmm. For use with view patterns, I prefer the name I'm using right now.

This also made me notice that Text haddocks tend to use 'b' as a type
variable rather than 'a', e.g.
   foldl :: (b -> Char -> b) -> b -> Text -> b

Historical artifact :-) 

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Krasimir Angelov-2
In reply to this post by Bryan O'Sullivan
2010/9/8 Bryan O'Sullivan <[hidden email]>:
> It doesn't really. As you note, the stuff in base is all tied to I/O on
> Handles, whereas the functions in the text package are pure.

It doesn't mean that it is not possible to make them pure. The
operations are pure in nature.

> Unfortunately, that's not possible, as it would break backwards
> compatibility with 6.10, which some industrial users still need.

I guess you mean that then the text package will not work with 6.10. I
prefer to have some intermediate version of text for compatibility
rather than to nail down this lack of synergy forever. I still think
that it is premature to add the text package if it is not in synchrony
with the existing packages.
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Brandon S Allbery KF8NH
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/7/10 23:37 , Krasimir Angelov wrote:

> 2010/9/8 Bryan O'Sullivan <[hidden email]>:
>> It doesn't really. As you note, the stuff in base is all tied to I/O on
>> Handles, whereas the functions in the text package are pure.
>
> It doesn't mean that it is not possible to make them pure. The
> operations are pure in nature.
>
>> Unfortunately, that's not possible, as it would break backwards
>> compatibility with 6.10, which some industrial users still need.
>
> I guess you mean that then the text package will not work with 6.10. I

He means that text won't work with 6.10 *if* it's changed to use the
conversion routines that only exist in 6.12+ as you seem to be demanding.

- --
brandon s. allbery     [linux,solaris,freebsd,perl]      [hidden email]
system administrator  [openafs,heimdal,too many hats]  [hidden email]
electrical and computer engineering, carnegie mellon university      KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyHB8sACgkQIn7hlCsL25UiAACePqrK1E+cor7KPUVfE+BSoi20
wggAoKuwOzGDHtz0roRS4b0IbPJrCBG8
=fpmo
-----END PGP SIGNATURE-----
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Krasimir Angelov-2
2010/9/8 Brandon S Allbery KF8NH <[hidden email]>:
> He means that text won't work with 6.10 *if* it's changed to use the
> conversion routines that only exist in 6.12+ as you seem to be demanding.

Exactly. But it is probably possible to make version of text which
with 6.10 uses some copy of the routines and with 6.12 uses the
routines in base. The compatibility issue is only temporary i.e. until
there are many users of 6.10. The API from text will have to stay
forever. For now at least the API should be made compatible with base.

For example something like that:

encode :: TextEncoding -> Text -> ByteString
decode :: TextEncoding -> ByteString -> Text

where TextEncoding could be defined in the text package when it is
compiled with GHC 6.10 or just reexported from base when it is
compiled with GHC 6.12.

Krasimir
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Ivan Lazar Miljenovic
On 8 September 2010 13:58, Krasimir Angelov <[hidden email]> wrote:
> 2010/9/8 Brandon S Allbery KF8NH <[hidden email]>:
>> He means that text won't work with 6.10 *if* it's changed to use the
>> conversion routines that only exist in 6.12+ as you seem to be demanding.
>
> Exactly. But it is probably possible to make version of text which
> with 6.10 uses some copy of the routines and with 6.12 uses the
> routines in base.

I highly doubt that you can simply copy/paste the IO stuff from 6.12+
to use with 6.10, unless you were willing to copy a _lot_ of code.

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Bryan O'Sullivan
In reply to this post by Krasimir Angelov-2
On Tue, Sep 7, 2010 at 8:58 PM, Krasimir Angelov <[hidden email]> wrote:

Exactly. But it is probably possible to make version of text which
with 6.10 uses some copy of the routines and with 6.12 uses the
routines in base.

It might be possible, but I am not going to do it :-)

For now at least the API should be made compatible with base.

I'm afraid not. The TextEncoding type ties encoding and decoding together, when in pure code you need just one or the other. The TextEncoding design is fine for read/write Handles, where you may need both, but it does not make sense for pure code, where the current API provided by text is more appropriate.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Krasimir Angelov-2
2010/9/8 Bryan O'Sullivan <[hidden email]>:
> I'm afraid not. The TextEncoding type ties encoding and decoding together,
> when in pure code you need just one or the other. The TextEncoding design is
> fine for read/write Handles, where you may need both, but it does not make
> sense for pure code, where the current API provided by text is more
> appropriate.

I don't see how this is related to purity. I can't believe that it is
not possible to design single coherent API suitable for both purposes.
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Duncan Coutts-4
In reply to this post by Bryan O'Sullivan
On 8 September 2010 05:53, Bryan O'Sullivan <[hidden email]> wrote:
> On Tue, Sep 7, 2010 at 8:58 PM, Krasimir Angelov <[hidden email]>
> wrote:
>>
>> Exactly. But it is probably possible to make version of text which
>> with 6.10 uses some copy of the routines and with 6.12 uses the
>> routines in base.
>
> It might be possible, but I am not going to do it :-)

In the longer term I would also like to see these unified, but I don't
think it has to be done immediately. It will require more changes in
the TextEncoding stuff than in the text package. In particular the
TextEncoding will need to be changed to be pure, e.g. using the ST
monad rather than the IO monad as it uses currently. I hope that way,
the same encoding stuff can be used for IO handles and for pure
conversions and that it can perform well in both use cases.

>> For now at least the API should be made compatible with base.
>
> I'm afraid not. The TextEncoding type ties encoding and decoding together,
> when in pure code you need just one or the other. The TextEncoding design is
> fine for read/write Handles, where you may need both, but it does not make
> sense for pure code, where the current API provided by text is more
> appropriate.

I have to say I don't understand this. It's easy to use just one
direction of encode/decode. Are you saying there are encodings where
it only makes sense to implement one direction? Or are you saying that
writing decodeUtf8 :: ByteString -> Text is just that much nicer than
writing decode utf8 :: ByteString -> Text ?

Here is a possible solution: keep the current encodeFoo/decodeFoo in
Data.Text.Encoding. Later when we get a sensible reusable TextEncoding
abstraction (e.g. by pulling it out of GHC.IO.* and making it use ST
so it can be pure) then we add to Data.Text.Encoding:

encode :: TextEncoding -> Text -> ByteString
decode :: TextEncoding -> ByteString -> Text
decodeWith :: TextEncoding -> OnDecodeError -> ByteString -> Text

and internally redefine:

decodeUtf8 = decode utf8  -- or is it utf8_bom ?

Duncan
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Johan Tibell-2
In reply to this post by Duncan Coutts-4
On Wed, Sep 8, 2010 at 12:21 AM, Duncan Coutts <[hidden email]> wrote:
On 7 September 2010 22:50, Ian Lynagh <[hidden email]> wrote:> Are there cases when Data.Text is significantly faster than
> Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
> Data.Text, but do we need the user to have a complete interface for
> both?)

Mm, this is a fair question. In the case of bytestring we need both
because sometimes for dealing with foreign code or IO you need the
representation to be a contigious block of memory. For text the
representation is more abstract so that need does not arrise. One
might argue that if it is simply to control strictness then one could
use the lazy version and provide a deepseq instance.

Here's an alternative argument: suppose we change the representation
of strict text to be a tree of chunks (e.g. finger tree). We could
achieve effecient concatenation. This representation would be
impossible while preserving semantics of a lazy tail. A tree impl that
has any kind of balance needs to know the overall length so cannot
have a lazy tail.

The lazy version of Text uses one more word per value than the strict version. This can be significant for small strings (e.g. ~8 characters) where the overhead per character already is quite high. If I counted the size of the BA# constructor correctly, a strict Text has a fixed overhead of 7 words and a lazy Text has an overhead of 8 words. This matters when you e.g. want to use Texts as keys in a Map.

Btw, I see that the BA# constructor is not manually unpacked into the Array data type. Is that done automatically since ByteArray# is unlifted or is there some room for improvement here?

Cheers,
Johan


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Duncan Coutts-4
On 8 September 2010 10:56, Johan Tibell <[hidden email]> wrote:

> The lazy version of Text uses one more word per value than the strict
> version. This can be significant for small strings (e.g. ~8 characters)
> where the overhead per character already is quite high. If I counted the
> size of the BA# constructor correctly, a strict Text has a fixed overhead of
> 7 words and a lazy Text has an overhead of 8 words. This matters when you
> e.g. want to use Texts as keys in a Map.

Ah, well if we're playing that game then I have a representation where
lazy uses the same storage as strict. :-)

The trick is to save a word by using smaller length and offset fields
(e.g. 16bit). That can be done for lazy but not strict because with
lazy you can always break long strings into multiple 2^16 sized chunks
whereas for strict it's essential to be able to use 32/64 bit
length/offsets.

> Btw, I see that the BA# constructor is not manually unpacked into the Array
> data type. Is that done automatically since ByteArray# is unlifted or is
> there some room for improvement here?

I'm not sure what you're referring to here, the definition is:

data UArray i e = UArray !i !i !Int ByteArray#

The ByteArray# is an unlifted type (but its representation is a
pointer to a heap object).

Duncan
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

John Lato-2
In reply to this post by Don Stewart-2
I'd like to first say that I'm very impressed with Ian's thoroughness of review.

On the API differences between Data.Text and Data.ByteString.Char8, I agree with Duncan that the Data.Text API is more natural for text-oriented work, although I'm slightly uncomfortable with the similarities between Data.Text and Data.List.  Everything works the same, until it doesn't because of a minor API change you didn't notice.

Would it be useful to list the API incompatibilities in the docs, either as a list or at each relevant function?  Or would that just be extra noise?

John
 
> I compared the API of Data.Text and Data.ByteString.Char8 and found a
> number of differences:

Many of these are deliberate and sensible. The thing with text as
opposed to lists/arrays is that almost all operations you want to do
are substring based and not element based. A Unicode code point (a
Char) is sadly only roughly related to the human concept of a
character. In particular there are combining characters. So even if
you want to search or split on a particular "character" that may mean
searching for a short sequence of Chars / code points.

So where the ByteString API followed the List api by being byte
oriented, the Text API is substring oriented.

> BS:   break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
> Text: break :: Text -> Text -> (Text, Text)
>      breakEnd :: Text -> Text -> (Text, Text)
>      breakBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   count :: Char -> ByteString -> Int
> Text: count :: Text -> Text -> Int
>
> BS:   find :: (Char -> Bool) -> ByteString -> Maybe Char
> Text: find :: Text -> Text -> [(Text, Text)]
>      findBy :: (Char -> Bool) -> Text -> Maybe Char
>
> BS:   replicate :: Int -> Char -> ByteString
> Text: replicate :: Int -> Text -> Text
>
> BS:   split :: Char -> ByteString -> [ByteString]
> Text: split :: Text -> Text -> [Text]
>
> BS:   span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
> Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   splitBy :: (Char -> Bool) -> Text -> [Text]
> Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]
>
> BS:   unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
> Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text
>
> BS:   zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
> Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text
>
> I think the two APIs ought to be brought into agreement.

Perhaps. If so, then it is the ByteString.Char8 that ought to be
brought into agreement with Text, not the other way around. I think
Text is right in this area. On the other hand, perhaps it makes sense
for ByteString.Char8 to remain like the ByteString byte interface
which is byte oriented (and probably rightly so). I hope the
significance and use of ByteString.Char8 will decrease as Text becomes
more popular. ByteString.Char8 is really just for the cases where
you're handling ASCII-like protocols.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Johan Tibell-2
In reply to this post by Duncan Coutts-4
On Wed, Sep 8, 2010 at 12:05 PM, Duncan Coutts <[hidden email]> wrote:
On 8 September 2010 10:56, Johan Tibell <[hidden email]> wrote:
> Btw, I see that the BA# constructor is not manually unpacked into the Array
> data type. Is that done automatically since ByteArray# is unlifted or is
> there some room for improvement here?

I'm not sure what you're referring to here, the definition is:

data UArray i e = UArray !i !i !Int ByteArray#

The ByteArray# is an unlifted type (but its representation is a
pointer to a heap object).

The BA# constructor also includes a length field. My question is whether that gets unpacked into the Array constructor (as in Data.Text.Array, not UArray).

-- Johan
 

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: Haskell Platform Proposal: add the 'text' library

Bulat Ziganshin-2
Hello Johan,

Wednesday, September 8, 2010, 3:13:36 PM, you wrote:

>  data UArray i e = UArray !i !i !Int ByteArray#
>  
>  The ByteArray# is an unlifted type (but its representation is a
>  pointer to a heap object).

> The BA# constructor also includes a length field.

it is the size of memory area allocated, i.e. it's rounded to 4(8) bytes

--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Haskell Platform Proposal: add the 'text' library

Ian Lynagh
In reply to this post by Bryan O'Sullivan
On Tue, Sep 07, 2010 at 07:10:27PM -0700, Bryan O'Sullivan wrote:
> Thanks for your comments, Ian. I appreciate your time and care in looking
> this over!

Actually, it's interesting you say that, because I don't think I looked
at the package carefully. I didn't look at the source at all, I briefly
skimmed the haddocks (mostly just to check that it looks like they all
existed, as that's one of the criteria), and I didn't check that the
package API looks sensible and consistent. In fact, the only reason I
looked at the API at all was that I had something to diff it against.

As a comment on the process, perhaps we should require that there are 2
or 3 people who can say that they have used the API (perhaps with hpc
results to see /how much/ they use it), and that it seems sensible (i.e.
they weren't having to work around missing or broken functionality).



Actually, I've just taken a quick look at a random bit of code, and
with Data.Text.Foreign.fromPtr and

init :: Text -> Text
init (Text arr off len) | len <= 0                   = emptyError "init"
                        | n >= 0xDC00 && n <= 0xDFFF = textP arr off (len-2)
                        | otherwise                  = textP arr off (len-1)
    where
      n = A.unsafeIndex arr (off+len-1)

it looks like I can create a Text with length -1 by doing
(init (fromPtr [0xDC00] 1)), which makes me nervous. I wonder if fromPtr
should be renamed unsafeFromPtr. init would still make me nervous,
though.


By the way, fromPtr asserts (len > 0), but from the haddock docs I'd
assume that (fromPtr p 0) is fine.

> > Incidentally, I've just noticed some broken haddock markup for:
> >    I/O libraries /do not support locale-sensitive I\O
> > in
> >
> > http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html
>
>
> Thanks for spotting that. It appears to be due to a Haddock bug,
> unfortunately.

Looking at the source, I'd guess you can work around it by moving the
linebreaks. And it actually looks like 2 haddock bugs: /.../ can't span
lines, and \/ isn't recognised inside /.../. Would be good to get
haddock tickets filed.

> >    Subject to fusion.
> > but I can't see an explanation for the new user of what this means or
> > why they should care.
>
>
> That's not quite true: it's actually the very first thing documented:

Sorry, my fault! I read the "Description" at the top, and erroneously
assumed that only function-specific docs would follow the "Synopsis".

> > And replace's docs just say
> >    O(m+n) Replace every occurrence of one substring with another.
> > but should presumably be O(n*m). It's also not necessarily clear what m
> > and n refer to.
>
> The two parameters to the function?

But replace takes 3 arguments!

The complexity must be at least the second and third multiplied
together, as
    replace "x" (replicate y 'y') (replicate z 'x')
makes y*z words in the heap.

> > > unicode-unaware case conversion (map toUpper is an unsafe case
>
> > conversion)
> >
> > Surely this is something that should be added to Data.Char, irrespective
> > of whether text is added to the HP?
>
> Yes, but that's a not-this-problem problem.

Oh, I didn't mean to suggest that you should fix it. I just don't think
it motivates adding the text package to the HP, and thus doesn't belong
in the proposal.

> > > RecordWildCards
> >
> > I'm not a fan, but I fear I may be in the minority.
>
> It's just used internally, so why do you mind?

I'm sure I'll need to look at the code at some point.

> There are a number of other differences which probably want to be tidied
> > up (mostly functions which are in one package but not the other, and
> > ByteString has IO functions mixed in with the non-IO functions), but
> > those seemed to be the most significant ones. Also,
> >    prefixed :: Text -> Text -> Maybe Text
> > is analogous to
> >    stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
> > in Data.List
>
> I hadn't seen that. Hmm. For use with view patterns, I prefer the name I'm
> using right now.

I'd like us to proceed in a way that means we haven't still got
Data.List.stripPrefix and Data.Text.prefixed in the HP in 3 years time.


Thanks
Ian

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
1234 ... 6