ANNOUNCE: text 0.8.0.0, fast Unicode text support

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

ANNOUNCE: text 0.8.0.0, fast Unicode text support

Bryan O'Sullivan
New in this release:
  • Substantial performance improvements.
  • Bug fixes, and better quality assurance via automated QuickCheck and HPC tests and coverage.
  • An even nicer API than before.
The text library provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.

The text package provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from ByteStrings, using several standard encodings (see the text-icu package for a much larger variety of encoding functions). Efficient locale-sensitive support for text I/O is also supported.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

John Millikin
Is there a summary of the API changes available? I see a new module,
but Precis is choking on Data.Text and Data.Text.Lazy, so I'm not sure
what existing signatures have been modified.

> Don't forget, you can always improve the text library yourself. I love to receive
> patches, requests for improvement, and bug reports.

Are there any areas in particular you'd like help with, for either
library? I'm happy to assist any effort which will help reduce use of
String.

[aside] does anybody know how to get a list of what packages
somebody's uploaded to Hackage? I think I've updated all mine for the
new "text" version dependency, but I'm worried I forgot some.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Ivan Lazar Miljenovic
On 1 September 2010 15:14, John Millikin <[hidden email]> wrote:
> [aside] does anybody know how to get a list of what packages
> somebody's uploaded to Hackage? I think I've updated all mine for the
> new "text" version dependency, but I'm worried I forgot some.

Unfortunately, no; I've often wished the same thing (not for my
packages, but because I could have sworn someone uploaded a package
that did something, I remembered the maintainer but not the name of
the package...).

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Tako Schotanus
In reply to this post by John Millikin
On Wed, Sep 1, 2010 at 07:14, John Millikin <[hidden email]> wrote:

> Don't forget, you can always improve the text library yourself. I love to receive
> patches, requests for improvement, and bug reports.

Are there any areas in particular you'd like help with, for either
library? I'm happy to assist any effort which will help reduce use of
String.

As a Haskell noob I'm curious about this statement, is there something intrinsically wrong with String?
Or is it more a performance/resource problem when dealing with large amounts of text for example?
(Like having to use StringBuilder in Java if you want to avoid the penalty of repeated String allocations when simply concatenating for example)

Cheers,
 -Tako


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

David Virebayre
2010/9/1 Tako Schotanus <[hidden email]>:
> As a Haskell noob I'm curious about this statement, is there something
> intrinsically wrong with String?

String is just a linked list of Char which are unicode code points;
which is probably not the optimal way to store text.
For intensive use of text it takes too much memory, and/or it's not
fast enough.

Sometimes I'd love if I could program using String and the compiler
would automatically convert that to Text, or Bytestrings, but of
course it's wishful thinking.

David.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Kevin Jardine-3
In reply to this post by Tako Schotanus
Hi Tako,

The issues involved with String, ByteString, Text and a few related
libraries were discussed at great length recently in this thread:

http://groups.google.com/group/haskell-cafe/browse_thread/thread/52a21cf61ffb21b0/

Basically, Chars are 32 bit integers and Strings are represented as a
list of Chars.

This is very convenient for small computations but often very
inefficient for anything large scale.

The String API  is also missing various encoding related features.

Because of the limitations of String, various alternative libraries
have been proposed. Text is one important option.

You'll find much more detail on the above referenced thread.

Kevin

On Sep 1, 8:13 am, Tako Schotanus <[hidden email]> wrote:

> On Wed, Sep 1, 2010 at 07:14, John Millikin <[hidden email]> wrote:
>
> > > Don't forget, you can always improve the text library yourself. I love to
> > receive
> > > patches, requests for improvement, and bug reports.
>
> > Are there any areas in particular you'd like help with, for either
> > library? I'm happy to assist any effort which will help reduce use of
> > String.
>
> As a Haskell noob I'm curious about this statement, is there something
> intrinsically wrong with String?
> Or is it more a performance/resource problem when dealing with large amounts
> of text for example?
> (Like having to use StringBuilder in Java if you want to avoid the penalty
> of repeated String allocations when simply concatenating for example)
>
> Cheers,
>  -Tako
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Ivan Lazar Miljenovic
In reply to this post by David Virebayre
On 1 September 2010 16:27, David Virebayre <[hidden email]> wrote:

> 2010/9/1 Tako Schotanus <[hidden email]>:
>> As a Haskell noob I'm curious about this statement, is there something
>> intrinsically wrong with String?
>
> String is just a linked list of Char which are unicode code points;
> which is probably not the optimal way to store text.
> For intensive use of text it takes too much memory, and/or it's not
> fast enough.
>
> Sometimes I'd love if I could program using String and the compiler
> would automatically convert that to Text, or Bytestrings, but of
> course it's wishful thinking.

Well, there's OverloadedStrings for String literals...

But the problem with automagic conversion would be one of encoding for
Bytestrings at least.

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Tako Schotanus
In reply to this post by Kevin Jardine-3
Hi Kevin,

thanks for the pointer, although I was aware of the thread and had followed it quite closely, it was quite interesting.
But it never explained if and why String should be avoided, all I read is "test and decide depending on the circumstances", which in itself is good advise, but I'd like to have an idea of the reasons so I can form in idea before actually having to code any benchmarks :)

Knowing that String literally is a linked list of Char makes it a lot clearer. I figured that maybe Haskell could be using some more efficient mechanism for Strings internally, only treating it outwardly as a [Char]. But I guess that in a lot of circumstances where you're just working with small pieces of text in non-performance critical code it's perfectly okay to use String.

Cheers,
-Tako


On Wed, Sep 1, 2010 at 08:31, Kevin Jardine <[hidden email]> wrote:
Hi Tako,

The issues involved with String, ByteString, Text and a few related
libraries were discussed at great length recently in this thread:

http://groups.google.com/group/haskell-cafe/browse_thread/thread/52a21cf61ffb21b0/

Basically, Chars are 32 bit integers and Strings are represented as a
list of Chars.

This is very convenient for small computations but often very
inefficient for anything large scale.

The String API  is also missing various encoding related features.

Because of the limitations of String, various alternative libraries
have been proposed. Text is one important option.

You'll find much more detail on the above referenced thread.

Kevin

On Sep 1, 8:13 am, Tako Schotanus <[hidden email]> wrote:
> On Wed, Sep 1, 2010 at 07:14, John Millikin <[hidden email]> wrote:
>
> > > Don't forget, you can always improve the text library yourself. I love to
> > receive
> > > patches, requests for improvement, and bug reports.
>
> > Are there any areas in particular you'd like help with, for either
> > library? I'm happy to assist any effort which will help reduce use of
> > String.
>
> As a Haskell noob I'm curious about this statement, is there something
> intrinsically wrong with String?
> Or is it more a performance/resource problem when dealing with large amounts
> of text for example?
> (Like having to use StringBuilder in Java if you want to avoid the penalty
> of repeated String allocations when simply concatenating for example)
>
> Cheers,
>  -Tako
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Daniel Fischer-4
In reply to this post by Bryan O'Sullivan
On Wednesday 01 September 2010 06:08:07, Bryan O'Sullivan wrote:
> New in this release:
>
>    - Substantial performance improvements.
>    - Bug fixes, and better quality assurance via automated QuickCheck
> and HPC tests and coverage.
>    - An even nicer API than before.

Good job.

>
> The text library provides an efficient packed, immutable Unicode text
> type (both strict and lazy), with a powerful loop fusion optimization
> framework.
>
> The 'Text' type represents Unicode character strings, in a time
> and space-efficient manner. This package provides text
> processing capabilities that are optimized for performance critical use,
> both in terms of large data quantities and high speed.

Hmm, I have to throw a few grains of salt into that.

File: big.txt from Peter Norvig's spelling checker (6.2MiB)

1. read the file and output it to stdout (redirected to /dev/null)
ByteString.Lazy: <= 0.01s
text-0.7.2.1 (Lazy): 0.68s
text-0.8.0.0 (Lazy): 0.26s
String: 0.35s

ByteString doesn't do any de/encoding or validity checking, so it's clear
that can be way faster than I/O which does.
The interesting part is the comparison between text and vanilla String I/O,
the difference is smaller than I expected for text-0.8.0.0. It came as a
surprise that text-0.7.2.1 actually had much worse performance than vanilla
Strings.

Okay, now we know how long the I/O takes, let's measure some work.

2. read the file, replace some string, output to stdout (redirected to
/dev/null)
a) "Professor" -> "Teacher"
stringsearch (Lazy): 0.02s, 2MB total
text-0.7.2.1 (Lazy): 0.92s, 28MB total
text-0.7.2.1 (Strict): 0.90s, 65MB total
text-0.8.0.0 (Lazy): 0.44s, 12MB total
text-0.8.0.0 (Strict): 0.36S, 55MB total
String (naive): 0.51s, 1MB total
String (KMP): 0.46s, 1MB total

b) "deteriorate" -> "worsen"
stringsearch (Lazy): 0.02s, 2MB total
text-0.7.2.1 (Lazy): 1.01s, 39MB total
text-0.7.2.1 (Strict): 0.91s, 65MB total
text-0.8.0.0 (Lazy): 0.45s, 17MB total
text-0.8.0.0 (Strict): 0.36s, 55MB total
String (naive): 0.52s, 1MB total
String (KMP): 0.50s, 1MB total

c) "hen" -> "here"
stringsearch (Lazy): 0.04s, 2MB total
text-0.7.2.1 (Lazy): 1.10s, 2MB total
text-0.7.2.1 (Strict): 0.95s, 65MB total
text-0.8.0.0 (Lazy): 0.66s, 2MB total
text-0.8.0.0 (Strict): 0.41s, 55MB total
String (naive): 0.53s, 1MB total
String (KMP): 0.51s, 1MB total

d) "kre" -> "here"
stringsearch (Lazy): 0.03s, 2MB total
text-0.7.2.1 (Lazy): 1.23s, 39MB total
text-0.7.2.1 (Strict): 0.96s, 65MB total
text-0.8.0.0 (Lazy): 0.65s, 17MB total
text-0.8.0.0 (Strict): 0.40s, 55MB total
String (naive): 0.51s, 1MB total
String (KMP): 0.47s, 1MB total

The performance of text-0.8.0.0 has improved significantly over that of
text-0.7.2.1 (for the tested features), the improvement of the replacing
algorithm is however not as impressive as that of I/O.

Replacing isn't much faster than for Strings, however, for (some?) very
short patterns, even slower.
That's not so good.

What's *really* bad is the space behaviour.

My tests suggest that the space usage (for Text.Lazy) depends on the index
of the pattern to replace, the later it appears (if at all), the more
memory is used by text. Apparently it doesn't deliver the beginning before
a match has been found (or the whole string has been scanned).
That space leak ought to be fixed.

The space behaviour of the strict variant doesn't show that problem, it
seems to only depend on the file, but it uses disproportionally much
memory.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Bryan O'Sullivan
Hi, Daniel -

Thanks for taking the new code for a test drive!
 
The interesting part is the comparison between text and vanilla String I/O,
the difference is smaller than I expected for text-0.8.0.0.

Yes. Much of this is due to the new encoding stuff on Handles in GHC 6.12, which is slow. Its performance wasn't so noticeable when it was only shipping String around, but it's much more visible with Text. It's far slower on a Mac than on Linux, in case that's relevant.
 
The performance of text-0.8.0.0 has improved significantly over that of
text-0.7.2.1 (for the tested features), the improvement of the replacing
algorithm is however not as impressive as that of I/O.

I'd bet you that's mostly because the program is I/O bound, in the sense that it's spending time going through the layers of buffering and translation that now make up a Handle. Any improvement in other code is going to be hard to see because of that.

The other major consideration, both for this case and the first one you note, is that the inliner in 6.12 chokes on code that uses stream fusion: it boxes and unboxes vast quantities of state. That kills performance due to both the boxing and unboxing overhead and the increased number of nursery GCs.

The marvelous new 6.13 inliner does a much better job here - that's where I see those 3x performance improvements for free.

What's *really* bad is the space behaviour.

What are you using to measure that?

Also, please don't forget to post your benchmark code when you make observations like this, as that way I can reproduce your measurements and fix problems. I appreciate your help!

Regards,
Bryan.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Bryan O'Sullivan
In reply to this post by John Millikin
On Tue, Aug 31, 2010 at 10:14 PM, John Millikin <[hidden email]> wrote:
Is there a summary of the API changes available? I see a new module,
but Precis is choking on Data.Text and Data.Text.Lazy, so I'm not sure
what existing signatures have been modified.

Ouch. I'll try to do a diff and follow up, probably not until much later.
 
Are there any areas in particular you'd like help with, for either
library? I'm happy to assist any effort which will help reduce use of
String.

For the text library, my main interest is performance. I've been far more focused on making the code correct and complete than on making it fast, but now that the API is stable and the test coverage is solid, performance deserves more attention.

What specifics would be useful?
  • Reproducible, realistic (but preferably small) benchmarks
  • Patches that speed things up
Thanks for asking!

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Daniel Fischer-4
In reply to this post by Bryan O'Sullivan
On Wednesday 01 September 2010 18:15:19, Bryan O'Sullivan wrote:

> Hi, Daniel -
>
> Thanks for taking the new code for a test drive!
>
> > The interesting part is the comparison between text and vanilla String
> > I/O, the difference is smaller than I expected for text-0.8.0.0.
>
> Yes. Much of this is due to the new encoding stuff on Handles in GHC
> 6.12, which is slow. Its performance wasn't so noticeable when it was
> only shipping String around, but it's much more visible with Text. It's
> far slower on a Mac than on Linux, in case that's relevant.
>

I'm on Linux. I guess that's another point in favour of it:)
Do you happen to know why it's slower on a Mac?

> > The performance of text-0.8.0.0 has improved significantly over that
> > of text-0.7.2.1 (for the tested features), the improvement of the
> > replacing algorithm is however not as impressive as that of I/O.
>
> I'd bet you that's mostly because the program is I/O bound, in the sense
> that it's spending time going through the layers of buffering and
> translation that now make up a Handle. Any improvement in other code is
> going to be hard to see because of that.

Maybe. As a rough measure, I took (total time - time for just reading and
outputting) for an approximation of the processing (replacing) time.
For the first example (Lazy), that yields
0.7.2.1: 0.92s - 0.68s = 0.24s
0.8.0.0: 0.44s - 0.26s = 0.18s,
for the fourth
0.7.2.1: 1.23s - 0.68s = 0.55s
0.8.0.0: 0.65s - 0.26s = 0.39s

Yes, I know it's very crude, but TIO.readFile file >>= TIO.putStrLn does no
less translation than with a replacing in between, or does it?
So I tentatively believe most of the difference is spent doing the
replacements.

>
> The other major consideration, both for this case and the first one you
> note, is that the inliner in 6.12 chokes on code that uses stream
> fusion: it boxes and unboxes vast quantities of state. That kills
> performance due to both the boxing and unboxing overhead and the
> increased number of nursery GCs.

I haven't looked at the core, so I take your word for it. I know the
behaviour from other situations.

>
> The marvelous new 6.13 inliner does a *much* better job here - that's
> where I see those 3x performance improvements for free.

That's nice. Maybe I should go get the HEAD before 6.14.1 will be released.

>
> What's *really* bad is the space behaviour.
>
>
> What are you using to measure that?

+RTS -s and top. The figures I gave were the "total memory in use" of
+RTS -s, maximum residency was 20MB (vs. 39 total) for 0.7 and 12MB (vs. 17
total) for 0.8 in the bad tests.

>
> Also, please don't forget to post your benchmark code when you make
> observations like this,

It's not very interesting,

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO

main :: IO ()
main = do
    (file : pat : sub : _) <- getArgs
    let !spat = T.pack pat
        !ssub = T.pack sub
        work = T.replace spat ssub
    TIO.readFile file >>= TIO.putStrLn . work

with the obvious changes of imports for the strict Text and the almost
obvious changes for the ByteString code.

> as that way I can reproduce your measurements
> and fix problems. I appreciate your help!

I can now say more. Looking at Data.Text.Lazy.replace,

replace s d = intercalate d . split s

, I also got a space leak with that for BS.Lazy's intercalate and
stringsearch's split. Looking at intercalate,

intercalate t = concat . (Data.List.intersperse t)

, that's where you definitely get a space leak, because

intersperse             :: a -> [a] -> [a]
intersperse _   []      = []
intersperse _   [x]     = [x]
intersperse sep (x:xs)  = x : sep : intersperse sep xs

isn't lazy enough.
Given the context, it's not hard to see what's wrong here. Before
intersperse produces any output, it checks whether the list contains at
least two elements (if any). If a is a list-like type, like String, lazy
ByteStrings or lazy Text, where the elements of the list are lazily
produced one after the other, as is the case for split, each element must
be complete before it can be delivered and then consumed.
So indeed, replace needs O(index pat) space :(

I don't think that fixing Data.List.intersperse will fix your space leak,
though.

split pat src
    | null pat        = emptyError "split"
    | isSingleton pat = splitBy (== head pat) src
    | otherwise       = go 0 (indices pat src) src
  where
    go  _ []     cs = [cs]
    go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs
                      in  h : go (x+l) xs (dropWords l t)
    l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat

Using the list of indices of the pattern, split can't deliver anything
before the first (next) occurrence of the pattern has been located (or the
end of the string reached). Again, that forces O(index pat) space
consumption.

I don't see how to avoid that without duplicating a large part of indices'
logic in a dedicated breaking/splitting function.

On a related note,

break :: Text -> Text -> (Text, Text)
break pat src
    | null pat  = emptyError "break"
    | otherwise = case indices pat src of
                    []    -> (src, empty)
                    (x:_) -> let h :*: t = splitAtWord x src
                             in  (h, t)

has the same problem.

Cheers,
Daniel


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Laziness bug in Data.List.intersperse (was: ANNOUNCE: text 0.8.0.0, fast Unicode text support)

Daniel Fischer-4
On Wednesday 01 September 2010 21:29:47, Daniel Fischer wrote:
> that's where you definitely get a space leak, because
>
> intersperse             :: a -> [a] -> [a]
> intersperse _   []      = []
> intersperse _   [x]     = [x]
> intersperse sep (x:xs)  = x : sep : intersperse sep xs
>
> isn't lazy enough.

Ticket created, http://hackage.haskell.org/trac/ghc/ticket/4282
I'm not keen on subscribing to libraries@ to follow the official proposal
process, any takers?

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Brandon S Allbery KF8NH
In reply to this post by Ivan Lazar Miljenovic

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/01/2010 02:44 AM, Ivan Lazar Miljenovic wrote:
> On 1 September 2010 16:27, David Virebayre
> [hidden email] wrote:
>> Sometimes I'd love if I could program using String and the
>> compiler would automatically convert that to Text, or
>> Bytestrings, but of course it's wishful thinking.
> Well, there's OverloadedStrings for String literals...
>
> But the problem with automagic conversion would be one of encoding
> for Bytestrings at least.


This reminds me:  it would be nice if Text, ByteString, etc. exported
their own definitions of "type String".  As most users have to import
them qualified to avoid collisions with Data.List (and the Prelude?)
anyway, this would enable someone to import their desired module as,
say, S, then use type S.String and string ops S.foo.  Anyone who *did*
use them unqualified would have to add (hiding (String)) to their
imports, of course.

The flip side of this would be a module which contained the definition
for String and re-exported Data.List, so that the above S.mumble could
work for String as well.  (The obvious choice would be Data.String,
but that's taken for IsString and of course that would be required to
pull this off so it can't just have the list implementation stuffed
into it.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx/FsQACgkQIn7hlCsL25Vk8ACeKco9/IFwGi+8gc+BxSyT3QwY
oP8An3oabG1lbXHChShbenEWj9HWEq6Q
=PQsO
-----END PGP SIGNATURE-----


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Laziness bug in Data.List.intersperse (was: ANNOUNCE: text 0.8.0.0, fast Unicode text support)

Bryan O'Sullivan
In reply to this post by Daniel Fischer-4
On Wed, Sep 1, 2010 at 1:00 PM, Daniel Fischer <[hidden email]> wrote:
I'm not keen on subscribing to libraries@ to follow the official proposal
process, any takers?

I'll take it up. 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Bryan O'Sullivan
In reply to this post by Daniel Fischer-4
On Wed, Sep 1, 2010 at 12:29 PM, Daniel Fischer <[hidden email]> wrote:
I'm on Linux. I guess that's another point in favour of it:)
Do you happen to know why it's slower on a Mac?

I'd guess because of something to do with the system iconv.
 
So I tentatively believe most of the difference is spent doing the
replacements.

I have a Replace.hs benchmark in the main text repo, just to be sure we're talking about the same thing. Factoring out the time spent on I/O, with GHC HEAD, my replace code takes twice the space and time of that in the stringsearch package. Given that the space involved is just 121KB maximum residency while processing a 124MB file, I'm not concerned about it. And the time required isn't a bad place to start from, I think.

By the way, as this implies, I can't reproduce your space behaviour at all.

I can now say more. Looking at Data.Text.Lazy.replace,

replace s d = intercalate d . split s 
, I also got a space leak with that for BS.Lazy's intercalate and
stringsearch's split.

How did you observe the space leak? Looking at -hT charted with hp2ps shows me nothing, and the program executes in constant space regardless of input size.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Daniel Fischer-4
On Wednesday 08 September 2010 06:44:32, Bryan O'Sullivan wrote:
> I have a Replace.hs benchmark in the main text repo, just to be sure
> we're talking about the same thing.

Grabbed the darcs repo, compiled with the darcs version and also with the
installed text package, both exhibit the same behaviour - what I reported
before.

> Factoring out the time spent on I/O,
> with GHC HEAD, my replace code takes twice the space and time of that in
> the stringsearch package.

That's very good. Twice the space is a given for mostly ASCII text, twice
the time is okay, I think.
My timings are quite different, but that's probably because 6.12.3's
inliner doesn't give the full fusion benefit, so it'll improve
automatically with the next GHC release.

> Given that the space involved is just 121KB
> maximum residency while processing a 124MB file, I'm not concerned about
> it.

I wouldn't either.

> And the time required isn't a bad place to start from, I think.
>
> By the way, as this implies, I can't reproduce your space behaviour at
> all.
>

That's surprising.
Have you made sure to replace a pattern which does not occur in the text?
Can you reproduce the behaviour with a) Data.List.intersperse instead of
the lazier version now used, b) ghc-6.12.* instead of HEAD?

Anyway, I would've thought that with

split pat src
    | null pat        = emptyError "split"
    | isSingleton pat = splitBy (== head pat) src
    | otherwise       = go 0 (indices pat src) src
  where
    go  _ []     cs = [cs]
    go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs
                      in  h : go (x+l) xs (dropWords l t)
    l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat

you can't start returning chunks before it's known whether the list of
indices is empty, so split would have O(index of pattern) space behaviour.

If HEAD manages to make the chunks available before they are complete
(before it's known how long they will be), it's even awesomer than I'd have
dared to hope.
Okay, so I'll have to try HEAD.

> > I can now say more. Looking at Data.Text.Lazy.replace,
> >
> > replace s d = intercalate d . split s
> >
> > , I also got a space leak with that for BS.Lazy's intercalate and
> >
> > stringsearch's split.
>
> How did you observe the space leak? Looking at -hT charted with hp2ps
> shows me nothing, and the program executes in constant space regardless
> of input size.
top and +RTS -s. I've attached the -hT graph of a run on a 74.3MB file with
a pattern that does not occur. It shows exactly the behaviour I expected
from the code of split, pretty constant growth of the heap until about
twice the file size is reached, then fast and pretty constant shrinking as
the text is output. The graphs are much more interesting if you do
replacements of patterns without large gaps between occurrences.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

nbenchText.ps (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ANNOUNCE: text 0.8.0.0, fast Unicode text support

Daniel Fischer-4
On Wednesday 08 September 2010 13:46:13, Daniel Fischer wrote:
> My timings are quite different, but that's probably because 6.12.3's
> inliner doesn't give the full fusion benefit, so it'll improve
> automatically with the next GHC release.
>

Or maybe not so much. Just built the latest source bundle from the HEAD
branch and

6.12.3:
./nbench lazyText bigfile krkx rabi +RTS -s
   1,796,245,884 bytes allocated in the heap
       1,125,596 bytes copied during GC
     110,398,048 bytes maximum residency (8 sample(s))
      38,897,164 bytes maximum slop
             191 MB total memory in use (4 MB lost due to fragmentation)

  Generation 0:  3043 collections,     0 parallel,  3.06s,  3.17s elapsed
  Generation 1:     8 collections,     0 parallel,  0.00s,  0.01s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.03s  (  6.48s elapsed)
  GC    time    3.07s  (  3.18s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    9.10s  (  9.66s elapsed)

  %GC time      33.7%  (33.0% elapsed)

  Alloc rate    297,965,335 bytes per MUT second

  Productivity  66.3% of total user, 62.4% of total elapsed

6.13.20100831:
./hdbench lazyText bigfile krkx rabi +RTS -s                                            
     543,409,296 bytes allocated in the heap                                            
         699,364 bytes copied during GC                                                
     110,956,008 bytes maximum residency (8 sample(s))                                  
      38,893,040 bytes maximum slop                                                    
             191 MB total memory in use (4 MB lost due to fragmentation)                

  Generation 0:   652 collections,     0 parallel,  0.44s,  0.43s elapsed
  Generation 1:     8 collections,     0 parallel,  0.00s,  0.01s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    5.42s  (  5.77s elapsed)
  GC    time    0.44s  (  0.44s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    5.86s  (  6.21s elapsed)

  %GC time       7.5%  (7.1% elapsed)

  Alloc rate    100,327,729 bytes per MUT second

  Productivity  92.5% of total user, 87.2% of total elapsed


Sure, that's a significant improvement, but that's mostly the GC time, with
-A64M, 6.12.3 is much closer.

However, for ByteStrings, performance got worse:

6.12.3:
./nbench lazyBS bigfile krkx rabi +RTS -s                                            
      90,127,112 bytes allocated in the heap                                          
          31,116 bytes copied during GC                                              
         103,396 bytes maximum residency (1 sample(s))                                
          39,964 bytes maximum slop                                                  
               2 MB total memory in use (0 MB lost due to fragmentation)              

  Generation 0:   158 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.10s  (  0.20s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.11s  (  0.20s elapsed)

  %GC time       3.6%  (1.8% elapsed)

  Alloc rate    834,456,211 bytes per MUT second

  Productivity  92.9% of total user, 50.9% of total elapsed

6.13.20100831:
./hdbench lazyBS bigfile krkx rabi +RTS -s                                              
     478,710,672 bytes allocated in the heap                                            
         164,904 bytes copied during GC                                                
          86,992 bytes maximum residency (1 sample(s))                                  
          44,080 bytes maximum slop                                                    
               2 MB total memory in use (0 MB lost due to fragmentation)                

  Generation 0:   864 collections,     0 parallel,  0.00s,  0.01s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.17s  (  0.28s elapsed)
  GC    time    0.00s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.18s  (  0.29s elapsed)

  %GC time       2.3%  (4.1% elapsed)

  Alloc rate    2,783,039,776 bytes per MUT second

  Productivity  95.5% of total user, 57.3% of total elapsed


Not only got it slower, it also allocates more than five times as much as
before.

> > Given that the space involved is just 121KB
> > maximum residency while processing a 124MB file, I'm not concerned
> > about it.
>
> I wouldn't either.
>

But it needs more space here, so I am concerned.

> > And the time required isn't a bad place to start from, I think.
> >
> > By the way, as this implies, I can't reproduce your space behaviour at
> > all.
>
> That's surprising.
> Have you made sure to replace a pattern which does not occur in the
> text? Can you reproduce the behaviour with a) Data.List.intersperse
> instead of the lazier version now used, b) ghc-6.12.* instead of HEAD?
>
> Anyway, I would've thought that with
>
> split pat src
>
>     | null pat        = emptyError "split"
>     | isSingleton pat = splitBy (== head pat) src
>     | otherwise       = go 0 (indices pat src) src
>
>   where
>     go  _ []     cs = [cs]
>     go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs
>                       in  h : go (x+l) xs (dropWords l t)
>     l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat
>
> you can't start returning chunks before it's known whether the list of
> indices is empty, so split would have O(index of pattern) space
> behaviour.
>
> If HEAD manages to make the chunks available before they are complete
> (before it's known how long they will be), it's even awesomer than I'd
> have dared to hope.
> Okay, so I'll have to try HEAD.
>

Doesn't do much here. Still leaking, still > 20 times slower than
ByteString, even for the now far worse ByteString times.
What's going on?

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe