FPS/Data.ByteString candidate

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

John Meacham
On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
> Using the Word8 API is not very pleasant, because all
> character constants etc are not Word8.

yeah, but using the version restricted to latin1 seems rather special
case, I can't imagine (or certainly hope) it won't be used in general
internally unless people are already doing low level stuff. In this day
and age, I expect unicode to work pretty much everywhere.

> This is very useful for many purposes and does not mean that there
> should not be a fancy UTF8 module. Rather than arguing about killing
> this, wouldn't it be more productive to create the UTF8 module?

I am not saying we should kill the latin1 version, since there is
interest in it, just that it doesn't fill the need for a general fast
string replacement.

> > but note, do the people that want latin1 just need ASCII? because it should be
> > noted that if we have a UTF8 PackedString, then we can make
> > ASCII-specific access routines that are just as fast as the ones in the
> > Latin1 variety without giving up the ability to store full unicode
> > values in the string.
>
> Case conversions and ordering need to be different. Thus we need to newtype
> things to avoid having two conflicting Ord instances. The UTF8 layer
> should provide:

I don't see why. ascii is a subset of utf8, the routines building a
packedstring from an ascii string or a utf8 string can be identical, if
you know your string is ascii to begin with you can use an optimized
routine but the end result is the same as if you used the general utf8
version.

> * Unicode toUpper/toLower
> * Unicode collation (UCA) for Ord
> * Graphemes (see Perl6 for good ways to do this)
> * Normalisation

well, none of these are UTF8 specific, we should not worry about the
encoding and just think of what 'PackedString' should do, the encoding
is unimportant to the API and semantics, the fact that you just happen
to be able to quickly convert to/from ascii and utf8 should be the only
visible difference in behavior.

the proper thing for PackedString is to make it behave exactly as the
String instances behave, since it is suposed to be a drop in
replacement. Which means the natuarl ordering based on the Char order
and the toLower and toUpper from the libraries.

uncode collation, graphemes, normalization, and localized sorting can be
provided as separate routines as another project (it would be nice to
have them work on both Strings and PackedStrings, so perhaps they could
be in a class?)

certainly a
newtype LocalizedPackedString = LocalizedPackedString PackedString
with different instances would be a useful thing too.

but this should be a separate but related project from just getting a
fast string replacement. (as in, it shouldn't hold up PackedString
development)

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Einar Karttunen
On 25.04 17:26, John Meacham wrote:
> On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
> > Using the Word8 API is not very pleasant, because all
> > character constants etc are not Word8.
>
> yeah, but using the version restricted to latin1 seems rather special
> case, I can't imagine (or certainly hope) it won't be used in general
> internally unless people are already doing low level stuff. In this day
> and age, I expect unicode to work pretty much everywhere.

Like in protocols where some segments may be compressed binary data?
And they use ascii character based matching to distinguish header
fields, which may have text data that is actually Utf8?

> I am not saying we should kill the latin1 version, since there is
> interest in it, just that it doesn't fill the need for a general fast
> string replacement.

It mostly fills the "I want to use the Word8 module with nicer API" place.
But most of the time it may not be Latin1. If we implement a Latin1 module
then we should implement it properly. Also if we implement Latin1 there
is a case for implementing Latin2-5 also.

Of course the people really arguing for this module are not interested in
a proper Latin1 implementation but just want the agnostic ascii superset.

I think the wishes on the libraries list have been mainly:
* UTF8
* Word8 interface
* "Ascii superset"

The easiest way seems to have three modules - one for each. Then we get
to the naming part.

I would like:
* Data.ByteString.Word8
* Data.ByteString.Char8
* Data.ByteString.UTF

And select your favorite and make Data.ByteString export that one.
I think that could be the Word8 or the UTF one.

> I don't see why. ascii is a subset of utf8, the routines building a
> packedstring from an ascii string or a utf8 string can be identical, if
> you know your string is ascii to begin with you can use an optimized
> routine but the end result is the same as if you used the general utf8
> version.

Actually toUpper works differently on ascii + something in the high bytes
and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem
for things like whitespace.

> the proper thing for PackedString is to make it behave exactly as the
> String instances behave, since it is suposed to be a drop in
> replacement. Which means the natuarl ordering based on the Char order
> and the toLower and toUpper from the libraries.

toUpper and toLower are the correct version in the standard
and they use the unicode tables. The natural ordering by
codepoint without any normalization is not very useful for
text handling, but works for e.g. putting strings in a Map.

> uncode collation, graphemes, normalization, and localized sorting can be
> provided as separate routines as another project (it would be nice to
> have them work on both Strings and PackedStrings, so perhaps they could
> be in a class?)

These are quite essential for really working with unicode characters.
It didn't matter much before as Haskell didn't provide good ways
to handle unicode chars with IO, but these are very important,
otherwise it becomes hard to do many useful things with the parsed
unicode characters.

How are we supposed to process user input without normalization
e.g. if we need to compare Strings for equivalence?

But a simple UTF8 layer with more features added later is a good way.

- Einar Karttunen
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

John Meacham
On Wed, Apr 26, 2006 at 04:48:52AM +0300, Einar Karttunen wrote:
> I would like:
> * Data.ByteString.Word8
> * Data.ByteString.Char8
> * Data.ByteString.UTF
>
> And select your favorite and make Data.ByteString export that one.
> I think that could be the Word8 or the UTF one.

ByteString should be the pure Word8 version. the others can be based on
it. ByteString is quite a useful data type independent of anything to do
with strings.

I'd like to see Data.PackedString be what you are calling
Data.ByteString.UTF and PackedString _specifically_ be a drop-in
replacement for String with an abstract internal representation and
should behave the same as String except when it comes to time and space.
I want to be able to just change a few types and routines to
PackedString from String in a library and be guarenteed I am not
affecting the meaning of a program. (or vice versa)


though, I do much much prefer the 'Char8' term to 'Latin1'. I think it
better represents what it does. just 'Chars truncated to 8 bits' while
'latin1' might have other unintended connotations. The fact that the
standard routines will interpret them as latin1 can be infered from the
fact that the standard routines interpret Chars as unicode code points.

In particular, if you do something wacky where you don't store unicode
values in a 'Char' it doesn't magically become 'Latin1' just because you
store it in a latin1 string, it just becomes whatever you put in
truncated to 8 bits and hopefully you know what you are doing.


> > I don't see why. ascii is a subset of utf8, the routines building a
> > packedstring from an ascii string or a utf8 string can be identical, if
> > you know your string is ascii to begin with you can use an optimized
> > routine but the end result is the same as if you used the general utf8
> > version.
>
> Actually toUpper works differently on ascii + something in the high bytes
> and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem
> for things like whitespace.

I am not sure what you mean, the data would always be utf8 full unicode
values in a PackedString, there would just be efficient ways to pull in
data you know is ascii since it can just use a memcpy rather than
recoding it from whatever format it is in. The fact that it happens to
just contain values < 128 won't make a different for subsequent handling
of the string. (except perhaps some routines will be faster). when I say
ASCII here, I just mean a utf8 string where all values happen to be <
128, which is happily binary compatable with ASCII.

> > the proper thing for PackedString is to make it behave exactly as the
> > String instances behave, since it is suposed to be a drop in
> > replacement. Which means the natuarl ordering based on the Char order
> > and the toLower and toUpper from the libraries.
>
> toUpper and toLower are the correct version in the standard
> and they use the unicode tables. The natural ordering by
> codepoint without any normalization is not very useful for
> text handling, but works for e.g. putting strings in a Map.

yeah, and it is fast. I always thought we should have two Ord classes,
one for human digestable ordering and the other for fast implementation
dependent ordering for use only in things like Map and Set. but that is
a different issue.

in any case, the point I was trying to make is that PackedString should
behave exactly like String, whether the instances for String are doing
the right thing is a different matter.

> > uncode collation, graphemes, normalization, and localized sorting can be
> > provided as separate routines as another project (it would be nice to
> > have them work on both Strings and PackedStrings, so perhaps they could
> > be in a class?)
>
> These are quite essential for really working with unicode characters.
> It didn't matter much before as Haskell didn't provide good ways
> to handle unicode chars with IO, but these are very important,
> otherwise it becomes hard to do many useful things with the parsed
> unicode characters.

yeah, they would be useful things to have. but no need to tie them
specifically to PackedString (though, they would operate on
PackedStrings most likely). ginsu and jhc both use unicode extensivly
without these routines, so saying it is hard to do useful things is
somewhat strong. but they would definitly be very useful to have and
necessary for certain applications.

> How are we supposed to process user input without normalization
> e.g. if we need to compare Strings for equivalence?

we implement normalization and provide it as a library :)

> But a simple UTF8 layer with more features added later is a good way.

I don't think these features should be in PackedString proper unless
they are added to String as well. (as in, in the default instances),
however a 'UnicodeString' that is a newtype of PackedString would be
easy enough with just different instance declarations.

the library routines for performing these transformations can be
provided in PackedString of course if that makes sense if they don't
conflict with any String operations of the same name.

but being able to do 'normalize a == normalize b' would be useful for
PackedStrings independent of UnicodeString.

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Data.ByteString candidate 4

Donald Bruce Stewart
In reply to this post by Donald Bruce Stewart
Ok, I've tried to encorporate the suggestions from yesterday's
discussion.

API: http://www.cse.unsw.edu.au/~dons/fps/new/
Src: http://www.cse.unsw.edu.au/~dons/fps.html
   
Changes:
    * Char functions live in Data.ByteString.Char8
    * Improved docs
    * Anything that needs Data.Char is now in Char8 (lines, words..)
    * Confirmed that Char8 runs at the same speed as the Word8 layer
    * isSuffix is about 100x faster.
   
No claims about being a 'Char' packed string library. Don't make claims about
encodings. Char8.hs is just a no-op layer over the underlying Data.ByteString
Word8 ops.

I'm wary of claiming 'PackedString' status, as John says, it isn't a
drop in replacement, so Data.ByteString.Char8 seems fine to me.

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Axel Simon
In reply to this post by Einar Karttunen
On Wed, 2006-04-26 at 02:16 +0300, Einar Karttunen wrote:

> This is very useful for many purposes and does not mean that there
> should not be a fancy UTF8 module. Rather than arguing about killing
> this, wouldn't it be more productive to create the UTF8 module?

I've been following this thread with some frowning. I can see that some
people want to dish out text over the network *really fast* and thus
would like the ability to emit pure ASCII without the overhead of 4
bytes per character. Still, I don't see the need for a .Latin1 module
next to a .Word8 module.

When it comes to UTF8, I cringe. Dealing with UTF8 is such a nightmare
to get right and it won't show up until you're test some Chinese texts
with it (or are there other common 4-byte characters?). Hence, UTF8
should not be a common interface for application developers. Haskell has
the advantage that changing Char form 8 bits to 32 bits doesn't add to
the space consumption of lists. With packed string the situation is
different, but still, I propose to

- have a library that deals with packed strings of 32-bit Haskell Char
- have a library that deals with packed Word8 sequences

This way, it will hurt if you touch the bare-metal Word8 representation,
but then, using Word8 sequences is quite an optimisation that you don't
use when you start developing an application. A simplistic solution like
this avoids the whole discussion on whether there should be an Ord or
toUpper for Latin1, or how to coerce a packed Latin1 string to a packed
Word8 representation.

Axel.


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Donald Bruce Stewart
A.Simon:

> different, but still, I propose to
>
> - have a library that deals with packed strings of 32-bit Haskell Char
> - have a library that deals with packed Word8 sequences
>
> This way, it will hurt if you touch the bare-metal Word8 representation,
> but then, using Word8 sequences is quite an optimisation that you don't
> use when you start developing an application. A simplistic solution like
> this avoids the whole discussion on whether there should be an Ord or
> toUpper for Latin1, or how to coerce a packed Latin1 string to a packed
> Word8 representation.

I'd like to say that all I want to do is have the Word8 "bare metal"
layer, and a minimal Char8 layer layer on top (where all conversions are
equivalent to id) to make the fast layer usable for speed-is-everything
projects. If we don't add the Char8 layer, the projects will end up
having to write their own anyway, since Word8 literals are unbearable.

This is what's currently implemented.

I'm providing the 2nd part of your plan above, with a little sugar for
people like me and Einar who need it. The first part, 32-bit packed
haskell strings, is another piece of work.

I'm not sure we need 5 kinds of Foo-encoding layers, and I don't plan to
write them.

-- Don

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 4

John Meacham
In reply to this post by Donald Bruce Stewart
On Wed, Apr 26, 2006 at 01:21:17PM +1000, Donald Bruce Stewart wrote:
> I'm wary of claiming 'PackedString' status, as John says, it isn't a
> drop in replacement, so Data.ByteString.Char8 seems fine to me.

I like it a lot. perfect for what it does. is somone working on a
Data.PackedString or should I have a go at it? should I send patches to
your darcs repo?

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Axel Simon
In reply to this post by Donald Bruce Stewart
On Wed, 2006-04-26 at 18:35 +1000, Donald Bruce Stewart wrote:

> A.Simon:
> > different, but still, I propose to
> >
> > - have a library that deals with packed strings of 32-bit Haskell Char
> > - have a library that deals with packed Word8 sequences
> >
> > This way, it will hurt if you touch the bare-metal Word8 representation,
> > but then, using Word8 sequences is quite an optimisation that you don't
> > use when you start developing an application. A simplistic solution like
> > this avoids the whole discussion on whether there should be an Ord or
> > toUpper for Latin1, or how to coerce a packed Latin1 string to a packed
> > Word8 representation.
>
> I'd like to say that all I want to do is have the Word8 "bare metal"
> layer, and a minimal Char8 layer layer on top (where all conversions are
> equivalent to id) to make the fast layer usable for speed-is-everything
> projects. If we don't add the Char8 layer, the projects will end up
> having to write their own anyway, since Word8 literals are unbearable.

I don't understand the need for the Char8 layer. How are Char8 literals
different from Word8 literals? You couldn't use "string" or 's' either
way.

> This is what's currently implemented.
>
> I'm providing the 2nd part of your plan above, with a little sugar for
> people like me and Einar who need it. The first part, 32-bit packed
> haskell strings, is another piece of work.

Ok, fair enough.

Axel.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: Data.ByteString candidate 3

Bulat Ziganshin-2
In reply to this post by Donald Bruce Stewart
Hello Donald,

Wednesday, April 26, 2006, 12:35:16 PM, you wrote:

> I'm not sure we need 5 kinds of Foo-encoding layers, and I don't plan to
> write them.

let's count:

Latin1 - already written by you
UTF8 - requested by many people here, required to work with compiler's
input in ghc/jhc, and is the most compact representation for general
string
UCS4 - already implemented in Data.PackedString, fastest way to work
with general strings (i mean faster indexing and other direct-index
ops)
UTF16 - used in Windows API, so it's implementation will be really
useful to simplify this API implementation and to allow application
programs to work directly with such strings instead of converting them
from/to some other format

btw, what will be really useful now, imho, is the interface to
Text.Regex. how about working on it as next stage?

and one more suggestion - you can significantly speedup your code by
importing the 6.5's ForeignPtr implementation inside your library.
This type almost don't appears in ByteString external interface, so
this should be not so huge work


--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Donald Bruce Stewart
bulat.ziganshin:
> Hello Donald,
> btw, what will be really useful now, imho, is the interface to
> Text.Regex. how about working on it as next stage?

This is already done actually, here:
    http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc
    http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

:)
   
> and one more suggestion - you can significantly speedup your code by
> importing the 6.5's ForeignPtr implementation inside your library.
> This type almost don't appears in ByteString external interface, so
> this should be not so huge work

Ah! That's a good idea.

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Ketil Malde-3
In reply to this post by Donald Bruce Stewart
[hidden email] (Donald Bruce Stewart) writes:

> I'd like to say that all I want to do is have the Word8 "bare metal"
> layer, and a minimal Char8 layer layer on top

> This is what's currently implemented.

I've now added a Latin1 module, that works like Char8, but where
packing a Char >255 is an error.  This means some extra checking,
packing 45M characters from [Char] to ByteString slows down from (very
rougly) 6.4 to 5.6 Mb/s.

Many (but probably less important) operations will be faster, checking
if c >= 256 is `elem` a Latin1 ByteString will be O(1) and always
False.  (Char8 will need to scan the string for c `mod` 256).

Feel free to grab, read, criticize, or benchmark,

     darcs get http://www.ii.uib.no/~ketil/src/fps

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Ketil Malde-3
Ketil Malde <[hidden email]> writes:

> I've now added a Latin1 module, that works like Char8, but where

And now there's an ASCII module, which instead of storing bytes > 127 in
the latin1 range, puts them in the "private" area of 0xF000..0xF07F.
This way, they won't be affected by other Char functions depending on
case etc.  Packing Chars outside of this area is still an error
(i.e. no 8 bit truncation)

IMHO this is the correct way to provide a Char interface to ASCII
(albeit at a performance penalty), but I simply can't wait to hear
what other people have to say about the matter. :-)

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: Data.ByteString candidate 3

Bulat Ziganshin-2
In reply to this post by Donald Bruce Stewart
Hello Donald,

Wednesday, April 26, 2006, 2:19:34 PM, you wrote:

>> btw, what will be really useful now, imho, is the interface to
>> Text.Regex. how about working on it as next stage?

> This is already done actually, here:
>     http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc
>     http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

please include it in your lib, it is very useful thing imho


--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Ashley Yakeley
In reply to this post by Donald Bruce Stewart
Donald Bruce Stewart wrote:

> Ok, here's what I've done:
>     http://www.cse.unsw.edu.au/~dons/fps/new/

> The code has been partioned into:
>     Data.ByteString         a Word8 only layer. All functions are in terms of Word8

Do the file-handling Word8 functions always work correctly, or do they
do some kind of round-trip Char conversion? We've needed Word8 file
access, so this would be very helpful. For instance:

  writeFile "myfile" (pack [0..255])

This should always write exactly the bytes 0 to 255, with no
text-related weirdness such as charset remapping or newline conversion.

--
Ashley Yakeley, Seattle WA
WWEWDD? http://www.cs.utexas.edu/users/EWD/

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Donald Bruce Stewart
ashley:

> Donald Bruce Stewart wrote:
>
> >Ok, here's what I've done:
> >    http://www.cse.unsw.edu.au/~dons/fps/new/
>
> >The code has been partioned into:
> >    Data.ByteString         a Word8 only layer. All functions are in terms
> >    of Word8
>
> Do the file-handling Word8 functions always work correctly, or do they
> do some kind of round-trip Char conversion? We've needed Word8 file
> access, so this would be very helpful. For instance:
>
>  writeFile "myfile" (pack [0..255])
>
> This should always write exactly the bytes 0 to 255, with no
> text-related weirdness such as charset remapping or newline conversion.

There is no round trip at all. Nothing is converted to Char in the Word8
code:

Prelude> Data.ByteString.writeFile "myfile" (Data.ByteString.pack [0..255])

$ od -t 'd1' myfile
0000000    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
0000020   16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
0000040   32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
0000060   48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
0000100   64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79
0000120   80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
0000140   96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
0000160  112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
0000200  128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
0000220  144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
0000240  160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
0000260  176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
0000300  192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
0000320  208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
0000340  224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
0000360  240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

John Meacham
In reply to this post by Donald Bruce Stewart
On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:
> bulat.ziganshin:
> > Hello Donald,
> > btw, what will be really useful now, imho, is the interface to
> > Text.Regex. how about working on it as next stage?
>
> This is already done actually, here:
>     http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc
>     http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

I have a regex interface to PCRE and some neat typeclass tricks to give
you perls (=~) operator but much more powerful here.

http://repetae.net/john/computer/haskell/JRegex/

It would be nice to get a PCRE binding in the libraries if it is
available.

if there is interest in including this in the fptools libraries I can
revisit and clean-up/modernize the code.

        John
--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 4

Donald Bruce Stewart
In reply to this post by John Meacham
john:
> On Wed, Apr 26, 2006 at 01:21:17PM +1000, Donald Bruce Stewart wrote:
> > I'm wary of claiming 'PackedString' status, as John says, it isn't a
> > drop in replacement, so Data.ByteString.Char8 seems fine to me.
>
> I like it a lot. perfect for what it does. is somone working on a
> Data.PackedString or should I have a go at it? should I send patches to
> your darcs repo?

I don't think anyone is working on it at the moment.
And I'm happy for patches, or maybe it should be another repo (so I can
just concentrate on getting Data.ByteString into the base libs). No
matter.

Also, today I checked that Data.ByteString.* runs in hugs (it does),
since its H98 + FFI+ cpp, so now I'm wondering if JHC can compile it...?

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Donald Bruce Stewart
In reply to this post by John Meacham
john:

> On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:
> > bulat.ziganshin:
> > > Hello Donald,
> > > btw, what will be really useful now, imho, is the interface to
> > > Text.Regex. how about working on it as next stage?
> >
> > This is already done actually, here:
> >     http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc
> >     http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc
>
> I have a regex interface to PCRE and some neat typeclass tricks to give
> you perls (=~) operator but much more powerful here.
>
> http://repetae.net/john/computer/haskell/JRegex/
>
> It would be nice to get a PCRE binding in the libraries if it is
> available.
>
> if there is interest in including this in the fptools libraries I can
> revisit and clean-up/modernize the code.

We really longed for a high performance regex lib in the standard
libraries while working on the shootout earlier this year. Text.Regex is
far too inefficient due to all the pack/unpackings. and even then C's
regexes aren't so great.  In fact, Chris K ended up writing
Tex.Regex.Lazy as a result of this effort.

Here's a nice benchmark for you code:
    http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all 

I wonder if JRegex would give us a faster entry?

After fast IO, regexes are the other thing we need to improve for ghc
6.6, I think. So at least the people who worked on the shootout would be
interested :)

-- Don

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Simon Marlow-5
In reply to this post by John Meacham
John Meacham wrote:

> On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:
>
>>bulat.ziganshin:
>>
>>>Hello Donald,
>>>btw, what will be really useful now, imho, is the interface to
>>>Text.Regex. how about working on it as next stage?
>>
>>This is already done actually, here:
>>    http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc
>>    http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc
>
>
> I have a regex interface to PCRE and some neat typeclass tricks to give
> you perls (=~) operator but much more powerful here.
>
> http://repetae.net/john/computer/haskell/JRegex/
>
> It would be nice to get a PCRE binding in the libraries if it is
> available.
>
> if there is interest in including this in the fptools libraries I can
> revisit and clean-up/modernize the code.

Actually yes, I did intend to replace/extend Text.Regex with JRegex at
some point.  Plus we can include PCRE, since it has a BSD license -
maybe it can replace the POSIX regex implementation that we have in GHC
right now (which was taken from FreeBSD's libc).

I imagine doing this as part of the library reorg we have planned for
6.6.  http://hackage.haskell.org/trac/ghc/ticket/710

Cheers,
        Simon
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: Data.ByteString candidate 4

Bulat Ziganshin-2
In reply to this post by Donald Bruce Stewart
Hello Donald,

Thursday, April 27, 2006, 11:09:24 AM, you wrote:

> Also, today I checked that Data.ByteString.* runs in hugs (it does),

that's great for debugging haskell programs

> since its H98 + FFI+ cpp, so now I'm wondering if JHC can compile it...?

and nhc/yhc too. after all, base libs is a common library for hugs, ghc and
nhc


--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
123