|
Hello.
It seems that the regex-pcre has a bug dealing with utf-8: Prelude> :m + Text.Regex.PCRE Prelude Text.Regex.PCRE> "país:Brasil" =~ "país:(.*)" :: (String,String,String,[String]) ("","pa\237s:Brasil","",["rasil"]) Notice the missing 'B' in the result of the regex matching. With regex-posix this does not happen: Prelude> :m + Text.Regex.Posix Prelude Text.Regex.Posix> "país:Brasil" =~ "país:(.*)" ::(String,String,String,[String]) ("","pa\237s:Brasil","",["Brasil"]) I hope this bug can be fixed soon. Is there a bug tracker to report the bug? If so, what is it? Romildo _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
On 08/18/2012 06:16 PM, José Romildo Malaquias wrote:
> Hello. > > It seems that the regex-pcre has a bug dealing with utf-8: > > I hope this bug can be fixed soon. > > Is there a bug tracker to report the bug? If so, what is it? > You need something like that let pat = makeRegexOpts (compUTF8 .|. defaultCompOpt) defaultExecOpt ("@'(.+?)'@" :: B.ByteString) and than pat will match correctly. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
On Tue, Aug 21, 2012 at 10:25:53PM +0300, Konstantin Litvinenko wrote:
> On 08/18/2012 06:16 PM, José Romildo Malaquias wrote: > > Hello. > > > > It seems that the regex-pcre has a bug dealing with utf-8: > > > > I hope this bug can be fixed soon. > > > > Is there a bug tracker to report the bug? If so, what is it? > > > You need something like that > > let pat = makeRegexOpts (compUTF8 .|. defaultCompOpt) defaultExecOpt > ("@'(.+?)'@" :: B.ByteString) > > and than pat will match correctly. The bug is related to String (not ByteString) in a UTF-8 locale. Until it is fixed, I am using the workaround of converting the regular expression and the text to ByteString, doing the matching, and then converting the results back to String. Romildo _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
In reply to this post by José Romildo Malaquias
On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
> On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote: > > I do not have time to test this myself right now. But I will unravel my code a > > bit for you. > > > > > By November 2011 it worked without problems in my application. Now that > > > I have resumed developping the application, I have been faced with this > > > behaviour. As it used to work before, I believe it is a bug in > > > regex-pcre or libpcre. > > > > I believe it may be problem in String <-> ByteString conversion. The "base" > > library may have changed and your LOCALE information may be different or may be > > being used differently by "base". > > > > > The (temporary) workaround I found is to convert the strings to > > > byte-strings before matching, and then convert the results back to > > > strings. With byte-strings it works well. > > > > That is an excellent sign that it is your LOCALE settings being picked up by > > GHC's "base" package, see explanation below. > I have written an application to test those things. There are 2 source > files: test.hs and seestr.c, which are attached. > > The test does the following: > > 1. shows the getForeignEncoding > > 2. uses a C function to show the characters from a String (using > withCString) and from a ByteString (using useAsCString) > > 3. matches a PCRE regular expression using String and ByteString > > The test is run twice, with different LANG settings, and its output > follows. > As can be seen, regular expression matching does not work with > en_US.UTF-8. But it works with en_US.ISO-8859-1. > > The test shows that withCString is working as expected too. This > may suggest the problem is really with regex-pcre. The previous tests were run on an gentoo linux with ghc-7.4.1. I have also run the tests on Fedora 17 with ghc-7.0.4, which does not have the bug. The sources are attached. The tests output follows: $ LANG=en_US.ISO-8859-1 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : pa�s:(.*) text : pa�s:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] $ LANG=en_US.UTF-8 && ./test testing with String code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 testing with ByteString code: 70, char: p code: 61, char: a code: ffffffed, char: code: 73, char: s result: 4 regex : país:(.*) text : país:Brasil String match : [["pa\237s:Brasil","Brasil"]] ByteString match : [["pa\237s:Brasil","Brasil"]] Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems that With ghc-7.0.4 withCString does not obey the UTF-8 locale and generates a latin1 C string. Regards, Romildo _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
Hello.
I think I have an explanation for the problem with regex-pcre, ghc-7.4.2 and UTF Strings. The Text.Regex.PCRE.String module uses the withCString and withCStringLen from the module Foreign.C.String to pass a Haskell string to the C library pcre functions that compile regular expressions, and execute regular expressions to match some text. Recent versions of ghc have withCString and withCStringLen definitions that uses the current system locale to define the marshalling of a Haskell string into a NUL terminated C string using temporary storage. With a UTF-8 locale the length of the C string will be greater than the length of the corresponding Haskell string in the presence with characters outside of the ASCII range. Therefore positions of corresponding characters in both strings do not match. In order to compute matching positions, regex-pcre functions use C strings. But to compute matching strings they use those positions with Haskell strings. That gives the mismatch shown earlier and repeated here with the attached program run on a system with a UTF-8 locale: $ LANG=en_US.UTF-8 && ./test1 getForeignEncoding: UTF-8 regex : país:(.*):(.*) text : país:Brasília:Brasil String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))]) String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]] $ LANG=en_US.ISO-8859-1 && ./test1 getForeignEncoding: ISO-8859-1 regex : pa�s:(.*):(.*) text : pa�s:Bras�lia:Brasil String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))]) String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]] I see two ways of fixing this bug: 1. make the matching functions compute the text using the C string and the positions calculated by the C function, and convert the text back to a Haskell string. 2. map the positions in the C string (if possible) to the corresponding positions in the Haskell string; this way the current definitions of the matching functions returning text will just work. I hope this would help fixing the issue. Regards, Romildo _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
On Thu, Aug 23, 2012 at 08:59:52AM -0300, José Romildo Malaquias wrote:
> Hello. > > I think I have an explanation for the problem with regex-pcre, ghc-7.4.2 > and UTF Strings. > > The Text.Regex.PCRE.String module uses the withCString and > withCStringLen from the module Foreign.C.String to pass a Haskell string > to the C library pcre functions that compile regular expressions, and > execute regular expressions to match some text. > > Recent versions of ghc have withCString and withCStringLen definitions > that uses the current system locale to define the marshalling of a > Haskell string into a NUL terminated C string using temporary storage. > > With a UTF-8 locale the length of the C string will be greater than the > length of the corresponding Haskell string in the presence with > characters outside of the ASCII range. Therefore positions of > corresponding characters in both strings do not match. > > In order to compute matching positions, regex-pcre functions use C > strings. But to compute matching strings they use those positions with > Haskell strings. > > That gives the mismatch shown earlier and repeated here with the > attached program run on a system with a UTF-8 locale: > > > $ LANG=en_US.UTF-8 && ./test1 > getForeignEncoding: UTF-8 > > regex : país:(.*):(.*) > text : país:Brasília:Brasil > String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))]) > String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]] > > $ LANG=en_US.ISO-8859-1 && ./test1 > getForeignEncoding: ISO-8859-1 > > regex : pa�s:(.*):(.*) > text : pa�s:Bras�lia:Brasil > String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))]) > String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]] > > > I see two ways of fixing this bug: > > 1. make the matching functions compute the text using the C string and > the positions calculated by the C function, and convert the text back > to a Haskell string. > > 2. map the positions in the C string (if possible) to the corresponding > positions in the Haskell string; this way the current definitions of > the matching functions returning text will just work. > > I hope this would help fixing the issue. I have a fix for this bug and it would be nice if others take a look at it and see if it is ok. It is based on the second way presented above. Romildo _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
| Powered by Nabble | Edit this page |
