Is it safe to index a little bit out of bounds

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Is it safe to index a little bit out of bounds

Andrew Martin
Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine:

- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#

But this one is non-deterministic:

- indexWordArray# myArr# 2#

Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.

--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Is it safe to index a little bit out of bounds

Sven Panne-2
2018-03-08 15:19 GMT+01:00 Andrew Martin <[hidden email]>:
[...] Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.

Before doing such things, please make sure that e.g. valgrind or similar tools are happy with such Kung-Fu. I don't know off the top of my head how fine-grained their checks are, but there is various similar code out there in the wild which is a PITA to debug. You might force people to add suppressions or even worse: Make some valuable tools totally useless. This is not something which should be done lightly...

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Is it safe to index a little bit out of bounds

Herbert Valerio Riedel-3
In reply to this post by Andrew Martin
Hi,

On 2018-03-08 at 09:19:29 -0500, Andrew Martin wrote:
> Some of the bytes in the word will have garbage in them. However, this
> could always be masked out with a bit mask (you have to know the platform
> endianness for this to work right).
>
> Is this safe? I doubt think this could ever cause a segfault but I
> wanted to check.

Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.

I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).


 [1]: https://hackage.haskell.org/package/text-short

 [2]: https://markkarpov.com/post/short-bs-and-text.html
 
 [3]: https://hackage.haskell.org/text-utf8


-- hvr
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Is it safe to index a little bit out of bounds

Andrew Martin
Thanks Herbert! This is exactly the kind of data point I was looking for. Good to know.

On Thu, Mar 8, 2018 at 12:42 PM, Herbert Valerio Riedel <[hidden email]> wrote:
Hi,

On 2018-03-08 at 09:19:29 -0500, Andrew Martin wrote:
> Some of the bytes in the word will have garbage in them. However, this
> could always be masked out with a bit mask (you have to know the platform
> endianness for this to work right).
>
> Is this safe? I doubt think this could ever cause a segfault but I
> wanted to check.

Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.

I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).


 [1]: https://hackage.haskell.org/package/text-short

 [2]: https://markkarpov.com/post/short-bs-and-text.html

 [3]: https://hackage.haskell.org/text-utf8


-- hvr



--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Is it safe to index a little bit out of bounds

David Feuer
In reply to this post by Andrew Martin
What do you gain from this?

On Mar 8, 2018 9:19 AM, "Andrew Martin" <[hidden email]> wrote:
Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine:

- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#

But this one is non-deterministic:

- indexWordArray# myArr# 2#

Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.

--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries


_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Is it safe to index a little bit out of bounds

Andrew Martin
If you are looking for ascii (or non-ascii characters) in a byte array, you build a word-sized mask like 0b1000000010000000... However, on the last word, if you cannot go past the end, you have to go one byte at a time. But, if you can go past the end, you can mask out the irrelevant bits and use the same mask as before.

On Thu, Mar 8, 2018 at 1:35 PM, David Feuer <[hidden email]> wrote:
What do you gain from this?

On Mar 8, 2018 9:19 AM, "Andrew Martin" <[hidden email]> wrote:
Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine:

- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#

But this one is non-deterministic:

- indexWordArray# myArr# 2#

Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.

--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries




--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries