optimizing StgPtr allocate (Capability *cap, W_ n)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

optimizing StgPtr allocate (Capability *cap, W_ n)

Bulat Ziganshin-2
Hello,

i'm looking a the https://github.com/ghc/ghc/blob/23bb90460d7c963ee617d250fa0a33c6ac7bbc53/rts/sm/Storage.c#L680

if i correctly understand, it's speed-critical routine?

i think that it may be improved in this way:

StgPtr allocate (Capability *cap, W_ n)
{
    bdescr *bd;
    StgPtr p;

    TICK_ALLOC_HEAP_NOCTR(WDS(n));
    CCS_ALLOC(cap->r.rCCCS,n);

/// here starts new improved code:

    bd = cap->r.rCurrentAlloc;
    if (bd == NULL || bd->free + n > bd->end) {
        if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
            ....
        }
        if (bd->free + n <= bd->start + BLOCK_SIZE_W)
            bd->end = min (bd->start + BLOCK_SIZE_W, bd->free + LARGE_OBJECT_THRESHOLD)
            goto usual_alloc;
        }
        ....
    }

/// and here it stops

usual_alloc:
    p = bd->free;
    bd->free += n;

    IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
    return p;
}


i  think  it's  obvious - we consolidate two if's on the crirical path
into the single one plus avoid one ADD by keeping highly-useful bd->end pointer

further   improvements   may   include   removing  bd==NULL  check  by
initializing bd->free=bd->end=NULL   and   moving   entire   "if" body
into   separate   slow_allocate()  procedure  marked  "noinline"  with
allocate() probably marked as forceinline:

StgPtr allocate (Capability *cap, W_ n)
{
    bdescr *bd;
    StgPtr p;

    TICK_ALLOC_HEAP_NOCTR(WDS(n));
    CCS_ALLOC(cap->r.rCCCS,n);

    bd = cap->r.rCurrentAlloc;
    if (bd->free + n > bd->end)
        return slow_allocate(cap,n);

    p = bd->free;
    bd->free += n;

    IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
    return p;
}

this  change  will  greatly simplify optimizer's work. according to my
experience   current  C++  compilers  are  weak  on  optimizing  large
functions with complex execution paths and such transformations really
improve the generated code

--
Best regards,
 Bulat                          mailto:Bulat.Ziganshin at gmail.com


Reply | Threaded
Open this post in threaded view
|

optimizing StgPtr allocate (Capability *cap, W_ n)

Mikolaj Konarski-2
Hi Bulat!

Is this exactly the same email you posted to
glasgow-haskell-users@ previously (I'm asking to know
if I need to reread)? It was already forwarded here by none
other but SPJ. :)

Welcome,
Mikolaj


On Wed, Oct 15, 2014 at 12:01 AM, Bulat Ziganshin
<bulat.ziganshin at gmail.com> wrote:

> Hello,
>
> i'm looking a the https://github.com/ghc/ghc/blob/23bb90460d7c963ee617d250fa0a33c6ac7bbc53/rts/sm/Storage.c#L680
>
> if i correctly understand, it's speed-critical routine?
>
> i think that it may be improved in this way:
>
> StgPtr allocate (Capability *cap, W_ n)
> {
>     bdescr *bd;
>     StgPtr p;
>
>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>     CCS_ALLOC(cap->r.rCCCS,n);
>
> /// here starts new improved code:
>
>     bd = cap->r.rCurrentAlloc;
>     if (bd == NULL || bd->free + n > bd->end) {
>         if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
>             ....
>         }
>         if (bd->free + n <= bd->start + BLOCK_SIZE_W)
>             bd->end = min (bd->start + BLOCK_SIZE_W, bd->free + LARGE_OBJECT_THRESHOLD)
>             goto usual_alloc;
>         }
>         ....
>     }
>
> /// and here it stops
>
> usual_alloc:
>     p = bd->free;
>     bd->free += n;
>
>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>     return p;
> }
>
>
> i  think  it's  obvious - we consolidate two if's on the crirical path
> into the single one plus avoid one ADD by keeping highly-useful bd->end pointer
>
> further   improvements   may   include   removing  bd==NULL  check  by
> initializing bd->free=bd->end=NULL   and   moving   entire   "if" body
> into   separate   slow_allocate()  procedure  marked  "noinline"  with
> allocate() probably marked as forceinline:
>
> StgPtr allocate (Capability *cap, W_ n)
> {
>     bdescr *bd;
>     StgPtr p;
>
>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>     CCS_ALLOC(cap->r.rCCCS,n);
>
>     bd = cap->r.rCurrentAlloc;
>     if (bd->free + n > bd->end)
>         return slow_allocate(cap,n);
>
>     p = bd->free;
>     bd->free += n;
>
>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>     return p;
> }
>
> this  change  will  greatly simplify optimizer's work. according to my
> experience   current  C++  compilers  are  weak  on  optimizing  large
> functions with complex execution paths and such transformations really
> improve the generated code
>
> --
> Best regards,
>  Bulat                          mailto:Bulat.Ziganshin at gmail.com
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>

Reply | Threaded
Open this post in threaded view
|

optimizing StgPtr allocate (Capability *cap, W_ n)

Bulat Ziganshin-2
Hello Mikolaj,

Wednesday, October 15, 2014, 2:07:35 AM, you wrote:

i'm sorry, it's the same


> Hi Bulat!

> Is this exactly the same email you posted to
> glasgow-haskell-users@ previously (I'm asking to know
> if I need to reread)? It was already forwarded here by none
> other but SPJ. :)

> Welcome,
> Mikolaj


> On Wed, Oct 15, 2014 at 12:01 AM, Bulat Ziganshin
> <bulat.ziganshin at gmail.com> wrote:
>> Hello,
>>
>> i'm looking a the https://github.com/ghc/ghc/blob/23bb90460d7c963ee617d250fa0a33c6ac7bbc53/rts/sm/Storage.c#L680
>>
>> if i correctly understand, it's speed-critical routine?
>>
>> i think that it may be improved in this way:
>>
>> StgPtr allocate (Capability *cap, W_ n)
>> {
>>     bdescr *bd;
>>     StgPtr p;
>>
>>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>>     CCS_ALLOC(cap->r.rCCCS,n);
>>
>> /// here starts new improved code:
>>
>>     bd = cap->r.rCurrentAlloc;
>>     if (bd == NULL || bd->free + n > bd->end) {
>>         if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
>>             ....
>>         }
>>         if (bd->free + n <= bd->start + BLOCK_SIZE_W)
>>             bd->end = min (bd->start + BLOCK_SIZE_W, bd->free + LARGE_OBJECT_THRESHOLD)
>>             goto usual_alloc;
>>         }
>>         ....
>>     }
>>
>> /// and here it stops
>>
>> usual_alloc:
>>     p = bd->free;
>>     bd->free += n;
>>
>>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>>     return p;
>> }
>>
>>
>> i  think  it's  obvious - we consolidate two if's on the crirical path
>> into the single one plus avoid one ADD by keeping highly-useful bd->end pointer
>>
>> further   improvements   may   include   removing  bd==NULL  check  by
>> initializing bd->free=bd->end=NULL   and   moving   entire   "if" body
>> into   separate   slow_allocate()  procedure  marked  "noinline"  with
>> allocate() probably marked as forceinline:
>>
>> StgPtr allocate (Capability *cap, W_ n)
>> {
>>     bdescr *bd;
>>     StgPtr p;
>>
>>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>>     CCS_ALLOC(cap->r.rCCCS,n);
>>
>>     bd = cap->r.rCurrentAlloc;
>>     if (bd->free + n > bd->end)
>>         return slow_allocate(cap,n);
>>
>>     p = bd->free;
>>     bd->free += n;
>>
>>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>>     return p;
>> }
>>
>> this  change  will  greatly simplify optimizer's work. according to my
>> experience   current  C++  compilers  are  weak  on  optimizing  large
>> functions with complex execution paths and such transformations really
>> improve the generated code
>>
>> --
>> Best regards,
>>  Bulat                          mailto:Bulat.Ziganshin at gmail.com
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>



--
Best regards,
 Bulat                            mailto:Bulat.Ziganshin at gmail.com