RSS
 

Archive for August, 2009

SlimGen and You, Part ADD EAX, [EAX] of N

17 Aug

So far I’ve covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.

So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.

The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you’re doing at the time. Running the results against my Phenom 9850 I get:

Total Matrix Count Per Run:  100,000
Multiply        Total Ticks: 2,001,059
SlimGenMultiply Total Ticks: 1,269,200
Improvement:                 36.57 % 

While when I run it against my T8300 Core2 Duo laptop I get:

Total Matrix Count Per Run:  100,000
Multiply        Total Ticks: 2,175,380
SlimGenMultiply Total Ticks: 1,621,830
Improvement:                 25.45 %

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn’t been fully hammered out (in that it “works” but hasn’t been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.

So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that’s over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there’s the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there’s the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.

Finally there’s how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn’t bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.

The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
 
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x20], xmm0 ; store row 0 of new matrix
 
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x30], xmm0 ; store row 1 of new matrix
 
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x40], xmm0 ; store row 2 of new matrix
 
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x30], xmm0 ; store row 3 of new matrix
            movups  xmm0, [esp - 0x40]
            movups  [eax + 0x20], xmm0
            movups  xmm0, [esp - 0x30]
            movups  [eax + 0x10], xmm0
            movups  xmm0, [esp - 0x20]
            movups  [eax], xmm0
            ret     4

The .Net matrix multiplication source code:

public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {
    Matrix r;
    r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);
    r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);
    r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);
    r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);
    r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);
    r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);
    r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);
    r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);
    r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);
    r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);
    r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);
    r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);
    r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);
    r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);
    r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);
    r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);
    result = r;
}
 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

SlimGen and You, Part ADD AL, [RAX] of N

14 Aug

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?

The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X(‘c’, 2, 3.0f, “Hello”); becomes:

X('c', 2, 3.0f, "Hello");
00000025  push        40400000h ; 3.0f
0000002a  push        dword ptr ds:[03402088h] ;Address of "Hello"
00000030  mov         edx,2 
00000035  mov         ecx,63h ;'c'
0000003a  call        FFB8B040

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);
0000006d  push        40400000h  ; 3.0f
00000072  mov         ecx,dword ptr [ebp-40h] ;this
00000075  mov         edx,2
0000007c  call        FFA1B048

So this all seems clear enough, but it’s important to note these differences, especially when you’re poking around in the low level bowels of the CLR or when you’re doing what SlimGen does: which is replacing actual method bodies.

So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers. 

Z('c', 2, 3.0f, "Hello", 1.0, pa);
000000c0  mov         r9,124D3100h 
000000ca  mov         r9,qword ptr [r9] ; "Hello"
000000cd  mov         rax,qword ptr [rsp+38h] ;pa (IntPtr[])
000000d2  mov         qword ptr [rsp+28h],rax ;pa - stack spill
000000d7  movsd       xmm0,mmword ptr [00000118h] ;1.0
000000df  movsd       mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill
000000e5  movss       xmm2,dword ptr [00000110h] ;3.0f
000000ed  mov         edx,2 ;int (2)
000000f2  mov         cx,63h ;'c' 
000000f6  call        FFFFFFFFFFEC9300

Whew, that looks pretty nasty doesn’t it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);
0000010a  mov         rcx,qword ptr [rsp+30h] ; this pointer
0000010f  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack
00000118  mov         r9,0FFFFFFFFFFFFFFFDh ;~2L
0000011f  mov         r8,0FFFFFFFFFFFFFFFEh ;~1L
00000126  mov         rdx,0FFFFFFFFFFFFFFFFh ;~0L
0000012d  call        FFFFFFFFFFEC9310</p>

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();
p.R(v);
00000169  lea         rcx,[rsp+40h] 
0000016e  mov         rax,qword ptr [rcx] 
00000171  mov         qword ptr [rsp+50h],rax 
00000176  mov         rax,qword ptr [rcx+8] 
0000017a  mov         qword ptr [rsp+58h],rax 
0000017f  lea         rdx,[rsp+50h] 
00000184  mov         rcx,r8 
00000187  call        FFFFFFFFFFEC9318

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.

All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32
ORG         0x59f0
;           void Multiply(ref Matrix, ref Matrix, out Matrix)
start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
 
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax], xmm0 ; Calculate row 0 of new matrix
 
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x10], xmm0 ; Calculate row 1 of new matrix
 
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x20], xmm0 ; Calculate row 2 of new matrix
 
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x30], xmm0 ; Calculate row 3 of new matrix
            ret     4
 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

SlimGen and You, Part ADD [EAX], EAX of N

07 Aug

So previously we delved into one of the nastier performance corners on the .Net framework. Today I’m going to introduce you to a tool, that is in development currently, which allows you to take those slow math functions of yours and replace them with high performance SSE optimized methods.

We’ve called it SlimGen, which although not exactly accurate, does fit nicely in with the other Slim projects currently underway including SlimTune, and the flagship that started it all, SlimDX.

So what does SlimGen do? Well, you pass it a .Net assembly and it replaces the native method bodies, which are generated using NGEN, with replacement ones written in assembly (for now). This modified assembly then replaces the original assembly that was stored in the native image store. SlimGen can operate on signed and unsigned assemblies alike, as the native image is not signed, more on this later though.

Managed PE files contain a great deal of metadata stored in tables. You can enumerate these tables and parse them yourself, for instance if you were writing your own CLR. Thankfully though, the .Net framework comes with several COM interfaces that are very helpful in accessing these tables without having to manually parse them out of the PE file, this is especially useful since the table rows are are not a fixed format. Specifically, indexes in the tables can be either a 2 bytes or 4 bytes in size depending on the size of the dataset indexed. In the case of SlimGen we use the IMetaDataImport2 interface for accessing the metadata.

Of course, the managed metadata does not contain all of the information we need. NGEN manipulates the managed assembly and introduces pre-jitted versions of the functions contained within the assembly. However, their managed counterparts remain in the assembly and are what the metadata tables reference to. So how does one go from a managed method and its IL to the associated unmanaged code? Well, the CLR header of a PE file does contain a pointer to a table for a native header. However the exact format of that table is undocumented and as such it makes it hard to parse it and find the information we need. Therefore we have to use an alternative method…

When you load up an assembly the CLR generates, using the metadata and other information found in the PE file, a set of runtime tables that it uses to indicate information about where things are in memory, and their current state. For instance, it can tell if its jitted a method or not. When you load up an assembly that’s been NGENed, it checks the native images for an associated copy, assuming your assembly validates, and will load up the NGENed assembly and parse out the appropriate information from that. Therefore we need some way of gaining access to these runtime generated tables. Enter the debugger.

The .Net framework exposes debugging interfaces that are quite trivial to implement, but more important, they give you access to all of the runtime information available to the CLR. In the case of SlimGen what we do is load up your assembly (not run) into a host process and then simply have the host process execute a debugger breakpoint. The SlimGen analyzer first initializes its self as a debugger and then executes the host process as the attached debugger. When the breakpoint is hit, it breaks into the analyzer, which can then begin the work of processing the loaded assemblies. Since SlimGen knows which assembly it fed to the host, it is able to filter out all of the other assemblies that have been loaded and focus in on the one we care about. First we check and see if a native version of the assembly has been loaded, for if one hasn’t been loaded there is no point in continuing. if not then we simply report an error and cleanup. Assuming there is a native version of the assembly loaded then we use the aforementioned metadata interfaces to walk the assembly and find all of the methods that have been marked for replacement. Each method is examined to ensure that it has a native counterpart, and if it doesn’t another warning is issued and the method is skipped.

Now comes the annoying part. In .Net 1.x the framework had each method exist within a singular code chunk, which made extracting that code quite easy. However in .Net 2.x and forward the framework allows a method to have multiple code chunks, each with a different base address and length. This is theoretically to allow an optimizer to spread work its magic, but it does make extracting methods harder. SlimGen will generate an assembly file per chunk and all of the associated binaries for each chunk, generated from the assembly files, must be present for the method to be replaced. No dangling chunks please. The SlimGen analyzer extracts each base address from each chunk, along with the module base address. Using that information we can then calculate the relative virtual address of each method’s native counterpart within the NGENed file.

Using that information the SlimGen client simply walks a copy of the native image performing the replacement of each method, and then when done (and assuming no errors), copies it back over the original NGEN image. Tada, you now have your highly optimized SSE code running in a managed application with no managed –> unmanaged transitions in sight.

 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development