RSS
 

Author Archive

Building a SlimDX MiniTriangle sample with Direct3D11 and IronPython

26 Aug

I generally don’t post huge code dumps, mainly because I find them more annoying and less helpful than some books/authors might. But you know, I’ve been playing with IronPython/SlimDX recently and decided to do up another SlimDX Sample (demonstrating DX11), except in IronPython this time. This will be in the SlimDX samples sometime soon!

import clr
clr.AddReference('System.Windows.Forms')
clr.AddReference('System.Drawing')
clr.AddReference('SlimDX')
 
from System import *
from System.Drawing import Size
from System.Windows.Forms import Form, Application, MessageBox, FormBorderStyle
from SlimDX import *
from SlimDX.Direct3D11 import *
from SlimDX.DXGI import SwapChainDescription, SwapChainFlags, ModeDescription, SampleDescription, Usage, SwapEffect, Format, PresentFlags, Factory, WindowAssociationFlags
from SlimDX.D3DCompiler import *
from SlimDX.Windows import MessagePump
 
class GameObject:
    def Render(self):
        pass
    def Tick(self):
        pass
 
class GraphicsDevice(IDisposable):
    Context = property(lambda self: self.context)
    Device = property(lambda self: self.device)
    SwapChain = property(lambda self: self.swapChain)
 
    def __init__(self, control, fullscreen):
        self.fullscreen = fullscreen
        self.control = control
 
        control.Resize += lambda sender, args: self.Resize()
 
        swapChainDesc = self.CreateSwapChainDescription();
        success,self.device,self.swapChain = Device.CreateWithSwapChain(DriverType.Hardware, DeviceCreationFlags.None, Array[FeatureLevel]([FeatureLevel.Level_11_0, FeatureLevel.Level_10_1, FeatureLevel.Level_10_0]), swapChainDesc)
        self.context = self.Device.ImmediateContext
 
        with self.swapChain.GetParent[Factory]() as factory:
            factory.SetWindowAssociation(self.control.Handle, WindowAssociationFlags.IgnoreAll)
 
        with Resource.FromSwapChain[Texture2D](self.swapChain, 0) as backBuffer:
            self.backBufferRTV = RenderTargetView(self.Device, backBuffer)
 
        self.Resize()        
 
    def CreateSwapChainDescription(self):
        swapChainDesc = SwapChainDescription()
        swapChainDesc.IsWindowed = not self.fullscreen
        swapChainDesc.BufferCount = 1
        swapChainDesc.ModeDescription = ModeDescription(self.control.ClientSize.Width, self.control.ClientSize.Height, Rational(60, 1), Format.R8G8B8A8_UNorm)
        swapChainDesc.Flags = SwapChainFlags.None
        swapChainDesc.SwapEffect = SwapEffect.Discard
        swapChainDesc.Usage = Usage.RenderTargetOutput
        swapChainDesc.SampleDescription = SampleDescription(1, 0)
        swapChainDesc.OutputHandle = self.control.Handle
        return swapChainDesc
 
    def Resize(self):
        self.Context.ClearState()
        self.backBufferRTV.Dispose()
        self.swapChain.ResizeBuffers(1, self.control.ClientSize.Width, self.control.ClientSize.Height, Format.R8G8B8A8_UNorm, SwapChainFlags.None)
        with Resource.FromSwapChain[Texture2D](self.swapChain, 0) as backBuffer:
            self.backBufferRTV = RenderTargetView(self.Device, backBuffer)
        self.Context.Rasterizer.SetViewports(Viewport(0, 0, self.control.ClientSize.Width, self.control.ClientSize.Height, 0.0, 1.0))
 
    def BeginRender(self):
        self.Context.ClearRenderTargetView(self.backBufferRTV, Color4(0, 0, 0, 0))
        self.Context.OutputMerger.SetTargets(self.backBufferRTV)
 
 
    def EndRender(self):
        self.swapChain.Present(0, PresentFlags.None)
 
    def Dispose(self):
        self.backBufferRTV.Dispose()
        self.swapChain.Dispose()
        self.device.Dispose()
 
 
class TriangleObject(GameObject):
    def __init__(self, game):
        self.game = game
        device = game.GraphicsDevice.Device
        context = game.GraphicsDevice.Context
 
        err = clr.Reference[str]()
        with ShaderBytecode.CompileFromFile("SimpleTriangle10.fx", "fx_5_0", ShaderFlags.None, EffectFlags.None, None, None, err) as shaderByteCode:
            self.effect = Effect(device, shaderByteCode)
 
        shaderTechnique = self.effect.GetTechniqueByIndex(0)
        self.shaderPass = shaderTechnique.GetPassByIndex(0)
 
        sig = self.shaderPass.Description.Signature
        self.inputLayout = InputLayout(device, sig, Array[InputElement]([InputElement("POSITION", 0, Format.R32G32B32A32_Float, 0, 0), InputElement("COLOR", 0, Format.R32G32B32A32_Float, 16, 0)]))
 
        bufferDesc = BufferDescription(3 * 32, ResourceUsage.Dynamic, BindFlags.VertexBuffer, CpuAccessFlags.Write, ResourceOptionFlags.None, 0)
        self.vertexBuffer = Buffer(device, bufferDesc)
 
        stream = context.MapSubresource(self.vertexBuffer, 0, 3 * 32, MapMode.WriteDiscard, MapFlags.None).Data
        data = Array[Vector4]([
            Vector4(0.0, 0.5, 0.5, 1.0), Vector4(1.0, 0.0, 0.0, 1.0),
            Vector4(0.5, -0.5, 0.5, 1.0), Vector4(0.0, 1.0, 0.0, 1.0),
            Vector4(-0.5, -0.5, 0.5, 1.0), Vector4(0.0, 0.0, 1.0, 1.0)
        ])
        stream.WriteRange(data)
        context.UnmapSubresource(self.vertexBuffer, 0)
 
    def Render(self):
        context = self.game.GraphicsDevice.Context
        context.InputAssembler.InputLayout = self.inputLayout
        context.InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList
        context.InputAssembler.SetVertexBuffers(0, VertexBufferBinding(self.vertexBuffer, 32, 0))
        self.shaderPass.Apply(context)
        context.Draw(3, 0)
 
    def Dispose(self):
        self.effect.Dispose()
        self.inputLayout.Dispose()
        self.vertexBuffer.Dispose()
 
class Game(IDisposable):
    GraphicsDevice = property(lambda self: self.graphicsDevice)
 
    def __init__(self, width, height, fullscreen = False):
        self.fullscreen = fullscreen
        self.form = GameForm(width, height, fullscreen)
        self.form.Visible = True
        self.graphicsDevice = GraphicsDevice(self.form, self.fullscreen)
        self.gameObjects = [TriangleObject(self)]
 
    def Run(self):
        Application.Idle += self.OnIdle
        Application.Run(self.form)
 
    def OnIdle(self, ea, sender):
        while MessagePump.IsApplicationIdle:
            self.Update()
            self.Render()
 
    def Update(self):
        for i in self.gameObjects:
            i.Tick()
 
    def Render(self):
        self.GraphicsDevice.BeginRender()
        for i in self.gameObjects:
            i.Render()
        self.GraphicsDevice.EndRender()
 
    def Dispose(self):
        self.GraphicsDevice.Dispose()
        for i in self.gameObjects:
            if 'Dispose' in dir(i):
                i.Dispose()
        self.form.Dispose(True)
 
class GameForm(Form):
    def __init__(self, width, height, fullscreen):
        self.ClientSize = Size(width, height)
        if fullscreen:
            self.FormBorderStyle = FormBorderStyle.None
 
if __name__ == "__main__":
    try:
        with Game(640, 480) as game:
            game.Run()
    except Exception as e:
        MessageBox.Show(e.ToString())
 
Comments Off

Posted in .Net, Python, SlimDX

 

Is It Really A Bug For A Beginner To Be Using C-Strings In C++?

26 Apr

Depends, but probably yes.

A beginning programmer should be focusing on learning to program. That is: the process of taking a concept and turning it into an application. Problem solving, in other words. Learning to program is not the same thing as learning a programming language. Learning a programming language is about learning the syntax and standard library that comes with said programming language, it may involve the process of problem solving, but that is not its primary concern.

Given that, one can quickly see that the best way to introduce a beginning programmer to programming is to get them to use a language that is quick and easy to get up and running in. There are many languages that are quick and easy to get up and running in, and they all tend to share a rather similar component… which is verbosity. Python and Ruby are two prime examples, both of which have a very simple language syntax which allows for a lot of leeway for the programmer, without all the extra clutter that many other languages have (C++ *cough*). Another good choice, in my opinion, is C# which, when combined with Microsoft Visual C#, provides a very robust but easy to learn language. These languages all have many key features which make them easy to learn and use: All of them are generally garbage collected, they all have fairly simple syntax with few (if any) corner cases, and all of them have huge standard libraries that provide for a great deal of quick and easy to use functionality with minimal programmer effort.

C++ has almost none of those things. While there is Visual Studio for it, the IntelliSense is still not perfect, even with the help of tools like WholeTomato’s VAX. The standard library is quite small, dealing mainly with file and console IO, and some minimal containers. It leaves the rest of the work up to the developer. This means that for any sufficiently complex project you will either end up implementing a majority of the behaviors needed yourself, or having to dig up third party libraries and APIs for said behavior. Even the recent C++0x work hasn’t really alleviated the problem. Then you have the language complexity of which I’ve commented on previously

However, the C++ standard library does provide some features that should be in every developers pocketbook… such as std::string. std::string behaves a lot more like what a beginning programmer expects a primitive type to work. They’ve learned that you can add integers and floats together, so why can’t they add strings together? Well, with std::string they can, but with c-strings they can’t. They’ve learned to compare integers and floats using the standard == operator, so why can’t they do that with strings? With std::string they can, but with c-strings they can’t (well, they “can”, but the behavior is not what they want). They’ve learned how to read in integers and floats from std::cin, so why can’t they do the same with strings? They can with std::string, but with c-strings they have to be careful of the length and also that they’ve pre-allocated it, which has hazards of its own… such as stack space issues when they try to create a char array of 5000 characters.

C-strings do not behave intuitively. They have no inherit length, instead preferring to use null terminators to indicate the end of the string. They cannot be trivially concatenated, instead requiring the user to ensure that appropriate space is available, and then they have to use various function calls to copy the string, and then they have to ensure that those string functions had the space required to copy the null terminator (which the strncpy and other functions MAY omit if there isn’t enough space in the destination). Comparison requires the use functionality like strcmp, which doesn’t return true/false, but instead returns an integer indicating the string difference, with 0 being no differences. In a language where the user has been taught that 0/null generally means failure, remembering to test for 0 in that one off corner case is rather strange.

For a beginner, all that strangeness doesn’t equate to extra power or better performance. Instead it equates to extra confusion, and strange crashes. Had they been taught std::string first, they would have been free and clear, able to use the familiar operators they are used to, while being safe and secure in the bosom that is std::string. In fact, it generally gets worst than that, as c-strings are usually taught before pointers! This makes it even more confusing for the poor beginner, because then they’re introduced to arrays and pointers (instead of say std::vector), and now have a whole slew of new functionality to basically kill themselves with.

Thus, in conclusion, if you see a c-string in a beginners code, it probably means they have a bug somewhere in their code.

 
Comments Off

Posted in C++

 

SlimDX Direct3D10 X Loader

28 Nov

Here’s a useful class for loading X files (using SlimDX) into a Direct3D10 Mesh object. This is based off of Jack Hoxley’s C++ code from his journal post on GameDev.Net.

A few things to note about it: It doesn’t handle multiple materials (or materials at all). To handle that would require you to be sure to optimize the D3D9 mesh in place, then harvest the EffectInstance’s and also the materials. That way you could load the appropriate textures and bind them during rendering of the appropriate attributes. For simple X meshes this isn’t an issue, but some (like the Airplane model that comes with the DirectX SDK) have multiple textures.

using System;
using System.Windows.Forms;
using DXGI = SlimDX.DXGI;
using D3D9 = SlimDX.Direct3D9;
using SlimDX.Direct3D10;
 
namespace XMeshLoader {
    class XLoader : IDisposable {
        public XLoader() {
            CreateNullDevice();
        }
 
        #region IDisposable
        ~XLoader() {
            Dispose(false);
        }
 
        public void Dispose() {
            Dispose(true);
        }
 
        private void Dispose(bool disposeManagedObjects) {
            if (disposeManagedObjects) {
                device9.Dispose();
                form.Dispose();
            }
        }
        #endregion
 
        public Mesh CreateMesh(Device device, D3D9.Mesh mesh9, out InputElement[] outDecls) {
            var inDecls = mesh9.GetDeclaration();
            outDecls = new InputElement[inDecls.Length - 1];
            ConvertDecleration(inDecls, outDecls);
 
            var flags = MeshFlags.None;
            if ((mesh9.CreationOptions & D3D9.MeshFlags.Use32Bit) != 0)
                flags = MeshFlags.Has32BitIndices;
 
            var mesh = new Mesh(device, outDecls, D3D9.DeclarationUsage.Position.ToString().ToUpper(), mesh9.VertexCount, mesh9.FaceCount, flags);
 
            ConvertIndexBuffer(mesh9, mesh);
            ConvertVertexBuffer(mesh9, mesh);
            ConfigureAttributeTable(mesh9, mesh);
 
            mesh.GenerateAdjacencyAndPointRepresentation(0);
            mesh.Optimize(MeshOptimizeFlags.Compact | MeshOptimizeFlags.AttributeSort | MeshOptimizeFlags.VertexCache);
 
            mesh.Commit();
            return mesh;
        }
 
        public Mesh LoadFile(Device device, string filename, out InputElement[] outDecls) {
            using (var mesh9 = D3D9.Mesh.FromFile(device9, filename, D3D9.MeshFlags.SystemMemory)) {
                return CreateMesh(device, mesh9, out outDecls);
            }
        }
 
        #region Implementation Details
        private static void ConfigureAttributeTable(D3D9.BaseMesh inMesh, Mesh outMesh) {
            var inAttribTable = inMesh.GetAttributeTable();
 
            if (inAttribTable == null || inAttribTable.Length == 0) {
                outMesh.SetAttributeTable(new[] {new MeshAttributeRange {
                    FaceCount = outMesh.FaceCount,
                    FaceStart = 0,
                    Id = 0,
                    VertexCount = outMesh.VertexCount,
                    VertexStart = 0
                }});
            } else {
                var outAttribTable = new MeshAttributeRange[inAttribTable.Length];
                for (var i = 0; i < inAttribTable.Length; ++i) {
                    outAttribTable[i].Id = inAttribTable[i].AttribId;
                    outAttribTable[i].FaceCount = inAttribTable[i].FaceCount;
                    outAttribTable[i].FaceStart = inAttribTable[i].FaceStart;
                    outAttribTable[i].VertexCount = inAttribTable[i].VertexCount;
                    outAttribTable[i].VertexStart = inAttribTable[i].VertexStart;
                }
                outMesh.SetAttributeTable(outAttribTable);
            }
 
            outMesh.GenerateAttributeBufferFromTable();
        }
 
        private static void ConvertIndexBuffer(D3D9.BaseMesh inMesh, Mesh outMesh) {
            using (var inStream = inMesh.LockIndexBuffer(D3D9.LockFlags.None))
            using (var outBuffer = outMesh.GetIndexBuffer()) {
                using (var outStream = outBuffer.Map()) {
                    if ((outMesh.Flags & MeshFlags.Has32BitIndices) != 0)
                        outStream.WriteRange(inStream.ReadRange(inMesh.FaceCount * 3));
                    else
                        outStream.WriteRange(inStream.ReadRange(inMesh.FaceCount * 3));
                }
                outBuffer.Unmap();
            }
            inMesh.UnlockIndexBuffer();
        }
 
        private static void ConvertVertexBuffer(D3D9.BaseMesh inMesh, Mesh outMesh) {
            using (var inStream = inMesh.LockVertexBuffer(D3D9.LockFlags.None))
            using (var outBuffer = outMesh.GetVertexBuffer(0)) {
                using (var outStream = outBuffer.Map()) {
                    outStream.WriteRange(inStream.ReadRange(inMesh.VertexCount * inMesh.BytesPerVertex));
                }
                outBuffer.Unmap();
            }
            inMesh.UnlockIndexBuffer();
        }
 
        private static void ConvertDecleration(D3D9.VertexElement[] inDecls, InputElement[] outDecls) {
            for (var i = 0; i < inDecls.Length - 1; ++i) {
                outDecls[i].SemanticName = ConvertSemanticName(inDecls[i].Usage);
                outDecls[i].SemanticIndex = inDecls[i].UsageIndex;
                outDecls[i].AlignedByteOffset = inDecls[i].Offset;
                outDecls[i].Slot = inDecls[i].Stream;
                outDecls[i].Classification = InputClassification.PerVertexData;
                outDecls[i].InstanceDataStepRate = 0;
                outDecls[i].Format = ConvertFormat(inDecls[i].Type);
            }
        }
 
        private static string ConvertSemanticName(D3D9.DeclarationUsage usage) {
            switch (usage) {
                case D3D9.DeclarationUsage.TextureCoordinate:
                    return "TEXCOORD";
                case D3D9.DeclarationUsage.PositionTransformed:
                    return "POSITIONT";
                case D3D9.DeclarationUsage.TessellateFactor:
                    return "TESSFACTOR";
                case D3D9.DeclarationUsage.PointSize:
                    return "PSIZE";
                default:
                    return usage.ToString().ToUpper();
            }
        }
 
        private static DXGI.Format ConvertFormat(D3D9.DeclarationType type) {
            switch (type) {
                case D3D9.DeclarationType.Float1: return DXGI.Format.R32_Float;
                case D3D9.DeclarationType.Float2: return DXGI.Format.R32G32_Float;
                case D3D9.DeclarationType.Float3: return DXGI.Format.R32G32B32_Float;
                case D3D9.DeclarationType.Float4: return DXGI.Format.R32G32B32A32_Float;
                case D3D9.DeclarationType.Color: return DXGI.Format.R8G8B8A8_UNorm;
                case D3D9.DeclarationType.Ubyte4: return DXGI.Format.R8G8B8A8_UInt;
                case D3D9.DeclarationType.Short2: return DXGI.Format.R16G16_SInt;
                case D3D9.DeclarationType.Short4: return DXGI.Format.R16G16B16A16_SInt;
                case D3D9.DeclarationType.UByte4N: return DXGI.Format.R8G8B8A8_UNorm;
                case D3D9.DeclarationType.Short2N: return DXGI.Format.R16G16_SNorm;
                case D3D9.DeclarationType.Short4N: return DXGI.Format.R16G16B16A16_SNorm;
                case D3D9.DeclarationType.UShort2N: return DXGI.Format.R16G16_UNorm;
                case D3D9.DeclarationType.UShort4N: return DXGI.Format.R16G16B16A16_UNorm;
                case D3D9.DeclarationType.UDec3: return DXGI.Format.R10G10B10A2_UInt;
                case D3D9.DeclarationType.Dec3N: return DXGI.Format.R10G10B10A2_UNorm;
                case D3D9.DeclarationType.HalfTwo: return DXGI.Format.R16G16_Float;
                case D3D9.DeclarationType.HalfFour: return DXGI.Format.R16G16B16A16_Float;
                default: return DXGI.Format.Unknown;
            }
        }
 
        private void CreateNullDevice() {
            form = new Form();
            using (var direct3D = new D3D9.Direct3D())
                device9 = new D3D9.Device(direct3D, 0, D3D9.DeviceType.NullReference, form.Handle, D3D9.CreateFlags.HardwareVertexProcessing, new D3D9.PresentParameters {
                    BackBufferCount = 1,
                    BackBufferFormat = D3D9.Format.A8R8G8B8,
                    BackBufferHeight = 1,
                    BackBufferWidth = 1,
                    SwapEffect = D3D9.SwapEffect.Copy,
                    Windowed = true
                });
        }
 
        private Form form;
        private D3D9.Device device9;
        #endregion
    }
}
 
Comments Off

Posted in Uncategorized

 

A Simple C++ Quiz

07 Oct

Recently some people have been pestering me to post back up my C++ quizzes. So…without further ado here is the first one. The answers will be posted later.

  1. Given the following three lines of code, answer these questions
    int* p = new int[10];
    int* j = p + 11;
    int* k = p + 10;

    1. Is the second line well defined behavior?
    2. If the second line is well defined, where does the pointer point to?
    3. What are some of the legal operations that can be performed on the third pointer?
  2. What output should the following lines of code produce?
    int a = 10;
    std::cout<<a<<a++<<--a;
  3. Assuming the function called in the following block of code has no default parameters, and that no operators are overloaded, how many parameters does it take? Which objects are passed to it?
    f((a, b, c), d, e, ((g, h), i));
  4. Assuming the function called in the following block of code takes an A* and a B*, what is potentially wrong with the code?
    f(new A(), new B());
 
1 Comment

Posted in C++, Quizzes

 

SlimGen and You, Part ADD EAX, [EAX] of N

17 Aug

So far I’ve covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.

So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.

The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you’re doing at the time. Running the results against my Phenom 9850 I get:

Total Matrix Count Per Run:  100,000
Multiply        Total Ticks: 2,001,059
SlimGenMultiply Total Ticks: 1,269,200
Improvement:                 36.57 % 

While when I run it against my T8300 Core2 Duo laptop I get:

Total Matrix Count Per Run:  100,000
Multiply        Total Ticks: 2,175,380
SlimGenMultiply Total Ticks: 1,621,830
Improvement:                 25.45 %

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn’t been fully hammered out (in that it “works” but hasn’t been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.

So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that’s over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there’s the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there’s the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.

Finally there’s how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn’t bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.

The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
 
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x20], xmm0 ; store row 0 of new matrix
 
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x30], xmm0 ; store row 1 of new matrix
 
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [esp - 0x40], xmm0 ; store row 2 of new matrix
 
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x30], xmm0 ; store row 3 of new matrix
            movups  xmm0, [esp - 0x40]
            movups  [eax + 0x20], xmm0
            movups  xmm0, [esp - 0x30]
            movups  [eax + 0x10], xmm0
            movups  xmm0, [esp - 0x20]
            movups  [eax], xmm0
            ret     4

The .Net matrix multiplication source code:

public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {
    Matrix r;
    r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);
    r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);
    r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);
    r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);
    r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);
    r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);
    r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);
    r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);
    r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);
    r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);
    r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);
    r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);
    r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);
    r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);
    r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);
    r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);
    result = r;
}
 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

SlimGen and You, Part ADD AL, [RAX] of N

14 Aug

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?

The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X(‘c’, 2, 3.0f, “Hello”); becomes:

X('c', 2, 3.0f, "Hello");
00000025  push        40400000h ; 3.0f
0000002a  push        dword ptr ds:[03402088h] ;Address of "Hello"
00000030  mov         edx,2 
00000035  mov         ecx,63h ;'c'
0000003a  call        FFB8B040

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);
0000006d  push        40400000h  ; 3.0f
00000072  mov         ecx,dword ptr [ebp-40h] ;this
00000075  mov         edx,2
0000007c  call        FFA1B048

So this all seems clear enough, but it’s important to note these differences, especially when you’re poking around in the low level bowels of the CLR or when you’re doing what SlimGen does: which is replacing actual method bodies.

So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers. 

Z('c', 2, 3.0f, "Hello", 1.0, pa);
000000c0  mov         r9,124D3100h 
000000ca  mov         r9,qword ptr [r9] ; "Hello"
000000cd  mov         rax,qword ptr [rsp+38h] ;pa (IntPtr[])
000000d2  mov         qword ptr [rsp+28h],rax ;pa - stack spill
000000d7  movsd       xmm0,mmword ptr [00000118h] ;1.0
000000df  movsd       mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill
000000e5  movss       xmm2,dword ptr [00000110h] ;3.0f
000000ed  mov         edx,2 ;int (2)
000000f2  mov         cx,63h ;'c' 
000000f6  call        FFFFFFFFFFEC9300

Whew, that looks pretty nasty doesn’t it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);
0000010a  mov         rcx,qword ptr [rsp+30h] ; this pointer
0000010f  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack
00000118  mov         r9,0FFFFFFFFFFFFFFFDh ;~2L
0000011f  mov         r8,0FFFFFFFFFFFFFFFEh ;~1L
00000126  mov         rdx,0FFFFFFFFFFFFFFFFh ;~0L
0000012d  call        FFFFFFFFFFEC9310</p>

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();
p.R(v);
00000169  lea         rcx,[rsp+40h] 
0000016e  mov         rax,qword ptr [rcx] 
00000171  mov         qword ptr [rsp+50h],rax 
00000176  mov         rax,qword ptr [rcx+8] 
0000017a  mov         qword ptr [rsp+58h],rax 
0000017f  lea         rdx,[rsp+50h] 
00000184  mov         rcx,r8 
00000187  call        FFFFFFFFFFEC9318

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.

All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32
ORG         0x59f0
;           void Multiply(ref Matrix, ref Matrix, out Matrix)
start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
 
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax], xmm0 ; Calculate row 0 of new matrix
 
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x10], xmm0 ; Calculate row 1 of new matrix
 
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x20], xmm0 ; Calculate row 2 of new matrix
 
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
 
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
 
            movups  [eax + 0x30], xmm0 ; Calculate row 3 of new matrix
            ret     4
 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

SlimGen and You, Part ADD [EAX], EAX of N

07 Aug

So previously we delved into one of the nastier performance corners on the .Net framework. Today I’m going to introduce you to a tool, that is in development currently, which allows you to take those slow math functions of yours and replace them with high performance SSE optimized methods.

We’ve called it SlimGen, which although not exactly accurate, does fit nicely in with the other Slim projects currently underway including SlimTune, and the flagship that started it all, SlimDX.

So what does SlimGen do? Well, you pass it a .Net assembly and it replaces the native method bodies, which are generated using NGEN, with replacement ones written in assembly (for now). This modified assembly then replaces the original assembly that was stored in the native image store. SlimGen can operate on signed and unsigned assemblies alike, as the native image is not signed, more on this later though.

Managed PE files contain a great deal of metadata stored in tables. You can enumerate these tables and parse them yourself, for instance if you were writing your own CLR. Thankfully though, the .Net framework comes with several COM interfaces that are very helpful in accessing these tables without having to manually parse them out of the PE file, this is especially useful since the table rows are are not a fixed format. Specifically, indexes in the tables can be either a 2 bytes or 4 bytes in size depending on the size of the dataset indexed. In the case of SlimGen we use the IMetaDataImport2 interface for accessing the metadata.

Of course, the managed metadata does not contain all of the information we need. NGEN manipulates the managed assembly and introduces pre-jitted versions of the functions contained within the assembly. However, their managed counterparts remain in the assembly and are what the metadata tables reference to. So how does one go from a managed method and its IL to the associated unmanaged code? Well, the CLR header of a PE file does contain a pointer to a table for a native header. However the exact format of that table is undocumented and as such it makes it hard to parse it and find the information we need. Therefore we have to use an alternative method…

When you load up an assembly the CLR generates, using the metadata and other information found in the PE file, a set of runtime tables that it uses to indicate information about where things are in memory, and their current state. For instance, it can tell if its jitted a method or not. When you load up an assembly that’s been NGENed, it checks the native images for an associated copy, assuming your assembly validates, and will load up the NGENed assembly and parse out the appropriate information from that. Therefore we need some way of gaining access to these runtime generated tables. Enter the debugger.

The .Net framework exposes debugging interfaces that are quite trivial to implement, but more important, they give you access to all of the runtime information available to the CLR. In the case of SlimGen what we do is load up your assembly (not run) into a host process and then simply have the host process execute a debugger breakpoint. The SlimGen analyzer first initializes its self as a debugger and then executes the host process as the attached debugger. When the breakpoint is hit, it breaks into the analyzer, which can then begin the work of processing the loaded assemblies. Since SlimGen knows which assembly it fed to the host, it is able to filter out all of the other assemblies that have been loaded and focus in on the one we care about. First we check and see if a native version of the assembly has been loaded, for if one hasn’t been loaded there is no point in continuing. if not then we simply report an error and cleanup. Assuming there is a native version of the assembly loaded then we use the aforementioned metadata interfaces to walk the assembly and find all of the methods that have been marked for replacement. Each method is examined to ensure that it has a native counterpart, and if it doesn’t another warning is issued and the method is skipped.

Now comes the annoying part. In .Net 1.x the framework had each method exist within a singular code chunk, which made extracting that code quite easy. However in .Net 2.x and forward the framework allows a method to have multiple code chunks, each with a different base address and length. This is theoretically to allow an optimizer to spread work its magic, but it does make extracting methods harder. SlimGen will generate an assembly file per chunk and all of the associated binaries for each chunk, generated from the assembly files, must be present for the method to be replaced. No dangling chunks please. The SlimGen analyzer extracts each base address from each chunk, along with the module base address. Using that information we can then calculate the relative virtual address of each method’s native counterpart within the NGENed file.

Using that information the SlimGen client simply walks a copy of the native image performing the replacement of each method, and then when done (and assuming no errors), copies it back over the original NGEN image. Tada, you now have your highly optimized SSE code running in a managed application with no managed –> unmanaged transitions in sight.

 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

SlimGen and You, Part ADD [EAX], AL of N

31 Jul

Imagine you could have the safety of managed code, and the speed of SIMD all in one? Sounds like one of those weird dreams Trent has, or perhaps you are already thinking of using C++/CLI to wrap SIMD methods to help reduce the unmanaged transition overhead. You might also be thinking about pinvoking DLL methods such as those used in the D3DX framework to take advantage of its SIMD capabilities.

While all of those are quite possible, and for sufficiently large problems quite efficient too, they also have a relatively high cost of invocation. Managed to unmanaged transitions, even in the best of cases, costs a pretty penny. Registers have to be saved, marshalling of non-fundamental types has to be performed, and in many cases an interop thunk has to be created/jitted. This is a case where the best option is to do as much work as you can in one area before transitioning to the next.

But you can’t always do tons of work at once, a prime example is that of managing your game state. You’ll have discrete transformations of objects, but batching up those transformations to perform them all at once because a management nightmare. You have to craft special data-structures to avoid marshalling, use pinned arrays, and in general you end up doing a lot of work maintaining the two, will spend plenty of time debugging your interface, and may actually not gain anything speed wise still.

If you’re wondering just how bad the interop transition is, you can take a look at my previous entries, where I explored the topic in some detail.

In the .Net framework, most code runs almost as fast, as fast, or faster than the comparable native counterparts. There are cases where the framework is significantly faster, and cases where it loses out at about 10% in the worst case. 10% isn’t a horrible loss, and it’s not a consistent loss either. The cost will vary depending on factors such as: is JITing required, is memory allocation performed, are you doing FPU math that would be vectorized in native code?

In fact, that 10% figure isn’t accurate either: If a method requires JITting the first time it is called, which could cost you 10% on the first invocation, future invocations will not need JITing and so the cost may end up being the same as its native counterpart henceforth. If the method is called a thousand times, then that’s only an additional .01% cost over the entire set of invocations.

The only real area that the .Net framework seriously loses out to unmanaged code is in the math department. The inability to use vectorization can significantly increase the cost of managed math over that of unamanged math code, that 10% figure rears its ugly head here. On the integer math side of things managed code is almost on equal footing with unmanaged code, although there are some vectorized operations you can perform that will enhance integer operations quite significantly, but in general the two add up to be about the same. However when it comes to floating point performance managed code loses out due to its dependency on the FPU or single float SSE instructions. The ability to vectorize large chunks of floating point math can work wonders for unmanaged code.

Well, all is not lost for those of us who love the managed world… SlimGen is here. Exactly what SlimGen is will be delved into later, but here’s a sample preview of what it can do:

SlimDX.Matrix.Multiply(SlimDX.Matrix ByRef, SlimDX.Matrix ByRef, SlimDX.Matrix ByRef)
Begin 5a856e64, size 293
5A856E64 8B442404         mov         eax,dword ptr [esp+4]
5A856E68 0F1022           movups      xmm4,xmmword ptr [edx]
5A856E6B 0F106A10         movups      xmm5,xmmword ptr [edx+10h]
5A856E6F 0F107220         movups      xmm6,xmmword ptr [edx+20h]
5A856E73 0F107A30         movups      xmm7,xmmword ptr [edx+30h]
5A856E77 0F1001           movups      xmm0,xmmword ptr [ecx]
5A856E7A 0F28C8           movaps      xmm1,xmm0
 
Comments Off

Posted in .Net, SlimDX, SlimGen, Software Development

 

VMWare Server 2.x

24 Jun

Virtualization has become quite the business hot topic (and buzzword too), now days. It offers the promise of server consolidation, ease of management, personal reduction, monetary savings in miscellaneous fields (such as power consumption). Of course, there is always the question of if it actually delivers on any those promises.

I’ve been using VMWare Server for quite some time, and have been pleased with the product overall. Well, up until I tried out version 2.x.

It was quite a disappointment.

First things first, the web interface, it’s rather slow and clunky. It does display more information than you used to get out of VMWare Server 1.x, such as memory usage and CPU usage, but you have to constantly refresh the pages to get more up to date information. There is a somewhat “ajax” feel to the interface in some areas, and in others you can definitely feel the page refreshes. You also have to install a console plugin to even view your VM running, which is major annoying since the installation only works on a select subset of browsers (Firefox and IE basically). The final issue, which is a real killer, with the web interface is the memory usage. Running two VMs, with neither console open, they were taking up their respective amounts of RAM. Meanwhile the web interface was hogging another half gig of ram doing nothing (I was not even on the interface at the time, just checking it via host OS tools). That’s a rather large chunk of overhead for something you shouldn’t have to touch much. The good news is that you can turn off the web interface, if you’re willing to edit some batch files. The bad news is, you then have to buy VMWare vCenter if you want to manage it without the web interface.

Then there are the speed issues. While running an x64 VM, attempting to install Windows Server x64, with the guest being allocated 2GB of RAM and two processor cores, the machine ran inordinately slow. The installation of Windows Server x64 took almost 3 HOURS just to get it far enough along that it would be usable. It was about this point in time that I started to wonder if v2.x of VMWare Server was worth it. Research indicated that others had experienced similar issues with VMWare Server 2.x and multicore installations. After disabling one of the cores it was noticeably faster, but still quite slow. Read, and especially writes, to the virtual disk were noticeable, and patching the operating system took longer than I cared to wait.

The last thing that really made me decide against using VMWare Server 2.x was the feature removal. For instance, you can no longer trivially create a virtual machine that has it’s backing storage being a physical disk. While the virtualization subsystem DOES support this, you have to craft a handmade (or use third party tools) VMDK just to mount it up. At which point the web interface has problems managing the virtual machine, as it’s not designed to deal with such capabilities, and will typically display an error message when attempting to manage other features of that virtual machine. Considering the rather significant performance advantages using actual backing disks can have, and also considering that this was a rather simple feature of VMWare Server 1.x to use… it’s quite annoying to see it vanish.

 
Comments Off

Posted in Virtualization

 

Server outage

19 Jun

Well, sites back up, on WordPress now…

Server crashed something spectacularily the other day. Both drives in the RAID 1 decided to take a nice vacation, and so I’m having to restore from backups I’ve kept.

Thanks to google I’ve got most of my old posts, and will be dumping them up here through the day as I edit them and put them together again.

 
Comments Off

Posted in Uncategorized