<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
<channel>
  <title>AMD Developer Blogs </title> 
  <description /> 
  <link>http://forums.amd.com/devblog/index.cfm?forumid=8</link>
  <language>en-US</language>
  <generator>FuseTalk Hosting Executive Plan 3.2 Build 80405</generator>

	<creativeCommons:license>http://creativecommons.org/licenses/by-nd/2.0/</creativeCommons:license><image><link>http://creativecommons.org/licenses/by-nd/2.0/</link><url>http://creativecommons.org/images/public/somerights20.gif</url><title>Some Rights Reserved</title></image><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/AmdDeveloperBlogs" type="application/rss+xml" /><feedburner:emailServiceId>AmdDeveloperBlogs</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<dc:creator>Simon Solotko</dc:creator>
		<title>Dealing With Reality | The Interview | ATI Stream and OpenCL | Part 2</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/DNCSqsclYE8/blogpost.cfm</link> 
		<pubDate>2009-10-13T14:44:36 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=120276#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p&gt;In &lt;a href="http://links.amd.com/openftw"&gt;Part I on the AMD At Home Blog&lt;/a&gt; Simon Solotko gave an overview of open, parallel computing with ATI Stream and OpenCl. Here, in Part 2, Simon Solotko &amp; Ben Sander discuss the power of ATI Stream technology and the elegant, standards-based interface now available with OpenCL for GPU.&lt;/p&gt;
&lt;p&gt;Ben, what have we created with OpenCL and what does it do?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: Sure, with OpenCL we created a C-based interface for programming a range of parallel processors. Developers write OpenCL Kernels, sub-routines which developers seek to accelerate or offload, and embed these in their applications. OpenCL includes a runtime component which allows these OpenCL Kernels to be compiled at runtime for either a CPU or GPU. AMD has contributed to the development of the OpenCL specification and written the implementation x86 processors and GPU's - a runtime environment which compiles the code near runtime, then schedules and executes the code at runtime. &lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What are the benefits of being able to compile an application for a CPU or a GPU?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: Developers can write one piece of code and easily support a variety of compute devices in the platform - CPUs and GPUs, from multiple vendors. Code can be load-balanced between CPU and GPU depending on the capabilities in the final platform. For example, we expect that some applications or parts of applications will run faster on the CPU than the GPU, other applications perform better on the GPU. Finally, the OpenCL CPU implementation levertages the CPU hardware debug features to provide excellent debug capabilities, using familiar debug environments, at a full CPU speeds. &lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When exactly during runtime is the Kernel compiled?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: There are specific commands within the body of your application which you call to compile the Kernel, and direct it to be compiled for the CPU or GPU. At that point, the Kernel code is translated into a binary. The binary later executes natively when the Kernel is called. The code is not interpreted in the hot spot of the loop, it's not like Java in that regard.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So the code within a Kernel looks like C but can be compiled to execute on the GPU?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: Exactly. Because a GPU looks and functions differently than a CPU, however, you have to think differently when you write the Kernel for GPU, because at that point, you are executing your code directly on the GPU. There are constraints imposed on Kernel code to accommodate the specialized functionality of the GPU. Kernels are based on C99 with extensions provided by OpenCL-C for vectors and address spaces. &lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Give me some examples of the special ways in which the C code within a Kernel is different from the standard code in the body of the application?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: To understand writing a Kernel it is important to understand that the code is actually executing on a GPU, despite the fact that the functions you are performing are syntactically the same as other C code. A GPU has a small fast cache (local memory) and larger main GPU memory (global memory). You move data in blocks, and complete as much of the task on that block as possible before moving the block out and moving the next block in. With a GPU we have a lot of compute bandwidth relative to memory bandwidth making it advantageous to do as much as you can to data within the cache. With OpenCL the blocking process does not necessarily get easier, but you can control it from C code.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;How do we move data from main memory to the GPU memory for use by a Kernel function?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: A Kernel cannot move memory from main memory, that is done in your application code. So there are standard functions to copy memory into GPU memory from the application, and pointers to this memory can then be passed to a Kernel function. The Kernel function can then copy memory into the fast cache or "local" memory.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This sounds a bit complicated, but I have to remind myself, this is all standard C code, and we are discussing the optimization that makes something run fast on the GPU, and the memory management tools that are available, now within standard C through the OpenCL library, to do that.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: That's Right. The magic is that a Kernel is C code which is amazingly compiled by the runtime component of OpenCL to run on a GPU or CPU with some extra tools to ensure it can take full advantage of the extremely high compute to memory bandwidth capability of the fast, parallel math engine of the GPU.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So as time goes on, we anticipate that people will write and optimize many useful Kernels which will simplify the development of complex applications?&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ben: Yes. It is relatively straight-forward to port applications written for other GPGPU languages like Brook+ and CUDA to OpenCL. This is a huge step forward from proprietary GPU code, you now have a standard way to get at GPU code and memory from C in a platform independent way.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;With ATI Stream technology and the standardization of the programming model with OpenCL for GPU almost any aspiring GPGPU developer can download the tools necessary to get started and develop platform-independent software fueled by the power of the evolved GPU. I have collected resources below to get you started, enjoy blazing the trail of a new frontier in computing!&lt;/p&gt;
&lt;p&gt;For more information, watch as &lt;a href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx"&gt;AMD's Mike Houston discusses OpenCL &lt;/a&gt;and what the future has in store for software applications that use it.&lt;/p&gt;
&lt;p&gt;If you are ready to get started with OpenCL, you can begin with &lt;a href="http://ati.amd.com/technology/streamcomputing/opencl.html"&gt;AMD's OpenCL resource page&lt;/a&gt; here.&amp;nbsp;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Simon has regular posts on the AMD At Home blog and you can check out &lt;a href="http://blogs.amd.com/home/2009/07/29/the-home-central-computer-a-hypothetical-inteview/"&gt;The Digital Nexus&lt;/a&gt; series here.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/DNCSqsclYE8" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=120276</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>AMD Developer Inside Track, Episode 2:  OpenCL Introduction</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/Hlq9mSVsjHE/blogpost.cfm</link> 
		<pubDate>2009-09-15T15:58:30 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=118929#comments</comments>
		<trackback:ping>4</trackback:ping>
		<description>&lt;p&gt;&lt;span&gt;AMD has always been an advocate of open standards that build on and extend proven technologies (example: x86-64)W.&amp;nbsp; As such, it is a natural fit for AMD to embrace OpenCL&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;as part of its &lt;a href="http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx"&gt;ATI Stream &lt;/a&gt;offering.&amp;nbsp; But, just what is OpenCL?&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;In this month's episode of the &lt;a title="AMD Developer Inside Track" href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx" target="_self"&gt;AMD Developer Inside Track &lt;/a&gt;I interview Mike Houston,&amp;nbsp;GPG System Architect.&amp;nbsp; He talks about what OpenCL is, what the transition to this new language will be like and he gets into what applications could benefit from OpenCL, as well as what the future has in store for software applications that use it.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;One of the advantages of OpenCL is its advanced queuing system which is great for game development. It is also designed to work very well with various graphics APIs such as OpenGL, DirectX 9 and DirectX 10.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;Game developers aren't the only ones who can take advantage of OpenCL though.&amp;nbsp; According to Michael, it is going to be very useful for applications such as media encoding, virus scanning, and physics to name a few.&amp;nbsp; It makes a&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;lot of sense for AMD to move to a ubiquitous computing language that runs on platforms everywhere.&amp;nbsp; The next few years will be an interesting time for GPGPU technology as several hardware and software vendors get on board.&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx"&gt;ATI Stream technology &lt;/a&gt;is gaining significant momentum.&lt;span&gt;&amp;nbsp; &lt;/span&gt;Some cool and unexpected examples of ATI Stream technology in action are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span&gt;&lt;a href="mailto:Folding@home"&gt;Folding@home&lt;/a&gt;: &lt;/span&gt;&lt;a href="http://folding.stanford.edu/"&gt;&lt;span&gt;http://folding.stanford.edu/&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;a href="mailto:Milkyway@home"&gt;Milkyway@home&lt;/a&gt;:&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;a href="http://milkyway.cs.rpi.edu/milkyway/"&gt;&lt;span&gt;http://milkyway.cs.rpi.edu/milkyway/&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An example of gaming technology and OpenCL:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span&gt;Havoc demo:&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;a href="http://www.engadget.com/2009/03/27/havok-and-amd-show-off-opencl-with-pretty-pretty-dresses/"&gt;&lt;span&gt;http://www.engadget.com/2009/03/27/havok-and-amd-show-off-opencl-with-pretty-pretty-dresses/&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Watch the &lt;a title="AMD Developer Inside Track Video Series" href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx" target="_self"&gt;AMD Developer Inside Track&lt;/a&gt;, Episode 2 for the full story.&lt;/p&gt;
&lt;p&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span&gt;-Sharon Troia, AMD Developer Outreach&lt;/span&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/Hlq9mSVsjHE" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=118929</feedburner:origLink></item>
	
	<item>
		<dc:creator>Ramesh J</dc:creator>
		<title>Framewave Multipass Build System</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/vgRVRkHqE5U/blogpost.cfm</link> 
		<pubDate>2009-09-07T05:31:28 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=118534#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p align="center"&gt;&lt;strong&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Developing libraries can be difficult, fun and interesting; an equally difficult task is testing the library and distributing it, so that other developers can use the library in their projects.&amp;nbsp; The big advantage of using libraries to accomplish certain functionalities is that libraries are already tested and optimized for various platforms. &amp;nbsp;For the libraries optimized for particular platforms, there needs to be a dispatch mechanism to select the best optimized path depending on the processor.&amp;nbsp; I have found that the build system from the Framewave library provides a good solution to accomplish this.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Derived from the&amp;nbsp;&lt;strong&gt;AMD Performance Library&lt;/strong&gt;, Framewave is a free of charge, open-source collection of popular image and signal processing routines designed to accelerate application development, debugging, multi-threading and optimization on x86-class processor platforms. This library has three paths of optimized code: &amp;nbsp;a reference code (c code) path, an SSE2 code path, and an SSE3 and F10H code path. One reason I found it interesting is because it is open-source; I can go through the code, understand it, and modify it as per my requirements, plus it has a single source bundle for four operating systems (Linux&amp;reg;, Mac, Windows&amp;reg;, and Solaris operating systems).&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Framewave has a different implementation for each of the paths, and the Framewave build system takes care of combining them together and exposing a single signature. To achieve this, Framewave has a custom build system based on the SCons build tool (&lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.scons.org/"&gt;http://www.scons.org&lt;/a&gt;&lt;/span&gt;). The advantage of using SCons is that it uses the Python scripting language for its configuration files.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Framewave has a single source bundle that is termed platform independent and is compiled using a single build system across all the platforms. The tool sets supported are GCC, MSVC, and Sun CC. This build system allows me to build 32/64-bit shared/static libraries with the ability to build either a debug or release version.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;This build system picks up the file and compiles it &lt;em&gt;n&lt;/em&gt; times, &lt;em&gt;n&lt;/em&gt; being the number of optimized paths, producing &lt;em&gt;n&lt;/em&gt; object files. These &lt;em&gt;n&lt;/em&gt; object files are linked together to the stub function which is exported as the actual function. To understand the build system more, refer to the architecture description here: &lt;a href="http://framewave.sourceforge.net/DesignDoc/FramewaveBuildSystem-Architecture.htm"&gt;http://framewave.sourceforge.net/DesignDoc/FramewaveBuildSystem-Architecture.htm&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Producing one DLL file and having only one signature exported for each function is a better option than having multiple DLL files for each of the optimized code paths and then loading the particular DLL depending on the processor. The advantage of having one single large DLL file for the library is that I end up adding only one file to the &lt;em&gt;n&lt;/em&gt; files present in my in project.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Overall this build system offers a unique way to bundle software that has different implementations for each processor.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;I'd like to hear what you think.&amp;nbsp; Is this build system useful in your own work? &amp;nbsp;What do you like about it, what do you dislike about it?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Watch out for my next post on Using SCons for building the build system.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/vgRVRkHqE5U" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=118534</feedburner:origLink></item>
	
	<item>
		<dc:creator>Stephan Diestelhorst</dc:creator>
		<title>Evaluation of the Advanced Synchronization Facility (ASF)</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/mbKvpcu_M5Y/blogpost.cfm</link> 
		<pubDate>2009-09-04T05:14:20 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=317&amp;threadid=118419#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;In a &lt;a href="blogpost.cfm?threadid=114715&amp;catid=317"&gt;previous entry&lt;/a&gt; on the &lt;strong&gt;Advanced Synchronization Facility&lt;/strong&gt; (ASF), my colleague Michael pointed you to the &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://developer.amd.com/cpu/ASF/Pages/default.aspx"&gt;current ASF specification proposal&lt;/a&gt;&lt;/span&gt; and showed some nifty use-cases for the feature. In this blog entry I'll try to make this a little more practical and show you how you can get some more hands-on experience with ASF.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a name="Running_ASF"&gt;&lt;/a&gt;&lt;strong&gt;Running ASF &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;ASF is an experimental feature which means that we do not yet have access to a "toy implementation" in silicon to play with. As with all other cases where the real thing is not available for testing (such as with early crash tests for cars) we resort to &lt;em&gt;simulation&lt;/em&gt; to analyse important properties of ASF. Simulation also allows us to get a feeling for how ASF can be used by applications and operating system kernels, and might be integrated into compilers and language runtimes.&lt;/p&gt;
&lt;p&gt;The approach of simulation is nothing new inside AMD and we have a rich set of simulation tools available for all kinds of purposes. Several aspects of ASF, however, made us use another external open-source simulator called &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.ptlsim.org/"&gt;PTLsim&lt;/a&gt;&lt;/span&gt; for our analysis. On the one hand, we want to have detailed AMD64 simulation capabilities to provide some performance predictions, get fine-grained thread interleaving right, and support simulation of operating system kernels. Furthermore, we would like to have an understanding of how ASF interacts with other features employed in today's processor cores. On the other hand, all of this should not have prohibitive overheads in terms of simulation speed and prototyping effort.&lt;/p&gt;
&lt;p&gt;In addition to the technical requirements, we appreciate PTLsim's open-source license, which makes it easier to share our prototypical ASF simulator implementation with the public and in related projects (such as the EU-funded &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.velox-project.eu/"&gt;VELOX&lt;/a&gt;&lt;/span&gt; project, which Martin will cover in the next post in this series).&lt;/p&gt;
&lt;p&gt;Although PTLsim certainly has an impressive list of features, several of these features come at the price of a somewhat large infrastructural requirement. To allow simulation of the entire operating system, PTLsim relies on Xen to provide the first-order hardware abstraction. Xen in turn, however, may demand an elaborate test machine setup.&lt;/p&gt;
&lt;p&gt;Besides "just" adding the ASF functionality to PTLsim, I've spent a fair amount of effort adding supportive features, such as a true multi-core simulation model that improves on the previously existing SMT (symmetric multi-threading) model. With the new multi-core model, each logical thread has its own set of resources (functional units and caches) and cores can modify the contents of other caches (for example by invalidating data in other caches by local updates). These interactions were not captured by the SMT model, as threads there shared functional units and caches. Other modifications to the upstream version of PTLsim mostly fix bugs in several subsystems of PTLsim. I regularly hang out on the &lt;a href="mailto:ptlsim-devel@ptlsim.org"&gt;&lt;span style="text-decoration: underline;"&gt;ptlsim-devel mailing list&lt;/span&gt;&amp;nbsp;&lt;/a&gt;:-).&lt;/p&gt;
&lt;p&gt;&lt;a name="Evaluating_ASF"&gt;&lt;/a&gt;&lt;strong&gt;Evaluating ASF &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Our initial evaluation of ASF started with an (internal) predecessor of the currently available version; let's just call it ASF1. Although ASF1 is a more restricted form of the current ASF specification, its implementation and analysis have been published already. You can take a look at our &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.amd64.org/publications.html#Manycore"&gt;EPHAM 2008 paper&lt;/a&gt;&lt;/span&gt; (or at my much more detailed thesis at the same location, if you're adventurous) to get an overview of how things behaved back in 2008. ASF1 basically has a more static phase layout; there is a strict separation between a 'declaration phase' and an 'atomic phase', in which you can add elements to your speculative working sets in the declaration phase only, and then modify them inside the subsequent atomic phase.&lt;/p&gt;
&lt;p&gt;The static phase layout makes ASF1 unsuitable for applications that want to interleave modifications and working-set discovery within a single atomic region, unnecessarily restricting programmers' flexibility. Nevertheless we did find ASF1 extremely powerful and we showed an 80% performance improvement over a conventional lock-free implementation of a linked list, and 20% for accelerating a software transactional memory (STM) run-time (you can find more details in the documents referenced above).&lt;/p&gt;
&lt;p&gt;ASF1 gives you the flexibility you need to make a lock-free linked-list implementation practical, actually even fairly straightforward. If you have some experience with lock-free linked lists, you'll know that the traditional &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Compare-and-swap"&gt;CAS (compare-and-swap)&lt;/a&gt;&lt;/span&gt; is not easily usable for element removal from the list. In order to safely remove the element you have to change the preceding element's next-pointer (make it point to the deleted element's successor) and at the same time ensure that nobody concurrently adds an element just after the deleted element. With just CAS it is difficult to ensure that &lt;strong&gt;two&lt;/strong&gt; memory locations do atomically change / keep their value. It is almost trivial to do this with ASF, even ASF1. Just have a look at Michael's DCAS example in the previous blog post.&lt;/p&gt;
&lt;p&gt;Besides making the currently specified ASF implementation available for you to play with (see below), we are currently testing and extending the implementation thoroughly. For example, we are porting the &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.tmware.org/tmunit"&gt;TMunit&lt;/a&gt;&lt;/span&gt; testing application and looking at other larger applications. We also analyse various ways of implementing ASF, see how we can make use of the increased flexibility (over ASF1) for accelerating STMs better than with ASF1, and look at new look-free use cases for ASF.&lt;/p&gt;
&lt;p&gt;Finally, we constantly strive to improve ASF to fit the needs of programmers wanting to use it -- so again, if you have any comments on the &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://developer.amd.com/cpu/ASF/Pages/default.aspx"&gt;current ASF specification proposal&lt;/a&gt;&lt;/span&gt;, leave us a comment or send email to &lt;span style="text-decoration: underline;"&gt;&lt;a href="mailto:ASF_Feedback@amd.com"&gt;ASF_Feedback@amd.com&lt;/a&gt;&lt;/span&gt;!&lt;/p&gt;
&lt;p&gt;&lt;a name="Hands_on"&gt;&lt;/a&gt;&lt;strong&gt;Hands on &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.amd64.org/research/multi-and-manycore-systems.html#Downloads"&gt;our downloads section&lt;/a&gt;&lt;/span&gt; you can find all the ingredients needed to brew your own magic ASF1 potion: the tweaked simulator implementing ASF1; the benchmarks in which we have used ASF1 to accelerate (and simplify!) a lock-free linked list implementation and an STM; and various explanatory documents, such as our EPHAM 2008 paper and my Diploma thesis. I'm currently cleaning up the implementation of the current ASF specification in PTLsim and it will become available there shortly, too.&lt;/p&gt;
&lt;p&gt;I'm aware that setting up the toolchain might be daunting, largely due to the Xen requirement, and sometimes less than 100% stable thanks to the research nature of the upstream project. If you have any specific questions regarding simulator setup and usage, please &lt;span style="text-decoration: underline;"&gt;&lt;a href="mailto:stephan.diestelhorst@amd.com"&gt;leave&lt;/a&gt;&lt;/span&gt; me a comment.&lt;/p&gt;
&lt;p&gt;&lt;a name="About_me"&gt;&lt;/a&gt;&lt;strong&gt;About me &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I joined AMD's OSRC group in Dresden in May 2007 as a student intern and started implementing the original ASF proposal (ASF1 above) in PTLsim. This implementation work laid the foundation for my &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.amd64.org/fileadmin/user_upload/pub/sdiestel-diplom.pdf"&gt;Master's thesis&lt;/a&gt;&lt;/span&gt; (mostly in English, ignore the German front pages) which I wrote to finish my studies of Computer Science at TU Dresden and the &lt;span style="text-decoration: underline;"&gt;&lt;a href="http://www.amd64.org/fileadmin/user_upload/pub/epham08-asf-eval.pdf"&gt;paper&lt;/a&gt;&lt;/span&gt; mentioned above. I graduated in February 2008 and have continued my work on ASF as a full employee at the OSRC since then.&lt;/p&gt;
&lt;p&gt;I'm interested in most computer science and engineering topics, but I'm currently focusing on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Microarchitecture: 	Cores, caches and interconnects&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Memory model semantics&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Simulation&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Parallel programming: 	Transactional memory, lock-free programming&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Computer graphics&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'd like to hear what your thoughts are on ASF, and what uses you have for it.&lt;/p&gt;
&lt;p&gt;--&lt;/p&gt;
&lt;p&gt;Stephan Diestelhorst, Software Engineer 1&lt;br /&gt;AMD Operating System Research Center, Dresden&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/mbKvpcu_M5Y" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=317&amp;threadid=118419</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>AMD Developer Inside Track - Taking Advantage of Multi-Core</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/LLn7mDmZLN0/blogpost.cfm</link> 
		<pubDate>2009-08-17T14:17:16 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117588#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;I was fortunate to have the opportunity to host a panel discussion on &lt;a href="http://developers.sun.com/events/communityone/2009/west/pdfs/S305066_B2.pdf"&gt;application development and multi-core at CommunityOne West&lt;/a&gt; this year. It was a fantastic opportunity to meet and work with software experts who are in the trenches and every day working on parallel programming solutions. The basic question here was: "How do I get started in taking advantage of multi-core processors?" To answer this question, everybody involved brought unique experiences and perspectives to the table. In the above link, you can see a view of AMD's roadmap - from our perspective, you should take away that from the desktop to the server, multi-core will be king.&amp;nbsp;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Check out the &lt;a href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx"&gt;&lt;span style="color: #009966;"&gt;AMD Developer Inside Track video&lt;/span&gt;&lt;/a&gt;&amp;nbsp;for a snapshot of three of our partners from this panel and myself answering the question of how to start taking advantage of multi-core processors.&lt;/p&gt;
&lt;p&gt;After these events I often get asked the same how-to-get-started question, but with more detail. Someone will say, "Okay, but let me tell you about this..." - so we talk it over. The questions I ask usually include at least some of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Who do you work for?&lt;/li&gt;
&lt;li&gt;What field are you in?&lt;/li&gt;
&lt;li&gt;What are you trying to do?&lt;/li&gt;
&lt;li&gt;Where is your code spending the most time now?&lt;/li&gt;
&lt;li&gt;What are your primary bottlenecks (CPU, I/O, Memory)?&lt;/li&gt;
&lt;li&gt;Do you need to scale up, or scale out?&lt;/li&gt;
&lt;li&gt;Are you trying to reduce response time?&lt;/li&gt;
&lt;li&gt;Are you trying to increase throughput?&lt;/li&gt;
&lt;li&gt;Where and how big is your data?&lt;/li&gt;
&lt;li&gt;What are your data dependencies?&lt;/li&gt;
&lt;li&gt;Are you using a managed runtime environment?&lt;/li&gt;
&lt;li&gt;What tools are you using?&lt;/li&gt;
&lt;li&gt;Are you open to using other tools?&lt;/li&gt;
&lt;li&gt;Will you be able to rewrite code?&lt;/li&gt;
&lt;li&gt;Who have you talked to in researching your problem?&lt;/li&gt;
&lt;li&gt;Do you have an n-tier infrastructure?&lt;/li&gt;
&lt;li&gt;What hardware are you using right now?&lt;/li&gt;
&lt;li&gt;What are your hardware upgrade plans?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These questions help decompose the problem and also provide a high-level view.&amp;nbsp; I find these discussions often touch on a mix of abstract principles combined with some specific practical advice. Below, I have some basic getting-started suggestions which I've mapped to the above questions, along with my perspectives on how they bear on the problem. For simplicity's sake, I've decided to map a question once to a single suggestion, though it may really have multiple applications.&lt;/p&gt;
&lt;table border="1" cellspacing="0" cellpadding="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Suggestion&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Relevant Questions&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Perspectives&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Identify your problem domain.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Who do you work for?&lt;/p&gt;
&lt;p&gt;What field are you in?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Telecommunications, financial services, manufacturing, scientific programming &amp; HPC, web services, database, ERP/CRM, BI: for these and many other segments there is typically an ecosystem of software tools for building products and solutions, in many cases with significant experience in parallelism.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Don't be afraid to ask for advice -- talk to your community of experts.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Who have you talked to in researching your problem?&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Your community of experts can be found at conferences, in online forums, and at your tools vendors.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Clearly define your performance problem and the associated metrics.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;What are you trying to do?&lt;/p&gt;
&lt;p&gt;Do you need to scale up, or scale out?&lt;/p&gt;
&lt;p&gt;Are you trying to reduce response time?&lt;/p&gt;
&lt;p&gt;Are you trying to increase throughput?&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;This is critical in explaining the problem to yourself and others. This should be an easy to understand and simple statement that includes a baseline.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Analyze and identify primary bottlenecks.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Where is your code spending the most time now?&lt;/p&gt;
&lt;p&gt;What are your primary bottlenecks (CPU, I/O, Memory)?&lt;/p&gt;
&lt;p&gt;Where and how big is your data?&lt;/p&gt;
&lt;p&gt;What are your data dependencies?&lt;/p&gt;
&lt;p&gt;Do you have an n-tier infrastructure?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;If you don't know the answers to these questions then you need to do some analysis.&amp;nbsp; Diagram your infrastructure.&amp;nbsp; Use performance analysis tools found in your OS and from your tools vendors.&amp;nbsp; There are usually a few places in your code where most of the time is spent. &lt;br /&gt;&lt;br /&gt;Like any optimization effort, you'll analyze first, re-measure, and re-analyze throughout your parallelization effort.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Review alternate algorithms.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Will you be able to rewrite code?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;After some initial analysis you should take a high-level look at your overall algorithm. It may not be the best choice. It also may place constraints on how easily you can parallelize.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Review current tools and look for acceptable alternates.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Are you using a managed runtime environment? What tools are you using?&lt;/p&gt;
&lt;p&gt;Are you open to using other tools?&lt;/p&gt;
&lt;p&gt;Will you be able to rewrite code?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;This is often closely related to the problem domain and associated business requirements.&amp;nbsp; Maybe you can take a new Fortran compiler that supports parallelization with OpenMP, or maybe you need to focus on a new math library.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Review current hardware and evaluate new hardware.&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;What hardware are you using right now?&lt;/p&gt;
&lt;p&gt;What are your hardware upgrade plans?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="213" valign="top"&gt;
&lt;p&gt;Along with looking at the architectural and tools aspects of your software, think about how much you could improve your basic situation with new hardware, be it one of more RAM, more or faster CPUs, or bigger or faster disks.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&amp;nbsp;In conclusion, I want to emphasize that after carefully stating your problem and doing some initial analysis, that you try new implementations with caution.&amp;nbsp; Measure with appropriate precision, and make sure your measurements are repeatable.&amp;nbsp; Only then can you be sure that your work is worthwhile. Finally, take a look at the AMD Developer Central for &lt;a href="http://developer.amd.com/documentation/articles/Pages/default.aspx#parallel"&gt;parallelization articles&lt;/a&gt;, our &lt;a href="http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx"&gt;CPU analysis tool CodeAnalyst&lt;/a&gt;, and our &lt;a href="http://developer.amd.com/cpu/Libraries/Pages/default.aspx"&gt;performance libraries&lt;/a&gt;.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Be sure to check out the first &lt;a href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx"&gt;AMD Developer Inside Track video&lt;/a&gt; featuring three of AMD's software tools partners giving their perspectives on taking advantage of multi-cores.&lt;/p&gt;
&lt;p&gt;&lt;span style="font-family: tahoma,arial,helvetica,sans-serif;"&gt;-&lt;span style="font-size: x-small;"&gt;&lt;span style="font-size: small;"&gt;Tracy&lt;/span&gt; &lt;span style="font-size: small;"&gt;Carver, Software Developer and Evangelist, AMD&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/LLn7mDmZLN0" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117588</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>Introducing the AMD Developer Inside Track - a New Monthly Video Series</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/k3571BNUU0M/blogpost.cfm</link> 
		<pubDate>2009-08-17T13:58:45 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117586#comments</comments>
		<trackback:ping>2</trackback:ping>
		<description>&lt;p&gt;I'm a member of AMD's software division (and yes, you read it correctly - I said software).&amp;nbsp; It turns out that a lot of people are surprised to hear that AMD has a software division.&amp;nbsp; I can't count the number of times that we've been at tradeshows showing off the &lt;a href="http://developer.amd.com/CPU/CODEANALYST/Pages/default.aspx"&gt;AMD CodeAnalyst Performance Analyzer&lt;/a&gt; or our &lt;a href="http://developer.amd.com/CPU/LIBRARIES/Pages/default.aspx"&gt;Performance Libraries&lt;/a&gt; and people have wondered why the heck AMD was at a software developer conference.&amp;nbsp; The answer is simple; you can't run the hardware without software.&amp;nbsp; We have a significant investment in software within AMD and with our software partners.&amp;nbsp; I've vowed to do my part to get you behind-the-scenes, one-on-one time with AMD software developers and our software partners' to get the scoop on what AMD is doing that would matter to software developers.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The first installment of the &lt;a href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx"&gt;AMD Developer Inside Track&lt;/a&gt; is available now.&amp;nbsp; This one features a panel of our software developer tools partners from &lt;a href="http://www.allinea.com/"&gt;Allinea&lt;/a&gt;, &lt;a href="http://www.pervasivesoftware.com/Pages/default.aspx"&gt;Pervasive&lt;/a&gt; and &lt;a href="http://www.roguewave.com/"&gt;Rogue Wave&lt;/a&gt; talking about taking advantage of multi-core processing.&amp;nbsp; I was able to pull them aside after the &lt;a href="http://developers.sun.com/events/communityone/2009/west/sessions.jsp"&gt;CommunityOne West 2009 Multicore Panel&lt;/a&gt; sponsored by AMD.&amp;nbsp; Check out the &lt;a href="http://developer.amd.com/documentation/videos/InsideTrack/Pages/default.aspx"&gt;video&lt;/a&gt;, &lt;a href="http://forums.amd.com/devblog/blogpost.cfm?threadid=117588&amp;catid=208"&gt;Tracy Carver's blog&lt;/a&gt;, and the &lt;a href="http://developers.sun.com/events/communityone/2009/west/pdfs/S305066_B2.pdf"&gt;slides&lt;/a&gt; that were presented.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Next month we will be talking with Michael Houston about &lt;a href="../devforum/categories.cfm?catid=390"&gt;OpenCL&lt;/a&gt;.&amp;nbsp; And we have a multitude of topics planned for the rest of the year.&amp;nbsp; If you have a topic in mind, let us know by making a comment on this blog post, or on our &lt;a href="../devforum/categories.cfm?catid=390"&gt;forums&lt;/a&gt;.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;-Sharon Troia, Sr. Developer Relations Engineer&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;ps.&amp;nbsp; If you are&amp;nbsp;experience any&amp;nbsp;viewing problems, please let me know.&amp;nbsp; We will be adding some different formats, lower resolution versions to &amp;nbsp;download, as well as the transcripts&amp;nbsp;over the next two weeks.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/k3571BNUU0M" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117586</feedburner:origLink></item>
	
	<item>
		<dc:creator>Tom Deneau</dc:creator>
		<title>Java Generics Performance Puzzler Part 2</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/BOJbR4LPR2Y/blogpost.cfm</link> 
		<pubDate>2009-08-14T15:32:25 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=117461#comments</comments>
		<trackback:ping>4</trackback:ping>
		<description>&lt;div style="margin: 0in 0in 10pt;"&gt;
&lt;table class=" FCK__ShowTableBorders" style="width: 100%;" border="0" cellspacing="0" cellpadding="0" width="100%"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="background-color: transparent; border: #ece9d8; padding: 0in;" valign="top"&gt;
&lt;table class=" FCK__ShowTableBorders" style="width: 100%;" border="0" cellspacing="0" cellpadding="0" width="100%"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="background-color: transparent; border: #ece9d8; padding: 0in;" valign="top"&gt;
&lt;table class=" FCK__ShowTableBorders" style="width: 100%;" border="0" cellspacing="0" cellpadding="0" width="100%"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="border-bottom: #ece9d8; border-left: #ece9d8; padding-bottom: 0in; background-color: transparent; padding-left: 0in; padding-right: 0in; border-top: #8f8f8f 1pt solid; border-right: #ece9d8; padding-top: 4.5pt;"&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;In a &lt;a href="blogpost.cfm?threadid=114296&amp;catid=313"&gt;&lt;span style="color: #800080;"&gt;previous blog&lt;/span&gt;&lt;/a&gt;, we looked at a microbenchmark where we were pulling an item from a collections class like an ArrayList and eventually&amp;nbsp;putting it in another collection.&amp;nbsp; And we saw that there could be a significant performance difference between the following two versions:&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;em&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;&amp;nbsp;(Note: In the following examples, we show only the parts where we access the ArrayLists and leave out any subsidiary logic.)&lt;/span&gt;&lt;/em&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;ArrayList aListSrc, aListDest1, aListDest2;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;1&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;while (idxSrc &amp;lt; NUMOBJS) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(idxDest, aListSrc.get(idxSrc++));&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;span style="color: #4c4c4c; font-size: 10pt;"&gt;and&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;2&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;while (idxSrc &amp;lt; NUMOBJS) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MyClass myc = aListSrc.get(idxSrc++);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(idxDest, myc);&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;with version 2 being slower because it requires a castcheck to check that the Object returned by aListSrc.get could be cast to a MyClass.&amp;nbsp;The performance impact was because the castcheck required touching an object that did not need to be touched in version 1.&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;In the microbenchmark code above, we navigated thru the ArrayList by incrementing an integer index to the ArrayList.get method.&amp;nbsp; What if we had used an explicit iterator or used the implied iterator in Java&amp;rsquo;s &amp;nbsp;for-each statement?&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;First let&amp;rsquo;s look at the least cluttered implementation, which uses for-each loop&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;3&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;for (MyClass myc : aListSrc) {&lt;br /&gt;&amp;nbsp;&amp;nbsp; aListDest1.add(myc);&lt;br /&gt;&amp;nbsp;&amp;nbsp; // ...&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;and remembering that the for-each loop is syntactic sugar for the following:&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;3b&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MyClass myc = iter.next();&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; //body of loop&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(myc);&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;we can see that, unfortunately, this suffers from the same castcheck as Version 2.&amp;nbsp;&amp;nbsp; And, once again, we cannot get around the castcheck by making the for-each variable an Object, because the compiler wisely will not let you add an Object to an ArrayList:&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;4 (will not compile)&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;for (Object myc : aListSrc) {&lt;br /&gt;&amp;nbsp;&amp;nbsp; aListDest1.add(myc);&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // &lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;szlig;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt; error here&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;Looking at the expanded code for the for-each loop, we see that we can still both use an explicit iterator&amp;nbsp;and avoid the castcheck by getting rid of the temporary variable from Version 3b and ending up with something like the following:&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;5&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt 0.5in;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(iter.next());&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;Like Version 1, this passes all the compile-time checks.&amp;nbsp;And at run time, because of type erasure,&amp;nbsp;iter.next() returns an Object and aListDest1.add consumes an Object .&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;But ideally we would want to be able to use the less cluttered for-each notation and&amp;nbsp;still get rid of the castcheck.&amp;nbsp; Can that be done?&amp;nbsp;&amp;nbsp;Brian Goetz's excellent article &lt;a href="http://www.ibm.com/developerworks/java/library/j-jtp04298.html?S_TACT=105AGX02&amp;S_CMP=EDU"&gt;&lt;span style="color: #000000; text-decoration: none; text-underline: none;"&gt;Going Wild with Generics&lt;/span&gt;&lt;/a&gt;&amp;nbsp;talks about using generic methods to force the compiler to use type inference to solve a&amp;nbsp;problem with wildcards in generics.&amp;nbsp; To quote his article "The Java compiler doesn't perform type inference in very many places, but one place it does is in inferring the type parameter for generic methods".&amp;nbsp; I wanted to see if the type inference from generic methods would solve our problem here and sure enough it does.&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;If we code up version 6 as a generic helper method&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;6&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: 12pt; margin: 0in 0in 10pt;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;private&amp;lt;V&amp;gt; void splitHelper(ArrayList&amp;lt;V&amp;gt; src, ArrayList&amp;lt;V&amp;gt; dest1, ArrayList&amp;lt;V&amp;gt; dest2) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (V elem&amp;nbsp;: src) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; dest1.add(elem);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // ... &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;and we can then call the helper with something like&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: 20.4pt; margin: 0in 0in 10pt;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; splitHelper(aListSrc, aListDest1, aListDest2);&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;If we&amp;nbsp;run version 6 thru javac and look at the generated bytecodes, we see that the checkcast bytecode that we saw in version 3 is not there, leading to better performance.&lt;/span&gt;&lt;/div&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&lt;span style="line-height: 115%; color: #000000; font-size: 10pt;"&gt;So we have found a for-each based solution that has gotten rid of the castcheck, but do others find this&amp;nbsp;behavior surprising?&amp;nbsp; The difference between Versions 3 and&amp;nbsp;6 seems very minor and it seems that if the compiler could eliminate the castcheck in Version 6, it could also do so in Version 3.&lt;/span&gt;&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div style="margin: 0in 0in 10pt;"&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/BOJbR4LPR2Y" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=117461</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>Windows® 7 and AMD  -  Technical Collaboration</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/K6fRF9dUVc8/blogpost.cfm</link> 
		<pubDate>2009-08-05T20:05:01 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117048#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p&gt;AMD is a close collaborator with Microsoft.&amp;nbsp; We work together to help ensure the operating system runs smoothly and efficiently on AMD platforms.&amp;nbsp; Here are some of the key technology collaborations for Windows&amp;reg; 7:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power Management:&lt;/strong&gt; AMD worked closely with Microsoft to support a new AMD product-specific power management driver in Windows 7. This in-box driver supports older processors as well as the latest generation AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processor and AMD Phenom&lt;sup&gt;TM&lt;/sup&gt; II processor. In addition to the power management driver, AMD collaborated with Microsoft to fine tune default power policy parameters that control power state transitions to help optimize for power and performance. And since this driver is "in-box", there's no need to download.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Virtualization:&lt;/strong&gt; AMD provided code to Microsoft and worked with the Hypervisor teams to help ensure that Hyper-V R2 and Windows Virtual PC in Windows 7 utilize Rapid Virtualization Indexing (aka nested paging tables) for improved performance of VM guests. All of the third-generation AMD Opteron processors, AMD Phenom processors, and AMD Phenom II processors support Rapid Virtualization Indexing. In addition, most of AMD's shipping processors (other than AMD Sempron&lt;sup&gt;TM&lt;/sup&gt; processors) include AMD-V&lt;sup&gt;TM&lt;/sup&gt; technology and thus support Windows XP Mode for Windows 7.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stability &amp; Performance:&lt;/strong&gt; Current and upcoming reference platforms containing multi-core processors from AMD were loaned to Microsoft's labs to vet out potential incompatibilities with Windows 7 and Windows Server 2008 R2.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graphics:&lt;/strong&gt; AMD has been working hard to support DirectX&amp;reg; 11, so there are plans to make native DirectX 11 hardware from AMD in its ATI Radeon&lt;sup&gt;TM&lt;/sup&gt; GPUs available when Windows 7 is released.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU Compute:&lt;/strong&gt; DirectX 11 Compute Shader (CS) is a new API in Windows 7 that helps enable rich applications through the use of compute on the GPU (General Purpose GPU or GPGPU). Rich experiences such as drag-and-drop media transcoding, physics, and AI are a few areas that DirectX 11 CS can help enable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;For more information on AMD and Microsoft technical collaboration visit the &lt;a href="http://developer.amd.com/zones/windows/Pages/default.aspx"&gt;Windows Zone&lt;/a&gt; on &lt;a href="http://developer.amd.com/"&gt;developer.amd.com&lt;/a&gt;.&amp;nbsp; For more information on what AMD is doing overall with Microsoft for end users, check out the &lt;a href="http://sites.amd.com/us/microsoft/Pages/default.aspx"&gt;Microsoft &amp; AMD&lt;/a&gt; corporate site, or see the AMD video on Microsoft's &lt;a href="http://readyset7.com/"&gt;Ready. Set. 7&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;-Robin Maffeo, Microsoft Alliance Manager&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/K6fRF9dUVc8" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=117048</feedburner:origLink></item>
	
	<item>
		<dc:creator>Michael Chu</dc:creator>
		<title>ATI Stream SDK and OpenCL(TM)</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/LI4BpueL7UE/blogpost.cfm</link> 
		<pubDate>2009-08-05T01:58:44 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=116993#comments</comments>
		<trackback:ping>3</trackback:ping>
		<description>&lt;p&gt;It's been a while since we've had an update on the ATI Stream Developer Blog... Over the past year since the last blog posting, a lot has happened. ATI Stream SDK v1.x saw two release (v1.3-beta at the end of last year and v1.4-beta at the beginning of this year). With each of those releases the SDK and Brook+, in particular, we focused on stability and adding more exciting features.&lt;/p&gt;
&lt;p&gt;We've even launched an &lt;a href="http://developer.amd.com/gpu/ATIStreamSDKBetaProgram/Pages/default.aspx"&gt;ATI Stream Developer Showcase&lt;/a&gt; site where quite a few of your fellow developers have submitted their ATI Stream applications to show the developer community (you), the exciting things they have done with the ATI Stream SDK. &lt;a href="http://developer.amd.com/gpu/ATISTREAMPOWERTOY/Pages/default.aspx"&gt;ATI Stream Power Toys&lt;/a&gt; came into existence and we are planning to continue to grow it as we come up with fun and useful tools for you that just can't wait for the next ATI Stream SDK release. And, &lt;a href="http://developer.amd.com/gpu/acmlgpu/Pages/default.aspx"&gt;ACML-GPU&lt;/a&gt; finally made it out of alpha/beta testing and is now release on AMD Developer Central. All truly exciting stuff!&lt;/p&gt;
&lt;p&gt;But, what has been even more anticipated since the middle of last year has been OpenCL(TM). If you don't know much about OpenCL and how it meshes with the rest of GPGPU history, take a look &lt;a href="http://www.amd.com/streamopencl"&gt;here&lt;/a&gt;. It was a tremendous amount of work that kept our engineering team up late for many nights... but, finally, we were able to release a beta version of our ATI Stream SDK v2.0 with OpenCL x86 CPU support today. It's part of our complete OpenCL development platform and is designed to help accelerate your applications with OpenCL today on multi-core CPUs, plus helps you take advantage of the added speed of GPUs later on this year. If you are interested in giving it a try, visit our &lt;a href="http://developer.amd.com/streambeta"&gt;ATI Stream SDK v2.0 Beta Program page&lt;/a&gt; to download the beta release.&lt;/p&gt;
&lt;p&gt;Benedict Gaster, our OpenCL compiler architect here at AMD, has written an introductory tutorial for OpenCL to help developers get started learning and getting comfortable programming in OpenCL. You can find his OpenCL tutorial article &lt;a href="http://developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also take a look at Patricia Harrell's blog, &lt;a href="http://links.amd.com/OpenCLGameChanger"&gt;OpenCL Changes the Game&lt;/a&gt;. Patricia is the Director of Stream Computing here at AMD.&lt;/p&gt;
&lt;p&gt;Stay tuned for even more information about ATI Stream SDK developments.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/LI4BpueL7UE" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=335&amp;threadid=116993</feedburner:origLink></item>
	
	<item>
		<dc:creator>Anton Chernoff</dc:creator>
		<title>Performance Profiling Without the Overhead</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/NofWXXU_WHA/blogpost.cfm</link> 
		<pubDate>2009-07-24T11:04:19 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116487#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt; Normal   0         false   false   false                             MicrosoftInternetExplorer4 &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt; &lt;/xml&gt;&lt;![endif]--&gt;
&lt;style&gt;

&lt;/style&gt;
&lt;!--[if gte mso 10]&gt;
&lt;style&gt;
 /* Style Definitions */
 table.MsoNormalTable
	{mso-style-name:"Table Normal";
	mso-tstyle-rowband-size:0;
	mso-tstyle-colband-size:0;
	mso-style-noshow:yes;
	mso-style-parent:"";
	mso-padding-alt:0in 5.4pt 0in 5.4pt;
	mso-para-margin:0in;
	mso-para-margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:10.0pt;
	font-family:"Times New Roman";
	mso-ansi-language:#0400;
	mso-fareast-language:#0400;
	mso-bidi-language:#0400;}
&lt;/style&gt;
&lt;![endif]--&gt;&lt;/p&gt;
&lt;p&gt;Performance Profiling Without the Overhead&lt;/p&gt;
&lt;p&gt;Here at AMD, we know that in order to improve program performance, you have to be able to measure it. &lt;a href="http://developer.amd.com/cpu/LWP"&gt;AMD's Lightweight Profiling feature (LWP)&lt;/a&gt; is designed to make performance measurement even easier and with negligible overhead. In this post, I'll give you an overview of LWP and tell you why we think it's an exciting next step in the area of performance tuning.&lt;/p&gt;
&lt;p&gt;First, a little history. Late in 2007, AMD announced Lightweight Profiling as a proposed extension to the AMD64 architecture that would allow an application to gather performance statistics about itself with low overhead. We posted the preliminary specification and asked for feedback from the developer community. Much to our delight, many of you responded with comments, criticisms, and suggestions on the proposal. We've read all of your feedback, and last week we posted the current version of the LWP specification. The announcement and the link to the spec are &lt;a href="http://developer.amd.com/cpu/LWP"&gt;here&lt;/a&gt;. Thanks to all of you who helped us out.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What came before...&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It's important to be able to measure the details of a program's performance in order to find ways to speed it up. Until now, there have been just two ways to do this. The first is via &lt;em&gt;instrumentation&lt;/em&gt;, i.e., adding code to the program to watch the clock, or the cycle counter, or just to count the number of times an instruction or loop is executed. Instrumentation can be added by the programmer or by a compiler. Unfortunately, it seriously perturbs the application, and the instrumented code usually doesn't have the same characteristics as the original code, especially when dealing with the data and instruction caches. Also, instrumentation can't observe the hardware caches, so it can't gather data about cache behavior.&lt;/p&gt;
&lt;p&gt;The second traditional method of monitoring performance is to use the hardware performance counters. These count hardware events and generate an interrupt after a programmed number of events have happened. The counters can report on events that are too hard to instrument (like counting each x86 instruction) or are not visible to software (like cache misses). These counters are used by the &lt;a href="http://developer.amd.com/cpu/CodeAnalyst"&gt;AMD CodeAnalyst Performance Analyzer&lt;/a&gt; and provide deep insight into application and system performance. However, each time a data sample is gathered, the processor must take an interrupt to a kernel-mode driver, and that takes hundreds or thousands of cycles. The driver, by simply executing, changes the contents of the data cache and the instruction cache and may perturb the application's performance. The counters can only be configured, started, and stopped from kernel mode, so an application must call a driver or the operating system to control them. Finally, some systems do not context-switch the performance counters when changing threads or processes, and on those systems, performance monitoring can only be done globally by a single user at a time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Introducing LWP&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After reading about current technology, you might think that an ideal performance monitor should:&lt;/p&gt;
&lt;ul type="disc"&gt;
&lt;li&gt;Operate entirely in user mode &lt;/li&gt;
&lt;li&gt;Cause little or no perturbation of the application &lt;/li&gt;
&lt;li&gt;Be controlled separately for each thread &lt;/li&gt;
&lt;li&gt;Have low overhead to allow for higher sampling rates &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that describes LWP!&lt;/p&gt;
&lt;p&gt;Lightweight Profiling adds a set of user-controlled counters to the AMD64 architecture. They can monitor multiple events simultaneously. An application thread starts profiling by providing the address of an LWP control block (LWPCB) as the operand to the new &lt;strong&gt;LLWPCB&lt;/strong&gt; instruction. The contents of the LWPCB specify which events to count and how often to count them. It also points to a ring buffer in the application's memory into which the hardware will store event records. That's it.&lt;/p&gt;
&lt;p&gt;Once started, LWP counts the specified events. When an event counter underflows, it stores an event record at the head of the ring buffer and resets the counter. (If requested, LWP randomizes the bottom bits of the new counter value to prevent "beating" against constant length loops.) LWP stores the record without interrupting the flow of the program, so the only perturbation to the program's performance is writing the record (usually affecting only a single data cache line) and a few cycles to perform the write. The record contains the event type, the address of the instruction that caused the underflow, and other information about the event. All event types share one ring buffer and can be sorted out by the event type field in the record.&lt;/p&gt;
&lt;p&gt;Of course, eventually the buffer will fill up. What then? Well, a program has two options for emptying the ring buffer. First, it can simply poll the buffer and remove event records from the tail of the ring. When software rewrites the tail pointer, the LWP hardware knows it can reuse the newly emptied region of the ring buffer. Since the buffer is in user memory, the program can even share the memory with another process, and that second process can be responsible for draining the buffer. Second, the application can specify that it wants LWP to generate an interrupt when the ring buffer is filled past a certain threshold. For instance, it can configure a buffer to hold 10,000 event records and tell LWP to interrupt whenever there are more than 9,000 records in the buffer. The interrupt does indeed perturb the program, but it does so 1/9000th as often as the traditional performance counters would. Better still, since the buffer is in user memory, the application can catch the interrupt and do whatever it wants with the data. It can store it to disk for later analysis, or it can process it immediately and even try to fix performance problems as they are happening.&lt;/p&gt;
&lt;p&gt;In addition, LWP is a per-thread feature. Each thread on the system can be monitoring different events at different rates without interference. If a thread is not using LWP, there is no impact on its performance even if other threads have LWP active.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Some LWP Details&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The LWP events are a small subset of the events available in the traditional performance counters. They include Instructions Retired, Branches Retired, and DCache Misses. The Branches Retired event can be filtered by whether the branch is direct or indirect, conditional or unconditional, or other criteria. It captures the target address of the branch, a useful value when looking at indirect branches. The DCache miss event can be filtered by cache level to capture only "expensive" cache misses.&lt;/p&gt;
&lt;p&gt;One exciting feature of LWP is the ability to insert events into the ring buffer under program control. There are two new instructions to do this:&lt;/p&gt;
&lt;ul type="disc"&gt;
&lt;li&gt;&lt;strong&gt;LWPINS&lt;/strong&gt; inserts a record into the      ring buffer containing data taken from the arguments to the instruction. A      program can use LWPINS to insert a marker to indicate an important event,      such as loading or unloading a shared library, that influences the way      addresses should be interpreted in subsequent event records. &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LWPVAL&lt;/strong&gt; uses an event counter and      decrements the counter each time it is executed, much the way the hardware      event counters work. When the counter underflows, it inserts a record into      the ring buffer containing data from its arguments. A program uses LWPVAL      to implement a technique called value profiling. For instance, it can      profile the divisor of a commonly executed DIV instruction and if the data      show that the divisor is frequently the same number, it can rewrite the      instruction to test for that value and execute an optimized code sequence.      Similarly, it can profile the target of a hot indirect branch and generate      better code if one way of the branch is dominant. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Who will use LWP?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;LWP can be used in many different application environments. These include:&lt;/p&gt;
&lt;ul type="disc"&gt;
&lt;li&gt;&lt;strong&gt;Managed Runtime Environment:&lt;/strong&gt; Managed Runtimes (MRTEs) are      programming environments such as Java and the Microsoft&amp;reg; .NET Framework.      These environments have the ability to generate AMD x86 or x64 code for      routines coded in a high level managed language (such as Java or C#), and      they can do that on the fly as a program is running. The MRTE can enable      LWP and periodically look for performance problems. If (when!) it finds      them, it can generate better code for the hot spots and improve the      program's overall performance. LWP is lightweight enough that it can run      continuously. &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic Optimizer:&lt;/strong&gt; A Dynamic Optimizer is a      program that monitors an application and attempts to improve its      performance by modifying it as it runs. In this case, the target application      is compiled to native code from a traditional language like C or C++. The      Dynamic Optimizer can gather performance data without affecting the flow      of control in the application. &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compiler Feedback:&lt;/strong&gt; Most modern compilers have an      option to build an instrumented program which the developer runs to gather      information on the program's performance. Unfortunately, the added      instrumentation (and the fact that optimization levels are often cranked      down in a feedback compilation) perturbs the program so much that what's      being measured is substantially different from the "real"      program. With LWP, the compiler can gather statistics on the program      execution without changes, and it can insert LWPVAL instructions to      profile interesting areas without adding a large block of instrumentation      code and without clobbering any registers. If the application runs without      turning on LWP, the LWPVAL instructions act as NOPs and only take a few      cycles. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We're very excited about Lightweight Profiling, and I hope this note has piqued your interest. You can read the full specification at the &lt;a href="http://developer.amd.com/cpu/LWP"&gt;LWP page on Developer Central&lt;/a&gt;. There's also an email link you can use to send us your comments and suggestions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;P.S.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;My colleagues suggested that I make this more "bloggy" by adding references to "traditional performance values" and "herbal performance enhancers". This postscript is dedicated to them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Anton Chernoff is a Senior Fellow and architect at AMD.&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; His postings are his own opinions and may not represent AMD's positions, strategies or opinions. &lt;/em&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/NofWXXU_WHA" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116487</feedburner:origLink></item>
	
	<item>
		<dc:creator>Gary Frost</dc:creator>
		<title>Final Words</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/JkxP400lSUI/blogpost.cfm</link> 
		<pubDate>2009-07-23T12:02:06 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=116433#comments</comments>
		<trackback:ping>6</trackback:ping>
		<description>&lt;p&gt;I have always been a little unhappy with the decision to overload the use of the 'final' keyword to enable local variables to be made available to methods in inner classes. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Let's recap. Here is a method which launches a thread which prints integers 0 thru 9.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;public void launch(){&lt;br /&gt;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=0; i&amp;lt;10; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;We decide to refactor this method to take two arguments (launch(int min, int max)) so that we can control the start and end values of the count.&amp;nbsp; We might be tempted to try&lt;br /&gt;&amp;nbsp;&lt;br /&gt;// will not compile&lt;br /&gt;public void launch(int min, int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;But this will fail to compile. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;The problem is that the parameters min and max are not in the scope of the run() method in the anonymous inner class implementation of Runnable(). In fact, because the run() method is being executed in another thread, it is likely that the original call to launch() has returned before the run() method has even started, so the variables that were on the stack when we created our Runnable() are long gone.&amp;nbsp; To solve this problem Java needs a way to signal that a variable should be captured into the scope of any anonymous inner class that wants to use it. If Annotations were around, I suspect that an Annotation would have worked well for this, unfortunately this 'requirement' predated Annotations and it was decided to 'overload' the use of the final keyword to convey this intent.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;public void launch(final int min, final int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;The above method will now compile and will function as suggested.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;But 'final' seems wrong here.&amp;nbsp; I understand that there is a reluctance to add new key/reserved words to a language (just look at all the trouble that enum and assert created!), but final seems to be a weird choice.&amp;nbsp; I think it breaks the law of 'least astonishment'.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;Let's refactor our method one more time.&amp;nbsp; This time we will launch 10 threads per count value and we will print the 'number' of each thread. Here is our first attempt&lt;br /&gt;&amp;nbsp;&lt;br /&gt;// Won't compile&lt;br /&gt;public void launch(final int min, final int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; for (int c=0; c&amp;lt;10; c++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("Thread "+c+" i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;Again our compilation issue is that the 'c' variable is not available in the run method of the anonymous inner class. &lt;br /&gt;We need c to be a final variable.&amp;nbsp; Let's make it final&lt;br /&gt;&amp;nbsp;&lt;br /&gt;// Won't compile for a different reason ;)&lt;br /&gt;public void launch(final int min, final int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; for (final int c=0; c&amp;lt;10; c++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("Thread "+c+" i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;Doh! Of course c can't be final; it is a loop variable. If we mark it as 'final' we are applying the traditional (you can't mutate this) meaning of final, yet we need to mark it as final for the variable to be made available to the inner class. We are forced to do 'weird things' to get around this, like create a local final value for the purpose of capturing the value for the inner class. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;public void launch(final int min, final int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; for (int c=0; c&amp;lt;10; c++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; final int fc = c;&amp;nbsp;&amp;nbsp; // fc is only used to expose a final value to the innerclass&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("Thread "+fc+" i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;Yuck!&lt;br /&gt;&amp;nbsp;&lt;br /&gt;However you might be even more surprised by this solution ;)&lt;br /&gt;&amp;nbsp;&lt;br /&gt;public void launch(final int min, final int max){ &lt;br /&gt;&amp;nbsp;&amp;nbsp; for (final int c: new int[]{0,1,2,3,4,5,6,7,8,9}){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; new Thread(new Runnable(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; public void run(){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (int i=min; i&amp;lt;max; i++){&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; System.out.println("Thread "+c+" i="+i);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }).start();&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;br /&gt;&amp;nbsp;&lt;br /&gt;What?&lt;br /&gt;&amp;nbsp;&lt;br /&gt;So it looks like we can declare a loop variable to be final providing we are using the new for-each form.&amp;nbsp; The code is happy to mutate it (so it's not really final, is it?) and also make it available to appropriate inner classes.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;How bizarre. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Next time we will look at how these final variables actually get captured/cloned into the inner classes.&amp;nbsp; One might be surprised what is happening at the bytecode level to allow these 'final' values [to be?] made available to inner classes&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/JkxP400lSUI" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=116433</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>HT Assist - what is it?</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/BLx2dD_N5J4/blogpost.cfm</link> 
		<pubDate>2009-07-21T17:01:09 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=116331#comments</comments>
		<trackback:ping>2</trackback:ping>
		<description>&lt;p&gt;Scalable Performance with HyperTransport&lt;sup&gt;TM&lt;/sup&gt; Technology HT Assist:&lt;/p&gt;
&lt;p&gt;With the release of the Six-Core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processor, formerly code-named "Istanbul", an important new hardware feature called HT Assist has been included that helps increase performance on 4-socket and 8-socket AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; 8400 Series processor-based systems.&lt;/p&gt;
&lt;p&gt;As you scale the number of sockets and, thus, processors in a system, maintaining data coherency becomes a more complex and important issue. &amp;nbsp;On a single-socket system with a multi-core processor your single processor just has to maintain cache coherency between the processor cores; there are no other sockets or processors to maintain coherency or communication with.&lt;/p&gt;
&lt;p&gt;In a multi-socket system, each processor has to communicate with each other processor to make sure it is working on the latest data, or cache line, to maintain coherency (and thus program correctness). &amp;nbsp;This communication is done over HyperTransport&lt;sup&gt;TM&lt;/sup&gt; technology links between the processor sockets in the case of systems based on HyperTransport technology. &amp;nbsp;With a broadcast coherence protocol, the latency of a memory access is always the longer of 2 paths: the time it takes to return data from DRAM and the time it takes to probe all the caches in the system. &amp;nbsp;Only when the processor has received the data and all probe responses can it actually process the required transaction. &amp;nbsp;With a 4-socket or 8-socket system (24 or 48 total processor cores with Six-Core AMD Opteron processor-based systems) the HyperTransport technology links between processors can increasingly be loaded with a significant amount of latency-sensitive cache probe requests checking for data coherency.&lt;/p&gt;
&lt;p&gt;In a 4-socket system, one cache line coherency check can generate 10 or more messages over the 4 HyperTransport links connecting the 4 processors together. &amp;nbsp;These transactions include all the probe requests, probe responses, data request, and data responses. With HT Assist though this same check may only generate 2-3 messages. &amp;nbsp;This significantly reduces the latency of the coherency check and the amount of transactions over the HyperTransport links.&lt;/p&gt;
&lt;p&gt;HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache. &amp;nbsp;This directory cache tracks all cache lines cached in the system.&amp;nbsp; Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup. &amp;nbsp;This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data. &amp;nbsp;This also means there is less queuing delay due to the lower HyperTransport technology traffic. &amp;nbsp;With significantly reduced probe traffic it effectively also increases system bandwidth performance.&amp;nbsp; It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor.&amp;nbsp; As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache.&amp;nbsp; On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.&lt;/p&gt;
&lt;p&gt;We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning. &amp;nbsp;On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.&lt;/p&gt;
&lt;p&gt;Applications that naturally will get a benefit from HT Assist include Database, Virtualization, and High Performance Computing (HPC). &amp;nbsp;And there is no need for software developers to change their code, just enjoy the extra performance from AMD!&lt;/p&gt;
&lt;p&gt;-Justin Boggs&lt;/p&gt;
&lt;p&gt;ISV Developer Relations&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;* 42GB/s using 4 x Six-Core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processors ("Istanbul") Model 8435 in Tyan Thunder n4250QE (S4985-E) motherboard, 32GB (16x2GB DDR2-800) memory, SuSE Linux&amp;reg; Enterprise Server 10 SP1 64-bit with HT Assist enabled vs. 25.5GB/s with HT Assist disabled.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/BLx2dD_N5J4" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=116331</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>Just released:  Sun Studio 12 Update 1, featuring optimizations for AMD Opterontm processors</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/bjp4xssdTE4/blogpost.cfm</link> 
		<pubDate>2009-07-17T12:55:38 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116158#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;&lt;em&gt;Featuring guest blogger from Sun Microsystems, Darryl Gove.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The release of a new version of &lt;a href="http://developers.sun.com/sunstudio/features/index.jsp"&gt;Sun Studio&lt;/a&gt; is always an exciting moment for Sun Studio enthusiasts. Sun Studio 12 came out pretty much two years ago, and a lot has changed in that time.&lt;/p&gt;
&lt;p&gt;One particular trend has been that multicore processors have become mainstream. One way of illustrating that is to look at the number of threads per chip for all the submitted &lt;a href="http://www.spec.org/cpu2006/results/cint2006.html"&gt;SPEC&amp;reg; CPU2006 integer speed results&lt;/a&gt;*. The following chart shows the cumulative number of submitted results since the benchmark came out in 2006 until the middle of June 2009 broken down by the number of threads that the chip could support.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://developer.amd.com/blog_assets/ss12u1fig1.jpg" alt="Cumulative number of CPU2006 Integer Speed results submitted" width="668" height="470" /&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Two years ago, when Sun Studio 12 came out, chips that could support two threads were starting to become common. Now we're looking at that being a minimal thread count, and we're starting to see the ramp up of threads that can support more than 4 threads - the latest AMD processors support six threads per chip. In tandem with the growth in thread count, we're seeing much more interest in developing applications that can use this core count. Sun is fortunate that with Solaris and Sun Studio, we have a very comprehensive, and long standing, investment in multiprocessor technology:. from &lt;a href="http://www.sun.com/solutions/virtualization/index.jsp"&gt;virtualisation&lt;/a&gt;, through &lt;a href="http://www.sun.com/bigadmin/content/zones/"&gt;Zones&lt;/a&gt;, to scalability to &lt;a href="http://www.sun.com/servers/coolthreads/t5440/"&gt;huge core counts&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sun Studio has always been on the leading edge of developing parallel applications. There are two ways of leveraging multiple cores, either through libraries provided with the compiler or through the parallisation of your application. For those people using the &lt;a href="http://developers.sun.com/sunstudio/overview/topics/perflib_index.html"&gt;Performance Library&lt;/a&gt;, this is now optimised to take advantage of the latest AMD Quad-core and Six-core processors.&lt;/p&gt;
&lt;p&gt;The easiest way of producing parallel code is using automatic parallelisation. Sun was the first company to submit &lt;a href="http://www.spec.org/cpu2000/results/res2003q2/cpu2000-20030326-02001.html"&gt;automatically parallelised results for SPEC&amp;reg; CPU2000&lt;/a&gt;. Automatic parallelisation is a great technology. It takes some of the work of making parallel codes away from the developer, and places it firmly into the category of "just another compiler flag".&lt;/p&gt;
&lt;p&gt;However, the compiler can't do this for all codes, which is why Sun was also one of the first companies to support the &lt;a href="http://www.openmp.org/mp-documents/spec30.pdf"&gt;OpenMP 3.0 specification&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The OpenMP 3.0 specification is a very important step in making parallel programming easier. The 2.5 specification that was supported by Sun Studio 12 allows developers to identify loops that can be performed in parallel, and different sections of code that can be run simultaneously. The big improvement in the 3.0 specification is the support for &lt;a href="http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature"&gt;Tasks&lt;/a&gt;. A task is a unit of work that one thread can request another thread to do. The developer defines the tasks in the source code, but the executed tasks and their order is dynamically determined at runtime. This massively increases the range of applications that can be parallelised using OpenMP.&lt;/p&gt;
&lt;p&gt;Of course, writing parallel applications becomes much harder without the tools to support this. Sun Studio 12 Update 1 includes these tools. The &lt;a href="http://developers.sun.com/sunstudio/overview/topics/debug_index.html"&gt;Debugger&lt;/a&gt; for diagnosing bugs in parallel applications, the &lt;a href="http://developers.sun.com/sunstudio/overview/topics/analyzer_index.html"&gt;Performance Analyzer&lt;/a&gt; for determining the activity of all the threads in an application, and the &lt;a href="http://developers.sun.com/sunstudio/downloads/ssx/tha/tha_using.html"&gt;Thread Analyzer&lt;/a&gt; for identifying data races in parallel applications. The Performance Analyzer has been enhanced to support hardware counters in the latest AMD processors. The hardware performance counters are an optimal way of determining exactly what the processor is doing during the run of your application.&lt;/p&gt;
&lt;p&gt;Performance is often one of the motivating factors for any compiler upgrade. In a compiler suite performance comes from two sources: enabling the developer to identify opportunities to improve performance, and the ability of the compiler to produce good code for the processor. The performance analyzer is able to profile all kinds of parallel applications including those parallelised with OpenMP directives as well as &lt;a href="http://developers.sun.com/sunstudio/documentation/techart/mpi_apps/"&gt;distributed MPI applications&lt;/a&gt;. This enables you to quickly determine where, at a source code level, the application is spending its time, and to drill down into that source to understand the performance at the level of hardware events.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://developer.amd.com/blog_assets/ss12u1fig2.jpg" alt="Sun Studio screenshot" width="666" height="425" /&gt;&lt;/p&gt;
&lt;p&gt;The goal for the Sun Studio compiler has always been to produce code that runs as fast as possible on all SPARC and x86 processors. Sun has worked closely with AMD to ensure that the compiler is aware of the best practices for producing code for the latest AMD processors. Sun Studio 12 Update 1 includes this support and continues the long track record of delivering &lt;a href="http://www.sun.com/servers/x64/x4600/benchmarks.jsp#4"&gt;superior performance on AMD processors&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As well as providing support for all processors, Sun Studio is also supported on a number of platforms: Solaris, OpenSolaris, and &lt;a href="http://developers.sun.com/sunstudio/overview/topics/linux_index.html"&gt;Linux (for x86)&lt;/a&gt;. Perhaps most importantly Sun Studio 12 Update 1 is &lt;a href="http://developers.sun.com/sunstudio/downloads/index.jsp"&gt;free of charge to download and use&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;* SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation.&amp;nbsp; Benchmark results stated above reflect results posted on &lt;a href="http://www.spec.org/"&gt;www.spec.org&lt;/a&gt; as of 15 June 2009.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Darryl Gove is a Senior Staff Engineer in the compiler team at Sun Microsystems. He works on the optimisation and tuning of applications and benchmarks. He is the author of the books "&lt;/em&gt;&lt;a href="http://www.sun.com/books/catalog/solaris_app_programming.xml"&gt;&lt;em&gt;Solaris Application Programming&lt;/em&gt;&lt;/a&gt;&lt;em&gt;" and "&lt;/em&gt;&lt;a href="http://my.safaribooksonline.com/0595352510"&gt;&lt;em&gt;The Developers Edge&lt;/em&gt;&lt;/a&gt;,&lt;em&gt;"&amp;nbsp;and maintains a blog at &lt;/em&gt;&lt;a href="http://blogs.sun.com/d/"&gt;&lt;em&gt;http://blogs.sun.com/d/&lt;/em&gt;&lt;/a&gt;&lt;em&gt;. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.&lt;/em&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/bjp4xssdTE4" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116158</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>The scoop on the x86 Open64 Compiler Suite</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/bAU-DmBv3D4/blogpost.cfm</link> 
		<pubDate>2009-07-14T16:30:25 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116028#comments</comments>
		<trackback:ping>3</trackback:ping>
		<description>&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;You may have seen the recent &lt;/span&gt;&lt;a href="http://blogs.amd.com/nigeldessau/2009/06/22/sweet-suite/"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;blog post&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt; from our CMO Nigel Dessau about the release of the x86 Open64 Compiler Suite. Nigel makes some great points about why AMD feels this open source project is important, so I won&amp;rsquo;t go into that here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Instead, I&amp;rsquo;ll provide an overview of &lt;/span&gt;&lt;a href="http://developer.amd.com/cpu/open64/Pages/default.aspx"&gt;&lt;span style="font-size: small; color: #0000ff; font-family: verdana,geneva;"&gt;the latest release&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt; and what the features can mean for your development work. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;Like other compilers, Open64 optimizes applications aggressively in many dimensions, but what is different is that Open64 employs innovative techniques that stem from an understanding of the underlying hardware architecture, such as laying out data structures in space and cache efficient manners and deploying aggressive forms of loop-nest optimizations to promote locality. The biggest area this helps is with multi-core scalability, a measure of throughput performance of running multiple applications simultaneously on multiple cores, where a memory sub-system is often stressed.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-bidi-font-family: Calibri;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="mso-bidi-font-family: Calibri;"&gt;While the Open64 compiler suite was created to optimize software development for all x86-based architectures, it utilizes many features that take particular advantage of AMD&amp;rsquo;s technology. One such example is enabling the use of 2MB huge pages for programs built with O&lt;/span&gt;pen64 to help reduce TLB misses. Another important feature is enhanced code generation and instruction scheduling to take advantage of core pipeline hardware features.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Also, software data prefetching is better tuned to work with the hardware prefetcher and DRAM prefetcher to effectively hide memory latencies. This latest release also offers preview features of OpenMP and automatic parallelization to map program parallelism to multiple cores.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;Here&amp;rsquo;s the full list of new features in x86 Open64 4.2.2 that AMD added (also detailed in the &lt;/span&gt;&lt;a href="http://developer.amd.com/cpu/open64/assets/ReleaseNotes.txt"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;release notes&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;):&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpFirst" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l1 level1 lfo2;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Support for 2 MB huge pages.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l1 level1 lfo2;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved loop fusion and loop unrolling.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l1 level1 lfo2;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l0 level1 lfo1;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved interprocedural alias analysis.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l0 level1 lfo1;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved partial inlining and inlining of virtual functions.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l0 level1 lfo1;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;More aggressive re-layout optimization for structure members.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpMiddle" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l0 level1 lfo1;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved instruction selection and instruction scheduling.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoListParagraphCxSpLast" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-add-space: auto; mso-list: l0 level1 lfo1;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"&gt;&lt;span style="mso-list: Ignore;"&gt;&amp;middot;&lt;span style="font: 7pt "Times New Roman";"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Improved tuning of library functions.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-bidi-font-family: Calibri;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;What this compiler suite really enables is highly optimized performance when running multiple applications at the same time, which is pretty much the norm for real-world workloads.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the spirit of open source projects, we&amp;rsquo;d like your feedback on how to improve this compiler suite.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If you would like to suggest features for future releases, leave us a comment.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;While we can&amp;rsquo;t promise that the features will be added, we certainly take your feedback under serious consideration.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="font-family: verdana,geneva;"&gt;&lt;span style="font-size: small;"&gt;Roy Ju&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: 11pt; line-height: 115%; font-family: "Calibri","sans-serif"; mso-fareast-font-family: Calibri; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"&gt;&lt;span style="font-size: small; font-family: verdana,geneva;"&gt;AMD Fellow&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/bAU-DmBv3D4" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=116028</feedburner:origLink></item>
	
	<item>
		<dc:creator>Bragadeesh Natarajan</dc:creator>
		<title>IEEE floating point exception handling in Windows® OS</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/bV5qlXaFSnE/blogpost.cfm</link> 
		<pubDate>2009-07-06T19:11:41 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115708#comments</comments>
		<trackback:ping>5</trackback:ping>
		<description>&lt;p&gt;In this blog, we present an example of how IEEE floating-point (FP) exceptions can be caught when programming in C++ for Microsoft&amp;reg; Windows&amp;reg; using Microsoft Visual Studio (VS). We employ the __try/__except extension available in the VS C++ compiler and the _fpieee_flt filter function to handle exceptions. We specifically talk about IEEE exceptions raised by SSE FP instructions, how the MXCSR register behaves, and some behind-the-scene details.&lt;/p&gt;
&lt;p&gt;FP arithmetic in the x86 world has traditionally been done by x87 instructions. But after the advent of the x86-64 (AMD64) architecture, FP math is increasingly done using the SSE FP instructions. Like their x87 counterparts, SSE instructions also raise IEEE exceptions during certain FP arithmetic operations. These exceptions are hardware exceptions raised by the processor to signal abnormal cases and conditions. By default, FP exceptions are masked, which means that they are recorded in a status register but prevented from actually getting raised. On the other hand, if they are unmasked, they will be raised and can alter the program flow. The MXCSR register controls the masking of FP exceptions for the SSE FP instructions. It also acts as the status register that records FP exceptions when those exceptions do occur.&lt;/p&gt;
&lt;p&gt;The IEEE FP exceptions are hardware exceptions and hence need support from the OS to get control back to user code when these exceptions occur. The structured exception handling (SEH) mechanism of Windows makes this possible. (Refer to &lt;a href="http://msdn.microsoft.com/en-us/library/ms680657(VS.85).aspx)"&gt;http://msdn.microsoft.com/en-us/library/ms680657(VS.85).aspx)&lt;/a&gt;. The _fpieee_flt function acts as the bridge in SEH to the user defined handler function. (Refer to &lt;a href="http://msdn.microsoft.com/en-us/library/te2k2f2t(VS.80).aspx)"&gt;http://msdn.microsoft.com/en-us/library/te2k2f2t(VS.80).aspx)&lt;/a&gt;. The handler is registered using this function, and when the exceptions get filtered by SEH, control is transferred to the handler with all the relevant information about the exception.&lt;/p&gt;
&lt;p&gt;Here is an example program to illustrate:&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;#include&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;&amp;lt;iostream&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;#include&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;&amp;lt;float.h&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;#include&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;&amp;lt;math.h&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;#include&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;&amp;lt;fpieee.h&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;#include&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;&amp;lt;windows.h&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #a31515; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;extern&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; &lt;span style="color: #a31515;"&gt;"C"&lt;/span&gt; &lt;span style="color: #0000ff;"&gt;int&lt;/span&gt; handler(_FPIEEE_RECORD *p)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;{&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;std::cout &amp;lt;&amp;lt; &lt;span style="color: #a31515;"&gt;"In the handler invoked by _fpieee_flt"&lt;/span&gt; &amp;lt;&amp;lt; std::endl;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;if&lt;/span&gt;(p-&amp;gt;Operation&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;== _FpCodeLog) &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;return&lt;/span&gt; EXCEPTION_CONTINUE_EXECUTION;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;else&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;return&lt;/span&gt; EXCEPTION_EXECUTE_HANDLER;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;}&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #0000ff; font-family: "Courier New"; mso-no-proof: yes;"&gt;int&lt;/span&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt; main()&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;{&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;unsigned&lt;/span&gt; &lt;span style="color: #0000ff;"&gt;int&lt;/span&gt; cw;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #008000;"&gt;// Get control word&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;_controlfp_s(&amp;cw, 0, 0); &lt;span style="color: #008000;"&gt;// Line A&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #008000; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #008000;"&gt;// Enable zero-divide exception&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;_controlfp_s(0, ~_EM_ZERODIVIDE, _MCW_EM); &lt;span style="color: #008000;"&gt;// Line B&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #008000; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;for&lt;/span&gt;(&lt;span style="color: #0000ff;"&gt;int&lt;/span&gt; i=0; i&amp;lt;2; i++)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;{&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;__try&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;{&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;double&lt;/span&gt; b, a = 0.0;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;if&lt;/span&gt;(i==0)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;b = log(a); &lt;span style="color: #008000;"&gt;// Line C&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;else&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;b = 1/a; &lt;span style="color: #008000;"&gt;// Line D&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #008000; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;std::cout &amp;lt;&amp;lt; &lt;span style="color: #a31515;"&gt;"b: "&lt;/span&gt; &amp;lt;&amp;lt; b &amp;lt;&amp;lt; std::endl;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;__except&lt;/span&gt;(_fpieee_flt(GetExceptionCode(),&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;GetExceptionInformation(), handler))&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;{&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;std::cout &amp;lt;&amp;lt; &lt;span style="color: #a31515;"&gt;"In the __except block"&lt;/span&gt; &amp;lt;&amp;lt; std::endl;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #008000;"&gt;// Restore control word&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;_controlfp_s(0, cw, _MCW_EM); &lt;span style="color: #008000;"&gt;// Line E&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; color: #008000; font-family: "Courier New"; mso-no-proof: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="color: #0000ff;"&gt;return&lt;/span&gt; 0;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 0pt; mso-layout-grid-align: none;"&gt;&lt;span style="font-size: 10pt; font-family: "Courier New"; mso-no-proof: yes;"&gt;}&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This code was run on VS 2008 targeting the x64 platform. Since it is a 64-bit target, the code generated will contain SSE FP instructions to perform the FP arithmetic operations.&lt;/p&gt;
&lt;p&gt;The _controlfp_s function is the interface to access and modify the MXCSR register. In line A, we store the control word for restoring it later. If the MXCSR register (not the variable cw) is examined we see it is set to 1f80h. This shows that all FP exceptions are masked (Refer to AMD64 architecture programmer's manual volume 1). At Line B, we enable the zero-divide FP exception. Now the MXCSR register changes to 1d80h to unmask that particular exception.&lt;/p&gt;
&lt;p&gt;Next, we try two scenarios in which the zero divide exception can occur. The first is taking logarithm of zero. According to the IEEE 754 standard's recommendation, this operation should raise an FP zero divide exception and the log function does that. The second scenario is a simple divide operation that will raise this exception.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The FP exception handler function checks if the exception was thrown by a log operation. If it is, it returns a code asking for the execution to continue in the __try block. If not, the return code notifies the program to execute the __except block. Refer to &lt;a href="http://msdn.microsoft.com/en-us/library/s58ftw19(VS.80).aspx"&gt;http://msdn.microsoft.com/en-us/library/s58ftw19(VS.80).aspx&lt;/a&gt; to learn more about __try/__except blocks and exception-handling constants.&lt;/p&gt;
&lt;p&gt;In the first iteration when line C is executed, control is transferred to the handler, which then asks control be given back to the __try block where the exception occurred and hence back to the log function. The log function continues and an output of negative infinity is produced. Examining the MXCSR register at various points shows that all FP exceptions are temporarily masked when the control is in the handler (1f80h) and restored when control gets back to the __try block (1d80h).&lt;/p&gt;
&lt;p&gt;In the second iteration when line D is executed, control goes to the handler and then to the __except block. In this case the MXCSR register changes to 1d84h after line D and stays that way until the exception masks are restored at line E. If you disassemble the program, you will see that line D is compiled as a divsd instruction. During execution this SSE instruction sets the zero-divide status bit in MXCSR (the 4 in 1d84h), and since the zero-divide mask bit is cleared it causes a hardware FP exception. This exception is trapped by the OS and the control is transferred back to user code through SEH.&lt;/p&gt;
&lt;p&gt;In the first case with the log operation, it is not hard to see that the temporary masking of the exceptions was done by the log function and not by SEH mechanism of the OS. In this case, the IEEE FP exception was simulated by software (similar to a call to RaiseException function) and not by a single hardware instruction as was in the second scenario.&lt;/p&gt;
&lt;p&gt;We hope you find this discussion and example useful. If you have any questions or comments, please post them. In the future, we will discuss similar techniques for Linux&amp;reg;.&lt;/p&gt;
&lt;p&gt;Visit AMD's Windows zone (&lt;a href="http://developer.amd.com/zones/windows/Pages/default.aspx"&gt;http://developer.amd.com/zones/windows/Pages/default.aspx&lt;/a&gt;) for general Windows related information.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/bV5qlXaFSnE" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115708</feedburner:origLink></item>
	
	<item>
		<dc:creator>Chip Freitag</dc:creator>
		<title>ACML 4.3.0 Performance Data</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/3ITAF6T-FFs/blogpost.cfm</link> 
		<pubDate>2009-06-30T11:46:05 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115425#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;Now that the ACML 4.3.0 release is completed and posted live on AMD Developer Central, I&amp;rsquo;ve been spending time collecting all the performance data needed to document the improvements in the 4.3.0 release. &amp;nbsp;&amp;nbsp;There are several new features that should show up nicely in performance graphs. &amp;nbsp;Improvements include a new SGEMM kernel for AMD Family 10h, new DGEMM and SGEMM for Woodcrest, Penryn, and Nehalem Intel processors, improved level 1 BLAS kernels, 3D FFT work, and new scalar acml_mv functions. &amp;nbsp;It&amp;rsquo;s a really long list!&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;You can easily demonstrate these new performance features by using the examples in the performance directory of the ACML installation. &amp;nbsp;There are examples for a few different routines, and these can be easily modified to demonstrate other routines as well.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;A couple of trends are jumping out from the data collected so far. &amp;nbsp;First, the 4.3.0 Level 3 blas routines run much better than previous versions on Intel machines.&amp;nbsp; It is very&amp;nbsp;competitive with MKL on Intel processors!&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="font-family: 'Arial','sans-serif'; color: black; font-size: 10pt;"&gt;Second, the Intel Nehalem is a very impressive processor. &amp;nbsp;However Istanbul&amp;rsquo;s 6 cores can crank out a bunch of raw DGEMM flops. &amp;nbsp;This graph tells the story:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="color: black;"&gt;&lt;img src="http://developer.amd.com/PublishingImages/acml-DGEMMperf.JPG" alt="" /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="color: black;"&gt;More information on ACML 4.3.0 is available on the ACML home page.&amp;nbsp; If you have feedback on how the new release improves performance for your application, we'd love to hear about it.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto;"&gt;&lt;span style="color: black;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/3ITAF6T-FFs" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115425</feedburner:origLink></item>
	
	<item>
		<dc:creator>Jim Conyngham</dc:creator>
		<title>Removing C wrapper functions from the AMD Core Math Library (ACML) to resolve linking issues.</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/X94rLDkuBFA/blogpost.cfm</link> 
		<pubDate>2009-06-29T15:39:49 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115391#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p style="TEXT-ALIGN: left"&gt;ACML is a significant library of (mostly) FORTRAN subroutines, provided in binary form and available for download at http://developer.amd.com/acml.&amp;nbsp; Each version of the library has been compiled with a particular FORTRAN compiler, and is compatible with application programs written and compiled with the same compiler.&lt;/p&gt;
&lt;p&gt;Although FORTRAN programming has hardly disappeared, if you're reading this blog, the odds are far more likely that you're developing in C/C++ or C#.&lt;/p&gt;
&lt;p&gt;Calling FORTRAN subroutines from C/C++/C# is doable, but there are a lot of potential problems and pitfalls.&amp;nbsp; The C and FORTRAN languages have completely different subroutine naming and argument-passing conventions.&amp;nbsp; For example, where C/C++ passes parameters by value (except for arrays), FORTRAN passes them by reference.&amp;nbsp; When you have a multi-dimensional array, FORTRAN stores the data in column-major order; C/C++ uses row-major order.&amp;nbsp; Different FORTRAN compilers have different conventions for passing strings, for the name of the subroutine entry point, etc.&lt;/p&gt;
&lt;p&gt;To help make ACML useful to C/C++/C# programmers, some versions of the library come with support for C compilers, including an "acml.h" header and "C wrapper" functions.&amp;nbsp; These alternate entry points take care of &lt;span style="text-decoration: underline;"&gt;most&lt;/span&gt; of the hassle for you (although it's up to the user to watch out for the row-major versus column-major array problem).&lt;/p&gt;
&lt;p&gt;For example, suppose you consulted the section "&lt;em&gt;Determining the best ACML version for your system&lt;/em&gt;" in the ACML manual (online here: &lt;a href="http://developer.amd.com/cpu/Libraries/acml/onlinehelp/Documents/BestLibrary.html#BestLibrary"&gt;http://developer.amd.com/cpu/Libraries/acml/onlinehelp/Documents/BestLibrary.html#BestLibrary&lt;/a&gt;), and chose to download the Linux IFort64 version for your project.&amp;nbsp;&amp;nbsp; You would be able to code your project with either Intel (R) FORTRAN&amp;nbsp; or a compatible C/C++ compiler.&amp;nbsp; Your choice.&lt;/p&gt;
&lt;p&gt;So how does this work?&amp;nbsp; If a FORTRAN module containing :&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; CALL DNRM2 (...) &lt;br /&gt;or&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SUBROUTINE DNRM2 (...) &lt;br /&gt;is compiled with the 64-bit ifort compiler, the linkage name passed to the linker is "&lt;strong&gt;dnrm2_&lt;/strong&gt;", (note: the lower-case symbol name &lt;span style="text-decoration: underline;"&gt;with&lt;/span&gt;&amp;nbsp; trailing underscore).&amp;nbsp; Both the caller and the callee assume that all parameters are passed by reference.&lt;/p&gt;
&lt;p&gt;If a C program module containing:&amp;nbsp;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#include &amp;nbsp;&amp;lt;acml.h&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dnrm2 (...) &lt;br /&gt;is compiled with the 64-bit GNU gcc compiler, the linkage name passed to the linker is "&lt;strong&gt;dnrm2&lt;/strong&gt;"&amp;nbsp; (lower-case symbol name &lt;span style="text-decoration: underline;"&gt;without&lt;/span&gt; the trailing underscore).&amp;nbsp; The caller passes array parameters by reference, but all other parameters are passed by value.&lt;/p&gt;
&lt;p&gt;You can use the "objdump" or "nm" utilities from the GNU binutils package to confirm the external linkage symbols in an object or library file.&lt;/p&gt;
&lt;p&gt;So, we can provide a single library with both FORTRAN-callable and C-callable versions of the same routine, because the linkage names used for subroutines are different for the two languages.&amp;nbsp; The ACML library contains two object modules for each routine defined in "acml.h".&amp;nbsp; The FORTRAN version exports the symbol with the trailing underscore as the entry point with the FORTRAN calling convention.&amp;nbsp; A separate "C wrapper" module exports the symbol without the underscore as the entry point for a short routine that resolves the differences in calling conventions and then calls the FORTRAN-compatible version.&lt;/p&gt;
&lt;p&gt;So all is well as long as your project is built with the specific FORTRAN compiler or a compatible C compiler or some combination of those.&amp;nbsp; But you can run into trouble if yet another compiler is thrown into the mix, or another 3&lt;sup&gt;rd&lt;/sup&gt;-party library which was built with another compiler is used.&lt;/p&gt;
&lt;p&gt;One of our users recently ran into exactly this situation.&amp;nbsp; They wanted to link together their program code, which was compiled with Intel (R) FORTRAN , plus ACML, plus yet another linear algebra library (which I won't name - let's call it library X).&amp;nbsp; Library X was linked with object code from a different FORTRAN compiler which did &lt;em&gt;&lt;span style="text-decoration: underline;"&gt;not&lt;/span&gt;&lt;/em&gt; append a trailing underscore to the linkage name.&amp;nbsp; The calling routine would push references (addresses) of the scalar parameters (such as the array sizes) onto the stack and then call the symbol "dnrm2" (without the underscore).&amp;nbsp; The linker would match that name with the "C wrapper" for dnrm2, which would expect those parameters to have been passed by value.&amp;nbsp; It would then execute the dnrm2 algorithm using &lt;span style="text-decoration: underline;"&gt;the address of&lt;/span&gt; the array size variable N in place of N itself.&amp;nbsp; This would probably just crash with a segment violation.&amp;nbsp; If by some miracle it did not crash, it certainly would not compute the correct results.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In some cases&lt;/em&gt; the ACML user can make local customizations to the ACML library to work are around these problems.&amp;nbsp; Of course, it is strictly the user's responsibility to insure that these customizations are appropriate and generate correct linkages.&amp;nbsp;&amp;nbsp; In this case, the work-around was to &lt;span style="text-decoration: underline;"&gt;remove all of the c wrappers from libacml.a.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The script below shows how this can be done.&amp;nbsp;&amp;nbsp; The technique used is a quick-and-dirty hack, and not the most efficient or elegant way of accomplishing the same effect.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
&lt;table style="height: 195px;" border="0" cellspacing="0" cellpadding="0" width="474"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;pre&gt;&lt;em&gt;#! /bin/sh&lt;/em&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;em&gt;#&amp;#160;&amp;#160; Make a local copy of the ifort64 ACML static library&lt;/em&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;cp&lt;/span&gt;&lt;/strong&gt; /opt/acml4.1.0/ifort64/lib/libacml.a &lt;strong&gt;.&lt;/strong&gt;/libacml.a&lt;/pre&gt;
&lt;pre&gt;&lt;em&gt;#&amp;#160;&amp;#160; Create a list of all of C-wrapper modules&lt;/em&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;ar&lt;/span&gt;&lt;/strong&gt; -t libacml.a &lt;strong&gt;|&lt;/strong&gt; &lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;egrep&lt;/span&gt;&lt;/strong&gt;&amp;#160; _cw.o &lt;strong&gt;&amp;gt;&lt;/strong&gt; wrapperlist&lt;/pre&gt;
&lt;pre&gt;&lt;em&gt;#&amp;#160;&amp;#160;&amp;#160; Create a script to delete all of the C-wrapper modules&lt;/em&gt;&lt;br /&gt;&lt;em&gt;#&amp;#160;&amp;#160;&amp;#160; and execute it.&lt;/em&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;sed&lt;/span&gt;&lt;/strong&gt; &lt;span style="color: #ff0000;"&gt;"s/.*/ar -dv libacml.a &amp;/"&lt;/span&gt; wrapperlist &lt;strong&gt;|&lt;/strong&gt; &lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;bash&lt;/span&gt;&lt;/strong&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;em&gt;#&amp;#160;&amp;#160;&amp;#160; Clean up&lt;/em&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="color: #ff00ff;"&gt;rm&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;.&lt;/strong&gt;/wrapperlist&lt;/pre&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/p&gt;
&lt;p&gt;One undocumented piece of information makes it easier to remove the "C wrapper" functions from this version of libacml.a:&amp;nbsp; All of those object modules have names with the suffix "_cw.o".&amp;nbsp; &lt;em&gt;There is no guarantee that this will be true in other versions of the library or in future releases.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;With this knowledge, the "ar -t" and "sed ... | bash" lines of the script are all that is needed to remove these modules.&amp;nbsp; Of course, this will remove them one at a time, which is remarkably slow and inefficient.&amp;nbsp; On the other hand, you only need to do this once.&amp;nbsp; You should expect this script to take a good fraction of an hour to execute, and plan accordingly;&amp;nbsp; start it when you're ready to leave for lunch or a meeting.&lt;/p&gt;
&lt;p&gt;Let us know if this makes ACML more useful for you; we'd like to hear what you're doing with the library.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/X94rLDkuBFA" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=253&amp;threadid=115391</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>Just released:  Advanced Synchronization Facility (ASF) specification</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/kzLg4UeVx_4/blogpost.cfm</link> 
		<pubDate>2009-06-15T13:57:39 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=317&amp;threadid=114715#comments</comments>
		<trackback:ping>3</trackback:ping>
		<description>&lt;p&gt;Recently AMD released an experimental specification for a proposed AMD64 architecture feature that may be of interest to all programmers of highly concurrent programs, libraries, runtimes, and operating systems: &lt;a title="ASF page" href="http://developer.amd.com/cpu/ASF" target="_blank"&gt;Advanced Synchronization Facility&lt;/a&gt;, or ASF for short. This is the first of three blog articles describing why AMD's Operating System Research Center (OSRC) became involved in the development of ASF, how we are evaluating ASF, and how this and other activities fit into the EU-funded VELOX project aiming at improving the state of the art for software-transactional-memory systems.&lt;/p&gt;
&lt;p&gt;In this posting I will give you a quick overview of what ASF is and how it works, along with some example code. I'll also describe how I became involved in developing ASF and why we are releasing this spec proposal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;About ASF&lt;br /&gt;&lt;/strong&gt;In a nutshell, ASF is intended to make it easier to write efficient, highly concurrent programs.&lt;/p&gt;
&lt;p&gt;When AMD introduced multicore CPUs to the x86 world, we acknowledged that individual CPU cores weren't getting much faster with each silicon-technology generation. Instead, we decided to provide multiple CPU cores in one processor. This put the burden on the software community of making programs run faster on newer processors (i.e., programs have to be changed to take advantage of the parallelism.)&lt;/p&gt;
&lt;p&gt;Writing efficient, concurrent programs or parallelizing an existing sequential program is a hard endeavor. The trickiest part is making sure that all program threads have a consistent view of all shared data. ASF is intended to address this very problem, known as synchronization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How does ASF work?&lt;br /&gt;&lt;/strong&gt;ASF provides a mechanism to update multiple shared memory locations atomically without having to rely on locks for mutual exclusion. It's quite flexible as the semantics of the update are not fixed, but can be provided using standard x86 instructions.&lt;/p&gt;
&lt;p&gt;Here's an example. This code snippet implements a two-word compare-and-swap primitive, with new instructions highlighted in red:&lt;/p&gt;
&lt;pre&gt;; DCAS Operation:
; IF ((mem1 = RAX) &amp;&amp; (mem2 = RBX))
; {
;   mem1 = RDI
;   mem2 = RSI
;   RCX = 0
; }
; ELSE
; {
;   RAX = mem1
;   RBX = mem2
;   RCX = 1
; }
; (R8, R9 modified)
;
DCAS:
 MOV      R8, RAX
 MOV      R9, RBX
retry:
 &lt;span style="color: #ff0000;"&gt;SPECULATE&lt;/span&gt;                   &amp;#160;; Speculative region begins
 JNZ      retry              &amp;#160;; Page fault, interrupt, or contention
 MOV      RCX, 1             &amp;#160;; Default result, overwritten on success
 &lt;span style="color: #ff0000;"&gt;LOCK MOV RAX, [mem1]&lt;/span&gt;        &amp;#160;; Specification begins
 &lt;span style="color: #ff0000;"&gt;LOCK MOV RBX, [mem2]&lt;/span&gt;
 CMP      R8, RAX            &amp;#160;; DCAS semantics
 JNZ      out
 CMP      R9, RBX
 JNZ      out
 &lt;span style="color: #ff0000;"&gt;LOCK MOV [mem1], RDI&lt;/span&gt;        &amp;#160;; Update protected memory
 &lt;span style="color: #ff0000;"&gt;LOCK MOV [mem2], RSI&lt;/span&gt;
 XOR      RCX, RCX           &amp;#160;; Success indication
out:
 &lt;span style="color: #ff0000;"&gt;COMMIT&lt;/span&gt;                      &amp;#160;; End of speculative region
&lt;/pre&gt;
&lt;p&gt;The SPECULATE-COMMIT pair wraps a speculative region, which speculatively reads from and writes to protected memory locations using the LOCK MOV instructions. The speculative memory updates will become visible to other CPUs only when the speculative region completes successfully.&lt;/p&gt;
&lt;p&gt;Here's what the speculative region does in this example: The initial LOCK MOV instructions signify the memory locations that need to be monitored for external modifications and also read the memory operands into the RAX and RBX registers. The code then compares these operands with the original register operands (saved to R8 and R9 at the outset of the routine). The DCAS operation may fail because of a miscomparison at that point, bypassing the memory update. The RCX register returns a pass-fail indication.&lt;/p&gt;
&lt;p&gt;A speculative region may also be aborted, for example when a contending program thread accesses a protected memory location or when an interrupt occurs. In this case, all speculative memory updates are discarded, and the program flow (instruction and stack pointer) is rolled back to just after SPECULATE, where software can inspect the reason for the abort in the rAX and rFLAGS registers. The code in this example examines RFLAGS immediately after SPECULATE using a JNZ instruction that branches to the abort handler, which in this case just attempts a retry. A real implementation might have a more elaborate recovery strategy, for example, exponential backoff if the abort was due to contention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How we are developing ASF&lt;br /&gt;&lt;/strong&gt;ASF really is a team effort, with team members looking at various software applications, hardware implementation, and the specification itself.&lt;/p&gt;
&lt;p&gt;When I joined AMD's OSRC at the end of 2006, I quickly discovered ASF as it existed at that time: a mechanism for improving the efficiency of highly parallel, lock-free synchronization code. In previous work I had used lock-free data structures for building a real-time microkernel operating system, and I had often craved a feature for multi-word atomic updates such as ASF. This might explain why I was so enthralled by ASF.&lt;/p&gt;
&lt;p&gt;In the meantime, I have become the editor of the ASF specification proposal. I'm working with the ASF team to evaluate the feature in various application scenarios, and to further develop ASF based on our findings. We have expanded its focus to include software transactional memory (STM) as well; more on that in a later blog post.&lt;/p&gt;
&lt;p&gt;We are also actively discussing ASF with both academic and industrial partners to learn about interesting application areas and to derive requirements for an eventual implementation in future products.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The ASF specification&lt;/strong&gt;&lt;br /&gt;ASF is an experimental architecture extension currently in proposal stage. AMD has not yet committed to including this feature into any future CPU product. Instead, we are soliciting input from developers and researchers that would help us refine the ASF specification to better meet software development requirements.&lt;/p&gt;
&lt;p&gt;ASF is not the first feature we have proposed in this way. A year and a half ago, AMD decided to be more open in developing extensions to the AMD64 architecture to help ensure we meet the needs of the software development community and to encourage cross-vendor compatibility. At that time, we proposed the &lt;a title="LWP specification" href="http://developer.amd.com/LWP" target="_blank"&gt;Lightweight Profiling (LWP)&lt;/a&gt; and SSE5 features in a similar spirit, and we received extremely valuable input from the programming community that helped us improve our future products - to your benefit. SSE5 has just recently evolved into the AVX-compatible XOP, which we described in a &lt;a href="blogpost.cfm?catid=208&amp;threadid=112934" target="_self"&gt;previous blog entry&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Please download the &lt;a title="ASF specification" href="http://developer.amd.com/assets/45432-ASF_Spec_2.1.pdf" target="_blank"&gt;ASF specification proposal&lt;/a&gt; and send your comments to &lt;a href="mailto:ASF_Feedback@amd.com"&gt;ASF_Feedback@amd.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;---&lt;br /&gt;Michael Hohmuth, MTS&lt;br /&gt;AMD Operating System Research Center, Dresden&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/kzLg4UeVx_4" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=317&amp;threadid=114715</feedburner:origLink></item>
	
	<item>
		<dc:creator>Gary Frost</dc:creator>
		<title>JavaOne 2009</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/ry4a19B30BY/blogpost.cfm</link> 
		<pubDate>2009-06-11T18:30:22 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114574#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;I was lucky enough to go to JavaOne last week and thought I'd share some comments, highlights, a few quibbles, and a way to make some serious money if you are in the beanbag industry. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I felt that this year's JavaOne was a little subdued -- attendance seemed lower (we can probably all guess that the economy was a factor here) and generally there were fewer 'cool!' exclamations from the audiences.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;&lt;strong&gt;Monday&lt;/strong&gt;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;This was 'Community One' day. I attended a couple of sessions (Hadoop and Cloud related) but really spent most of the day bumping into people and catching up. CommunityOne looked a little sparsely attended at times. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I did attend a session where the presenters and attendees discussed how to get the most out of their JUGs (Java User Groups).&amp;nbsp; This was a really good session. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I did enjoy hanging out in the AMD sponsored 'Hang Space,' and I had my first 'patent pending idea' here watching all of the laptop users sitting on the floor next to the walls (where the 110vac was served) and not on the comfy beanbags! So beanbag builders of the world, we need beanbags which incorporate 110v sockets.&amp;nbsp; These could be sold in strings which connect together and will allow those slacking off at conferences to actually partake in the bean-bag offerings rather than sit on the floor.&amp;nbsp; Of course, one might ask why the beanbags were not dragged to the walls, and the answer would be, you wouldn't be able to watch the episodes of 'The Office - US version' that were being served up on the big screen, obviously.&amp;nbsp;&amp;nbsp;&amp;nbsp; I, of course, could happily sit in a beanbag, pretend to work and watch Dwight, Jim, and Pam wrestle with their plight because I have an AMD powered HP dv2 - whose battery lasted way longer than Season 1 of "The Office." &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;&lt;strong&gt;Tuesday&lt;br /&gt;&lt;/strong&gt;&amp;nbsp;&lt;br /&gt;It was good to see Scott McNealy handover (the keynote, not Sun just yet) to Larry Ellison. Larry's remarks regarding the importance of Java to Oracle must have made a few folks sleep easier on Tuesday evening and I suspect that the JavaFX team will be particularly pleased with Larry calling out JavaFX by name and pushing a possible OpenOffice/JavaFX integration down the line. That should be good for JavaFX and hopefully good for OpenOffice.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;So where is JavaFX in 2009? I count this as the third JavaOne where Sun has pushed JavaFX. 2007 was kind of a preview, and I enjoyed the demos but that was really all it was. It dominated in 2008, but was still really not cooked and I walked out of the lab session when I was asked to sign an NDA -- an NDA for a lab session at a conference that I paid to attend seemed a bit weird. Now in 2009 I really do think it might start gaining some traction.&amp;nbsp; The addition of charting was smart (and pretty obvious really) and I was pleased that even Eclipse users got something in the form of a fairly cool Eclipse plugin.&amp;nbsp; Now it really feels that JavaFX is not just for Netbeans anymore.&amp;nbsp; The demos were slicker and the downloaded Eclipse plugin worked like a charm.&amp;nbsp;&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Having worked on a large Flex application a few years back, and having seen some extremely cool Flex apps, I have always seen JavaFX as too little too late. Flash and Flex have pretty much carved up the R part of RIA (although AJAX is not dead yet!). Now I am a little more hopeful for JavaFX to at least find an audience. The more natural Java integration and the impressive binding support will appeal to those who really took to mxml+actionscript, and I can see the story developing.&amp;nbsp; The effort that has gone into jnlp/applet deployment (on jre 1.6_10 +) has helped enormously and once we can find a way to get JavaFX to launch faster (Flash still seems to launch way faster than even trivial JavaFX apps) I think that JavaFX will come into its own.&amp;nbsp; I look forward to kicking the tyres some more. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Joshua Bloch (Google, Inc) and Neal Gafter's (Microsoft) "Return of the Puzzlers: Schlock and Awe" session was as well attended as ever. These guys do a great job presenting these infuriating corner cases. I liked the fact that they&amp;nbsp; acknowledged making some of the mistakes presented; it makes us all feel a little less incompetent. &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt; I think I got more answers right this year, although my success rate is still not impressive.&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;The "Small Language Changes in JDK(tm) Release 7" session by Joseph Darcy, Sun Microsystems, Inc. was interesting.&amp;nbsp; I really like the 'Elvis operator' :? and also look forward to using some of the suggestions for&amp;nbsp; less verbose 'Generic' declaration/initializations. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;The "Asynchronous I/O Tricks and Tips" session by Jean-Fran&amp;ccedil;ois Arcand and Alan Bateman from Sun Microsystems, Inc. was an informative session. I really am guilty of not tracking nio (when will the 'n' in 'nio' seem really inappropriate) enough, and I look forward to using some of these tricks, especially using a 'Future' to access the response from an asynchronous read.&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;One of my favourite sessions was "Toward a Renaissance VM" by Brian Goetz and John Rose from&amp;nbsp; Sun Microsystems. Sometimes I feel my head is way too small to understand this JSR 292 of stuff, but I actually felt that I have a grasp of how this will help dynamic languages and also how it might apply to frameworks which currently rely on bytecode engines/injection and reflection to do their work.&amp;nbsp; I still need to track down more information on this but the fog is lifting for me. &lt;img src="i/expressions/face-icon-small-smile.gif" border="0"&gt; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I wish I had caught the "The Feel of Scala" session by Bill Venners of Artima, Inc.&amp;nbsp; Only as the week progressed did I realize that I need to track Scala. I look forward to the slides and video of this presentation. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;&lt;strong&gt;Wednesday&lt;/strong&gt;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;I attended a great session called 'State: You're Doing It Wrong -- Alternative Concurrency Paradigms on the JVM&amp;trade Machine' in the morning from Jonas Bon&amp;eacute;r of Scalable Solutions.&amp;nbsp; This session proposed State, Actor message passing and Data Flow mechanisms to improve concurrency.&amp;nbsp; For me the Actor-based demos (based on Scala) not only prompted me to look at this approach in my Java apps, but also was a great example of how Scala can be scaled out.&amp;nbsp; As I mentioned earlier I really need to dig into Scala some more. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I regret missing "The Modular Java(tm) Platform and Project Jigsaw" by Mark Reinhold of Sun Microsystems, Inc. From what I have read alsewhere this modular approach is really going to help deployment and packaging. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Joshua Bloch's (from Google) ""Effective Java": Still Effective After All These Years" was another opportunity to see the 'Billy Mayes' of Java (I really mean no disrespect - Josh is a pitch-perfect pitch man) do what he does flawlessly.&amp;nbsp; His 'Effective Java' book is like the Movie 'Brazil;' you need to reread/review every year to catch what you missed previously. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I enjoyed "The Ghost in the Virtual Machine: A Reference to References" session from Bob Lee, Google Inc., which went into depth regarding GC, references, and finalization issues.&amp;nbsp; I look forward to walking through the slide deck on this one.&amp;nbsp; I learned a lot and also know a bunch slipped on past me. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I watched a cool demo which redefined classes in a running JVM using a java agent and some classloader tricks.&amp;nbsp; This BOF session "Runtime Update of Java(tm) Technology-Based Applications, Using Dynamic Class Redefinition" by Allan Gregersen from University of Southern Denmark was fun and educational. The presenter built a Swing-based game incrementally by adding fields and methods, changing class hierarchies, etc., all without ever restarting the JVM.&amp;nbsp; Although in practice I feel this javagent based chaining approach may not scale particularly well, if this can be pushed down into the JVM (as the presenter suggested) then this whole area has some great potential. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;I must apologise to my fellow AMDer, Richard West, and David Gilbert from Object Refinery Limited for missing their "JFreeChart: Surviving and Thriving" BOF.&amp;nbsp; I look forward to picking Richard's brain about this great toolkit.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;&lt;strong&gt;Thursday&lt;br /&gt;&lt;/strong&gt;&amp;nbsp;&lt;br /&gt;Occasionally I like to see what is going on in the Swing world.&amp;nbsp; I don't really get to write much in Swing but there are some really great toolkits out there. I particularly enjoyed "Swing Rocks: A Tribute to Filthy-Rich Clients" by Martin Gunnarsson and P&amp;auml;r Sik&amp;ouml; from Epsilon Information Technology. Swing really can look compelling.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;The "Matchmaking in the Cloud: Hadoop and EC2 at eHarmony" session from Steve Kuo and Joshua Tuberville of eHarmony, Inc. was a good presentation (and from a show of hands there were two attendees that actually got married through eHarmony so there was a cool validation of eHarmony's matching algorithm!). It walked through the technical and economic considerations around using these technologies. &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;"Garbage Collection Tuning in the Java HotSpot(tm) Virtual Machine" from Charlie Hunt and Antonios Printezis of Sun Microsystems, Inc was a good, informative session that walked through a number of great slides highlighting what to do and what not to do.&amp;nbsp; I still feel that GC tuning should be less of a 'dark art.'&amp;nbsp; I worry how many JVMs are sitting out there thrashing when a few command line options would smooth the way.&amp;nbsp; I do wish for a -XX+GCAdvise option which (possibly at the end of each GC) would suggest what command lines would be optmil with a specific workload. I know that I am supposed to use the printgc options (flag examples) to be added, and/or use visualvm to show me the graphs that I should use to determine what flags will be optimal, but this seems way too hard.&amp;nbsp; Surely after running for a while the GC engine/subsystem would have a enough data to generate an 'I suggest running with these flags ... because ....' style report, instead of 'here are a bunch of graphs and text dumps, now go away and work out what you did wrong and come back.'&amp;nbsp;&amp;nbsp; Sometimes I don't want to learn to fish; sometimes I would just like to eat some fish. &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;Cliff Click (from Azul Systems) and Brian Goetz's (Sun Microsystems)&amp;nbsp; session,&amp;nbsp; "This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java(tm) Apps" was another one of the highlights of the conference.&amp;nbsp; It was a great presentation and allowed folk without a deep understanding of microprocessor architecture to walk away with some understanding of what happens under the hood. The slide deck in the middle which walked through the issues relating to how multi-core architectures executing speculatively have to handshake over the cache was very, very slick. I am looking forward to Cliff and Brian's Boxed Set being released. &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;There were some great sessions on&amp;nbsp; "Actor-Based Concurrency in Scala" from Philipp Haller of EPFL and Frank Sommers of Artima which really rammed home how effective Scala and this Actor-based communication mechanism can simplify some concurrency problems.&amp;nbsp;&amp;nbsp; As I mentioned before this was brought up in a former session, and I enjoyed digging deeper in this dedicated session.&lt;br /&gt;&amp;nbsp;&lt;br /&gt;I stayed late to enjoy the "Java(tm) Programming Language Tools in JDK(tm) Release 7" BOF on Thursday night hosted by Maurizio Cimadamore and&amp;nbsp; Jonathan Gibbons from Sun Microsystems, Inc.&amp;nbsp; I applaud the upcoming refactoring of javap and also enjoyed the discussion on how we might get better error reporting out of javac. I also vote [should this be "voted" in this context?] for the option of getting compilation rendered to xml to help tool chaining.&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;&lt;strong&gt;Friday&lt;br /&gt;&lt;/strong&gt;&amp;nbsp;&lt;br /&gt;Gosling's "Toy Show" (Friday morning) did have some cool stuff; the JavaFX studio tool for composing JavaFX without coding does look very, very good. Also the image analysis toolkit which generated analytical 'hashes' for images and then allowed image related searching/matching was very impressive. My favourite was the Printer/Copier based Java app for creating arbitrary multiple choice exam papers or surveys on plain paper, then printing a bunch of the question papers off and by feeding a special page with the answers and the response papers into the scanner, allow the copier/printer to grade the papers. &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt;&amp;nbsp; Very smart.&amp;nbsp; &lt;br /&gt;&amp;nbsp;&lt;br /&gt;The "Under the Hood: Inside a High-Performance JVM(tm) Machine" session from Trent Gray-Donald of IBM was excellent.&amp;nbsp; This provided some more insight into what happens when your code is executed by a modern JVM. &lt;br /&gt;&amp;nbsp;&lt;br /&gt;Sadly I missed afternoon sessions because I had to get to the airport to get home to watch season two of 'The Office.' &lt;img src="i/expressions/face-icon-small-wink.gif" border="0"&gt;&lt;br /&gt;&amp;nbsp;&lt;br /&gt;There certainly is enough to dig into to keep me busy enough until next year. &lt;br /&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/ry4a19B30BY" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114574</feedburner:origLink></item>
	
	<item>
		<dc:creator>Ben Pollan</dc:creator>
		<title>How Complex Is Your JRE Command-line?</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/vMwxtTpJtfo/blogpost.cfm</link> 
		<pubDate>2009-06-11T12:04:41 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114556#comments</comments>
		<trackback:ping>1</trackback:ping>
		<description>&lt;p&gt;Guess how many command-line flags there are for the server JRE in the OpenJDK?&amp;nbsp; I'm hearing 42.&amp;nbsp; Kudos to all of you fans of the late Douglas Adams, but you're slightly short of the real answer.&amp;nbsp; It's 477 (give or take a flag or two).&amp;nbsp; To confirm, just go into src\share\vm\runtime\globals.hpp and src\share\vm\opto\c2_gloabls.hpp, which define them.&amp;nbsp; The flags control all sorts of things, some of which you are probably very familiar with like the heap (-Xms -Xmx), and some which you may not know about, such as the memory footprint settings (-XX:ReservedCodeCacheSize and -XX:InitialCodeCacheSize).&lt;/p&gt;
&lt;p&gt;I'm not asking you this because I want to know if you have intimate knowledge of the JRE (although if you can keep bits of trivia like this in your head, I am truly impressed).&amp;nbsp; My question really comes out of the world of performance analysis of Java runtimes.&amp;nbsp; Suffice it to say that as the Java Labs works to improve JRE performance, sometimes our analysis leads to improvements that can be realized by tuning these existing command-line flags.&amp;nbsp; But here's my theory...I bet most of you use few, if any, of these flags in production.&amp;nbsp; You probably have very good reasons for doing this.&amp;nbsp; You may not have access to the command line, or you may have different applications, some of which my benefit from certain flags, while others won't.&amp;nbsp; If true, the result is the same...when we look to improve JRE performance, we really need to do it in a way that is engineered to help potentially any application in a flexible way that does not require changes to the command line.&lt;/p&gt;
&lt;p&gt;So answer these two questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do you set any command line flags in production?&lt;/li&gt;
&lt;li&gt;If yes, what are they?&lt;/li&gt;
&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/vMwxtTpJtfo" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114556</feedburner:origLink></item>
	
	<item>
		<dc:creator>Tom Deneau</dc:creator>
		<title>A Java Generics Performance Puzzler</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/c2FWWwJAT94/blogpost.cfm</link> 
		<pubDate>2009-06-05T15:41:31 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114296#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span style="font-size: larger;"&gt;&lt;span style="color: #000000;"&gt;In this entry, we&amp;rsquo;ll go down that well-worn path of looking at some microbenchmark results and trying to explain them.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span style="font-size: larger;"&gt;&lt;span style="color: #000000;"&gt;This microbenchmark created an ArrayList such that if one went thru the ArrayList in order, the entries were randomly distributed in memory.&amp;nbsp;We also had enough elements in the list that it would take some time to go thru the list.&amp;nbsp;We then wanted to go thru the list in order and &amp;ldquo;split&amp;rdquo; it so that we created two new ArrayLists, one for all the even elements and one for all the odd elements.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span style="font-size: larger;"&gt;&lt;span style="color: #000000;"&gt;There are a number of ways to code the splitting but let&amp;rsquo;s start with an approach that doesn&amp;rsquo;t use Iterators, but just uses an integer index to the get method for the source and then adds (appends) to the destination ArrayList.&amp;nbsp;The body of the loop might then look like the following:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;1&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;ArrayList aListSrc, aListDest1, aListDest2;&lt;br /&gt;&lt;/span&gt;&amp;nbsp;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt; ...&lt;br /&gt;&amp;nbsp;&amp;nbsp; while (idxSrc &amp;lt; NUMOBJS) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(aListSrc.get(idxSrc));&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest2.add(aListSrc.get(idxSrc+1));&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; idxSrc+=2;&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;After measuring version 1, you decide those add method lines are a bit wordy so you break them into two statements, using a local variable to hold the intermediate result.&amp;nbsp;Or perhaps you wanted to print some debug information for each element as you are copying it, and you needed a local variable to hold the element reference (and you then removed the debug statements).&amp;nbsp;So you end up with something like:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version&amp;nbsp;2&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;ArrayList aListSrc, aListDest1, aListDest2;&lt;br /&gt;&lt;/span&gt;&amp;nbsp;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt; ...&lt;br /&gt;&amp;nbsp;&amp;nbsp; while (idxSrc &amp;lt; NUMOBJS) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; MyClass myc = aListSrc.get(idxSrc);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(myc);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; myc = aListSrc.get(idxSrc+1);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest2.add(myc);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;idxSrc+=2;&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;But when you measure version 2, you find that it is much slower than version 1 (about 1/3 the speed in my measurements).&amp;nbsp;Before reading on, you might try to figure out why.&amp;nbsp;Is the JVM perhaps not able to optimize away the store to the local variable?&amp;nbsp;And if so, is the store to the local variable&amp;nbsp;really that expensive?&amp;nbsp; I will add that in both cases, the get and add methods got inlined nicely into the timed loop.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;strong&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: small;"&gt;Answer&lt;/span&gt;&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;strong&gt;&lt;/strong&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;You may recall that generics in Java are implemented with type checking at compile time but with type erasure at run time.&amp;nbsp;How does that impact us?&amp;nbsp;Well for one it means that at runtime the call to &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListSrc.get(idxSrc);&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;really returns an Object, even though aListSrc is an ArrayList.&amp;nbsp;Therefore the statement from version 2:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; MyClass myc = aListSrc.get(idxSrc);&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;requires a runtime castcheck that the Object returned by aListSrc really is a MyClass.&amp;nbsp;If you look at the byte codes generated for such a statement, you will see a checkcast bytecode.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;To&amp;nbsp;check&amp;nbsp;whether the object returned by aList.get really is of type MyClass (or a child of MyClass) the JVM must read the header of the object.&amp;nbsp;In our list splitting operation however we never had any other reason to look at any of the fields of the MyClass objects as we went thru the list.&amp;nbsp;We just copied each MyClass reference from the source list into one of the destination lists.&amp;nbsp;So by having to look at the header as part of the castcheck, we must now wait until&amp;nbsp;the object is read from memory into the processor&amp;rsquo;s cache.&amp;nbsp;And with lots of objects in the list, it makes it less likely that an object is already in the cache when we need it.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;How did we avoid the castcheck in Version 1?&amp;nbsp;&amp;nbsp; In version 1, the javac compiler used the list&amp;rsquo;s type declaration ArrayList to guarantee that the returned object was of type MyClass at compile time.&amp;nbsp;&amp;nbsp; And at runtime the types from the generics&amp;nbsp;were erased so basically we have&amp;nbsp;a get method returning an object which is passed to an add method which&amp;nbsp;takes an Object.&amp;nbsp;&amp;nbsp;So no checkcast is necessary.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;Note that we can try to get around the checkcast&amp;nbsp;&amp;nbsp;by just declaring the local variable to be an Object rather than a MyClass, but now the javac compiler will rightly complain when we try to do an add&amp;nbsp;of an Object into an ArrayList.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;strong&gt;&lt;em&gt;&lt;span style="text-decoration: underline;"&gt;&lt;span style="color: #000000; font-size: 12pt;"&gt;Version 3 (will not compile)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;ArrayList aListSrc, aListDest1, aListDest2;&lt;br /&gt;&lt;/span&gt;&amp;nbsp;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt; ...&lt;br /&gt;&amp;nbsp;&amp;nbsp; while (idxSrc &amp;lt; NUMOBJS) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;strong&gt;Object&lt;/strong&gt; myc = aListSrc.get(idxSrc);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; aListDest1.add(myc);&amp;nbsp;&amp;nbsp; // error&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&amp;nbsp;&lt;span style="color: #000000; font-size: 10pt;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="color: #000000; font-size: 10pt;"&gt; ...&lt;br /&gt;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;I should note here that if our original algorithm had looked at fields of the MyClass objects to make some decision on how to split the list, then the object would have already have to be read from memory for the other field accesses and the extra time to do the header check for the castcheck would have been insignificant.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;Even though the above is explainable by type erasure, I&amp;rsquo;m not sure it follows the principal of least surprise.&amp;nbsp;After all, I declared aListSrc to be ArrayList and all I did was assign the .get output to a MyClass object.&amp;nbsp;If the javac compiler knew enough to eliminate the castcheck between the output of the get and the input of the add, why couldn&amp;rsquo;t it eliminate it between the output of the get and the&amp;nbsp;assignment to the local variable?&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;Looking at this from another angle, one might ask whether the JVM can&amp;nbsp;optimize away&amp;nbsp;the castcheck at runtime.&amp;nbsp;A check with the Hotspot folks indicated that the bytecodes are saying "throw an exception if aListSrc.get ever returns a non-MyClass object".&amp;nbsp;&amp;nbsp;And the JVM cannot elide bytecodes that &lt;em&gt;could &lt;/em&gt;cause an exception like this.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="line-height: normal; margin: 0in 0in 10pt; vertical-align: top;"&gt;&lt;span&gt;&lt;span style="color: #000000; font-size: small;"&gt;So the message is don't cast your return from the Collections classes like this&amp;nbsp;if you don't need to.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/c2FWWwJAT94" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=313&amp;threadid=114296</feedburner:origLink></item>
	
	<item>
		<dc:creator>Quentin Neill</dc:creator>
		<title>Adventures in Dual Booting OpenSolaris</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/W2UdPSsrnbI/blogpost.cfm</link> 
		<pubDate>2009-06-02T12:14:32 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=209&amp;threadid=114137#comments</comments>
		<trackback:ping>2</trackback:ping>
		<description>&lt;p&gt;This year I entered into a new role as a performance engineer for AMD, assigned with tackling any and all Sun compiler performance engineering issues for AMD's Sun alliance.&lt;/p&gt;
&lt;p&gt;This blog entry focuses on how I got &lt;strong&gt;&lt;a href="http://wn.wikipedia.org/wiki/Multi_boot"&gt;multi-boot&lt;/a&gt;&lt;/strong&gt; working on a system with both &lt;a href="http://www.novell.com/linux/sp2highlights.html"&gt;SuSE Linux&amp;reg; Enterprise Server 10 SP2&lt;/a&gt; and &lt;a href="http://www.opensolaris.org/"&gt;OpenSolaris&lt;sup&gt;TM&lt;/sup&gt; 2008.11&lt;/a&gt;, even though&lt;strong&gt; &lt;/strong&gt;OpenSolaris is installed on the second partition (most of the blogs and articles I found online always recommended OpenSolaris be installed on the primary partition)&lt;/p&gt;
&lt;p&gt;Back in the day, it wasn't called "multi-boot," it was just "dual-boot" (I suppose because having &lt;em&gt;two&lt;/em&gt; operating systems installed on one disk was almost a freak of nature). Multi-booting operating systems is somewhat of a black art, mainly involving choosing, installing, and configuring the &lt;a href="http://en.wikipedia.org/wiki/Booting#Second-stage_boot_loader"&gt;boot loader&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As a software developer in the past, I have performed such ad-hoc system setups frequently, mostly focused on bootstrapping a project. It is no different as a performance engineer.&amp;nbsp; So I recently found myself engaged in setting up a shiny new system with a couple of AMD "&lt;a href="http://www.youtube.com/watch?v=D11uY5dOE2c"&gt;Istanbul&lt;/a&gt;" processors hot off the fab. The activity generated a surprising amount of excitement ...&lt;/p&gt;
&lt;p&gt;&lt;em&gt;... A crowd of engineers gather around the latest machine. A few twists of a knob here, a button there, and a fiery glow lights their faces. They hunger for performance numbers! Overnight &lt;a href="http://spec.org/benchmarks.html#cpu"&gt;SPEC&amp;reg; CPU2006&lt;/a&gt; runs are almost too much to endure. Can we speed up the install? Should we add more memory? We want those results!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Back to the real world - I need SLES 10 SP2 for our initial studies, so that goes on first. Anticipating the need to multi-boot, I divide the disk into 3 partitions while installing SLES to the first one, namely (hd0,0). I setup and configure the SPEC benchmarks and get those started.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;... The benchmarks results finally come in. The engineers ooh and ahh over the towering new SPEC numbers. Abruptly they disperse, returning to their cubicles to digest. I finally have the machine to myself (moo hoo ha ha).&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Now comes the OpenSolaris install on the second partition ((hd0,1)). It goes smoothly, except it installs a new copy of GRUB &lt;a href="http://www.gnu.org/software/grub/"&gt;(GRand Unified Bootloader)&lt;/a&gt; which &lt;strong&gt;doesn't seem to know anything&lt;/strong&gt; about the original SLES partition. When I reboot, I can't get back to the original install!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;... I have broken the shiny new machine! The light glows but it is a strange color, not the fiery glow the engineers will need any day now. Both hands inside the box, I am certain if I stop to scratch my nose I will lose control and it will fly around the cube and out the window. &lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I have two paths to try: 1) configure the OpenSolaris GRUB to see SLES 10, or 2) configure the SLES GRUB to see OpenSolaris.&lt;/p&gt;
&lt;p&gt;First I try configuring the OpenSolaris GRUB by editing GRUB's menu.lst file. Booting OpenSolaris, I look for /boot/grub/menu.lst but eventually I discover that OpenSolaris' GRUB menu file is in /rpool/boot/grub/menu.lst. I cook up an entry like this:&lt;/p&gt;
&lt;pre class="style1"&gt;&lt;span class="ColorBlue"&gt;&lt;br /&gt;	title SLES 10 SP2, kernel 2.16.16.60-0.21-smp&lt;br /&gt;	root (hd0,0)&lt;br /&gt;	kernel /boot/vmlinuz-2.6.16-60-0.21-smp \&lt;br /&gt;		root=/dev/dsk/by-id/scsi-SATA_ST3250410AS_6RYC836A-part1 \&lt;br /&gt;		vga=normal showopts ide=nodma apm=off acpi=off noresume edd=off 3&lt;br /&gt;	initrd /boot/initrd-2.6.16.60-0.21-smp&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;but after tweaking several times (where GRUB complains about not finding a valid OS) I can't get the recipe exactly right. I move on to option #2, getting the SLES GRUB booting again.&lt;/p&gt;
&lt;p&gt;At first I try OpenSolaris' &lt;strong&gt;fdisk&lt;/strong&gt; command but I don't find an easy way to determine the &lt;strong&gt;device name&lt;/strong&gt; of the SLES disk partition (because I am unfamiliar with the OpenSolaris way of device naming). So I decide to do it from SLES - if I can boot the SLES partition, or mount it somehow from a rescue disk I could modify its GRUB configuration (by editing the /boot/grub/menu.lst file). After some Googling, I create and boot a SLES 10 SP2 install CD, boot in rescue mode, mount the partition, and then add this entry:&lt;/p&gt;
&lt;pre class="style1"&gt;&lt;span class="ColorBlue"&gt;&lt;br /&gt;	title OpenSolaris 2008.11&lt;br /&gt;	root (hd0,1)&lt;br /&gt;	chainloader +1&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;While booted, I use the SLES fdisk to mark the SLES partition as bootable. When I reboot, the machine comes up and boots SLES 10 SP2 without intervention! Whew!.&lt;/p&gt;
&lt;p&gt;And now I can choose the &lt;strong&gt;OpenSolaris 2008.11&lt;/strong&gt; partition at boot time, which then displays the OpenSolaris GRUB menu, which knows how to boot the OpenSolaris ZFS partition. If I had to I could use &lt;strong&gt;fdisk&lt;/strong&gt; again to make the machine boot to OpenSolaris every time, but for now it will reboot to SLES 10 SP2 each time.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;... The light is restored and the machine is ready. When the engineers return to take the machine they will be able to use it as they did before, but I have left a door to my little workshop where I can return when their interests move on to the next shiny problem. &lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/W2UdPSsrnbI" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=209&amp;threadid=114137</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>"Istanbul" overview</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/HrxIjTvuk2Y/blogpost.cfm</link> 
		<pubDate>2009-06-01T12:20:27 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=114108#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;Today, AMD is launching the "Istanbul" processor. Since our first dual-core processor, those of us in AMD's CPU ISV team have been evangelizing that more cores are coming. This processor contains 6 cores on one die. Just to be clear these are 6 distinct physical cores, just as the Shanghai processors contained 4 distinct physical cores. Each core comprises 512K of L2 cache and 128k of L1 cache. The L3 is a 6MB cache shared by the six cores. The Istanbul processors are MP capable, supporting up to 8 processors (48 cores). There have been numerous refinements made to this processor. One notable change is the addition of a Probe Filter, which you may see referred to as HyperTransport&lt;sup&gt;TM&lt;/sup&gt; technology, HT Assist. Simply put, this filter can greatly reduce HT traffic between multiple sockets, which in turn can improve memory bandwidth, especially on 4 socket platforms. For those with silicon interest, the "Istanbul" processors are fabricated with the 45nm SOI process. And did I mention that these processors use AMD's existing Socket F (1207) infrastructure? Which means that on many platforms all is needed is a simple BIOS upgrade. &amp;nbsp;Some of the other features are: HT3 capability and numerous power saving features. &amp;nbsp;More blogs to come on the cool new features of Istanbul - otherwise known as the new Six-core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processors.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://sites.amd.com/us/atwork/promo/Pages/six-core-opteron.aspx" target="_blank"&gt;&lt;img src="http://developer.amd.com/PublishingImages/47217D_SixCore_Opteron_Blac.jpg" border="0" alt="AMD Opteron(TM) processors" /&gt;&lt;/a&gt;&lt;br /&gt;Six-core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processor&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/HrxIjTvuk2Y" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=114108</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>"Shanghai" blog category is now "Istanbul" blog category</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/xtBlW3meiac/blogpost.cfm</link> 
		<pubDate>2009-06-01T12:15:50 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=114107#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;With the launch of the new Six-Core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processors (codenamed "Istanbul"), the powerful follow-up to the "Shanghai" processors, we're updating the title of this blog category to reflect the information you will now find here.&amp;nbsp; Don't worry, the previous content isn't going away - it's still very valid, since the "Istanbul" processors build on foundations that were laid by the "Barcelona" and "Shanghai" processors, and add advancements in many features.&amp;nbsp; Check back often for new write-ups on these features, and visit our "Istanbul" Zone for a round-up of everything you need to know about this enhanced generation.&lt;/p&gt;
&lt;p&gt;We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/xtBlW3meiac" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=271&amp;threadid=114107</feedburner:origLink></item>
	
	<item>
		<dc:creator>AMD DeveloperCentral</dc:creator>
		<title>"Shanghai" Zone is now "Istanbul" Zone</title>
		<link>http://feedproxy.google.com/~r/AmdDeveloperBlogs/~3/6yZTH_U5rv0/blogpost.cfm</link> 
		<pubDate>2009-06-01T12:13:01 -05.00</pubDate>
		<comments>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=114106#comments</comments>
		<trackback:ping>0</trackback:ping>
		<description>&lt;p&gt;Looking for our "Shanghai" Zone?&amp;nbsp; All the content and resources you expected to find are still there, but we've added some new information about AMD's follow-up to the Quad-Core AMD Opteron&lt;sup&gt;TM&lt;/sup&gt; processor (codenamed "Barcelona", and "Shanghai") and have renamed the content section to "Istanbul" Zone.&amp;nbsp; The new Six-Core AMD Opteron processors (codenamed "Istanbul") retain all the features of the "Barcelona" and "Shanghai" processors and add further advancements in the technologies for even better performance.&amp;nbsp; Find out what's new with this six core processor in the &lt;a href="http://developer.amd.com/zones/istanbul/Pages/default.aspx"&gt;"Istanbul" Zone&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/6yZTH_U5rv0" height="1" width="1"/&gt;</description>
	<feedburner:origLink>http://forums.amd.com/devblog/blogpost.cfm?catid=208&amp;threadid=114106</feedburner:origLink></item>
	
</channel>
</rss>
