WWDC2004 Session 601
Transcript
Kind: captions Language: en okay good morning my name is Doug Brooks I'm product a hard were doubtful and I'd like to this session is entitled HPC technology update this session would like to take a look at Apple and hpc and specifically Apple products and technologies that can contribute to hpc deployments we'll also look at industry leading third-party products that complement those solutions and also we'll hear from two customers to talk about their HPC deployments using Apple technology first let's take a look at Apple and hpc now in the last year when you think of Apple and HBCUs are the first thing that comes to your mind is this with remote works there we go technical failure Virginia Tech Virginia Tech really was an early leader and that they saw the vision the power of the g5 and and mac OS x server and combining over a thousand power mac g5 and mac OS x server software using technology for interconnect InfiniBand achieve a phenomenal performance achieving over ten teraflops of computing power on its debut at ranked number three on the top 500 list and is the number one academic supercomputer in the world an amazing achievement and they also really prove the the the value and the price point that you could build and deliver a very high performance super computer system with Apple technology now what's interesting is well they may have been the first and definitely the largest g5 cluster deployment they were definitely not the first apple cluster deployment matter of fact a lot of the early cluster work done on the Macintosh platform was actually done a number of years earlier most notably at UCLA with the apple seeds work may remember this from the late 90s actually believe this is circa nineteen ninety eight UCLA with the apple seeds project took at the time you know pretty fast technology beige g3 233 megahertz systems 10 100 Ethernet as an interconnect it was running early on Mac OS 8 and was using Apple events of the Middle where nevertheless viewed doing a high-energy physics work that they were doing it cheves phenomenal performance at very low cost and actually the system you see here on the screen outperformed Cray ymp one in similar codes so again showing the value some of the same things Virginia Tech has proved with the g5 they actually shown some of those same capabilities much earlier of course we've come a long way since then from the g3 base g three days of course to the g for introducing desktop supercomputing with the velocity engine bringing phenomenal vector processing capabilities that many applications have been able to take a tremendous benefit from by leveraging that power in that technology and of course most recently the g5 bringing phenomenal floating-point performance dr. processing with a system that delivers very high memory bandwidth and system throughput through this processor providing a phenomenal foundation for computing and of course the customers have responded of course higher education customers tend to be on the leading edge even some of our most strongest early adopters in the higher education market and of course if you heard the last session you tell heard about some of the deployments in the scientific field matter of fact life sciences in particular has been adopted of our technology primarily because many of the key applications that are run day in and day out in the life science of the rain had been velocity engine optimized and running very high performance on the g4 and the g5 processors and so we're seeing lots of deployments taking advantage of our tools and our technology combined with the ease of use that we provide in our system in the sciences field and of course we've had a phenomenal community of vendors bringing technologies products if your developers your applications coming to Mac os10 has really grown this this community of people in applications that they're able to run on this platform and so we're seeing more and more solutions come available into the HPC community on Mac OS 10 and growing more of a time we've also seen a community of expertise build up in the in the HPC community in the cluster community around Mac OS 10 and around the g5 matter of fact that a notable one is the bio team a consulting company in the focuses on the life sciences arena really delivering great solutions for life science is based in in many part by Apple technologies and of course Apple's responded apples responded with products really focused and targeted at this market most specifically the extra of cluster node a machine tailored and streamlined specifically for clusters and high-performance computing configurations streamline full dual processing processing computer performance without a lot of the extras you don't need when you rack a whole rack full of extras in the system and we've seen that dawn even further a man fact most recently the Apple workgroup cluster taken a complete solutions approach to this product line being able to offer not just hardware but hardware software cables power rack interconnect everything you need to perform a complete solution around bioinformatics and it's been very very popular the response has been phenomenal we're very proud that it won best of show at bio IT world earlier this year and again for customers doing bioinformatics that is the easiest cluster to set up really bringing Apple performance and ease of use to the cluster space and finally we've had phenomenal customer response this is a small sampling of customers who have recently deployed clusters based on Apple technology xserve and mac OS x server really see this market continuing to grow and more very exciting customer deployments and again you're going to hear about two of them later on in this session matter of fact we're seeing right now roughly forty percent of our extra units going into clusters and high-performance computing and as we see as xserve continues to grow we see this continue to grow as well as an important slice of the xserve pie you might say so what does it take to actually put a cluster together with Apple technology obviously it's a lot more than just racking a bunch of extras in Iraq so what we want to do is take a look at the HPC technology stack what are the components on Mac OS 10 that it takes to build an htpc deployment and what technologies and products are available from Apple and third parties to complement this stack so this is a view of again what will stay hpc building blocks the components that are required to to build a HPC deployment in in technology space so if we look from the bottom up from the hardware the actual hardware platform itself the operating system the interconnect compiler and optimization tools the communications middleware and finally the management tools can really do a complete cluster deployment we need components from all of these now if we take a look at what Apple Apple products and technologies provide you see apple provides products that fit really in about four of the six areas and if we look at third party technologies really industry leading third-party technologies we have again about 46 of those components that we have to choose from and as we walk through you'll be able to see have a wide selection of products and technologies for deploying clusters on Mac OS 10 so let's take the quick review walking the stack from the bottom up I must take a quick review at from an apple perspective at the hardware component so this is pretty pretty pretty straightforward first and foremost we have the g5 processor the g5 processor really stands out as a processor for high-performance computing with the 64-bit capabilities with the massive floating point and velocity engine support in this processor coupled with the power advantage is that it has with the smaller 90 man o meter technology really delivers a phenomenal bang for your buck and for your power and heat output as well so obviously that becomes one of the core foundation pieces of the product as we look up of course then we wrap that up in the xserve g5 I've been able to provide dual processor performance in a 1u form factor a system that delivers peak performance of 16 gigaflops of double-precision floating-point or 32 gigaflops of single precision floating point with a velocity engine again a very very powerful system to deliver a phenomenal performance couple that with the latest of io technologies pci-x ECC memory integrated hardware monitoring for systems management and monitoring and again very low power when we look at it compared to competitors you know roughly a compute node configuration of X serve at a hundred percent CPU is under 250 watts of total power usage and well under a thousand BTUs an hour in heat output for the data center that needs to cool these systems that's significantly lower the competitive systems and a great advantage that we have with the g5 processor we also have extra braids and kind of the corollary to a lot of computing power is the data that needs to be fed into those systems and that data needs to be stored somewhere we see ex or braid is the ideal storage device for for storing data for high-performance computing clusters with phenomenal performance and phenomenal capacity you have the ability to through fibre channel quit quite a large large amount of storage online and serve that throughout your cluster and again a breakthrough price performance with extra grade just again ideal storage device for high performance computing clusters we then take a take a look up the next step obviously with from an operating system perspective we have mac OS 10 mac OS 10 has really provided a key foundation for cluster cluster applications what's really unique about mac OS 10 is that we have the ability with mac OS 10 to combine the power of unix under the hood which allows us to bring applications and technologies over that allow us to compile and run those applications on mac OS 10 but combine that with the great ease of use ease of deployment ease of management that the mac OS x server services provide and of course with the g5 optimizations been able to deliver the performance out of the g5 processor and of course this we've been to any of the sessions yesterday you know that we've introduced tiger and with tiger server one of the most important things in the HPC space that we're able to bring with Tiger is true 64-bit user space environment been able to break that 4 gigabyte barrier in user space we've always been able to access more than 4 gigabytes of memory with the g5 system even on up answer but now we have asked the ability to have applications access large data so especially code coming over from other platforms from other operating systems be able to take advantage of that large memory footprint of course we expect to see a number of across-the-board improvements in other areas of Tiger improved SMP performs improve network performance improved NFS performance things that we think are going to really deliver a phenomenal platform for the future of high-performance computing with Mac os10 it's also interesting encourage you to attend if you're interested in some of the X fans sessions later this week X fan is Apple's fan file system for Mac OS 10 and X man has a role to play in clusters as well with especially on larger clusters where file i/o bandwidth is a concern X fan gives you a file system that has the ability to scale out in file IO services so X and again plays a role in high performance clusters as well be able to share a large data pool across a large clustered environment then as we move up the stack we look at cluster interconnects and this is an area that we really work with industry leading third-party vendors to provide phenomenal solutions on Mac OS 10 you know today with the inner connection really a couple couple leading leading choices first and foremost we have Gigabit Ethernet kind of the common denominator for cluster interconnects and of course from an apple perspective we provide on xserve g5 to very high performance gigabit ethernet ports so out of the box you know you're ready to to connect those systems together but four clusters that really need higher prior bandwidth at lower latency there are two really leading choices available for Mac os10 the first is a mirror net technology and the second is InfiniBand and wanted to touch on both of those first we want to talk a little bit about Mira net mere Nets the technology that's been in the HPC space for many many years it's really established itself in this space and matter of fact if you look at the top 500 list you'll see quite a number of clusters that are built on mirror net technology we're really proud that Mac os10 has very much drivers really deliver excellent performance with mirror net and matter-of-fact me Ernest actually one of the cards when we first did the original xserve which introduced 66 megahertz PCI slots mirror net were the cards that we actually use to tune the bus to verify we're getting maximum performance out of the out of the system for PCI transfers matter of fact and working with their engineers they told us we were seeing much higher throughput than even the latest PC chipsets at the time and so mirror net is a really excellent choice here are some performance or excuse me here are some views of their latest pci-x card and a switch components the key components that are used in the mirror net deployment it is a pci cards it goes in your system and here's some performance numbers that were provided by mera kaam on what baronet's capable of achieving it again you can see here the latencies are much much lower than Gigabit Ethernet so for applications where that's really critical mere net is an option you can you can choose from here and the way they achieve this is actually the same way many of these interconnects work is that they are able to bypass the the largest section of code that provides a lot of latency in this system and that's specifically the IP stack as you can see from the diagram here applications that call through MPI stacks through Marinette bypass the IP stack and go directly into the hardware and are able to achieve that much lower latency performance that then what Gigabit Ethernet can provide so again for applications that require that that becomes them really key marinette just actually had a number of recent announcements at the last supercomputing conference earlier this month and most significantly what they've debut are much larger switch form factors and labeling you to scale Marinette too much higher cluster node numbers at much lower price points and sameer net is a definitely a very compelling solution on Mac OS 10 for people looking for lower latency at very attractive price point what I'd like to do now is introduce that John through it from Voltaire who's going to talk about InfiniBand on Mac os10 John thank you Doug all right so to talk about a spin abandon certainly to talk about clusters and a dog just gave a good introduction there but what we're seeing in the HPC market is a lot of transition of supercomputers mainframes SMP machines to a bunch of interconnected servers why the reason our cost is the overriding reason we'll see we can put together one of these clusters with a bunch of service or realistically one tenth of the cost of some of the existing SP machines historically though that's been fraught with complexities underutilization of processors storage bottlenecks just to the complexity of hooking these up so what does it take to build a effective efficient cluster you have a physical distribution and logical consolidation on the physical side you need a high Dan with infiniband offers 10 gigabits and 30 gigabits per second today low latency for inter processor communications low CPU overhead you don't want the CPU spending all their processing time communicating with the other processors the ability to scale else only logical consolidation the ability to logically group nodes and systems into a logical sets or groups domain so we need a high-performance interconnect and an intelligent interconnect what in Spanish and Sunnah band brings to the table is it's an open standard so does the first open standard interconnect designed from the ground up for high-performance interconnect and already made support so the extraneous features and functions that you might get in other technologies are not there just was designed for the ground up with high and clustering mind because of that it has the significantly we think I lower cost performance now ratio than other other options do available for clustering the latency is about 140 nanoseconds perhaps and 5.8 microseconds end-to-end dilatancy a key feature of infiniband is that supports multiple types of traffic over the same fabric so well there's file black network IPC it's all using a single single technology efficiently work from the beginning we've built in a extensive management monitoring capability so high availability quality of service partitioning types capabilities or burthen built-in from from day one again already supports 30 gigabits per second today there's something called a DDR double data rate and qtr quad data rates that supports up 120 gigabits per second being worked on today so on 33 real key key points incentive and high bandwidth low latency low CPU utilization and the ability to scale this is a high level representation of an SMP machine on the on the Left shown a8 processor what the proprietary interconnect verses for two-way servers interconnected with InfiniBand I want to make it clear the SMP the symmetric multiprocessor systems will still be a perfect solution for a lot of applications to use the plus we have to build up Farrell parallelized the application and and we're looking at near memory speeds but the cost performance compared to the SMT is is is a drastic on the infiniband link protocol just really trying to make one or two points on this overhead one is with a single event you can move a large amount of data to two gigabytes of data and the entire link protocol is handled in hardware so all reassembly and segmentation and all that is all handled and hardware and we are on the third generation of a check technology for InfiniBand some more incentive beyond link attributes each packet is sent with a service level so there's up to 16 service level supported SLS there's also something they called BL virtual lane so there's 15 virtual lanes possible over single virtual over single physical link so a the SL is mapped to a VL which is then arbitrated across the physical link basically that that is the basis for your quality of service type implementation which lets you mix and match efficiently different types of traffic across the same infiniband connection instead of being awful to signs up with all partitions so this would be similar to uh zoning and fibre channel or vlans in the IT world so it's a mechanism for defining isolated domains so each porter node can be defined system partition to communicate with only nodes in that partition or given full or limited rights within that that group and that's all defined by the subnet managers sm manages by assigning partition keys and cinnabon is based on a 2.5 gigahertz out signaling right so when you hear the rates for InfiniBand 1x is 2.5 there were really no implementations done at that data rate for X is really where all the implementations are most of the implementations today 12 X is 2 30 gigabits per second so today I voltaire we can support the 10 and 30 gigabit per second rates over copper cabling it's a 17 meter distance limitation one kilometer with multimode fiber and then see there's there's efforts to in the works to five and ten gigahertz signaling radar in process and Finnegan has a very rich protocol stack to find on the upper layer you'll see a bunch of stuff that looks familiar you know NFS RDMA and virtue for NFS we'll have our DNA infiniband support MPI is I think dog mentioned message passing interface far and away the most popular MPI is the most popular IPC API in the HPC world I think that was for three-letter acronyms and short sentence let me see if I do it again so message passing interface is is the most popular application for room and interface for inter process for communications in the HPC market and that supported in the Apple world today ice cozies for storage support across the fabric stp is sockets direct protocol and any application with a sockets level API can utilize that of course TCP and IP over I'd be in a dappled direct access programming library a lot of acronyms here adapt will define the API to RDMA and then there's a full suite of incentive and services below that for management and monitoring and my 10 minute time slot we won't be going through those right now and that HCA terminology used for the host channel adapter that's sort of unscented ban term just for a network interface or host bus adapter they call them to calm HCA's so what does incentive than the value where bring to the HPC market it's the first industry standard to enable server clustering dog mentioned Marinette and there's quadric two other ones out there that are proprietary interconnect set that are available the clustering is the fastest-growing segment in the HPC market so that's why it's a company Voltaire and Apple working together a very interested in that space excellent performance advantages over other other options we're currently seeing a lots of interest at the universities and labs the VOE lab specifically or very aggressive and pushing forward the standard and purchasing products and implementing large large clusters and pricing has dropped quite a bit we're certainly certainly not at what i would call economy to scale yet but we've still seen about a fifty percent price reduction in the last 12 months or so and we think we'll see more reductions as as volume increases and the virginia tech system Doug mentioned 1000 hundred five side note 5 mil clusters number three and last November's top 500 list really key the 5.2 million which a lot of money anywhere but for this type of system it is it is literally one tenth the cost or more greater than that of the other systems in the top ten want to thank Doug and Apple for having us here and working with 10 minutes that's all thanks John we're really excited about Voltaire's InfiniBand offerings because for customers who are looking for a very versatile interconnect with great latency and bandwidth properties in Santa ban is very attractive and gaining tremendous momentum in the HPC space I'd like to move on go back to walking up our HPC building blocks back here and take a look at compiler and optimization tools you know the interesting thing about HPC is that it's really a segment of event users and developers I've never met an HP seeded appointment that's not taking advantage of their own tools or compiling their own programs and so the compiler and the optimization tools that are needed to really eke out the most performance out of their code it becomes a very important piece of the technology from Apple's perspective of course we have xcode and xcode is just a phenomenal development environment to be able to to leverage the productivity features been able to deliver great user interface great tools to build develop the bug applications in the fact that i can write a program on my powerbook and then send up to my cluster for execution optimized for the g5 is incredibly powerful and of course as we improve xcode for example the the betas that you've received the pre-release versions you've received this week actually begin to introduce some of the 64-bit capabilities for for large memory space so you can already begin working with those with those tools a very important part of xcode are actually the chudd tools if you're not familiar with them showed stands for computer hardware understanding development and these are tools originally written internally within Apple to help us optimize and understand the implications of code executing on our systems these tools that turned out to be extremely powerful and actually have been made available as part of our developer toolset and now our standard part of xcode installation if you have xcode installed you'll find them right in your developer folder these tools are incredibly important in this space to be able to really understand the performance bottlenecks and implications of your code Ninon on our system I've seen numerous of examples for example people who are convinced their code as processor bound on a g5 and with some simple profiling with shark for example one of the key tools in the chudd set really find out maybe it's more memory bound and there's some tuning that can be made to improve throughput so these are very very important tools to our to our tool set and if you have an opportunity really encourage you to go to some of the sessions this week on the tred tools for better understanding of their capabilities the another important piece of the space is fortran fortran continues to be one of the top scientific programming languages and it's an area where alpha works with third parties to really develop and provide great solutions on top of Mac OS 10 so it gives me a lot of pleasure to introduce Dave Paul mark from IBM who's going to talk about XL fortran Dave great thanks Doug I really happy to be here today so great to be able to stand in front of a bunch of Apple developers as an IBM ER and talk about our technology it's not just the processor this time we're going to talk about some software today and and some hardware to infancy there we go so what we what we brought to the apple processor in the mac OS 10 we've got a compiler that's got a long history behind it this is the the IBM fortran compiler that's been behind our systems since the very early 90s and even going beyond that we use this technology inside of seek our c compilers as well and we have XL compiler both C C++ and Fortran on this platform and it's been used by very important IBM customers mainly on AIX but we're starting to see some movement to Linux and Mac now people like LLNL nurse and our and a European weather forecasting group and we deal with these people every day we understand their problems we understand what the kinds of applications they have to develop and we built a compiler for them now when you pick up XL fortran you're not just getting performance you're getting language standards and conformance so this helps a lot with porting if you have something that runs somewhere else that's conformant we're going to be able to handle that so we're fully fortran 77 fully for 290 fully for 1095 and we started on fortran 2003 which we expect to be ratified hopefully the end of this year but people have asked us for some things early in that and when as a standard can congealed and got a little bit more stable we went ahead and did things like I Tripoli module allocatable components stream i/o things that people were asking us for and we have people on that on those standards committees so that we know what's coming and with a we have a voice in there we also handle things like OpenMP which we also have focused on those standards committees we're not your standard conformant we also have extensions surprise it's for trans so we do OpenMP 2 point 0 fully compliant to that on mac OS x it's a technology preview as yet it's a it's a preview of what of some of the technology that we've deployed on AIX and an linux powerpc that's been out there for quite a while now but we do other things from from other other companies we've got great pointers 128 50 14-point 64 billions structure record union map and so on and so on is way too many options to talk about to try and describe them all in this in this group but suffice to say things like structure record union map we've had customers come to us with things like you know we would like to buy IBM hardware but you don't have this well they got it and now they have IBM hardware we do that sort of thing all the time this kinds of requests come in you know we we want to hear things we want to hear what you need and we'll talk to you about that we've also got some very important extensions for the PowerPC in particular the the Power PC hardware intrinsic functions and directives get you access at a source level to the hardware instructions so you can code something as a directive or as a function call and what you're going to get there is a particular instruction that you need at that point something like data prefetch for example very powerful we also give you an extra less utility module that you can use to get access to some common system services you not to go off and code that yourself now we're in Xcode and that's real exciting but for folks that still like a UNIX command line we're there to make file still work gdb works with us we work well with gdb now as you go up the off levels obviously things start to go down a bit but we've given you sir that the support goes down a bit but we do have some directives and so on that you can put in your source at certain points to get you the information you need to debug you know that's back to that trace back or whatever it is you having trouble with its time something that isn't here that shouldn't be shark I love that tool I wish we had that naix we work well the shark and that's the message use it and find some amazing things it's one of the most popular things in Toronto for digging down into these problems that we get from our customers when you're analyzing performance problems but it's not always just debuggers and so on we also give you some options to use for finding problems in your code so you can automatically insert checks to find you know oh I went off the off the balance of that array it'll trap and tell you that stop it from going off and corrupting memory you know automatic initialization of variables where you need that to happen and a rich set of listing information that you can dig through to understand what's going on with your program the runtime environment we have our own fortran run time that we ship with the compiler and the message there is that is something that if you build an application with XL fortran you can take that runtime and give it to your customers as well so that they can run that Fortran compiler sorry run that fortran code on their systems we give you a lot of tuning levers and buttons and dials through environment variables and you can control things such as the characteristics of the i/o that's going on when you're doing that error reporting what kinds of messages you want do you want to know when you're doing something that is in Fortran 90 conformant for example you can do that sort of thing but in this space certainly thread scheduling models number of threads thread profiling environment variables Israel important things and of course all the things that openmp defines we've got those the binary compatibility is a very important thing in this you can take our objects work work with other objects in GCC g plus plus and of course i BMX well c and c++ take that whole bundle put it together and there you go you've got your applications mixed as many languages as you like and we've added some things like q flo complex GCC minus QX name just auctioning but that the messages we wearing we need them we've added some things to help out with that binary compatibility there we go now we exist because of optimization if we weren't a good optimizing compiler we wouldn't be there in Toronto doing this every day for the last 14 15 years so the Conte the optimization components that are in Excel for Trent are in all of IBM's core compilers and all our important systems C C++ cobalt pl1 on AIX Linux the mainframes Mac and now p series i series and of course g4g 5 the message is we've taken all that that we built up over those number of years and all those different platforms and brought it down to the Apple platform and we're seeing some really important success with that the XL compiler is a used by IBM on AIX to announce spec performance numbers so again the message is we know how to tune for those chips IBM does the chips we work with the chip designers we know what's coming we know how to tune for those things and we build our own software with it AIX db2 lotus domino they are all built with diabetes IBM compilers as you might expect optimization options we go to five dots at the base level so 0 0 all the way to 05 and you can go from basically almost no optimization up to wow what it wasn't that what did this thing due to my at my code I can't recognize it anymore and we've got a whole set of against witches dials lock knobs and levers that you can play with in order to tune the optimization to what you need to have happen on your application things like minus Q hot enables the high-order transformation loop optimizer that it was built to understand Fortran 90 array language and syntax and can take those loops and do some amazing things with them it'll also work with C kodas with sealants as well when you use it in our C compiler makyo arch option tells you on the mac OS machine do you want to target a generic PowerPC in other words g4 or do you want to go to g5 which I'm sure most of this group is interested in and that enables inside of our optimizer all the modeling and tuning capabilities that we that we bring to bear from that we brought up and specifically done well person years of effort to and into the g5 gives you access to all it doesn't give you gives the optima using using QR h g5 allows the optimizer to precisely model your code as it's going to see it because it understands the chip understands how many units are going and how to how to keep that process or busy that's what it's trying to do with the scheduling model and using QR gt5 also gives you access to those risks a tapar PC intrinsic sagain that you can use for things like cache control all certain during arithmetic operations that you might need and floating-point control you want to toggle things in the status and control register for example and the nice thing about that if you're interested in moving your code from one IBM kind of system certain one IBM ship to another those same intrinsics work on compatible chips if you're going to AIX or linux same story in the other direction you have some code up there on AIX want to bring down to g5 those intrinsic is going to work too I PA is sort of that is the keystone of our optimization technology and really differentiates us and what we can do with your application when you've got something you've got ITA involved in your in your compiler and your compile and it runs automatically at 04 05 what it does for you is when you compile your code is it inserts information into your object which is essentially invisible for the linker so if you just take those objects and feed them into LD outcomes your outcomes you're a dot out you're happy but if you then use IPA when you link your application it then extract that information is hidden away in the datos and re optimizes your code again this time not on a file by file basis but it's got the entire application there it's got all the dogs that make up the whole thing and it understands so that called that that was called with this and was called with that and so we don't have to worry about this parameter we'll just stick a 7 in there that kind of thing so what it can do is it repartition your application into more logical units that are easy to keep keep memory together and in those massive amounts of inlining where it makes sense and can even go across languages so if you build if you build your application mixed mode with C C++ and Fortran if you build all that with IBM XL compilers run the IP a link step you can do things it will do things like take your seat code invited into your Fortran application and yeah that's an amazing technology that we to bring down to the g5 and of course after it does all that then we go back down into the low level optimizer II game which is the one that really understands the chip and tunes for that PDF profile directed feedback is another important technology especially useful for codes where you may have it instrumented with debug or prefer or perhaps some tuning information that you want to use to gather statistics what what PDF will do for you is you build your app that you build build your application once with minus Q PDF one run your application with typical sample data that will write out a statistics file compile your application to gain with PDF to and it will read that statistics file and that will tell the optimizer oh look 99 percent of the time you take the branch this way not this way and so we can we can take your most frequently executed code and put that in line and the stuff that almost never executes goes off to the side and you get a much better performance out of that and of course again the message is the exhale compile we share the technology so you can if you build stuff with with our C compilers use PDF you can mix that in with the fortran compiler OpenMP and SMP are very important to this this space and we've got a lot of experience with these again technology preview and Mac OS X right now but again we're bringing that down from some platforms where we've had a lot of time to work on that we fully implement the 2.0 standard and the important thing about the openmp for us is our optimizer fully understands what OpenMP use and what SNT is and so we can take things like a minus Q s MP auto option I put it in our compiler where it can take a look at your your application automatically parallel lines things where it makes sense to do so so you've got you've got a couple of choices and the way you want to do things if you want to code in the openmp standard that's great we'll handle that but will also automatically parallelize for you where we can and again it's another one with a dozens of switches that I can't talk about right now we'll give you lots of lots of directives and options on the optimizer as i said before things that where there's a couple of variations on this where you can go into your source and say things about your code to say this loop has this characteristic and that will give the optimizer even more opportunities to go and do things that it might not be able to recognize otherwise but in some cases you want to constrain the optimizer a lot of a lot of older code especially may not be a hundred percent standards-compliant so things like minus q alias non-standard will let you crank up the optimization level and still have your code run correctly even though it might not be as as opportunities if your code what standards conformant and of course things like mice to prefetch and minus will automatically insert prestretching directors where that's where that's useful had a great example of that yesterday where we had a gentleman in the lab across the hall working with us on brought his code in and just with some analysis with shark and some and looking at things we stuck in one directive and speeded up the core loop in his application by a factor of two just by doing a prefetch so the summary is IBM exhale Fortran and xlc bring to you bring to the apple g5 systems technology has been in the works at IBM since the honestly the mid-80s and it's been improved every year with a large team in Toronto and we work closely with the chip folks we're fully backed by ibm's premier customer service doesn't matter if you if you buy your the compilers from apps software you buy it from IBM it's still with the team in toronto it's going to be looking after you and our standards compliance and the large range of extensions that we have let you bring your code down from pretty much anywhere and we'll help you out with things that you need it thank this case great thank you Dave okay continuing along our stack want to talk a little bit about communications middleware this is typically what we see is the MPI layers of a cluster the great thing about Mac os10 again leveraging off that UNIX foundation is that just about all the major MPI stacks have been brought over to Mac OS 10 and run really really well matter of fact some of them have been really optimized for Mac OS 10 and are available for example lamb MPI is a package installer for really ease of installation right on top of Mac OS 10 so great selection of tools matter of fact if you're you have experience with a particular NPI stack hopefully you'll see that it the exact same stack is available on Mac OS 10 and can leverage that the familiarity on the platform so both open source and commercial stacks available for Mac OS 10 there are a number of other pieces of middleware obviously talked about OpenMP Globus pvm paradise Linda from SCA and a recent product accelerate from gridiron software I'll also fall into this communications middleware stack and of course all are available on on Mac OS 10 if I only wanted to touch on management tools this is an area where we think Matthew s10 really shines because again you have the best of breed tools available from Apple to really make managing these systems particularly head nodes and things where you're providing file services and network services are able to provide very very ease of use for system administrators managing you know whether it be a small cluster or a large cluster we also have the benefit of great open source tools to really provide added value and functionality so if we drill into this of course first of all we start with apples management tools server admin work group manager for providing kind of the bread and butter you know file services dns dhcp directory services things that kind of any forget about but you know it's a network infrastructure you need these to support cluster operations one of the highlights server monitor the tool that's unique to exer xserve g5 has over 30 sensors on the logic board I like to joke it's one of the most instrumented one youth servers in the industry server monitors the tool that allows you to wrap up that data and provide that that status information about the hardware temperatures predictive drive failures power consumption all that data is available in server monitor is a great compliment when you're managing a large number of machines beyond that we also have new piece of technology from Apple introduced not too long ago as a technology preview which is X grid again taking that ease-of-use approach of how do we make deploying clusters easier how do we make distributed computing easier extra is really a great solution for these class of problems were you want to distribute workloads across a number of machines what's interesting about it is that not only can it take advantage of dedicated cluster resources such as a rack of x earth you can also bring ad-hoc resources through rendezvous technologies out across to desktops and other machines on your network the recent technology preview to added MPI support which makes running and dispatching MPI jobs across your cluster much easier and of course it provides great user interface all the way down to the tachometer to let me see how much performance you're getting on your jobs so we're really excited about X grid and of course now stood being brought into tiger it's going to be very broadly available to two Mac os10 systems finally again wanted to touch on some of the leading open source and commercial tools in this space most notably schedulers you know again again the top schedulers available in the industry are available on Mac OS 10 platform lsf in the commercial space PBS and open PBS Sun grid engine now called n1 grid engine even the Maui scheduler available for Mac OS 10 and also some of the leading cluster management monitoring tools tools like gangly and Big Brother also available for Mac OS 10 and a very valuable resources there so in summary if we look from all the way from the harbor up to the management tools we have a really compelling set of products and technologies both from Apple and industry-leading third parties that allow you to build really phenomenal cluster solutions with with mac OS 10 and powerpc g5 at the foundation of this stack what I'd like to do is now introduce some customers we're going to talk about how they've deployed xserve and mac OS x server to solve some of their high performance computing needs first customer I'd like to introduce is actually is Ben singer from Princeton University who's going to talk about his deployment of X serve in their Center Ben thanks Doug to delight to be here I'm here to talk about a little bit about the Princeton xserve cluster at the Center for the Study of brain mind and behavior that we're still setting up we got it about a month ago and we're having fun setting it up what is the csb mb well we're a consortium of Princeton faculty interested in the neural basis of cognition and really what that is is one of the great unanswered questions in science which is how does all this activity in the brain lead to consciousness and awareness and action and motivation and all the associated behaviors that we own that we do every day and just take for granted the consortium's made up of faculty from applied mathematics computer science chemistry physics biology psychology and philosophy and actually psychology is our home building and so we have a lot of people in psychology that are working with us and to point out one of the others in applied mathematics our biggest collaboration is actually with Ingrid dobashi who is the mother of wavelets and the supplying some algorithms to brain imaging analysis so really what we are is a place that provides resources for all these faculty and we have staff and we have resources in the computing and data acquisition area and on the staff side there's software engineers MRI physicists the system administrator and administrators running the center the big data acquisition instrument that I was alluding to is the MRI brain scanner from Siemens that we picked up a few years ago and it was the first at the time it was installed it was the first research on the installation so most of the time when you use an MRI it's in a hospital setting so it has first priority for clinical applications and you end up doing work at three in the morning or something and one nice thing about our facility is that it's there just few doors away in the psychology building from the cs BMB staff center and that provides all the data that that i'm going to be talking about and why we ended up getting an xserve we already had a file server when i went shopping for cluster which was actually the first thing we had me do when i came about six months ago and that was in place already it's a blue arc nine terabyte file server to store all this data that comes from the MRI brain scanner and we need to back it up and we need to process it so that's how we ended up with 64 xserve g5 nodes and I'm going to explain a little bit about how we chose the xserve but before I do that I want to just say what it is from a computing perspective that what we do is motivating us to pursue an xserve in the first place we have a whole lot of brain data coming out of the MRI a single study will produce hundreds of gigabytes of data you take a single scan from somebody and if you're doing functional MRI even though a single slice of the brain is at lower lower resolution 64 squared image you're taking 25 slices and then you're taking maybe 30 of these a second and in one experiment recently we had subjects watch raiders of the lost ark for two hours and recorded their brain for two hours that produces a lot of data and so we did that with multiple subjects too because we want to see our brains doing the same thing when they're watching this movie have sort of fun fun example and to crunch through that is going to take some some computing power the other thing is people are moving their head in the scanner they have a little head rest so we tell them not to move but they still do and that's natural and so we need to align every image with the first one or some reference and that takes a lot of time in the workflow so does filtering in space and time there's a lot of noise coming this data when you first get it is not it's not like suddenly something pops out at you and you know exactly what's happening except in very simple cases there's a lot of noise in the data needs to be filtered out there's other machines in the room that will put a signature in the data maybe some low frequency noise maybe high frequency noise so you have to do filtering and then finally you need to do a statistical analysis and you're comparing brains where they were just sitting there doing nothing where with what when we're doing the task that you have them do and so comparing those two things is a simple statistical test but you need to do it for every voxel in the brain so that's thousands hundreds of thousands of statistical tests and that can take traditionally days of CPU time to do a single study and one one problem with that is that when people are they've got all this data and it takes all this time to analyze it they don't tend to play with it much they don't tend to try new things or look at it from a new angle because there's a big cost to doing that they're going to tie up the lab resources for a day they can't just put this data on their on their portable and run away with it and they have to stay and use up the center resources and sometimes people won't do it and so it's just it's stifles creativity it's one thing so why did we choose xserve well when I first started looking and we all were a group but I was you know sort of the one that was doing it at the time I got really my head deep into bed marks and although the xserve does really well with benchmarks I think the reason we chose it wasn't just because the benchmarks but anyway let me point out the benchmark that I have on the slide Daphne speedo score is from the athénée package it's from the National Institutes of Health it's a free software package for analyzing MRI brain scans and off the website that where they published their single processor 32-bit benchmarks come the bottom three bars here and then I ran it last week on RX serve and it came out a little better that this benchmark tests the whole system so maybe it was I 0 or something that caused the xserve to do better than the desktop even though it has the same speed ship again like I was saying it wasn't only benchmarks there were some benchmarks actually which early on if we compiled say 64-bit linux opteron systems would would we came out with different results bexar was always close and what we what we were doing when we were buying this cluster was looking at the whole package and thinking about the future and being able to actually how's this thing and to be able to maintain it and we also knew that in the future Apple's operating system would be fully 64-bit to do a fair comparison with that compile on on the 64-bit linux so we knew it would get better the power and the cooling that's been alluded to earlier were a great story for us because we are in a small area and we had we we said we want to get 64 nodes and the facilities people laughed at us so we said well how if we put it in that room there and they said well good luck so we thought about it we did get them to put in some additional air conditioning and then we looked at the stats in the in the specs and the g5 xserve which had just come out at the time was the suspects show that it used about half the power and the cooling I think roughly at that time and we were really happy with that so we could actually could actually buy it and that it was actually a great great feeling and we knew that we'd be able to cool it in it and it's also very quiet I think in the last session but Tribble mentioned that we went into the room with them all on for the first time and there was this strange high-pitched noise and we thought oh great you know this it's going to be kind of noisy it turned out it was the two dell power edges in the in the farce so that we were happy thing the great thing we have a whole lot of people that are were coming up in SGI system in a bunch of Linux system so there were a lot of people that said well when I can be able to use these Oprah's open source packages and we're going to have to recompile them well I've been in the beginning the process of porting it's been really pretty easy to do the g5 is a very is becoming a more and more popular target in GCC make files for the packages that we depend on including afni and so in in fact even have a binary distribution for the g 5s already so that was great and the the administration of this thing is then really straightforward so far we're really still bringing it up but the server admin and the server monitor tools that Doug showed and one of his slides have been really helpful I can just bring up my g4 desktop and look on the screen and see what's going on with a cluster and I don't have to be a full-time sysadmin this is our system what's a little different about us is what I emphasized in this slide we we have all this data and we have this file server already so we sort of had to work with it and so we have the second network hg5 node has to has two built-in ethernet ports on the back and so we decide to use both of them which when we were setting it up we realized we had to do 256 crimps of network cable so we were wondering why we did that but but then we realized once we set it all up and we had a lot of help from Apple doing that and they someone came and did most of the Crimson redid the ones that I I did and we we got all this stuff up and running and what we have here there's we have a foundry switch that we got along with this thing this excerpt cluster I shouldn't call it a thing but it's got a few gigabit ports on it in which goes to our existing blue arc file server which is an appliance sort of as most of its software and firmware and we connect out to the world with that into the head node of our cluster our cluster in red there's a uses what apple ships with the cluster the asante gigabit switches and then down below is the foundry connections going and that's where all the NFS traffic is between well it stays on its own network so it doesn't interfere with what's going on on the other network and most of our applications are single processor embarrassingly parallel so we don't have a need for mpi yet or any of the high-speed enter connect so we were happy with with this and we just need to move a lot of data on churna churn and work on it for a couple hours and then write it back out so we didn't have a need to to do NPI yet and there will be a lot of opportunity for that just just to conclude we're seeing five to ten times faster results than we had before with the system that we moved off-site an STI origin we haven't even begun to try to optimize so that's something that we want to do although a lot of this software isn't really written by us originally but its fertile ground because like I said every voxel in the brain the analysis is independent so you know you could have a node for every foxhole if you wanted so there's a great opportunity to speed things up we don't need to hire anyone at least I hope that's not why I was hired just just to do this and but I'm having so much fun doing it that actually I might I might spend more time of playing with it than what I'm really supposed to do and we our students are happy they have a feeling that something's coming they see it there they see the lights blinking and they know that they're going to get a chance to play with it soon and they're going to be able to have less than an excuse to just say that their work is not done yet because it'll it'll be done quicker and but the good thing about that is they'll be able to try new things do more stuff and and and actually probably discover some things that they wouldn't have discovered otherwise and that's the bottom line and like I said one reason we chose the ex service is just we've had both I in the past it I'm an Apple user and so is the director of the center and we know that Ethel is always innovating the hardware is solid the operating system is too and with X grid for instance you see that Apple's I is on this problem and that more great things are going to come in so we're looking forward to that thanks Thank You Ben okay the second customer would like to introduce dr. John maduros now if you read any news last week he may have heard about the small little cluster going in down in Huntsville Alabama a 1566 node cluster being put together by Coulson and I'd like to introduce dr. John medeiros the senior scientist from call so is going to tell you a little bit more about it thank you as Doug mentioned I want to talk to you about our cluster tell you a bit about it and about why we got it but to put things in context a little bit I like to tell you a bit about who we are what kind of computing we do and why we need so much of it and what process we go through to pick the cluster system that we did and why we picked the xserve cluster that we did any case who was cosa well we're a small engineering services contractor about 800 people based in Huntsville Alabama as Doug mentioned and we have a few offices throughout throughout the u.s. I have to mention our company president nelson and dr. Tony Jerry ngoi get your vice president actually a champion this project for us providing a lot of corporate support for the vision that we had in terms of bringing that system to bear my particular project involved in as the hypersonic missile technology program and we have a dedicated corporate facility that we recently renovated for the system called the research and operations center so that makes us the HMT rock which sounds a bit like a radio station but really the only music there is on the ipods anyway a program manager there is that the white Whitlock and I'm the technical lead on the project primary customer is the US Army's research development engineering command out of redstone arsenal are the econ and principal scientist there on that project it was doctors bill walker and Kevin Kennedy so what kind of computing requirements are there for the project well supporting their hypersonic era namak now social flight in scramjet engines and focus is on the computational fluid dynamic analysis of the hypersonic endo atmospheric regime that is very fast in near-earth atmosphere the cartoons on the right show some of the schematic data that comes out visualized data that comes out of it where you display parameters of the space around an object flying very fast and it's in fact a very complex and a difficult problem that we are simply attacking by boot pores using a code that's proprietary double precision fortran code solve the navier-stokes partial differential equations and it using a pullet explores the full combustion chemistry that goes on in that regime and we explore problem sizes with the space around object divided into 20 million or more individual points at which the computations are done that's a lot of points but the good news is that the blocks of those points can be assigned to a given processor and computations are carried out in that processor and then the results are compared step 2 and iterations continue and as a result in the way that this whole process works problems very CPU intensive and very little time relatively spent in inter process or communications it's the category which might call almost embarrassingly parallel which is good from our point of view and in fact drilled the design of the kind of cluster that we went after now we've been doing some computing in this project for a while and we've done it well systems we have include a traditional sort of computer system we have an IBM SP power 3 system of 284 CPUs that when we got it as pictured there back in june of two thousand it came in as number of about number 47 on the top 500 list and four years later today it's completely off the list gives you an idea of how things are progressing so our goal was i mean for this project like it can't be too rich or too thin or have too much computational power we need a lot more than this and main things as expenses that they were while they work very well we're too expensive to get to the kind of computational levels that we wanted so we began exploring and in the interim since we got that mainframe system we acquired and put together and played with the number of clusters and explored a whole range of architectures from major vendors including AMD intel and apple our first system back in june two thousand was a 34 processor AMD athlon system and about the same time frame a little later we acquired a g4 system about the same size which at that time I believe was probably one of the biggest apple clusters around epson performed fairly well but it was only 800 megahertz and we wanted to scale up right substantially and so we want look at other architectures including rack mounted obviously the coming a few the systems that we have now we work with as shown here the first upper tube for historical reasons were the early days of looking at clustering so we get tower systems for PCs and apples and that little applica system we affectionately called apple orchard we've looked at 64-bit systems very extensively including the opteron system and a lot of our computations now are done on Intel Architecture 32-bit system you see there are largest cluster now as a 522 processor system but we fact needs larger more so we forward additional possibilities now the whole thing is set up in this unpriests of holding building in Huntsville that we acquired back late last year and the building was gutted internally virtually Charlie and the shows the computer room being put together back last fall and we renovated literally to top the bottom of the ceiling and the computer room floor about 3,000 square feet of computer room floor and the shows are configuration where the that 522 processor Intel Architecture system on the left as you see it r SP mainframe system on the right and center is going to be talking about here the cluster system that we are acquiring from Apple okay how do you pick such a system you have the bench market and from our point of view when we benchmark that counted was our applications so we ran our code using a sort of simplified geometry but the full complexity of the problem in terms of a reasonable problem size and full combustion chemistry with the whole range of chemical species what we did find among other things and testing across the whole range of processors that the inter process communication as I mentioned is a small fraction of the total compute time that is in a given iteration we found that that might take typically a few seconds the amount of time that was done in communicating between processors between those iterations was in the range of milliseconds so there was very little penalty in worrying about the internet which is why fact with chosen gigabit ethernet for the system let me go back a minute does it work the last point about it part of the reason for doing that of course is that these other interconnections you've already heard about previously in the session you can get better performance for a broader range of applications but the cost difference is not trivial compared to gigabit ethernet switch okay this shows if some of the data that we used and the kinds of things that drove our decision what you're seeing there and this is a log-log plot of the time to do a given step of the computation as a function of the number of processors you throw that problem for all the different processors five are shown there on this on this chart for all the different processes you see they're actually scaled very well for our kind of problem that is as you double the number of processors the processing time cuts in half the grouping there if you can make it out this is a log-log plot but does break up naturally into two groups the upper two for the app lon and xeon systems are 32-bit systems and the bottom three opteron a canyon 2 and G fives of the 64-bit systems lower is better on this chart less time taking good iteration the more room you can do when given amount of time the more process and you can do and on that basis you can see that the g5 in fact performed the best now let's stick with comparison is maybe not a little bit fair because these were different processors with different speeds so the next chart is the same data this time we're all results were normalized as if each of the processors have the same two gigahertz quasti it didn't but you can normalize the data that way just for demonstration purposes what you see here is results you're essentially the same and not changed very much and the only difference here now is that the I came to looks a little bit better than the g5 it's a little bit faster on that chart but you got to keep in mind that I came to is not available at any cost for two gigahertz and at its fastest implementation about one and a half gigahertz a system built with I came tues comfortable to the g5 would cost about five times as much so g5 it is okay well process is one thing but there's a lot of issues and putting together a big cluster and we've put together as I just let you kind of saw clusters that are pretty reasonable size but even for us this was a big cluster and there are a lot of issues that come up in terms of scale but there's a whole laundry list of things I'm show here and not gonna go through them in the detail except to highlight a couple things that you've heard it before today and I want emphasize them yet again power and cooling on the bottom especially at this kind of scale is very much non-trivial for example for the current system that Apple's living to us right now as we get it in we've looked we've had to upgrade a power into the building I tell you about that in a minute but just to give you an idea something we haven't shared whether our corporate executives yet but just the run the system our utility bill for the year it's going to run about two hundred fifty thousand dollars just to keep it going cooling is also of course a very important issue and just like power you can calculate how much cooling you need and you can get that cooling into the system to them into your facility but addition we have the added complication they get the cooling to the right place and you got to look at how you distribute that and remove the heat and bring in the cooler air so that's something that we're playing with we expect to actually have to fuss with the fair a bit over the next little while so how do we do this process we decided was going to be g5s but you can't say that when you buy on a government contract you got to be generic and we did we put a quote a general quote with crustal quotes out to the community at large and one of the closest we got back in fact was the g5 system just coincidentally but the requirements that we had included a theoretical performance for the system of at least 25 teraflops we wanted the process account in excess of 3,000 we wanted all the fit into a thousand Square footprint footprint excuse me wanted minimal power and cooling requirements and we wanted it all delivered by 12 july this year and you'd want to pay a lot to this cluster we didn't share with the vendors what the prospector we had in mind was we had to go to the lowest bid but we wanted the whole thing putting the switch and all the ancillary equipment we need with it to come in under six million dollars and we're going to make that cause that meant that target so the system award exclusive of the network component was done on 17th of May this year so that's really a three week turnaround which is in this business is a very short timeframe for getting that done but we wanted it fairly soon as well so okay we're about the system itself some of the details we're calling it mark 5 it stands for multiple advanced computers for hypersonics using g5s we got 1562 dual xserve g5 compute nodes and forehead nodes and these nodes are being delivered as we speak that there's a lot of complaining back home that I get the play and come here and attend WWDC while they're working on put the system together back there and I'm going to fly back here tomorrow so I couldn't get much more time off than that the system's we've taken delivery about 350 nodes as of yesterday and it's coming in at a tractor trailer load worth the day consists of 25 pallets of a dozen exurbs and everybody's going to pull it out someone Rackham get them hooked up as at that kind of scale it gets kind of interesting so the physical configuration is can be set up in 40 racks with 39 x sword nodes in each each rack is a 42-year racks and a 48 port gig gigabit ethernet switch in that rack which we are getting from foundry networks is a actually a very high performance gigabit switch and work great for our purposes we believe and one rack is includes forehead nodes couple more cluster nodes and a large 320 port gige main switch that are trumped from the to which the individual 48 port switches that each of those are trumped and access the nexus for the cluster network the whole thing occupies less than 600 square feet so it beat the thousand square foot limitations that we imposed and we're expecting to draw about four hundred kilowatts of peak power or the system and we're having we didn't have that much enough power being brought in the building at that point and we had Huntsville utilities bring in and bring us in a new transformer rated at over two megawatts we're planning to build actually bigger system but that's another story for cooling a requirement we required by 110 tons of cooling for those you cannot be familiar with that but the ton unit that used to rate these big chillers is an archaic unit in the heating industry that relates to being able to remove the latent heat of fusion of water in one ton of water in one day that is make a ton of ice in a day so 150 comes of that installed and if we ever get the computer in business I guess we could make a lot of ice but not at that price okay details the nodes okay butthead knows themselves or of course dual two gigahertz g5 units with mirrored 80 gigabyte hard drives and eight gigabytes of ram installed with a CD rewritable and the video card the compute nodes themselves fifteen hundred and fifty two of them also two gigahertz xserve units with a single 80 gigabyte hard drive three and a half gigabytes of ram per node just under two gig per processor no cd-rom or video card required on the on the excerpt nodes so in total there's a 30 132 CPUs and at eight gigaflops per CPU the theoretical performance the system comes in at a tad over 25 teraflops as required okay as I mentioned the system is being delivered and the pictures we took last week before we came out here and there those were the first 40 units packed in the high bay being delivered some work still going on in the computer room in terms of getting the rest of the infrastructure setup and you see some of the guys working on putting in some of the hardware in the racks now they've got 40 racks and a man all these extras in the racks you have to put these little clips to which you can screw into to get the rack in there you go to put in the front and the back and but this many we calculated the guys have to put in over 14,000 such clips they did it in afternoon we had a bunch of folks working on it okay so to get kind of wind down the story a little bit give you some perspective on the progression of computer technology and preparing here that mainframe system we acquired back in two thousand to Mach 5 coming in now and cost lies we're paying a little bit more 6 million compared to 4 million back then so thick percent more floor space is about twice as much however for that we get 10 times more than 10 times more processors and we get more than 50 times more performance so in summary we chose the apples xserve g5 architecture for a major production cluster for computational fluid dynamic analysis in hypersonic flight the proposal he got from apple on the xserve g5 exhibit in fact delivered the best bang for the buck and as essence best price performance now as I've kind of mentioned Mach 5 has been designed for a compute intensive problem with relatively little demand on network now that means in terms of standard measures that put systems on the top 500 it will not do as well relatively as a system sort of purpose design with the highest speed network that being said we fully expect to achieve something over 12 teraflops the real performance and we believe we might be able to get up to 15 teraflops of real performance if we can do that will still be easily the top five when the November list comes out hopefully we can get there so as I mentioned the finish up the systems be installed and we hope to get into production Ashley get it working by the fall and from the solicitation of the system to Ashley production work or looking for a six-month time frame which is pretty phenomenal for a system that this kind of scale will hope it worked out thank you okay well run it a little late so just to summarize and finish up the session what I really wanted to say is you know Apple is investing in the high-performance computing market we're doing it through our products our technologies and the solutions that we're providing working very closely to make sure the right solutions available but from third parties in the open source community and the adoption is really been phenomenal and momentum continues so in summary you know Apple's products from the workgroup cluster turnkey easy to use and bioinformatics all the way up to you know the top supercomputers so with that thank you very much unfortunately running out of time for formal QA but if there any questions will be available up front for any any questions you might have