---
title: WWDC2004 Session 642
framework: wwdc
role: article
path: wwdc/wwdc2004-642
---

# WWDC2004 Session 642

## Transcript

Kind: captions Language: en I'm Warner Ewan and welcome to this afternoon session on HPC software optimization so once again a lot of people are always wondering apple and high-performance computing and in reality what they think about is this but the reality is there are a lot of great tools for building high-performance computers with Apple hardware and today in this session we'll be talking about some of these things specifically what's new at Apple for high-performance computing and also some of the things for preparation of your Apple hardware for high-performance computing and a great introduction on how to write parallel code so we don't have time for to show you all of the cool new Apple tools but I wanted to give a plug for the new shud performance optimization tools specifically tomorrow there's a great session called got shark that talks about performance profiling high-performance compilers so this is what has really allowed us to play in the high performance computing world specifically knew this week was the 64-bit compiling ability with a GCC in your tiger previews in today's sessions what you'll really what we'll get into is performance benefits using mac OS tens accelerate framework for high-performance computing and we'll have a presenter talking about streamlining the OS services for performance and another person will be coming up to talk about what options are available for building high-performance computers with Apple hardware and then we'll have our introductions on creating parallel code with both MPI and also a next-generation parallel computing developer framework so who's talking today we have some really great speakers will have Steve Peters from Apple's vector numerics group coming up to talk about mac OS 10 will have a josh durham from virginia tech that will come up and talk about use of mac OS 10 for high-performance computing and beamed alger from down research will come up to talk about writing parallel MPI code and finally we have Steve Ford from gridiron software to talk about next generation parallel development framework so with that I'm just going to go to our first speaker Steve Peters Thank You Warner I'm on the air on wire I'm going to tell you about the mathematical facilities in Mac OS 10 so the base of much of scientific high performance computing that's available for Mac OS 10 the agenda today is survey the api's that we ship with Mac os10 tell you about new Tiger api's a little bit of comparative performance to let you know that we're I think leading the field in sort of this price space anyway and then reinforce the mantra that first check out our frameworks when you're interested in getting performance out of the machines on math so we start out with tips that are I Triple E 754 are compliant like much of the industry this is what gives us our substrate in math they then layer on top of that lib m.a.c 99 compliance library full of the elementary transcendental functions that everyone knows and loves sine cosine square roots and so forth we present these in flow single double and long double Precision's a complex and in the real domain for linear algebra place where many vendors add value and so do we we take the basic linear algebra subroutines as shipped in the Atlas open source package do some additional tuning and ship that as part of our accelerate framework so if you're looking for any of the blahs look first to the accelerate framework where they've been closely match to the hardware on each of our platforms and those come in all the familiar flavors and layered on top of that is the gold standard of dence numerical linear algebra solvers la pack and again in all the familiar flavors when we begin to talk about performance we're looking for the g5 is our flagship it's really gotten us into a really interesting place in the HGC space and in my opinion for scientific work it's the dual floating-point cords that have taken us there so for every 970 CPU you get to floating-point core is capable of doing a double and single precision I Triple E arithmetic the CPU can dispatch to both of those floating point instructions units on every cycle it could start a new operation on every cycle all the basic arithmetic operations are present as well as Hardware square root that's a new to the Pat new to power pc as we've seen it anyway and good for us the g5 also offers a class of instructions called hughes multiply add these are three operand instructions basically multiplies the first two together as the third does that as one machine instruction saving a rounding and adding finishing the job in smaller number of cycles and back-to-back multiply followed by an so fuse multiply X uses those together why do we make a fuss about this or some people I've heard or said oh that's a big deal Buffy's multiply add well it's fundamental to linear algebra it's the dot product is the essential piece of the dot product and that is all fundamental to matrix multiplication so a big part of the fast Fourier transform the butterflies are essentially fused multiplied adds multiply ads which we confuse and if you're doing function evaluations say by Horner's rule you'll arrange polynomials in a way that can take advantage of views multiply hands so a fuse multiplied accounts for two flops a multiplying an ad so we're we're credited with for opt for floating-point ops per cycle on modern g5 dual g5 that gives us eight flops across both processors per cycle and we get 2.5 GB cycles per second so we peek out at 20 theoretical double-precision floating-point gigaflops 20 gigaflops on the new dual g5 2.5 gigahertz Power Mac so how do you take advantage of this coming to the platform well if he's already got compiled binaries for Mac OS 10 bring them to the g5 our flagship they'll immediately see some advantage from the ability of the CPU to schedule to both of those floating point cores recompile and you get even better performance because now the compiler knows there are two floating point units out there and can rearrange the order of operations in your code to take advantage of both and make some efficient use of the dual floating-point cords and if you have the opportunity to think about your algorithms you may be able to cast them in ways that can squeeze out a bit more performance this kind of detail we've paid to our libraries lib m/v force which I'll talk about the moment the blahs la pack and our digital signal processing libraries be DSP in the area of single precision floating point we have a very formidable capability on both the g4 and the g5 the altivec cindy processor it's for way parallel single precision engine again all the basic arithmetic operations and a vector fuse multiply add we top out here at 40 single precision 40 gigaflops single precision on the new Power Max and in there are some codes that can get fairly close to using most of most of all of those convolutions are very very effective on that box how do you get to high performance on the altivec unit you got to work a little bit harder you're really going to have to think your algorithms through and cast them in terms of parallel operations we have some advice on the web about how to do that but first of all it's probably wise to profile and here's another plug for the chudd folks they've wonderful profiling tools that will focus you on that ten percent of the code where you're spending ninety percent of your time look there first auto vectorization is an option and even a better option announced this week that GCC three dot five available later in the year will have auto vectorization features that's a good way to get into the ultimate game and finally the level of detail that gets really good all perfect performance and single precision we've already paid envy force applause d DSP and the image how do you use these things we try to make it straightforward try to hide at least a bit the nature of the platform from your code you call the API will dispatch to the proper code suited to the underlying ship live and the math library linkedin by default there's nothing special you need to do if you want the long double facilities and the complex api's we have lib MX live and extended that's a flag on the link line for GCC and for our value added library the accelerate framework you simply specify framework accelerate that gets you on the air and of course we ship these performance libraries on every copy of mac OS 10 that goes out the door you can always expect to find it there well what do we do that's new in tiger we've added a library called v-force it had been called to our attention that the c99 AP is for the familiar elementary functions where data starved on our machines that we're seeing bubbles and the floating point pipes that were going unused so you know so unused cycles and we hate to see those go by and also the c99 and I Triple E demand very careful attention to the rounding modes and the way exceptions are handled and that adds quite a bit of overhead for api's that are only processing just one operand at a time so the ideas and v-force were to pass many operands through a single call for example if you need 768 values of the sine of X well there's a call called V V sine F that lets you pass what all the men at once we amortize the overhead and get back in a big big hurry and in fact that code runs about 12 times faster than a naive loop calling the traditional sine function there's some caveat here you have to expect I Triple E default rounding mode and you won't see any exceptions murder sort works expecting that you'll give arguments that are within the domain of the function and so forth so using these ideas opened up a number of performance opportunities on single precision hit off the back hit off of that card that gives us now it's a four-way parallel with them to begin with double precision we've got to FG use let's make sure we scheduled those effectively do some software pipelining fill up those bubbles with sort of independent parallel streams of computation and then we take great care in choice of algorithms to avoid branching which is very tricky on pipeline machines tavius word generally is accurate as the traditional lib em elementary functions but we're not always big wide ethically we handled nearly all the edge cases according to the c99 specs plus and minus zero are the occasional exceptions there's documentation that tells you where we make no alignment requirements but if you really want top performance aligned to sixteen byte boundaries that lets our cindy engines collect the data most efficiently we're tuned for g5 but we also run very well on g4 and g3 too here's what's in their inventory some simple division like functions routes x mentions logs and powers trigonometric Arctic new metrics hyperbolic and some integer manipulation how do you code to these things well couldn't be simpler the blue is the on the top of see below as Fortran and in orange the obvious command-line compilations what else is new in Tiger we updated to Atlas three dot six did it some additional Mac os10 specific tune-ups we get some la pack performance gain since it relies on those laws and from some compiler advances little performance chart showing in blue our matrix multiply performance and in the cluster of orange green and burnt orange the matrix decompositions lu the symmetric decomposition ll transpose and the symmetric u transpose you notice that matrix multiply tops out around 13 gigaflops on a new dual 2.5 gigahertz g5 powermac and 2d compositions let's say 11 here's what the Xeon gets to a three point 0 z see on running MKL 6 matrix multiply topping out well under 10 and decompositions just about getting 28 and finally opteron this is quite old numbers we look about from last summer they're topping out about five and a half left summer let's give them fifty percent since they've up clocked by that much that would be perfect scaling and that takes them well not quite 28 I think on matrix multiply so g5 cruising along near 13 and the opteron eight maybe nine and xeon probably closer to 12 in the 3.6 incarnation all right finally we bring a long double back to the mac platform it was president mac OS 9 and earlier and complex long double and the c99 TG math type generic mass so you can say sign of a complex number and the compiler figures out what you mean in c isn't that nice i'm going to pass by this since i am running very close to my time and let me just jump to the end here where i show on the left our elementary functions number of cycles to get in and out of our library functions in the middle column what the competition publishes for their x87 microcode on hardware implementations of the transcendentals and finally when those get wrapped in the gnu/linux system with library calls where you know they have to take the additional care to get the rounding modes and the exceptions right it cost them a bit more and I think if you compare most of those the g5 is clearly ahead there's several documents out on the web on our developer site that can get you started with this kind of stuff we've already had the accelerate framework talk we're in this talk if you haven't seen the judge stuff by all means please go see it and gohan on to Josh yeah thanks Steve so today I'm going to do kind of a brief overview of Virginia Tech system 10 which I'm pretty sure most of you have heard of its as the 1100 node cluster at Virginia Tech and then we go into some detail about some of the services and OS 10 that can turn off briefly go over what kind of things we did at Virginia Tech to kind of improve our benchmark scores and a very briefly when just kind of go into some of the management tools that we use a Virginia Tech so regimes like system 10 is a loan hundred dual processor xserve so of course 2200 powerpc 970 s each x or cluster node has four gigs of memory which gives us basically an overall 4.4 terabytes of system memory for the whole whole cluster one of the things that early on when we were deploying this as well they got to be running Linux why would they run out an or maybe they're running Darwin but we really are running OS 10 & Co sent out will ship with it we did a little bit of modifications which I'm going to kind of go into but it 10 briefly our interconnect that we use at Virginia Tech was InfiniBand and we went with 24 mellanox nice exploring infiniband switches so these switches give basically 20 gigabit gigabits per second a full duplex purport and it's basically 1.92 terabits per second for the overall bandwidth for the switch we got about 12 microsecond latency notes across the entire network about eight and a half microsecond Lindsey across this once which we use the fat tree topology which I kind of did a really rudimentary diagram at the bottom of the thing there of what if kind of fat tweets apology is and in our case is a half by section and half my section means that at any time if you have half the cluster trying to talk to the other half your guarantee basically half half the vandalism which in our case is basically five gigabits per second in addition that we also have a secondary gigabit network which are comprises six Cisco switches with 240 ports each and we basically use that to do management and some file-sharing two basic can't get the system up and running and kind of a deal on some of the administrative stuff on it power and cooling can can never be emphasized enough when you're when you're looking at closer especially this size so it Virginia Tech we're lucky to have this really wonderful computer facility that has basically three megawatts of electrical power half of which is basically dedicated for system 10 the cluster so we have a door and on see with our power we have a UPS system and we actually have a diesel dinner which is pretty much the size of like a.m. diesel locomotives just sits back on a pad and it's gigantic so as I said about half that one point five megawatts is reserved for system 10 cooling we have basically two million BTUs of cooling capacity using levers extreme density cooling awesome pictures later I can kind of plan out but basically it's a rack mounted system where it kind of blows cold air from it from above and we also have for cooling as well so using standard refrigerant I'm overhead chillers so we're looking at different kinds of cooling and we had a regular data center and it's a Boise that's you have air conditioners throughout the room that basically bring an air from the top chill it and can push out throughout the floor and then um you basically put tiles in the right places to get that air so we looked at trying to do that and if we did that we just had a wind velocities of about 60 miles per hour underneath the floor so now just pulling off when there's tiles you get shot without 60 miles per hour of wind so that's basically I'm when we say about system 10 one of things I'm going to overview is some of the services in system 10 and kind of we're going to turn off some of them to kind of optimize it slightly for more the HPC type application so SM server by default comes with about 40 processes it's just the right or default install and so why do we want to reduce the services well one of course is going to free up resources like memory and CPU time and increases and security obviously if you're not running things like a web server something like that you're not have to worry about securing that and a reduced amount of time for the system startup which is kind of lowers here your mean time between failures so other things i want to emphasize is i always use this analogy for when you're turning off services that kind of like these guys that buy the honda civic and they basically you know rip out the engine they put a turbo charger on it big spoilers on it and I co soup it up so you know they're kind of design it for their own purpose and the problem with that is you know hon does not just do any sort of a hardware support for you and so they seem to keep in mind is that your search turn off these services you know this is something I apples going to recommend you to do this kind of what we did here at Virginia Tech so the first serve basically I can't step through the different places and oh it's Henry turn off services and the first one is FC host config and there an orange or basically some of the services that run from at the house config we have things like couch d which is a printing service we have the auto amount which basically handles noun removable file system to network file systems and we had a crash reporter d which would sort sounds important but basically it's just for them Korean crash logs for the GUI applications we have server manager d which is on basically if you ever ever used with the xserve server admin a server monitor or any of those tools this is what use that so that's one thing I'd like to point out is if you want to use these tools and they're great tools you want to keep keep this service on so basically this file has a list of services and just changing the service equal yes the service equal no I'll disable the service on the next time you reboot the next thing we do and this kind of blasphemy and the ls10 thing is we're going to turn off the GUI unfortunately you know we have 1100 gooeys running there's no real need to do that no one's that we're going to see these duties so the place to do that is an SV ttys and so I have basically there's this very complicated line there and it needs to be counted out and coming up that one line pretty much it's going to prevent the the windows server from running and the login window thing to run actually this just as login window the windows server is in another place basically in OS 10 there's all sorts different places where these services get started and this one is in a directory called etsy McKennitt dot d and so the way i disabled in this is just a kind of personal preference as opposed to just I could delete the file I can remove it instead I'm just going to create another directory and lose this script into that directory that way if I change my mind later I don't have to go find it or make sure it's the right thing I just can move it back the next thing that we turn off is the 80s server which basically provides font services since we disabled the GUI we're obviously not getting advanced service on the system oh it's nothing that gets run out at sea mocking addy so basically just moving that P listen to the disabled directory directory is going to do that the next thing that us we're going to hit the next stop on turn off the services is basically modifying watchdog watchdog is this process that basically monitors your your system make sure that process are running if it not if they aren't running it refers to the processes another thing that wash doc does that's really nice is that it enables them the system to reboot if it crashes so that's actually pretty nice and hpc thing because the system will come back up and hopefully rejoin the network and kind of reduces your time between failures because it kind of almost like a self-healing kind of thing so as we watch dog comps we're going to disable to services that we don't need the print service monitor obviously where I could be running from the cluster nodes and master master I didn't know what it was my first thought it but it sounded really important it's actually just the the main main server for the mail server so we turn that off one more thing that's on there is HW mondi which is the harder monitor and so basically this thing is pulling every I think five seconds is the default and basically is just keeping track of your your your hardware's to keep track of all your fans or your temperatures throughout the system and and just kind of records that it can also send notification stuff like that and what's up five secs every five seconds is a little little too much so we kind of bump that up by adding data 60 and that's going to make it only run once a minute so that kind of reduces the CPU overhead of this one service next thing I turn off is mdns responder and I was feeling that this is not gonna be something we may be able to turn off as more and more things for to rely on on rendezvous so we have things like eggs grid and if you plan on using X red you don't want to turn this thing off because extras going to use rendezvous to find find other other compute nodes and if you mean there are things like if you ever want to use the distributed compile option and Xcode it also uses rendezvous so you know if you're not if you don't plan on doing anything with rendezvous and this stuffing something can turn off it will kind of reduce the amount of network stuff it sends out and a little bit of CPU overhead so this one basically has a script and the system library start of items folder and basically I I just comment out the line it starts it so in I was ten there's lots of different places to kind of look for services and you kind of see there the great out services that uh that we kind of went through and there's some things on there that people will disagree with that either need it today or some things need to go like I leave the time server on there and because I think that's important for the cluster that I run my leave Kron turned on because we actually use kranj can't do things every so often on the cluster but I'm some people concern that off and not have any issues with that so I'm going to talk about the Linpack optimization and kind of things we did at Virginia Tech Lynn tak is basically the benchmark that's used in the top 500 list so we were number third in the world in November and so the way this is establishing we have to run this benchmark called hpl so we had some some this was about a year ago and this was a fort a lot of the the optimization splendid accelerate frameworks we had a person japan in Kazushi go to and he did basically some assembly level authorizations on major subroutines basically easy on the d gem subroutines i have a website there for more information if you want to look at his optimizations and one things that we had to do though is we had to kind of write our own memory andhra because the the blast routines that that he was writing did a much better job if it was guaranteed a contiguous physical amount of memory as opposed to having it kind of gets ugly in or out or partitioned so with those optimizations actually had about ten percent increase over over the Apple decklid at the time remember this was using Jaguar so we didn't have cell rate we're still using backwood so with the optimizations and some of the tweaking that we did at Virginia Tech we actually got 10.28 teraflops per second which is the third fastest in the world and without those sketchy authorizations we probably have gotten around eight point four teraflops which on that list probably put it on forward very quickly I just want to kind of go over some of the system management stuff and I could I can talk for maybe two hours or five hours for 12 hours on this it's something I do a lot of work with the tool that I love for system and it's called ganglia and I know like bio team uses it inside their package and what's really great about gangly is it runs on each system it kind of just gets to some status and kind of broadcast that out in the network so by default it has a couple displays and I had a few their displays at the top like at the top there's the cluster load percentage so it's kind of really great can see what's going on with your 1100 systems you know you kind of get a take a step back and be able to see what's really going on in the cluster and what I love about this you can drill down so we have that big cluster overview but you can drill down and look at like a specific node and what I love about is that XML data you can parse that XML data so we have Virginia Tech way they basically kind of a custom display there that kind of shows the physical representation of what our clusters doing so we can see if the CPU can doing something weird or if we can look at temperatures and loads and can't get a physical view of it it really helps with just um just quickly discovering what's going on with our system so you know things I'd talked about of course words you know over in our system 10 reduced in number services and what we do in the Linpack scores in some of the management features we did so people of course or if you went to dr. Rajan on presentation yesterday you probably saw some of this but people keep asking us so what's going on with system 10 you know we dropped off the list and it's because we swapped out or powermax and we're acronyms extra so i can say that um people are very hard at work the song systems and we have 850 n and so there's some some of the racks that we have and you know one of the things that that is really interesting is that we have we're using basically a third of the space that we do with the power max so only have one aisle where we can do all the cabling and so it gets kind of crowded and uh there's I don't know how many people in that picture but that's a small space for a whole bunch of people and let's say see us doing the wiring in the background have to wire your Ethernet do the power and I run infinite bands so quite a bit of cabling going on so with that I'm going to introduce beam Dowager from Dowager research all right Thank You Josh yes so let's see it's definitely a pleasure to be here today and to be speaking to you is I would very much appreciate the kind people at Apple to invite me to come out and talk about plug-and-play clustering and how you can build your cluster minutes and so um what I'd like to go over first is an outline of what like to talk about and first of all why parallel computing why parallel computing was interesting to do and what we did to go about inventing they're essentially reinventing the cluster invented the mac cluster and an introduction to basic message passing code and then the description of how you can build your own mac cluster and and hopefully if the demo gods are kind to us we can I can show you what we can do with a mac cluster so why parallel computing really parallel computing is good for problems are too large to solve in one sense or another on one computer well the simple reason is simply taking that too much time too much cpu time but also in some cases are in many cases i know it requires too much memory they can some problems can easily outgrow the RAM capacity that's available on a single box and I know Cosette runs 15 billion particles and has keep it all that data all in RAM and so multiply that by however many dozens of bytes per particle see that's quite a bit memory space so the other thing that's happened in the last decade or so is that so the programming API has become standardized on what's known as message passing interface I'm also known as MPI it's a specification that was established in 1994 and it by the end of the 19th it became the dominant software interface that's available at supercomputing centers such as the San Diego supercomputing Center as well as nurse and also on many cluster systems and and so this the development to enable the possibility of having portable portable parallel code that code that's portable between the three competing centers and the clusters in both for Tennessee by using MPI and that's been a real benefit to scientists and many other many other users of such systems so they give you an idea of some of our experience this is a current picture of the UCLA physics Appleseed cluster established in 1998 and as you can see we use a mixture of the g5 from g4s connected with this bath switch and we are running on a mix of OS 10 various versions of OS 10 as well as OS 9 so we're able to mix and match notes older and newer hardware and then we can combine this cluster with machines that are on people's desks such as my colleagues professors or postdocs for graduate students combine them as who needs to when they're away to go home from work or on our vacation or if like a colleague need something needs time just before a conference they can go ahead and just use the machines after mission and evolve them together and that's really saved a lot of people's work so and just a little quick notes this is a picture just from last week the Dawson cluster it's going to be 256 X serves dual processors currently a hundred twenty-eight online literally physically just assemble last week and we're able to get this picture check with gigabit rain 10.3 so will be definitely having some results of that later the month so a cluster computing with Mac os10 essentially we went about reinventing the the cluster computer and it really is very very nice approach to question computing much more reliable than many other systems and I'm familiar with its independence storage or any kind of command-line login or static data and like machine lists or static IP files and so that leaves a lot to a great deal of reliability because you don't have to make sure that every little switch is just right in order for the cluster to work and so there's always alts in the lowest barrier to entry for people who are using clusters and really saves a lot money and really the purpose of this whole of this whole approach is to be able to enable users to focus on getting useful work done so they don't have to be bogged down with the mechanics of the cluster they can actually get real research and we will worked in and that was that was our motivation to be able to assemble and design the Mac cluster so the Matco cluster technology designs really divided up into two parts Mac MPI which serves as the MPI layer and pucha application which stirs supplies the cluster infrastructure Matt Kemp e is a source code library normally compiled within the application and it forms an NPI wrapper over the tcp/ip stack in the offering system we have two flavors of it mackay and i underscore roman numeral 10 it runs both OS 9 and 10 using carbon and an S version of the users UNIX pockets which results in better latency the pooch application if the utility that dynamically manages a cluster in parallel applications running on it and monsters health missed a dose of the cluster and it queries the nodes using rendezvous and done so since 2002 as well as SLT to be able to provide with compatibility with OS 9 and queries them to be able to be able to determine more information about the cluster it provides for user interfaces the cluster one a GUI an apple script interface you can you can direct it with a command line as well as other applications can direct food store using apple events to be able to grab the grab notes on the cluster and support three different MPI's currently Mac MPI as well as MPI CH commonly on commonly on Linux and a commercial MPI named MPI pro so let me give you an introduction to parallel code using MPI basically it's it's code that coordinates it's worth using messages the models that there are end tasks or virtual processors that are running simultaneously you label them from 0 to n minus 1 and these executables often use this tent ification data to determine what part of the work they're going to do and how to coordinate worth between them and so they pass messages between all between all these virtual processors or tasks to organize the data and organize the work it's really analogous to a number of employees that accompany you make phone calls because ever had meeting is to be able to coordinate work on them and to accomplish a much larger project any group of tasks can communicate with supplies that are order n square connections that are that are supported by the NPI and that can support simple sends and receives as well as collective calls such as broadcast where you're sending from one task to all the others or gather where you're collecting data say for doubt data output or reduction operation such as computing the maximum of an array that's spread across the cluster or the some or other other parameters like that and also major about trans matrix operations such as transpose and vector operations and synchronization is not required in between the the task no precise organization is necessary but it's only implied by the fact that messages need to be able to get from one cast to the other so they give you an idea of what looks like i'll introduce a a simple example the simple example i know of message passing called we call parallel knock and in this case the dot in this diagram the time axis is down and we have two tasks they're communicating with each other at first task 0 sends a message to task 1 and they both print that message and so pass through a prince the message of just sent and task 1 prints the message of just received and then a reply is sent back for task one to pass 0 which which is then printed by both tasks and so task 0 Prince the message to just receive the reply and task one print replying it just sent so the give an idea of what the code looks like this is an example that's fancy but it's also available for trans you can go and check out our website to that and the way is this is buy it up I deep rock is the ID of the virtual processor for that particular task and the top part of this statement is executed by all the odd tasks and the bottom half the esteem is executed by aldi by all the even tasks are all the odds hacen envelope and so what happens first is that for example for TAS 0 Casarez doesn't performs an NPI stand on its send message to Heidi proc plus 1 which is one and at the same time and the lower half day of statement is executed by task 1 it performs an empty i receive which receives that message from ID truck which is one minus one which is zero and so that's performed the that performs the message passing and then they both print their messages and then reply is sent that from task one by the NPI sin and the lower half of this statement of the reply message back to Castro correspondingly there's a receive bite Astro and then they print those messages and so the result of running this code looks like this tesoro says knock knock task one says who's there so that's an example of a simple conversation going on between the two now the next example like to go over is Pascal's triangle this is example that illustrates local propagation l'application I mean is that every element eventually interacts with every LOL every other element in the problem and but its local to every the propagation the interactions are all local because any one element is simply the sum of the two neighboring elements in the preceding line and so eventually they all interact with each other but every interaction itself is local this is similar to a variety of physical problems such as fluid modeling where you're looking at say fluid flow fluid flow through like inside a plasma or inside or inside a blood vessel or things like that and you can use partial one tools they can use partial differential equations that we're looking we're having neighboring elements interact with each other as well as the elastic deformation which which occurs when you're trying to simulate say the using finite element modeling to be able to understand how the Earth's crust is is going to deform when say a false vault flips or a Gaussian blur where you're talking about one point spreading this information to all those neighboring ones and so forth using localized convolution or molecular dynamics or you have molecules interaction each other in a local manner or certain parts of particle-based plasma models are all those those kinds of codes are all good examples of local propagation so apparel of Pascal's triangle the way that you recognize where the message passing is the first layout set you can think of this as the time axis being down from 12 11 and so forth and that thing that recognizes to understand where the communication is happening the problem in order to perform the computation and so what I've drawn here is all the arrows indicating the all of all the places where there's a certain amount of information being propagated or data being propagated from element to element and so the thing the wrecked and so the thing to recognize is that when you partition the problem let's say up into three different sections you can you can recognize that a certain amount of information or certain amount of data being propagated through the partitions in between each section of the problem and so you can handle all the internal communication as as normal for any single processor code but then the NPI calls corresponds to the arrows that cross the red boundaries they're here but by just choosing this method this this arrangement for the partitioning the computation becomes proportional to the volume of the problem and the communication becomes proportional to surface area so you can think of it sort of physically that that you'll probably end up with a good communication to computation ratio if you with this kind of organization so by splitting it up into the three different sections imagine the three different tasks running these are the messages being sent to receive so that for every odd and even line you're sending messages on this to the left or to the right the left to the right for every for every ultimate line and so for the computation all the computation even though is simply that there's an array to deform and take to compute the value of an element in one line you simply sum of the previous two but with the message passing does it fills in the gaps as it needs to to be able to propagate the information in between each section and so you can see say the left edge of the middle task is a duplicate of the right edge of the left task and so the fact that there's a duplicate this is also known as large cells where you're able to set up these kinds of guard cells to to allow the competition to proceed as if it was the only processor running but then the NPI simply fills in the guard cells at the moment where it's needed and so this is a this is actually a fairly prototypical example of a lot of local propagation type problems so security code example again this is available in Fortran as we'll see in this case this if statement is alternate between odd and even lines of problem and for example we start at the top part of the if statement we have an MPI receive that's reformed on the right edge of the array from the right processor and what I means if that's an immediate receive it immediately returns also known as an asynchronous receive so you're allowed to continue to execute while the receive is happening and then an MPI sin is performed on the left part of the array array 0 to the left processor and then an NPI wedge performed to be able to balance out that I received it and complete the I receive that came before so in this case you were since everybody is sending something to the left that means that you're receiving something from the right and so that's what that corresponds to likewise there's an i and the lower half the famous we're doing i received from the left and then ascends to the right and then a way to complete the receive and so we're all sending to the right instead and so the result of this code is like this if we divide us in the three different task and the way that this is drawn is that all the odd values is drawn with an asterisk and all the eval you have the space and so we can see that they're essentially task one has a seed at the top which then propagates through and propagates out so that out of the boundaries across the partitions into the section on task 0 and task 2 and so by forming by aration to say we can see that we we actually have maintained our guards salesman if you look carefully the left edge of task one is identical to the right edge of test 0 and so those guard cells would be maintained by the NPI and so we see that successful there and the other thing that's that we see out of seems this way this forms a shape and also knows as the sierpinski gasket in the Pascal's triangle so we're able to perform this problem using MPI in this way so that's just one of many possible message passing patterns that are available that are supported using MPI and for example that's the example the nearest neighbor on left and of course the arrows are reversible so you can do left right left right and the upper left is another common message passing pattern called also known as master-slave this is something that it's relatively simple and so in this case it shows a broadcast from one task to all the others or you can reverse system course to a gather but also the auto all communications pattern where every note is communicating to every other one that's all that's very important for data transpose of matrices and that's important safe for performing a 1d a very large 1d FFT in parallel you have to go through data transpose is and consequently the message passing patterns you have a lot of all the all communications or if you died by the message passing times like a tree where one is sending two to others and they in turn send the two others or in a regular pattern for a more regular problem or any combination of the above or any multi-dimensional versions in the movies so these are all things that are that are possible with mpi and are important for a variety of interesting problems so to give an idea of some of those interesting problems these are the applications that have been run on mac clusters that's not familiar with and for example and the fls is a picture of the electric tokamak device so tokamak is a plasma too but the plasma device that attempts to hold a plasma in confinement in a ring shaped pattern or tourist shaped pattern and one of the things about many many kinds of tokamaks is that the plasma in it that is is is so is is very hot and very hard to handle and so typically light leaks out to the walls around it and so they wanted to try to confine it better but in order to be able to be able to see inside it if you stick a probe in it it's so hot it could just be / eyes so we're very hard to be able to try to really probe in there so that's like computational simulations very interesting to do and this is an example a quicktime movie made from a diary kinetic simulation of a tokamak plasma in the cross-section so the electric potential is seeing how it evolves from the linear state to a saturated state whoops and then on the right is the planetarium rendering that was performed by a customer over Northern Kentucky University this was submitted to the first-ever full dome festival and actually one of the ward this was performed on a 50 node Matt cluster rendering out the the three-dimensional simulation inside a planetarium on lower left this example that comes from dr. Wilson back over red SD and biology where he's using that he he and his colleagues would a program called P mr. BAE's that computes the posterior probabilities of phylogenetic trees so say that five times fast the and what it studies is it looks at the DNA the similarities in DNA between various species and tries to determine the evolutionary path in between them and this was a code that he consulted with me on to be able to do parallelization as well as vectorization with vectorization where we were able to get a three-times speed up and of course the parallelization were able to boost that even more by the number processors Thunderbolt on the lower right the quantum pics simulate some diagrams from Ponte tick simulation is in this case this is showing a two dimensional quantum wave function in a super harmonic oscillator and showing the circulation of the electron around the wave function this is actually work that was based on my doctoral dissertation which I did entirely on Matt clusters and what it involves is an approximation assignment pass integrals that to be able to choose sample just to pop all the possible classical paths and use the plasma code to be able to push those paths forward and determine the evolution of the quantum wave function so the mat cluster recipe basically this is all the description that you need to be able to assemble a Mac closer the ingredients simply take a bunch of power max or exergy force or g 5 upgrade the memory as you need to and get a fast these nets which were faster if you have more money and get a bunch of ethernet cables then the directions are connecting cables from the max to the switch and download pooch we can get from our website and assault pooch they only take seconds per node to be able to install couch and then use the alphabet fascism and to be able to test the cluster and so what I like to do is to be able to see if I can give you a demonstration so if we could switch to demo to yes thanks and let's see okay so let me all that okay good so let me give you sort of a prototypical idea of a numerically intensive code that that that we have here this is known as the Alpha X rational demo right now it's not using the vector processor that's in this g5 here and and it adds up it uses those eases the fourth computation something I thought a little bit more numerically challenging and and something and also it counts up how many floating point operations dozen times itself to bit ly determine how many megawatts with chiefs and gets about to 1,100 mega flops in this case but if i use the vector processor i can go ahead and use math and it goes quite a bit faster it's about you know 56 gigaflops or so which is pretty nice and so and i get this also to make use of a dual prostitution and that gives me another factor of two but you know what if i want to get beyond a factor too well that's where parallel community comes and that's where pooch comes in so this is how long it takes to install pooch just double-click on the installer and there we are and pooch is an acronym parallel operation to control heuristic application and let's say I just need to be able to login to the cluster and to be able to start up a new job I go ahead and click new jobs from the file menu and opens up a new job window and this will hold a list of thought this has two panes in the job window holds a list of files on the left and executable that will be copied to the machines listed on the right and an execute as a comparable computing job there so if I click on select app I can go ahead and use the file dialog to be able to navigate through the the final manage but I really don't prefer doing it that way I prefer using drag-and-drop so how many of the perils creatures can you think if you could launch using drag-and-drop there there really aren't too many by default is flexible note the note i ah i'm on which is nob hill demo too and if i click on select nodes the sopes of a new network scan window she uses both rendezvous and SLP simultaneously to determine the names and IP addresses of other machines on the local area network that's here and so I can see that it drew it used this information it uses the IP addresses to be able to contact the pooches on the other machines in this case of the X terms that are here and involve those and and then determine whether or not they're busy are ok busy means that they're running a parallel job will show up in red letters or how much RAM they have and it also queries other information such as G you know what's the clock speed of them or what operating system or how much low does it have how much disk space you know when was the last time you know someone touched a mouse or and it uses this information to be able to form a rating of the cluster and so it helps you choose the nodes that are more suitable for writing in the cluster and it actually can give you a recommendation you can you can go ahead and choose the add best function if you want or you can or you can go ahead and drag and drop or double click on the notes that are there so if I click on the options of the job window this opens up the options drawer and you can say place the executable a particular direction on each one of the machines maybe perhaps delay the launch till some later time of day so like after a colleague that leaves home to go from work and you can also pretend that you're on a very very large system by launching as many 10 line as many taps are there by default it launches mate asus there are processors but you can also really benchmarking or or stress test your code we support three different MP is so described earlier and if you want to get through a firewall hughes particular report number a cue the job for later execution so if you able to launch the job i go ahead and click on launch job and it's copied xq ball to the other machines and then passes control to the to the parallel computing code and which then divides us into various the various different sections and then collects results back here for display and we get something like 44 gigaflops in this case so thank you let's see I just want to check something no okay so from there and just to show you that this isn't just for fractals is that this is an example of the physics code that we have and let me go ahead and actually oh that's fine I'll just involve the same same nodes that are here and we can go ahead and and and so what's happening here is that this is a this actually applies the boots code it's running at least a few million of particle simulation and it's being performed on the nine processors that that are available here and if I go ahead and run this job we can see that the electrostatic potential is going to show that there's a plasma instabilities that grows out of that and we can see that in the lower right there is a Mac the Mac to MK monitor window which is very useful for diagnosing and debugging the parallel codes in MPI and so the in the in the top part of the window it shows the messages it white means it's not sending any there red means is receiving the other green means it's sending yellow means of swapping and so this in the end so a typical thing that happens when you're learning how to write apparel a parallel computing code is that a lock-up happens and so it freezes in the light pattern of the of the hang and but also down below there's a histogram of the messages being sent and received as a function of message time so that's encouraged you to send many fewer large messages with those too many smaller messages it also shows you vials of how much communicate how much time it's been communicating or how many megabytes per second are being censored received in between these machines so this is a utility that myself and many colleagues and many venerable institutions that used to be able to diagnose and debug their codes and they give you another example of the code I'm also ethics code this is an example of a code that performs a furnell diffraction problem where you have a point source of light going to produce in spherical waves and projecting and diffraction image on the screen and so from there while we can you know we can actually this actually has a feature where you're able to automatically launch itself in parallel on sure and so this is a way that I I would hope that applications become so easy to use because simply drag them you can just simply use them menu click to be able to have it launched itself onto the cluster and make use of the resources that are there you can see again the maximum monitor windows showing the messages being received mostly very large messages we make the problem I was go so fast I got to make the problem size bigger and then it's also showing the colors of different parts of the problem that are being in the signs good phrase processors and so this one more feature one other thing I want to show you something that was just announced this week was it was what we call pooch pro it has a new user menu where you can actually assign a certain amount of quota for each user and so it computes how much compute time was being used and then this is a cluster the only cluster or sugar you could I know that has rollover minutes so you can roll over your compute time from week to week let's say and you also found this is something you would only see as an administrator you can actually administrate the users that are there and like a means that it's administrate you have administrative capabilities for cumings has a quarter rollover minutes and my great being able to migrate and pass for changing you can have different passwords and so on so I can double click on the particular one and edit say how much time are good for water yen has has CPUs like I know let's say me I'll give them this a really little time or something like that in this and not now sought to limit changes password okay anyway so these are the kinds of things that are available in future pro so that will be in from the demo can Alex switch back to slides thank you so just very quickly for more information the reference the reference library that we can pursue basically goes with the dead roosters website you can final hole bunch of information the pooch websites and find out the cluster recipe you can download a trial version and we have a tutorial and writing parallel codes as well as a zoology of parallel computing that as a description of the bridge parallel computing types and will this all be linked from the WBC URLs as well as the parallel knock tutorial with both code examples for Shannon see a parallel adder tutorial both languages parallel Pascal's triangle and as well as related publications and actually another video that's little bit longer than we've been displayed here of some of the work that we've done so like to introduce the Ford and thank you very much for your attention and ce4 time CEO of gridiron software I'm going to go over a real brief overview of what we would call a next-generation parallel computing framework and we're going to do that really from a very commercial perspective so probably a lot of the same points that you're going to you've heard before I'll go through a little bit but uh we'll go from here so one of the key things that is these that we work with are looking for are obviously speed but a lot of times the resources that are available to end users for products that they ship are not best so you get into a scenario where I need to provide one hundred percent performance or provide linearity for every CPU that I add because you have might have a company like a Pixar that has thousands of machines but you also might have a small post-production facility or something like that sitting in the basement with just one or two machines kicking around is that actually going to provide some value for them so the challenge for developers is to how do you build in a parallel application that provides this performance it is very easy to use and seamless fashion power is v's are really interested in the money quotient this is what I would like to call our are a million dollars slide but from this perspective this was a customer we have in the print space and we actually did a comparison between five g5 Xers and and a sun sparc fire 6812 CPU the interesting thing is is that this was the result and this is the cost that's generally associated with machines like that and you can kind of get the idea of why commercial software vendors are very interested in seeing how can they provide this functionality from a commercial perspective to everyone in their user base so message the grid computing we've heard a lot about different things but from the grid perspective there are three basic kinds there's the middleware perspective there's the opportunity for message passing which is being talked a lot about and some development tools that try to make this whole black art of parallelism a little bit easier on you the developer scripted distribution there's obviously some pros and cons it's very good as we see with distributed resource managers and if you're familiar with things like X grid and that kind of stuff to go out and say okay I'm going to use existing resources with existing applications and do things across but they're generally needs to be some sort of skill set for the end user to understand how to do those things so it's very useful in areas of scientific computing it in research but when you go into a shrink-wrapped application and you're trying to put that onto a CD it's a little tough for a lot of the user base to grasp message passing as being talked a lot about it's used quite extensively in in the scientific and research areas but the interesting thing that we found as we went through our engagement with several is fees from the commercial perspective obviously pros and cons from that perspective but the biggest thing was was that there wasn't a lot of confidence with their ability to ship that with a product so it was the learning curve associated with actually putting that into their products and their users to understand how this thing works development tools is is where you're probably going to see a lot more emphasis on this down the road especially as chip design and so on is going to move in a few different directions but from our perspective we wanted to create an application development environment that we've had a very high level of abstraction so message passing is a message passing interface but to turn around and say okay you still have to write a parallel application you that provides you the messages but everything else that you're associated so you not only have to worry about how do you partition your algorithm but then how do you message and then how do you build all the things such as discovery and resource fault tolerance all those kind of things you've got some good tools again like beings tool that can come along and work with mpi and so on on the on the top but from the perspective but what happens from within and that's something that's very important gridiron accelerate just as a real brief overview it's a peer-to-peer base distribute computing architecture and the api's are built into source it's more wrapping the source I'm going to give a quick example of that what we did for some MPEG encoding and then the work is dynamically addressed across the network and that can be to a dedicated cluster or to specific resources of death loss it doesn't really matter the key thing is is that you can go into that scenario very quickly and from within your application once it's been programmatically added provide an end user with a very engaging experience so the development tools have a lot of the same you know pros and cons there's obviously requires code modification and we as developers don't like to modify code it's it's a very non trivial thing especially when you get into busting up algorithms anybody who works in multi-threading can you know go into that assessment but from our perspective we're kind of like a hybrid between openmp and MPI we wanted the ability to provide in the demo I'm going to show you with Adobe After Effects was to be able to take advantage of just another machine another cpu within the same box for a very serial threaded based application the grid are not too great obviously that's the big question what kind of development work or effort do you have to put into the parallelization of your code to return some results and is it worth it that's the big question and that's that's always wipe Rio the parallelism has been the black art so from our perspective we wanted to really start to get rid of that black art connotation provide a an interesting framework to do this a lot of times I mean there's been a lot of talk also the processor intensive applications but ironically most of the applications that we've worked with they have more data problems data movement I you know reads and writes and those kind of things seem to be the major bottleneck with a lot of the applications that we've worked with so we wanted to come up with various different means again sing to you the developer really you focus on your algorithm you focus on the thing you know very well and we'll provide you a parallel application that you can call into so which grid method views obviously if you're doing embarrassingly parallel with scriptable type things distributed resource managers and scriptable scripted batch queue systems are very good if your source code is not available I can that's probably the only route that you have but then you also have the opportunity depending on the resources at your disposal to go into a message pass or into another type of frameworks such as a development environment quickly on grid enabling an application obviously 9010 8020 wheaton basically the same thing but when you're looking at the 8020 rule focus on that twenty percent of the code that does eighty percent of the work the abstraction level again is very important here because what really we're trying to do from a development perspective to there's no such thing as real automatic parallel ism but from that perspective maybe there is ways to wrap and provide hints instead of breaking your algorithms in other words they can still run the same way that they did before and you don't have to worry about totally wrecking your application application modification we broken out our architecture ninja 3 plugins defining patched class compilation and then result reassembly but again the goal is to provide the end user with a really engaging experience where they can basically think that it was just done on their PC just on in there Mac and they can use whatever machines that are on the network MPEG encoding was a challenge that we're given by a certain company that does a lot of encoding and we wanted to see what kind of results because this was high data so this is HD video we actually went in and did a modification and we ran the test on several exurbs and we got some very interesting results basically we tried it out originally on 12 we did actually go up to 40 but we started seeing some some degradation in the curve of diminishing returns but we took an HD in encode and brought that down from two and a half hours 26 minutes and provided the seamless result the nice thing is I also didn't have a lot of disk space I didn't need to have a lot of things it just went and dynamically knew the data when it needed to and brought the results back to the end user more importantly from a development perspective we only modified 100 lines out of this one hundred eleven hundred source files we modified three did basically about a thousand lines course we published a white paper in this hole there's a lot of comments and that's a available on our website if you do want to check that in so another thing that we did is in what we're shipping with right now is with Adobe After Effects and I wanted to kind of show that to you if I can switch over to demo too I wanted to show that you see very quickly because this is what we feel the end user has to experience and this is the challenge for us as developers to bring into an engaging experience for your end customers the funny thing is well and I'm just going to make a reference I think in the keynote on Monday there was a reference to you know the challenge of bringing chips into a smaller design well a lot of folks are announcing now the ability to go into a core multi-threading or a multiple cell type chip so parallelism is going to be absolutely key if our software environments are actually going to take advantage of it so one of the things that we did in this scenario is that we're actually going to use these for exurbs here but from an end-user perspective they don't know anything I mean this is a product that ships you can buy it at frys for 900 bucks and from that perspective if you go in there they don't know anything about the hcp they don't know anything about DNS they don't know anything other than how you have to plug it in and I hit go so from that perspective that's literally all they have to do they go out it'll automatically dynamically find all the other machines it'll pass the data that's relevant for it to work on but the most important thing is is to provide the results writing an application in a method that they're very familiar with so if you look down here we're starting to bring in this is an 88 1080i clip by the way for those who are interested but the interesting thing is is that we're bringing the results right into the RAM cache of after effects so from the users perspective this looks just like they always worked and that's very engaging the other side too though is that we have an interesting side effect of using a grid to do all the work and that is you can render and work at the same time that's never been doable before in single-threaded application like this and I can do things and render and so on so I'll go back to other slides that's just a very quick demo and the power that provides an end user is very engaging and we've been able to see that this has been shipping for a month and i think the stat was seventeen thousand users are using this cobbling together machines in their basements using it in very large infrastructure as well the NBA finals were brought to you by this and from that perspective you know if we can as developers bring engaging experiences for any customers our products will come at the market in a very engaging line so summary obviously speed speeds great is it worth the work you have that's really up to you you need to look at environments that are going to help you get to a more optimized and parallel infrastructure without the headaches or the worry of breaking your code there's going to be new hardware technologies coming down the road specifically multi-cell chips that are going to mean parallelism is absolutely key so we've got to start thinking about it now and significant linear performance is really the thing that customers want to buy