WWDC2004 Session 601

Transcript

Kind: captions
Language: en
okay good morning my name is Doug Brooks
I'm product
a hard were doubtful and I'd like to
this session is entitled HPC technology
update this session would like to take a
look at Apple and hpc and specifically
Apple products and technologies that can
contribute to hpc deployments we'll also
look at industry leading third-party
products that complement those solutions
and also we'll hear from two customers
to talk about their HPC deployments
using Apple technology first let's take
a look at Apple and hpc now in the last
year when you think of Apple and HBCUs
are the first thing that comes to your
mind is this with remote works
there we go technical failure Virginia
Tech Virginia Tech really was an early
leader and that they saw the vision the
power of the g5 and and mac OS x server
and combining over a thousand power mac
g5 and mac OS x server software using
technology for interconnect InfiniBand
achieve a phenomenal performance
achieving over ten teraflops of
computing power on its debut at ranked
number three on the top 500 list and is
the number one academic supercomputer in
the world an amazing achievement and
they also really prove the the the value
and the price point that you could build
and deliver a very high performance
super computer system with Apple
technology now what's interesting is
well they may have been the first and
definitely the largest g5 cluster
deployment they were definitely not the
first apple cluster deployment matter of
fact a lot of the early cluster work
done on the Macintosh platform was
actually done a number of years earlier
most notably at UCLA with the apple
seeds work may remember this from the
late 90s actually believe this is circa
nineteen ninety eight UCLA with the
apple seeds project took at the time you
know pretty fast technology beige g3 233
megahertz systems 10 100 Ethernet as an
interconnect it was running early on Mac
OS 8 and was using Apple events of the
Middle where nevertheless viewed doing a
high-energy physics work that they were
doing it cheves phenomenal performance
at very low cost and actually the system
you see here on the screen outperformed
Cray ymp one in similar codes so again
showing the value some of the same
things Virginia Tech has proved with the
g5 they actually shown some of those
same capabilities much earlier of course
we've come a long way since then from
the g3 base g three days of course to
the g for introducing desktop
supercomputing with the velocity engine
bringing phenomenal vector processing
capabilities
that many applications have been able to
take a tremendous benefit from by
leveraging that power in that technology
and of course most recently the g5
bringing phenomenal floating-point
performance dr. processing with a system
that delivers very high memory bandwidth
and system throughput through this
processor providing a phenomenal
foundation for computing and of course
the customers have responded of course
higher education customers tend to be on
the leading edge even some of our most
strongest early adopters in the higher
education market and of course if you
heard the last session you tell heard
about some of the deployments in the
scientific field matter of fact life
sciences in particular has been adopted
of our technology primarily because many
of the key applications that are run day
in and day out in the life science of
the rain had been velocity engine
optimized and running very high
performance on the g4 and the g5
processors and so we're seeing lots of
deployments taking advantage of our
tools and our technology combined with
the ease of use that we provide in our
system in the sciences field and of
course we've had a phenomenal community
of vendors bringing technologies
products if your developers your
applications coming to Mac os10 has
really grown this this community of
people in applications that they're able
to run on this platform and so we're
seeing more and more solutions come
available into the HPC community on Mac
OS 10 and growing more of a time we've
also seen a community of expertise build
up in the in the HPC community in the
cluster community around Mac OS 10 and
around the g5 matter of fact that a
notable one is the bio team a consulting
company in the focuses on the life
sciences arena really delivering great
solutions for life science is based in
in many part by Apple technologies and
of course Apple's responded apples
responded with products really focused
and targeted at this market most
specifically the extra of cluster node a
machine tailored and streamlined
specifically for clusters and
high-performance computing
configurations streamline full dual
processing processing computer
performance without a lot of the extras
you don't need when you rack a whole
rack full of extras in the system and
we've seen that dawn even further a man
fact most recently the Apple workgroup
cluster taken a complete solutions
approach to this product line being able
to offer not just hardware but hardware
software cables power rack interconnect
everything you need to perform a
complete solution around bioinformatics
and it's been very very popular the
response has been phenomenal we're very
proud that it won best of show at bio IT
world earlier this year and again for
customers doing bioinformatics that is
the easiest cluster to set up really
bringing Apple performance and ease of
use to the cluster space and finally
we've had phenomenal customer response
this is a small sampling of customers
who have recently deployed clusters
based on Apple technology xserve and mac
OS x server really see this market
continuing to grow and more very
exciting customer deployments and again
you're going to hear about two of them
later on in this session matter of fact
we're seeing right now roughly forty
percent of our extra units going into
clusters and high-performance computing
and as we see as xserve continues to
grow we see this continue to grow as
well as an important slice of the xserve
pie you might say so what does it take
to actually put a cluster together with
Apple technology obviously it's a lot
more than just racking a bunch of extras
in Iraq so what we want to do is take a
look at the HPC technology stack what
are the components on Mac OS 10 that it
takes to build an htpc deployment and
what technologies and products are
available from Apple and third parties
to complement this stack so this is a
view of again what will stay hpc
building blocks the components that are
required to to build a HPC deployment in
in technology space so if we look from
the bottom up from the hardware the
actual hardware platform itself the
operating system the interconnect
compiler and optimization tools the
communications middleware and finally
the management tools can really do a
complete cluster deployment we need
components from all of these now if we
take a look at what Apple Apple products
and technologies provide you see apple
provides products that fit really in
about four of the six areas and if we
look at third party technologies really
industry leading third-party
technologies we have again about 46 of
those components that we have to choose
from and as we walk through you'll be
able to see have a wide selection of
products and technologies for deploying
clusters on Mac OS 10 so let's take the
quick review walking the stack from the
bottom up I must take a quick review at
from an apple perspective at the
hardware component so this is pretty
pretty pretty straightforward first and
foremost we have the g5 processor the g5
processor really stands out as a
processor for high-performance computing
with the 64-bit capabilities with the
massive floating point and velocity
engine support in this processor coupled
with the power advantage is that it has
with the smaller 90 man o meter
technology really delivers a phenomenal
bang for your buck and for your power
and heat output as well so obviously
that becomes one of the core foundation
pieces of the product as we look up of
course then we wrap that up in the
xserve g5 I've been able to provide dual
processor performance in a 1u form
factor a system that delivers peak
performance of 16 gigaflops of
double-precision floating-point or 32
gigaflops of single precision floating
point with a velocity engine again a
very very powerful system to deliver a
phenomenal performance couple that with
the latest of io technologies pci-x ECC
memory integrated hardware monitoring
for systems management and monitoring
and again very low power when we look at
it compared to competitors
you know roughly a compute node
configuration of X serve at a hundred
percent CPU is under 250 watts of total
power usage and well under a thousand
BTUs an hour in heat output for the data
center that needs to cool these systems
that's significantly lower the
competitive systems and a great
advantage that we have with the g5
processor we also have extra braids and
kind of the corollary to a lot of
computing power is the data that needs
to be fed into those systems and that
data needs to be stored somewhere we see
ex or braid is the ideal storage device
for for storing data for
high-performance computing clusters with
phenomenal performance and phenomenal
capacity you have the ability to through
fibre channel quit quite a large large
amount of storage online and serve that
throughout your cluster and again a
breakthrough price performance with
extra grade just again ideal storage
device for high performance computing
clusters we then take a take a look up
the next step obviously with from an
operating system perspective we have mac
OS 10 mac OS 10 has really provided a
key foundation for cluster cluster
applications what's really unique about
mac OS 10 is that we have the ability
with mac OS 10 to combine the power of
unix under the hood which allows us to
bring applications and technologies over
that allow us to compile and run those
applications on mac OS 10 but combine
that with the great ease of use ease of
deployment ease of management that the
mac OS x server services provide and of
course with the g5 optimizations been
able to deliver the performance out of
the g5 processor and of course this
we've been to any of the sessions
yesterday you know that we've introduced
tiger and with tiger server one of the
most important things in the HPC space
that we're able to bring with Tiger is
true 64-bit user space environment been
able to break that 4 gigabyte barrier in
user space we've always been able to
access more than 4 gigabytes of memory
with the g5 system even on up answer but
now we have asked the ability to have
applications access large data
so especially code coming over from
other platforms from other operating
systems be able to take advantage of
that large memory footprint of course we
expect to see a number of
across-the-board improvements in other
areas of Tiger improved SMP performs
improve network performance improved NFS
performance things that we think are
going to really deliver a phenomenal
platform for the future of
high-performance computing with Mac os10
it's also interesting encourage you to
attend if you're interested in some of
the X fans sessions later this week X
fan is Apple's fan file system for Mac
OS 10 and X man has a role to play in
clusters as well with especially on
larger clusters where file i/o bandwidth
is a concern X fan gives you a file
system that has the ability to scale out
in file IO services so X and again plays
a role in high performance clusters as
well be able to share a large data pool
across a large clustered environment
then as we move up the stack we look at
cluster interconnects and this is an
area that we really work with industry
leading third-party vendors to provide
phenomenal solutions on Mac OS 10 you
know today with the inner connection
really a couple couple leading leading
choices first and foremost we have
Gigabit Ethernet kind of the common
denominator for cluster interconnects
and of course from an apple perspective
we provide on xserve g5 to very high
performance gigabit ethernet ports so
out of the box you know you're ready to
to connect those systems together but
four clusters that really need higher
prior bandwidth at lower latency there
are two really leading choices available
for Mac os10 the first is a mirror net
technology and the second is InfiniBand
and wanted to touch on both of those
first we want to talk a little bit about
Mira net mere Nets the technology that's
been in the HPC space for many many
years it's really established itself in
this space and matter of fact if you
look at the top 500 list you'll see
quite a number of clusters that are
built on mirror net technology we're
really proud that Mac os10 has very much
drivers really deliver excellent
performance with mirror net and
matter-of-fact me Ernest actually one of
the cards when we first did the original
xserve which introduced 66 megahertz PCI
slots mirror net were the cards that we
actually use to tune the bus to verify
we're getting maximum performance out of
the out of the system for PCI transfers
matter of fact and working with their
engineers they told us we were seeing
much higher throughput than even the
latest PC chipsets at the time and so
mirror net is a really excellent choice
here are some performance or excuse me
here are some views of their latest
pci-x card and a switch components the
key components that are used in the
mirror net deployment it is a pci cards
it goes in your system and here's some
performance numbers that were provided
by mera kaam on what baronet's capable
of achieving it again you can see here
the latencies are much much lower than
Gigabit Ethernet so for applications
where that's really critical mere net is
an option you can you can choose from
here and the way they achieve this is
actually the same way many of these
interconnects work is that they are able
to bypass the the largest section of
code that provides a lot of latency in
this system and that's specifically the
IP stack as you can see from the diagram
here applications that call through MPI
stacks through Marinette bypass the IP
stack and go directly into the hardware
and are able to achieve that much lower
latency performance that then what
Gigabit Ethernet can provide so again
for applications that require that that
becomes them really key marinette just
actually had a number of recent
announcements at the last supercomputing
conference earlier this month and most
significantly what they've debut are
much larger switch form factors and
labeling you to scale Marinette too much
higher cluster node numbers at much
lower price points and sameer net is a
definitely a very compelling solution on
Mac OS 10 for people looking for lower
latency at very attractive price point
what I'd like to do now is introduce
that John through it from Voltaire who's
going to talk about InfiniBand on Mac
os10 John thank you Doug all right so to
talk about a spin abandon certainly to
talk about clusters and a dog just gave
a good introduction there but what we're
seeing in the HPC market is a lot of
transition of supercomputers mainframes
SMP machines to a bunch of
interconnected servers why the reason
our cost is the overriding reason we'll
see we can put together one of these
clusters with a bunch of service or
realistically one tenth of the cost of
some of the existing SP machines
historically though that's been fraught
with complexities underutilization of
processors storage bottlenecks just to
the complexity of hooking these up so
what does it take to build a effective
efficient cluster you have a physical
distribution and logical consolidation
on the physical side you need a high Dan
with infiniband offers 10 gigabits and
30 gigabits per second today low latency
for inter processor communications low
CPU overhead you don't want the CPU
spending all their processing time
communicating with the other processors
the ability to scale else only logical
consolidation the ability to logically
group nodes and systems into a logical
sets or groups domain so we need a
high-performance interconnect and an
intelligent interconnect what in Spanish
and Sunnah band brings to the table is
it's an open standard so does the first
open standard interconnect designed from
the ground up for high-performance
interconnect and already made support so
the extraneous features and functions
that you might get in other technologies
are not there just was designed for the
ground up with high and clustering
mind because of that it has the
significantly we think I lower cost
performance now ratio than other other
options do available for clustering the
latency is about 140 nanoseconds perhaps
and 5.8 microseconds end-to-end
dilatancy a key feature of infiniband is
that supports multiple types of traffic
over the same fabric so well there's
file black network IPC it's all using a
single single technology efficiently
work from the beginning we've built in a
extensive management monitoring
capability so high availability quality
of service partitioning types
capabilities or burthen built-in from
from day one again already supports 30
gigabits per second today there's
something called a DDR double data rate
and qtr quad data rates that supports up
120 gigabits per second being worked on
today so on 33 real key key points
incentive and high bandwidth low latency
low CPU utilization and the ability to
scale this is a high level
representation of an SMP machine on the
on the Left shown a8 processor what the
proprietary interconnect verses for
two-way servers interconnected with
InfiniBand I want to make it clear the
SMP the symmetric multiprocessor systems
will still be a perfect solution for a
lot of applications to use the plus we
have to build up Farrell parallelized
the application and and we're looking at
near memory speeds but the cost
performance compared to the SMT is is is
a drastic on the infiniband link
protocol just really trying to make one
or two points on this overhead one is
with a single event you can move a large
amount of data to two gigabytes of data
and the entire link protocol is handled
in hardware so all
reassembly and segmentation and all that
is all handled and hardware and we are
on the third generation of a check
technology for InfiniBand some more
incentive beyond link attributes each
packet is sent with a service level so
there's up to 16 service level supported
SLS there's also something they called
BL virtual lane so there's 15 virtual
lanes possible over single virtual over
single physical link so a the SL is
mapped to a VL which is then arbitrated
across the physical link basically that
that is the basis for your quality of
service type implementation which lets
you mix and match efficiently different
types of traffic across the same
infiniband connection instead of being
awful to signs up with all partitions so
this would be similar to uh zoning and
fibre channel or vlans in the IT world
so it's a mechanism for defining
isolated domains so each porter node can
be defined system partition to
communicate with only nodes in that
partition or given full or limited
rights within that that group and that's
all defined by the subnet managers sm
manages by assigning partition keys and
cinnabon is based on a 2.5 gigahertz out
signaling right so when you hear the
rates for InfiniBand 1x is 2.5 there
were really no implementations done at
that data rate for X is really where all
the implementations are most of the
implementations today 12 X is 2 30
gigabits per second so today I voltaire
we can support the 10 and 30 gigabit per
second rates over copper cabling it's a
17 meter distance limitation one
kilometer with multimode fiber and then
see there's there's efforts to in the
works to five and ten gigahertz
signaling radar in process
and Finnegan has a very rich protocol
stack to find on the upper layer you'll
see a bunch of stuff that looks familiar
you know NFS RDMA and virtue for NFS
we'll have our DNA infiniband support
MPI is I think dog mentioned message
passing interface far and away the most
popular MPI is the most popular IPC API
in the HPC world I think that was for
three-letter acronyms and short sentence
let me see if I do it again so message
passing interface is is the most popular
application for room and interface for
inter process for communications in the
HPC market and that supported in the
Apple world today ice cozies for storage
support across the fabric stp is sockets
direct protocol and any application with
a sockets level API can utilize that of
course TCP and IP over I'd be in a
dappled direct access programming
library a lot of acronyms here adapt
will define the API to RDMA and then
there's a full suite of incentive and
services below that for management and
monitoring and my 10 minute time slot we
won't be going through those right now
and that HCA terminology used for the
host channel adapter that's sort of
unscented ban term just for a network
interface or host bus adapter they call
them to calm HCA's so what does
incentive than the value where bring to
the HPC market it's the first industry
standard to enable server clustering dog
mentioned Marinette and there's quadric
two other ones out there that are
proprietary interconnect set that are
available
the clustering is the fastest-growing
segment in the HPC market so that's why
it's a company Voltaire and Apple
working together a very interested in
that space excellent performance
advantages over other other options
we're currently seeing a lots of
interest at the universities and labs
the VOE lab specifically or very
aggressive and pushing forward the
standard and purchasing products and
implementing large large clusters and
pricing has dropped quite a bit we're
certainly certainly not at what i would
call economy to scale yet but we've
still seen about a fifty percent price
reduction in the last 12 months or so
and we think we'll see more reductions
as as volume increases and the virginia
tech system Doug mentioned 1000 hundred
five side note 5 mil clusters number
three and last November's top 500 list
really key the 5.2 million which a lot
of money anywhere but for this type of
system it is it is literally one tenth
the cost or more greater than that of
the other systems in the top ten want to
thank Doug and Apple for having us here
and working with 10 minutes that's all
thanks John we're really excited about
Voltaire's InfiniBand offerings because
for customers who are looking for a very
versatile interconnect with great
latency and bandwidth properties in
Santa ban is very attractive and gaining
tremendous momentum in the HPC space I'd
like to move on go back to walking up
our HPC building blocks back here and
take a look at compiler and optimization
tools you know the interesting thing
about HPC is that it's really a segment
of event users and developers I've never
met an HP seeded appointment that's not
taking advantage of their own tools or
compiling their own programs and so the
compiler and the optimization tools that
are needed to really eke out the most
performance out of their code it becomes
a very important piece of the technology
from Apple's perspective of course we
have xcode and xcode is just a
phenomenal development environment to be
able to to leverage the productivity
features been able to deliver great user
interface great tools to build develop
the bug applications in the fact that i
can write a program on my powerbook and
then send up to my cluster for execution
optimized for the g5 is incredibly
powerful and of course as we improve
xcode for example the the betas that
you've received the pre-release versions
you've received this week actually begin
to introduce some of the 64-bit
capabilities for for large memory space
so you can already begin working with
those with those tools a very important
part of xcode are actually the chudd
tools if you're not familiar with them
showed stands for computer hardware
understanding development and these are
tools originally written internally
within Apple to help us optimize and
understand the implications of code
executing on our systems these tools
that turned out to be extremely powerful
and actually have been made available as
part of our developer toolset and now
our standard part of xcode installation
if you have xcode installed you'll find
them right in your developer folder
these tools are incredibly important in
this space to be able to really
understand the performance bottlenecks
and implications of your code
Ninon on our system I've seen numerous
of examples for example people who are
convinced their code as processor bound
on a g5 and with some simple profiling
with shark for example one of the key
tools in the chudd set really find out
maybe it's more memory bound and there's
some tuning that can be made to improve
throughput so these are very very
important tools to our to our tool set
and if you have an opportunity really
encourage you to go to some of the
sessions this week on the tred tools for
better understanding of their
capabilities the another important piece
of the space is fortran fortran
continues to be one of the top
scientific programming languages and
it's an area where alpha works with
third parties to really develop and
provide great solutions on top of Mac OS
10 so it gives me a lot of pleasure to
introduce Dave Paul mark from IBM who's
going to talk about XL fortran Dave
great thanks Doug I really happy to be
here today so great to be able to stand
in front of a bunch of Apple developers
as an IBM ER and talk about our
technology it's not just the processor
this time we're going to talk about some
software today and and some hardware to
infancy there we go so what we what we
brought to the apple processor in the
mac OS 10 we've got a compiler that's
got a long history behind it this is the
the IBM fortran compiler that's been
behind our systems since the very early
90s and even going beyond that we use
this technology inside of seek our c
compilers as well and we have XL
compiler both C C++ and Fortran on this
platform and it's been used by very
important IBM customers mainly on AIX
but we're starting to see some movement
to Linux and Mac now people like LLNL
nurse and
our and a European weather forecasting
group and we deal with these people
every day we understand their problems
we understand what the kinds of
applications they have to develop and we
built a compiler for them now when you
pick up XL fortran you're not just
getting performance you're getting
language standards and conformance so
this helps a lot with porting if you
have something that runs somewhere else
that's conformant we're going to be able
to handle that so we're fully fortran 77
fully for 290 fully for 1095 and we
started on fortran 2003 which we expect
to be ratified hopefully the end of this
year but people have asked us for some
things early in that and when as a
standard can congealed and got a little
bit more stable we went ahead and did
things like I Tripoli module allocatable
components stream i/o things that people
were asking us for and we have people on
that on those standards committees so
that we know what's coming and with a we
have a voice in there we also handle
things like OpenMP which we also have
focused on those standards committees
we're not your standard conformant we
also have extensions surprise it's for
trans so we do OpenMP 2 point 0 fully
compliant to that on mac OS x it's a
technology preview as yet it's a it's a
preview of what of some of the
technology that we've deployed on AIX
and an linux powerpc that's been out
there for quite a while now but we do
other things from from other other
companies we've got great pointers 128
50 14-point 64 billions structure record
union map and so on and so on is way too
many options to talk about to try and
describe them all in this in this group
but suffice to say things like structure
record union map we've had customers
come to us with things like you know we
would like to buy IBM hardware but you
don't have this well they got it and now
they have IBM hardware we do that sort
of thing all the time this kinds of
requests come in you know we we want to
hear things we want to hear what you
need
and we'll talk to you about that we've
also got some very important extensions
for the PowerPC in particular the the
Power PC hardware intrinsic functions
and directives get you access at a
source level to the hardware
instructions so you can code something
as a directive or as a function call and
what you're going to get there is a
particular instruction that you need at
that point something like data prefetch
for example very powerful we also give
you an extra less utility module that
you can use to get access to some common
system services you not to go off and
code that yourself now we're in Xcode
and that's real exciting but for folks
that still like a UNIX command line
we're there to make file still work gdb
works with us we work well with gdb now
as you go up the off levels obviously
things start to go down a bit but we've
given you sir that the support goes down
a bit but we do have some directives and
so on that you can put in your source at
certain points to get you the
information you need to debug you know
that's back to that trace back or
whatever it is you having trouble with
its time something that isn't here that
shouldn't be shark I love that tool I
wish we had that naix we work well the
shark and that's the message use it and
find some amazing things it's one of the
most popular things in Toronto for
digging down into these problems that we
get from our customers when you're
analyzing performance problems but it's
not always just debuggers and so on we
also give you some options to use for
finding problems in your code so you can
automatically insert checks to find you
know oh I went off the off the balance
of that array it'll trap and tell you
that stop it from going off and
corrupting memory you know automatic
initialization of variables where you
need that to happen and a rich set of
listing information that you can dig
through to understand what's going on
with your program the runtime
environment we have our own fortran run
time that we ship with the compiler and
the message there is that is something
that if you build an application with XL
fortran you can take that runtime
and give it to your customers as well so
that they can run that Fortran compiler
sorry run that fortran code on their
systems we give you a lot of tuning
levers and buttons and dials through
environment variables and you can
control things such as the
characteristics of the i/o that's going
on when you're doing that error
reporting what kinds of messages you
want do you want to know when you're
doing something that is in Fortran 90
conformant for example you can do that
sort of thing but in this space
certainly thread scheduling models
number of threads thread profiling
environment variables Israel important
things and of course all the things that
openmp defines we've got those the
binary compatibility is a very important
thing in this you can take our objects
work work with other objects in GCC g
plus plus and of course i BMX well c and
c++ take that whole bundle put it
together and there you go you've got
your applications mixed as many
languages as you like and we've added
some things like q flo complex GCC minus
QX name just auctioning but that the
messages we wearing we need them we've
added some things to help out with that
binary compatibility there we go now we
exist because of optimization if we
weren't a good optimizing compiler we
wouldn't be there in Toronto doing this
every day for the last 14 15 years so
the Conte the optimization components
that are in Excel for Trent are in all
of IBM's core compilers and all our
important systems C C++ cobalt pl1 on
AIX Linux the mainframes Mac and now p
series i series and of course g4g 5 the
message is we've taken all that that we
built up over those number of years and
all those different platforms and
brought it down to the Apple platform
and we're seeing some really important
success with that the XL compiler is a
used by IBM on AIX to announce spec
performance numbers so again the message
is we know how to tune for those chips
IBM does the chips we work with the chip
designers we know what's coming we know
how to tune for those things and we
build our own software with it AIX db2
lotus domino they are all built with
diabetes IBM compilers as you might
expect optimization options we go to
five dots at the base level so 0 0 all
the way to 05 and you can go from
basically almost no optimization up to
wow what it wasn't that what did this
thing due to my at my code I can't
recognize it anymore and we've got a
whole set of against witches dials lock
knobs and levers that you can play with
in order to tune the optimization to
what you need to have happen on your
application things like minus Q hot
enables the high-order transformation
loop optimizer that it was built to
understand Fortran 90 array language and
syntax and can take those loops and do
some amazing things with them it'll also
work with C kodas with sealants as well
when you use it in our C compiler makyo
arch option tells you on the mac OS
machine do you want to target a generic
PowerPC in other words g4 or do you want
to go to g5 which I'm sure most of this
group is interested in and that enables
inside of our optimizer all the modeling
and tuning capabilities that we that we
bring to bear from that we brought up
and specifically done well person years
of effort to and into the g5 gives you
access to all it doesn't give you gives
the optima using using QR h g5 allows
the optimizer to precisely model your
code as it's going to see it because it
understands the chip understands how
many units are going and how to how to
keep that process or busy that's what
it's trying to do with the scheduling
model and using QR gt5 also gives you
access to those risks a tapar PC
intrinsic sagain that you can use for
things like cache control
all certain during arithmetic operations
that you might need and floating-point
control you want to toggle things in the
status and control register for example
and the nice thing about that if you're
interested in moving your code from one
IBM kind of system certain one IBM ship
to another those same intrinsics work on
compatible chips if you're going to AIX
or linux same story in the other
direction you have some code up there on
AIX want to bring down to g5 those
intrinsic is going to work too I PA is
sort of that is the keystone of our
optimization technology and really
differentiates us and what we can do
with your application when you've got
something you've got ITA involved in
your in your compiler and your compile
and it runs automatically at 04 05 what
it does for you is when you compile your
code is it inserts information into your
object which is essentially invisible
for the linker so if you just take those
objects and feed them into LD outcomes
your outcomes you're a dot out you're
happy but if you then use IPA when you
link your application it then extract
that information is hidden away in the
datos and re optimizes your code again
this time not on a file by file basis
but it's got the entire application
there it's got all the dogs that make up
the whole thing and it understands so
that called that that was called with
this and was called with that and so we
don't have to worry about this parameter
we'll just stick a 7 in there that kind
of thing so what it can do is it
repartition your application into more
logical units that are easy to keep keep
memory together and in those massive
amounts of inlining where it makes sense
and can even go across languages so if
you build if you build your application
mixed mode with C C++ and Fortran if you
build all that with IBM XL compilers run
the IP a link step you can do things it
will do things like take your seat code
invited into your Fortran application
and yeah that's an amazing technology
that we
to bring down to the g5 and of course
after it does all that then we go back
down into the low level optimizer II
game which is the one that really
understands the chip and tunes for that
PDF profile directed feedback is another
important technology especially useful
for codes where you may have it
instrumented with debug or prefer or
perhaps some tuning information that you
want to use to gather statistics what
what PDF will do for you is you build
your app that you build build your
application once with minus Q PDF one
run your application with typical sample
data that will write out a statistics
file compile your application to gain
with PDF to and it will read that
statistics file and that will tell the
optimizer oh look 99 percent of the time
you take the branch this way not this
way and so we can we can take your most
frequently executed code and put that in
line and the stuff that almost never
executes goes off to the side and you
get a much better performance out of
that and of course again the message is
the exhale compile we share the
technology so you can if you build stuff
with with our C compilers use PDF you
can mix that in with the fortran
compiler OpenMP and SMP are very
important to this this space and we've
got a lot of experience with these again
technology preview and Mac OS X right
now but again we're bringing that down
from some platforms where we've had a
lot of time to work on that we fully
implement the 2.0 standard and the
important thing about the openmp for us
is our optimizer fully understands what
OpenMP use and what SNT is and so we can
take things like a minus Q s MP auto
option I put it in our compiler where it
can take a look at your your application
automatically parallel lines things
where it makes sense to do so so you've
got you've got a couple of choices and
the way you want to do things if you
want to code in the openmp standard
that's great we'll handle that but will
also automatically parallelize for you
where we can and again it's another one
with a dozens of switches that I can't
talk about right now we'll give you lots
of lots of directives and options on the
optimizer as i said before things that
where there's a couple of variations on
this where you can go into your source
and say things about your code to say
this loop has this characteristic and
that will give the optimizer even more
opportunities to go and do things that
it might not be able to recognize
otherwise but in some cases you want to
constrain the optimizer a lot of a lot
of older code especially may not be a
hundred percent standards-compliant so
things like minus q alias non-standard
will let you crank up the optimization
level and still have your code run
correctly even though it might not be as
as opportunities if your code what
standards conformant and of course
things like mice to prefetch and minus
will automatically insert prestretching
directors where that's where that's
useful had a great example of that
yesterday where we had a gentleman in
the lab across the hall working with us
on brought his code in and just with
some analysis with shark and some and
looking at things we stuck in one
directive and speeded up the core loop
in his application by a factor of two
just by doing a prefetch
so the summary is IBM exhale Fortran and
xlc bring to you bring to the apple g5
systems technology has been in the works
at IBM since the honestly the mid-80s
and it's been improved every year with a
large team in Toronto and we work
closely with the chip folks we're fully
backed by ibm's premier customer service
doesn't matter if you if you buy your
the compilers from apps software you buy
it from IBM it's still with the team in
toronto it's going to be looking after
you and our standards compliance and the
large range of extensions that we have
let you bring your code down from pretty
much anywhere and we'll help you out
with things that you need it thank
this case great thank you Dave okay
continuing along our stack want to talk
a little bit about communications
middleware this is typically what we see
is the MPI layers of a cluster the great
thing about Mac os10 again leveraging
off that UNIX foundation is that just
about all the major MPI stacks have been
brought over to Mac OS 10 and run really
really well matter of fact some of them
have been really optimized for Mac OS 10
and are available for example lamb MPI
is a package installer for really ease
of installation right on top of Mac OS
10 so great selection of tools matter of
fact if you're you have experience with
a particular NPI stack hopefully you'll
see that it the exact same stack is
available on Mac OS 10 and can leverage
that the familiarity on the platform so
both open source and commercial stacks
available for Mac OS 10 there are a
number of other pieces of middleware
obviously talked about OpenMP Globus pvm
paradise Linda from SCA and a recent
product accelerate from gridiron
software I'll also fall into this
communications middleware stack and of
course all are available on on Mac OS 10
if I only wanted to touch on management
tools this is an area where we think
Matthew s10 really shines because again
you have the best of breed tools
available from Apple to really make
managing these systems particularly head
nodes and things where you're providing
file services and network services are
able to provide very very ease of use
for system administrators managing you
know whether it be a small cluster or a
large cluster we also have the benefit
of great open source tools to really
provide added value and functionality so
if we drill into this of course first of
all we start with apples management
tools server admin work group manager
for providing kind of the bread and
butter you know file services dns dhcp
directory services things that kind of
any forget about but you know it's a
network infrastructure you need these to
support cluster operations one of the
highlights server monitor the tool
that's unique to exer
xserve g5 has over 30 sensors on the
logic board I like to joke it's one of
the most instrumented one youth servers
in the industry server monitors the tool
that allows you to wrap up that data and
provide that that status information
about the hardware temperatures
predictive drive failures power
consumption all that data is available
in server monitor is a great compliment
when you're managing a large number of
machines beyond that we also have new
piece of technology from Apple
introduced not too long ago as a
technology preview which is X grid again
taking that ease-of-use approach of how
do we make deploying clusters easier how
do we make distributed computing easier
extra is really a great solution for
these class of problems were you want to
distribute workloads across a number of
machines what's interesting about it is
that not only can it take advantage of
dedicated cluster resources such as a
rack of x earth you can also bring
ad-hoc resources through rendezvous
technologies out across to desktops and
other machines on your network the
recent technology preview to added MPI
support which makes running and
dispatching MPI jobs across your cluster
much easier and of course it provides
great user interface all the way down to
the tachometer to let me see how much
performance you're getting on your jobs
so we're really excited about X grid and
of course now stood being brought into
tiger it's going to be very broadly
available to two Mac os10 systems
finally again wanted to touch on some of
the leading open source and commercial
tools in this space most notably
schedulers you know again again the top
schedulers available in the industry are
available on Mac OS 10 platform lsf in
the commercial space PBS and open PBS
Sun grid engine now called n1 grid
engine even the Maui scheduler available
for Mac OS 10 and also some of the
leading cluster management monitoring
tools tools like gangly and Big Brother
also available for Mac OS 10 and a very
valuable resources there
so in summary if we look from all the
way from the harbor up to the management
tools we have a really compelling set of
products and technologies both from
Apple and industry-leading third parties
that allow you to build really
phenomenal cluster solutions with with
mac OS 10 and powerpc g5 at the
foundation of this stack what I'd like
to do is now introduce some customers
we're going to talk about how they've
deployed xserve and mac OS x server to
solve some of their high performance
computing needs first customer I'd like
to introduce is actually is Ben singer
from Princeton University who's going to
talk about his deployment of X serve in
their Center Ben thanks Doug to delight
to be here I'm here to talk about a
little bit about the Princeton xserve
cluster at the Center for the Study of
brain mind and behavior that we're still
setting up we got it about a month ago
and we're having fun setting it up what
is the csb mb well we're a consortium of
Princeton faculty interested in the
neural basis of cognition and really
what that is is one of the great
unanswered questions in science which is
how does all this activity in the brain
lead to consciousness and awareness and
action and motivation and all the
associated behaviors that we own that we
do every day and just take for granted
the consortium's made up of faculty from
applied mathematics computer science
chemistry physics biology psychology and
philosophy and actually psychology is
our home building and so we have a lot
of people in psychology that are working
with us and to point out one of the
others in applied mathematics our
biggest collaboration is actually with
Ingrid dobashi who is the mother of
wavelets and the supplying some
algorithms to brain imaging analysis so
really what we are is a place that
provides resources for all these faculty
and we have staff and we have resources
in the computing and
data acquisition area and on the staff
side there's software engineers MRI
physicists the system administrator and
administrators running the center the
big data acquisition instrument that I
was alluding to is the MRI brain scanner
from Siemens that we picked up a few
years ago and it was the first at the
time it was installed it was the first
research on the installation so most of
the time when you use an MRI it's in a
hospital setting so it has first
priority for clinical applications and
you end up doing work at three in the
morning or something and one nice thing
about our facility is that it's there
just few doors away in the psychology
building from the cs BMB staff center
and that provides all the data that that
i'm going to be talking about and why we
ended up getting an xserve we already
had a file server when i went shopping
for cluster which was actually the first
thing we had me do when i came about six
months ago and that was in place already
it's a blue arc nine terabyte file
server to store all this data that comes
from the MRI brain scanner and we need
to back it up and we need to process it
so that's how we ended up with 64 xserve
g5 nodes and I'm going to explain a
little bit about how we chose the xserve
but before I do that I want to just say
what it is from a computing perspective
that what we do is motivating us to
pursue an xserve in the first place we
have a whole lot of brain data coming
out of the MRI a single study will
produce hundreds of gigabytes of data
you take a single scan from somebody and
if you're doing functional MRI even
though a single slice of the brain is at
lower lower resolution 64 squared image
you're taking 25 slices and then you're
taking maybe 30 of these a second and in
one experiment recently we had subjects
watch raiders of the lost ark for two
hours and recorded their brain for two
hours that produces a lot of data and so
we did that with multiple subjects too
because we want to see our brains doing
the same thing when they're watching
this movie have sort of fun fun example
and to crunch through that is going to
take some some computing power the other
thing is people are moving their head in
the scanner they have a little head rest
so we tell them not to move but they
still do and that's natural and so we
need to align every image with the first
one or some reference and that takes a
lot of time in the workflow so does
filtering in space and time there's a
lot of noise coming this data when you
first get it is not it's not like
suddenly something pops out at you and
you know exactly what's happening except
in very simple cases there's a lot of
noise in the data needs to be filtered
out there's other machines in the room
that will put a signature in the data
maybe some low frequency noise maybe
high frequency noise so you have to do
filtering and then finally you need to
do a statistical analysis and you're
comparing brains where they were just
sitting there doing nothing where with
what when we're doing the task that you
have them do and so comparing those two
things is a simple statistical test but
you need to do it for every voxel in the
brain so that's thousands hundreds of
thousands of statistical tests and that
can take traditionally days of CPU time
to do a single study and one one problem
with that is that when people are
they've got all this data and it takes
all this time to analyze it they don't
tend to play with it much they don't
tend to try new things or look at it
from a new angle because there's a big
cost to doing that they're going to tie
up the lab resources for a day they
can't just put this data on their on
their portable and run away with it and
they have to stay and use up the center
resources and sometimes people won't do
it and so it's just it's stifles
creativity it's one thing so why did we
choose xserve well when I first started
looking and we all were a group but I
was you know sort of the one that was
doing it at the time I got really my
head deep into bed
marks and although the xserve does
really well with benchmarks I think the
reason we chose it wasn't just because
the benchmarks but anyway let me point
out the benchmark that I have on the
slide Daphne speedo score is from the
athénée package it's from the National
Institutes of Health it's a free
software package for analyzing MRI brain
scans and off the website that where
they published their single processor
32-bit benchmarks come the bottom three
bars here and then I ran it last week on
RX serve and it came out a little better
that this benchmark tests the whole
system so maybe it was I 0 or something
that caused the xserve to do better than
the desktop even though it has the same
speed ship again like I was saying it
wasn't only benchmarks there were some
benchmarks actually which early on if we
compiled say 64-bit linux opteron
systems would would we came out with
different results bexar was always close
and what we what we were doing when we
were buying this cluster was looking at
the whole package and thinking about the
future and being able to actually how's
this thing and to be able to maintain it
and we also knew that in the future
Apple's operating system would be fully
64-bit to do a fair comparison with that
compile on on the 64-bit linux so we
knew it would get better the power and
the cooling that's been alluded to
earlier were a great story for us
because we are in a small area and we
had we we said we want to get 64 nodes
and the facilities people laughed at us
so we said well how if we put it in that
room there and they said well good luck
so we thought about it we did get them
to put in some additional air
conditioning and then we looked at the
stats in the in the specs and the g5
xserve which had just come out at the
time was the suspects show that it used
about half the power and the cooling I
think roughly at that time and we were
really happy with that so we could
actually
could actually buy it and that it was
actually a great great feeling and we
knew that we'd be able to cool it in it
and it's also very quiet I think in the
last session but Tribble mentioned that
we went into the room with them all on
for the first time and there was this
strange high-pitched noise and we
thought oh great you know this it's
going to be kind of noisy it turned out
it was the two dell power edges in the
in the farce so that we were happy thing
the great thing we have a whole lot of
people that are were coming up in SGI
system in a bunch of Linux system so
there were a lot of people that said
well when I can be able to use these
Oprah's open source packages and we're
going to have to recompile them well
I've been in the beginning the process
of porting it's been really pretty easy
to do the g5 is a very is becoming a
more and more popular target in GCC make
files for the packages that we depend on
including afni and so in in fact even
have a binary distribution for the g 5s
already so that was great and the the
administration of this thing is then
really straightforward so far we're
really still bringing it up but the
server admin and the server monitor
tools that Doug showed and one of his
slides have been really helpful I can
just bring up my g4 desktop and look on
the screen and see what's going on with
a cluster and I don't have to be a
full-time sysadmin this is our system
what's a little different about us is
what I emphasized in this slide we we
have all this data and we have this file
server already so we sort of had to work
with it and so we have the second
network hg5 node has to has two built-in
ethernet ports on the back and so we
decide to use both of them which when we
were setting it up we realized we had to
do 256 crimps of network cable so we
were wondering why we did that but but
then we realized once we set it all up
and we had a lot of help from Apple
doing that and they someone came and did
most of the Crimson redid the ones that
I I did
and we we got all this stuff up and
running and what we have here there's we
have a foundry switch that we got along
with this thing this excerpt cluster I
shouldn't call it a thing but it's got a
few gigabit ports on it in which goes to
our existing blue arc file server which
is an appliance sort of as most of its
software and firmware and we connect out
to the world with that into the head
node of our cluster our cluster in red
there's a uses what apple ships with the
cluster the asante gigabit switches and
then down below is the foundry
connections going and that's where all
the NFS traffic is between well it stays
on its own network so it doesn't
interfere with what's going on on the
other network and most of our
applications are single processor
embarrassingly parallel so we don't have
a need for mpi yet or any of the
high-speed enter connect so we were
happy with with this and we just need to
move a lot of data on churna churn and
work on it for a couple hours and then
write it back out so we didn't have a
need to to do NPI yet and there will be
a lot of opportunity for that just just
to conclude we're seeing five to ten
times faster results than we had before
with the system that we moved off-site
an STI origin we haven't even begun to
try to optimize so that's something that
we want to do although a lot of this
software isn't really written by us
originally but its fertile ground
because like I said every voxel in the
brain the analysis is independent so you
know you could have a node for every
foxhole if you wanted so there's a great
opportunity to speed things up we don't
need to hire anyone at least I hope
that's not why I was hired just just to
do this and but I'm having so much fun
doing it that actually I might I might
spend more time of playing with it than
what I'm really supposed to do and we
our students are happy they have a
feeling that something's coming they see
it there they see the lights blinking
and they know that they're going to get
a chance to play with it soon and
they're going to be able to have less
than an excuse
to just say that their work is not done
yet because it'll it'll be done quicker
and but the good thing about that is
they'll be able to try new things do
more stuff and and and actually probably
discover some things that they wouldn't
have discovered otherwise and that's the
bottom line and like I said one reason
we chose the ex service is just we've
had both I in the past it I'm an Apple
user and so is the director of the
center and we know that Ethel is always
innovating the hardware is solid the
operating system is too and with X grid
for instance you see that Apple's I is
on this problem and that more great
things are going to come in so we're
looking forward to that thanks Thank You
Ben okay the second customer would like
to introduce dr. John maduros now if you
read any news last week he may have
heard about the small little cluster
going in down in Huntsville Alabama a
1566 node cluster being put together by
Coulson and I'd like to introduce dr.
John medeiros the senior scientist from
call so is going to tell you a little
bit more about it
thank you as Doug mentioned I want to
talk to you about our cluster tell you a
bit about it and about why we got it but
to put things in context a little bit I
like to tell you a bit about who we are
what kind of computing we do and why we
need so much of it and what process we
go through to pick the cluster system
that we did and why we picked the xserve
cluster that we did any case who was
cosa well we're a small engineering
services contractor about 800 people
based in Huntsville Alabama as Doug
mentioned and we have a few offices
throughout throughout the u.s. I have to
mention our company president nelson and
dr. Tony Jerry ngoi get your vice
president actually a champion this
project for us providing a lot of
corporate support for the vision that we
had in terms of bringing that system to
bear my particular project involved in
as the hypersonic missile technology
program and we have a dedicated
corporate facility that we recently
renovated for the system called the
research and operations center so that
makes us the HMT rock which sounds a bit
like a radio station but really the only
music there is on the ipods anyway a
program manager there is that the white
Whitlock and I'm the technical lead on
the project primary customer is the US
Army's research development engineering
command out of redstone arsenal are the
econ and principal scientist there on
that project it was doctors bill walker
and Kevin Kennedy so what kind of
computing requirements are there for the
project well supporting their hypersonic
era namak now social flight in scramjet
engines and focus is on the
computational fluid dynamic analysis of
the hypersonic endo atmospheric regime
that is very fast in near-earth
atmosphere the cartoons on the right
show
some of the schematic data that comes
out visualized data that comes out of it
where you display parameters of the
space around an object flying very fast
and it's in fact a very complex and a
difficult problem that we are simply
attacking by boot pores using a code
that's proprietary double precision
fortran code solve the navier-stokes
partial differential equations and it
using a pullet explores the full
combustion chemistry that goes on in
that regime and we explore problem sizes
with the space around object divided
into 20 million or more individual
points at which the computations are
done that's a lot of points but the good
news is that the blocks of those points
can be assigned to a given processor and
computations are carried out in that
processor and then the results are
compared step 2 and iterations continue
and as a result in the way that this
whole process works problems very CPU
intensive and very little time
relatively spent in inter process or
communications it's the category which
might call almost embarrassingly
parallel which is good from our point of
view and in fact drilled the design of
the kind of cluster that we went after
now we've been doing some computing in
this project for a while and we've done
it well systems we have include a
traditional sort of computer system we
have an IBM SP power 3 system of 284
CPUs that when we got it as pictured
there back in june of two thousand it
came in as number of about number 47 on
the top 500 list and four years later
today it's completely off the list gives
you an idea of how things are
progressing so our goal was i mean for
this project like it can't be too rich
or too thin or have too much
computational power we need a lot more
than this and main things as expenses
that they were while they work very well
we're too expensive to get to the kind
of computational levels that we wanted
so we began exploring
and in the interim since we got that
mainframe system we acquired and put
together and played with the number of
clusters and explored a whole range of
architectures from major vendors
including AMD intel and apple our first
system back in june two thousand was a
34 processor AMD athlon system and about
the same time frame a little later we
acquired a g4 system about the same size
which at that time I believe was
probably one of the biggest apple
clusters around epson performed fairly
well but it was only 800 megahertz and
we wanted to scale up right
substantially and so we want look at
other architectures including rack
mounted obviously the coming a few the
systems that we have now we work with as
shown here the first upper tube for
historical reasons were the early days
of looking at clustering so we get tower
systems for PCs and apples and that
little applica system we affectionately
called apple orchard we've looked at
64-bit systems very extensively
including the opteron system and a lot
of our computations now are done on
Intel Architecture 32-bit system you see
there are largest cluster now as a 522
processor system but we fact needs
larger more so we forward additional
possibilities now the whole thing is set
up in this unpriests of holding building
in Huntsville that we acquired back late
last year and the building was gutted
internally virtually Charlie and the
shows the computer room being put
together back last fall and we renovated
literally to top the bottom of the
ceiling and the computer room floor
about 3,000 square feet of computer room
floor and the shows are
configuration where the that 522
processor Intel Architecture system on
the left as you see it r SP mainframe
system on the right and center is going
to be talking about here the cluster
system that we are acquiring from Apple
okay how do you pick such a system you
have the bench market and from our point
of view when we benchmark that counted
was our applications so we ran our code
using a sort of simplified geometry but
the full complexity of the problem in
terms of a reasonable problem size and
full combustion chemistry with the whole
range of chemical species what we did
find among other things and testing
across the whole range of processors
that the inter process communication as
I mentioned is a small fraction of the
total compute time that is in a given
iteration we found that that might take
typically a few seconds the amount of
time that was done in communicating
between processors between those
iterations was in the range of
milliseconds so there was very little
penalty in worrying about the internet
which is why fact with chosen gigabit
ethernet for the system let me go back a
minute does it work the last point about
it part of the reason for doing that of
course is that these other
interconnections you've already heard
about previously in the session you can
get better performance for a broader
range of applications but the cost
difference is not trivial compared to
gigabit ethernet switch okay this shows
if some of the data that we used and the
kinds of things that drove our decision
what you're seeing there and this is a
log-log plot of the time to do a given
step of the computation as a function of
the number of processors you throw that
problem
for all the different processors five
are shown there on this on this chart
for all the different processes you see
they're actually scaled very well for
our kind of problem that is as you
double the number of processors the
processing time cuts in half the
grouping there if you can make it out
this is a log-log plot but does break up
naturally into two groups the upper two
for the app lon and xeon systems are
32-bit systems and the bottom three
opteron a canyon 2 and G fives of the
64-bit systems lower is better on this
chart less time taking good iteration
the more room you can do when given
amount of time the more process and you
can do and on that basis you can see
that the g5 in fact performed the best
now let's stick with comparison is maybe
not a little bit fair because these were
different processors with different
speeds so the next chart is the same
data this time we're all results were
normalized as if each of the processors
have the same two gigahertz quasti it
didn't but you can normalize the data
that way just for demonstration purposes
what you see here is results you're
essentially the same and not changed
very much and the only difference here
now is that the I came to looks a little
bit better than the g5 it's a little bit
faster on that chart but you got to keep
in mind that I came to is not available
at any cost for two gigahertz and at its
fastest implementation about one and a
half gigahertz a system built with I
came tues comfortable to the g5 would
cost about five times as much so g5 it
is
okay well process is one thing but
there's a lot of issues and putting
together a big cluster and we've put
together as I just let you kind of saw
clusters that are pretty reasonable size
but even for us this was a big cluster
and there are a lot of issues that come
up in terms of scale but there's a whole
laundry list of things I'm show here and
not gonna go through them in the detail
except to highlight a couple things that
you've heard it before today and I want
emphasize them yet again power and
cooling on the bottom especially at this
kind of scale is very much non-trivial
for example for the current system that
Apple's living to us right now as we get
it in we've looked we've had to upgrade
a power into the building I tell you
about that in a minute but just to give
you an idea something we haven't shared
whether our corporate executives yet but
just the run the system our utility bill
for the year it's going to run about two
hundred fifty thousand dollars just to
keep it going cooling is also of course
a very important issue and just like
power you can calculate how much cooling
you need and you can get that cooling
into the system to them into your
facility but addition we have the added
complication they get the cooling to the
right place and you got to look at how
you distribute that and remove the heat
and bring in the cooler air so that's
something that we're playing with we
expect to actually have to fuss with the
fair a bit over the next little while so
how do we do this process we decided was
going to be g5s but you can't say that
when you buy on a government contract
you got to be generic and we did we put
a quote a general quote with crustal
quotes out to the community at large and
one of the closest we got back in fact
was the g5 system just coincidentally
but the requirements that we had
included a theoretical performance for
the system of at least 25 teraflops we
wanted the process account in excess of
3,000 we wanted all the fit into a
thousand
Square footprint footprint excuse me
wanted minimal power and cooling
requirements and we wanted it all
delivered by 12 july this year and you'd
want to pay a lot to this cluster we
didn't share with the vendors what the
prospector we had in mind was we had to
go to the lowest bid but we wanted the
whole thing putting the switch and all
the ancillary equipment we need with it
to come in under six million dollars and
we're going to make that cause that
meant that target so the system award
exclusive of the network component was
done on 17th of May this year so that's
really a three week turnaround which is
in this business is a very short
timeframe for getting that done but we
wanted it fairly soon as well so okay
we're about the system itself some of
the details we're calling it mark 5 it
stands for multiple advanced computers
for hypersonics using g5s we got 1562
dual xserve g5 compute nodes and
forehead nodes and these nodes are being
delivered as we speak that there's a lot
of complaining back home that I get the
play and come here and attend WWDC while
they're working on put the system
together back there and I'm going to fly
back here tomorrow so I couldn't get
much more time off than that the
system's we've taken delivery about 350
nodes as of yesterday and it's coming in
at a tractor trailer load worth the day
consists of 25 pallets of a dozen exurbs
and everybody's going to pull it out
someone Rackham get them hooked up as at
that kind of scale it gets kind of
interesting so the physical
configuration is can be set up in 40
racks with 39 x sword nodes in each each
rack is a 42-year racks and a 48 port
gig gigabit ethernet switch in that rack
which we are getting from foundry
networks is a actually a very high
performance gigabit switch
and work great for our purposes we
believe and one rack is includes
forehead nodes couple more cluster nodes
and a large 320 port gige main switch
that are trumped from the to which the
individual 48 port switches that each of
those are trumped and access the nexus
for the cluster network the whole thing
occupies less than 600 square feet so it
beat the thousand square foot
limitations that we imposed and we're
expecting to draw about four hundred
kilowatts of peak power or the system
and we're having we didn't have that
much enough power being brought in the
building at that point and we had
Huntsville utilities bring in and bring
us in a new transformer rated at over
two megawatts we're planning to build
actually bigger system but that's
another story for cooling a requirement
we required by 110 tons of cooling for
those you cannot be familiar with that
but the ton unit that used to rate these
big chillers is an archaic unit in the
heating industry that relates to being
able to remove the latent heat of fusion
of water in one ton of water in one day
that is make a ton of ice in a day so
150 comes of that installed and if we
ever get the computer in business I
guess we could make a lot of ice but not
at that price okay details the nodes
okay butthead knows themselves or of
course dual two gigahertz g5 units with
mirrored 80 gigabyte hard drives and
eight gigabytes of ram installed with a
CD rewritable and the video card the
compute nodes themselves fifteen hundred
and fifty two of them also two gigahertz
xserve units with a single 80 gigabyte
hard drive three and a half gigabytes of
ram
per node just under two gig per
processor no cd-rom or video card
required on the on the excerpt nodes so
in total there's a 30 132 CPUs and at
eight gigaflops per CPU the theoretical
performance the system comes in at a tad
over 25 teraflops as required okay as I
mentioned the system is being delivered
and the pictures we took last week
before we came out here and there those
were the first 40 units packed in the
high bay being delivered some work still
going on in the computer room in terms
of getting the rest of the
infrastructure setup and you see some of
the guys working on putting in some of
the hardware in the racks now they've
got 40 racks and a man all these extras
in the racks you have to put these
little clips to which you can screw into
to get the rack in there you go to put
in the front and the back and but this
many we calculated the guys have to put
in over 14,000 such clips they did it in
afternoon we had a bunch of folks
working on it okay so to get kind of
wind down the story a little bit give
you some perspective on the progression
of computer technology and preparing
here that mainframe system we acquired
back in two thousand to Mach 5 coming in
now and cost lies we're paying a little
bit more 6 million compared to 4 million
back then so thick percent more floor
space is about twice as much however for
that we get 10 times more than 10 times
more processors and we get more than 50
times more performance so in summary we
chose the apples xserve g5 architecture
for a major production cluster for
computational fluid dynamic analysis in
hypersonic flight the proposal he got
from apple on the xserve g5 exhibit in
fact delivered the best bang for the
buck
and as essence best price performance
now as I've kind of mentioned Mach 5 has
been designed for a compute intensive
problem with relatively little demand on
network now that means in terms of
standard measures that put systems on
the top 500 it will not do as well
relatively as a system sort of purpose
design with the highest speed network
that being said we fully expect to
achieve something over 12 teraflops the
real performance and we believe we might
be able to get up to 15 teraflops of
real performance if we can do that will
still be easily the top five when the
November list comes out hopefully we can
get there so as I mentioned the finish
up the systems be installed and we hope
to get into production Ashley get it
working by the fall and from the
solicitation of the system to Ashley
production work or looking for a
six-month time frame which is pretty
phenomenal for a system that this kind
of scale will hope it worked out thank
you
okay well run it a little late so just
to summarize and finish up the session
what I really wanted to say is you know
Apple is investing in the
high-performance computing market we're
doing it through our products our
technologies and the solutions that
we're providing working very closely to
make sure the right solutions available
but from third parties in the open
source community and the adoption is
really been phenomenal and momentum
continues so in summary you know Apple's
products from the workgroup cluster
turnkey easy to use and bioinformatics
all the way up to you know the top
supercomputers so with that thank you
very much unfortunately running out of
time for formal QA but if there any
questions will be available up front for
any any questions you might have